8 Great Data Integration Development Tips

By
Data Integration,

Just as anything in IT, things don’t ever go as smoothly as they looked on the data integration process flow from the PowerPoint slide that your manager spoke to at the last departmental meeting. The time that you projected to run a report may be longer than expected, the effort to get the process working seems like endless lines of code, and you find yourself doing twice as many debugging sessions than ever expected.

Many can probable relate to this nightmare because it’s not always easy to get the dataflow right, especially with batch processes for big data volumes. I’m sure that this more common than one would admit so I thought it would be helpful to provide some assistance by reviewing an article by Saggi Neumann, Co-Founder and CTO of xplenty, called “Eight Best Practices for Data Integration Development.”

Below are small snippets from the article.

1. Start Small

“Start with a small sample of the dataset for development and debugging purposes.”

“Using too much data at this point only lengthens development time.”

“Process the entire dataset further down the line after you have confirmed that your dataflow works correctly.”

2. Develop Gradually

“Developing a long and complicated dataflow only to see it fail can waste plenty of time, not to mention that it is rather hard to debug.”

“…develop it gradually, part by part.”

“…check the output after each intersection and make sure that the results are correct.”

3. Filter Out Useless Data

“Select only relevant fields via projection and use filters to keep irrelevant data out of the flow.”

4. Join Carefully

“…take a look at data from the join sources and manually check whether they are joined correctly by checking row counts and value histograms after the join.”

Types of joins: Replicated Join; Skewed Join; Merge Join; Merge-Sparse Join; Default Join

“Make sure, of course, that you put the relevant data source on the correct side depending on the join type.”

5. Store Results as Files

“Using the database as immediate output during development is not such a good idea – you will find out about errors, like an invalid schema, only when inserting the crunched data into the DB.”

6. Split Parallel Dataflows

“Dataflows where one data source is split into several parallel flows may work better when split into entirely separate dataflows.”

7. Split Complex Dataflows

“Dataflows that are too big and complex should also be split into several dataflows.”

“This helps to debug each one more easily and make sure everything works correctly.”

8. Use GZIP

“Compressing input and output files saves plenty of time.”

“Yes, it takes more CPU power to compress and decompress data, but that’s nothing compared the time saved transferring bytes over the network.”

Click here to read the entire article.

This article was written by on August 22, 2014

An Example AI Readiness Assessment Framework for Manufacturing Companies - July 8, 2025
Top Worktech News From the Week of July 4th: Updates from SYSPRO, Celonis, SAP, and More - July 3, 2025
Top MarTech News From the Week of July 4th: Updates from Hightouch, Freshworks, ActiveCampaign, and More - July 3, 2025

Vendors to Know in Integration Platform as a Service

Data Integration

8 Great Data Integration Development Tips

Categories

Important Links

Useful Pages

8 Great Data Integration Development Tips

Share this:

Share This

Related Posts

Solutions Review’s Vendors to Know in Integration Platform as a Service...

Fivetran Acquires Teleport Data, Releases Fivetran Teleport Sync

Boomi Unveils Data Catalog and Preparation AtomSphere Service