8 Great Data Integration Development Tips

By Tim King , Executive Editor at Solutions Review
Data Integration News,

Just as anything in IT, things don’t ever go as smoothly as they looked on the data integration process flow from the PowerPoint slide that your manager spoke to at the last departmental meeting. The time that you projected to run a report may be longer than expected, the effort to get the process working seems like endless lines of code, and you find yourself doing twice as many debugging sessions than ever expected.

Many can probable relate to this nightmare because it’s not always easy to get the dataflow right, especially with batch processes for big data volumes. I’m sure that this horror show is more common than one would like to admit so I thought it might be helpful to review an article written by Saggi Neumann, Co-Founder and CTO of xplenty, called “Eight Best Practices for Data Integration Development.”

Below are small snippets from the article.

1. Start Small

“Start with a small sample of the dataset for development and debugging purposes.”

“Using too much data at this point only lengthens development time.”

“Process the entire dataset further down the line after you have confirmed that your dataflow works correctly.”

2. Develop Gradually

“Developing a long and complicated dataflow only to see it fail can waste plenty of time, not to mention that it is rather hard to debug.”

“…develop it gradually, part by part.”

“…check the output after each intersection and make sure that the results are correct.”

3. Filter Out Useless Data

“Select only relevant fields via projection and use filters to keep irrelevant data out of the flow.”

4. Join Carefully

“…take a look at data from the join sources and manually check whether they are joined correctly by checking row counts and value histograms after the join.”

Types of joins: Replicated Join; Skewed Join; Merge Join; Merge-Sparse Join; Default Join

“Make sure, of course, that you put the relevant data source on the correct side depending on the join type.”

5. Store Results as Files

“Using the database as immediate output during development is not such a good idea – you will find out about errors, like an invalid schema, only when inserting the crunched data into the DB.”

6. Split Parallel Dataflows

“Dataflows where one data source is split into several parallel flows may work better when split into entirely separate dataflows.”

7. Split Complex Dataflows

“Dataflows that are too big and complex should also be split into several dataflows.”

“This helps to debug each one more easily and make sure everything works correctly.”

8. Use GZIP

“Compressing input and output files saves plenty of time.”

“Yes, it takes more CPU power to compress and decompress data, but that’s nothing compared the time saved transferring bytes over the network.”

Click here to read the entire article.

This article was written by Tim King on August 22, 2014

Tim King

Executive Editor

Tim is Solutions Review's Executive Editor and leads coverage on data management and analytics. A 2017 and 2018 Most Influential Business Journalist and 2021 "Who's Who" in Data Management, Tim is a recognized industry thought leader and changemaker. Story? Reach him via email at tking@solutionsreview dot com.

What the AI Impact on Data Engineering Jobs Looks Like Right Now - April 24, 2025
The 17 Best AI Agents for Data Integration to Consider in 2025 - April 22, 2025
What to Expect at Safe Software’s The Peak of Data and AI 2025 May 6-8 - April 17, 2025

Data Integration News

8 Great Data Integration Development Tips

Tim King

Executive Editor

Expert Insights

Latest Posts

Categories

Important Links

Useful Pages

8 Great Data Integration Development Tips

Share This

Tim King

Executive Editor

Related Posts

What to Expect at Safe Software’s The Peak of Data and AI 2025 May 6-8

The One Azure Data Engineer Expert Certification to Rule Them All

What to Expect at the 6th Annual Insight Jam LIVE!: Strategies for AI Impact ...

Expert Insights

Latest Posts

Follow Solutions Review