8 Great Data Integration Development Tips

8 Great Data Integration Development TipsJust as anything in IT, things don’t ever go as smoothly as they looked on the data integration process flow from the PowerPoint slide that your manager spoke to at the last departmental meeting. The time that you projected to run a report may be longer than expected, the effort to get the process working seems like endless lines of code, and you find yourself doing twice as many debugging sessions than ever expected.

Many can probable relate to this nightmare because it’s not always easy to get the dataflow right, especially with batch processes for big data volumes. I’m sure that this horror show is more common than one would like to admit so I thought it might be helpful to review an article written by Saggi Neumann, Co-Founder and CTO of xplenty, called “Eight Best Practices for Data Integration Development.

Below are small snippets from the article.

1. Start Small

“Start with a small sample of the dataset for development and debugging purposes.”

“Using too much data at this point only lengthens development time.”

“Process the entire dataset further down the line after you have confirmed that your dataflow works correctly.”

2. Develop Gradually

“Developing a long and complicated dataflow only to see it fail can waste plenty of time, not to mention that it is rather hard to debug.”

“…develop it gradually, part by part.”

“…check the output after each intersection and make sure that the results are correct.”

3. Filter Out Useless Data

“Select only relevant fields via projection and use filters to keep irrelevant data out of the flow.”

4. Join Carefully

“…take a look at data from the join sources and manually check whether they are joined correctly by checking row counts and value histograms after the join.”

Types of joins: Replicated Join; Skewed Join; Merge Join; Merge-Sparse Join; Default Join

“Make sure, of course, that you put the relevant data source on the correct side depending on the join type.”

5. Store Results as Files

“Using the database as immediate output during development is not such a good idea – you will find out about errors, like an invalid schema, only when inserting the crunched data into the DB.”

6. Split Parallel Dataflows

“Dataflows where one data source is split into several parallel flows may work better when split into entirely separate dataflows.”

7. Split Complex Dataflows

“Dataflows that are too big and complex should also be split into several dataflows.”

“This helps to debug each one more easily and make sure everything works correctly.”

8. Use GZIP

“Compressing input and output files saves plenty of time.”

“Yes, it takes more CPU power to compress and decompress data, but that’s nothing compared the time saved transferring bytes over the network.”

Click here to read the entire article.

 

Timothy King
Follow Tim