The Benefits of Solutions Offering Open-Source Libraries of Transformation and Mapping Logic

By Jared Stiff , CTO at SoundCommerce
Best Practices,

Solutions Review’s Contributed Content Series is a collection of contributed articles written by thought leaders in enterprise tech. In this feature, SoundCommerce‘s Jared Stiff offers commentary on the benefits of solutions offering open-source libraries of transformation and mapping logic.

Fans of Isaac Asimov’s Foundation Series (Books, not, TV!) know that The Imperial Library on the planet Trantor was both the future galactic repository for human knowledge, and the place where scientist protagonist Hari Seldon developed his theories of Psychohistory – the ability to predict the future with advanced probabilistic mathematics. Asimov was a polymath, and his writings were amazingly prescient regarding artificial intelligence and machine learning today.

Back here on present-day earth, data scientists face a similar cost/benefit conundrum. How can I develop useful machine learning algorithms on complex data sets and models – without the heavy lift of engineering everything from scratch? Generative AI represents a step-change increase in the speed of analysis, but the utility of even the best GenAI tools are still constrained by the quality of the data and data models analyzed.

No one wants to build their data infrastructure from scratch. Thankfully open cloud standards (the “modern data stack”) and popular programming languages like Python and SQL give data teams a massive head start toward useful, actionable and ML-ready data. A growing number of commercial data integration tools offer users the ability to leverage and expand shared libraries of mapping and modeling logic, presenting the opportunity to greatly accelerate data time to value and analytics time to insights.

As with Asimov’s Imperial Library on Trantor, there are major advantages to using commercial data integration tools or software applications that offer open-source or community-maintained libraries of transformation and mapping logic.

First, these tools can help businesses save time and money by providing pre-built components, connectors, and transformations that can be easily integrated into their ETL or ELT workflows. This can reduce the need for custom development and testing, and speed up the overall development process.

Second, these tools can help businesses improve the quality and accuracy of their data integrations by providing a library of pre-built components and transformations that have been tested and validated by the community. This can help reduce errors and improve the reliability of data pipelines.

Third, platforms that allow end-users to use and contribute to ETL or ELT code written by other users can help foster collaboration and innovation within the data integration community. Users can share their own custom components and transformations, as well as learn from others and contribute to the development of the platform.

Overall, using commercial data integration tools or software applications that offer open-source or community-maintained libraries of transformation and mapping logic can help businesses build more efficient, reliable, and innovative data integrations.

Here are a few providers active today in the data onboarding ecosystem:

Fivetran

Fivetran is a cloud-based data integration platform that offers pre-built connectors for Snowflake and Google BigQuery, as well as a range of other data sources. The platform includes a library of pre-built transformations and mappings, and allows users to create their own custom integrations and contribute to the community library. FiveTran offers no advanced analytical modeling capability – users are expected to build their own models using tools like DBT or Coalesce in the data warehouse.

Matillion

Matillion is a cloud-native data integration platform that offers pre-built connectors for Snowflake and Google BigQuery, as well as a range of other data sources. The platform includes a library of pre-built components and transformations, and allows users to create their own custom integrations and contribute to the community library. Analytical modeling is performed downstream of Matillion in the data warehouse.

SoundCommerce

SoundCommerce is a newer entrant to the cloud data onboarding and modeling ecosystem, focused on the needs of consumer brands and the retail industry. SoundCommerce offers an intelligent data pipeline to facilitate data ingest, pre-emptive data cataloging (semantic labeling and mapping of data during onboarding), and analytical modeling prep as data is loaded into Snowflake and BigQuery. SoundCommerce provides an open, no-code mapping interface and library to speed data readiness.

Talend

Talend is an open-source data integration platform that offers pre-built connectors for Snowflake and Google BigQuery, as well as a range of other data sources. The platform includes a library of pre-built transformations and mappings, and allows users to create their own custom integrations and contribute to the community library.

Stitch

Stitch is a cloud-based data integration platform that offers pre-built connectors for Snowflake and Google BigQuery, as well as a range of other data sources. The platform includes a library of pre-built transformations and mappings, and allows users to create their own custom integrations and contribute to the community library.

Choosing the best data onboarding tool for Snowflake from among FiveTran, Matillion, SoundCommerce, Talend, Stitch depends on a number of factors, including the specific requirements of your business, the complexity of your data integration needs, and your budget.

Here are some key criteria to consider when choosing a data onboarding tool for Snowflake or BigQuery:

Based on these criteria, the best data onboarding tool for Snowflake and BigQuery will depend on the specific needs and priorities of your business. All of the platforms listed above offer a range of features and capabilities, so it’s important to evaluate each one in terms of its suitability for your business.

Like the citizens of Asimov’s galactic empire, data practitioners can call upon rich libraries of content and code (or no-code logic) to more quickly capitalize on data to drive better outcomes – especially through Generative AI and ML algorithms. The key to fast time to useful data insights and activation is leveraging industry best practices in the form of shared data labels, mapping, and models to leapfrog the most tedious and time-consuming data engineering tasks!

This article was written by Jared Stiff on January 19, 2024

Jared Stiff

CTO

Best Practices