AI Data Integration: AI Is Only as Smart as the Data You Feed It
Solutions Review’s Expert Insights Series is a collection of contributed articles written by industry experts in enterprise software categories. In this feature, Arcion CEO Gary Hagmueller offers commentary on AI data integration and why AI is only as smart as the data you feed it.
When ChatGPT gate-crashed into the tech world in late 2022, it implicitly propelled the dawning of the AI age. ChatGPT has ushered in a broad collective realization that this “Age of AI” is going to be unprecedented and have influence on almost every business, organization, technology, product, and how we as humans interact with the world around us, at the very least. The sheer amazement most people felt at having a machine produce content that was frequently as good as, if not better than, what most English majors generated was profound. And while the output that these AI systems produce promises to revolutionize many industries and occupations, what’s under the covers is equally as fascinating and transformational.
Below the fancy headlines and the cleverly written pieces, all AI systems are really just highly refined data consumption applications. It may seem like the machine is conscious, but in reality, that sense of being is the result of the ingestion, processing and rendering of data.
Generative AI, such as the large language models (LLMs) that power generative pre-trained transformers (more commonly known today as GPT), are but one flavor of AI. While LLM applications are vast, there are a huge number of other forms of AI modeling that enterprises will utilize to produce company-specific AI applications. Neural nets, unsupervised models, supervised techniques, and deep learning, to name a few, are other techniques that generate enterprise AI models with the potential to be transformational. No matter what the approach, all AI models have one thing in common: They’re only as useful as the data that is fed to them.
It is data that brings these AI models to life. The data can be in different formats. The format is dependent on the applications and its users that consume and generate the data over time across various channels and in myriad ways. For applications that leverage AI models, once the input data is determined, a lengthy process of tuning, testing and refining will eventually determine the output models that are used in a production environment. AI projects always involve data scientists, data engineers, database administrators, developers, the DevOps team, etc. All told, the investment involved in the development of even the most basic AI model can be substantial.
Once AI models are trained and refined, they become quite specialized and highly sensitive to material changes in the source data. For an enterprise, data is increasingly sourced from the transactional systems that underlie most functional operations. Astute enterprises know that real-time operational data represents the most valuable asset at their disposal, as it reflects the current intent or actions of that company’s customers, partners, counter parties, threats and assets — precisely everything that keeps the business running.
AI Data Integration
Unlocking Operational Data Is the Key to Deriving Value From Enterprise AI
Sometimes the changes in source data reflect new phenomena that should drive an action or create awareness. This is one of the most powerful value drivers in enterprise AI adoption: spotting a weak signal that represents a meaningful change as it happens and taking advantage of the insight immediately. Harnessing the power of data changes can lead to higher revenues, lower churn, reduced risks, more efficient operations, and fewer regulatory issues; these are just a few possible benefits. These benefits drive nearly all enterprise data projects.
But there’s another form of change that occurs frequently: alterations of the data schemas, objects and formats in upstream source systems. This form of change is nearly constant in most enterprise environments. It’s nearly impossible to police data from its origin to all possible end uses. In most companies, the needs of operational uses and users will trump all downstream consumers. Upstream data generators perform specific acts that the company mandates to properly conduct business. In the “data world,” systemic change is the only real constant.
It is this reality that makes most enterprise-level AI models far more brittle than most people realize. Broken models result in significant rework, inaccurate results, or failed deployments. Worse still, an undetected change may cause actions or inactions that can materially degrade the AI application’s ability to generate the value intended when it was built. At the very least, organizations plagued by an inability to detect and assimilate data changes in real time will need to invest far more into their data QA process, which will invariably slow adoption, reduce agility, and severely limit value generation.
In addition to adapting to frequent data changes, AI applications are also critically dependent on transactional integrity. The order of events is essential to understanding the changing nature of most phenomena that AI applications seek to detect. AI applications are ill-suited to determine whether data is new or a repeat of previously ingested data. Such requirements bog them down and introduce the risk of inadvertently creating various forms of inaccurate data that can further contaminate the output. Data teams are increasingly realizing that modern AI apps should not have to struggle to absorb batch dumps that insert many days, or even weeks, worth of data into downstream data consumption apps. Those that have ever built AI apps will recognize that using batch or other connective techniques that do not guarantee data integrity is the bane of existence for data teams looking to drive value.
Data pipe resilience and integrity have become critical factors in effectively deploying AI models in a production environment. What’s needed is the ability to effectively “self-heal” models when data changes. Transactional integrity is also vital to any high-performing AI application to guarantee accuracy. Finally, if data does not arrive close to when it is produced, it rapidly begins to lose its time value.
CDC Is the Missing Piece of the Data Puzzle
What modern data applications need is a form of connective technology that can solve the sensitive requirements that AI applications demand. Enter change data capture (CDC). CDC is the only technology that allows users to build real-time pipes that stream data from operational data stores to downstream systems with transactional integrity and the ability to manage changes. It does so by listening for changes in the logs that upstream databases produce and transmitting those new entries to downstream consumers in real time. Modern CDC vendors have designed distributed architectures that handle the load of modern data producers at wire speed. When changes occur, such as new columns or changes to the data object type, leading CDC solutions will propagate those changes to target systems without requiring any manual intervention.
There are a host of other benefits — too many to elaborate here. If you’re designing a new AI application for your enterprise or if you are struggling to keep one working, it’s worth your time to explore CDC as a basis to build your data pipes.