{"id":551,"date":"2024-01-01T10:51:45","date_gmt":"2024-01-01T10:51:45","guid":{"rendered":"https:\/\/solutionsreview.com\/expert\/?p=551"},"modified":"2024-02-02T14:31:58","modified_gmt":"2024-02-02T14:31:58","slug":"solving-the-data-wrangling-conundrum-can-machine-learning-transform-data-management","status":"publish","type":"post","link":"https:\/\/solutionsreview.com\/thought-leaders\/solving-the-data-wrangling-conundrum-can-machine-learning-transform-data-management\/","title":{"rendered":"Solving the Data Wrangling Conundrum &#8211; Can Machine Learning Transform Data Management?"},"content":{"rendered":"<p style=\"text-align: justify;\"><em>Do data scientists really spend 80% of their time wrangling data? Last time around, we examined this notion. But when it comes to data management, how can machine learning change data platforms for the better?<\/em><\/p>\n<p style=\"text-align: justify;\">In my last piece, I asked:\u00a0<a href=\"https:\/\/diginomica.com\/data-science-myths-and-realities-do-data-scientists-really-spend-80-their-time-wrangling-data\" class=\"external\" rel=\"nofollow\">do data scientists really spend 80% of their time wrangling data?<\/a>\u00a0Now it&#8217;s time for the follow-up: can machine learning make a difference in data management? Can it alter that 80\/20 data cleansing ratio?<\/p>\n<p style=\"text-align: justify;\">Machine Learning (ML) is a term that can mean just about anything. In evaluating a (proprietary) tool for data management that claims to use machine learning, you should understand what that means. It isn\u2019t necessary to see the math or even the code that implements the algorithm.<\/p>\n<p style=\"text-align: justify;\">It should suffice to understand what the algorithm evaluates, at least a high-level explanation of how it operates and what it produces.<\/p>\n<p style=\"text-align: justify;\">Keep in mind that the fundamental workings of the algorithms are usually proprietary, so the explanations, if given, will be pretty high level. How coherent the explanation is, though, should help you understand what is real.<\/p>\n<p style=\"text-align: justify;\">Despite its lofty name, machine learning isn\u2019t that mysterious. The most popular algorithms in use today are pretty mature. What makes them \u201cmachine learning\u201d instead of just statistical models is the use of massive amounts of data, which was not previously possible. Some machine learning algorithms that are\u00a0<a href=\"https:\/\/www.towardsdatascience.com\/\" class=\"external\" rel=\"nofollow\">common in use are<\/a>:<\/p>\n<ul style=\"text-align: justify;\">\n<li>Linear Regression<\/li>\n<li>Logistic Regression<\/li>\n<li>Linear Discriminant Analysis<\/li>\n<li>Classification and Regression Trees<\/li>\n<li>Naive Bayes<\/li>\n<li>K-Nearest Neighbors<\/li>\n<li>Learning Vector Quantization<\/li>\n<li>Support Vector Machines<\/li>\n<li>Bagging and Random Forest<\/li>\n<li>Boosting and AdaBoost<\/li>\n<\/ul>\n<h2 style=\"text-align: justify;\">Is Machine Learning Artificial Intelligence?<\/h2>\n<p style=\"text-align: justify;\">There is a tendency to conflate machine learning with Artificial Intelligence (AI). There are two general fields of AI. The first, Artificial General Intelligence (AGI), is about machines having human-like cognition and human intelligence, but there is some disagreement about when or if we will reach that threshold. Each new bold advance in what appears to be AGI demonstrates that what was assumed to be intelligence turns out not to be. Facial recognition is a good example.\u00a0The other is what is in place now: non-sentient machine intelligence, typically focused on a narrow task. This is where machine learning and AI get mixed up.<\/p>\n<p style=\"text-align: justify;\">For an ML algorithm to learn, it sifts through lots of data using a variety of statistical, non-parametric and other quantitative algorithms to find relationships, patterns and connections in the data. According to Judea Pearl, the Turing Award winner and author of \u201cThe Book of Why: The New Science of Cause and Effect,\u201d ML cannot understand cause and effect. ML without causal capabilities, as Pearl derisively claims, \u201cis just curve fitting.\u201d<\/p>\n<p style=\"text-align: justify;\">Pearl has led the field in the issue of cause and effect, and while there is some truth in his comment, there are many applications for ML that are \u201cjust curve fitting.\u201d For example, sifting through billions of records to find what relates to what and how strongly, and then having analysts or data stewards the opportunity to edit those findings. That\u2019s how ML actually \u201clearns.\u201d<\/p>\n<p style=\"text-align: justify;\">For a data discovery\/relationship discovery process to tie to a data catalog, the essential abilities are:<\/p>\n<ul style=\"text-align: justify;\">\n<li>The ability to scale as the data volumes are large; the processing is continuous.<\/li>\n<li>The ML algorithms operate in supervised and unsupervised mode.<\/li>\n<li>No ML discovery algorithm is perfect. User input is captured and cycled back into the ML process.<\/li>\n<li>Continuous relearning and adapting of the models.<\/li>\n<\/ul>\n<p style=\"text-align: justify;\"><a href=\"\/\/\/Users\/mac\/Downloads\/Machine%20learning%20and%20AI%3A%20Is%20it%20live%20or%20is%20...%20-%20SiliconANGLE.%20https%3A\/siliconangle.com\/2018\/08\/06\/machine-learning-ai-live-memorex\" class=\"external\" rel=\"nofollow\">I wrote a few years ago<\/a>:\u00a0The real magic in applying machine learning models to a software product is producing the right mix of things that are general enough to work with a wide range of situations and powerful enough to produce non-trivial results repeatedly (useless example, \u201cMost auto injury accidents occur when the driver is at least 16 years old.\u201d) \u00a0Supporting data science with Integrated (no code) tools requires\u00a0creating and maintaining a comprehensive data catalog, but a few steps precede it.<\/p>\n<p style=\"text-align: justify;\"><strong>Relationship discovery<\/strong><\/p>\n<p style=\"text-align: justify;\">If you think about it, the most crucial part of managing collections of unalike data is finding relationships. Finding relationships between so many forms of data is practically impossible to do by hand. When dealing with tabular\/columnar data, figuring out what names are likely to point to similar kinds of data (though not consistently accurate). Instead, the magic investigates the actual data to determine what it is.<\/p>\n<p style=\"text-align: justify;\">To put this in perspective, if you have a few billion instances to compare, this can be a computationally expensive (read, slow) process. Here is the first example of machine learning boosting the process. Using some of the algorithms mentioned above, an unsupervised machine learning model can quickly break down the similarities and converge to a solution. As the process flows through the data collection, it builds a relationship map that drives all of the elements of the system.\u00a0Some powerful \u00a0techniques that data discovery vendors are employing to find these relationships are<\/p>\n<ul style=\"text-align: justify;\">\n<li>Recurrent Convolutional Neural Networks RCNN.<\/li>\n<li>Semi-Structured Data Parsing: Hidden Markov Model\u00a0and Gene Sequencing\u00a0algorithms.<\/li>\n<\/ul>\n<p style=\"text-align: justify;\">Recommendations are then provided to help the analyst join data sets, enrich the data, choose columns, add filters, and aggregate the data. The algorithms convert the mapping recommendation problem into a machine translation problem using:<\/p>\n<ul style=\"text-align: justify;\">\n<li>Encoder-Decoder architecture for primitive one-to-one mappings.<\/li>\n<li>Then using maximal grouping.<\/li>\n<li>An Attention Neural Network (ANN) is used to resolve the recommendation.<\/li>\n<\/ul>\n<p style=\"text-align: justify;\"><strong>Data flow<\/strong><\/p>\n<p style=\"text-align: justify;\">Machine learning-based discovery of how data flows between databases and data sources and ultimately how data moves through the organization; discovering where data emanates and the affinities in the data itself.<\/p>\n<p style=\"text-align: justify;\"><strong>Sensitive data discovery<\/strong><\/p>\n<p style=\"text-align: justify;\">There are two types of sensitive data in sources. The first is the obvious personal information such as name, social security number, date of birth, and demographic, sociographic, and psychographic data. The problem is that this data may not be identifiable by merely looking at the column names or other available metadata. Only by examining the data itself can an algorithm decide the data within the &#8221; sensitive realm.\u201d<\/p>\n<p style=\"text-align: justify;\">But there is a deeper problem. Personally Identifiable Information (PII) is the term for seemingly non-sensitive information that can be combined with other non-sensitive details to create an \u201cemergent\u201d identity. Additionally, there may be information that is considered sensitive or confidential to an organization that is defined by company policy, which may also be considered within the realm of \u201csensitive.\u201d<\/p>\n<p style=\"text-align: justify;\">Considering these types of sensitive data, there are many issues where it is essential to manage the process. First, of course, are regulatory issues, such as the recently enabled General Data Protection Regulation (GDPR). But there are also organizational promises to customers and suppliers to be good stewards of data you collect about them. It is relatively easy to govern these policies when a single internal system generates and manages the data. Still, if the data is scattered across sources and locations, gaps in governance and even the \u201cemergent\u201d problem can occur.<\/p>\n<p style=\"text-align: justify;\">And finally, the connection between policy and digital processing is wide. The policy is stated in natural language, but how that policy is implemented in software can be pretty tricky.<\/p>\n<p style=\"text-align: justify;\"><strong>Impact analysis<\/strong><\/p>\n<p style=\"text-align: justify;\">Like a trend analysis, this captures changes in the source data at different points in time. For example, if new sensitive data is introduced into the database, impact analysis can determine when that occurred and quantify the delta.<\/p>\n<p style=\"text-align: justify;\"><strong>Redundant data analysis<\/strong><\/p>\n<p style=\"text-align: justify;\">Redundant data may, and usually does, have different modification cycles, leading to data confusion. Generally, there aren\u2019t redundant data sources of primary enterprise data (though it happens). Still, other data sets can creep into the universe of sources, such as saved analysis outputs, training data sets and even spreadsheets. The relationship map can identify these redundant sources and allow the analysts to choose the appropriate one.<\/p>\n<p style=\"text-align: justify;\">Organizations can accumulate vast quantities of redundant data. They may be impacted by storage costs and unknowingly leave such data unmanaged and unprotected. Redundant data also requires management so that organizations can decide on the appropriate remediation steps as part of the data management process once identified.\u00a010<\/p>\n<p style=\"text-align: justify;\"><strong>Data catalog<\/strong><\/p>\n<p style=\"text-align: justify;\">Most important. The automated data catalog is driven by relationship discovery. The whole point of a semantically rich data catalog is to provide analysts, data scientists, business and technology users (anyone who uses data, actually) a means to find the data needed, to understand what it means, how it relates to other data, its flow and to support collaboration and enable good data governance, data management and ultimately business analytics. Unlike proprietary metadata of an application, such as enterprise applications like ERP or CRM or the proprietary metadata of Business Intelligence and visualization tools, the catalog is not tied to a specific schema or model. Its generality is the key to its usefulness.<\/p>\n<p style=\"text-align: justify;\">The most common repositories of metadata relate to customer and product domains. There is no doubt that these repositories are useful, but they lack perhaps 90 percent of the valuable data for analytics and data science.<\/p>\n<h2 style=\"text-align: justify;\">My take<\/h2>\n<p style=\"text-align: justify;\">Machine Learning alone cannot break through the 80% problem, but it is the necessary element if applied intelligently. A unified platform, from data discovery to data catalog, can vastly reduce the time it takes to do the analytics required for digital transformation.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Do data scientists really spend 80% of their time wrangling data? Last time around, we examined this notion. But when it comes to data management, how can machine learning change data platforms for the better? In my last piece, I asked:\u00a0do data scientists really spend 80% of their time wrangling data?\u00a0Now it&#8217;s time for the [&hellip;]<\/p>\n","protected":false},"author":433,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[11],"tags":[],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v23.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Solving the Data Wrangling Conundrum - Can Machine Learning Transform Data Management?<\/title>\n<meta name=\"robots\" content=\"noindex, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Solving the Data Wrangling Conundrum - Can Machine Learning Transform Data Management?\" \/>\n<meta property=\"og:description\" content=\"Do data scientists really spend 80% of their time wrangling data? Last time around, we examined this notion. But when it comes to data management, how can machine learning change data platforms for the better? In my last piece, I asked:\u00a0do data scientists really spend 80% of their time wrangling data?\u00a0Now it&#8217;s time for the [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/solutionsreview.com\/thought-leaders\/solving-the-data-wrangling-conundrum-can-machine-learning-transform-data-management\/\" \/>\n<meta property=\"og:site_name\" content=\"Solutions Review Thought Leaders\" \/>\n<meta property=\"article:published_time\" content=\"2024-01-01T10:51:45+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-02-02T14:31:58+00:00\" \/>\n<meta name=\"author\" content=\"Neil Raden\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Neil Raden\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/solutionsreview.com\/thought-leaders\/solving-the-data-wrangling-conundrum-can-machine-learning-transform-data-management\/\",\"url\":\"https:\/\/solutionsreview.com\/thought-leaders\/solving-the-data-wrangling-conundrum-can-machine-learning-transform-data-management\/\",\"name\":\"Solving the Data Wrangling Conundrum - Can Machine Learning Transform Data Management?\",\"isPartOf\":{\"@id\":\"https:\/\/solutionsreview.com\/thought-leaders\/#website\"},\"datePublished\":\"2024-01-01T10:51:45+00:00\",\"dateModified\":\"2024-02-02T14:31:58+00:00\",\"author\":{\"@id\":\"https:\/\/solutionsreview.com\/thought-leaders\/#\/schema\/person\/fe941647826b18f7a50b492466b043d9\"},\"breadcrumb\":{\"@id\":\"https:\/\/solutionsreview.com\/thought-leaders\/solving-the-data-wrangling-conundrum-can-machine-learning-transform-data-management\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/solutionsreview.com\/thought-leaders\/solving-the-data-wrangling-conundrum-can-machine-learning-transform-data-management\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/solutionsreview.com\/thought-leaders\/solving-the-data-wrangling-conundrum-can-machine-learning-transform-data-management\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/solutionsreview.com\/thought-leaders\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Solving the Data Wrangling Conundrum &#8211; Can Machine Learning Transform Data Management?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/solutionsreview.com\/thought-leaders\/#website\",\"url\":\"https:\/\/solutionsreview.com\/thought-leaders\/\",\"name\":\"Solutions Review Thought Leaders\",\"description\":\"\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/solutionsreview.com\/thought-leaders\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/solutionsreview.com\/thought-leaders\/#\/schema\/person\/fe941647826b18f7a50b492466b043d9\",\"name\":\"Neil Raden\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/solutionsreview.com\/thought-leaders\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/0813278cb05cca09748dcebe9e2cc499?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/0813278cb05cca09748dcebe9e2cc499?s=96&d=mm&r=g\",\"caption\":\"Neil Raden\"},\"description\":\"Neil Raden is a mathematician, former P&amp;C actuary, consultant and industry analyst and has for more than a quarter-century devised and implemented analytical decision-making systems for industry and government He delivers context and advisory services in the application of analytics, decision management, AI and AI Ethics as an author and popular speaker.\",\"sameAs\":[\"https:\/\/www.hiredbrains.com\",\"www.linkedin.com\/in\/neilraden\/\"],\"url\":\"https:\/\/solutionsreview.com\/thought-leaders\/author\/neil-raden\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Solving the Data Wrangling Conundrum - Can Machine Learning Transform Data Management?","robots":{"index":"noindex","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"og_locale":"en_US","og_type":"article","og_title":"Solving the Data Wrangling Conundrum - Can Machine Learning Transform Data Management?","og_description":"Do data scientists really spend 80% of their time wrangling data? Last time around, we examined this notion. But when it comes to data management, how can machine learning change data platforms for the better? In my last piece, I asked:\u00a0do data scientists really spend 80% of their time wrangling data?\u00a0Now it&#8217;s time for the [&hellip;]","og_url":"https:\/\/solutionsreview.com\/thought-leaders\/solving-the-data-wrangling-conundrum-can-machine-learning-transform-data-management\/","og_site_name":"Solutions Review Thought Leaders","article_published_time":"2024-01-01T10:51:45+00:00","article_modified_time":"2024-02-02T14:31:58+00:00","author":"Neil Raden","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Neil Raden","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/solutionsreview.com\/thought-leaders\/solving-the-data-wrangling-conundrum-can-machine-learning-transform-data-management\/","url":"https:\/\/solutionsreview.com\/thought-leaders\/solving-the-data-wrangling-conundrum-can-machine-learning-transform-data-management\/","name":"Solving the Data Wrangling Conundrum - Can Machine Learning Transform Data Management?","isPartOf":{"@id":"https:\/\/solutionsreview.com\/thought-leaders\/#website"},"datePublished":"2024-01-01T10:51:45+00:00","dateModified":"2024-02-02T14:31:58+00:00","author":{"@id":"https:\/\/solutionsreview.com\/thought-leaders\/#\/schema\/person\/fe941647826b18f7a50b492466b043d9"},"breadcrumb":{"@id":"https:\/\/solutionsreview.com\/thought-leaders\/solving-the-data-wrangling-conundrum-can-machine-learning-transform-data-management\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/solutionsreview.com\/thought-leaders\/solving-the-data-wrangling-conundrum-can-machine-learning-transform-data-management\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/solutionsreview.com\/thought-leaders\/solving-the-data-wrangling-conundrum-can-machine-learning-transform-data-management\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/solutionsreview.com\/thought-leaders\/"},{"@type":"ListItem","position":2,"name":"Solving the Data Wrangling Conundrum &#8211; Can Machine Learning Transform Data Management?"}]},{"@type":"WebSite","@id":"https:\/\/solutionsreview.com\/thought-leaders\/#website","url":"https:\/\/solutionsreview.com\/thought-leaders\/","name":"Solutions Review Thought Leaders","description":"","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/solutionsreview.com\/thought-leaders\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/solutionsreview.com\/thought-leaders\/#\/schema\/person\/fe941647826b18f7a50b492466b043d9","name":"Neil Raden","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/solutionsreview.com\/thought-leaders\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/0813278cb05cca09748dcebe9e2cc499?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/0813278cb05cca09748dcebe9e2cc499?s=96&d=mm&r=g","caption":"Neil Raden"},"description":"Neil Raden is a mathematician, former P&amp;C actuary, consultant and industry analyst and has for more than a quarter-century devised and implemented analytical decision-making systems for industry and government He delivers context and advisory services in the application of analytics, decision management, AI and AI Ethics as an author and popular speaker.","sameAs":["https:\/\/www.hiredbrains.com","www.linkedin.com\/in\/neilraden\/"],"url":"https:\/\/solutionsreview.com\/thought-leaders\/author\/neil-raden\/"}]}},"_links":{"self":[{"href":"https:\/\/solutionsreview.com\/thought-leaders\/wp-json\/wp\/v2\/posts\/551"}],"collection":[{"href":"https:\/\/solutionsreview.com\/thought-leaders\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/solutionsreview.com\/thought-leaders\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/solutionsreview.com\/thought-leaders\/wp-json\/wp\/v2\/users\/433"}],"replies":[{"embeddable":true,"href":"https:\/\/solutionsreview.com\/thought-leaders\/wp-json\/wp\/v2\/comments?post=551"}],"version-history":[{"count":0,"href":"https:\/\/solutionsreview.com\/thought-leaders\/wp-json\/wp\/v2\/posts\/551\/revisions"}],"wp:attachment":[{"href":"https:\/\/solutionsreview.com\/thought-leaders\/wp-json\/wp\/v2\/media?parent=551"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/solutionsreview.com\/thought-leaders\/wp-json\/wp\/v2\/categories?post=551"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/solutionsreview.com\/thought-leaders\/wp-json\/wp\/v2\/tags?post=551"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}