This is part of Solutions Review’s Premium Content Series, a collection of contributed columns written by industry experts in maturing software categories. In this submission, dotData CEO Ryohei Fujimaki gives his take on how to choose data science software.
Choosing the best data science and machine learning (DSML) platform can be jarring. Uncovering how to choose data science software is even more complex for organizations new to machine learning or ones with a traditional BI background without predictive analytics experience. And ditto for application developers and software architects searching for cloud AI services to leverage AI and machine learning using APIs. When learning how to choose data science software, one needs to consider the technical features and capabilities first.
How to Choose Data Science Software
Before starting a data science platform evaluation process, the stakeholders should brainstorm to identify relevant use cases, develop requirements, and prioritize the impact and value to the business. The selection of the ideal platform is heavily dependent on the available resources, the data architecture of the company, and the skillset of the intended users. To make the best possible choice, AI and business leaders should seek answers to these fundamental questions:
- Who will be the primary users? The data science team, application developers, or the BI and analytics team?
- What are the skill-level and data science expertise of the primary user? Are they expert data scientists with several years of experience or just starting?
- Which programming language is most used and preferred by the intended users – Python, Scala, R, or something else?
The rationale for selecting a particular DSML platform will depend on the target user. If the intended users are experienced data scientists, and the primary environment is Python, you need a platform that offers a significant amount of customization and flexibility. Experienced data scientists generally prefer to build, test, and tweak models manually. These data scientists will have an affinity for a platform that automatically discovers and generates new features to build accurate models faster and explore broader feature space.
No-Code or Code-First, what degree of automation will accelerate the data science workflow?
An important consideration is selecting a no-code (or low code) versus code-first approach to data science. Traditional DSML platforms (code-first) require data science teams to generate features manually, a very time-consuming process that involves a lot of domain knowledge. Once the features are built, AutoML platforms can accelerate the work by selecting the algorithms and building ML models automatically. As an analytics and data science leader, you need to decide how much of this process you need to automate.
On the other hand, a no-code environment means using visual tools, drag-and-drop functionality. The BI and analytics team or inexperienced data scientists will prefer an enterprise platform with AutoML 2.0 capabilities such as end-to-end data science automation, including data preparation, automated feature engineering, ML, and one-click model deployment.
Here is a quick rundown of five significant attributes to think through while evaluating the DSML platforms:
Data Ingestion and Preparation
How much manipulation of data must be performed before it is ready for ingestion by the DSML platform? Can you upload data to the platform without having to write additional SQL code?
Feature Engineering Automation
How much manual work is involved in feature engineering? Will the platform support automated feature engineering, and can the AI engine automatically explore all available database entity relationships and discover and evaluate features based on available columns and relationships?
Does the system support automated machine learning, state-of-the-art ML algorithms like scikit-learn, XGBoost, LightGBM, TensorFlow, and PyTorch? Can the users perform an automated hyper-parameter search of machine learning algorithms?
How easy is it to deploy machine learning models in a production environment? Can you monitor models, discover model drift, and quickly retrain models if production data changes over time?
Platform Integration, Ease of Use, and Deployment Flexibility
Can all steps of the data science process be executed seamlessly within a single platform without the need for moving between systems and applications?
Irrespective of your preference for the vendor, the most important thing to keep in mind is the user. Is it easy for non-data scientists to understand the workflow of the application, the concepts, and the steps necessary to proceed?