The lifecycles below will guide you from the initial phase of a Data Science project through the project’s successful completion. It will enable you to divide the work within the team, estimate efforts, document all the steps of your project, and set realistic expectations for the project stakeholders.
I believe that implementing a standard process model should be the Data Science norm, not the exception.
CRISP-DM for Data Science
I’ve been using CRISP-DM (Cross Industry Standard Practice for Data Mining) as a process model for my Data Science project execution work for a few years and I can confirm that it works.
The process consists of 6 major steps and all the Data Science sub-tasks can be mapped as below:
Data Science Lifecycle
While studying for the DELL EMC Data Science Associate Exam, I learned that DELL also recommends a Data Science lifecycle. In the course, the Data Science lifecycle is also divided into 6 phases, named differently, but having the same functions: Discovery – Data Prep – Model Planning – Model Building – Communicate Results – Operationalize.
The Data Science lifecycle it’s an iterative process; you’ll move through the phases if sufficient information is available.
1. Business Understanding
- Determine business objectives and goals
- Assess situation
- Produce project plan
It might seem like you need to do a lot of documentation even from the initial phase (and this is considered one of the few weaknesses of CRISP-DM), but a formal one-pager with the signed-off Business Case, that clearly states both business and machine learning objectives, along with listing the past information related to similar efforts should be documented. This exercise will prioritize items in your backlog and protect you from scope creep.
Pic. Source: The Machine Learning Project Checklist
2. Data Understanding
- Collect initial data
- Describe data
- Explore data
- Verify data quality
This is the “scary”, time-consuming and crucial phase. I would sum it up as ETL/ELT + EDA = ♥
Acquiring data can be complex when it originates from both internal and external sources, in a structured and unstructured format, and without the help of a Data Engineer (they are so Rare Unicorns nowadays). Without having good quality data, Machine Learning projects will became useless. This part is missing in most online Data Science courses/competitions, so Data Scientists should learn how to do ETL/ELT by themselves.
EDA (Explanatory Data Analysis) got simple as this can be automatically performed with packages like pandas profiling, sweet viz, Dtale, autoviz in Python or DataExplorer, GGally, Smarteda, and tableone in R. Some teams will use a mix of automated and custom EDA features.
- Select data
- Clean data
- Construct data
- Integrate data
- Format data
This is another code-intensive phase and it means knowing your dataset in great detail so it gives you the confidence to select data for modeling, clean it and perform feature engineering. The EDA stage will tell you if you have missing data, outliers, highly correlated features, and so on, but in the Data Prep stage, you have to decide what imputation strategy to apply. You might go back to Business Understanding in case you form a new hypothesis or you require confirmations from Subject Matter Experts.
Featuring engineering is a hot topic here. Until recently, this was a tedious task, mostly manual that involved creating new variables based on available features using data wrangling techniques. Data Scientists will use pandas in python and dplyr in R. Nowadays, frameworks like Featuretools will save the day. I wouldn’t rely 100% on an automated feature engineering tool, but overall, it’s a nice addition to a Data Science project.
- Feature importance: some algorithms like random forests or XGBoost allow you to determine which features were the most “important” in predicting the target variable’s value. By quickly creating one of these models and conducting feature importance, you’ll get an understanding of which variables are more useful than others.
- Dimensionality reduction: One of the most common dimensionality reduction techniques, Principal Component Analysis (PCA) takes a large number of features and uses linear algebra to reduce them to fewer features.
By the end of this phase, you’ve 70% – 80% completed the project. But the fun begins in the next phase:
- Select modeling techniques
- Generate test design
- Build model
- Assess model
Once it is clear which algorithm to try (or try first), you’ll have to:
from sklearn.model_selection import train_test_split .
Splitting your dataset into train and test it is key for building a performant model. You’ll build the model on the trained dataset (usually consisting of 70% of your data) and check how it performed on the test dataset (30% of data).
Don’t forget to: random_state / seed
This code will help you reproduce the same random split result.
Based on the complexity of the model, building the code can be as quick as writing 2 lines of code. For Python scikit-learn is your model building library. In R you have packages like caret, e1071, xgboost, randomForest, etc.
While assessing the accuracy of your model, you might decide to go back to the previous step/s and reiterate. Time spent on modeling is subjective as models can be improved with more tuning, but if time is more valuable than the % increase in accuracy, you’ll want to move to the next step.
Some rule of thumb in modeling is that you shortlist the models that have at least 70% accuracy for unsupervised and 80% for supervised. You should also look at the loss function, setting up the threshold, accuracy matrix, and sensitivity/specificity.
- Evaluate results
- Review process
- Determine the next steps
At this step, you’ll have to decide which model to select using tools like ROC Curves, the number of features, and also business feedback.
You can find more reading on this topic here: I linked this blog as it’s written by a Data Scientist, with hands-on experience.
- Plan deployment
- Plan monitoring and maintenance
- Produce final report
- Review project
While preparing for deployment, you should create a final report and document if the model met objectives, start monitoring model stability, and accuracy and when retrain should be triggered.
Also, always communicate back to business.
You probably noticed that I use the “automated” term several times across the article. Does this mean that Machine Learning can be 100% automated and that Data Scientists will not be in demand in the future? Well, I believe quite the contrary, junior Data Scientists will be enabled to ramp on very quickly using tools like:
- Dataiku DSS
- Amazon SageMaker
- Google Cloud AutoML
- Qlik AutoML
- Azure Machine Learning Studio
Senior Data Scientists will also benefit from using autoML tools, especially when they’re supposed to scale ML at a fast pace. In this case, autoML is at the heart of MLOps: the practice of applying DevOps tools and techniques to Data Science / Machine Learning workflows to make the process efficient and reproducible.
(Note: I participate in the affiliate amazon program. This post may contain affiliate links from Amazon or other publishers I trust (at no extra cost to you). I may receive a small commission when you buy using my links, this helps to keep the blog alive! See disclosure for details.)
MLOps is something new in my life. I’m still trying to understand the bits and bolts of the whole concept. I invested in 2 Oreily books:
They helped me conclude and map the below MlOps Lifecycle:
MLOps means scaling end-to-end Data Science products in a reliable and automated way. Deploying a Data Science product is not always straightforward, so just imagine what it means to deploy several Data Science solutions each month. Without an automated way of running the above lifecycle, the whole process would be slow, and prone to error and dependencies.
Prepare for production
- Runtime environment
- Risk evaluation
If it works on your machine, now it’s time to check if will work on the production. You need to check first if this is technically possible in a development environment and then sent the project to production. In an ideal scenario: the Data Scientist will export the model and the Data Engineer will deploy it. Real-life scenarios are more complex and due to a lack of resources, the Data Scientist might end up re-coding in a different programming language to be able to put it in production.
At this stage, the format required by the production environment should be agreed upon and tested. On success, the conversion of the model will be added as a step post the modeling stage.
Pic. Source: ML Infrastructure Tools for Production
Another aspect to manage in the preparation for the production is data access: test internal and external (if applicable) data connections and configurations required.
The Risk evaluation milestone refers to the model risk. This has to be assessed and documented and all the risks logged, addressed/mitigated in time.
QA – preferably to copy the advanced tools and framework that classic Software engineering uses.
Watch the below video to learn how unit tests look like for Data Scientists:
Development to production
- Elastic scaling
- Containerization (I’ll write more about this in a new blog post)
- CI/CD Pipelines
CI/CD is a key DevOps acronym and it refers to continuous integration and continuous delivery = deployment. In MLOps, after full deployment, the Data Scientist should push the code, metadata, and documentation to a central repository and trigger a CI/CD Pipeline.
This book: share an example pipeline :
- Build the model
- Build the model artifacts
- Send the artifacts to long-term storage
- Run checks
- Generate fairness and explainability reports
- Deploy to a test environment
- Run tests to validate ML & computational performance
- Validate manually
- Deploy to a production environment
- Deploy the model as canary
- Fully deploy the model
End-to-end ML development lifecycle with artifact tracking and ML metadata store
Monitoring and Feedback loop
- Logging / alerting
- Input drift tracking
- Performance drift
- Online evaluation
It is known in advance that once the model is in production, the performance will decrease. How much we’ll allow the model to stay live until we update it, will be addressed by managing Performance Drift. Input drift tracking is also important, so we can identify any potential changes in new data (type, not matching training schema, percentage of missing values, NaNs , infinities, Population Stability Index(PSI), Characteristic Stability Index(CSI), etc).
In the monitoring phase, we might consider any system upgrades or environment changes that might occur and look at GPU memory allocation, network traffic, and disk usage. You can read more on this here.
Logging and alerting will help us stay on top of the problems.
More reading material for you: check out this paper on machine learning-specific risk factors and design patterns to be avoided or refactored where possible.
This is a personal blog. My opinion on what I share with you is that “All models are wrong, but some are useful”. Improve the accuracy of any model I present and make it useful!