Mentoring for Data Science

Coaching and Mentoring are very important to me as they guided me throughout my career in Data and more recently, the career of a Data Scientist. They are extremely relatable, but there are still some key differences between the two.  In this article I’ll explain the difference between the two and how you can benefit from having a coach or/and a mentor.    What is mentoring?   Wiki says: “Mentorship is the influence, guidance, or direction given by a mentor. A mentor is someone who teaches or gives help and advice to a less experienced and often younger person. In an organisational setting, a mentor influences the personal and professional growth of a mentee. Most traditional mentorships involve having senior employees mentor more junior employees, but mentors do not necessarily have to be more senior than the people they mentor. What matters is that mentors have experience that others can learn from.”   How can mentoring help you?   Mentoring is like a student-teacher relationship. In Data Science, a mentor can be a Senior/Consultant Data Scientist who will answer your questions and raise questions you hadn’t considered to help you meet a certain objective / skill / master a specific tool.   I had mentors while working for Symantec and DELL as these companies had formal mentoring programs running. The “issue” was that they were not Analytics / Data Science specific, and, at that time I needed  domain specific support.  To fill this need, I went on LinkedIn (as they try to promote professionals as mentors), but people I contacted  did not reply or they were very expeditive. I looked deeper and I found these guys: MentorCruise. They are a dedicated website for mentoring services. I found them cool and I registered as a Mentor myself. You can find my profile here: Having a mentor and being one will give you confidence and great satisfaction that you can help other professionals similar to you. At the moment I’m mentoring two data professionals. I really like the interaction and the feedback I got so far . Click here if you want to become a mentor on MentorCruise. While working for Symantec, I attended a 3 days course on coaching and I got to experience being both a coach and a coachee. Nowadays, there are more in-depth specialisation and coaching is a profession. I wanted you to learn about coaching from the best, so I asked a former colleague of mine who is a trained coach to guest post for www.thebabydatascientist.com .    Claudia is a Positive Psychologist and Coach specialising in career and leadership coaching for people in tech. She is based in Ireland and works globally with people who want to be happier at work.   Let’s see what Claudia advises: What is coaching?   Coaching can be defined in many ways. In essence, coaching is a space for personal development through a thought-provoking and creative process that inspires you to maximise your personal and professional potential (International Coaching Federation).   In the business world, workplace coaching is often used as an effective tool to optimise performance and unlocking untapped potential. Often, the goal of coaching interventions in the workplace is to build team cohesion, employee productivity and motivation, and develop authentic leadership skills.   However, coaching is much more than just a tool to increase your productivity. Coaching is a journey to more self-awareness, a space to help you learn more about yourself. Typically you will gain a greater understanding of values and your strengths and learn how they impact your behaviour, your thought processes and your emotions.  There is great power in understanding that triangular relationship – it’s the starting point to self-determined behavioural change.   How is coaching different from mentoring?   Coaching is built on the core belief that you, the coach or the client, already have all the answers, skills and knowledge within yourself to reach your goals. The coach’s role then is to help you uncover the answers you are seeking through a non-directive and non-judgmental dialogue.    That means coaching is less directive than mentoring. It is not about giving advice, it is about empowering you to discover what is best for you – in work and in life. While a mentor shares their own experiences and industry knowledge with the mentee, a coach is not necessarily an expert in your line of work. A coach asks you thought-provoking questions, offers new perspectives and guides you in the process to come to a new insight that empowers you to create a path forward that is right for you.   How can coaching help you?   Coaching is a very versatile approach to personal development and can help you achieve goals, make difficult decisions or face challenges with more confidence. Below are a few examples of typical coaching topics for career coaching.   How coaching can help you in your career   Qualified and accredited coaches often specialise on a topic, or career stage a client is in. Here are some examples of what career coaching can help you with:   Career coaching. This can be anything from finding career clarity, building your confidence to interview effectively, supporting you in managing up, progressing in your career or building your own brand.    Leadership coaching. Leadership coaches work with new and established leaders and often are an independent sounding board for their clients. Typical topics in leadership coaching include finding your authentic leadership style, building confidence as a leader, communicating assertively, employee motivation, and building high performing teams.   Career and leadership coaching is ideal for people who suffer from imposter feelings, have trouble creating or maintaining a healthy work-life balance or find it difficult to be themselves at work.   Your coaching readiness checklist   Are you ready to find out if you are ready to engage a coach? Here is a handy checklist to determine your coaching readiness:   I am determined to make a change I am ready to ask myself

Data Science lifecycle and steps

The lifecycles below will guide you from the initial phase of a Data Science project through the project’s successful completion. It will enable you to divide the work within the team, estimate efforts, document all the steps of your project, and set realistic expectations for the project stakeholders. I believe that implementing a standard process model should be the Data Science norm, not the exception.   CRISP-DM for Data Science I’ve been using CRISP-DM (Cross Industry Standard Practice for Data Mining) as a process model for my Data Science project execution work for a few years and I can confirm that it works. The process consists of 6 major steps and all the Data Science sub-tasks can be mapped as below:     Data Science Lifecycle   While studying for the DELL EMC Data Science Associate Exam, I learned that DELL also recommends a Data Science lifecycle. In the course, the Data Science lifecycle is also divided into 6 phases, named differently, but having the same functions: Discovery – Data Prep – Model Planning – Model Building – Communicate Results – Operationalize. The Data Science lifecycle it’s an iterative process; you’ll move through the phases if sufficient information is available. 1. Business Understanding Determine business objectives and goals Assess situation Produce project plan   It might seem like you need to do a lot of documentation even from the initial phase (and this is considered one of the few weaknesses of CRISP-DM), but a formal one-pager with the signed-off Business Case, that clearly states both business and machine learning objectives, along with listing the past information related to similar efforts should be documented. This exercise will prioritize items in your backlog and protect you from scope creep.   Pic. Source: The Machine Learning Project Checklist   2. Data Understanding Collect initial data Describe data Explore data Verify data quality   This is the “scary”, time-consuming and crucial phase. I would sum it up as ETL/ELT + EDA = ♥ Acquiring data can be complex when it originates from both internal and external sources, in a structured and unstructured format, and without the help of a Data Engineer (they are so Rare Unicorns nowadays). Without having good quality data, Machine Learning projects will became useless. This part is missing in most online Data Science courses/competitions, so Data Scientists should learn how to do ETL/ELT by themselves. EDA (Explanatory Data Analysis) got simple as this can be automatically performed with packages like pandas profiling, sweet viz, Dtale, autoviz in Python or DataExplorer, GGally, Smarteda, and tableone in R. Some teams will use a mix of automated and custom EDA features.   Data Preparation Select data Clean data Construct data Integrate data Format data   This is another code-intensive phase and it means knowing your dataset in great detail so it gives you the confidence to select data for modeling, clean it and perform feature engineering. The EDA stage will tell you if you have missing data, outliers, highly correlated features, and so on, but in the Data Prep stage, you have to decide what imputation strategy to apply. You might go back to Business Understanding in case you form a new hypothesis or you require confirmations from Subject Matter Experts.  Featuring engineering is a hot topic here. Until recently, this was a tedious task, mostly manual that involved creating new variables based on available features using data wrangling techniques. Data Scientists will use pandas in python and dplyr in R. Nowadays, frameworks like Featuretools will save the day. I wouldn’t rely 100% on an automated feature engineering tool, but overall, it’s a nice addition to a Data Science project. Feature selection Feature importance: some algorithms like random forests or XGBoost allow you to determine which features were the most “important” in predicting the target variable’s value. By quickly creating one of these models and conducting feature importance, you’ll get an understanding of which variables are more useful than others. Dimensionality reduction: One of the most common dimensionality reduction techniques, Principal Component Analysis (PCA) takes a large number of features and uses linear algebra to reduce them to fewer features. By the end of this phase, you’ve 70% – 80% completed the project. But the fun begins in the next phase:   Modeling Select modeling techniques Generate test design Build model Assess model   Once it is clear which algorithm to try (or try first), you’ll have to: from sklearn.model_selection import train_test_split . Splitting your dataset into train and test it is key for building a performant model. You’ll build the model on the trained dataset (usually consisting of 70% of your data) and check how it performed on the test dataset (30% of data). Don’t forget to: random_state / seed  This code will help you reproduce the same random split result. Based on the complexity of the model, building the code can be as quick as writing 2 lines of code. For Python scikit-learn is your model building library. In R you have packages like caret, e1071, xgboost, randomForest, etc. While assessing the accuracy of your model, you might decide to go back to the previous step/s and reiterate. Time spent on modeling is subjective as models can be improved with more tuning, but if time is more valuable than the % increase in accuracy, you’ll want to move to the next step. Some rule of thumb in modeling is that you shortlist the models that have at least 70% accuracy for unsupervised and 80% for supervised. You should also look at the loss function, setting up the threshold, accuracy matrix, and sensitivity/specificity.    Evaluation Evaluate results Review process Determine the next steps   At this step, you’ll have to decide which model to select using tools like ROC Curves, the number of features, and also business feedback. You can find more reading on this topic here: I linked this blog as it’s written by a Data Scientist, with hands-on experience.   Deployment Plan deployment Plan monitoring and maintenance Produce final report Review project   While preparing for deployment, you should create a final report and document if the model met objectives, start monitoring model stability, and accuracy and when retrain should

Photo by cottonbro: https://www.pexels.com/photo/person-using-macbook-pro-on-white-table-4065864/

New job titles are confusing. When dad asked me what I do for a living, I told his that I’m a Data Scientist. By his look, I understood that a job title wasn’t enough to explain what I do, so I added: “I use current and past information to predict the future”. He seemed fascinated and we continued the conversation with some work examples. Dad is my biggest supporter, so he asked how can he be of help in promoting my new project. So I created a small experiment for dad in my blog. I wanted to see how good I was at explaining what a Data Scientist is to someone that has no connection to the industry.  I created the Mini Data Science Quiz: 3 questions that can be answered in less than 1 min. Dad was 70% correct. In Data Science, if the model accuracy is above 70%, you’ve got a decent model.   (Note: I participate in the affiliate amazon program. This post may contain affiliate links from Amazon or other publishers I trust (at no extra cost to you). I may receive a small commission when you buy using my links, this helps to keep the blog alive! See disclosure for details.)   If you want to educate yourself in Data Science or start a career in Data Science, put these 3 materials on your list: R for Data Science Python Data Science Handbook Doing Data Science with Python Start a 10-day free trial at Pluralsight – Over 5,000 Courses Available Can you do better than dad?  Mini Data Science Quiz This is a personal blog. My opinion on what I share with you is that “All models are wrong, but some are useful”. Improve the accuracy of any model I present and make it useful!

DELL EMC Data Science Certified Associate

Starting with 2019, the interest in Data Science education and getting an accreditation skyrocketed. Have a look below at the Google trend of the two search terms: Data Science course vs Data Science certificate:   This shows that many people are looking to get formal training on Data Science.    Generally speaking, a technical certification will be somewhat attractive on a CV, but a certification alone will not secure you a role. The majority of Data Science interviews will have at least one technical test and multiple discussions. Some interviewers might even question you more on the topics of the Certification.  To boost my confidence in my Data Science skills, I also decided to pursue a Data Science Certification. I did my “Google research” and I was pleasantly surprised by the results: DELL EMC program scored high in the top Data Science certifications search. This meant for me, as a Dell employee, that I was able to access multiple learning materials to prepare for the exam.    Structure Dell offers a two-level Data science certification: Associate and Specialist level. The Associate level exam consist of 60 questions and you have 90 minutes to answer them.  The minimum score to pass the exam is 63 and the topics assessed are: MapReduce (15%) MapReduce framework and its implementation in Hadoop Hadoop Distributed File System (HDFS) Yet Another Resource Negotiator (YARN) Hadoop Ecosystem and NoSQL (15%) Pig Hive NoSQL HBase Spark Natural Language Processing (NLP) (20%) NLP and the four main categories of ambiguity Text Preprocessing Language Modeling Social Network Analysis (SNA) (23%) SNA and Graph Theory Communities Network Problems and SNA Tools Data Science Theory and Methods (15%) Simulation Random Forests Multinomial Logistic Regression and Maximum Entropy Data Visualization (12%) Perception and Visualization Visualization of Multivariate Data I recently (in January 2022) took my Associate level one and I am currently studying for the Specialist level, so it is an ideal time to write about my learning and exam experiences. Learning The official website page for the exam and course info is this. Here you will find details about the On Demand classes they offer, exam link and practice tests. You can also see more sample questions here and additional online practice tests. The Data Science and Big Data Analytics course prepares you for the Data Scientist Associate v2 (DCA-DS) Certification. Once you pass the exam, you receive a Dell Technologies Certified Associate(DCA-DS) Certification. Why is the Data Scientist Associate v2 (DCA-DS) a good certification for a junior data scientist: Going through the topics included in the material will give a good foundation of data science terminologies. It gives an intro into what big data is, the most basic algorithms, and an understanding of the responsibilities of a Data Scientist and the data science lifecycle. Learning all of this will enable immediate and effective participation in big data and other analytics projects. You’ll be hands-on Hadoop (including Pig, Hive, and HBase), Natural Language Processing, Social Network Analysis, Simulation, Random Forests, Multinomial Logistic Regression, and Data Visualization. The labs will prepare you to do data processing, apply algorithms and run data visualization in R. It will empower you to keep on studying and move forward to get the next level certificate as DCA-DS Certification is a prerequisite for DCS-DS. The Advanced Methods in Data Science and Big Data Analytics course prepares you for Specialist – Data Scientist, Advanced Analytics Version 1.0 (DCS-DS) Certification.    (Note: I participate in the affiliate amazon program. This post may contain affiliate links from Amazon or other publishers I trust (at no extra cost to you). I may receive a small commission when you buy using my links, this helps to keep the blog alive! See disclosure for details.)   If you don’t have availability to sit in a class for a full week (8  hours a day), you can study for the exam at your own pace. Dell Emc published the below book to help you prepare for the exam. It is rated very high and it’s now discounted on Amazon:   When are you getting one of this?   If this is not motivational enough, I’ll leave below an interesting Ted Talk on the influence of social network  (one of the topics of the course / exam) and you’ll see why Data Science is so cool:       This is a personal blog. My opinion on what I share with you is that “All models are wrong, but some are useful”. Improve the accuracy of any model I present and make it useful!