Get data from a Google BigQuery table using Python 3 – a specific task that took me a while to complete due to little or confusing online resources You might ask why a Data Scientist was stuck to solve such a trivial Data Engineering task? Well… because most of the time… there is no proper Data Engineering support in an organization. Steps to follow if you want to connect to Google BigQuery and pull data using Python: Ask your GCP admin to generate a Google Cloud secret key and save it in a json file: install libraries: pip install google-cloud-bigquery pip install google-cloud pip install tqdm pip install pandas_gbq import libraries: import os import io from google.cloud.bigquery.client import Client from google.cloud import bigquery import pandas_gbq set Google Credentials (your json file created at Step 0): os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] = ‘path to your json file/filename.json’ define a BQ client: storage_client = storage.Client( project = ‘yourprojectname’) define de query and save it in a variable query = f””” SELECT * FROM `projectname.tablename`; “”” use pandas_gbq to read the results and save them in a dataframe: queryResultsTbl= pandas_gbq.read_gbq( query, project_id=project_id, dialect=”standard” Something like this: import os import io from google.cloud.bigquery.client import Client from google.cloud import bigquery import pandas_gbq os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] = ‘google-key_Cristina.json’ project_id = “project-name” client = bigquery.Client(project = project_id) query = f””” SELECT * FROM `project-name.table-name`; “”” queryResultsTbl= pandas_gbq.read_gbq( query, project_id=project_id, dialect=”standard” ) queryResultsTbl.head(10) This is a personal blog. My opinion on what I share with you is that “All models are wrong, but some are useful”. Improve the accuracy of any model I present and make it useful!
Category: Data science
Get data from a GCS bucket using Python 3 – a specific task that took me a while to complete due to little or confusing online resources You might ask why a Data Scientist was stuck to solve such a trivial Data Engineering task? Well… because most of the time… there is no proper Data Engineering support in an organization. Anyways, my challenge was that I ran out of RAM while using Google Colab Notebook to read Google Could Storage ( GCS ) data, so I just wanted to work locally in Python to leverage my 32 GB of RAM. In Colab I used to run gsutil commands and everything was easier due to using a Google environment. Steps to follow if you want to pull GCS data locally: Ask your GCP admin to generate a Google Cloud secret key and save it in a json file: install ( pip install google-cloud-storage ) & import GCP libraries: from google.cloud import storage define a GCS client: storage_client = storage.Client() loop through the buckets you want to read data from (I’ve saved my buckets in a list) for each bucket you’ll need: bucket name prefix create an iterator save the blobs identify the CSVs (in my case) download the CSVs save data in a pandas dataframe Something like this: from google.cloud import storage from google.cloud.bigquery.client import Client import pandas as pd import io os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] = ‘google-key_Cristina.json’ storage_client = storage.Client() BM_GCS_Buckets = open(‘BM_buckets.txt’,’r’) BM_GCS_Buckets = list(BM_GCS_Buckets) def find_bucket( plant_name ): plant = plant_name buckets = BM_GCS_Buckets result = ” for bucket in buckets: if re.findall(plant,bucket): result = bucket break return result for i in [‘ML’,’MP’]: bucket_name = find_bucket(i).strip() bucket = f’bucketname’ prefix_name = f”” + bucket_name.replace(‘gs://bucketname/’,”) + “/” iterator = storage_client.list_blobs(bucket, prefix=prefix_name, delimiter=”/”) blobs = storage_client.list_blobs(bucket, prefix=prefix_name) files = [] files = [blob for blob in blobs if blob.name.endswith(“.csv”)] data_hist = pd.DataFrame() all_df = [] for file in files: df = pd.read_csv(io.BytesIO(file.download_as_bytes()), index_col=0) all_df.append(df) data_hist = pd.concat(all_df) This is a personal blog. My opinion on what I share with you is that “All models are wrong, but some are useful”. Improve the accuracy of any model I present and make it useful!
Run automatic EDA (Exploratory Data Analysis) in Python With 2 lines of Python code you’ll get a HTML report with all the important EDA aspects you need to understand your raw data. Install pandas_profiling: pip install pandas_profiling Import pandas_profiling: from pandas_profiling import ProfileReport Create the autoEDA report: profile = ProfileReport(rawdataTbl, title=”Profiling Report”) profile.to_file(“Profiling Report.html”) Check this website if you need additional configuration for your report: https://pypi.org/project/pandas-profiling/ This is a personal blog. My opinion on what I share with you is that “All models are wrong, but some are useful”. Improve the accuracy of any model I present and make it useful!
Coaching and Mentoring are very important to me as they guided me throughout my career in Data and more recently, the career of a Data Scientist. They are extremely relatable, but there are still some key differences between the two. In this article I’ll explain the difference between the two and how you can benefit from having a coach or/and a mentor. What is mentoring? Wiki says: “Mentorship is the influence, guidance, or direction given by a mentor. A mentor is someone who teaches or gives help and advice to a less experienced and often younger person. In an organisational setting, a mentor influences the personal and professional growth of a mentee. Most traditional mentorships involve having senior employees mentor more junior employees, but mentors do not necessarily have to be more senior than the people they mentor. What matters is that mentors have experience that others can learn from.” How can mentoring help you? Mentoring is like a student-teacher relationship. In Data Science, a mentor can be a Senior/Consultant Data Scientist who will answer your questions and raise questions you hadn’t considered to help you meet a certain objective / skill / master a specific tool. I had mentors while working for Symantec and DELL as these companies had formal mentoring programs running. The “issue” was that they were not Analytics / Data Science specific, and, at that time I needed domain specific support. To fill this need, I went on LinkedIn (as they try to promote professionals as mentors), but people I contacted did not reply or they were very expeditive. I looked deeper and I found these guys: MentorCruise. They are a dedicated website for mentoring services. I found them cool and I registered as a Mentor myself. You can find my profile here: Having a mentor and being one will give you confidence and great satisfaction that you can help other professionals similar to you. At the moment I’m mentoring two data professionals. I really like the interaction and the feedback I got so far . Click here if you want to become a mentor on MentorCruise. While working for Symantec, I attended a 3 days course on coaching and I got to experience being both a coach and a coachee. Nowadays, there are more in-depth specialisation and coaching is a profession. I wanted you to learn about coaching from the best, so I asked a former colleague of mine who is a trained coach to guest post for www.thebabydatascientist.com . Claudia is a Positive Psychologist and Coach specialising in career and leadership coaching for people in tech. She is based in Ireland and works globally with people who want to be happier at work. Let’s see what Claudia advises: What is coaching? Coaching can be defined in many ways. In essence, coaching is a space for personal development through a thought-provoking and creative process that inspires you to maximise your personal and professional potential (International Coaching Federation). In the business world, workplace coaching is often used as an effective tool to optimise performance and unlocking untapped potential. Often, the goal of coaching interventions in the workplace is to build team cohesion, employee productivity and motivation, and develop authentic leadership skills. However, coaching is much more than just a tool to increase your productivity. Coaching is a journey to more self-awareness, a space to help you learn more about yourself. Typically you will gain a greater understanding of values and your strengths and learn how they impact your behaviour, your thought processes and your emotions. There is great power in understanding that triangular relationship – it’s the starting point to self-determined behavioural change. How is coaching different from mentoring? Coaching is built on the core belief that you, the coach or the client, already have all the answers, skills and knowledge within yourself to reach your goals. The coach’s role then is to help you uncover the answers you are seeking through a non-directive and non-judgmental dialogue. That means coaching is less directive than mentoring. It is not about giving advice, it is about empowering you to discover what is best for you – in work and in life. While a mentor shares their own experiences and industry knowledge with the mentee, a coach is not necessarily an expert in your line of work. A coach asks you thought-provoking questions, offers new perspectives and guides you in the process to come to a new insight that empowers you to create a path forward that is right for you. How can coaching help you? Coaching is a very versatile approach to personal development and can help you achieve goals, make difficult decisions or face challenges with more confidence. Below are a few examples of typical coaching topics for career coaching. How coaching can help you in your career Qualified and accredited coaches often specialise on a topic, or career stage a client is in. Here are some examples of what career coaching can help you with: Career coaching. This can be anything from finding career clarity, building your confidence to interview effectively, supporting you in managing up, progressing in your career or building your own brand. Leadership coaching. Leadership coaches work with new and established leaders and often are an independent sounding board for their clients. Typical topics in leadership coaching include finding your authentic leadership style, building confidence as a leader, communicating assertively, employee motivation, and building high performing teams. Career and leadership coaching is ideal for people who suffer from imposter feelings, have trouble creating or maintaining a healthy work-life balance or find it difficult to be themselves at work. Your coaching readiness checklist Are you ready to find out if you are ready to engage a coach? Here is a handy checklist to determine your coaching readiness: I am determined to make a change I am ready to ask myself
The lifecycles below will guide you from the initial phase of a Data Science project through the project’s successful completion. It will enable you to divide the work within the team, estimate efforts, document all the steps of your project, and set realistic expectations for the project stakeholders. I believe that implementing a standard process model should be the Data Science norm, not the exception. CRISP-DM for Data Science I’ve been using CRISP-DM (Cross Industry Standard Practice for Data Mining) as a process model for my Data Science project execution work for a few years and I can confirm that it works. The process consists of 6 major steps and all the Data Science sub-tasks can be mapped as below: Data Science Lifecycle While studying for the DELL EMC Data Science Associate Exam, I learned that DELL also recommends a Data Science lifecycle. In the course, the Data Science lifecycle is also divided into 6 phases, named differently, but having the same functions: Discovery – Data Prep – Model Planning – Model Building – Communicate Results – Operationalize. The Data Science lifecycle it’s an iterative process; you’ll move through the phases if sufficient information is available. 1. Business Understanding Determine business objectives and goals Assess situation Produce project plan It might seem like you need to do a lot of documentation even from the initial phase (and this is considered one of the few weaknesses of CRISP-DM), but a formal one-pager with the signed-off Business Case, that clearly states both business and machine learning objectives, along with listing the past information related to similar efforts should be documented. This exercise will prioritize items in your backlog and protect you from scope creep. Pic. Source: The Machine Learning Project Checklist 2. Data Understanding Collect initial data Describe data Explore data Verify data quality This is the “scary”, time-consuming and crucial phase. I would sum it up as ETL/ELT + EDA = ♥ Acquiring data can be complex when it originates from both internal and external sources, in a structured and unstructured format, and without the help of a Data Engineer (they are so Rare Unicorns nowadays). Without having good quality data, Machine Learning projects will became useless. This part is missing in most online Data Science courses/competitions, so Data Scientists should learn how to do ETL/ELT by themselves. EDA (Explanatory Data Analysis) got simple as this can be automatically performed with packages like pandas profiling, sweet viz, Dtale, autoviz in Python or DataExplorer, GGally, Smarteda, and tableone in R. Some teams will use a mix of automated and custom EDA features. Data Preparation Select data Clean data Construct data Integrate data Format data This is another code-intensive phase and it means knowing your dataset in great detail so it gives you the confidence to select data for modeling, clean it and perform feature engineering. The EDA stage will tell you if you have missing data, outliers, highly correlated features, and so on, but in the Data Prep stage, you have to decide what imputation strategy to apply. You might go back to Business Understanding in case you form a new hypothesis or you require confirmations from Subject Matter Experts. Featuring engineering is a hot topic here. Until recently, this was a tedious task, mostly manual that involved creating new variables based on available features using data wrangling techniques. Data Scientists will use pandas in python and dplyr in R. Nowadays, frameworks like Featuretools will save the day. I wouldn’t rely 100% on an automated feature engineering tool, but overall, it’s a nice addition to a Data Science project. Feature selection Feature importance: some algorithms like random forests or XGBoost allow you to determine which features were the most “important” in predicting the target variable’s value. By quickly creating one of these models and conducting feature importance, you’ll get an understanding of which variables are more useful than others. Dimensionality reduction: One of the most common dimensionality reduction techniques, Principal Component Analysis (PCA) takes a large number of features and uses linear algebra to reduce them to fewer features. By the end of this phase, you’ve 70% – 80% completed the project. But the fun begins in the next phase: Modeling Select modeling techniques Generate test design Build model Assess model Once it is clear which algorithm to try (or try first), you’ll have to: from sklearn.model_selection import train_test_split . Splitting your dataset into train and test it is key for building a performant model. You’ll build the model on the trained dataset (usually consisting of 70% of your data) and check how it performed on the test dataset (30% of data). Don’t forget to: random_state / seed This code will help you reproduce the same random split result. Based on the complexity of the model, building the code can be as quick as writing 2 lines of code. For Python scikit-learn is your model building library. In R you have packages like caret, e1071, xgboost, randomForest, etc. While assessing the accuracy of your model, you might decide to go back to the previous step/s and reiterate. Time spent on modeling is subjective as models can be improved with more tuning, but if time is more valuable than the % increase in accuracy, you’ll want to move to the next step. Some rule of thumb in modeling is that you shortlist the models that have at least 70% accuracy for unsupervised and 80% for supervised. You should also look at the loss function, setting up the threshold, accuracy matrix, and sensitivity/specificity. Evaluation Evaluate results Review process Determine the next steps At this step, you’ll have to decide which model to select using tools like ROC Curves, the number of features, and also business feedback. You can find more reading on this topic here: I linked this blog as it’s written by a Data Scientist, with hands-on experience. Deployment Plan deployment Plan monitoring and maintenance Produce final report Review project While preparing for deployment, you should create a final report and document if the model met objectives, start monitoring model stability, and accuracy and when retrain should
At this stage, you managed to create your first Python project and saved all the packages used in a virtual environment. The next step is to collaborate with other data professionals. For a smooth collaboration with your peers, now you have to create a requirements.txt file with all the Python packages used and their respective versions. Create the requirements.txt: To automatically create the file, run the below line in the terminal: pip freeze > requirements.txt File usage There are numerous situations when you’ll use a requirements.txt file: share a project with your peers; use a build system; copy the project to any other location; project documentation. One more command line to remember for the first 3 situations: pip install -r requirements.txt It’s that simple! This is a personal blog. My opinion on what I share with you is that “All models are wrong, but some are useful”. Improve the accuracy of any model I present and make it useful!
Wait, what?? Why would you use R Studio as an IDE for running Python? There’s a simple answer to this question: this is the perfect Data Science IDE when you use R and Python together. You’ll find below the simple steps to help set up a project in R Studio so you can start using Python: Create an R Studio Project: Navigate and save your project: File -> New Project – > Existing Directory (linked to the GitHub folder if applicable) / New Directory ( if you work locally) . Create and activate Python virtual environment Each project might require different versions of packages and this can be encapsulated in a virtual environment. You’ll have to create and select the virtual environment as the Python interpreter for the RStudio Project and then activate. Install virtual environment with pip install in the RStudio terminal window (initial setup): pip install virtualenv Create the virtual environment for the current project (initial setup): virtualenv environment_name Activate the virtual environment for the current project (initial setup): on Windows: environment_name\Scripts\activate.bat on MAC: source environment_name\bin\activate Select Python interpreter Navigate and select your Python interpreter : Tools -> Global Options – > Python -> Select -> Virtual environments When you open the project, just remember to activate the environment: environment_name\Scripts\activate.bat (Note: I participate in the affiliate amazon program. This post may contain affiliate links from Amazon or other publishers I trust (at no extra cost to you). I may receive a small commission when you buy using my links, this helps to keep the blog alive! See disclosure for details.) If you’re new to RStudio , you can browse this book. Once you start coding, you might be also interested in reading: the 17 Clean Code standards to adopt NOW! “Freeze” your Python environment by creating the Requirements.txt file This is a personal blog. My opinion on what I share with you is that “All models are wrong, but some are useful”. Improve the accuracy of any model I present and make it useful!
New job titles are confusing. When dad asked me what I do for a living, I told his that I’m a Data Scientist. By his look, I understood that a job title wasn’t enough to explain what I do, so I added: “I use current and past information to predict the future”. He seemed fascinated and we continued the conversation with some work examples. Dad is my biggest supporter, so he asked how can he be of help in promoting my new project. So I created a small experiment for dad in my blog. I wanted to see how good I was at explaining what a Data Scientist is to someone that has no connection to the industry. I created the Mini Data Science Quiz: 3 questions that can be answered in less than 1 min. Dad was 70% correct. In Data Science, if the model accuracy is above 70%, you’ve got a decent model. (Note: I participate in the affiliate amazon program. This post may contain affiliate links from Amazon or other publishers I trust (at no extra cost to you). I may receive a small commission when you buy using my links, this helps to keep the blog alive! See disclosure for details.) If you want to educate yourself in Data Science or start a career in Data Science, put these 3 materials on your list: R for Data Science Python Data Science Handbook Doing Data Science with Python Start a 10-day free trial at Pluralsight – Over 5,000 Courses Available Can you do better than dad? Mini Data Science Quiz This is a personal blog. My opinion on what I share with you is that “All models are wrong, but some are useful”. Improve the accuracy of any model I present and make it useful!
You are here because you code, but how professional does your code look? Professional programmers think of systems as stories to be told rather than programs to be written. I have listed 17 important clean coding standards into 4 different sections. Make sure you bookmark the page and share it with your colleagues: Naming: The name of a variable, function, or class should answer all the big questions. It should tell you why it exists, what it does, and how it is used. Best to use Computer Science terms (algorithm names, pattern names, math etc.) wherever possible. Otherwise, stick with application domain terms. Do not try to be cute or funny when naming. Spend your time wisely when trying to find the correct name. Shorter names are generally better than longer ones, as long as they are clear. Add enough context to a name, but NO MORE. Functions and Methods FUNCTIONS SHOULD SERVE ONE PURPOSE. They should serve it well with NO SIDE EFFECTS. This means one level of abstraction per function. Guide: – They can’t be divided reasonably into sections. – They can’t do anything hidden. If you must have some coupling, then you should AT LEAST make it clear in the name (ex: serializeAndSetContext(…), startThreadAndLogWork(…)). Although this isn’t that pretty. Should be small. Hardly ever longer than 30 lines. Blocks within IF, ELSE, WHILE, FOR should be small, one line in an ideal world. And that should probably be a function call. The indent size within a function shouldn’t be greater than 2, rarely 3. If you come up with 3 levels, first consider if it is possible to break into another function. Arguments are hard for testing. Ideal number of arguments for a function is niladic(0). Next comes one monadic, followed closely by dyadic. More than 3 arguments require special treatment, most likely those arguments need to exist in a class of their own. In professional programming, you have only 2 reasons for function arguments: asking a question or operating on arguments (transforming it, returning a dependent result). In professional programming, flag arguments are bad practice. Passing a bool into a function proclaims that the function does more than one thing. It does one thing if the flag is true and another if the flag is false. In monadic argument functions, we should keep a verb phrase + name pair: download(object), convertToString(date) etc. Comments Comments are, at best, a necessary evil. Most of them are excuses/ justification for bad code. C. Martin: “Comments usually compensate failure to express ourselves in code.“ Clear and expressive code with few comments is far superior to cluttered and complex code with lots of comments. TODO comments are acceptable. But the TODO should be addressed when possible. Module formatting Use the newspaper metaphor for source code: You read it vertically. At the top you expect a headline that will tell you what the story is about and allows you to decide whether it is something you want to read. If one function calls another, they should be vertically close, and the caller should be above the callee, if at all possible. Surround assignment operators with white space to accentuate them. (Note: I participate in the affiliate amazon program. This post may contain affiliate links from Amazon or other publishers I trust (at no extra cost to you). I may receive a small commission when you buy using my links, this helps to keep the blog alive! See disclosure for details.) Want to learn more? Grab your Clean Code Handbook: Clean Code This is a personal blog. My opinion on what I share with you is that “All models are wrong, but some are useful”. Improve the accuracy of any model I present and make it useful!
Starting with 2019, the interest in Data Science education and getting an accreditation skyrocketed. Have a look below at the Google trend of the two search terms: Data Science course vs Data Science certificate: This shows that many people are looking to get formal training on Data Science. Generally speaking, a technical certification will be somewhat attractive on a CV, but a certification alone will not secure you a role. The majority of Data Science interviews will have at least one technical test and multiple discussions. Some interviewers might even question you more on the topics of the Certification. To boost my confidence in my Data Science skills, I also decided to pursue a Data Science Certification. I did my “Google research” and I was pleasantly surprised by the results: DELL EMC program scored high in the top Data Science certifications search. This meant for me, as a Dell employee, that I was able to access multiple learning materials to prepare for the exam. Structure Dell offers a two-level Data science certification: Associate and Specialist level. The Associate level exam consist of 60 questions and you have 90 minutes to answer them. The minimum score to pass the exam is 63 and the topics assessed are: MapReduce (15%) MapReduce framework and its implementation in Hadoop Hadoop Distributed File System (HDFS) Yet Another Resource Negotiator (YARN) Hadoop Ecosystem and NoSQL (15%) Pig Hive NoSQL HBase Spark Natural Language Processing (NLP) (20%) NLP and the four main categories of ambiguity Text Preprocessing Language Modeling Social Network Analysis (SNA) (23%) SNA and Graph Theory Communities Network Problems and SNA Tools Data Science Theory and Methods (15%) Simulation Random Forests Multinomial Logistic Regression and Maximum Entropy Data Visualization (12%) Perception and Visualization Visualization of Multivariate Data I recently (in January 2022) took my Associate level one and I am currently studying for the Specialist level, so it is an ideal time to write about my learning and exam experiences. Learning The official website page for the exam and course info is this. Here you will find details about the On Demand classes they offer, exam link and practice tests. You can also see more sample questions here and additional online practice tests. The Data Science and Big Data Analytics course prepares you for the Data Scientist Associate v2 (DCA-DS) Certification. Once you pass the exam, you receive a Dell Technologies Certified Associate(DCA-DS) Certification. Why is the Data Scientist Associate v2 (DCA-DS) a good certification for a junior data scientist: Going through the topics included in the material will give a good foundation of data science terminologies. It gives an intro into what big data is, the most basic algorithms, and an understanding of the responsibilities of a Data Scientist and the data science lifecycle. Learning all of this will enable immediate and effective participation in big data and other analytics projects. You’ll be hands-on Hadoop (including Pig, Hive, and HBase), Natural Language Processing, Social Network Analysis, Simulation, Random Forests, Multinomial Logistic Regression, and Data Visualization. The labs will prepare you to do data processing, apply algorithms and run data visualization in R. It will empower you to keep on studying and move forward to get the next level certificate as DCA-DS Certification is a prerequisite for DCS-DS. The Advanced Methods in Data Science and Big Data Analytics course prepares you for Specialist – Data Scientist, Advanced Analytics Version 1.0 (DCS-DS) Certification. (Note: I participate in the affiliate amazon program. This post may contain affiliate links from Amazon or other publishers I trust (at no extra cost to you). I may receive a small commission when you buy using my links, this helps to keep the blog alive! See disclosure for details.) If you don’t have availability to sit in a class for a full week (8 hours a day), you can study for the exam at your own pace. Dell Emc published the below book to help you prepare for the exam. It is rated very high and it’s now discounted on Amazon: When are you getting one of this? If this is not motivational enough, I’ll leave below an interesting Ted Talk on the influence of social network (one of the topics of the course / exam) and you’ll see why Data Science is so cool: This is a personal blog. My opinion on what I share with you is that “All models are wrong, but some are useful”. Improve the accuracy of any model I present and make it useful!