October 7, 2022 - The Baby Data Scientist

Get data from a GCS bucket using Python 3 – a specific task that took me a while to complete due to little or confusing online resources You might ask why a Data Scientist was stuck to solve such a trivial Data Engineering task? Well… because most of the time… there is no proper Data Engineering support in an organization. Anyways, my challenge was that I ran out of RAM while using Google Colab Notebook to read Google Could Storage ( GCS ) data, so I just wanted to work locally in Python to leverage my 32 GB of RAM. In Colab I used to run gsutil commands and everything was easier due to using a Google environment. Steps to follow if you want to pull GCS data locally: Ask your GCP admin to generate a Google Cloud secret key and save it in a json file: install ( pip install google-cloud-storage ) & import GCP libraries: from google.cloud import storage define a GCS client: storage_client = storage.Client() loop through the buckets you want to read data from (I’ve saved my buckets in a list) for each bucket you’ll need: bucket name prefix create an iterator save the blobs identify the CSVs (in my case) download the CSVs save data in a pandas dataframe Something like this: from google.cloud import storage from google.cloud.bigquery.client import Client import pandas as pd import io os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] = ‘google-key_Cristina.json’ storage_client = storage.Client() BM_GCS_Buckets = open(‘BM_buckets.txt’,’r’) BM_GCS_Buckets = list(BM_GCS_Buckets) def find_bucket( plant_name ): plant = plant_name buckets = BM_GCS_Buckets result = ” for bucket in buckets: if re.findall(plant,bucket): result = bucket break return result for i in [‘ML’,’MP’]: bucket_name = find_bucket(i).strip() bucket = f’bucketname’ prefix_name = f”” + bucket_name.replace(‘gs://bucketname/’,”) + “/” iterator = storage_client.list_blobs(bucket, prefix=prefix_name, delimiter=”/”) blobs = storage_client.list_blobs(bucket, prefix=prefix_name) files = [] files = [blob for blob in blobs if blob.name.endswith(“.csv”)] data_hist = pd.DataFrame() all_df = [] for file in files: df = pd.read_csv(io.BytesIO(file.download_as_bytes()), index_col=0) all_df.append(df) data_hist = pd.concat(all_df) This is a personal blog. My opinion on what I share with you is that “All models are wrong, but some are useful”. Improve the accuracy of any model I present and make it useful!

Day: October 7, 2022

Get data from a GCS bucket using Python 3

Follow me on Social Media