GCS GCP Buckets

Get data from a GCS bucket using Python 3 – a specific task that took me a while to complete due to little or confusing online resources   You might ask why a Data Scientist was stuck to solve such a trivial Data Engineering task? Well… because most of the time… there is no proper Data Engineering support in an organization. Anyways, my challenge was that I ran out of RAM while using Google Colab Notebook to read Google Could Storage ( GCS ) data, so I just wanted to work locally in Python to leverage my 32 GB of RAM. In Colab I used to run gsutil commands  and everything was easier due to using a Google environment. Steps to follow if you want to pull GCS data locally: Ask your GCP admin to generate a Google Cloud secret key and save it in a json file: install ( pip install google-cloud-storage ) & import GCP libraries: from google.cloud import storage define a GCS client: storage_client = storage.Client() loop through the buckets you want to read data from (I’ve saved my buckets in a list) for each bucket you’ll need: bucket name prefix create an iterator save the blobs identify the CSVs (in my case) download the CSVs save data in a pandas dataframe   Something like this:     from google.cloud import storage from google.cloud.bigquery.client import Client import pandas as pd import io os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] = ‘google-key_Cristina.json’ storage_client = storage.Client() BM_GCS_Buckets = open(‘BM_buckets.txt’,’r’) BM_GCS_Buckets = list(BM_GCS_Buckets) def find_bucket( plant_name ): plant = plant_name buckets = BM_GCS_Buckets result = ”     for bucket in buckets: if re.findall(plant,bucket): result = bucket break     return result for i in [‘ML’,’MP’]:     bucket_name = find_bucket(i).strip()     bucket = f’bucketname’     prefix_name = f”” + bucket_name.replace(‘gs://bucketname/’,”) + “/”     iterator = storage_client.list_blobs(bucket, prefix=prefix_name, delimiter=”/”)     blobs = storage_client.list_blobs(bucket, prefix=prefix_name)     files = []     files = [blob for blob in blobs if blob.name.endswith(“.csv”)]     data_hist = pd.DataFrame()     all_df = []     for file in files: df = pd.read_csv(io.BytesIO(file.download_as_bytes()), index_col=0)         all_df.append(df)     data_hist = pd.concat(all_df)   This is a personal blog. My opinion on what I share with you is that “All models are wrong, but some are useful”. Improve the accuracy of any model I present and make it useful!