Get data from a GCS bucket using Python 3

GCS GCP Buckets

Get data from a GCS bucket using Python 3 – a specific task that took me a while to complete due to little or confusing online resources

 

You might ask why a Data Scientist was stuck to solve such a trivial Data Engineering task? Well… because most of the time… there is no proper Data Engineering support in an organization.

Anyways, my challenge was that I ran out of RAM while using Google Colab Notebook to read Google Could Storage ( GCS ) data, so I just wanted to work locally in Python to leverage my 32 GB of RAM. In Colab I used to run gsutil commands  and everything was easier due to using a Google environment.

Steps to follow if you want to pull GCS data locally:

  1. Ask your GCP admin to generate a Google Cloud secret key and save it in a json file:

  1. install ( pip install google-cloud-storage ) & import GCP libraries: from google.cloud import storage
  2. define a GCS client: storage_client = storage.Client()
  3. loop through the buckets you want to read data from (I’ve saved my buckets in a list)
  4. for each bucket you’ll need:
    1. bucket name
    2. prefix
    3. create an iterator
    4. save the blobs
    5. identify the CSVs (in my case)
    6. download the CSVs
    7. save data in a pandas dataframe

 

Something like this:

 

 

from google.cloud import storage
from google.cloud.bigquery.client import Client

import pandas as pd
import io

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'google-key_Cristina.json'

storage_client = storage.Client()

BM_GCS_Buckets = open('BM_buckets.txt','r')
BM_GCS_Buckets = list(BM_GCS_Buckets)

def find_bucket( plant_name ):
plant = plant_name
buckets = BM_GCS_Buckets
result = ''

    for bucket in buckets:
if re.findall(plant,bucket):
result = bucket
break

    return result

for i in ['ML','MP']:

    bucket_name = find_bucket(i).strip()

    bucket = f'bucketname'
    prefix_name = f"" + bucket_name.replace('gs://bucketname/','') + "/"

    iterator = storage_client.list_blobs(bucket, prefix=prefix_name, delimiter="/")
    blobs = storage_client.list_blobs(bucket, prefix=prefix_name)

    files = []
    files = [blob for blob in blobs if blob.name.endswith(".csv")]

    data_hist = pd.DataFrame()
    all_df = []

    for file in files:
df = pd.read_csv(io.BytesIO(file.download_as_bytes()), index_col=0)
        all_df.append(df)

    data_hist = pd.concat(all_df)

 

This is a personal blog. My opinion on what I share with you is that “All models are wrong, but some are useful”. Improve the accuracy of any model I present and make it useful!

Any comments are welcome

Share this post

Related articles

Cristina Gurguta

content creator

Welcome to www.thebabydatascientist.com! I’m Cristina, a Senior Machine Learning Operations Lead and a proud mom of two amazing daughters. Here, we help nurture your data science career and offer insane data-driven designs for shopping. Join us on this exciting journey of balancing work and family in a data-driven world!

Cristina Gurguta

My personal favourites
Explore