Get data from a GCS bucket using Python 3

October 7, 2022

Get data from a GCS bucket using Python 3 – a specific task that took me a while to complete due to little or confusing online resources

You might ask why a Data Scientist was stuck to solve such a trivial Data Engineering task? Well… because most of the time… there is no proper Data Engineering support in an organization.

Anyways, my challenge was that I ran out of RAM while using Google Colab Notebook to read Google Could Storage ( GCS ) data, so I just wanted to work locally in Python to leverage my 32 GB of RAM. In Colab I used to run gsutil commands and everything was easier due to using a Google environment.

Steps to follow if you want to pull GCS data locally:

Ask your GCP admin to generate a Google Cloud secret key and save it in a json file:

install ( pip install google-cloud-storage ) & import GCP libraries: from google.cloud import storage
define a GCS client: storage_client = storage.Client()
loop through the buckets you want to read data from (I’ve saved my buckets in a list)
for each bucket you’ll need:
1. bucket name
2. prefix
3. create an iterator
4. save the blobs
5. identify the CSVs (in my case)
6. download the CSVs
7. save data in a pandas dataframe

Something like this:

from google.cloud import storage from google.cloud.bigquery.client import Clientimport pandas as pd import io

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'google-key_Cristina.json'

storage_client = storage.Client()

BM_GCS_Buckets = open('BM_buckets.txt','r') BM_GCS_Buckets = list(BM_GCS_Buckets)

def find_bucket( plant_name ): plant = plant_name buckets = BM_GCS_Buckets result = ''

for bucket in buckets: if re.findall(plant,bucket): result = bucket break

return result

for i in ['ML','MP']:

bucket_name = find_bucket(i).strip()

bucket = f'bucketname'
prefix_name = f"" + bucket_name.replace('gs://bucketname/','') + "/"

iterator = storage_client.list_blobs(bucket, prefix=prefix_name, delimiter="/")
blobs = storage_client.list_blobs(bucket, prefix=prefix_name)

files = []
files = [blob for blob in blobs if blob.name.endswith(".csv")]

data_hist = pd.DataFrame()
all_df = []

for file in files: df = pd.read_csv(io.BytesIO(file.download_as_bytes()), index_col=0) all_df.append(df)

data_hist = pd.concat(all_df)

This is a personal blog. My opinion on what I share with you is that “All models are wrong, but some are useful”. Improve the accuracy of any model I present and make it useful!