Get data from a GCS bucket using Python 3 – a specific task that took me a while to complete due to little or confusing online resources
You might ask why a Data Scientist was stuck to solve such a trivial Data Engineering task? Well… because most of the time… there is no proper Data Engineering support in an organization.
Anyways, my challenge was that I ran out of RAM while using Google Colab Notebook to read Google Could Storage ( GCS ) data, so I just wanted to work locally in Python to leverage my 32 GB of RAM. In Colab I used to run gsutil commands and everything was easier due to using a Google environment.
Steps to follow if you want to pull GCS data locally:
- Ask your GCP admin to generate a Google Cloud secret key and save it in a json file:
- install ( pip install google-cloud-storage ) & import GCP libraries: from google.cloud import storage
- define a GCS client: storage_client = storage.Client()
- loop through the buckets you want to read data from (I’ve saved my buckets in a list)
- for each bucket you’ll need:
- bucket name
- prefix
- create an iterator
- save the blobs
- identify the CSVs (in my case)
- download the CSVs
- save data in a pandas dataframe
Something like this:
from google.cloud import storage
from google.cloud.bigquery.client import Client
import pandas as pd
import io
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'google-key_Cristina.json'
storage_client = storage.Client()
BM_GCS_Buckets = open('BM_buckets.txt','r')
BM_GCS_Buckets = list(BM_GCS_Buckets)
def find_bucket( plant_name ):
plant = plant_name
buckets = BM_GCS_Buckets
result = ''
for bucket in buckets:
if re.findall(plant,bucket):
result = bucket
break
return result
for i in ['ML','MP']:
bucket_name = find_bucket(i).strip()
bucket = f'bucketname'
prefix_name = f"" + bucket_name.replace('gs://bucketname/','') + "/"
iterator = storage_client.list_blobs(bucket, prefix=prefix_name, delimiter="/")
blobs = storage_client.list_blobs(bucket, prefix=prefix_name)
files = []
files = [blob for blob in blobs if blob.name.endswith(".csv")]
data_hist = pd.DataFrame()
all_df = []
for file in files:
df = pd.read_csv(io.BytesIO(file.download_as_bytes()), index_col=0)
all_df.append(df)
data_hist = pd.concat(all_df)
This is a personal blog. My opinion on what I share with you is that “All models are wrong, but some are useful”. Improve the accuracy of any model I present and make it useful!