How to read csv file in Google Cloud Platform jupyter notebook - google-cloud-platform

I am working on Jupyter notebook in google cloud platform AI notebook. Now I want to read .csv file in GCP which is stored locally in my laptop.
My approach:
df = pd.read_csv("C:\Users\Desktop\New Folder\Data.csv")
But its not working. How to read local file in GCP AI notebbok.

I don't think there is a direct way to do this, but here you have three alternatives:
a) Upload the file from the Jupyter UI:
1.Open the Jupyter UI.
2.In the left pane of the screen, at the top, below the menus, click the "Upload files" button.
3.Select the file from your local file system and click Open.
4.Once the file is available in the left pane of the screen, right-click the file and select "Copy Path".
5.In your Notebook, type the following code, replacing test.csv with the path you just copied:
import pandas as pd
df2 = pd.read_csv("test.csv")
print(df2)
b. Upload the file to the Notebooks instance's file system
1.Go to the Compute Engine screen in the GCP console.
2.SSH to your AI Platform Notebooks instance, using the SSH button.
3.In the new terminal window, click the gear icon and the "Upload File" option
4.Select the file from your local file system and click Open.
5.The file will be stored in $HOME/, optionally move it to the desired path.
6.In your Notebook, type the following code, replacing the path accordingly:
import pandas as pd
df = pd.read_csv("/path/to_file/test.csv")
print(df2)
c)Store the file in a GCS bucket.
1.Upload your file to GCS.
2.In your Notebook, type the following code, replacing the bucket and file names accordingly:
import pandas as pd
from google.cloud import storage
from io import BytesIO
client = storage.Client()
bucket_name = "your-bucket"
file_name = "your_file.csv"
bucket = client.get_bucket(bucket_name)
blob = bucket.get_blob(file_name)
content = blob.download_as_string()
df = pd.read_csv(BytesIO(content))
print(df)

Related

Upload to BigQuery from Cloud Storage

Have ~50k compressed (gzip) json files daily that need to be uploaded to BQ with some transformation, no API calls. The size of the files may be up to 1Gb.
What is the most cost-efficient way to do it?
Will appreciate any help.
Most efficient way to use Cloud Data Fusion.
I would suggest below approach
Create cloud function and trigger on every new file upload to uncompress file.
Create datafusion job with GCS file as source and bigquery as sink with desired transformation.
Refer below my youtube video.
https://youtu.be/89of33RcaRw
Here is (for example) one way - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json...
... but quickly looking over it however one can see that there are some specific limitations. So perhaps simplicity, customization and maintainability of solution can also be added to your “cost” function.
Not knowing some details (for example read "Limitations" section under my link above, what stack you have/willing/able to use, files names or if your files have nested fields etc etc etc ) my first thought is cloud function service ( https://cloud.google.com/functions/pricing ) that is "listening" (event type = Finalize/Create) to your cloud (storage) bucket where your files land (if you go this route put your storage and function in the same zone [if possible], which will make it cheaper).
If you can code Python here is some started code:
main.py
import pandas as pd
from pandas.io import gbq
from io import BytesIO, StringIO
import numpy as np
from google.cloud import storage, bigquery
import io
def process(event, context):
file = event
# check if its your file can also check for patterns in name
if file['name'] == 'YOUR_FILENAME':
try:
working_file = file['name']
storage_client = storage.Client()
bucket = storage_client.get_bucket('your_bucket_here')
blob = bucket.blob(working_file)
#https://stackoverflow.com/questions/49541026/how-do-i-unzip-a-zip-file-in-google-cloud-storage
zipbytes = io.BytesIO(blob.download_as_string())
#print for logging
print(f"file downloaded, {working_file}")
#read_file_as_df --- check out docs here = https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
# if nested might need to go text --> to dictionary and then do some preprocessing
df = pd.read_json(zipbytes, compression='gzip', low_memory= False)
#write processed to big query
df.to_gbq(destination_table ='your_dataset.your_table',
project_id ='your_project_id',
if_exists = 'append')
print(f"table bq created, {working_file}")
# if you want to delete processed file from your storage to save on storage costs uncomment 2 lines below
# blob.delete()
#print(f"blob delete, {working_file}")
except Exception as e:
print(f"exception occured {e}, {working_file}")
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud-storage
google-cloud-bigquery
pandas
pandas.io
pandas-gbq
PS
Some alternatives include
Starting up a VM and run your script on a schedule and shutting VM down once process is done ( for example cloud scheduler –-> pub/sub –-> cloud function –-> which starts up your vm --> which then runs your script)
Using app engine to run your script (similar)
Using cloud run to run your script (similar)
Using composer/airflow (not similar to 1,2&3) [ could use all types of approaches including data transfers etc, just not sure what stack you are trying to use or what you already have running ]
Scheduling vertex ai workbook (not similar to 1,2&3, basically write up a jupyter notebook and schedule it to run in vertex ai)
Try to query files directly (https://cloud.google.com/bigquery/external-data-cloud-storage#bq_1) and schedule that query (https://cloud.google.com/bigquery/docs/scheduling-queries) to run (but again not sure about your overall pipeline)
Setup for all (except #5 & #6) just in technical debt to me is not worth it if you can get away with functions
Best of luck,

Django open excel.xlsx with openpyxl from Google Cloud Storage

I need to open a .xlsx file from my bucket on Google Cloud Storage, the problem is I get :FileNotFoundError at /api/ficha-excel
[Errno 2] No such file or directory: 'ficha.xlsx'
These are the settings from my bucket.
UPLOAD_ROOT = 'reportes/'
MEDIA_ROOT = 'reportes'
These are the route bucket/reportes/ficha.xlsx
This is the code of my get function:
directorio = FileSystemStorage("/reportes").base_location
os.makedirs(directorio, exist_ok=True)
# read
print("Directorios: ", directorio)
plantilla_excel = openpyxl.load_workbook(f"{directorio}/ficha.xlsx")
print(plantilla_excel.sheetnames)
currentSheet = plantilla_excel['Hoja1']
print(currentSheet['A5'].value)
What is the problem with the path? I can't figure out.
The below solution doesn’t use Django FileStorage/Storage classes. It opens a .xlsx file from the Cloud Storage bucket on Google Storage using openpyxl.
Summary :
I uploaded the Excel file on GCS, read the Blob data with openpyxl via BytesIO and saved the data in the workbook using the .save() method.
Steps to Follow :
Create a Google Cloud Storage bucket. Choose a globally unique name for it. Keep with the defaults and finally enter Create.
Choose an Excel file from your local system and upload it in the bucket using the “Upload files” option.
Once you have the excel file in your bucket, follow the steps below :
Go to Google Cloud Platform and create a service account (API). Click
Navigation Menu> APIs & Services> Credentials to go to the screen.
Then click Manage Service Accounts.
On the next screen, click Create Service Account.
Enter the details of the service account for each item.
In the next section, you will create a role for Cloud Storage. Choose
Storage Admin (full permission).
Click the service account you created, click Add Key in the Keys
field, and select Create New Key.
Select JSON as the key type and "create" it. Since the JSON file is
downloaded in the local storage, use the JSON file in the next item
and operate Cloud Storage from Python.
We will install the libraries required for this project in Cloud
Shell First, install the Google Cloud Storage library with pip
install to access Cloud Storage:
pip install google-cloud-storage
Install openpyxl using :
pip install openpyxl
Create a folder (excel) with the name of your choice in your Cloud editor.
Create files within it :
main.py
JSON key file (the one that got downloaded in local storage, copy that
file into this folder)
excel
main.py
●●●●●●●●●●.json
Write the below lines of code in main.py file :
from google.cloud import storage
import openpyxl
import io
#Create a client instance for google cloud storage
client = storage.Client.from_service_account_json('●●●●●●●●●●.json') //The path to your JSON key file which is now
#Get an instance of a bucket
bucket = client.bucket(‘bucket_name’) //only the bucketname will do, full path not necessary.
##Get a blob instance of a file
blob = bucket.blob(‘test.xlsx') // test.xlsx is the excel file I uploaded in the bucket already.
buffer = io.BytesIO()
blob.download_to_file(buffer)
wb = openpyxl.load_workbook(buffer)
wb.save('./retest.xlsx')
You will see a file ‘retest.xlsx’ getting created at the same folder in Cloud Editor.

Rename file name in GCP Cloud Storage and remove square brackets []

We have many videos uploaded to GCP Cloud Storage.
We need to change file name and remove [].
Asking if there is a good solution.
file example:
gs://xxxxxx/xxxxxx/[BlueLobster] Saint Seiya The Lost Canvas - 06 [1080p].mkv
You can't rename file in Cloud Storage. Renaming file equals to copying the file with a new name and to delete the older name.
It will take time if you have a lot of (large) files, but it's not impossible.
Based on the given scenario, you want to bulk rename all the filenames with ¨[]¨. Based on this documentation, gsutil interprets these characters as wildcards. gsutil does not support this currently.
There´s a way to handle this kind of request by using a custom script to rename all the files with ´[´.
You may use any programming languages that have Cloud Storage client libraries. For this instructions, we´ll be using Python for the custom script.
On your Google Cloud Console, Click the Activate Cloud Shell on the top right of the Google Cloud Console beside the question mark sign. For more information, you may refer here.
On your Cloud Shell, Install the Python client library by using this command:
pip install --upgrade google-cloud-storage
For more information, please refer on this documentation.
After the installation of client library, launch the Cloud Shell Editor by clicking the Open Editor on the top right side of the Cloud Shell. You may refer here for more information.
On your Cloud Shell Editor, click the File menu and choose New File. Name it script.py. Click Ok.
This code assumes that all the objects on your bucket have the same name from the sample you provided.:
import re
from google.cloud import storage
storage_client = storage.Client()
bucket_name = "my_bucket"
bucket = storage_client.bucket(bucket_name)
storage_client = storage.Client()
blobs = storage_client.list_blobs(bucket_name)
pattern = r"[\([{})\]]"
for blob in blobs:
out_var = blob.name
fixed_var = re.sub(pattern, '', blob.name)
print(out_var + " " + fixed_var)
new_blob = bucket.rename_blob(blob, fixed_var)
Change the content of ¨my_bucket¨ to the name of your bucket.
Click File and then Save or you can just press Ctrl + S.
Go back to the terminal by clicking the Open Terminal on the top right section of the Cloud Shell Editor.
Copy and paste this code to the editor:
python script.py
To run the script, press the Enter key.
Files that have brackets are now renamed.
The files aren´t renamed in the backend. Under the hood, it's more of being rewritten with a new name and it's due to object immutability. This will only copy the old files with a new name and removes the old file afterwards.

Google Cloud Webapp to use a user uploaded file for processing

I recently moved my project from heroku to google cloud. It's written in flask and basically does some text summary (nothing fancy) of an uploaded .docx file. I was able to locally use files on heroku due to their ephemeral file system.
With google cloud, finding myself lost trying to use a file uploaded and running python functions on it.
The error I'm getting is:
with open(self.file, 'rb') as file: FileNotFoundError: [Errno 2] No such file or directory: 'http://storage.googleapis.com/...'
Edited the specifics out for now but when I open the link in a browser it brings up the download window. I know the file gets there since I go to google cloud and everything is in the proper bucket.
Also is there a way to delete from the bucket immediately after python goes through the document? Currently have the lifecycle set to a day but just need the data temporarily runover.
I'm sorry if these are silly questions. Very new to this and trying to learn.
Thanks
Oh and here's the current code
gcs = storage.Client()
user_file = request.files['file']
local = secure_filename(user_file.filename)
blob = bucket.blob(local)
blob.upload_from_string(user_file.read(),content_type=user_file.content_type)
this_file = f"http://storage.googleapis.com/{CLOUD_STORAGE_BUCKET}/{local}"
then a function is supposed to open this_file
returned a public_url to a file name to be processed and used
def open_file(self):
url = self.file
file = BytesIO(requests.get(url).content)
return docx.Document(file)

Uploading a Large Dataset(10 GB+) into Jupyter Notebooks/GCP AI Notebooks

TL;DR:
How to move a large dataset(over 30 GB) from BigQuery to Jupyter Notebooks(AI Notebook within GCP)
Problem:
I do have a ~ 30GB dataset(time series) that I want to upload to Jupyter Notebooks(AI Notebook) in order to test a NN model before deploying it in its own server. The dataset already has been built in Bigquery, and I did move it using wildcards(100 parts) into Storage.
What I have done:
However, I am stuck trying to upload it into the Notebook:
1) Bigquery does not allow to query it directly, also too slow
2) Can not download it and the upload locally
2) Did move it to the storage in avro format, but have not achived to query it using the wildcards:
from google.cloud import storage
from io import BytesIO
client = storage.Client()
bucket = "xxxxx"
file_path = "path"
blob = storage.blob.Blob(file_path,bucket)
content = blob.download_as_string()
train = pd.read_csv(BytesIO(content))
What I am missing? Should I make the model into a function and the using Dataflow somehow?
Best