In a Vertex AI workbench notebook, I'm trying to read data from Cloud Storage with Cloud Storage FUSE.
The file path to the dataset inside Cloud Storage is:
gs://my_bucket_name/cola_public/raw/in_domain_train.tsv so I can read it into pandas dataframe as follows:
import pandas as pd
# Load the dataset into a pandas dataframe.
df = pd.read_csv("gs://my_bucket_name/cola_public/raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df.shape[0]))
# Display 10 random rows from the data.
df.sample(10)
The previous code works seamlessly. However, I want to update my code to read data with Cloud Storage FUSE (for Vertex AI Training later). Based on Read and write Cloud Storage files with Cloud Storage FUSE and this Codelab, I should be able to load my data using the following code:
df = pd.read_csv("/gcs/my_bucket_name/cola_public/raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
Unfortunately, It did not work for me. The error message is:
FileNotFoundError: [Errno 2] No such file or directory: '/gcs/my_bucket_name/cola_public/raw/in_domain_train.tsv'
How I could solve this problem?
Thank you in advance!
Thanks to Ayush Sethi for the answer:
"Did you try performing step 5 of the mentioned codelab ? The GCS buckets are mounted on performing step 5. So, the training application code that is containerised in step 4, should be able to access the data present in GCS buckets when run as training job on VertexAI which is described in step 5."
Related
Have ~50k compressed (gzip) json files daily that need to be uploaded to BQ with some transformation, no API calls. The size of the files may be up to 1Gb.
What is the most cost-efficient way to do it?
Will appreciate any help.
Most efficient way to use Cloud Data Fusion.
I would suggest below approach
Create cloud function and trigger on every new file upload to uncompress file.
Create datafusion job with GCS file as source and bigquery as sink with desired transformation.
Refer below my youtube video.
https://youtu.be/89of33RcaRw
Here is (for example) one way - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json...
... but quickly looking over it however one can see that there are some specific limitations. So perhaps simplicity, customization and maintainability of solution can also be added to your “cost” function.
Not knowing some details (for example read "Limitations" section under my link above, what stack you have/willing/able to use, files names or if your files have nested fields etc etc etc ) my first thought is cloud function service ( https://cloud.google.com/functions/pricing ) that is "listening" (event type = Finalize/Create) to your cloud (storage) bucket where your files land (if you go this route put your storage and function in the same zone [if possible], which will make it cheaper).
If you can code Python here is some started code:
main.py
import pandas as pd
from pandas.io import gbq
from io import BytesIO, StringIO
import numpy as np
from google.cloud import storage, bigquery
import io
def process(event, context):
file = event
# check if its your file can also check for patterns in name
if file['name'] == 'YOUR_FILENAME':
try:
working_file = file['name']
storage_client = storage.Client()
bucket = storage_client.get_bucket('your_bucket_here')
blob = bucket.blob(working_file)
#https://stackoverflow.com/questions/49541026/how-do-i-unzip-a-zip-file-in-google-cloud-storage
zipbytes = io.BytesIO(blob.download_as_string())
#print for logging
print(f"file downloaded, {working_file}")
#read_file_as_df --- check out docs here = https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
# if nested might need to go text --> to dictionary and then do some preprocessing
df = pd.read_json(zipbytes, compression='gzip', low_memory= False)
#write processed to big query
df.to_gbq(destination_table ='your_dataset.your_table',
project_id ='your_project_id',
if_exists = 'append')
print(f"table bq created, {working_file}")
# if you want to delete processed file from your storage to save on storage costs uncomment 2 lines below
# blob.delete()
#print(f"blob delete, {working_file}")
except Exception as e:
print(f"exception occured {e}, {working_file}")
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud-storage
google-cloud-bigquery
pandas
pandas.io
pandas-gbq
PS
Some alternatives include
Starting up a VM and run your script on a schedule and shutting VM down once process is done ( for example cloud scheduler –-> pub/sub –-> cloud function –-> which starts up your vm --> which then runs your script)
Using app engine to run your script (similar)
Using cloud run to run your script (similar)
Using composer/airflow (not similar to 1,2&3) [ could use all types of approaches including data transfers etc, just not sure what stack you are trying to use or what you already have running ]
Scheduling vertex ai workbook (not similar to 1,2&3, basically write up a jupyter notebook and schedule it to run in vertex ai)
Try to query files directly (https://cloud.google.com/bigquery/external-data-cloud-storage#bq_1) and schedule that query (https://cloud.google.com/bigquery/docs/scheduling-queries) to run (but again not sure about your overall pipeline)
Setup for all (except #5 & #6) just in technical debt to me is not worth it if you can get away with functions
Best of luck,
I'm working with a large CSV (400M+ lines) located in a GCS bucket. I need to get a random sample of this csv and export it to BigQuery for a preliminary exploration. I've looked all the over the web and I just can't seem to find anything that addresses this question.
Is this possible and how do I go about doing it?
You can query your csv file directly from BigQuery using external tables.
Try it with TABLESAMPLE clause:
SELECT * FROM dataset.my_table TABLESAMPLE SYSTEM (10 PERCENT)
You can create a external table from GCS (to read directly from GCS) and then do something like that
SELECT * FROM `<project>.<dataset>.<externalTableFromGCS>`
WHERE CAST(10*RAND() AS INT64) = 0
The result of the select can be stored in GCS with an export or stored in a table with an insert select
Keep in mind that you need to fully load the file (and thus to pay for the full file size) and then query a subset of the file. You can't load only 10% of the volume in BigQuery.
There is no direct way to load sample records from GCS to BigQuery, But we can achieve it in different way, In GCS we have option to download only specific chunk of file, so following simple python code can load sample records to BQ from large big GCS file
from google.cloud import storage
from google.cloud import bigquery
gcs_client = storage.Client()
bq_client = bigquery.Client()
job_config = bigquery.LoadJobConfig(source_format='CSV', autodetect=True, max_bad_records=1)
bucket = gcs_client.get_bucket("your-bucket")
blob = storage.Blob('gcs_path/file.csv', bucket)
with open('local_file.csv', 'wb') as f: # downloading sample file
gcs_client.download_blob_to_file(blob, f, start=0, end=2000)
with open('local_file.csv', "rb") as source_file: # uploading to BQ
job = bq_client.load_table_from_file(source_file, 'your-proj.dataset.table_id', job_config=job_config)
job.result() # Wait for loading
In above code, it will download 2 kb of data from your huge GCS file, But
last row in the downloaded csv file may incomplete since we cannot define the bytes for each rows. Here the trickier part is "max_bad_records=1" in bq job config so it will ignore the uncompleted last row.
Error in Google Cloud Data labeling Service:
I am trying to create a dataset of images in Google's Data labeling service.
Using a single image to test it out.
Created a Google storage bucket named: my-bucket
Uploaded an image to my-bucket - image file name: testcat.png
Created and uploaded a csv file (UTF-8) with URI path of image stored inside it.
image URI path as stored in csv file: gs://my-bucket//testcat.png
Named the csv file : testimage.csv
Uploaded the csv file in the gs bucket - my-bucket.
i.e. testimage.csv, and testcat.png are in the same google storage bucket (my-bucket).
When I try to create the datasset in google console, GCP gives me the following error message:
** Failed to import dataset gs://my-bucket/testcat.png is not a valid
youtube uri nor a readable file path.**
I've checked multiple times and the URI for this image in Google is exactly the same as what I've used. I've tried at least 10-15 times ... the error persists.
Any one faced and successfully resolved this issue?
Your help is greatly appreciated.
Thanks!
As you can see in our AI Platform Data Labeling Service documentation, there is a service update due to the coronavirus (COVID-19) health emergency that states that data labeling services are limited or unavailable until further notice.
You can't start new data labeling tasks through the Cloud Console, Google Cloud SDK, or the API
You can request data labeling tasks only through email at cloudml-data-customer#google.com
New data labeling tasks can't contain personally identifiable information
I'm trying to use Cloud AI Platform for training (gcloud ai-platform jobs submit training).
I created my bucket and am sure the training file is there (gsutil ls gs://sat3_0_bucket/data/train_input.csv).
However, my job is failing with log messsage:
File "/root/.local/lib/python3.7/site-packages/ktrain/text/data.py", line 175, in texts_from_csv
with open(train_filepath, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'gs://sat3_0_bucket/data/train_input.csv'
Am I missing something?
The error is probably happening because ktrain tries to auto-detect the character encoding using open(train_filepath, 'rb') which may be problematic with Google Cloud Storage. One solution is to explicitly provide the encoding to texts_from_csv as an argument so this step is skipped (default is None, which means auto-detect).
Alternatively, you can read the data in yourself as a pandas DataFrame using one of these methods. For instance, pandas evidently supports GCS, so you can simply do this: df = pd.read_csv('gs://bucket/your_path.csv')
Then, using ktrain, you can use ktrain.text.texts_from_df (or ktrain.text.texts_from_array) to load and preprocess your data.
TL;DR:
How to move a large dataset(over 30 GB) from BigQuery to Jupyter Notebooks(AI Notebook within GCP)
Problem:
I do have a ~ 30GB dataset(time series) that I want to upload to Jupyter Notebooks(AI Notebook) in order to test a NN model before deploying it in its own server. The dataset already has been built in Bigquery, and I did move it using wildcards(100 parts) into Storage.
What I have done:
However, I am stuck trying to upload it into the Notebook:
1) Bigquery does not allow to query it directly, also too slow
2) Can not download it and the upload locally
2) Did move it to the storage in avro format, but have not achived to query it using the wildcards:
from google.cloud import storage
from io import BytesIO
client = storage.Client()
bucket = "xxxxx"
file_path = "path"
blob = storage.blob.Blob(file_path,bucket)
content = blob.download_as_string()
train = pd.read_csv(BytesIO(content))
What I am missing? Should I make the model into a function and the using Dataflow somehow?
Best