uploading data into Big Query - google-cloud-platform

is there a way to unzip a file that has been loaded into Google Cloud Platform.
I have a 33GB Zip File filled with CSV files enroute to Google Big Query.
If Unzipped, the file size is over 215GB.
Is there a way to load the Zip File and uncompress it in GCP instead of trying to upload 215GB of raw data?
Thanks.

If it's gzipped then you can load into BigQuery from Google Cloud Storage. Otherwise, I don't believe there is a way to unzip once it's in GCS.
One option is to spin up a Google Compute Engine with enough hard drive space (and preferably SSDs), do whatever you want there and then push to GCS from there then shut down the Compute Engine

Related

GCP Data fusion transfer multiples from Azure storage to Google Storage

I am Trying to transfer multiple (.csv) files under a directory from Azure storage container to Google storage (as .txt files)through data fusion.
From Data fusion, I can successfully transfer single file and converting it to .txt file as part of GCS Sink.
But when I am trying to transfer all the .csv files under azure's container to GCS, it s merging all the .csv files data and generating single .txt file at GCS.
Can some one help on how to transfer each file separately and converting it to txt at Sink side?
What you're seeing is expected behavior when using GCS sink.
You need an Azure to GCS copy action plugin, or more generally an HCFS to GCS copy action plugin. Unfortunately such a plugin doesn't already exist. You could consider writing one using https://github.com/data-integrations/example-action as a starting point.

Is there a way to zip a folder of files into one zip, gzip, bzip, etc file using Google Cloud?

My Goal: I have hundreds of Google Cloud Storage folders with hundreds of images in them. I need to be able to zip them up and email a user a link to a single zip file.
I made an attempt to zip these files on an external server using PHP's zip function, but that has proved to be fruitless given the ultimate size of the zip files I'm creating.
I have since found that Google Cloud offers a Bulk Compress Cloud Storage Files utility (docs are at https://cloud.google.com/dataflow/docs/guides/templates/provided-utilities#api). I was able to successfully call this utility, but for zips each file into it's own bzip or gzip file.
For instance, if I had the following files in the folder I'm attempt to zip:
apple.jpg
banana.jpg
carrot.jpg
The resulting outputDirectory would have:
apple.bzip2
banana.bzip2
carrot.bzip2
Ultimately, I'm hoping to create a single file named fruits.bzip2 that can be unzipped to reveal these three files.
Here's an example of the request parameters I'm making to https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Bulk_Compress_GCS_Files
{
"jobName": "ziptest15",
"environment": {
"zone": "us-central1-a"
},
"parameters": {
"inputFilePattern": "gs://PROJECT_ID.appspot.com/testing/samplefolder1a/*.jpg",
"outputDirectory": "gs://PROJECT_ID.appspot.com/testing/zippedfiles/",
"outputFailureFile": "gs://PROJECT_ID.appspot.com/testing/zippedfiles/failure.csv",
"compression": "BZIP2"
}
}
The best way to achieve that is to create an app that:
Download locally all the file of a GCS prefix (that you name "directory" but directory doesn't exist on GCS, only file with the same prefix)
Create an archive (can be a ZIP or a TAR. ZIP won't really compress the image. The image format is already a compressed format. You especially want only one 1 with all the image in it)
Upload the archive to GCS
Clean the files
Now you have to choose where to run this app.
On Cloud Run, you are limited by the space that you have in memory (for now, new feature are coming). For now you are limited to 8Gb of memory (and soon 16Gb), your app will be able to process total image size of 45% of the memory capacity (45% for the image size, 45% for the archive size, 10% for the app memory footprint.). Set the concurrency parameter to 1.
If you need more space, you can use Compute Engine.
Set up a startup script that run your script and stop automatically the VM at the end. The script read the parameter from the metadata server and run your app with the correct parameters
Before each run, update the Compute Engine metadata with the directory to process (and maybe other app parameter
-> The issue is that you can only run 1 process at a time. Or you need to create a VM for each job, and then delete the VM at the end of the startup script instead of stopping the VM
A side solution is to use Cloud Build. Run a Build with the parameters in the substitutions variables and perform the job in Cloud Build. You are limited to 10 builds in parallel. Use the diskSizeGb build option to set the correct disk size according to your file size requirements.
The dataflow template only zip each file unitary, and don't create an archive.

Unable to Upload Huge Files/Datasets on Google Colab

I am uploading a TSV file for processing on GColab, the file is 4GB and the upload process is not getting completed from a very long time (hours). Any pointers here are of great help. Click here to check upload process details
It can be your internet connection. The import function for Google Colab is better useful when you upload small .py files. For huge files, I'd suggest you use Google Drive and upload it there in your account then simply move or copy it to your Google Colab instance:
1. Copy the file you want to use:
%cp "path/to/the file/file_name.extension" "path/to/your/google-colab-instance"
Google colab instance is usually like this - /contents/
Similarly,
2. Move the file you want to use:
%mv "path/to/the file/file_name.extension" "path/to/your/google-colab-instance"
The first "" would be the path to where you uploaded the .csv file in your drive.
Hope this helps. Let me know in the comments.

Pytorch: Google cloud storage, persistent disk for training models on DLVM?

I'm wondering what's the best way to go about
a)reading in the data
b)writing data while training a model?
My data is in GCS bucket, about 1TB, from a dataflow job.
For writing data (all I want is model checkpoints, and logs) do people just write to a zonal persistent disk? Don't write to google cloud storage. it is a large model so the checkpoints take up a fair bit of space.
I can't seem to write data to cloud storage, without writing say a context manager and byte string code all the places I want to write.
Now for reading in data:
pytorch doesn't have a good way to read in the data from GCS bucket like tensorflow?
So what should I do, I've tride gcsfuse which I think could work, when I 'mount' the bucket, I can only see inside the repo I selected, not sub directories. Is this normal?
would gcsfuse be the right way to load in data from GCS?
Thanks.

Unable to import more than 1000 files from Google Cloud Storage to Cloud Data Prep

I have been trying to run a Cloud Data Prep flow which takes files from Google Cloud Storage.
The files on Google Cloud Storage gets updated daily and there are more than 1000 files in the bucket right now. However, I am not able to fetch more than 1000 files from the bucket.
Is there any way to get the data from Cloud Storage? If not, is there any alternative way from which we can achieve this?
You can load a large number of files using the + button next to a folder in the file browser. This will load all the files in that folder (or more precisely prefix) when running a job on Dataflow.
There is however a limit when browsing/using the parameterization feature. Some users might have millions of files and searching among all of them is not possible. (as GCS only allow filtering by prefix).
See the limitations on that page for more details:
https://cloud.google.com/dataprep/docs/html/Import-Data-Page_57344837