Unable to Upload Huge Files/Datasets on Google Colab - google-cloud-platform

I am uploading a TSV file for processing on GColab, the file is 4GB and the upload process is not getting completed from a very long time (hours). Any pointers here are of great help. Click here to check upload process details

It can be your internet connection. The import function for Google Colab is better useful when you upload small .py files. For huge files, I'd suggest you use Google Drive and upload it there in your account then simply move or copy it to your Google Colab instance:
1. Copy the file you want to use:
%cp "path/to/the file/file_name.extension" "path/to/your/google-colab-instance"
Google colab instance is usually like this - /contents/
Similarly,
2. Move the file you want to use:
%mv "path/to/the file/file_name.extension" "path/to/your/google-colab-instance"
The first "" would be the path to where you uploaded the .csv file in your drive.
Hope this helps. Let me know in the comments.

Related

Unable to import more than 1000 files from Google Cloud Storage to Cloud Data Prep

I have been trying to run a Cloud Data Prep flow which takes files from Google Cloud Storage.
The files on Google Cloud Storage gets updated daily and there are more than 1000 files in the bucket right now. However, I am not able to fetch more than 1000 files from the bucket.
Is there any way to get the data from Cloud Storage? If not, is there any alternative way from which we can achieve this?
You can load a large number of files using the + button next to a folder in the file browser. This will load all the files in that folder (or more precisely prefix) when running a job on Dataflow.
There is however a limit when browsing/using the parameterization feature. Some users might have millions of files and searching among all of them is not possible. (as GCS only allow filtering by prefix).
See the limitations on that page for more details:
https://cloud.google.com/dataprep/docs/html/Import-Data-Page_57344837

How to allow google cloud storage to save duplicate files and not overwrite

I am using google cloud storage bucket to save file uploads from my Django web application. However if a file with same name is uploaded, then it overwrites the existing file and i do not want this to happen. I want to allow duplicate file saving at same location. Before moving to google cloud storage when i used my computer's hard disk to save files, Django used to smartly update filename in database as well as hard disk.
I upload files with the name given by users, and I concatenate a timestamp including seconds and milliseconds. but the name of the file is seen by clients as they added it, since I remove that part of the string when it is displayed in the view.
example
image1-16-03-2022-12-20-32-user-u123.pdf
image1-27-01-2022-8-22-32-usuario-anotheruser.pdf
both users would see the name image1

Specify output filename of Cloud Vision request

So I'm sitting with Google Cloud Vision (for Node.js) and I'm trying to dynamically upload a document to a Google Cloud Bucket, process it using Google Cloud Vision API, and then downloading the .json afterwards. However, when Cloud Vision processes my request and places it in my bucket for saved text extractions, it appends output-1-to-n.json at the end of the filename. So let's say I'm processing a file called foo.pdf that's 8 pages long, the output will not be foo.json (even though I specified that), but rather be foooutput1-to-8.json.
Of course, this could be remedied by checking the page count of the PDF before uploading it and appending that to the path I search for when downloading, but that seems like such an unneccesary hacky solution. I can't seem to find anything in the documentation about not appending output-1-to-n to outputs. Extremely happy for any pointers!
You can't specify a single output file for asyncBatchAnnotate because depending on your input, many files may get created. The output config is only a prefix and you have to do a wildcard search in gcs for your given prefix (so you should make sure your prefix is unique).
For more details see this answer.

uploading data into Big Query

is there a way to unzip a file that has been loaded into Google Cloud Platform.
I have a 33GB Zip File filled with CSV files enroute to Google Big Query.
If Unzipped, the file size is over 215GB.
Is there a way to load the Zip File and uncompress it in GCP instead of trying to upload 215GB of raw data?
Thanks.
If it's gzipped then you can load into BigQuery from Google Cloud Storage. Otherwise, I don't believe there is a way to unzip once it's in GCS.
One option is to spin up a Google Compute Engine with enough hard drive space (and preferably SSDs), do whatever you want there and then push to GCS from there then shut down the Compute Engine

.csv upload not working in Amazon Web Services Machine Learning - AWS

I have uploaded a simple 10 row csv file (S3) into AWS ML website. It keeps giving me the error,
"We cannot find any valid records for this datasource."
There are records there and Y variable is continuous (not binary). I am pretty much stuck at this point because there is only 1 button to move forward to build Machine Learning. Does any one know what should I do to fix it? Thanks!
The only way I have been able to upload .csv files to S3 created on my own is by downloading an existing .csv file from my S3 server, modifying the data, uploading it then changing the name in the S3 console.
Could you post the first few lines of contents of the .csv file? I am able to upload my own .csv file along with a schema that I have created and it is working. However, I did have issues in that Amazon ML was unable to create the schema for me.
Also, did you try to save the data in something like Sublime, Notepad++, etc. in order to get a different format? On my mac with Microsoft Excel, the CSV did not work, but when I tried LibreOffice on my Windows, the same file worked perfectly.