batch predictions in GCP Vertex AI - google-cloud-platform

While trying out batch predictions in GCP Vertex AI for an AutoML model, the batch prediction results span over several files(which is not convenient from a user perspective). If it would have been a single batch prediction result file i.e. covering all the records in a single file, it would make the procedure much more simple.
For instance, I had 5585 records in my input dataset file. The batch prediction results comprise of 21 files wherein each file has records in the range of 200-300, thus, covering 5585 records altogether.

Batch predictions on an image, text,video,tabular AutoML model, runs the jobs using distributed processing which means the data is distributed among an arbitrary cluster of virtual machines and is processed in an unpredictable order because of which you will get the prediction results stored across various files in Cloud Storage. Since the batch prediction output files are not generated with the same order as an input file, a feature request has been raised and you can track the update on this request from this link.
We cannot provide an ETA at this moment but you can follow the progress in the issue tracker and you can ‘STAR’ the issue to receive automatic updates and give it traction by referring to this link.
However, if you are doing batch prediction for a tabular AutoML model, there you have the option to choose the BigQuery as storage where all the prediction output will be stored in a single table and then you can export the table data to a single CSV file.

Related

Ideas to take a stream amongst 200-1000 servers and create one single file quickly

We are in Google Cloud Platform so technologies there would be a good win. We have a huge file that comes in and dataflow scales on the input to break up the file quite nicely. After that however, it streams through many system, microservice1 over to dataconnectors grabbing related data over to ML and finally over to a final microservice.
Since the final stage could be around 200-1000 servers depending on load, how can we take all the requests coming in (yes, we have a file id attached to every request including a customerRequestId in case a file is dropped multiple times). We only need to be writing every line with the same customerRequestId to the same file on output.
What is the best method to do this? The resulting file is almost always a csv file.
Any ideas or good options I can explore? I wonder if dataflow was good at ingestion and reading a massively large file in parallel, is it good at taking in various inputs on a cluster of nodes(not a single node which would bottleneck us).
EDIT: I seem to recall hdfs has files partitioned across nodes and I think can be written by many nodes at the same time somehow (a
node per partition). Does anyone know if google cloud storage files are this way as well? Is there a way to have 200 nodes writing to 200 partitions of the same file in google cloud storage in such a way that it is all 1 file?
EDIT 2:
I see that there is a streaming pub/sub to bigquery option that could be done as one stage in this list: https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming
HOWEVER in this list, there is not a batch bigquery to csv(what our customer wants). I do see a bigquery to parquet option though here: https://cloud.google.com/dataflow/docs/guides/templates/provided-batch
I would prefer to go directly to csv though. Is there a way?
thanks,
Dean
You case is complex and hard (and expensive) to reproduce. My first idea is to use BigQuery. Sink all the data in the same table with Dataflow.
Then, create a temporary table with only the data to export to CSV like that
CREATE TABLE `myproject.mydataset.mytemptable`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
) AS
SELECT ....
And then to export the temporary table to CSV. If the table is less than 1Gb, only one CSV will be generated.
If you need to orchestrate these steps, you can use Workflows

How to trigger auto update Google BigQuery Dataset every time CSV upload on Google Cloud Storage

I am trying to automate the entire data loading, that means whenever I upload a file to Google Cloud storage, it automatically triggers the data to be uploaded into the BigQuery dataset. I know that there is a daily set timing update available, but I want something where it only triggers whenever the CSV file is re-uploaded.
You have 2 possibilities:
Either you react on event. I mean you can plug a function on Google Cloud Storage events. In the event message you have the file stored in GCS and you can do what you want with it, for exemple run a load job from Google Cloud Storage.
Or, do nothing! Let the file in GCS and create a BigQuery federated table to read into GCS
With this 2 solutions, your data are accessible by BigQuery. Your Datastudio graph can query BigQuery, the data are here. However.
The load job is more efficient, you can partition and clusterize your data for optimize the speed and the cost. However, you duplicate your data (from GCS) and you have to code and to run your function. Anyway, cost is very low and function very simple. For Big Data it's my recommended solution
The federated table are very useful when the quantity of data is low and for occasional access or for prototyping. You can't clusterize and partition your data and the speed is lower than data loaded into BigQuery (because the CSV parsing is performing on the fly).
So, Big Data is a wide area: do you need to transform the data before the load? can you transform them after the log? How can you link query the ones after the others? ....
Don't hesitate if you have other questions on this!

Getting error "INTERNAL" when training a model with AutoML

I'm training a small model with AutoML entity extraction, but the training keeps failing with the error message "INTERNAL" and no other details.
I'm doing this from the Google Cloud console, and I've followed the same steps I've used successfully to train other models.
The dataset has two labels with a few hundred text items each, so I doubt it's a timeout or anything like that.
What might be causing this and is there a way to debug/get more visibility?
Could be that dataset contains duplicate columns which is not currently supported. If this is not your case, I'd suggest to reach with GCP Support to check it internally.

Manipulate large number of files to reformat in google cloud

I have a large amount of json files in Google cloud storage that I would like to load to Bigquery. Average file size is 5MB not compressed.
The problem is that they are not new line delimited so I can't load them as is to bigquery.
What's my best approach here? Should I use Google functions or data prep or just spin up a server and have it download the file, reformat it and upload it back to cloud storage and then to Bigquery?
Do not compress the data before loading into Bigquery. Another item, 5 MB is small for Bigquery. I would look at consolidation strategies and maybe changing file format while processing each Json file.
You can use Dataprep, Dataflow or even Dataproc. Depending on how many files, this may be the best choice. Anything larger than say 100,000 5 MB files will require one of these big systems with many nodes.
Cloud Functions would take too long for anything more than a few thousand files.
Another option is to write a simple Python program that preprocesses your files on Cloud Storage and directly loads them into BigQuery. We are only talking about 20 or 30 lines of code unless you add consolidation. A 5 MB file would take about 500 ms to load and process and write back. I am not sure about the Bigquery load time. For 50,000 5 MB files, 12 to 24 hours for one thread on a large Compute Engine instance (you need high network bandwidth).
Another option is to spin up multiple Compute Engines. One engine will put the names of N files (something like 4 or 16) per message into Pub/Sub. Then multiple Compute instances subscribe to the same topic and process the files in parallel. Again, this is only another 100 lines of code.
If your project consists of many millions of files, network bandwidth and compute time will be an issue unless time is not a factor.
You can use Dataflow to do this.
Choose the “Text Files on Cloud Storage to BigQuery” template:
A pipeline that can read text files stored in GCS, perform a transform
via a user defined javascript function, and load the results into
BigQuery. This pipeline requires a javascript function and a JSON
describing the resulting BigQuery schema.
You will need to add an UDF in Javascript that converts from JSON to new line delimited JSON when creating the job.
This will retrieve the files from GCS, convert them and upload them to BigQuery automatically.

Pivoting 1,620 columns to rows in 360gb text file in aws

I have a pipe delimited text file that is 360GB, compressed (gzip).
It has over 1,620 columns. I can't show the exact field names, but here's basically what it is:
primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male|1|is_college_educated|1
Seriously, there are over 800 of these property name/value fields.
There are roughly 280 million rows.
The file is in an S3 bucket.
I need to get the data into Redshift, but the column limit in Redshift is 1,600.
The users want me to pivot the data. For example:
primary_key|key|value
12345|is_male|1
12345|is_college_educated|1
What is a good way to pivot the file in the aws environment? The data is in a single file, but I'm planning on splitting the data into many different files to allow for parallel processing.
I've considered using Athena. I couldn't find anything that states the maximum number of columns allowed by Athena. But, I found a page about Presto (on which Athena is based) that says “there is no exact hard limit, but we've seen stuff break with more than few thousand.” (https://groups.google.com/forum/#!topic/presto-users/7tv8l6MsbzI).
Thanks.
First, pivot your data, then load to Redshift.
In more detail, the steps are:
Run a spark job (using EMR or possibly AWS Glue) which reads in your
source S3 data and writes out (to a different s3 folder) a pivoted
version. by this i mean if you have 800 value pairs, then you would
write out 800 rows. At the same time, you can split the file into multiple parts to enable parallel load.
"COPY" this pivoted data into Redshift
What I learnt from most of the time from AWS is, if you are reaching a limit, you are doing it in a wrong way or not in a scalable way. Most of the time architects designed with scalability, performance in mind.
We had similar problems, having 2000 columns. Here is how we solved it.
Split the file across 20 different tables, 100+1 (primary key) column each.
Do a select across all those tables in a single query to return all the data you want.
If you say you want to see all the 1600 columns in a select, then the business user is looking at wrong columns for their analysis or even for machine learning.
To load 10TB+ of data we had split the data into multiple files and load them in parallel, that way loading was faster.
Between Athena and Redshift, performance is the only difference. Rest of them are same. Redshift performs better than Athena. Initial Load time and Scan Time is higher than Redshift.
Hope it helps.