Manipulate large number of files to reformat in google cloud - google-cloud-platform

I have a large amount of json files in Google cloud storage that I would like to load to Bigquery. Average file size is 5MB not compressed.
The problem is that they are not new line delimited so I can't load them as is to bigquery.
What's my best approach here? Should I use Google functions or data prep or just spin up a server and have it download the file, reformat it and upload it back to cloud storage and then to Bigquery?

Do not compress the data before loading into Bigquery. Another item, 5 MB is small for Bigquery. I would look at consolidation strategies and maybe changing file format while processing each Json file.
You can use Dataprep, Dataflow or even Dataproc. Depending on how many files, this may be the best choice. Anything larger than say 100,000 5 MB files will require one of these big systems with many nodes.
Cloud Functions would take too long for anything more than a few thousand files.
Another option is to write a simple Python program that preprocesses your files on Cloud Storage and directly loads them into BigQuery. We are only talking about 20 or 30 lines of code unless you add consolidation. A 5 MB file would take about 500 ms to load and process and write back. I am not sure about the Bigquery load time. For 50,000 5 MB files, 12 to 24 hours for one thread on a large Compute Engine instance (you need high network bandwidth).
Another option is to spin up multiple Compute Engines. One engine will put the names of N files (something like 4 or 16) per message into Pub/Sub. Then multiple Compute instances subscribe to the same topic and process the files in parallel. Again, this is only another 100 lines of code.
If your project consists of many millions of files, network bandwidth and compute time will be an issue unless time is not a factor.

You can use Dataflow to do this.
Choose the “Text Files on Cloud Storage to BigQuery” template:
A pipeline that can read text files stored in GCS, perform a transform
via a user defined javascript function, and load the results into
BigQuery. This pipeline requires a javascript function and a JSON
describing the resulting BigQuery schema.
You will need to add an UDF in Javascript that converts from JSON to new line delimited JSON when creating the job.
This will retrieve the files from GCS, convert them and upload them to BigQuery automatically.

Related

AWS Redshift and small datasets

I have S3 bucket to which many different small files (2 files 1kB per 1 min) are uploaded.
Is it good practice to injest them by trigger using lambda at once to Redshift?
Or maybe will it be better to push them to some stage area like Postgres and then at the end of the day do batch etl from stage area to Redshift?
Or maybe do the job of making manifest file that contains all of the file names per day and use COPY command for injesting them to Redshift?
As Mitch says, #3. Redshift wants to work on large data sets and if you ingest small things many times you will need to vacuum the table. Loading many files at once fixes this.
However there is another potential problem - your files are too small for efficient bulk retrieval from S3. S3 is an object store and each request needs to be translated from bucket/object-key pair to a location in S3. This takes on the order of .5 seconds to do. Not an issue for loading a few at a time. But if you need to load a million of them in series then that’s 500K seconds of lookup time. Now Redshift will do the COPY in parallel but only to the number of slices you have in your cluster - it is still going to take a long time.
So depending on your needs you may need to think about a change in your use of S3. If so then your may end up with a Lambda that combines small files into bigger ones as part of your solution. You can do this in a parallel process to RS COPY if you only need to load many, many files at once during some recovery process. But an archive of 1 billion 1KB files will be near useless if they need to be loaded quickly.

batch predictions in GCP Vertex AI

While trying out batch predictions in GCP Vertex AI for an AutoML model, the batch prediction results span over several files(which is not convenient from a user perspective). If it would have been a single batch prediction result file i.e. covering all the records in a single file, it would make the procedure much more simple.
For instance, I had 5585 records in my input dataset file. The batch prediction results comprise of 21 files wherein each file has records in the range of 200-300, thus, covering 5585 records altogether.
Batch predictions on an image, text,video,tabular AutoML model, runs the jobs using distributed processing which means the data is distributed among an arbitrary cluster of virtual machines and is processed in an unpredictable order because of which you will get the prediction results stored across various files in Cloud Storage. Since the batch prediction output files are not generated with the same order as an input file, a feature request has been raised and you can track the update on this request from this link.
We cannot provide an ETA at this moment but you can follow the progress in the issue tracker and you can ‘STAR’ the issue to receive automatic updates and give it traction by referring to this link.
However, if you are doing batch prediction for a tabular AutoML model, there you have the option to choose the BigQuery as storage where all the prediction output will be stored in a single table and then you can export the table data to a single CSV file.

Ideas to take a stream amongst 200-1000 servers and create one single file quickly

We are in Google Cloud Platform so technologies there would be a good win. We have a huge file that comes in and dataflow scales on the input to break up the file quite nicely. After that however, it streams through many system, microservice1 over to dataconnectors grabbing related data over to ML and finally over to a final microservice.
Since the final stage could be around 200-1000 servers depending on load, how can we take all the requests coming in (yes, we have a file id attached to every request including a customerRequestId in case a file is dropped multiple times). We only need to be writing every line with the same customerRequestId to the same file on output.
What is the best method to do this? The resulting file is almost always a csv file.
Any ideas or good options I can explore? I wonder if dataflow was good at ingestion and reading a massively large file in parallel, is it good at taking in various inputs on a cluster of nodes(not a single node which would bottleneck us).
EDIT: I seem to recall hdfs has files partitioned across nodes and I think can be written by many nodes at the same time somehow (a
node per partition). Does anyone know if google cloud storage files are this way as well? Is there a way to have 200 nodes writing to 200 partitions of the same file in google cloud storage in such a way that it is all 1 file?
EDIT 2:
I see that there is a streaming pub/sub to bigquery option that could be done as one stage in this list: https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming
HOWEVER in this list, there is not a batch bigquery to csv(what our customer wants). I do see a bigquery to parquet option though here: https://cloud.google.com/dataflow/docs/guides/templates/provided-batch
I would prefer to go directly to csv though. Is there a way?
thanks,
Dean
You case is complex and hard (and expensive) to reproduce. My first idea is to use BigQuery. Sink all the data in the same table with Dataflow.
Then, create a temporary table with only the data to export to CSV like that
CREATE TABLE `myproject.mydataset.mytemptable`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
) AS
SELECT ....
And then to export the temporary table to CSV. If the table is less than 1Gb, only one CSV will be generated.
If you need to orchestrate these steps, you can use Workflows

How to trigger auto update Google BigQuery Dataset every time CSV upload on Google Cloud Storage

I am trying to automate the entire data loading, that means whenever I upload a file to Google Cloud storage, it automatically triggers the data to be uploaded into the BigQuery dataset. I know that there is a daily set timing update available, but I want something where it only triggers whenever the CSV file is re-uploaded.
You have 2 possibilities:
Either you react on event. I mean you can plug a function on Google Cloud Storage events. In the event message you have the file stored in GCS and you can do what you want with it, for exemple run a load job from Google Cloud Storage.
Or, do nothing! Let the file in GCS and create a BigQuery federated table to read into GCS
With this 2 solutions, your data are accessible by BigQuery. Your Datastudio graph can query BigQuery, the data are here. However.
The load job is more efficient, you can partition and clusterize your data for optimize the speed and the cost. However, you duplicate your data (from GCS) and you have to code and to run your function. Anyway, cost is very low and function very simple. For Big Data it's my recommended solution
The federated table are very useful when the quantity of data is low and for occasional access or for prototyping. You can't clusterize and partition your data and the speed is lower than data loaded into BigQuery (because the CSV parsing is performing on the fly).
So, Big Data is a wide area: do you need to transform the data before the load? can you transform them after the log? How can you link query the ones after the others? ....
Don't hesitate if you have other questions on this!

Storing and processing a lot of tiny files

There are a lot of files with size from 1Kb to 5Mb on our servers. Total size of those files is about 7Tb. Process algorithm - read and make some decisions about this file. Files may have several formats: doc, txt, png, bmp and etc. Therefore I can't merge those files to get bigger files.
How I can effectively store and process those files? What technology fits well to this task?
You can use various technologies to store and process these files. Below mentioned are the technologies that you can use.
1 Apache Kafka: You can create different topics for each format and push your data in these topics.
Advantage :
Based on your load you can easily increase your consumption speed.
2 Hadoop: you can store your data in hdfs the format and can design MR jobs to process.
3 You can use any document storage NOSQL database to store your data
Note: All the above solutions will store your data in distributed format and you can run it on commodity machines
Store your data in clouds(AWS, Google, Azure) and use there API to get and process the data. (If you want your data to be shared with the other applications also)
Start by segregating files into Different directories based on types. You can even have partition withing the individual directories. Example /data/images/YYYY-MM-DD , /data/text/YYYY-MM-DD
Use multipleInputs with appropriate InputFormat for each Path.
Normalize the data into a generic format before sending it to the reducer if needed.
There are ways to ingest data for your need .
Use Kafka to store data under different topics based on type(image , text ) and then copy to hdfs from kafka
Use Flume
As you have huge amount of data ,
please rollup the data in HDFS on a weekly basis . You can use oozie or falcon to automate the weekly rollup process
Use CombinedInPutFormat in your Spark or MR code.
Last but not the least map the data as table using Hive to expose it to external clients.
Hadoop archieves (HAR) is usual way to address this.
More details about this are available on : https://hadoop.apache.org/docs/r2.7.0/hadoop-archives/HadoopArchives.html
You also have option to use SequenceFile, HBase as described in : https://blog.cloudera.com/blog/2009/02/the-small-files-problem/
But, looking at your usecase HAR fits the bill.