Max file count using big query data transfer job - google-cloud-platform

I have about 54 000 files in my GCP bucket. When I try to schedule a big query data transfer job to move files from GCP bucket to big query, I am getting the following error:
Error code 9 : Transfer Run limits exceeded. Max size: 15.00 TB. Max file count: 10000. Found: size = 267065994 B (0.00 TB) ; file count = 54824.
I thought the max file count was 10 million.

I think that BigQuery transfer service lists all the files matching the wildcard and then use the list to load them. So it will be same that providing the full list to bq load ... therefore reaching the 10,000 URIs limit.
This is probably necessary because BigQuery transfer service will skip already loaded files, so it needs to look them one by one to decide which to actually load.
I think that your only option is to schedule a job yourself and load them directly into BigQuery. For example using Cloud Composer or writing a little cloud run service that can be invoked by Cloud Scheduler.

The Error message Transfer Run limits exceeded as mentioned before is related to a known limit for Load jobs in BigQuery. Unfortunately this is a hard limit and cannot be changed. There is an ongoing Feature Request to increase this limit but for now there is no ETA for it to be implemented.
The main recommendation for this issue is to split a single operation in multiple processes that will send data in requests that don't exceed this limit. With this we could cover the main question: "Why I see this Error message and how to avoid it?".
Is is normal to ask now "how to automate or perform these actions easier?" I can think of involve more products:
Dataflow, which will help you to process the data that will be added to BigQuery. Here is where you can send multiple requests.
Pub/Sub, will help to listen to events and automate the times where the processing will start.
Please, take a look at this suggested implementation where the aforementioned scenario is wider described.
Hope this is helpful! :)

Related

Fastest way to get exact count of rows for a 100GB CSV file stored on S3

What is the fastest way of getting an exact count of rows for a 100GB CSV file stored on Amazon S3 without using Athena nor any Fargate or EC2 VM? I can't use Athena, because the CSV file isn't clean-enough for it. I can't use Fargates or EC2 VMs, because I need a purely serverless solution. I can't use third-party services like Snowflake (native AWS services only).
Also, 100GB is too large to fit within a Lambda Function's /tmp (limited to 10GB). I could try to run something like DuckDB (or any other streaming database engine) on a Lambda and scan the entire file with a SELECT COUNT(*) FROM "s3://myBucket/myFile.csv" query, but the Lambda is quite likely to timeout, because its read bandwidth from S3 is 100MB/s at best, and it cannot run for more than 15 minutes (900s).
I know the approximate size of the file.
Note: I have an inaccurate estimate of the number of rows provided by AWS Glue Data Catalog's crawler, with an error margin of -50%/+100%. This could be used for some kind of iterative or dichotomous process, but I could not figure any out. For example, I tried adding an OFFSET with a value lower than but close to the number of rows to the aforementioned query, but the Lambda running DuckDB timed out. That was disappointing and somewhat surprising, because a query like SELECT * FROM "s3://myBucket/myFile.csv" LIMIT 10 OFFSET 10000000 worked well.
The fastest solution is probably to use SelectObjectContent with ScanRange to parallelize the request on chunks of 50MB or so.
Have you tried "AWS S3 select":https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html. It lets you run queries on S3 files. I use the service to get basic insight into any file on S3(Provided it can be queried).

Used Dataflow's DLP to read from GCS and write to BigQuery - Only 50% data written to BigQuery

I recently started a Dataflow job to load data from GCS and run it through DLP's identification template and write the masked data to BigQuery. I could not find a Google-provided template for batch processing hence used the streaming one (ref: link).
I see only 50% of the rows are written to the destination BigQuery table. There is no activity on the pipeline for a day even though it is in the running state.
yes DLP Dataflow template is a streaming pipeline but with some easy changes you can also use it as batch. Here is the template source code. As you can see it uses File IO transform and poll/watch for any new file in every 30 seconds. if you take out the window transform and continuous polling syntax, you should be able to execute as batch.
In terms of pipeline not progressing all data, can you confirm if you are running a large file with default settings? e.g- workerMachineType, numWorkers, maxNumWorkers? Current pipeline code uses a line based offsetting which requires a highmem machine type with large number of workers if the input file is large. e.g for 10 GB, 80M lines you may need 5 highmem workers.
One thing you can try to see if it helps is to trigger the pipeline with more resources e.g: --workerMachineType=n1-highmem-8, numWorkers=10, maxNumWorkers=10 and see if it's any better.
Alternatively, there is a V2 solution that uses byte based offsetting using state and timer API for optimized batching and resource utilization that you can try out.

How to process files serially in cloud function?

I have written a cloud storage trigger based cloud function. I have 10-15 files landing at 5 secs interval in cloud bucket which loads data into a bigquery table(truncate and load).
While there are 10 files in the bucket I want cloud function to process them in sequential manner i.e 1 file at a time as all the files accesses the same table for operation.
Currently cloud function is getting triggered for multiple files at a time and it fails in BIgquery operation as multiple files trying to access the same table.
Is there any way to configure this in cloud function??
Thanks in Advance!
You can achieve this by using pubsub, and the max instance param on Cloud Function.
Firstly, use the notification capability of Google Cloud Storage and sink the event into a PubSub topic.
Now you will receive a message every time that a event occur on the bucket. If you want to filter on file creation only (object finalize) you can apply a filter on the subscription. I wrote an article on this
Then, create an HTTP functions (http function is required if you want to apply a filter) with the max instance set to 1. Like this, only 1 function can be executed in the same time. So, no concurrency!
Finally, create a PubSub subscription on the topic, with a filter or not, to call your function in HTTP.
EDIT
Thanks to your code, I understood what happens. In fact, BigQuery is a declarative system. When you perform a request or a load job, a job is created and it works in background.
In python, you can explicitly wait the end on the job, but, with pandas, I didn't find how!!
I just found a Google Cloud page to explain how to migrate from pandas to BigQuery client library. As you can see, there is a line at the end
# Wait for the load job to complete.
job.result()
than wait the end of the job.
You did it well in the _insert_into_bigquery_dwh function but it's not the case in the staging _insert_into_bigquery_staging one. This can lead to 2 issues:
The dwh function work on the old data because the staging isn't yet finish when you trigger this job
If the staging take, let's say, 10 seconds and run in "background" (you don't wait the end explicitly in your code) and the dwh take 1 seconds, the next file is processed at the end of the dwh function, even if the staging one continue to run in background. And that leads to your issue.
The architecture you describe isn't the same as the one from the documentation you linked. Note that in the flow diagram and the code samples the storage events triggers the cloud function which will stream the data directly to the destination table. Since BigQuery allow for multiple streaming insert jobs several functions could be executed at the same time without problems. In your use case the intermediate table used to load with write-truncate for data cleaning makes a big difference because each execution needs the previous one to finish thus requiring a sequential processing approach.
I would like to point out that PubSub doesn't allow to configure the rate at which messages are sent, if 10 messages arrive to the topic they all will be sent to the subscriber, even if processed one at a time. Limiting the function to one instance may lead to overhead for the above reason and could increase latency as well. That said, since the expected workload is 15-30 files a day the above maybe isn't a big concern.
If you'd like to have parallel executions you may try creating a new table for each message and set a short expiration deadline for it using table.expires(exp_datetime) setter method so that multiple executions don't conflict with each other. Here is the related library reference. Otherwise the great answer from Guillaume would completely get the job done.

Compose Google Storage Objects without headers via CLI

I was wondering if it would be possible to compose Google Storage Objects (specifically csv files) without headers (i.e. without the row with column names) while using gsutil.
Currently, I can do the following:
gsutil compose gs://bucket/test_file_1.csv gs://bucket/test_file_2.csv gs://bucket/test-composition-files.csv
However, I will be unable to ingest test-composition-files.csv into Google BigQuery because compose blindly appended the files (including the column names).
One possible solution would be to download the file locally and process it with pandas, but this is not ideal for large files.
Is there any way to do this via the CLI? I could not find anything in the docs.
By reading the comment, I think you are spending effort in the wrong way. I understood that you wanted to load your files into big query, but the large number of file prevented you to do this (too many API calls). And dataflow is too slow.
Maybe you can think differently. I have 2 solutions to propose
If you need "near real time" ingestion, and if file size is bellow 1.5Gb, the best way is to build a function which read the file and perform a stream write to BigQuery. This function is triggered by a Cloud Storage event. If there is several file in the same time, several functions will be spawn. Be careful, stream write to BigQuery is not free
If you can wait up to 2 minutes when a file arrive, I recommend you to build a Cloud Functions, triggered every 2 minutes. This function read the file name in a bucket, move them to a sub directory and perform a load job of all the files in the sub directory. You are limited to 1000 load jobs per day (and per table), a day contains 1440 minutes. Batch every 2 minutes you are OK. The load job are free.
Is it acceptable alternatives?

Subscribe Google Pub/sub topic to Cloud Storage Avro file gives me "quota exceeded" error - in a beginners tutorial?

I'm going through Google's Firestore to BigQuery pipeline tutorual and I've come to step 10 where I should set up an export from my topic to an avro file saved on cloud storage.
However, when I try running the job, after doing exactly what's mentioned in the tutorial, I get an error telling me that my project has insufficient quotas to execute the workflow. In the quota summary of the message, I notice that it says 1230/818 disk GB. Does that mean that the job requires 1230 GB disk space? Currently, there are only 100 documents in the Firestore?. This seems wrong to me?
All my Cloud storage buckets are empty:
But when I look at the resources used in the first export job I set up (Pubsub Topic to BigQuery) on page 9, I'm even more confused.
It seems like it's using CRAZY amounts of resources
Current vCPUs
4
Total vCPU time
2.511 vCPU hr
Current memory
15 GB
Total memory time
9.417 GB hr
Current PD
1.2 TB
Total PD time
772.181 GB hr
Current SSD PD
0 B
Total SSD PD time
0 GB hr
Can this be real, or have I done something completely wrong, since all these resources are used? I mean, there's no activity at all, It's just a subscription, right?
Under the hood, that step is calling a Cloud Dataflow template (this one to be exact) to read from Pub/Sub and write to GCS. In turn, Cloud Dataflow is using GCE instances (VMs) for its worker pool. Cloud Dataflow is requesting too many resources (GCE instances which need disk, ram, vCPUs etc) and is hitting your project's limit/quota.
You can override the default number of workers (try 1 to start with) and also also set the smallest VM type (n1-standard-1) when configuring the job under optional parameters. This should also save you some money too. Bonus!