Subscribe Google Pub/sub topic to Cloud Storage Avro file gives me "quota exceeded" error - in a beginners tutorial? - google-cloud-platform

I'm going through Google's Firestore to BigQuery pipeline tutorual and I've come to step 10 where I should set up an export from my topic to an avro file saved on cloud storage.
However, when I try running the job, after doing exactly what's mentioned in the tutorial, I get an error telling me that my project has insufficient quotas to execute the workflow. In the quota summary of the message, I notice that it says 1230/818 disk GB. Does that mean that the job requires 1230 GB disk space? Currently, there are only 100 documents in the Firestore?. This seems wrong to me?
All my Cloud storage buckets are empty:
But when I look at the resources used in the first export job I set up (Pubsub Topic to BigQuery) on page 9, I'm even more confused.
It seems like it's using CRAZY amounts of resources
Current vCPUs
4
Total vCPU time
2.511 vCPU hr
Current memory
15 GB
Total memory time
9.417 GB hr
Current PD
1.2 TB
Total PD time
772.181 GB hr
Current SSD PD
0 B
Total SSD PD time
0 GB hr
Can this be real, or have I done something completely wrong, since all these resources are used? I mean, there's no activity at all, It's just a subscription, right?

Under the hood, that step is calling a Cloud Dataflow template (this one to be exact) to read from Pub/Sub and write to GCS. In turn, Cloud Dataflow is using GCE instances (VMs) for its worker pool. Cloud Dataflow is requesting too many resources (GCE instances which need disk, ram, vCPUs etc) and is hitting your project's limit/quota.
You can override the default number of workers (try 1 to start with) and also also set the smallest VM type (n1-standard-1) when configuring the job under optional parameters. This should also save you some money too. Bonus!

Related

Issue with reading millions of files from cloud storage using dataflow in Google cloud

Scenario: I am trying to read files and send the data to pub/sub
Millions of files stored in a cloud storage folder(GCP)
I have created a dataflow pipeline using the template "Text files on cloud storage to Pub/Sub" from the pub/sub topic
But the above template was not able to read millions of files and failed with the following error
java.lang.IllegalArgumentException: Total size of the BoundedSource objects generated by split() operation is larger than the allowable limit. When splitting gs://filelocation/data/*.json into bundles of 28401539859 bytes it generated 2397802 BoundedSource objects with total serialized size of 199603686 bytes which is larger than the limit 20971520.
System configuration:
Apache beam: 2.38 Java SDK
Machine: High performance n1-highmem-16
Any idea on how to solve this issue? Thanks in advance
According to this document (1) you can work around this by modifying your custom BoundedSource subclass so that the generated BoundedSource objects become smaller than the 20 MB limit.
(1) https://cloud.google.com/dataflow/docs/guides/common-errors#boundedsource-objects-splitintobundles
You can also use TextIO.readAll() to avoid these limitations.

EMR ignores spark submit parameters (memory/cores/etc)

I'm trying to use all resources on my EMR cluster.
The cluster itself is 4 m4.4xlarge machines (1 driver and 3 workers) with 16 vCore, 64 GiB memory, EBS Storage:128 GiB
When launching the cluster through the cli I'm presented with following options (all 3 options were executed within the same data pipeline):
Just use "maximizeResourceAllocation" without any other spark-submit parameter
This only gives me 2 executors presented here
Don't put anything, leave spark-defaults to do their job
Gives following low-quality executors
Use AWS's guide on how to configure cluster in EMR
Following this guide, I deduced following spark-submit parameters:
"--conf",
"spark.executor.cores=5",
"--conf",
"spark.executor.memory=18g",
"--conf",
"spark.executor.memoryOverhead=3g",
"--conf",
"spark.executor.instances=9",
"--conf",
"spark.driver.cores=5",
"--conf",
"spark.driver.memory=18g",
"--conf",
"spark.default.parallelism=45",
"--conf",
"spark.sql.shuffle.partitions=45",
Aaand still no luck:
Now I did look everywhere I could on the internet, but couldn't find any explanation on why EMR doesn't use all the resources provided. Maybe I'm missing something or maybe this is expected behaviour, but when "maximizeAllocation" only spans 2 executors on a cluster with 3 workers, there's something wrong there.
UPDATE:
So today while running a different data pipeline I got this using "maximizeResourceAllocation":
Which is much much better then the other ones, but still lacks a lot in terms of used memory and executors (although someone from EMR team said that emr merges executors into super-executors to improve performances).
I wanted to add my answer, even though I cannot explain all the cases that you are seeing.
But I will try to explain those that I can, as best as I can.
Case 1: Maximum Resource Allocation
This is an Amazon EMR specific configuration and I'm not completely sure what Amazon EMR does under the hood in this configuration.
When you enable maximizeResourceAllocation, EMR calculates the maximum
compute and memory resources available for an executor on an instance
in the core instance group. It then sets the corresponding
spark-defaults settings based on the calculated maximum values
The above text snippet is from here.
It looks like there might be some configuration for the cluster and that EMR calculates resource allocation for that. I'm not completely sure though.
One follow-up that I'd like to know is if you are running other applications on your cluster? That can change the calculations done by EMR to determine the resource allocation.
Case 2: Default Values
I can explain this.
This is using something called DynamicResourceAllocation in Spark. From the spark website:
Spark provides a mechanism to dynamically adjust the resources your
application occupies based on the workload. This means that your
application may give resources back to the cluster if they are no
longer used and request them again later when there is demand. This
feature is particularly useful if multiple applications share
resources in your Spark cluster.
And looking at the link above, it mentions that Amazon EMR uses dynamic resource allocation as the default setting for Spark.
So the executor sizes and counts that you are seeing are due to the workload you are running.
Case 3: Specific Values
To understand the allocation here, you need to understand the memory allocation model of Spark.
Spark executors and drivers run as JVMs. And they divide their memory for the JVM into
On-Heap Space, further divided into
Storage Memory - for cache data
Execution Memory - temporarily store data for shuffle, join sort
User Memory - for RDD conversions and dependency graphs
Reserved Memory - for spark internal objects
Off-Heap Space - usually disabled by default
The memory allocation for the on-heap space is as follows:
X = Executor Memory Configured
X' = X - 300MB (for reserved memory - taken to start of with to let spark work)
User Memory = X' * 0.40 = User Memory
Storage + Execution Memory = X' * 0.60
The Storage and execution memory is split 50-50.
For your example
X = 18 GB ~ 18000 MB
X' = 18000 - 300 MB = 17700 MB = Remaining
User = 17700 * 0.40 = 7080
Storage + Execution = 17700 * 0.60 = 10620 ~ 10 GB
And that is what you're seeing in your Storage Memory column and On-heap storage memory column.
Note2: the boundary between Storage and Execution space is flexible. If required, storage could evict all of execution and vice-versa.
Plugging the numbers in, it seems to line-up with your observations. Which explains why you only get 10G out of the original 18G you allocated.
Hope that adds a bit more clarity to at least two of the cases you are seeing.
Have you tried setting --master yarn parameter and replace parameter spark.executor.memoryOverhead by spark.yarn.executor.memoryOverhead ?

How does AWS Athena manage to load 10GB/s from s3? I've managed 230 mb/s from a c6gn.16xlarge

When running this query on AWS Athena, it manages to query a 63GB Traders.csv file
SELECT * FROM Trades WHERE TraderID = 1234567
Tt takes 6.81 seconds, scanning 63.82GB in so doing (almost exactly the size of the Trades.csv file, so is doing a full table scan).
What I'm shocked at is the unbelievable speed of data drawn from s3. It seems like AWS Athena's strategy is to use an unbelievably massive box with a ton of RAM and incredible s3 loading ability to get around the lack of indexing (although on a standard SQL DB you would have an index on TraderID and load millions times less data).
But in my experiments I only managed to get these data reads from S3 (which are still impressive):
InstanceType
Mb/s
Network Card Gigabits
t2.2xlarge
113
low
t3.2xlarge
140
up to 5
c5n.2xlarge
160
up to 25
c6gn.16xlarge
230
100
(that's megabytes rather than megabits)
I'm using an internal VPC Endpoint for the s3 on eu-west-1. Anyone got any tricks/tips for getting s3 to load fast? Has anyone got over 1GB/s read speeds from s3? Is this even possible?
It seems like AWS Athena's strategy is to use an unbelievably massive
box with a ton of RAM
No, it's more like many small boxes, not a single massive box. Athena is running your query in parallel, on multiple servers at once. The exact details of that are not published anywhere as far as I am aware, but they make very clear in the documentation that your queries run in parallel.

Used Dataflow's DLP to read from GCS and write to BigQuery - Only 50% data written to BigQuery

I recently started a Dataflow job to load data from GCS and run it through DLP's identification template and write the masked data to BigQuery. I could not find a Google-provided template for batch processing hence used the streaming one (ref: link).
I see only 50% of the rows are written to the destination BigQuery table. There is no activity on the pipeline for a day even though it is in the running state.
yes DLP Dataflow template is a streaming pipeline but with some easy changes you can also use it as batch. Here is the template source code. As you can see it uses File IO transform and poll/watch for any new file in every 30 seconds. if you take out the window transform and continuous polling syntax, you should be able to execute as batch.
In terms of pipeline not progressing all data, can you confirm if you are running a large file with default settings? e.g- workerMachineType, numWorkers, maxNumWorkers? Current pipeline code uses a line based offsetting which requires a highmem machine type with large number of workers if the input file is large. e.g for 10 GB, 80M lines you may need 5 highmem workers.
One thing you can try to see if it helps is to trigger the pipeline with more resources e.g: --workerMachineType=n1-highmem-8, numWorkers=10, maxNumWorkers=10 and see if it's any better.
Alternatively, there is a V2 solution that uses byte based offsetting using state and timer API for optimized batching and resource utilization that you can try out.

Max file count using big query data transfer job

I have about 54 000 files in my GCP bucket. When I try to schedule a big query data transfer job to move files from GCP bucket to big query, I am getting the following error:
Error code 9 : Transfer Run limits exceeded. Max size: 15.00 TB. Max file count: 10000. Found: size = 267065994 B (0.00 TB) ; file count = 54824.
I thought the max file count was 10 million.
I think that BigQuery transfer service lists all the files matching the wildcard and then use the list to load them. So it will be same that providing the full list to bq load ... therefore reaching the 10,000 URIs limit.
This is probably necessary because BigQuery transfer service will skip already loaded files, so it needs to look them one by one to decide which to actually load.
I think that your only option is to schedule a job yourself and load them directly into BigQuery. For example using Cloud Composer or writing a little cloud run service that can be invoked by Cloud Scheduler.
The Error message Transfer Run limits exceeded as mentioned before is related to a known limit for Load jobs in BigQuery. Unfortunately this is a hard limit and cannot be changed. There is an ongoing Feature Request to increase this limit but for now there is no ETA for it to be implemented.
The main recommendation for this issue is to split a single operation in multiple processes that will send data in requests that don't exceed this limit. With this we could cover the main question: "Why I see this Error message and how to avoid it?".
Is is normal to ask now "how to automate or perform these actions easier?" I can think of involve more products:
Dataflow, which will help you to process the data that will be added to BigQuery. Here is where you can send multiple requests.
Pub/Sub, will help to listen to events and automate the times where the processing will start.
Please, take a look at this suggested implementation where the aforementioned scenario is wider described.
Hope this is helpful! :)