I'm new to Dataflow.
I'd like to use the Dataflow streaming template "Pub/Sub Subscription to BigQuery" to transfer some messages, say 10000 per day.
My question is about pricing since I don't understand how they're computed for the streaming mode, with Streaming Engine enabled or not.
I've used the Google Calculator which asks for the following:
Machine Type, Number of worker nodes used by the job, If streaming or Batch job, Number of GB of Persistent Disks (PD), Hours the job runs per month.
Consider the easiest case, since I don't need many resources, i.e.
Machine type: n1-standard1
Max Workers: 1
Job Type: Streaming
Price: in us-central1
Case 1: Streaming Engine DISABLED
Hours using the vCPU = 730 hours (1 month always active). Is this always true for the streaming mode? Or there can be a case in a streaming mode in which the usage is lower?
Persistent Disks: 430 GB HDD, which is the default value.
So I will pay:
(vCPU) 730 x $0.069(cost vCPU/hour) = $50.37
(PD) 730 x $0.000054 x 430 GB = $16.95
(RAM) 730 x $0.003557 x 3.75 GB = $9.74
TOTAL: $77.06, as confirmed by the calculator.
Case 2 Streaming Engine ENABLED.
Hours using the v CPU = 730 hours
Persistent Disks: 30 GB HDD, which is the default value
So I will pay:
(vCPU) 30 x $0.069(cost vCPU/hour) = $50.37
(PD) 30 x $0.000054 x 430 GB = $1.18
(RAM) 30 x $0.003557 x 3.75 GB = $9.74
TOTAL: $61.29 PLUS the amount of Data Processed (which is extra with Streaming Engine)
Considering messages of 1024 Byte, we have a traffic of 1024 x 10000 x 30 Bytes = 0.307 GB, and an extra cost of 0.307 GB x $0.018 = $0.005 (almost zero).
Actually, with this kind of traffic, I will save about $15 in using Streaming Engine.
Am I correct? Is there something else to consider or something wrong with my assumptions and my calculations?
Additionally, considering the low amount of data, is Dataflow really fitted for this kind of use? Or should I approach this problem in a different way?
Thank you in advance!
It's not false, but not perfectly accurate.
In the streaming mode, your Dataflow always listen the PubSub subscription and thus you need to but up full time.
In batch processing, you normally start the batch, it performs its job and then it stops.
In your comparison, you consider to have a batch job that runs full time. It's not impossible, but it doesn't fit your use case, I think.
About streaming and batching, all depends on your need of real time.
If you want to ingest the data in BigQuery with low latency (in few seconds) to have real time data, streaming is the good choice
If having data only updated every hour or every day, batch is a more suitable solution.
A latest remark, if your task is only to get message from PubSub and to stream write to BigQuery, you can consider to code it yourselves on Cloud Run or Cloud Functions. With only 10k messages per day, it will be free!
Related
I'm running a system with lots of AWS Lambda functions. Our load is not huge, let's say a function gets 100k invocations per month.
For quite a few of the Lambda functions, we're using warmup plugins to reduce cold start times. This is effectively a CloudWatch event triggered every 5 minutes to invoke the function with a dummy event which is ignored, but keeps that Lambda VM running. In most cases, this means one instance will be "warm".
I'm now looking at the native solution to the cold start problem: AWS Lambda Concurrent Provisioning, which at first glance looks awesome, but when I start calculating, either I'm missing something, or this will simply be a large cost increase for a system with only medium load.
Example, with prices from the eu-west-1 region as of 2020-09-16:
Consider function RAM M (GB), average execution time t (s), million requests per month N, provisioned concurrency C ("number of instances"):
Without provisioned concurrency
Cost per month = N⋅(16.6667⋅M⋅t + 0.20)
= $16.87 per million requests # M = 1 GB, t = 1 s
= $1.87 per million requests # M = 1 GB, t = 100 ms
= $1.69 per 100.000 requests # M = 1 GB, t = 1 s
= $1686.67 per 100M requests # M = 1GB, t = 1 s
With provisioned concurrency
Cost per month = C⋅0.000004646⋅M⋅60⋅60⋅24⋅30 + N⋅(10.8407⋅M⋅t + 0.20) = 12.04⋅C⋅M + N(10.84⋅M⋅t + 0.20)
= $12.04 + $11.04 = $23.08 per million requests # M = 1 GB, t = 1 s, C = 1
= $12.04 + $1.28 = $13.32 per million requests # M = 1 GB, t = 100 ms, C = 1
= $12.04 + $1.10 = $13.14 per 100.000 requests # M = 1 GB, t = 1 s, C = 1
= $12.04 + $1104.07 = $1116.11 per 100M requests # M = 1 GB, t = 1 s, C = 1
There are obviously several factors to take into account here:
How many requests per month is expected? (N)
How much RAM does the function need? (M)
What is the average execution time? (t)
What is the traffic pattern, few small bursts or even traffic (might mean C is low, high, or must be dynamically changed to follow peak hours etc)
In the end though, my initial conclusion is that Provisioned Concurrency will only be a good deal if you have a lot of traffic? In my example, at 100M requests per month there's a substantial saving (however, at that traffic it's perhaps likely that you would need a higher value of C as well; break-even at about C = 30). Even with C = 1, you need almost a million requests per month to cover the static costs.
Now, there are obviously other benefits of using the native solution (no ugly dummy events, no log pollution, flexible amount of warm instances, ...), and there are also probably other hidden costs of custom solutions (CloudWatch events, additional CloudWatch logging for dummy invocations etc), but I think they are pretty much neglectible.
Is my analysis fairly correct or am I missing something?
I think about provisioned concurrency as something that eliminates the cold starts and not something that saves money. There is a bit of saving if you can keep the lambda function running all the time (100%) utilization, but as you've calculated it becomes quite expensive when the provisioned capacity sits idle.
I am using AWS Redis for a project and ran into an Out of Memory (OOM) issue. In investigating the issue, I discovered a couple parameters that affect the amount of usable memory, but the math doesn't seem to work out for my case. Am I missing any variables?
I'm using:
3 shards, 3 nodes per shard
cache.t2.micro instance type
default.redis4.0.cluster.on cache parameter group
The ElastiCache website says cache.t2.micro has 0.555 GiB = 0.555 * 2^30 B = 595,926,712 B memory.
default.redis4.0.cluster.on parameter group has maxmemory = 581,959,680 (just under the instance memory) and reserved-memory-percent = 25%. 581,959,680 B * 0.75 = 436,469,760 B available.
Now, looking at the BytesUsedForCache metric in CloudWatch when I ran out of memory, I see nodes around 457M, 437M, 397M, 393M bytes. It shouldn't be possible for a node to be above the 436M bytes calculated above!
What am I missing; Is there something else that determines how much memory is usable?
I remember reading it somewhere but I can not find it right now. I believe BytesUsedForCache is a sum of RAM and SWAP used by Redis to store data/buffers.
As Elasticache's docs suggest that SWAP should not go higher than 300 MB.
I would suggest checking the SWAP metric at that time.
When I'm inserting rows on BigQuery using writeTableRows, performance is really bad compared to InsertAllRequest. Clearly, something is not setup correctly.
Use case 1: I wrote a Java program to process 'sample' Twitter stream using Twitter4j. When a tweet comes in I write it to BigQuery using this:
insertAllRequestBuilder.addRow(rowContent);
When I run this program from my Mac, it inserts about 1000 rows per minute directly into BigQuery table. I thought I could do better by running a Dataflow job on the cluster.
Use case 2: When a tweet comes in, I write it to a topic of Google's PubSub. I run this from my Mac which sends about 1000 messages every minute.
I wrote a Dataflow job that reads this topic and writes to BigQuery using BigQueryIO.writeTableRows(). I have a 8 machine Dataproc cluster. I started this job on the master node of this cluster with DataflowRunner. It's unbelievably slow! Like 100 rows every 5 minutes or so. Here's a snippet of the relevant code:
statuses.apply("ToBQRow", ParDo.of(new DoFn<Status, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
TableRow row = new TableRow();
Status status = c.element();
row.set("Id", status.getId());
row.set("Text", status.getText());
row.set("RetweetCount", status.getRetweetCount());
row.set("FavoriteCount", status.getFavoriteCount());
row.set("Language", status.getLang());
row.set("ReceivedAt", null);
row.set("UserId", status.getUser().getId());
row.set("CountryCode", status.getPlace().getCountryCode());
row.set("Country", status.getPlace().getCountry());
c.output(row);
}
}))
.apply("WriteTableRows", BigQueryIO.writeTableRows().to(tweetsTable)//
.withSchema(schema)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))
.withNumFileShards(1000)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
What am I doing wrong? Should I use a 'SparkRunner'? How do I confirm that it's running on all nodes of my cluster?
With BigQuery you can either:
Stream data in. Low latency, up to 100k rows per second, has a cost.
Batch data in. Way higher latency, incredible throughput, totally free.
That's the difference you are experiencing. If you only want to ingest 1000 rows, batching will be noticeably slower. The same with 10 billion rows will be way faster thru batching, and at no cost.
Dataflow/Bem's BigQueryIO.writeTableRows can either stream or batch data in.
With BigQueryIO.Write.Method.FILE_LOADS the pasted code is choosing batch.
I am running an Aerospike cluster in Google Cloud. Following the recommendation on this post, I updated to the last version (3.11.1.1) and re-created all servers. In fact, this change cause my 5 servers to operate in a much lower CPU load (it was around 75% load before, now it is on 20%, as show in the graph bellow:
Because of this low load, I decided to reduce the cluster size to 4 servers. When I did this, my application started to receive the following error:
All batch queues are full
I found this discussion about the topic, recommending to change the parameters batch-index-threads and batch-max-unused-buffers with the command
asadm -e "asinfo -v 'set-config:context=service;batch-index-threads=NEW_VALUE'"
I tried many combinations of values (batch-index-threads with 2,4,8,16) and none of them solved the problem, and also changing the batch-index-threads param. Nothing solves my problem. I keep receiving the All batch queues are full error.
Here is my aerospace.conf relevant information:
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
paxos-recovery-policy auto-reset-master
pidfile /var/run/aerospike/asd.pid
service-threads 32
transaction-queues 32
transaction-threads-per-queue 4
batch-index-threads 40
proto-fd-max 15000
batch-max-requests 30000
replication-fire-and-forget true
}
I use 300GB SSD disks on these servers.
A quick note which may or may not pertain to you:
A common mistake we have seen in the past is that developers decide to use 'batch get' as a general purpose 'get' for single and multiple record requests. The single record get will perform better for single record requests.
It's possible that you are being constrained by the network between the clients and servers. Reducing from 5 to 4 nodes reduced the aggregate pipe. In addition, removing a node will start cluster migrations which adds additional network load.
I would look at the batch-max-buffer-per-queue config parameter.
Maximum number of 128KB response buffers allowed in each batch index
queue. If all batch index queues are full, new batch requests are
rejected.
In conjunction with raising this value from the default of 255 you will want to also raise the batch-max-unused-buffers to batch-index-threads x batch-max-buffer-per-queue + 1 (at least). If you do not do that new buffers will be created and destroyed constantly, as the amount of free (unused) buffers is smaller than the ones you're using. The moment the batch response is served the system will strive to trim the buffers down to the max unused number. You will see this reflected in the batch_index_created_buffers metric constantly rising.
Be aware that you need to have enough DRAM for this. For example if you raise the batch-max-buffer-per-queue to 320 you will consume
40 (`batch-index-threads`) x 320 (`batch-max-buffer-per-queue`) x 128K = 1600MB
For the sake of performance the batch-max-unused-buffers should be set to 13000 which will have a max memory consumption of 1625MB (1.59GB) per-node.
Referring to the docs, you can specify the number of concurrent connection when pushing large files to Amazon Web Services s3 using the multipart uploader. While it does say the concurrency defaults to 5, it does not specify a maximum, or whether or not the size of each chunk is derived from the total filesize / concurrency.
I trolled the source code and the comment is pretty much the same as the docs:
Set the concurrency level to use when uploading parts. This affects
how many parts are uploaded in parallel. You must use a local file as
your data source when using a concurrency greater than 1
So my functional build looks like this (the vars are defined by the way, this is just condensed for example):
use Aws\Common\Exception\MultipartUploadException;
use Aws\S3\Model\MultipartUpload\UploadBuilder;
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bucket)
->setKey($file)
->setConcurrency(30)
->setOption('CacheControl', 'max-age=3600')
->build();
Works great except a 200mb file takes 9 minutes to upload... with 30 concurrent connections? Seems suspicious to me, so I upped concurrency to 100 and the upload time was 8.5 minutes. Such a small difference could just be connection and not code.
So my question is whether or not there's a concurrency maximum, what it is, and if you can specify the size of the chunks or if chunk size is automatically calculated. My goal is to try to get a 500mb file to transfer to AWS s3 within 5 minutes, however I have to optimize that if possible.
Looking through the source code, it looks like 10,000 is the maximum concurrent connections. There is no automatic calculations of chunk sizes based on concurrent connections but you could set those yourself if needed for whatever reason.
I set the chunk size to 10 megs, 20 concurrent connections and it seems to work fine. On a real server I got a 100 meg file to transfer in 23 seconds. Much better than the 3 1/2 to 4 minute it was getting in the dev environments. Interesting, but thems the stats, should anyone else come across this same issue.
This is what my builder ended up being:
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bicket)
->setKey($file)
->setConcurrency(20)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
I may need to up that max cache but as of yet this works acceptably. The key was moving the processor code to the server and not relying on the weakness of my dev environments, no matter how powerful the machine is or high class the internet connection is.
We can abort the process during upload and can halt all the operations and abort the upload at any instance. We can set Concurrency and minimum part size.
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource('/path/to/large/file.mov')
->setBucket('mybucket')
->setKey('my-object-key')
->setConcurrency(3)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
try {
$uploader->upload();
echo "Upload complete.\n";
} catch (MultipartUploadException $e) {
$uploader->abort();
echo "Upload failed.\n";
}