I have several hundred of millions rows in database as input data.
Every 10000 rows processing tooks approximately 15 minutes cause external API requests are involved.
I am going to divide it into data spots and process with hundred Amazon AWS EC2 instances. Every process launched on EC2 instance will also save output in database.
Is it possible to organize this multiagent task as map reduce claster using Cloud services?
Related
I'm creating a Cloud Function in GCP to automatically resize images uploaded to a bucket and transfer them to another bucket. Since the images arrive in batches and one folder might contain hundreds or thousands of images, is it better to incorporate in the code the ability to deal with the multiple files or is better to let cloud functions be triggered on every image uploaded.
Parallel processing is really powerful with serverless product because it scales up and down automatically according to your workloads.
If you can receive thousands of image in few seconds, the serverless product scalability can have difficulties and you can loose some messages (serverless scale up quickly, but it's not magic!!)
A better solution is to publish the Cloud Storage event in PubSub. Like that you can retry easily the failed messages.
If you continue to increase the number of image, or if you want to optimize cost, I recommend you to have a look on Cloud Run.
You can plug PubSub push subscription to Cloud Run. The power of Cloud Run is the capacity to process several HTTP requests (PubSub push message -> Cloud Storage events) on the same instance, and therefore to process concurrently several image on the same instance. If the conversion process is compute intensive, you can have up to 4 CPUs on a Cloud Run instance.
And, as Cloud Functions, you pay only the number of active (being processing request) instances. With Cloud Functions you can process 1 request at a time, therefore 1 instance per file. With Cloud Run you can process up to 1000 concurrent request and therefore your can reduce up to 1000 time the number of instances, and thus your cost. However, take care of the CPU required for you processing, if it's compute intensive, you can't process 1000 images at the same time.
The finalize event is sent when a new object is created (or an existing object is overwritten, and a new generation of that object is created) in the bucket.
A new function will be triggered for each object uploaded. You can try compressing all those images in a ZIP file on client, upload it so it'll trigger only 1 function, then upload images back to storage after unzipping them. But make sure you don't hit any limits mentioned in the documentation.
I have a requirement to import 50K(this number changes) records in our database applying business logics to each record (or a bulk of records), I plan to implement it by breaking the recordset into multiple chunks of 500 records and sending messages to hornetq queue where a MDB processes the chunk of records. This solution helps me to spread the process in multiple processes by having a MDB pool of 30 threads and also since i use persistent queues my messages are persisted and hence in case of fault the entire process is not affected. Firstly, i would love to know if this is an ideal approach and secondly since we are completely in AWS, are there solutions(or combination of) in AWS which are designed to handle these kind of applications.
If you are so specific to Map Reduce, go ahead and utilize AWS Elastic Map Reduce (EMR) to perform your Map Reduce activity with your custom processing and source being stored in S3 or pulled from any other source.
You have to manage the infrastructure, it is not a managed service
Alternatively, you can use AWS Glue ETL jobs to perform the same using Spark. This is a managed ETL service which gets you with pre-generated spark template to begin with.
To choose between Glue vs EMR, read more about Spark vs Map Reduce and decide yourself.
Hope this helps!!
According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my script submits 16 CTAS queries each of which takes about 2 minutes to finish. In a AWS account, it is only me who is using Athena service. However, when I look at the state of queries through console I see that only a few of queries (5 on average) are actually being executed despite all of them being in state Running. Here is what would normally see in Athena hisotry tab:
I understand that, after I submit queries to Athena, it processes the queries by assigning resources based on the overall service load and the amount of incoming requests. But I tried to run them at different days and hours, still would get about 5 queries being executed at the same time.
So my question is this how it supposed to be? If it is then what is the point of being able to submit up to 20 queries if roughly 15 of them would be idling and waiting for available slots.
Update 2019-09-26
Just stumbled across HIVE CONNECTOR in presto documentation, which has a section AWS Glue Catalog Configuration Properties. There we can see
hive.metastore.glue.max-connections: Max number of concurrent connections to Glue (defaults to 5).
This got me wonder if it has something to do with my issue. As I understand, Athena is simply a Presto that runs on EMR cluster which is configured to use AWS Glue Data Catalog as the Metastore.
So what if my issue comes from the fact that EMR cluster for Athena simply uses default value for concurrent connections to Glue, which is 5 which and is exactly of how many concurrent queries are actually getting executed (on average) in my case.
Update 2019-11-27
The Athena team recently deployed a host of new functionality for Athena. although QUEUED has been in the state enum for some time is hasn't been used until now. So now I get, correct info about query state in a history tab, but everything else remains the same.
Also, another post was published with similar problem.
Your account's limits for the Athena service is not an SLA, it's more of a priority in the query scheduler.
Depending on available capacity your queries may be queued even though you're not running any other queries. Exactly what a higher concurrency limit means is internal and could change, but in my experience it's best to think of it as the priority by which he query scheduler will deal with your query. Queries for all accounts run in the same server pool(s) and if everyone is running queries there will not be any capacity left for you.
You can see this in action by running the same query over and over again and then plot the query execution metrics over time, you will notice that they vary a lot, and you will notice spikes in the time your queries are queued on the top of every hour – when everyone else is running their scheduled queries.
I am using dynamodb to store configuration for an application, this configuration is likely to be changed a few times a day, and will be in the order of tens of rows. My application will be deployed to a number of EC2 instances. I will eventually write another application to allow management of the configuration, in the meantime configuration is managed by making changes to the table directly in the AWS console.
I am trying to use dynamo streams to watch for changes to the configuration, and when the application receives records to process, it simply rereads the entire dynamo table.
This works locally and when deployed to one instance, but when I deploy it to three instances it never initializes the IRecordProcessor, and doesn't pick up any changes to the table.
I suspect this is because the table has only one shard, and the number of instances should not exceed the number of shards (at least for kinesis streams, I understand that kinesis and dynamo streams are actually different though).
I known how to split shards in kinesis streams, but cannot seem to find a way to do this for dynamo streams. I read that, in fact, the number of shards in a dynamo stream is equal to the number of partitions in the dynamo table, and you can increase the number of partitions by increasing read/write capacity. I don't want to increase the throughput as this would be costly.
Does the condition that the number of shards should be more than the number of instances also apply to dyanmo streams? If so, is there another way to increase the number of shards, and if not, is there a known reason that dynamo streams on small tables fail when read by multiple instances?
Is there a better way to store and watch such configuration (ideally using AWS infrastructure)? I am going to investigate triggers.
I eventually solved this by adding the instance ID (EC2MetadataUtils.getInstanceId) to the stream name when setting up the KinesisClientLibConfiguration, so a new stream is set up for each instance. This does result in a separate dynamo table being set up for each instance, and I now need to delete old tables when I restart the app on new instances.
I also contacted AWS support, and received this response.
In the DynamoDB documentation and in many places around the internet I've seen that single digit ms response times are typical, but I cannot seem to achieve that even with the simplest setup. I have configured a t2.micro ec2 instance and a DynamoDB table, both in us-west-2, and when running the command below from the aws cli on the ec2 instance I get responses averaging about 250 ms. The same command run from my local machine (Denver) averages about 700 ms.
aws dynamodb get-item --table-name my-table --key file://key.json
When looking at the CloudWatch metrics in the AWS console it says the average get latency is 12 ms though. If anyone could tell me what I'm doing wrong or point me in the direction of information where I can solve this on my own I would really appreciate it. Thanks in advance.
The response times you are seeing are largely do to the cold start times of the aws cli. When running your get-item command the cli has to get loaded into memory, fetch temporary credentials (if using an ec2 iam role when running on your t2.micro instance), and establish a secure connection to the DynamoDB service. After all that is completed then it executes the get-item request and finally prints the results to stdout. Your command is also introducing a need to read the key.json file off the filesystem, which adds additional overhead.
My experience running on a t2.micro instance is the aws cli has around 200ms of overhead when it starts, which seems inline with what you are seeing.
This will not be an issue with long running programs, as they only pay a similar overhead price at start time. I run a number of web services on t2.micro instances which work with DynamoDB and the DynamoDB response times are consistently sub 20ms.
There are a lot of factors that go into the latency you will see when making a REST API call. DynamoDB can provide latencies in the single digit milliseconds but there are some caveats and things you can do to minimize the latency.
The first thing to consider is distance and speed of light. Expect to get the best latency when accessing DynamoDB when you are using an EC2 instance located in the same region. It is normal to see higher latencies when accessing DynamoDB from your laptop or another data center. Note that each region also has multiple data centers.
There are also performance costs from the client side based on the hardware, network connection, and programming language that you are using. When you are talking millisecond latencies the processing time on your machine can make a difference.
Another likely source of the latency will be the TLS handshake. Establishing an encrypted connection requires multiple round trips and computation on both sides to get the encrypted channel established. However, as long as you are using a Keep Alive for the connection you will only pay this overheard for the first query. Successive queries will be substantially faster since they do not incur this initial penalty. Unfortunately the AWS CLI isn't going to keep the connection alive between requests, but the AWS SDKs for most languages will manage this for you automatically.
Another important consideration is that the latency that DynamoDB reports in the web console is the average. While DynamoDB does provide reliable average low double digit latency, the maximum latency will regularly be in the hundreds of milliseconds or even higher. This is visible by viewing the maximum latency in CloudWatch.
They recently announced DAX (Preview).
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB that delivers up to a 10x performance improvement – from milliseconds to microseconds – even at millions of requests per second. For more information, see In-Memory Acceleration with DAX (Preview).