Concurrent Queries, COPY and Connections in AWS Redshift - amazon-web-services

I am trying to understand the difference between concurrent connections and concurrent queries in Redshift. As per documents, We can make 500 concurrent connections to a Redshift cluster but it says maximum 15 queries can be run at the same time in a cluster. Now what is the exact value?
How many queries can be in running state in a cluster at the same time ? If it is 15, does it include RETURNING state queries as well ?
How many concurrent COPY statement can run in a cluster ?
We are evaluating Redshift as our primary reporting data store. If we cannot run a large number of queries simultaneously it may be difficult for us to go with this model.

I think, you have misread somewhere, Max concurrent queries are 50 per WLM. Refer below thread for Amazon support response for more detail.
How many queries can be in running state in a cluster at the same time ? If it is 15, does it include RETURNING state queries as well ?
At a time, Max 50 queries could be running concurrently. Yes it does include INSERT/UPDATE/DELETE etc all.
How many concurrent COPY statement can run in a cluster ?
Ideally, you could Max go up to 50 concurrently, but Copy works bit differently.
Amazon Redshift automatically loads in parallel from multiple data files.
If you use multiple concurrent COPY commands to load one table from multiple files, Amazon Redshift is forced to perform a serialized load, which is much slower and requires a VACUUM at the end if the table has a sort column defined. For more information about using COPY to load data in parallel, see Loading Data from Amazon S3.
Meaning, you could run concurrent Copy commands but make sure one copy command at a time per table.
So practically, it doesn't depend on Nodes on cluster, but Number of tables as well.
So if you have only 1 table, you would like to execute 50 insert concurrently, it will result only 1 Copy concurrently.

Related

Amazon DynamoDB read latency while writing

I have an Amazon DynamoDB table which is used for both read and write operations. Write operations are performed only when the batch job runs at certain intervals whereas Read operations are happening consistently throughout the day.
I am facing a problem of increased Read latency when there is significant amount of write operations are happening due to the batch jobs. I explored a little bit about having a separate read replica for DynamoDB but nothing much of use. Global tables are not an option because that's not what they are for.
Any ideas how to solve this?
Going by the Dynamo paper, the concept of a read-replica for a record or a table does not exist in Dynamo. Within the same region, you will have multiple copies of a record depending on the replication factor (R+W > N) where N is the replication factor. However when the client reads, one of those records are returned depending on the cluster health.
Depending on how the co-ordinator node is chosen either at the client library or at the cluster, the client can only ask for a record (get) or send a record(put) to either the cluster co-ordinator ( 1 extra hop ) or to the node assigned to the record (single hop to record). There is just no way for the client to say 'give me a read replica from another node'. The replicas are there for fault-tolerance, if one of the nodes containing the master copy of the record dies, replicas will be used.
I am researching the same problem in the context of hot keys. Every record gets assigned to a node in Dynamo. So a million reads on the same record will lead to hot keys, loss of reads/writes etc. How to deal with this ? A read-replica will work great because I can now manage the hot keys at the application and move all extra reads to read-replica(s). This is again fraught with issues.

Alternatives for Athena to query the data on S3

I have around 300 GBs of data on S3. Lets say the data look like:
## S3://Bucket/Country/Month/Day/1.csv
S3://Countries/Germany/06/01/1.csv
S3://Countries/Germany/06/01/2.csv
S3://Countries/Germany/06/01/3.csv
S3://Countries/Germany/06/02/1.csv
S3://Countries/Germany/06/02/2.csv
We are doing some complex aggregation on the data, and because some countries data is big and some countries data is small, the AWS EMR doesn't makes sense to use, as once the small countries are finished, the resources are being wasted, and the big countries keep running for long time. Therefore, we decided to use AWS Batch (Docker container) with Athena. One job works on one day of data per country.
Now there are roughly 1000 jobs which starts together and when they query Athena to read the data, containers failed because they reached Athena query limits.
Therefore, I would like to know what are the other possible ways to tackle this problem? Should I use Redshift cluster, load all the data there and all the containers query to Redshift cluster as they don't have query limitations. But it is expensive, and takes a lot of time to wramp up.
The other option would be to read data on EMR and use Hive or Presto on top of it to query the data, but again it will reach the query limitation.
It would be great if someone can give better options to tackle this problem.
As I understand, you simply send query to AWS Athena service and after all aggregation steps finish you simply retrieve resulting csv file from S3 bucket where Athena saves results, so you end up with 1000 files (one for each job). But the problem is number of concurrent Athena queries and not the total execution time.
Have you considered using Apache Airflow for orchestrating and scheduling your queries. I see airflow as an alternative to a combination of Lambda and Step Functions, but it is totally free. It is easy to setup on both local and remote machines, has reach CLI and GUI for task monitoring, abstracts away all scheduling and retrying logic. Airflow even has hooks to interact with AWS services. Hell, it even has a dedicated operator for sending queries to Athena, so sending queries is as easy as:
from airflow.models import DAG
from airflow.contrib.operators.aws_athena_operator import AWSAthenaOperator
from datetime import datetime
with DAG(dag_id='simple_athena_query',
schedule_interval=None,
start_date=datetime(2019, 5, 21)) as dag:
run_query = AWSAthenaOperator(
task_id='run_query',
query='SELECT * FROM UNNEST(SEQUENCE(0, 100))',
output_location='s3://my-bucket/my-path/',
database='my_database'
)
I use it for similar type of daily/weekly tasks (processing data with CTAS statements) which exceed limitation on a number of concurrent queries.
There are plenty blog posts and documentation that can help you get started. For example:
Medium post: Automate executing AWS Athena queries and moving the results around S3 with Airflow.
Complete guide to installation of Airflow, link 1 and link 2
You can even setup integration with Slack for sending notification when you queries terminate either in success or fail state.
However, the main drawback I am facing is that only 4-5 queries are getting actually executed at the same time, whereas all others just idling.
One solution would be to not launch all jobs at the same time, but pace them to stay within the concurrency limits. I don't know if this is easy or hard with the tools you're using, but it's never going to work out well if you throw all the queries at Athena at the same time. Edit: it looks like you should be able to throttle jobs in Batch, see AWS batch - how to limit number of concurrent jobs (by default Athena allows 25 concurrent queries, so try 20 concurrent jobs to have a safety margin – but also add retry logic to the code that launches the job).
Another option would be to not do it as separate queries, but try to bake everything together into fewer, or even a single query – either by grouping on country and date, or by generating all queries and gluing them together with UNION ALL. If this is possible or not is hard to say without knowing more about the data and the query, though. You'll likely have to post-process the result anyway, and if you just sort by something meaningful it wouldn't be very hard to split the result into the necessary pieces after the query has run.
Using Redshift is probably not the solution, since it sounds like you're doing this only once per day, and you wouldn't use the cluster very much. It would Athena is a much better choice, you just have to handle the limits better.
With my limited understanding of your use case I think using Lambda and Step Functions would be a better way to go than Batch. With Step Functions you'd have one function that starts N number of queries (where N is equal to your concurrency limit, 25 if you haven't asked for it to be raised), and then a poll loop (check the examples for how to do this) that checks queries that have completed, and starts new queries to keep the number of running queries at the max. When all queries are run a final function can trigger whatever workflow you need to run after everything is done (or you can run that after each query).
The benefit of Lambda and Step Functions is that you don't pay for idle resources. With Batch, you will pay for resources that do nothing but wait for Athena to complete. Since Athena, in contrast to Redshift for example, has an asynchronous API you can run a Lambda function for 100ms to start queries, then 100ms every few seconds (or minutes) to check if any have completed, and then another 100ms or so to finish up. It's almost guaranteed to be less than the Lambda free tier.
As I know Redshift Spectrum and Athena cost same. You should not compare Redshift to Athena, they have different purpose. But first of all I would think about addressing you data skew issue. Since you mentioned AWS EMR I assume you use Spark. To deal with large and small partitions you need to repartition you dataset by months, or some other equally distributed value.Or you can use month and country for grouping. You got the idea.
You can use redshift spectrum for this purpose. Yes, it is a bit costly but it is scalable and very good for performing complex aggregations.

is 100-200 upserts and inserts in a 10 second window into a 3 node redshift cluster a realistic architecture?

Under 3 nodes using redshift we plan on doing 50-100 inserts every 10 seconds. Within that 10 second window we also will try to do the equivalent of a redshift upsert as documented here https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html on about 50 to 100 rows as well.
I'm basically unaware if a 10 second window is realistic or a 10 minute window... etc is good for this kind of load. Should this be a daily batch? Should I try to re-architect to get rid of upserts?
My question is essentially can redshift handle this load? I feel the upsert is happening too many times. We are using structured streaming in spark to handle all of this. If yes what type of nodes should we be using? Has anyone who has done this have a ballpark estimate? If no, what are alternative architectures?
Essentially what we're trying to do is load entity data to be joined with the events in redshift. But we want the analytics to be as near real time as possible so we want load as fast as we can.
There's probably no exact answer for this, so any explanation that can get help me perform estimations on requirements based on load will be helpful.
I do not think you will achieve the performance you seek.
Running large numbers of INSERT statements is not an optimal way to load data into Amazon Redshift.
The best way is via running COPY from data stored in Amazon S3. This loads data in parallel across all nodes.
Unless you have a very real need to get data immediately into Redshift, it would be better to batch the data in S3 over a period of time (the larger the batch, the better), then load via COPY. This will also work well with the Staging Table approach to performing UPSERTS.
The best way to discover whether Redshift will handle a particular load is to try it! Spin up another cluster and try the various methods, measuring the performance each time.
I would recommend using Kinesis Firehose to insert data to Redshift. It will optimize for time / load and insert accordingly.
We tried inserting manually in batches, does not seems to be the cleaner way of handling it when an optimized cloud service exist for the same.
https://docs.aws.amazon.com/ses/latest/DeveloperGuide/event-publishing-redshift-firehose-stream.html
It does collect them in batches, compress and load them to Redshift.
Upsert Process:
If you want an upsert this is how I would have them done in a scalable way,
DynamoDB Table (Update) --> DynamoDB Streams --> Lambda --> Firehose --> Redshift
Have a scheduled job to cleanup any duplicate records based on created_timestamp.
Hope it helps.

How to make Athena process multiple queries concurrently

I'm launching several concurrent queries to Athena via a Python application.
Given Athena's history of queries, it seems that multiple queries are all indeed received at the same time by Athena, and processed concurrently.
However, it turns out that the overall query running time is not that different from sending queries one after the other.
Example: sending three queries sequentially vs concurrently:
# sequentially
received at took finished at
query_1 22:01:14 6s 22:01:20
query_2 22:01:20 6s 22:01:27
query_3 22:01:27 5s 22:01:25
# concurrently
received at took finished at
query_1 22:02:25 17s 22:02:42
query_2 22:02:25 17s 22:02:42
query_3 22:02:25 17s 22:02:42
According to these results, in the second case, it seems that Athena, although pretending to be treating the queries concurrently, effectively processed them in a sequential manner.
Is there some configuration I wouldn't be aware of, to make Athena effectively process multiple queries concurrently? Ideally, in this example, the three queries processed concurrently would take a global running time of 6s (the longest time of the three individual queries).
Note: these are three queries targeting the same database/table, backed by the same (single) Parquet file in S3. This Parquet file is approx. 70Mb big and has 2.5M rows with a half-dozen columns.
In general the way you run concurrent queries in Athena is to run as many StartQueryExecution calls as you need, collect the query execution IDs, and then poll using GetQueryExecution for each one to be completed. Athena runs each query independently, concurrently, and asynchronously.
Depending on how long you wait between polling each query execution ID it may look like queries take different amounts of time. You can use the Statistics.EngineExecutionTimeInMillis property of the response from GetQueryExecution to see how long the query executed in Athena, and the difference between the Status.SubmissionDateTime and Status.CompletionDateTime properties to see the total time between when Athena received the query and when the response was available. Usually these two numbers are very close, and if there is a difference your query got queued internally in Athena.
The numbers in your question look unlikely. That they ended on the exact same second after running for 17 seconds looks suspicious. How many times did you run your experiment? If you look at Statistics.EngineExecutionTimeInMillis do they differ in the number of milliseconds, or are all numbers identical? Did you set ClientRequestToken, and if so, was it the same value for all three queries (in that case you actually only ran one).
What do you mean by "concurrently", do you start and poll from different threads, or poll in a single loop? How long did you wait between each poll call?

RDS - max replica lag when using read replica

I am using RDS's read replica mechanism for a schema update to a very large Mysql table.
I ran an Alter command which locks the table for a long period of time (more than 24 hours).
In that period of time my read replica was not getting updated and I noticed the Replica lag value was slowly increasing.
When the table update was complete I saw that the Replica lag was slowly decreasing until the read replica finally caught up with the original DB.
While my Alter command was running, I did a small experiment and occasionally updated a specific row so I can follow it on my read replica. My experiment showed that the updates to this specific row indeed eventually happened also in the read replica (after the table was unlocked).
Based on the above experiment result I assume all updates that were blocked while my read replica was updating eventually were also performed on my replicated DB after the table modification but it would be hard to prove something like that for such a big table and such a long period of time.
I couldn't find any official documentation on how this mechanism works and I was wondering where exactly all these updates are buffered and what would be the limit of this buffer (e.g. when will I start loosing changes that occured on my master DB)?
This is covered in the documentation. Specifically, the replica ("slave") server's relay log is the place where the changes usually wait until they are actually executes on the replica.
http://dev.mysql.com/doc/refman/5.6/en/slave-logs.html
But, the limit to how far behind a replica can be -- but still, eventually, have data identical to the master -- is a combination of factors. It should not ever quietly "misplace" any of the buffered changes, as long as it's being monitored.
Each time the data on the master database changes, the master writes a replication event to its binary log, and these logs are delivered to the replica, usually in near-real-time, where they are stored, pretty much as-sent, in the relay logs, as the first step in a 2-step process.
The second step is for the replica to read through those logs, sequentially, and modify its local data set, according to what the master server sent. The statements are typically executed sequentially.
The two biggest factors that determine how far behind a replica can safely become are the amount of storage available for relay logs on the replica and the amount of storage plus log retention time on the master. RDS has additional logic on top of "stock" MySQL Server to prevent the master from purging its copy of the log until the replica(s) have received them.