Cheapest way to insert one billion rows into AWS RDS - amazon-web-services

I need to import one billion rows of data from local machine to AWS RDS.
Local machine has a high speed internet connection and it goes up to 100MB/s. So, network is not the problem.
I'm using AWS RDS r3.xlarge with 2000 PIOPS and 300GB of storage.
However, since my PIOPS is stuck at 2000, in order to import one billion rows, it's going to take like 7 days.
How can I speed up the process without paying more?
Thanks so much.

Your PIOPS are the underlying IO provisioning for your database instance - that is, how much data per second the OS is guaranteed to be able to send to persistent storage. You might be able to optimize that slightly by using larger write batches (Depending on what your DB supports), but fundamentally it limits the amount of bytes/second available for your import.
You can provision more IO for the import, and then scale it down, though.

Related

Fastest way to Import Data from AWS Redshift to BI Tool

I have a table in AWS redshift running ra3.xlplus with 2 nodes which has 15 million rows. I am retrieving data on-premise at the office. I am trying to load that data into Memory in a BI tool. It takes a lot of time (12 minutes) to import that data over using a JDBC connection. Also tried on ODBC connection got same result. I tried to spin up a EC2 with a 25 gigabit connection on AWS, but got the same results.
For comparison loading that data in CSV format takes about 90 seconds.
Are there any solutions as to speed up data transfer.
There are ways to improve this but the true limiter needs to be identified. The likely the bottleneck is the network bandwidth between AWS and your on-prem system. As you are bringing a large amount of data down from the cloud you will want an efficient process for this transport.
JDBC and ODBC are not network efficient as you are seeing. The first thing that will help in moving the data is compression. The second is parallel transfer since there is a fair amount of handshaking in TCP protocol and there is more usable bandwidth than one connection can consume. So how I have done this in the past is to UNLOAD the data compressed to S3, then parallel copy the files from S3 to the local machine piping the files through decompress and saving them. Lastly these files are loaded into your BI tool.
Clearly setting this up takes some time so you want to be sure that the process will be used enough to justify this effort. Another way to go is to bring your BI tool closer to Redshift by locating it in an ec2 instance. The shorter network distance and higher bandwidth should bring down the transfer time significantly. A downside of locating your database in the cloud is that it is in the cloud and not on-prem.

S3 docs: "one concurrent request per 85–90 MB/s of desired network throughput" -- Why?

On the page linked below, I found the following statement:
Make one concurrent request for each 85–90 MB/s of desired network throughput. To saturate a 10 Gb/s network interface card (NIC), you might use about 15 concurrent requests over separate connections. You can scale up the concurrent requests over more connections to saturate faster NICs, such as 25 Gb/s or 100 Gb/s NICs.
Performance Design Patterns for Amazon S3 - Horizontal Scaling and Request Parallelization for High Throughput
What is the origin of these numbers? I can't find any other documentation that justifies this. My guess is that this limitation is speaking more to the limitations of NIC on the EC2 instance rather than S3. Still, is there any other source that explains where these numbers came from?
To be clear, this is not a question about how to optimize S3 throughput -- I'm aware of the alternatives. This is a question about the AWS S3 documentation itself.
The only people who could answer this definitively are those who are working on S3 internals. And they're almost certainly covered by NDA. So what I'm about to write is complete speculation.
We know that S3 is distributed and redundant: each object is stored on multiple physical drives, across multiple availability zones.
We can infer, from the fact that S3 is available as a networked service, that there is some form of network interface between the S3 volume and the outside world. Obvious, yes, but if that network interface is limited to 1Gbit/sec, it would be able to achieve approximately 85-90 Mbyte/sec sustained throughput.
It's also important to remember that AWS uses a software-defined network: so while the S3 service may in fact have a network interface that supports 10 Gbit/sec, AWS may restrict the bandwidth that is available to any given connection.
Far more interesting to me is this quote, from the same link:
we suggest making concurrent requests for byte ranges of an object at the granularity of 8–16 MB
This implies that redundancy is managed at a sub-object level, so that a large object is split into multiple pieces of maybe 64 MB, and those pieces are individually distributed. Which is how HDFS manages large files, so not a giant leap.
As for your supposition that it's a limit of EC2 rather than S3, I think that the suggestion to use multiple connections rules that out. Although it's possible that a single connection is limited to 1Gbit/sec by EC2, I would expect the S3 designers to be more concerned about load on their system. You can always test that out by opening a single connection between two EC2 instances with high-bandwidth networking, and see if it's throttled.

Does increasing the index cause the write IOPS of AWS RDS to rise?

Does increasing the index cause the write IOPS of AWS RDS to rise?
The AWS RDS I use is db.m3.xlarge . The storage of RDS is 50G.
Now the write IOPS of AWS RDS is 120
50G RDS write IOPS peak is 150
According to the Official document :
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html
Baseline I/O performance for General Purpose SSD storage is 3 IOPS for each GiB, which means that larger volumes have better performance.
How do I query the reason why RDS write IOPS rises?
Let me try to answer this by dividing it into two parts :
Do I have a I/O Problem ?
Finding reason behind high write IOPS on MySQL (RDS) server ?
Do I have a I/O Problem ?
When using AWS RDS, one does not have traditional OS tools such as systat, iostat, dtstat or sar. The tool to understand what is happening in RDS is cloudwatch metrics and the graphs provided.
Read and Write IOPS metrics:
By summing up the ReadIOPS and WriteIOPS you will see how much IOPS your operations consume.
DiskQueueDepth Metric: The DiskQueueDepth metric provides the number of outstanding IOs (read/write requests) waiting to access the disk. If this metrics is frequently above 2, then you should expect sooner or later to face performance issues.
Using the above two graphs it is easy to identify if you are under-provisioned or over-provisioned in IOPS.
If your DiskQueueDepth is consistsently between 0 and 0.5 you are over provisioned.
If your DiskQueueDepth is consistsently above 2 then you are under provisioned.
Finding reason behind high write IOPS on MySQL (RDS) server ?
There are several ways to profile your queries, but as you are using RDS with MySQL , I would recommend you to use PERFORMANCE_SCHEMA to do it easily, as you won't need external software (some of which is not fully RDS-compatible).
You can refer to this video with an introduction to query profiling, with examples like IOPS and temporary table creation monitoring by query pattern, user and table. For a more specific guide (specially for configuration of metrics), you can have a look at the official manual and the sys schema documentation.
If you need to have quick look what is going you can have quick look at the SHOW GLOBAL STATUS like 'com\_%'; and SHOW GLOBAL STATUS like 'Hand%'; at time interval to see if you have an increase on the number of SQL queries per unit of time or on the number of engine row operations per unit of time.
To Conclude, an increase on Write IOPS normally may mean extra SQL load (obviously), but also many other things, such as,too many temporary tables or worse query plans being executed due to a change on the query optimiser plan or on your data cardinality/size. It is critical to identify the underlying cause first before taking any action.
Hope this Helps you !

AWS read replicas architecture

We have a service that runs in 6 AWS regions and we have some requisites that should be met:
The latency of querying the database must be very low
It support a high throughput of queries
It's been observed that the database update process is IO intensive, so it increases the queries latency due to db locks.
Delays in the order of seconds is acceptable between update and read
The architecture that we discussed was having one service that updates the master db and one slave in each region (6 slaves total).
We found some problems and some possible solutions with that:
There is a limitation of 5 read replicas using AWS infrastructure.
To solve this issue we though of creating read replicas of read replicas. That should give us 25 instances.
There is a limitation in AWS that you cannot create a read replica of a read replica from another region.
To solve this issue we though of inside the application updating 2 master databases.
This approach will create a problem that, for a period of time, the databases can be inconsistent.
In the service implementation we can always recreate the data. So there is a job re-updating the data from times to times (that is one of the reasons that the update is IO intensive).
Anyone has a similar problem? How do you handle it? Can we avoid creating and maintaining databases by ourselves?
We are using MySQL but we are pretty open to use other compatible DBs.
unfortunately, there is no magical solution when it comes to inter-region: you lose latency.
I think you explored pretty much all the solutions from an RDS point of view with what you propose, e.g read replica of read replica (I confirm you cannot do this from another region, but this is to save you from a too high replica-lag).
Another solution would be to create databases on EC2 instances, but you would lose all the benefits from RDS (You could protect this traffic with an inter-region vpn between vpcs). Bare in mind however that too many read replicas will impact your performances.
My advises in your case would be:
to massively use cache at every possible levels: elasticache between DB and servers, varnish for http pages, cloudfront for content delivery. If you want so many read replicas, it means that you are heavely dependent on reads. This way, you would save a lot of reads from hitting your database and gain latency significantly, and maybe 5 read replicas would be enough then.
to consider sharding or using several databases. This
is not always a good solution however, depending on your use case...
You can request an increase in the number of RDS for MySQL Read Replicas using the form at https://aws.amazon.com/contact-us/request-to-increase-the-amazon-rds-db-instance-limit/
Once the limit has been increased you'll want to test to make sure that the performance of having a large number of Read Replicas is acceptable to your application.
Hal

How to get sub 10ms response times from AWS DynamoDB?

In the DynamoDB documentation and in many places around the internet I've seen that single digit ms response times are typical, but I cannot seem to achieve that even with the simplest setup. I have configured a t2.micro ec2 instance and a DynamoDB table, both in us-west-2, and when running the command below from the aws cli on the ec2 instance I get responses averaging about 250 ms. The same command run from my local machine (Denver) averages about 700 ms.
aws dynamodb get-item --table-name my-table --key file://key.json
When looking at the CloudWatch metrics in the AWS console it says the average get latency is 12 ms though. If anyone could tell me what I'm doing wrong or point me in the direction of information where I can solve this on my own I would really appreciate it. Thanks in advance.
The response times you are seeing are largely do to the cold start times of the aws cli. When running your get-item command the cli has to get loaded into memory, fetch temporary credentials (if using an ec2 iam role when running on your t2.micro instance), and establish a secure connection to the DynamoDB service. After all that is completed then it executes the get-item request and finally prints the results to stdout. Your command is also introducing a need to read the key.json file off the filesystem, which adds additional overhead.
My experience running on a t2.micro instance is the aws cli has around 200ms of overhead when it starts, which seems inline with what you are seeing.
This will not be an issue with long running programs, as they only pay a similar overhead price at start time. I run a number of web services on t2.micro instances which work with DynamoDB and the DynamoDB response times are consistently sub 20ms.
There are a lot of factors that go into the latency you will see when making a REST API call. DynamoDB can provide latencies in the single digit milliseconds but there are some caveats and things you can do to minimize the latency.
The first thing to consider is distance and speed of light. Expect to get the best latency when accessing DynamoDB when you are using an EC2 instance located in the same region. It is normal to see higher latencies when accessing DynamoDB from your laptop or another data center. Note that each region also has multiple data centers.
There are also performance costs from the client side based on the hardware, network connection, and programming language that you are using. When you are talking millisecond latencies the processing time on your machine can make a difference.
Another likely source of the latency will be the TLS handshake. Establishing an encrypted connection requires multiple round trips and computation on both sides to get the encrypted channel established. However, as long as you are using a Keep Alive for the connection you will only pay this overheard for the first query. Successive queries will be substantially faster since they do not incur this initial penalty. Unfortunately the AWS CLI isn't going to keep the connection alive between requests, but the AWS SDKs for most languages will manage this for you automatically.
Another important consideration is that the latency that DynamoDB reports in the web console is the average. While DynamoDB does provide reliable average low double digit latency, the maximum latency will regularly be in the hundreds of milliseconds or even higher. This is visible by viewing the maximum latency in CloudWatch.
They recently announced DAX (Preview).
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB that delivers up to a 10x performance improvement – from milliseconds to microseconds – even at millions of requests per second. For more information, see In-Memory Acceleration with DAX (Preview).