I have a table in AWS redshift running ra3.xlplus with 2 nodes which has 15 million rows. I am retrieving data on-premise at the office. I am trying to load that data into Memory in a BI tool. It takes a lot of time (12 minutes) to import that data over using a JDBC connection. Also tried on ODBC connection got same result. I tried to spin up a EC2 with a 25 gigabit connection on AWS, but got the same results.
For comparison loading that data in CSV format takes about 90 seconds.
Are there any solutions as to speed up data transfer.
There are ways to improve this but the true limiter needs to be identified. The likely the bottleneck is the network bandwidth between AWS and your on-prem system. As you are bringing a large amount of data down from the cloud you will want an efficient process for this transport.
JDBC and ODBC are not network efficient as you are seeing. The first thing that will help in moving the data is compression. The second is parallel transfer since there is a fair amount of handshaking in TCP protocol and there is more usable bandwidth than one connection can consume. So how I have done this in the past is to UNLOAD the data compressed to S3, then parallel copy the files from S3 to the local machine piping the files through decompress and saving them. Lastly these files are loaded into your BI tool.
Clearly setting this up takes some time so you want to be sure that the process will be used enough to justify this effort. Another way to go is to bring your BI tool closer to Redshift by locating it in an ec2 instance. The shorter network distance and higher bandwidth should bring down the transfer time significantly. A downside of locating your database in the cloud is that it is in the cloud and not on-prem.
Related
We recently released an open source project to stream data to Redshift in near realtime.
Github: https://github.com/practo/tipoca-stream
The realtime data pipeline stream data to Redshift from RDS.
Debezium writes the RDS events to Kafka.
We wrote Redshiftsink to sink data from Kafka to Redshift.
We have 1000s of tables which are streaming to Redshift, we use COPY command. We wish to load every ~10 minutes to keep the data as near realtime as possible.
Problem
Parallel load becomes a bottleneck. Redshift is not good in ingesting data at such short interval. We do understand Redshift is not a realtime database. What is the best that can be done? Does Redshift plan to solve this in future?
Workaround that works for us!
We have 1000+ tables in Redshift but we use not over 400 in a day. This is the reason we now throttle loads for the unused table when needed. This feature makes sure the tables which are in use are always near realtime and keep the Redshift less burdened. This was very useful.
Looking for suggestions from the Redshift community!
I need to import one billion rows of data from local machine to AWS RDS.
Local machine has a high speed internet connection and it goes up to 100MB/s. So, network is not the problem.
I'm using AWS RDS r3.xlarge with 2000 PIOPS and 300GB of storage.
However, since my PIOPS is stuck at 2000, in order to import one billion rows, it's going to take like 7 days.
How can I speed up the process without paying more?
Thanks so much.
Your PIOPS are the underlying IO provisioning for your database instance - that is, how much data per second the OS is guaranteed to be able to send to persistent storage. You might be able to optimize that slightly by using larger write batches (Depending on what your DB supports), but fundamentally it limits the amount of bytes/second available for your import.
You can provision more IO for the import, and then scale it down, though.
I want to run incremental nightly job that extracts 100s of GBs of data from Oracle DataWarehouse into HDFS. After processing, the results (few GBs) needs to be exported back to Oracle.
We are running Hadoop in Amazon AWS, and our Data Warehouse is on premises. The data link between AWS and on premises is 100 mbps and not reliable.
If I use Sqoop-import to bring the data from Oracle, and the network experience intermittent outages, how does Sqoop handle this?
Also, what happens if I imported (or exported) 70% of my data, and during the remaining 30%, the network goes down?
Since by default Sqoop uses JDBC, how does the data transfer happen at a network level? Can we compress the data in transit?
Does anyone know how fast the copy speed is from Amazon S3 to Redshift?
I only want to use RedShift for about an hour a day, to run updates on Tabelau reports. The queries being run are always on the same database, but I need to run them each night to take in to account new data that's come in that day.
I don't want to keep a cluster going 24x7 just to be used for one hour a day, but the only way that I can see of doing this is to Import the entire database each night into Redshift (I don't think you can't suspend or pause a cluster). I have no idea what the copy speed is so I have no idea if its going to be relatively quick to copy a 10GB file in to Redshift every night.
Assuming its feasible, my thinking is to push the incremental changes on SQL Server dbase in to S3. Using Cloud Formation, I automate the provisioning of a Redshift cluster at 1am for 1 hour, import the dbase from S3, and schedule Tableau to run its queries between that time and get its results. I keep an eye on how long the queries take, and If I need longer than an hour I just amend the cloud formation.
In this way I hope to keep a really 'lean' Tableau server by outsourcing all the ETL to Redshift, and buying only what I consume on Redshift.
Please feel free to critique my solution, or out right blow it out of the water. Otherwise If the consensus of the answer is that importing is relevantly quick, It gives me a thumbs up I'm headed in the right direction with this solution.
Thanks for any assistance!
Redshift loads from S3 are very quick, however Redshift clusters do not come up / tear down very quickly at all. In the above example most of your time (and money) would be spent waiting for the cluster to come up, existing data to load, refreshed data to unload and cluster to tear down again.
In my opinion it would be better to use another approach for your overnight processing. I would suggest either:
For a couple of TB, InfiniDB on a largish EC2 instance with the database stored on an EBS volume.
For many TBs, Amazon EMR with the data stored on S3. If you don't want to get into Hadoop too much you can use Xplenty/Syncsort Ironcluster/etc. to orchestrate the Hadoop element.
While this question was written three years ago and it wasn't available at that time, a suitable solution to this now would be to use Amazon Athena, which allows on-demand SQL querying of data held in S3. This works on a pay-per-query model, and is intended for ad-hoc and "quick" workloads like this.
Behind the scenes, Athena uses Presto and Elastic MapReduce, but the only required knowledge for a developer/analyst in practice is SQL.
Tableau also now has a built-in Athena connector (as of 10.3).
More on Athena here: https://aws.amazon.com/athena/
You can presort data you are keeping on S3. It will make Vacuum much faster.
This is the classic problem with Redshift... if you looking different way .. Microsoft recently announced new service called SQL Data Warehouse (Uses PDW Engine) I think they want to compete directly with Redshift.... Most interesting concept here is ... Familiar SQL Server Query language and Toolset (including Stored proc support). They also decoupled Storage and Compute so you can have 1 GB storage but 10 Compute node for intensive query and vice versa.... they are claiming that compute node start in few seconds and when you resize cluster you don't have to take it offline. Cloud Data Warehouse Battle getting hot :)
We need to populate database which sits on Amazon WS { EC2 (Compute Cluster Eight extra large) + EBS 1TB }. Given that we have close to 700GB of data on local, how can I find out the time (theoretical) it would take to upload the entire data? I could not find any information on data upload/download speeds for EC2?
Since this will depend strongly on the networking betweeen your site and amazon's data centre...
Test it with a few GB and extrapolate.
Be aware of AWS Import/Export and consider the option of simply couriering Amazon a portable hard drive. (Old saying: "Never underestimate the bandwidth of a stationwagon full of tape"). In fact I note the page includes a section "When to use..." which gives some indication of transfer times vs. connection bandwidth.