I've tried default setup of Data Pipeline for cross-regional table copy. Copying one table to another in same region (eu-west-1).
On pipeline activation, EMR cluster is launched, runs for approx 20 minutes and then it's terminated with pipeline being in "success" state.
Problem is that only 389 entries are copied from my table :/ (number is the same on multiple runs). Total number of entries is close to 100000.
I've tried turning on logs (no errors there), walking through them, launching 4.1.0 cluster, increasing throughputs, etc, nothing solves the case.
Does cross-regional table copy work? What could be the problem? Why no error? How do I debug it?
Datapipeline config: https://gist.github.com/mariusgrigaitis/adceb18354b52d845278
Related
I have a DynamoDB table in production with about 1.5 billion objects. I am writing an EMR script to back up the table to S3. I would prefer it to be done as quickly as possible. I have a script that provisions an EMR cluster with 4 m4.2xlarge nodes and runs the following hive queries:
SET dynamodb.throughput.read.percent = 1.5;
SET dynamodb.throughput.write.percent = 1.5;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = org.apache.hadoop.io.compress.GzipCodec;
CREATE DATABASE IF NOT EXISTS my_db;
USE my_db;
CREATE EXTERNAL TABLE IF NOT EXISTS ddb_table (composite_key string) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("dynamodb.table.name" = "my_ddb_table", "dynamodb.column.mapping" = "composite_key:composite_key");
INSERT OVERWRITE DIRECTORY 's3://s3-backups/ddb/' SELECT composite_key FROM ddb_table;
When I run the script with on demand RCU's the job spawns 62 tasks. When I run the script with provisioned RCU's, I get only one task (about 800 RCU's consumed per minute). Neither the amount of provisioned RCU's (I tested with 40,000 RCUs) nor autoscaling seems to change the number of tasks. Only on-demand RCU's seems to create additional tasks.
Other than AWS wanting me to pay more money, is there a reason for this behavior or a workaround? Provisioning more capacity and having less utilization of that provisioned capacity seems counter-intuitive.
For now I start my job with on demand RCU's and then switch to provisioned after the job starts...but that is not very fun.
Any ideas?
The new EMR versions seem to require dynamodb.throughput.write/read parameters that you need to manually specify(which wasn't the case in older versions). E.g.
SET dynamodb.throughput.write=40000 // depending on your RCU
SET dynamodb.throughput.write.percent=0.9
SET dynamodb.throughput.read=40000 // depending on your WCU
SET dynamodb.throughput.read.percent=0.9
I want to import a large csv file (around 1gb with 2.5m rows and 50 columns) into a DynamoDb, so have been following this blog from AWS.
However it seems I'm up against a timeout issue. I've got to ~600,000 rows ingested, and it falls over.
I think from reading the CloudWatch log that the timeout is occurring due to the boto3 read on the CSV file (it opens the entire file first, iterates through and batches up for writing)... I tried to reduce the file size (3 columns, 10,000 rows as a test), and I got a timeout after 2500 rows.
Any thoughts here?!
TIA :)
I really appreciate the suggestions (Chris & Jarmod). After trying and failing to break things programmatically into smaller chunks, I decided to look at the approach in general.
Through research I understood there were 4 options:
Lambda Function - as per the above this fails with a timeout.
AWS Pipeline - Doesn't have a template for importing CSV to DynamoDB
Manual Entry - of 2.5m items? no thanks! :)
Use an EC2 instance to load the data to RDS and use DMS to migrate to DynamoDB
The last option actually worked well. Here's what I did:
Create an RDS database (I used the db.t2.micro tier as it was free) and created a blank table.
Create an EC2 instance (free Linux tier) and:
On the EC2 instance: use SCP to upload the CSV file to the ec2 instance
On the EC2 instance: Firstly Sudo yum install MySQL to get the tools needed, then use mysqlimport with the --local option to import the CSV file to the rds MySQL database, which took literally seconds to complete.
At this point I also did some data cleansing to remove some white spaces and some character returns that had crept into the file, just using standard SQL queries.
Using DMS I created a replication instance, endpoints for the source (rds) and target (dynamodb) databases, and finally created a task to import.
The import took around 4hr 30m
After the import, I removed the EC2, RDS, and DMS objects (and associated IAM roles) to avoid any potential costs.
Fortunately, I had a flat structure to do this against, and it was only one table. I needed the cheap speed of the dynamodb, otherwise, I'd have stuck to the RDS (I almost did halfway through the process!!!)
Thanks for reading, and best of luck if you have the same issue in the future.
I have around 500GB compressed data in amazon s3. I wanted to load this data to Amazon Redshift. For that, I have created an internal table in AWS Athena and I am trying to load data in the internal table of Amazon Redshift.
Loading of this big data into Amazon Redshift is taking more than an hour. The problem is when I fired a query to load data it gets aborted after 1hour. I tried it 2-3 times but it's getting aborted after 1 hour. I am using Aginity Tool to fire the query. Also, in Aginity tool it is showing that query is currently running and the loader is spinning.
More Details:
Redshift cluster has 12 nodes with 2TB space for each node and I used 1.7 TB space.
S3 files are not the same size. One of them is 250GB. Some of them in MB.
I am using the command
create table table_name as select * from athena_schema.table_name
it stops exactly after 1hr.
Note: I have set the current query timeout in Aginity to 90000 sec.
I know this is an old thread, but for anyone coming here because of the same issue, I've realised that, at least for my case, the problem was the Aginity client; so, it's not related with Redshift or its Workload Manager, but only with such third party client called Aginity. In summary, use a different client like SQL Workbench and run the COPY command from there.
Hope this helps!
Carlos C.
More information, about my environment:
Redshift:
Cluster TypeThe cluster's type: Multi Node
Cluster: ds2.xlarge
NodesThe cluster's type: 4
Cluster Version: 1.0.4852
Client Environment:
Aginity Workbench for Redshift
Version 4.9.1.2686 (build 05/11/17)
Microsoft Windows NT 6.2.9200.0 (64-bit)
Network:
Connected to OpenVPN, via SSH Port tunneling.
The connection is not being dropped. This issue is only affecting the COPY command. The connection remains active.
Command:
copy tbl_XXXXXXX
from 's3://***************'
iam_role 'arn:aws:iam::***************:role/***************';
S3 Structure:
120 files of 6.2 GB each. 20 files of 874MB.
Output:
ERROR: 57014: Query (22381) cancelled on user's request
Statistics:
Start: ***************
End: ***************
Duration: 3,600.2420863
I'm not sure if following answer will solve your exact problem of timeout at exactly 1 Hr.
But, based on my experience, in case of Redshift loading data via Copy command is best and fast way. SO I feel that timeout issue shouldn't happen at all in your case.
The copy command in RedShift could load data from S3 or via SSH.
e.g.
Simple copy
copy sales from 'emr://j-SAMPLE2B500FC/myoutput/part-*' iam_role
'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '\t' lzop;
e.g. Using Menifest
copy customer
from 's3://mybucket/cust.manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;
PS: Even if you do it using Menifest and divide your data into Multiple files, it will be more faster as RedShift loads data in parallel.
My understanding was that spark will choose the 'default' number of partitions, solely based on the size of the file or if its a union of many parquet files, the number of parts.
However, in reading in a set of large parquet files, I see the that default # of partitions for an EMR cluster with a single d2.2xlarge is ~1200. However, in a cluster of 2 r3.8xlarge I'm getting default partitions of ~4700.
What metrics does Spark use to determine the default partitions?
EMR 5.5.0
spark.default.parallelism - Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.
2X number of CPU cores available to YARN containers.
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#spark-defaults
Looks like it matches non EMR/AWS Spark as well
I think there was some transient issue because I restarted that EMR cluster with d2.2xlarge and it gave me the number of partitions I expected, which matched the r3.8xlarge, which was the number of files on s3.
If anyone knows why this kind of things happens though, I'll gladly mark yours as the answer.
I am trying to setup Replica outside of AWS and master is running at AWS RDS. And I do not want any downtime at my master. So I setup my slave node and now I want to backup my current database which is at AWS.
mysqldump -h RDS ENDPOINT -u root -p --skip-lock-tables --single-transaction --flush-logs --hex-blob --master-data=2 --all-databases > /root/dump.sql
I tested it on my VM and it worked fine but when tying it with RDS it gives me error
mysqldump: Couldn't execute 'FLUSH TABLES WITH READ LOCK': Access denied for user 'root'#'%' (using password: YES) (1045)
Is it because i do not have super user privilege or how to I fix this problem? Please someone suggest me.
RDS does not allow even the master user the SUPER privilege, and this is required in order to execute FLUSH TABLES WITH READ LOCK. (This is an unfortunate limitation of RDS).
The failing statement is being generated by the --master-data option, which is, of course, necessary if you want to be able to learn the precise binlog coordinates where the backup begins. FLUSH TABLES WITH READ LOCK acquires a global read lock on all tables, which allows mysqldump to START TRANSACTION WITH CONSISTENT SNAPSHOT (as it does with --single-transaction) and then SHOW MASTER STATUS to obtain the binary log coordinates, after which it releases the global read lock because it has a transaction that will keep the visible data in a state consistent with that log position.
RDS breaks this mechanism by denying the SUPER privilege and providing no obvious workaround.
There are some hacky options available to properly work around this, none of which may be particularly attractive:
do the backup during a period of low traffic. If the binlog coordinates have not changed between the time you start the backup and after the backup has begin writing data to the output file or destination server (assuming you used --single-transaction) then this will work because you know the coordinates didn't change while the process was running.
observe the binlog position on the master right before starting the backup, and use these coordinates with CHANGE MASTER TO. If your master's binlog_format is set to ROW then this should work, though you will likely have to skip past a few initial errors, but should not have to subsequently have any errors. This works because row-based replication is very deterministic and will stop if it tries to insert something that's already there or delete something that's already gone. Once past the errors, you will be at the true binlog coordinates where the consistent snapshot actually started.
as in the previous item, but, after restoring the backup try to determine the correct position by using mysqlbinlog --base64-output=decode-rows --verbose to read the master's binlog at the coordinates you obtained, checking your new slave to see which of the events must have already been executed before the snapshot actually started, and using the coordinates determined this way to CHANGE MASTER TO.
use an external process to obtain a read lock on each and every table on the server, which will stop all writes; observe that the binlog position from SHOW MASTER STATUS has stopped incrementing, start the backup, and release those locks.
If you use any of these approaches other than perhaps the last one, it's especially critical that you do table comparisons to be certain your slave is identical to the master once it is running. If you hit subsequent replication errors... then it wasn't.
Probably the safest option -- but also maybe the most annoying since it seems like it should not be necessary -- is to begin by creating an RDS read replica of your RDS master. Once it is up and synchronized to the master, you can stop replication on the RDS read replica by executing an RDS-provided stored procedure, CALL mysql.rds_stop_replication which was introduced in RDS 5.6.13 and 5.5.33 which doesn't require the SUPER privilege.
With the RDS replica slave stopped, take your mysqldump from the RDS read replica, which will now have an unchanging data set on it as of a specific set of master coordinates. Restore this backup to your off-site slave and then use the RDS read replica's master coordinates from SHOW SLAVE STATUS Exec_Master_Log_Pos and Relay_Master_Log_File as your CHANGE MASTER TO coordinates.
The value shown in Exec_Master_Log_Pos on a slave is the start of the next transaction or event to be processed, and that's exactly where your new slave needs to start reading on the master.
Then you can decommission the RDS read replica once your external slave is up and running.
Thanks Michael, I think the most correct solution and the recommended by AWS is do the replication using a read replica as a source as explained here.
Having a RDS master, RDS read replica and an instance with MySQL ready, the steps to get an external slave are:
On master, increase binlog retention period.
mysql> CALL mysql.rds_set_configuration('binlog retention hours', 12);
On read replica stop replication to avoid changes during the backup.
mysql> CALL mysql.rds_stop_replication;
On read replica annotate the binlog status (Master_Log_File and Read_Master_Log_Pos)
mysql> SHOW SLAVE STATUS;
On server instance do a backup and import it (Using mydumper as suggested by Max can speed up the process).
mysqldump -h RDS_READ_REPLICA_IP -u root -p YOUR_DATABASE > backup.sql
mysql -u root -p YOUR_DATABASE < backup.sql
On server instance set it as slave of RDS master.
mysql> CHANGE MASTER TO MASTER_HOST='RDS_MASTER_IP',MASTER_USER='myrepladmin', MASTER_PASSWORD='pass', MASTER_LOG_FILE='mysql-bin-changelog.313534', MASTER_LOG_POS=1097;
Relace MASTER_LOG_FILE and MASTER_LOG_POS to the values of Master_Log_File Read_Master_Log_Pos you saved before, also you need an user in RDS master to be used by slave replication.
mysql> START SLAVE;
On server instance check if replication was success.
mysql> SHOW SLAVE STATUS;
On RDS read replica resume replication.
mysql> CALL mysql.rds_start_replication;
for RDS binlog position you can use mydumper with --lock-all-tables, it will use LOCK TABLES ... READ just to get the binlog coordinates and then realease it instead of FTWRL.
Michael's answer is extremely helpful and focuses on the main sticking point: you simply cannot GRANT the required SUPER privilege on RDS, and therefore you can't use the --master-data flag that would make things so much easier.
I read that it may be possible to work around this by creating or modifying a Database Parameter Group via the API, but I think using the RDS procedures is a better option.
The multi-tiered replication approach works well, though, and can include tiers outside RDS/VPC so it's possible to replicate from "Classic" EC2 to VPC using this method.
A lot of the necessary functionality is only in later releases of MySQL 5.5 and 5.6, and I strongly recommend you run the same version on all the DBs involved in the replication stack, so you may have to do an upgrade of your old DB before all of this, which means yet more tedium and replication and so on.
I had faced a similar problem a quick workaround to this is:
Create a EBS Volume to have an extra space or extend current EBS volume on EC2. (or if you have an extra space you can use that).
Use mysqldump command without --master-data or --flush-data directive to generate a complete (FULL) backup of db.
mysqldump -h hostname --routines -uadmin -p12344 test_db > filename.sql
admin is DB name and 12344 is the password
Above is for taking backup of one single DB, if required to take all DBs then specify --all-databases and also mention DB Names.
Create a Cron of this command to run once a day that will automatically generate the dump.
Please note that this will incur an extra cost if your DB Size is huge. as it creates a complete DB dump.
hope this helps
You need to
1- create a read replica on aws
2- make sure that this instance is catching up with the master
3- stop the replication and get the log_file and log_position parameters by
show slave status \G
4- dump the database and use the parameters logged in step 3 to start the replication on your own server.
5- start the slave.
Detailed instructions here
Either things have changed, since #Michael - sqlbot 's response, or there is a misunderstanding going on here (could be on my part),
You can use COPY to import a csv file into rds, at least on the postgres version, you just need to use FROM STDIN instead of directly naming the file,
which means you end up piping things like:
cat data.csv | psql postgresql://server:5432/mydb -U user -c "COPY \"mytable\" FROM STDIN DELIMITER ',' "