Extracting multiple RDS MySQL tables to S3 - amazon-web-services

Rather new to AWS Data Pipeline so any help will be appreciated. I have used the pipeline template RDStoS3CopyActivity to extract all contents of a table in RDS MySQL. Seems to be working good. But there are 90 other tables to be extracted and dumped to S3. I cannot imagine craeting 90 pipelines or one for each table.
What is the best approach to resolving this task? How could pipeline be instructed to iterate though a list of the table names?

I am not sure if this will ever get responded. However, in this early stage of exploration, I have developed a pipeline that seems to fit a preliminary purpose -- extracting from 10 RDS MySQL tables and copying each to their respective sub-bucket on S3.
The logic is rather simple.
Configure connection for the RDS MySQL.
Extract data by specifying in "Select Query" field for each table.
Drop a Copy Activity and link up for each table above. It runs on a specified EC2 instance. If you're running expensive query, make sure you choose the appropriate EC2 instance with enough CPU and memory. This step copies the extracted dump, which lives temporarily in ec2 tmp filesystem, to a designated S3 bucket you will set up next.
Finally, the designated / target destination.
By default, data extracted and loaded to S3 bucket will be comma separated. If you need it to be tab delimited, then in the last target S3 destination:
- Add an optional field.. > select Data Format.
- Create a new Tab Separated. This will appear under the category of 'Others'.
- Give it a name. I call it Tab Separated.
- Type: TSV. Hover mouse over 'Type' to learn more of other data formats.
- Column separator: \t (i could leave this blank as type was already specified as tsv)
Screenshot
-

If the tables are all in the same RDS Why not use a SQLActivity pipeline with a SQL statement containing multiple unload commands to S3?
You can just write one query and use one pipeline.

Related

AWS Glue: Add An Attribute to CSV Distinguish Between Data Sets

I need to pull two companies' data from their respective AWS S3 buckets, map their columns in Glue, and export them to a specific schema in a Microsoft SQL database. The schema is to have one table, with the companies' data being distinguished with attributes for each of their sites (each company has multiple sites).
I am completely new to AWS and SQL, would someone mind explaining to me how to add an attribute to the data, or point me to some good literature on this? I feel like manipulating the .csv in the Python script I'm already running to automatically download the data from another site then upload it to S3 could be an option (deleting NaN columns and adding a column for site name), but I'm not entirely sure.
I apologize if this has already been answered elsewhere. Thanks!
I find this website to generally be pretty helpful with figuring out SQL stuff. I've linked to the ALTER TABLE commands that would allow you to do this through SQL.
If you are running a python script to edit the .csv to start, then I would edit the data there, personally. Depending on the size of the data sets, you can run your script as a Lambda or Batch job to grab, edit, and then upload to s3. Then you can run your Glue crawler or whatever process you're using to map the columns.

Row level changes captured via AWS DMS

I am trying to migrate the database using AWS DMS. Source is Azure SQL server and destination is Redshift. Is there any way to know the rows updated or inserted? We dont have any audit columns in source database.
Redshift doesn’t track changes and you would need to have audit columns to do this at the user level. You may be able to deduce this from Redshift query history and save data input files but this will be solution dependent. Query history can be achieved in a couple of ways but both require some action. The first is to review the query logs but these are only saved for a few days. If you need to look back further than this you need a process to save these tables so the information isn’t lost. The other is to turn on Redshift logging to S3 but this would need to be turned on before you run queries on Redshift. There may be some logging from DMS that could be helpful but I think the bottom line answer is that row level change tracking is not something that is on in Redshift by default.

Having trouble setting up multiple tables in AWS glue from a single bucket

So, I've used Glue before, but it's been with a single file <> single folder relationship.
What I'm trying to do now is to have a structure like this create individual tables for each folder:
- Data Bucket
- Table 1 Folder
- file1.csv
- file2.csv
- Table 2 Folder
- file1.csv
- file2.csv
...and so on.
But every time I create the crawler and set the Data Bucket as the data source, I only get a single table created. I've tried every combo of the "create single schema ...etc" I can think of.
I'm hoping that I don't have to add each sub-folder as a separate data source as my ultimate goal is to translate it eventually into an RDS instance. Hoping to keep the high-level bucket as the single data source if possible. I can easily tweak folder/file structure if needed.
And yes, I'm aware of partitioning, but isn't that only applicable to individual tables?
Thanks!
I ran into the same issue and digging into Glue docs, I found that setting table level in crawler's output configurations do the trick.
Table level seems to be set from the bucket level, in your case, I believe setting table level to 2 (the first folder after the root), would do the trick. 2 means that the tables definition starts at that point
I've been trying to accomplish the same thing. I was hoping that Glue would magically see the different folders and automatically create separate tables. Glue seems to want to create a single table, especially when the schemas overlap. In my example, I'm using US census data so there are some common fields, especially in the beginning of each file.
In the end, I was able to get this to work by creating multiple data stores in the Glue Crawler. By doing this, it would create the five separate tables I wanted, but I had to add each folder manually. Still hoping to find a way to get Glue to discover them automatically.

AWS Data Pipeline - SQLActivity - update statement possible?

I need to build a data pipeline which takes input from a CSV file (stored on S3) and "updates" records in the Aurora RDS table. I understand the standard format (out of the box template) for bulk record insertion, but for the records update or deletion, is there any standard way to have those statements in the SqlActivity?
I can write an update statement, but then the way CSV inputs are referenced, they are just question marks (?) without any liberty to index a column.
Let me know if data pipeline can be used in this way? If yes any specific way I can refer CSV columns? Thanks in advance!
You will need to do some preprocessing of your CSV to a SQL script containing your bulk updates and then invoke the SqlActivity with a reference to your script.
If you have inserts you might be able to perform this by using the following:
CopyActivity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html) which takes:
S3DataNode as an input
SqlDataNode as the output.
If performance is not a concern then this is the closest you can get to an out of the box transport using AWS Data Pipeline.
You can refer to the AWS Data Pipeline docs (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html) for more information.

Backup DynamoDB Table with dynamic columns to S3

I have read several other posts about this and in particular this question with an answer by greg about how to do it in Hive. I would like to know how to account for DynamoDB tables with variable amounts of columns though?
That is, the original DynamoDB table has rows that were added dynamically with different columns. I have tried to view the exportDynamoDBToS3 script that Amazon uses in their DataPipeLine service but it has code like the following which does not seem to map the columns:
-- Map DynamoDB Table
CREATE EXTERNAL TABLE dynamodb_table (item map<string,string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "MyTable");
(As an aside, I have also tried using the Datapipe system but found it rather frustrating as I could not figure out from the documentation how to perform simple tasks like run a shell script without everything failing.)
It turns out that the Hive script that I posted in the original question works just fine but only if you are using the correct version of Hive. It seems that even with the install-hive command set to install the latest version, the version used is actually dependent on the AMI Version.
After doing a fair bit of searching I managed to find the following in Amazon's docs (emphasis mine):
Create a Hive table that references data stored in Amazon DynamoDB. This is similar to
the preceding example, except that you are not specifying a column mapping. The table
must have exactly one column of type map. If you then create an EXTERNAL
table in Amazon S3 you can call the INSERT OVERWRITE command to write the data from
Amazon DynamoDB to Amazon S3. You can use this to create an archive of your Amazon
DynamoDB data in Amazon S3. Because there is no column mapping, you cannot query tables
that are exported this way. Exporting data without specifying a column mapping is
available in Hive 0.8.1.5 or later, which is supported on Amazon EMR AMI 2.2.3 and later.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands.html