I have +7 million records stored in CSV file hosted at AWS S3 bucket and I want to load them into DynamoDB table. I've tried data AWS pipeline service but the job always failed, because this service doesn't support importing CSV format.
So I should first convert CSV data into format that can be understood by DynamoDB. Is there any way to make this conversion?
AWS Datapipeline service supports CSV Import to dynamo db. You can create a pipeline from the aws console for datapipeline and choose "Import DynamoDB backup data from S3." to import CSV stored in S3 to Dynamodb.
See also
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html#DataPipelineExportImport.Importing
Related
Is there a way to export data from a SQL Server query to an AWS (S3) bucket in csv file?
I created the bucket
arn:aws:s3:::s3tintegration
https://s3tintegration.s3.sa-east-1.amazonaws.com/key_prefix/
Can anybody help me?
If you are looking for Automated solution then there are several option in AWS .
Schedule or trigger lambda that will connect to RDS execute query and save as csv file s3 bucket .Please remember aws lambda has to be in same vpc and subnet where your SQL server is .
If you have query that takes long time you can use AWS Glue to run a task and write output to S3 in CSV format .Glue can use JDBC connection as well .
You can also use DMS that will connect SQL server as source and S3 as target in CSV format .You need to lean DMS that can migrate full table or part of it but not query .
If you are familiar with Big data you can very well use hive that will run your query and write to s3 in CSV format .
The quick and easiest way to start with is Lambda .
I have a Data Pipeline that exports my DynamoDB table to an S3 bucket so I can use the S3 file for services like QuickSight, Athena and Forecast.
However, for my S3 file to work with these services, I need the file to be formatted in a csv like so:
date, journal, id
1589529457410, PLoS Genetics, 10.1371/journal.pgen.0030110
1589529457410, PLoS Genetics, 10.1371/journal.pgen.1000047
But instead, my exported file looks like this:
{"date":{"s":"1589529457410"},"journal":{"s":"PLoS Genetics"},"id":{"s":"10.1371/journal.pgen.0030110"}}
{"date":{"s":"1589833552714"},"journal":{"s":"PLoS Genetics"},"id":{"s":"10.1371/journal.pgen.1000047"}}
How can I specify the format for my exported file in S3 so I can operate with services like QuickSight, Athena and Forecast? I'd preferably do the data transformation using Data Pipeline as well.
Athena can read JSON data.
You can also use DynamoDB streams to stream the data to S3. Here is a link to a blog post with best practice and design patterns for streaming data from DynamoDB to S3 to be used with Athena.
You can use DynamoDB streams to trigger an AWS Lambda function, which can transform the data and store it in Amazon S3, Amazon Redshift etc. With AWS Lambda you could also trigger Amazon Forecast to retrain, or pass the data to Amazon Forecast for a prediction.
Alternatively you could use Amazon Data Pipeline to write the data to an S3 bucket as you currently have it. Then use a cloud watch event scheduled to run a lambda function, or an S3 event notification to run a lambda function. The lambda function can transform the file and store it in another S3 bucket for further processing.
I have 3 types of csv in my s3 bucket and want to flow them into respective redshift tables based on csv prefix . I am thinking to use Kinesis to stream data to redshift as file in s3 will be dropped every 5 min. I am all new to aws and not sure how to achieve this.
I have gone through aws documentation but not sure how to achieve this
I Found that we can use spectrify python module to convert a parquet format but i want to know which command will unload a table to S3 location in parquet format.
one more thing i found that we can load parquet formatted data from s3 to redshift using copy command, https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#r_COPY_command_examples-load-listing-from-parquet
can we do the same for unload to s3 from redshift?
There is no need to use AWS Glue or third party Python to unload Redshift data to S3 in Parquet format. The new feature is now supported:
UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
FORMAT PARQUET
Documentation can be found at UNLOAD - Amazon Redshift
Have you considered AWS Glue? You can create Glue Catalog based on your Redshift Sources and then convert into Parquet. AWS blog for your reference although it talks about converting CSV to Parquet, but you get the idea.
The AWS docs to import data from S3 into a Dynamo DB table using Data Pipeline (https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-part1.html) references an S3 file (s3://elasticmapreduce/samples/Store/ProductCatalog) which is in this format:
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-pipelinejson-verifydata2.html?_sm_ovs=2DtvnqvHTVHW7q50vnqJqRQFVVnqZvnqMVVVVVVsV
Question is... how do I get a CSV of say 4 millions rows into this format in the first place? Is there a utlity for that?
Thanks for any suggestions... I've had a good google and haven't turned up anything.
steveprk84 already linked to this in his response, but I wanted to call it out: https://github.com/awslabs/data-pipeline-samples/tree/master/samples/DynamoDBImportCSV
Hive on EMR supports DynamoDB as an external table type. This sample uses a HiveActivity to create external Hive tables pointing to the target Dynamo table and the source CSV, and then it executes a Hive query to copy the data from one to the other.
AWS Datapipeline service supports CSV Import to dynamo db. You can create a pipeline from the aws console for datapipeline and choose "Import DynamoDB backup data from S3." to import CSV stored in S3 to Dynamodb.
See also
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html#DataPipelineExportImport.Importing