I have a static set of data I want to get into AWS DynamoDB. I have downloaded the local version of DynamoDB and tested the code that generates the data on it, and now I have the database with all the data locally.
My question is: Is there an efficient way to move the local database into the cloud? I know that I can transfer a CSV file to S3 and use a data pipe from there. Is there a better way without exporting the data and re-importing it?
The data is not that much, about 5 GB (so not Amazon Snowball type thing).
Thanks!
Related
There is a requirement to copy from Azure Blob to S3 for 10TB data and also from Synpase to Redshift for 10TB of data.
What is the best way to achieve these 2 migrations?
For the Redshift - you could export Azure Synapse Analytics to a a blob storage in a compatible format ideally compressed and then copy the data to S3. It is pretty straightforward to import data from S3 to Redshift.
You may need a VM instance to load read from Azure Storage and put into AWS S3 (doesn't matter where). The simplest option seems to be using the default CLI (Azure and AWS) to read the content to the migration instance and write to to the target bucket. However me personally - I'd maybe create an application writing down checkpoints, if the migration process interrupts for any reason, the migration process wouldn't need to start from the scratch.
There are a few options you may "tweak" based on the files to move, if there are many small files or less large files, from which region to move where, ....
https://aws.amazon.com/premiumsupport/knowledge-center/s3-upload-large-files/
As well you may consider using the AWS S3 Transfer Acceleration, may or may not help too.
Please note every larger cloud provider has some outbound data egress cost, for 10TB it may be considerable cost
I am trying to train a machine learning model on AWS EC2. I have over 50GB of data currently stored in an AWS S3 bucket. When training my model on EC2, I want to be able to access this data.
Essentially, I want to be able to call this command:
python3 train_model.py --train_files /data/train.csv --dev_files /data/dev.csv --test_files /data/test.csv
where /data/train.csv is my S3 bucket s3://data/. How can I do this? I currently only see ways to cp my S3 data into my EC2.
You can develop an enhancement to your code using boto.
But if you want access to your S3 as if it was another local filesystem I would consider s3fs-fuse, explained further here.
Another option would be to use the aws-cli to sync your code to a local folder.
How can I do this? I currently only see ways to cp my S3 data into my EC2.
S3 is a object storage system. It does not allow for direct access nor reading of files like a regular file system.
Thus to read your files, you need to download it first (downloading in parts is also possible), or have some third party software do it for you like s3-fuse. You can download it to your instance, or store in external file system (e.g. EFS).
Its not clear from your question if you have one 50GB CSV file, or multiple small ones. In case you have one large CSV file of 50GB in size, you can reduce the amount of data read, if not all of its needed, at once using S3 Select:
With S3 Select, you can use a simple SQL expression to return only the data from the store you’re interested in, instead of retrieving the entire object. This means you’re dealing with an order of magnitude less data which improves the performance of your underlying applications.
Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format.
I am trying to setup data lake and move all the data to s3.
I have to move aurora mysql data to s3 (most probably in parquet format).
I tried initial POC using Data Migration Service with that we can move all data at once. Problem with this was every time I run it will copy whole data.
I wanted something like near real time reflection of db changes in s3.
Thanks in advance.
If you enable binary logs you should be able to do Change Data Capture and replicate ongoing changes as explained at this blog post
We have large amount of data stored on ES cluster. I need to add one more field to the ES cluster and upload data for this field from Redshift table’s column. I’ve never work with such data transfer, and I’m new to AWS and not sure how to approach this task and what I should read to perform such data transfer. Do you know what is the best approach to do it?
Are you using logstash doing just the data if yes then you can easily add column in logstash. And restart the lock start from the beginning so that the additional column data is ingested into the index. Let me know what is your current setup.
As i understand you want to dump data from elasticssearch cluster and load it to redshift.
Here is a approach i would take:
Dump data from elasticsearch using:https://github.com/taskrabbit/elasticsearch-dump
Copy the json file to s3 : using aws cli
Copy the json file from s3 to redshift using : https://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-copy-from-json.html
has anyone ever exported S3 data from Amazon AWS into local database using EMR? I want to write a custom M/R solution that would extract certain data and parallel load into a local network database instance. I have not seen anything on Amazon website that states that that is possible or not. Lot of mentioning of moving the data within AWS instances.
When you say a "local network database", are you referring to a database on an EC2 instance or your local network?
Either way is possible - if you are using a non-EC2 or non-AWS database, just make sure to open up your security groups / firewall to make the necessary network connections.
As for loading data from S3 into your local database:
You can crunch data from S3 using EMR and convert it into CSV format using the mappers, and bulk import that into your database. This will likely be the fastest - since bulk import from CSV will allow the database to import data really fast.
You can use the EMR mappers to insert data directly into the database - but I don't recommend this approach. With multiple mappers writing to the database directly, you can easily overload the database and cause stalls and the process to fail.