I am using AWS Data Pipeline to import some CSV data from S3 to Redshift. I also added a ShellCommandActivity to remove all S3 files after the copy activity completed. I attached a picture with the whole process.
Everything works fine but each activity starts it's own EC2 instance. Is it possible that the ShellCommandActivity to reuse the same EC2 instance as the RedshiftCopyActivity, after the copy command completed?
Thank you!
Unless you can do all activities in shell or CLI, it is not possible to do everything in the same instance.
One suggestion I can give is to move on to new technologies. AWS Data Pipeline is outdated (4 years old). You should use AWS Lambda which will cost you a fraction of what you are paying and you can load the files into Redshift as soon as the files are uploaded to S3. Clean up is automatic and Lambda is much more powerful than AWS Data Pipeline. The tutorial A Zero-Administration Amazon Redshift Database Loader is the one you want. Yes, there is some learning curve, but as the title suggest it is a zero administration load.
In order for the ShellCommandActivity to run on the same EC2 instance, I edited my ShellCommandActivity using Architect and for the Runs On option a chose Ec2Instance. The ShellCommandActivity gets mapped automatically to the same EC2Instance as the RedshiftCopyActivity. Now the whole process looks like this:
Thank you!
Related
Can someone explain me whats the best way to transfer data from a harddrive on an EC2 Instance (running Windows Server 2012) to an S3 Bucket for the same AWS Account on a daily basis?
Backround idea to this:
I'm generating a .csv file for one of our Business partners daily at 11:00 am and I want to deliver it to S3 (he has access to our S3 Bucket).
After that he can pull it out of S3 manually or automatically whenever he wants.
Hope you can help me, I only found manually solutions with the CLI, but no automated way for daily transfers.
Best Regards
You can directly mount S3 buckets as mounted drives on your EC2 instances. This way you don't even need some sort of triggers/daily task scheduler along with third party service as objects would be directly available in the S3 bucket.
For Linux typically you would use Filesystem in Userspace (FUSE). Take a look at this repo if you need it for Linux: https://github.com/s3fs-fuse/s3fs-fuse.
Regarding Windows, there is this tool:
https://tntdrive.com/mount-amazon-s3-bucket.aspx
If these tools don't suit you or if you don't want to mount directly the s3 bucket, here is another option: Whatever you can do with the CLI you should be able to do with the SDK. Therefore if you are able to code in one of the various language AWS Lambda proposes - C#/Java/Go/Powershell/Python/Node.js/Ruby - you could automate that using a Lambda function along with a daily task scheduler triggering at 11a.m.
Hope this helps!
Create a small application that uploads your file to an S3 bucket (there are a some example here). Then use Task Scheduler to execute your application on a regular basis.
Working on a problem at the moment where I want to export a file on an EC2 instance running a Windows AMI at four hour intervals to an S3 bucket. Currently, the architecture I'm thinking is as follows.
1. CloudWatch Events rule using scheduled trigger
2. Rule triggers Lambda function to run
3. Lambda function would use some form of the AWS CLI on the windows EC2 instance to extract (sync, cp, etc.) the file
4. File is placed is S3 bucket
Does anyone see a path that's more efficient than this one? I want to ensure that I'm handling this in the most straightforward manner. Thanks in advance for any input!
It is quite difficult to have external code (eg an AWS Lambda function) cause something to execute on a Windows computer. You could use Systems Manager Run Command, but that's a rather complex solution.
It would be much simpler to have the Windows computer push the files to Amazon S3:
Create a scheduled task in Windows
Use aws s3 cp or aws s3 sync to copy the files to Amazon S3
Done!
Your solution seems solid. Alternatively you may want to write daemon-like service (background process) that runs on each EC2 and does the data transfer from that instance to S3. What I like about your solution is how you can centrally control the scheduling easily. For my distributed solution you can have the processes read from central config, but that seems more complicated than the CW/Lambda solution.
For the EC2 process solution, this may be useful:
How to mount Amazon S3 Bucket as a Windows Drive, but it should be easy (and more scalable) to just use the AWS SDK instead to talk to S3
I read through the documentation, which talks about MySQL and RDS. But could not find anything on moving on premise Hive/Hadoop data to S3. I appreciate any links or articles.
You can use S3DistCp to copy HDFS data from your on-premise to S3 and vise versa.
Normally Data Pipeline instantiates an Ec2Resource instance in the AWS cloud and runs the TaskRunner on this instance. The corresponding activity in the pipeline that is marked as 'runsOn' for the Ec2Resource is then run on this instance. For details refer to the documentation here.
But any S3DistCp running on an EC2 instance will not have access to your on-premise HDFS. To have access to on-premise resources the corresponding activities will have to be executed by a TaskRunner running on an on-premise box. For details on how to set this up refer to the documentation here.
The TaskRunner is a java standalone application provided by AWS, that can be manually run on any self managed box. It connects to the data pipeline service over AWS API, to get metadata about tasks pending execution and then executes them on the same box where it is running.
In case of automated Ec2Resource provisioning, Data Pipeline instantiates the ec2 instance and runs this same TaskRunner on it, and all of it is transparent to us.
I am new in this forum and technology and looking for your advice. I am working on POC and below are my requirement. Could you please guide me the way to achieve the result.
Copy data from NAS to S3.
Use S3 as a source in EMR Job with target to S3/Redshift.
Any link, pdf will also helpful.
Thanks,
Pardeep
There's a lot here that you're asking and there's not a lot of info on your use case to go by so I'm going to be very general in my answer and hopefully it at least points you in the right direction.
You can use Lambda to copy data from your NAS to S3. Assuming your NAS is on-premise and assuming you have a VPN into your VPC or even Direct Connect configured, then you can use a VPC enabled Lambda function to read from the NAS on-premise and write to S3.
If your NAS is running on EC2 the above will remain the same except there's no need for VPN or Direct Connect.
Are you looking to kick off the EMR job from Lambda? You can use S3 as a source for EMR to then output to S3 either from within Lambda or via other means as well.
If you can provide more info on your use case we could probably give you a better quality answer.
Copy data from NAS to S3.
Really depends on the amount of data and the frequency on which you run the copy job. If the data in GBs, then you can install AWS CLI on a machine where NFS is attached. AWS CLI command like CP can be multithreaded and can easily copy your datasets to S3. You might also enable S3 transfer acceleration to speed things up. Having AWS Direct connect to your company network can also speed up any transfers from on-premis to AWS.
http://docs.aws.amazon.com/cli/latest/topic/s3-config.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/transfer-acceleration.html
https://aws.amazon.com/directconnect/
If the data is in TBs (which is probably distributed across multiple volumes), then you might have to consider using physical transfer utilities like AWS Snowball,AWSImportExport or AWS Snowmobile based on the use-case.
https://aws.amazon.com/cloud-data-migration/
Use S3 as a source in EMR Job with target to S3/Redshift.
Again, as there are lot of applications on EMR, there are lot of choices. Redshift supports COPY/UNLOAD commands to S3 which any application can make use of. If you want to use SPARK on EMR , then installing databricks spark-redshift driver is a viable option for you.
https://github.com/databricks/spark-redshift
https://databricks.com/blog/2015/10/19/introducing-redshift-data-source-for-spark.html
https://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/
I have a subset of Windows EC2 instances that I would like to continuously copy files to whenever files are uploaded to a specific S3 bucket. Files will be uploaded to this bucket anywhere between once a month to several times a month but will need to be copied to the instances within an hour of upload. EC2 instances will be continually added and removed from this subset of instances. I would like this functionality to be controlled by the EC2 instance so that whenever a new instance is created, it can be configured to pull from this bucket. Ideally, this would be an instantaneous upon upload (vs a cron job running periodically). I have researched AWS Lamba and S3-notifications, and I am unsure if these are the correct methods to use. What solution is best suited to fit this model of copying files?
If you don't need "real time" presence of the files, you might think to run s3 sync on each instance by a cron job (easy one) or s3-notification->with some lambda works to deliver EC2 Run Command.
If the instances are in an autoscaling group, you can use aws s3 copy in the user data section of your launch config to accomplish this.