How to write mapreduce program with amazon ec2 and s3 - amazon-web-services

I want to analyse data stored in amazon s3, how can I write java program on amazon emr and access these data.
The data url is http://s3.amazonaws.com/aws-publicdatasets/trec/kba/FAKBA1/index.html

Amazon EMR has built-in integration with S3.
There are several guide on how to do it. here and this video

Related

Move data from Amazon S3 to Amazon RDS

Can we connect Amazon S3 buckets present in two different regions and migrate CSV Files data into one Particular region Amazon RDS ? I am trying to use AWS Glue.
There are certainly different ways to solve this use case. You can use AWS Glue. You can also write a workflow using AWS Step Functions that can solve this as well. For example, you can write a series of Lambda functions that can read CSV in an Amazon S3 bucket, get the values and then write the values to an Amazon RDS database. Both ways are valid.
See these docs as ref:
https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-using-aws-step-functions-and-aws-lambda/
https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc
Keep in mind however. a workflow is not ideal when your data set is so large, it will timeout the 15 min window that Lambda uses. In this case, you should use AWS Glue.

Transform files from one S3 bucket to another

i am new to amazon aws. i have a use case to read ORC files from one s3 bucket, convert that to JSON files and write to another s3 bucket.
Volume is about 100G and roughly a thousand files everyday.
I should be able to run this on-demand or schedule to run daily. what are the options i should consider?
any ideas would be helpful
Amazon Athena
You can use Amazon Athena to convert file formats via the CREATE TABLE AS command. See: Creating a Table from Query Results (CTAS) - Amazon Athena
The question then becomes how to send the commands to Athena. For this, you could schedule an AWS Lambda function to run, which starts an Amazon EC2 instance. Then, run a script on the instance to send all the commands to Amazon Athena. See: Auto-Stop EC2 instances when they finish a task - DEV Community
AWS Glue ETL job
Alternatively, you could create an AWS Glue ETL job that uses Spark to transform the data. See: Built-In Transforms - AWS Glue

download files from AWS S3 bucket in parallel

I want to download million of files from S3 bucket which will take more than a week to be downloaded one by one - any way/ any command to download those files in parallel using shell script ?
Thanks,
AWS CLI
You can certainly issue GetObject requests in parallel. In fact, the AWS Command-Line Interface (CLI) does exactly that when transferring files, so that it can take advantage of available bandwidth. The aws s3 sync command will transfer the content in parallel.
See: AWS CLI S3 Configuration
If your bucket has a large number of objects, it can take a long time to list the contents of the bucket. Therefore, you might want to sync the bucket by prefix (folder) rather than trying it all at once.
AWS DataSync
You might instead want to use AWS DataSync:
AWS DataSync is an online data transfer service that simplifies, automates, and accelerates copying large amounts of data to and from AWS storage services over the internet or AWS Direct Connect... Move active datasets rapidly over the network into Amazon S3, Amazon EFS, or Amazon FSx for Windows File Server. DataSync includes automatic encryption and data integrity validation to help make sure that your data arrives securely, intact, and ready to use.
DataSync uses a protocol that takes full advantage of available bandwidth and will manage the parallel downloading of content. A fee of $0.0125 per GB applies.
AWS Snowball
Another option is to use AWS Snowcone (8TB) or AWS Snowball (50TB or 80TB), which are physical devices that you can pre-load with content from S3 and have it shipped to your location. You then connect it to your network and download the data. (It works in reverse too, for uploading bulk data to Amazon S3).

What is the most optimal way to automate data (csv file) transfer from s3 to Redshift without AWS Pipeline?

I am trying to take sql data stored in a csv file in an s3 bucket and transfer the data to AWS Redshift and automate that process. Would writing etl scripts with lambda/glue be the best way to approach this problem, and if so, how do I get the script/transfer to run periodically? If not, what would be the most optimal way to pipeline data from s3 to Redshift.
Tried using AWS Pipeline but that is not available in my region. I also tried to use the AWS documentation for Lambda and Glue but don't know where to find the exact solution to the problem
All systems (including AWS Data Pipeline) use the Amazon Redshift COPY command to load data from Amazon S3.
Therefore, you could write an AWS Lambda function that connects to Redshift and issues the COPY command. You'll need to include a compatible library (eg psycopg2) to be able to call Redshift.
You can use Amazon CloudWatch Events to call the Lambda function on a regular schedule. Or, you could get fancy and configure Amazon S3 Events so that, when a file is dropped in an S3 bucket, it automatically triggers the Lambda function.
If you don't want to write it yourself, you could search for existing code on the web, including:
The very simply Python-based christianhxc/aws-lambda-redshift-copy: AWS Lambda function that runs the copy command into Redshift
A more fully-featured node-based A Zero-Administration Amazon Redshift Database Loader | AWS Big Data Blog

Best option for data transfer between on-premises servers and AWS S3

My organization is evaluating options of Hybrid Data Warehouse using AWS Redshift and S3. Objective is to process the data on-premises and send processed copy to S3 and then load to Redshift for visualization.
As we are in initial stages, there is no file/storage gateway setup yet.
Initially we used Informatica Cloud tool to upload data from on-premises server to AWS S3, but was taking long time. Data volume is few hundred million records in history and few thousand records in daily incremental.
Now I have created custom UNIX scripts using AWS CLI and using CP command to transfer files between on-premises server and AWS S3 in gzip compressed format.
This option is working fine.
But would like to understand from experts, if this is the right way of doing it or if there are any other optimized approaches available to achieve this.
If the volume of your data is more than 100 mb then AWS suggest to use Multipart upload for better performance.
You can refer the below to get the benefit of this
AWS Java SDK to upload large file in S3