How to work with csv using AWS EMR? - amazon-web-services

I'm copying .csv files into s3 bucket and i need to join them like in RDB. Is it possible to do this? I hope for your great minds. =)

You can do this using AWS Data pipeline and EMR.
EMR supports CSV (and TSV) as types (means, it will understand the files and has capability to consider this as a table with data rows).
You will keep these files in an S3 bucket and this bucket gets mounted as an HDFS (Hadoop Distributed File System) table. Once this has happened you can issue HIVE queries (which can be join as well) and do most of the things you need to.
I will point you to the doc from here on. You will need to spend some time to read and understand the entire setup, but once mastered it is very handy.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-s3tos3hivecsv.html

Related

AWS ETL Job to write data to s3 bucket in csv format

I am completly new to AWS, and need your support to point me to right direction on my requirement.
Requirement:
I need to read multiple csv files from s3 bucket, union the data perform some transformation and load it to another s3 bucket.
Issue:
I understand that Lambda is one of the option to do the same, but the data is huge so i belive at somepoint 15min limitation will be a issue to me.
Also for Glue ETL, fom what i read i understand it does suport the output to be s3.
Ask
Could you suggest any other ELT services and the link to help me get started.

AWS Glue Outputting Empty Files on Sequential Runs

I am trying to automate an ETL pipeline that outputs data from AWS RDS MYSQL to AWS S3. I am currently using AWS Glue to do the job. When I do an initial load from RDS to S3. It captures all the data in the file which is exactly what I want. However, when I add new data to the MYSQL database and run the Glue job again. I get an empty file instead of the added rows. Any help would be MUCH appreciated.
Bookmarking rules for JDBC Sources are here. Important point to remember for JDBC sources is that values have to be increasing or decreasing order and Glue only processes new data from last checkpoint.
Typically, either an autogenerated sequence number or a datatime used as key for bookmarking
For anybody who is still struggling with this (it drove me mad, because i thought my spark code was wrong), disable bookmarking in job details.

Read S3 Bucket from EC2 for ML Training

I am trying to train a machine learning model on AWS EC2. I have over 50GB of data currently stored in an AWS S3 bucket. When training my model on EC2, I want to be able to access this data.
Essentially, I want to be able to call this command:
python3 train_model.py --train_files /data/train.csv --dev_files /data/dev.csv --test_files /data/test.csv
where /data/train.csv is my S3 bucket s3://data/. How can I do this? I currently only see ways to cp my S3 data into my EC2.
You can develop an enhancement to your code using boto.
But if you want access to your S3 as if it was another local filesystem I would consider s3fs-fuse, explained further here.
Another option would be to use the aws-cli to sync your code to a local folder.
How can I do this? I currently only see ways to cp my S3 data into my EC2.
S3 is a object storage system. It does not allow for direct access nor reading of files like a regular file system.
Thus to read your files, you need to download it first (downloading in parts is also possible), or have some third party software do it for you like s3-fuse. You can download it to your instance, or store in external file system (e.g. EFS).
Its not clear from your question if you have one 50GB CSV file, or multiple small ones. In case you have one large CSV file of 50GB in size, you can reduce the amount of data read, if not all of its needed, at once using S3 Select:
With S3 Select, you can use a simple SQL expression to return only the data from the store you’re interested in, instead of retrieving the entire object. This means you’re dealing with an order of magnitude less data which improves the performance of your underlying applications.
Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format.

Tagging objects read by spark on s3

I use pyspark to read objects on an s3 bucket on amazon s3. My bucket is composed if many json files which I read and then save as parquet files with
spark.read.json('s3://my-bucket/directory1/')
spark.write.parquet('s3://bucket-with-parquet/', mode='append')
Every day I will upload some new files on s3://my-bucket/directory1/ and I would like to update them to s3://bucket-with-parquet/ is there a way to ensure that I do not update the data two times. My idea is to tag every files which I read with spark (do not know how to do it). I can then use those tags to tell spark not to read the file again after (dunno how to do it as well). If an AWS guru could help me on that I would be very grateful.
There are a couple of things you could do, one is to write a script which reads timestamp from the metadata of the bucket and gives the list of files added on that day. You can process only those files which are mentioned in this list. (https://medium.com/faun/identifying-the-modified-or-newly-added-files-in-s3-11b577774729)
Second, you can enable versioning in S3 bucket to make sure if you overwrite any files you can retrieve the old file. You can also set ACL for read-only and write once permission as mentioned here Amazon S3 ACL for read-only and write-once access.
I hope this helps.

Aws: best approach to process data from S3 to RDS

I'm trying to implement, I think, a very simple process, but I don't really know what's the best approach.
I want to read a big csv (around 30gb) file from S3, make some transformation and load it into RDS MySQL and I want this process to be replicable.
I tought that the best approach was Aws data pipeline, but I've found that this service is more designed to load data from different sources to redshift after several transformtions.
I've also seen that the process of creating a pipeline is slow and a little bit messy.
Then I've found the dataduct wrapper of Coursera, but after some research, it seems that this project has been abandoned (the last commit was one year ago).
So I don't know if I should continue trying with aws data pipeline or take another approach.
I've also read about AWS Simple Workflow and Step Functions, but I don't know if it's simpler.
Then I've seen a video of AWS glue and it looks nice, but unfortunatelly it's not yet available and I don't know when Amazon will launch it.
As you see, I'm a little bit confuse, can anyone enlight me?
Thanks in advance
If you are trying to get them into RDS so you can query them, there are other options that do not require the data to be moved from S3 to RDS to do SQL like queries.
You can use Redshift spectrum to read and query information from S3 now.
Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables
Step 1. Create an IAM Role for Amazon Redshift
Step 2: Associate the IAM Role with Your Cluster
Step 3: Create an External Schema and an External Table
Step 4: Query Your Data in Amazon S3
Or you can use Athena to query the data in S3 as well if Redshift is too much horsepower for the need job.
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.
You could use an ETL tool to do the transformations on your csv data and then load it into your RDS database. There are a number of open source tools that do not require large licensing costs. That way you can pull the data into the tool, do your transformations and then the tool will load the data into your MySQL database. For example there is Talend, Apache Kafka, and Scriptella. Here's some information on them for comparison.
I think Scriptella would be an option for this situation. It can use SQL scripts (or other scripting languages), and has JDBC/ODBC compliant drivers. With this you could create a script that would perform your transformations and then load the data into your MySQL database. And you would be using familiar SQL (I'm assuming you already can create SQL scripts) so there isn't a big learning curve.