Running an EMR job on dynamoDB data - amazon-web-services

I'd like to run a MapReduce job on a DynamoDB Table.
My question is:
Is it ok to dump all the table (even if it's very big, with tens of millions of entries) into one file on S3?
That is, will the MapReduce know to take "chunks" of these file and distribute to the mappers? Or is the atomic unit provided to a mapper a file on S3, and then I need to break the table into lots of little files, for example make files of at most 10,0000 rows.
If that is the case, if there a way to use the AWS Data Pipline to dump a dynamoDB table into several different files on S3?
Thanks!

You can see this to export DynamoDB data to S3
https://aws.amazon.com/articles/Elastic-MapReduce/28549
Check Exporting data stored in DynamoDB to Amazon S3.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/EMRforDynamoDB.html
Video at
http://www.youtube.com/watch?v=RlKndm22bXw
Hope this helps.

Related

Architecture to process AWS S3 files

I am working on a POC where we have millions of existing S3 compressed json files (uncompressed 3+ MB, with nested objects and arrays) and more being added every few minutes. We need to perform computations on top of the uncompressed data (per file basis) and store it to a DB table where we can then perform some column operations. The most common solution I found online is
S3 (Add/update event notification) => SQS (main queue => dlq queue) <=> AWS lambda
We have a DB table for all S3 bucket key names that are being successfully loaded, so I can query this table and use the AWS SDK Node.js package to send messages to the SQS main queue. For newly added/updated files, S3 event notification will take care of it.
I think the above architecture will work in my case, but are there any other AWS services I should look at?
I looked at AWS Athena which can read my compressed files and can give me the raw output but since I have big nested objects and arrays on top of which I need to perform computation, I am not sure if it's ideal to write such complex logic in SQL.
I would really appreciate some guidance here.
If you plan to query the data in the future in ways you can't anticipate, I would strongly suggest you explore the Athena solution, since you would be plugging a very powerful SQL engine on top of your data. Athena can query directly compressed json and export to other data formats that are a lot more efficient to query (like parquet or orc) and support complex data structures.
The flow would be:
S3 (new file) => Athena ETL (json to, say, parquet)
see e.g. here.
For already existing data you can do a one-off query to convert it to the appropriate format (partitioning would be useful if your data volume is big as it seems it is). Having good partitioning is key to obtain good performance on Athena and you will need to think carefully about it on your ETL. More on partitioning, e.g., there.

AWS Athena - What happens when you add new files to S3 folder

I have a sample working where I put a file in S3.
What I'm confused about is what happens when I add new CSV files (with the same format) to that folder.
Are they instantly available in queries? Or do you have to run Glue or something to process them? So for example, what if set up a Lambda function to extract a new CSV every hour, or even every 5 minutes to that same S3 directory.
Does Athena actually load the data into some database somewhere in order to do fast performing queries?
If your table is not partitioned or you add a file to an existing partition the data will be available right away.
However, if you constantly add files you may want to consider partition your table to optimize query performance, see:
Table Location in Amazon S3
Partitioning Data
Athena itself doesn't have any caching, any query will hit the S3 location of the table.

Automate loading data from S3 to Redshift

I want too load data from S3 to Redshift. The data coming to S3 in around 5MB{approximate size} per sec.
I need to automate the loading of data from S3 to Redshift.
The data to S3 is dumping from the kafka-stream consumer application.
The folder S3 data is in folder structure.
Example folder :
bucketName/abc-event/2020/9/15/10
files in this folder :
abc-event-2020-9-15-10-00-01-abxwdhf. 5MB
abc-event-2020-9-15-10-00-02-aasdljc. 5MB
abc-event-2020-9-15-10-00-03-thntsfv. 5MB
the files in S3 have json objects separated with next line.
This data need to be loaded to abc-event table in redshift.
I know few options like AWS Data pipeline, AWS Glue, AWS Lambda Redshift loader (https://aws.amazon.com/blogs/big-data/a-zero-administration-amazon-redshift-database-loader/).
What would be the best way to do it.
Really appreciate if someone will guide me.
Thanks you
=============================================
Thanks Prabhakar for the answer. Need some help in continuation on this.
Created a table in Data Catalog by crawler and
then running a ETLL job in glue does the job of loading the data from S3 to redshift.
I am using approach 1. Predicate pushdown
New files get loaded in S3 in different partition say (new hour started.)
I am adding new partition using a AWS Glue python script job.
Adding new partition in the table using Athena API. (using ALTER TABLE ADD PARTITION).
I have checked in the console that the new partition gets added by the python script job. I checked new partion gets added in Data catalog table.
When I run the same job with pushdown predicate giving same partition added by the python script glue job.
The job did not load the new files from S3 in this new partition to Redshift.
I cant figure out what I am doing wrong ???
In your use case you can leverage AWS Glue to load the data periodically into redshift.You can schedule your Glue job using trigger to run every 60 minutes which will calculate to be around 1.8 GB in your case.
This interval can be changed according to your needs and depending on how much data that you want to process each run.
There are couple of approaches you can follow in reading this data :
Predicate pushdown :
This will only load the partitions that mentioned in the job. You can calculate the partition values every run on the fly and pass them to the filter. For this you need to run Glue crawler each run so that the table partitions are updated in the table metadata.
If you don't want to use crawler then you can either use boto3 create_partition or Athena add partition which will be a free operation.
Job bookmark :
This will load only the latest s3 data that is accumulated from the time that your Glue job completed it's previous run.This approach might not be effective if there is no data generated in S3 in some runs.
Once you calculate the data that is to be read you can simply write it to redshift table every run.
In your case you have files present in sub directories for which you need to enable recurse as shown in below statement.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =<name>, table_name = <name>, push_down_predicate = "(year=='<2019>' and month=='<06>')", transformation_ctx = "datasource0", additional_options = {"recurse": True})

Loading parquet file from S3 to DynamoDB

I have been looking at options to load (basically empty and restore) Parquet file from S3 to DynamoDB. Parquet file itself is created via spark job that runs on EMR cluster. Here are few things to keep in mind,
I cannot use AWS Data pipeline
File is going to contain millions of rows (say 10 million), so would need an efficient solution. I believe boto API (even with batch write) might not be that efficient ?
Are there any other alternatives ?
Can you just refer to the Parquet files in a Spark RDD and have the workers put the entries to dynamoDB? Ignoring the challenge of caching the DynamoDB client in each worker for reuse in different rows, it some bit of scala to take a row, build an entry for dynamo and PUT that should be enough.
BTW: Use DynamoDB on demand here, as it handles peak loads well without you having to commit to some SLA.
Look at the answer below:
https://stackoverflow.com/a/59519234/4253760
To explain the process:
Create desired dataframe
Use .withColumn to create new column and use psf.collect_list to convert to desired collection/json format, in the new column in the
same dataframe.
Drop all un-necessary (tabular) columns and keep only the JSON format Dataframe columns in Spark.
Load the JSON data into DynamoDB as explained in the answer.
My personal suggestion: whatever you do, do NOT use RDD. RDD interface even in Scala is 2-3 times slower than Dataframe API of any language.
Dataframe API's performance is programming language agnostic, as long as you dont use UDF.

AWS Glue Crawler Overwrite Data vs. Append

I am trying to leverage Athena to run SQL on data that is pre-ETL'd by a third-party vendor and pushed to an internal S3 bucket.
CSV files are pushed to the bucket daily by the ETL vendor. Each file includes yesterday's data in addition to data going back to 2016 (i.e. new data arrives daily but historical data can also change).
I have an AWS Glue Crawler set up to monitor the specific S3 folder where the CSV files are uploaded.
Because each file contains updated historical data, I am hoping to figure out a way to make the crawler overwrite the existing table based on the latest file uploaded instead of appending. Is this possible?
Thanks very much in advance!
It is not possible the way you are asking. The Crawler does not alter data.
The Crawler is populating the AWS Glue Data Catalog with tables only.
Please see here for details: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
If you want to do data cleaning using Athena/Glue before using data you need to follow the steps:
Map the data using Crawler into a temporary Athena database/table
Profile your data using Athena. SQL or QuickSight etc. to get the idea what you need to alter
Use Glue job to
make data transformation/cleaning/renaming/deduping using PySpark or Scala
export data into S3 new location (.csv / .paruqet etc.) potentially partitioning
Run one more Crawler to map cleaned data from the new S3 location into Athena database
The dedupe you are askinging about happens in step 3