Joining many large files on AWS - amazon-web-services

Joining many large files on AWS - amazon-web-services

I am looking for advice which service should I use. I am new to big data and confused with differences between them on AWS.
Use case:
I receive 60-100 csv files daily (each one can be from few MB to few GB). There are six corresponding schemas, and each file can be treated as part of only one table.
I need to load those files to the six database tables and execute joins between them and generate daily output. After generation of the output, the data present in database is no longer need, so we can truncate that tables and await on the next day.
Files have predictable naming patterns:
A_<timestamp1>.csv goes to A table
A_<timestamp2>.csv goes to A table
B_<timestamp1>.csv goes to B table
etc ...
Which service could be used for that purpose?
AWS Redshift (execute here joins)
AWS Glue (load to redshift)
AWS EMR (spark)
or maybe something else? I heard that spark could be used to do the joins, but what is the proper, optimal and performant way of doing that?
Edit:
Thanks for the responses. I see two options for now:
Use AWS Glue, setup 6 crawlers which will load on trigger files to specific AWS Glue Data Catalogs, execute SQL joins with Athena
Use AWS Glue, setup 6 crawlers which will load on trigger files to specific AWS Glue Data Catalogs, trigger spark job (AWS Glue in serverless form) to do the SQL joins and setup output to the S3.
Edit 2:
But according to the: https://carbonrmp.com/knowledge-hub/tech-engineering/athena-vs-spark-lessons-from-implementing-a-fully-managed-query-system/
Presto is designed for low latency and uses a massively parallel processing (MPP) approach which is fast but requires everything to happen at once and in memory. It’s all or nothing, if you run out of memory, then “Query exhausted resources at this scale factor”. Spark is designed for scalability and follows a map-reduce design [1]. The job is split and processed in chunks, which are generally processed in batches. If you double the workload without changing the resource, it should take twice as long instead of failing [2]
So Athena (aka Presto) is not scalable as much as I want. I've seen "Query exhausted resources at this scale factor" for my case.

Any possibility of changing the file type to a columnar format like parquet? Then you can use AWS EMR and spark should be able to handle the joins easily. Obviously, you need to optimize the query depending on the data/cluster size etc.

Related

What is the difference between AWS Glue ETL Job and AWS EMR?

If I had to perform ETL on a huge dataset(say 1Tb) stored in S3 as csv files, Both AWS Glue ETL job and AWS EMR steps can be used. Then how is AWS Glue different from AWS EMR. And which is the better solution in this case.

Most of the differences are already listed so I'll focus more on the use case specific.
When to choose aws glue
Data size is huge but structured i.e. it is in the table structure and is of known format (CSV, parquet, orc, json).
Lineage is required, if you need the data lineage graph while developing your etl job prefer developing the etl using glue native libraries.
The developers don't need to tweak the performance parameters like setting number of executors, per executor memory and so on.
You don't want the overhead of managing large cluster and pay only for what you use.
When to use EMR
Data is huge but semi-structured or unstructured where you can't take any benefit from Glue catalog.
You believe only in the outputs and lineage is not required.
You need to define more memory per executor depending upon the type of your job and requirement.
You can manage the cluster easily or if you have so many jobs which can run concurrently on the cluster saving you money.
In case of structured data, you should use EMR when you want more Hadoop capabilities like hive, presto for further analytics.
So it depends on what your use case is. Both are great service.

Glue allows you to submit ETL scripts directly in PySpark/Python/Scala, without the need for managing an EMR cluster. All setup/tear-down of infrastructure is managed.
There are also a few other managed components like Crawlers, Glue Data Catalog, etc which make it easier to work on your data.
You could use either for your use-case, Glue would be faster however you may not have the flexibility you get with EMR.

Glue uses EMR under the hood. This is evident when you ssh into the driver of your Glue dev-endpoint.
Now since Glue is a managed spark environment or say managed EMR environment, it comes with reduced flexibility. The type of workers that you can chose is limited. The number of language libraries that you can use in your spark code is limited. Glue did not support packages like pandas, numpy until recently. Apps like presto cant be integrated with Glue although Athena is a good alternative to a separate presto installation.
The main issue however is that Glue jobs have a cold start time from anywhere between 1 minute to 15 minutes.
EMR is a good choice for exploratory data analysis but for a production environment with CI/CD, Glue seems to be the better choice.
EDIT - Glue jobs no longer have a cold start wait time

From the AWS Glue FAQ:
AWS Glue works on top of the Apache Spark environment to provide a scale-out execution environment for your data transformation jobs. AWS Glue infers, evolves, and monitors your ETL jobs to greatly simplify the process of creating and maintaining jobs.
Amazon EMR provides you with direct access to your Hadoop environment, affording you lower-level access and greater flexibility in using tools beyond Spark.
Source: https://aws.amazon.com/glue/faqs/

AWS Glue is a ETL service from AWS. AWS Glue will generate ETL code in Scala or Python to extract data from the source, transform the data to match the target schema, and load it into the target
AWS EMR is a service where you can process large amount of data , its a supporting big data platform .It Supports Hadoop,Spark,Flink,Presto, Hive etc.You can spin up EC2 with the above listed softwares and make a similar ecosystem.
In your case , you want to process 1 TB of data .Now if you want do computations on the same data , you can use EMR and if you want to run the analytics on the transformed data , use Glue .

Following is something that i compiled post working on analytics projects (though a lot of it depends on use case) - but generally speaking :
Criteria
Glue
EMR
Costs
Comparatively Costlier
Much Cheaper (Due to Spot Instance Functionality, There have been cases when there are saving of upto 50% over top-off glue costs - even more depending upon the use case)
Orchestration
Inbuilt(Glue WorkFlows & Triggers)
Through Cloud Watch Triggers & Step Functions
Infra Work Required
No Infra Setup - Select Worker Type However,Roles & Permissions are needed
Identify the Type of Node Needed & Setup Autoscaling rules etc
Cluster Resiliency & Robustness
Highly Resilient (AWS MANAGED)
If Spot Instances are used then interruption might occur with 2 min notification(Though the System Recovers Automatically - For eg - Job Times might elongate)
Skill Sets Needed
PySpark & Intermediate AWS Knowledge
DevOps to Setup EMR & Manage, Intermediate Knowledge of Orchestration via Cloud Watch & Step Function, PySpark
Applicable Use Cases
Attractive Option in event: 1. You are not worried about Costs but need highly resilient infra2. Batch Setups wherein the Job might complete in fixed time3. Short RealTime Streaming Jobs which need to run for let's say hrs during a day
1. Use Case is of Volatile Clusters - Mostly Used for Batch Processing (Day MINUS Scenarios) - Thereby making a costs effective solution for Batch Jobs2. Attractive option for 24/7 Spark Streaming Programs3. You Need a Hadoop Ecosystem & Related tools (like HDFS, HIVE, HUE, Impala etc)4. You need to run Flink Programs etc5. You need control over Infra & It's tuning parameters
Also going back to OP's use case of 1TB of data processing. If its one time processing Glue should suffice, if its a Daily Once Batch EMR & GLUE will both be good (depending on how job is tuned Glue can be an attractive option), if its a multiple time daily job - then EMR is a better option (Considering balance of performance and cost)

What is the best way to copy large csv files from s3 to redshift?

I'm working on a task of copying csv files from s3 bucket to redshift. I've found multiple ways to do so but I'm not sure which one will be the best possible way to do it. Here's the scenario:
On regular intervals, multiple CSV files of size around 500 MB - 1 GB, will be added to my s3 bucket. The data can contain duplicates. The task is to copy the data to redshift table while ensuring that the duplicate data is not present in redshift.
Here are the ways I found which can be used:
Create a AWS Lambda function which will be triggered whenever a file is added to s3 bucket.
Use AWS Kinesis
Use AWS Glue
I understand Lambda should not be used for jobs that takes more than 5 minutes. So should I use it or just eliminate this option?
Kinesis can handle large amount of data but is it the best way to do it?
I'm not familiar with Glue and Kinesis. But I read that Glue can be slow.
If anyone can point me to the right direction, it will be really helpful.

You can definitely make it work with Lambda, if you leverage StepFunctions and the S3 Select option to filter subsets of data into smaller chunks. You'd have your Step Functions manage your ETL orchestration wherein you execute your lambdas that selectively pull from the large data file via the S3 select option. Your pre-process state--see links below--could be used to determine execution requirements, then execute multiple Lambdas, even in parallel, if you wish. Those lambdas would process the subsets of data to remove dups and perform any other ETL operations you might require. Then, you'd take the processed data and write to Redshift. Here are links that will help you put that architecture together:
Trigger State Machine Execution from S3 Event
Manage Lambda Processing Executions and workflow state
Use S3 Select to pull subsets from large data objects
Also, here's a link to a Python ETL pipeline example for the CDK that I built. You'll see an example of an S3 event-driven lambda along with data processing and DDB or MySQL writes. Will give you an idea as to how you can build out comprehensive Lambdas for ETL operations. You would just need to add a psycopg2 layer to your deployment for Redshift. Hope this helps.

Spark and continuous processing of data

I am new to Spark but I am reading up as much as I can. I have a small project where multiple data files (in gzip) are going to continuously land in an S3 bucket every hour. I need to be able to open/read these gzip files and consolidate/aggregate data across them. So, I need to look at them in a holistic fashion. How, what techniques and tools from Amazon AWS can be used? Do I create interim files in a S3 folder or hold Dataframes in memory or use some database and blow away the data after each hour? So, I am looking for ideas more than a piece of code.
So far, in AWS, I have written a pyspark script that reads 1 file at a time and create an output file back in output S3 folder. But that leaves me with multiple output files for each hour. Would be nice if there was 1 file for a given hour.
From technology perspective, I am using an EMR cluster with just 1 master and 1 core node, Pyspark and S3.
Thanks

You could use an AWS Glue ETL job written in PySpark. Glue jobs can be scheduled to run every hour.
I suggest reading the entire dataset, performing your operations, and then moving the data to another long-term storage location.
If you are working on a few GB of data, a PySpark job should complete within minutes. There's no need to keep an EMR cluster running for an hour if you'll only need it for 10 minutes. Consider using short-lived EMR clusters or a Glue ETL job.
Athena supports querying GZipped data. If you're performing some sort of analysis, maybe executing an Athena query with a time range will work?
You could also use a CTAS (Create Table As Select) statement in Athena to copy data to a new location, and performing basic ETL on it at the same time.
What exactly does your PySpark code do?

AWS data pipeline: dump data to 3 s3 nodes

I have a use case wherein I want to take a data from DynamoDB and do some transformation on the data. After this I want to create 3 csv files (there will be 3 transformations on the same data) and dump them to 3 different s3 locations.
My architecture would be sort of following:
Is it possible to do so? I can't seem to find any documentation regarding it. If it's not possible using pipeline, are there any other services which could help me with my use case?
These dumps will be scheduled daily. My other consideration was using aws lamda. But according to my understanding, it's event based triggered rather time based scheduling, is that correct?

Yes it is possible but not using HiveActivity instead EMRActivity. If you look into Data pipeline documentation for HiveActivity, it clearly states its purpose and not suits your use case:
Runs a Hive query on an EMR cluster. HiveActivity makes it easier to set up an Amazon EMR activity and automatically creates Hive tables based on input data coming in from either Amazon S3 or Amazon RDS. All you need to specify is the HiveQL to run on the source data. AWS Data Pipeline automatically creates Hive tables with ${input1}, ${input2}, and so on, based on the input fields in the HiveActivity object.
Below is how your data pipeline should look like. There is also a inbuilt template Export DynamoDB table to S3 in UI for AWS Data Pipeline which creates the basic structure for you, and then you can extend/customize to suit your requirements.
To your next question using Lambda, Of course lambda can be configured to have event based triggering or schedule based triggering, but I wouldn't recommend using AWS Lambda for any ETL operations as they are time bound & usual ETLs are longer than lambda time limits.
AWS has specific optimized feature offerings for ETLs, AWS Data Pipeline & AWS Glue, I would always recommend to choose between one of two. In case your ETL involves data sources not managed within AWS compute and storage services OR any speciality use case which can't be sufficed by above two options, then AWS Batch will be my next consideration.

Thanks amith for your answer. I have been busy for quite some time now. I did some digging after you posted your answer. Turns out we can dump the data to different s3 locations using Hive activity as well.
This is how the data pipeline would like in that case.
But I believe writing multiple hive activities, when your input source is DynamoDB table, is not a good idea since hive doesn't load any data in memory. It does all the computations on the actual table which could deteriorate the performance of the table. Even documentation suggests to export the data incase you need to make multiple queries to same data. Reference
Enter a Hive command that maps a table in the Hive application to the data in DynamoDB. This table acts as a reference to the data stored in Amazon DynamoDB; the data is not stored locally in Hive and any queries using this table run against the live data in DynamoDB, consuming the table’s read or write capacity every time a command is run. If you expect to run multiple Hive commands against the same dataset, consider exporting it first.
In my case I needed to perform different type of aggregations on the same data once a day. Since dynamoDB doesn't support aggregations, I turned to Data pipeline using Hive. In the end we ended up using AWS Aurora which is My-SQL based.

Lambda architecture on AWS: choose database for batch layer

We're building Lambda architecture on AWS stack. A lack of devops knowledge forces us to prefer AWS managed solution over custom deployments.
Our workflow:
[Batch layer]
Kinesys Firehouse -> S3 -Glue-> EMR (Spark) -Glue-> S3 views -----+
|===> Serving layer (ECS) => Users
Kinesys -> EMR (Spark Streaming) -> DynamoDB/ElasticCache views --+
[Speed layer]
We have already using 3 datastores: ElasticCache, DynamoDB and S3 (queried with Athena). Bach layer produce from 500,000 up to 6,000,000 row each hour. Only last hour results should be queried by serving layer with low latency random reads.
Neither of our databases fits batch-insert & random-read requirements. DynamoDB not fit batch-insert - it's too expensive because of throughput required for batch inserts. Athena is MPP and moreover has limitation of 20 concurrent queries. ElasticCache is used by streaming layer, not sure if it's good idea to perform batch inserts there.
Should we introduce the fourth storage solution or stay with existing?
Considered options:
Persist batch output to DynamoDB and ElasticCache (part of data that is updated rarely and can be compressed/aggregated goes to DynamoDB; frequently updated data ~8GB/day goes to elasticCache).
Introduce another database (HBase on EMR over S3/ Amazon redshift?) as a solution
Use S3 Select over parquet to overcome Athena concurrent query limits. That will also reduce query latency. But have S3 Select any concurrent query limits? I can't find any related info.
The first option is bad because of batch insert to ElasticCache used by streaming. Also does it follow Lambda architecture - keeping batch and speed layer views in the same data stores?
The second solution is bad because of the fourth database storage, isn't it?

In this case you might want to use something like HBase or Druid; not only can they handle batch inserts and very low latency random reads, they could even replace the DynamoDB/ElastiCache component from your solution, since you can write directly to them from the incoming stream (to a different table).
Druid is probably superior for this, but as per your requirements, you'll want HBase, as it is available on EMR with the Amazon Hadoop distribution, whereas Druid doesn't come in a managed offering.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js