Here is a scenario which is as follows:
There is a historical data in Hive which is keep growing on daily basis.
Daily fresh data comes in the form of batches which is merged to the data mentioned above.
As of now above process is being run on Hadoop Cluster as scheduled Hive jobs.
Above setup is to be migrated on AWS cloud.
Now problem is how to setup the cluster:
If we setup continuously running EMR cluster then cost will be too high and also most of the time cluster would be unused (as data is coming in batches once or twice in a day)
If we use scheduled EMR cluster (instance fleet) then at the processing time we will have to copy full historical data from the storage (lets say S3) to cluster for merging with fresh data, which will take too much time as its huge in volume and also after processing,again it is to be stored in S3.
Please suggest, which kind of EMR cluster will be suitable here...
Related
If I had to perform ETL on a huge dataset(say 1Tb) stored in S3 as csv files, Both AWS Glue ETL job and AWS EMR steps can be used. Then how is AWS Glue different from AWS EMR. And which is the better solution in this case.
Most of the differences are already listed so I'll focus more on the use case specific.
When to choose aws glue
Data size is huge but structured i.e. it is in the table structure and is of known format (CSV, parquet, orc, json).
Lineage is required, if you need the data lineage graph while developing your etl job prefer developing the etl using glue native libraries.
The developers don't need to tweak the performance parameters like setting number of executors, per executor memory and so on.
You don't want the overhead of managing large cluster and pay only for what you use.
When to use EMR
Data is huge but semi-structured or unstructured where you can't take any benefit from Glue catalog.
You believe only in the outputs and lineage is not required.
You need to define more memory per executor depending upon the type of your job and requirement.
You can manage the cluster easily or if you have so many jobs which can run concurrently on the cluster saving you money.
In case of structured data, you should use EMR when you want more Hadoop capabilities like hive, presto for further analytics.
So it depends on what your use case is. Both are great service.
Glue allows you to submit ETL scripts directly in PySpark/Python/Scala, without the need for managing an EMR cluster. All setup/tear-down of infrastructure is managed.
There are also a few other managed components like Crawlers, Glue Data Catalog, etc which make it easier to work on your data.
You could use either for your use-case, Glue would be faster however you may not have the flexibility you get with EMR.
Glue uses EMR under the hood. This is evident when you ssh into the driver of your Glue dev-endpoint.
Now since Glue is a managed spark environment or say managed EMR environment, it comes with reduced flexibility. The type of workers that you can chose is limited. The number of language libraries that you can use in your spark code is limited. Glue did not support packages like pandas, numpy until recently. Apps like presto cant be integrated with Glue although Athena is a good alternative to a separate presto installation.
The main issue however is that Glue jobs have a cold start time from anywhere between 1 minute to 15 minutes.
EMR is a good choice for exploratory data analysis but for a production environment with CI/CD, Glue seems to be the better choice.
EDIT - Glue jobs no longer have a cold start wait time
From the AWS Glue FAQ:
AWS Glue works on top of the Apache Spark environment to provide a scale-out execution environment for your data transformation jobs. AWS Glue infers, evolves, and monitors your ETL jobs to greatly simplify the process of creating and maintaining jobs.
Amazon EMR provides you with direct access to your Hadoop environment, affording you lower-level access and greater flexibility in using tools beyond Spark.
Source: https://aws.amazon.com/glue/faqs/
AWS Glue is a ETL service from AWS. AWS Glue will generate ETL code in Scala or Python to extract data from the source, transform the data to match the target schema, and load it into the target
AWS EMR is a service where you can process large amount of data , its a supporting big data platform .It Supports Hadoop,Spark,Flink,Presto, Hive etc.You can spin up EC2 with the above listed softwares and make a similar ecosystem.
In your case , you want to process 1 TB of data .Now if you want do computations on the same data , you can use EMR and if you want to run the analytics on the transformed data , use Glue .
Following is something that i compiled post working on analytics projects (though a lot of it depends on use case) - but generally speaking :
Criteria
Glue
EMR
Costs
Comparatively Costlier
Much Cheaper (Due to Spot Instance Functionality, There have been cases when there are saving of upto 50% over top-off glue costs - even more depending upon the use case)
Orchestration
Inbuilt(Glue WorkFlows & Triggers)
Through Cloud Watch Triggers & Step Functions
Infra Work Required
No Infra Setup - Select Worker Type However,Roles & Permissions are needed
Identify the Type of Node Needed & Setup Autoscaling rules etc
Cluster Resiliency & Robustness
Highly Resilient (AWS MANAGED)
If Spot Instances are used then interruption might occur with 2 min notification(Though the System Recovers Automatically - For eg - Job Times might elongate)
Skill Sets Needed
PySpark & Intermediate AWS Knowledge
DevOps to Setup EMR & Manage, Intermediate Knowledge of Orchestration via Cloud Watch & Step Function, PySpark
Applicable Use Cases
Attractive Option in event: 1. You are not worried about Costs but need highly resilient infra2. Batch Setups wherein the Job might complete in fixed time3. Short RealTime Streaming Jobs which need to run for let's say hrs during a day
1. Use Case is of Volatile Clusters - Mostly Used for Batch Processing (Day MINUS Scenarios) - Thereby making a costs effective solution for Batch Jobs2. Attractive option for 24/7 Spark Streaming Programs3. You Need a Hadoop Ecosystem & Related tools (like HDFS, HIVE, HUE, Impala etc)4. You need to run Flink Programs etc5. You need control over Infra & It's tuning parameters
Also going back to OP's use case of 1TB of data processing. If its one time processing Glue should suffice, if its a Daily Once Batch EMR & GLUE will both be good (depending on how job is tuned Glue can be an attractive option), if its a multiple time daily job - then EMR is a better option (Considering balance of performance and cost)
What are the advantages/disadvantages of using 'plain' Hadoop cluster Hortonworks with components HDFS, Hive, Oozie... vs some services on AWS like S3/Athena/Lambda?
my scenario data flow:
source data come from iot sensors in order to analytics and sometimes I need to query by deviceid & datetime with Hive/Athena ... (all conditions have been partitioned)
Disadvantages of installing Hadoop yourself in any cloud provider is obviously cost and a little bit of maintenance.
For example, HDFS disk gets full, add more volumes. You need to upgrade and patch software yourself. You're charged every machine hour, for every machine and turning off just the namenode of the cluster will render it unusable for a period of time; if you do not have any business use-case for running the cluster overnight, you're wasting money
Therefore the advantage of storing data in cloud is.
While slower than HDFS, object store in S3 is significantly cheaper and scalable
Triggering actions via Lambda or another scheduler, can actually happen faster than Oozie launching a YARN job. Your code isn't tied to Hadoop, either, so your functions should be able to be smaller, although you may be limited in language options. If you combine lambda or other filesystem triggers with container schedulers like Kubernetes, you can open lots of options.
Querying your data any time you want with tools like AWS Glue and Athena, decouples the maintenance of a Hive metastore and a compatible query engine, whether that's Hive, Presto, Impala, Drill, etc. Anyone with AWS access can run an Athena query without needing to know an address of your HiveServer and how to appropriately connect to it (for example, you should secure it and make it highly available)
I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue.
I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1 million above free tier and is calculated at $1 for the 1 million additional requests.
On EMR I have considered m3.xlarge for both EC2 & EMR (pricing at $0.266 & $0.070 respectively) with 6 nodes, running for 10 minutes for 30 days.
On calculating for a month, I see that AWS Glue works out to be around $14.64, whereas for EMR it works out to be around $10.08. I have not taken into account other additional expenses such as S3, RDS, Redshift, etc. & DEV Endpoint which is optional, since my objective is to compare ETL job price benefits
Looks like EMR is cheaper when compared to AWS Glue. Is the EMR pricing correct, can someone please suggest if anything missing? I have tried the AWS price calculator for EMR, but confused, and not clear if normalized hours are billed into it.
Regards
Yuva
Yes, EMR does work out to be cheaper than Glue, and this is because Glue is meant to be serverless and fully managed by AWS, so the user doesn't have to worry about the infrastructure running behind the scenes, but EMR requires a whole lot of configuration to set up. So it's a trade off between user friendliness and cost, and for more technical users EMR can be the better option.
#user2889316 - Did you check my question wherein I had provided a comparison numbers?
Also please note Glue is roughly about 0.44 per hour / DPU for a job. I don't think you will have any AWS Glue JOB that is expected to running throughout the day? Are you talking about the Glue Dev end point or the Job?
A AWS Glue job requires a minimum of 2 DPUs to run, which means 0.88 per hour, which I think roughly about $21 per day? This is only for the GLUE job and there are additional charges such as S3, and any database / connection charges / crawler charges, etc.
Corresponding instance for EMR is m3.xlarge & its charges are (pricing at $0.266 & $0.070 respectively). This would be approximately less than $16 for 2 instance per day? plus other S3, database charges, etc. Am considering 2 EMR instances against the default DPUs for AWS Glue job.
Hope this would give you an idea.
Thanks
If your infrastructure doesn't need drastic scaling (and is mostly with fixed configuration), use EMR. But if it is needed, Glue is better choice as it is serverless. By just changing DPUs, your infrastructure is scaled. However in EMR, you have to decide on cluster type, number of nodes, auto-scaling rules. For each change, you will need to change cluster creation script, test it, deploy it - basically add overhead of standard release cycle for change. With change in infra config, you may want to change spark config to optimize jobs accordingly. So time to make new version release is higher with change in infra configuration. If you add high configuration to start, it will cost more. If you add low configuration to start, you need frequent changes in script.
Having said that, AWS Glue has fixed infra configuration for each DPU - e.g. 16GB memory per core. If your ETL demands more memory per core, you may have to shift to EMR. However, if your ETL is designed such a way that it will not exceed 11GB driver memory with 1 executor or 5.5GB with 2 executors (e.g. Take additional data volume in parallel on new core or divide volume in 5gb/11gb batch and run in for loop on same core), Glue is right choice.
If your ETL is complex and all jobs are going to keep cluster busy throughout day, I would recommend to go with EMR with dedicated devops team to manage EMR infra.
If you use Spot instance of EMR instead of On-Demand it will cost 1/3rd of on-Demand price and will turn out to be much cheaper. AWS Glue doesn't have that pricing benefits.
i'm looking at using AWS Redshift to let users submit queries against the old archived data which isn't available in my web page.
the total data i'm dealing with across all my users is a couple of terabytes. the data is already in an s3 bucket, split up into files by week. most requests won't deal with more than a few files totaling 100GB.
to keep costs down should i use snapshots and delete our cluster when not in use or should i have a smaller cluster which doesn't hold all of the data and only COPY data from S3 into a temporary table when running a query?
If you are just doing occasional queries where cost is more important than speed, you could consider using Amazon Athena, which can query data stored in Amazon S3. (Only in some AWS regions at the moment.) You are only charged for the amount of data read from disk.
To gain an appreciation for making Athena even better value, see: Analyzing Data in S3 using Amazon Athena
Amazon Redshift Spectrum can perform a similar job to Athena but requires an Amazon Redshift cluster to be running.
All other choices are really a trade-off between cost and access to your data. You could start by taking a snapshot of your Amazon Redshift database and then turning it off at night and on the weekends. Then, have a script that can restore it automatically for queries. Use fewer nodes to reduce costs -- this will make queries slower, but that doesn't seem to be an issue for you.
Does anyone know how fast the copy speed is from Amazon S3 to Redshift?
I only want to use RedShift for about an hour a day, to run updates on Tabelau reports. The queries being run are always on the same database, but I need to run them each night to take in to account new data that's come in that day.
I don't want to keep a cluster going 24x7 just to be used for one hour a day, but the only way that I can see of doing this is to Import the entire database each night into Redshift (I don't think you can't suspend or pause a cluster). I have no idea what the copy speed is so I have no idea if its going to be relatively quick to copy a 10GB file in to Redshift every night.
Assuming its feasible, my thinking is to push the incremental changes on SQL Server dbase in to S3. Using Cloud Formation, I automate the provisioning of a Redshift cluster at 1am for 1 hour, import the dbase from S3, and schedule Tableau to run its queries between that time and get its results. I keep an eye on how long the queries take, and If I need longer than an hour I just amend the cloud formation.
In this way I hope to keep a really 'lean' Tableau server by outsourcing all the ETL to Redshift, and buying only what I consume on Redshift.
Please feel free to critique my solution, or out right blow it out of the water. Otherwise If the consensus of the answer is that importing is relevantly quick, It gives me a thumbs up I'm headed in the right direction with this solution.
Thanks for any assistance!
Redshift loads from S3 are very quick, however Redshift clusters do not come up / tear down very quickly at all. In the above example most of your time (and money) would be spent waiting for the cluster to come up, existing data to load, refreshed data to unload and cluster to tear down again.
In my opinion it would be better to use another approach for your overnight processing. I would suggest either:
For a couple of TB, InfiniDB on a largish EC2 instance with the database stored on an EBS volume.
For many TBs, Amazon EMR with the data stored on S3. If you don't want to get into Hadoop too much you can use Xplenty/Syncsort Ironcluster/etc. to orchestrate the Hadoop element.
While this question was written three years ago and it wasn't available at that time, a suitable solution to this now would be to use Amazon Athena, which allows on-demand SQL querying of data held in S3. This works on a pay-per-query model, and is intended for ad-hoc and "quick" workloads like this.
Behind the scenes, Athena uses Presto and Elastic MapReduce, but the only required knowledge for a developer/analyst in practice is SQL.
Tableau also now has a built-in Athena connector (as of 10.3).
More on Athena here: https://aws.amazon.com/athena/
You can presort data you are keeping on S3. It will make Vacuum much faster.
This is the classic problem with Redshift... if you looking different way .. Microsoft recently announced new service called SQL Data Warehouse (Uses PDW Engine) I think they want to compete directly with Redshift.... Most interesting concept here is ... Familiar SQL Server Query language and Toolset (including Stored proc support). They also decoupled Storage and Compute so you can have 1 GB storage but 10 Compute node for intensive query and vice versa.... they are claiming that compute node start in few seconds and when you resize cluster you don't have to take it offline. Cloud Data Warehouse Battle getting hot :)