Tool/Ways to schedule Amazon's Elastic MapReduce jobs - mapreduce

I use EMR to create new instances and process the jobs and then shutdown instances.
My requirement is to schedule jobs in periodic fashion. One of the easy implementation can be to use quartz to trigger EMR jobs. But looking at longer run I am interested in using out of box mapreduce scheduling solution. My question is that is there any out of box scheduling feature provided by EMR or AWS-SDK, which i can use for my requirement? I can see there is scheduling in Auto scaling, but i want to schedule EMR jobflow instead.

There is Apache Oozie Workflow Scheduler for Hadoop to do just that.
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by
time (frequency) and data availabilty.
Oozie is integrated with the rest of the Hadoop stack supporting
several types of Hadoop jobs out of the box (such as Java map-reduce,
Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system
specific jobs (such as Java programs and shell scripts).
Oozie is a scalable, reliable and extensible system.
Here is a simple example of Elastic Map Reduce bootstrap actions for configuring apache oozie : https://github.com/lila/emr-oozie-sample
But to let you know oozie is a bit complicated and if and only if you have a lot of jobs to be scheduled/monitored/maintained then only you shall go for oozie or else just create a bunch of cron jobs if you have say just 2 or 3 jobs to be scheduled periodically.
You may also look into and explore simple workflow from Amazon.

Related

Triggering an alert when multiple dataflow jobs run in parallel in GCP

I am using google cloud dataflow to execute some resource intensive dataflow jobs. And at a given time, my system must execute no more than 2 jobs in parallel.
Since each job is quite resource intensive, I am looking for a way to trigger an alert when more than 2 dataflow jobs are running.
I tried implementing a custom_count which increments after the start of each job. But custom_couter only display after the job has executed. And it might be too late to trigger an alert by then.
You could modify the quota dataflow.googleapis.com/job_count of the project to be limited to 1, and no two jobs could run parallel in that project. The quota is at the project level, it would not affect other projects.
Another option is to use an GCP monitoring system that is observing the running Dataflow jobs. You can e.g. use Elastic Cloud (available via Marketplace) to load all relevant Metrics and Logs. Elastic can visualize and alert on every state you are interested in.
I found this terraform project very helpful in order to get started with that approach.

AWS Managed Apache airflow (MWAA) or AWS Batch for simple batch job flow

I have simple workflow to design where there will be 4 batch job running one after another sequentially and each jobs is running in multi node master/slave architecture.
My question is AWS Batch can manage simple workflow using job queue and can manage multi-node parallel job as well.
Now, should I use AWS Batch or Airflow ?
With Airflow , I can use KubernetesPodOperator and job will run in Kubernetes cluster. But Airflow does not inherently support multi node parallel jobs.
Note: The batch job is written in java using Spring batch remote partitioning framework that support master/slave architecture.
AWS Batch would fit your requirements better.
Airflow is a workflow orchestration tool, it's used to host many jobs that have multiple tasks each, with each task being light on processing. Its most common use is for ETL, but in your use case you would have an entire Airflow ecosystem for just a single job, which (unless you manually broke it out to smaller tasks) would not run multi-threaded.
AWS Batch on the other hand is for batch processing, and you can more finely-tune the servers/nodes that you want your code to execute on. I think in your use case it would also work out cheaper than Airflow too.

Want to clear Big picture about the AWS Glue

I want to clear big picture about the aws Glue regarding some of the following aspects.
How AWS Glue prepare and provision its infrastructure? However it's serverless but how does it manage it?
How it's using apache spark and hadoop to solve so many ETL jobs at a time, Almost jobs of hundreds of AWS Glue customers from every region.
Thanks
AWS Glue uses EMR underneath. It spawns a new cluster with required number of executors (depending on configured DPU) when a new job starts. However, to improve cold start time they have a buffer of already provisioned EMR clusters for the most common number of DPUs. To manage all this they have a set of automated services that monitor state of each cluster, start a new ones etc.

What is the difference between Google Cloud Dataflow and Google Cloud Dataproc?

I am using Google Data Flow to implement an ETL data ware house solution.
Looking into google cloud offering, it seems DataProc can also do the same thing.
It also seems DataProc is little bit cheaper than DataFlow.
Does anybody know the pros / cons of DataFlow over DataProc
Why does google offer both?
Yes, Cloud Dataflow and Cloud Dataproc can both be used to implement ETL data warehousing solutions.
An overview of why each of these products exist can be found in the Google Cloud Platform Big Data Solutions Articles
Quick takeaways:
Cloud Dataproc provides you with a Hadoop cluster, on GCP, and access to Hadoop-ecosystem tools (e.g. Apache Pig, Hive, and Spark); this has strong appeal if you are already familiar with Hadoop tools and have Hadoop jobs
Cloud Dataflow provides you with a place to run Apache Beam based jobs, on GCP, and you do not need to address common aspects of running jobs on a cluster (e.g. Balancing work, or Scaling the number of workers for a job; by default, this is automatically managed for you, and applies to both batch and streaming) -- this can be very time consuming on other systems
Apache Beam is an important consideration; Beam jobs are intended to be portable across "runners," which include Cloud Dataflow, and enable you to focus on your logical computation, rather than how a "runner" works -- In comparison, when authoring a Spark job, your code is bound to the runner, Spark, and how that runner works
Cloud Dataflow also offers the ability to create jobs based on "templates," which can help simplify common tasks where the differences are parameter values
Here are three main points to consider while trying to choose between Dataproc and Dataflow
Provisioning
Dataproc - Manual provisioning of clusters
Dataflow - Serverless. Automatic provisioning of clusters
Hadoop Dependencies
Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem.
Portability
Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine. This helps with portability across different execution engines that support the Beam runtime, i.e. the same pipeline code can run seamlessly on either Dataflow, Spark or Flink.
This flowchart from the google website explains how to go about choosing one over the other.
https://cloud.google.com/dataflow/images/flow-vs-proc-flowchart.svg
Further details are available in the below link
https://cloud.google.com/dataproc/#fast--scalable-data-processing
Same reason as why Dataproc offers both Hadoop and Spark: sometimes one programming model is the best fit for the job, sometimes the other. Likewise, in some cases the best fit for the job is the Apache Beam programming model, offered by Dataflow.
In many cases, a big consideration is that one already has a codebase written against a particular framework, and one just wants to deploy it on the Google Cloud, so even if, say, the Beam programming model is superior to Hadoop, someone with a lot of Hadoop code might still choose Dataproc for the time being, rather than rewriting their code on Beam to run on Dataflow.
The differences between Spark and Beam programming models are quite large, and there are a lot of use cases where each one has a big advantage over the other. See https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison .
Cloud Dataflow is a serverless data processing service that runs jobs written using the Apache Beam libraries. When you run a job on Cloud Dataflow, it spins up a cluster of virtual machines, distributes the tasks in your job to the VMs, and dynamically scales the cluster based on how the job is performing. It may even change the order of operations in your processing pipeline to optimize your job.
So use cases are ETL (extract, transfer, load) job between various data sources / data bases. For example load big files from Cloud Storage into BigQuery.
Streaming works based on subscription to PubSub topic, so you can listen to real time events (for example from some IoT devices) and then further process.
Interesting concrete use case of Dataflow is Dataprep. Dataprep is cloud tool on GCP used for exploring, cleaning, wrangling (large) datasets. When you define actions you want to do with your data (like formatting, joining etc), job is run under the hood on Dataflow.
Cloud Dataflow also offers the ability to create jobs based on "templates," which can help simplify common tasks where the differences are parameter values.
Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.
Super fast — Without using Dataproc, it can take from five to 30
minutes to create Spark and Hadoop clusters on-premises or through
IaaS providers. By comparison, Dataproc clusters are quick to start,
scale, and shutdown, with each of these operations taking 90 seconds
or less, on average. This means you can spend less time waiting for
clusters and more hands-on time working with your data.
Integrated — Dataproc has built-in integration with other Google
Cloud Platform services, such as BigQuery, Cloud Storage, Cloud
Bigtable, Cloud Logging, and Cloud Monitoring, so you have more than
just a Spark or Hadoop cluster—you have a complete data platform.
For example, you can use Dataproc to effortlessly ETL terabytes of
raw log data directly into BigQuery for business reporting.
Managed — Use Spark and Hadoop clusters without the assistance of an
administrator or special software. You can easily interact with
clusters and Spark or Hadoop jobs through the Google Cloud Console,
the Cloud SDK, or the Dataproc REST API. When you're done with a
cluster, you can simply turn it off, so you don’t spend money on an
idle cluster. You won’t need to worry about losing data, because
Dataproc is integrated with Cloud Storage, BigQuery, and Cloud
Bigtable.
Simple and familiar — You don’t need to learn new tools or APIs to
use Dataproc, making it easy to move existing projects into Dataproc
without redevelopment. Spark, Hadoop, Pig, and Hive are frequently
updated, so you can be productive faster.
If you want to migrate from your existing Hadoop/Spark cluster to the cloud, or take advantage of so many well-trained Hadoop/Spark engineers out there in the market, choose Cloud Dataproc; if you trust Google's expertise in large scale data processing and take their latest improvements for free, choose DataFlow.
Here are three main points to consider while trying to choose between Dataproc and Dataflow
Provisioning
Dataproc - Manual provisioning of clusters
Dataflow - Serverless. Automatic provisioning of clusters
Hadoop Dependencies
Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem.
Portability
Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine. This helps with portability across different execution engines that support the Beam runtime, i.e. the same pipeline code can run seamlessly on either Dataflow, Spark or Flink.
Cloud Dataproc and Cloud Dataflow can both be used for data processing, and there’s overlap in their batch and streaming capabilities. You can decide which product is a better fit for your environment.
Cloud Dataproc is good for environments dependent on specific Apache big data components:
- Tools/packages
- Pipelines
- Skill sets of existing resources
Cloud Dataflow is typically the preferred option for green field environments:
- Less operational overhead
- Unified approach to development of batch or streaming pipelines
- Uses Apache Beam
- Supports pipeline portability across Cloud Dataflow, Apache Spark, and Apache Flink as runtimes.
See more details here https://cloud.google.com/dataproc/
Pricing comparision:
DataProc
Dataflow
If you want to calculate and compare cost of more GCP resources, please refer this url https://cloud.google.com/products/calculator/
One of the other important difference is:
Cloud Dataproc:
Data mining and analysis in datasets of known size
Cloud Dataflow:
Manage datasets of unpredictable size
see
Cloud Dataflow
Is a serverless data processing service that runs jobs written using
the Apache Beam libraries.
When you run a job on Cloud Dataflow it gets operated like this:
It spins up a cluster of virtual machines
Distributes the tasks in your job to the VMs, and dynamically scale the cluster based on how the job is performing
Dataflow may even change the order of operations in your processing pipeline to optimize your job.
It supports both batch and streaming Jobs. So use cases are ETL (extract, transfer, load) jobs between various data sources/databases.
For example, load big files from Cloud Storage into Big Query.
Streaming works based on subscription to Pub-Sub topic, so you can listen to real-time events (for example from some IoT devices) and then further process the data.
An interesting concrete use case of Dataflow is Data prep.
Data prep is a cloud tool on GCP used for exploring, cleaning, and wrangling (large) datasets. When you define the actions you want to perform on your data (like formatting, joining etc.), the job run under the hood on Dataflow.
Cloud Dataflow also offers the ability to create jobs based on "templates" which can help simplify common tasks where the differences are parameter values.
Data proc
Is a managed Spark and Hadoop service that lets you take advantage of
open-source data tools for batch processing, querying, streaming, and
machine learning.
Data proc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.

How to increase the number of MapReduce jobs on WSO2 BAM

I am using WSO2 BAM 2.4.1 to run Hive Analytic scripts and by default, it only kicks off 1 MapReduce job as seen below. Need helps about how to configure WSO2 BAM to run multiple jobs instead.
Thanks!
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Standalone WSO2BAM doesn't start hadoop server internally, rather it uses the hadoop as library, and do a direct JVM call to run the hadoop jobs. Hence you can't have actual parallel job execution with many jobs launched with standalone BAM. To achieve this you need to configure external hadoop cluster and submit the job remotely to the cluster. See here for the configuration of hadoop cluster.