I am using WSO2 BAM 2.4.1 to run Hive Analytic scripts and by default, it only kicks off 1 MapReduce job as seen below. Need helps about how to configure WSO2 BAM to run multiple jobs instead.
Thanks!
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Standalone WSO2BAM doesn't start hadoop server internally, rather it uses the hadoop as library, and do a direct JVM call to run the hadoop jobs. Hence you can't have actual parallel job execution with many jobs launched with standalone BAM. To achieve this you need to configure external hadoop cluster and submit the job remotely to the cluster. See here for the configuration of hadoop cluster.
Related
In our place, we use AWS services for all our data infrastructure and services needs. Our hive tables are external tables and the actual data files are stored in S3. We use Apache Spark for Data ingestion and transformation. We have EMR ever-running cluster with 1 master node (always running) and 1 core node(always running), whenever data processing happens additional core nodes and task nodes are added and removed once processing is done. Our EC2 instances are having EBS volumes for temporary storage/scratch space for executors.
Given this context, I am wondering why do we need HDFS in our EMR cluster at all?. I also see that the HDFS Namenode services are always running on the master node and on the core node Datanode services are running. They do have some blocks they are managing but not able to find which files they belong to. And also the size of all the blocks are very small(~2 GB).
Software versions used
Python version: 3.7.0
PySpark version: 2.4.7
Emr version: 5.32.0
If you know the answer to this question, can you please help me understand this need for HDFS?. Please let me know if you have any questions for me.
HDFS in EMR is a built-it component that is provided to store secondary information such as credentials if your spark executors need to authenticate themselves to read a resource, another use is to store log files, in my personal experience I used it as a staging area to store a partial result in a long computation, so that if something went wrong in the middle I would have a checkpoint from which to resume execution instead of starting the computation from scratch, it is strongly discouraged to store the final result on HDFS.
Spark on EMR runs on YARN, which itself uses HDFS. The Spark executors run inside of YARN containers, and Spark distributes the Spark code and config by placing it in HDFS and distributing it to all of the nodes running the Spark executors in the YARN containers. Additionally, the Spark Event Logs from each running and completed application are stored in HDFS by default.
I have simple workflow to design where there will be 4 batch job running one after another sequentially and each jobs is running in multi node master/slave architecture.
My question is AWS Batch can manage simple workflow using job queue and can manage multi-node parallel job as well.
Now, should I use AWS Batch or Airflow ?
With Airflow , I can use KubernetesPodOperator and job will run in Kubernetes cluster. But Airflow does not inherently support multi node parallel jobs.
Note: The batch job is written in java using Spring batch remote partitioning framework that support master/slave architecture.
AWS Batch would fit your requirements better.
Airflow is a workflow orchestration tool, it's used to host many jobs that have multiple tasks each, with each task being light on processing. Its most common use is for ETL, but in your use case you would have an entire Airflow ecosystem for just a single job, which (unless you manually broke it out to smaller tasks) would not run multi-threaded.
AWS Batch on the other hand is for batch processing, and you can more finely-tune the servers/nodes that you want your code to execute on. I think in your use case it would also work out cheaper than Airflow too.
I have spark and airflow servers differently. And I don't have spark binary in airflow servers. I am able to use SSHOperator and run the spark jobs in cluster mode perfectly well. I would like to know what would be good using either SSHOperator or SparkSubmitOperator in a long run for submitting pyspark jobs. Any help would be appreciated in advance.
Below are the pros and cons of using SSHOperator vs SparkSubmit Operator in airflow and my recommendation followed.
SSHOperator : This operator will perform SSH action into remote Spark server and execute the spark submit in remote cluster.
Pros:
No additional configuration required in the airflow workers
Cons:
Tough to maintain the spark configuration parameters
Need to enable SSH port 22 from airflow servers to spark servers which leads to security concerns ( though you are on private network its not a best practice to use SSH based remote execution.)
SparkSubbmitOperator : This operator will perform spark submit operation in clean way still you need to have additional infrastructure configuration.
Pros:
As mentioned above it comes with handy spark configuration and no additional effort to invoke spark submit
Cons:
Need to install spark on all airflow servers.
Apart from these 2 options I have listed additional 2 options.
Install Livy server on spark clusters and use python Livy library to interact with Spark servers from Airflow. Refer : https://pylivy.readthedocs.io/en/stable/
If your spark clusters are on AWS EMR , I would encourage to using EmrAddStepsOperator
Refer here for additional discussions : To run Spark Submit programs from a different cluster (1**.1*.0.21) in airflow (1**.1*.0.35). How to connect remotely other cluster in airflow
SparkSubmitOperator is a specialized operator. That is, it should make writing tasks for submitting Spark jobs easier and the code itself more readable and maintainable. Therefore, I would use it if possible.
In your case, you should consider if the effort of modifying the infrastructure, such that you can use the SparkSubmitOperator, is worth the benefits, which I mentioned above.
Is there a command that list all the spark jobs running on the cluster?
I am new to this technology and we have multiple users running spark-submit jobs on a aws cluster. Is there a way to list all the spark jobs running?
Thank you!
Use Spark REST API. It can be invoked from active Spark Web UI or from History Server. Of course, as cricket_007 said, you can also list jobs in UI. These UIs and REST services are running in all cluster types
I use EMR to create new instances and process the jobs and then shutdown instances.
My requirement is to schedule jobs in periodic fashion. One of the easy implementation can be to use quartz to trigger EMR jobs. But looking at longer run I am interested in using out of box mapreduce scheduling solution. My question is that is there any out of box scheduling feature provided by EMR or AWS-SDK, which i can use for my requirement? I can see there is scheduling in Auto scaling, but i want to schedule EMR jobflow instead.
There is Apache Oozie Workflow Scheduler for Hadoop to do just that.
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by
time (frequency) and data availabilty.
Oozie is integrated with the rest of the Hadoop stack supporting
several types of Hadoop jobs out of the box (such as Java map-reduce,
Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system
specific jobs (such as Java programs and shell scripts).
Oozie is a scalable, reliable and extensible system.
Here is a simple example of Elastic Map Reduce bootstrap actions for configuring apache oozie : https://github.com/lila/emr-oozie-sample
But to let you know oozie is a bit complicated and if and only if you have a lot of jobs to be scheduled/monitored/maintained then only you shall go for oozie or else just create a bunch of cron jobs if you have say just 2 or 3 jobs to be scheduled periodically.
You may also look into and explore simple workflow from Amazon.