I am trying to submit multiple Hive queries using CLI and I want the queries to run concurrently. However, these queries are running sequentially.
Can somebody tell me how to invoke a number of Hive queries so that they do in fact run concurrently?
This is not because of Hive, it has to do with your Hadoop configuration. By default, Hadoop uses a simple FIFO queue for job submission and execution. You can, however, configure a different policy so that multiple jobs can run at once.
Here's a nice blog post from Cloudera back in 2008 on the matter: Job Scheduling in Hadoop
Pretty much any scheduler other than the default will support concurrent jobs, so take your pick!
Related
I am trying to build a multi-node parallel job in AWS Batch running an R script. My R script runs independently multiple statistical models for multiple users. Hence, I want to split and distribute this job running on parallel on a cluster of several servers for faster execution. My understanding is that at some point I have to prepare a containerized version of my R-application code using a Dockerfile pushed to ECR. My question is:
The parallel logic should be placed inside the R code, while using 1 Dockerfile? If yes, how does Batch know how to split my job (in how many chunks) ?? Is the for-loop in the Rcode enough?
or I should define the parallel logic somewhere in the Dockerfile saying that: container1 run the models for user1-5, container2 run
the models for user6-10, etc.. ??
Could you please share some ideas or code on that topic for better understanding? Much appreciated.
AWS Batch does not inspect or change anything in your container, it just runs it. So you would need to handle the distribution of the work within the container itself.
Since these are independent processes (they don't communicate with each other over MPI, etc) you can leverage AWS Batch Array Jobs. Batch MNP jobs are for tightly-coupled workloads that need that inter-instance or inter-GPU communication using Elastic Fabric Adapter.
Your application code in the container can leverage the AWS_BATCH_JOB_ARRAY_INDEX environment variable to process a subset of users. AWS_BATCH_JOB_ARRAY_INDEX starts with 0 so you would need to account for that.
You can see an example in the AWS Batch docs for how to use the index.
Note that AWS_BATCH_JOB_ARRAY_INDEX is zero based, so you will need to account for that if your user numbering / naming scheme is different.
AWS Glue looks promising but I'm having a challenge with the development cycle time. If I edit PySpark scripts through the AWS console, it takes several minutes to run even on a minimal test dataset. This makes it a challenge to iterate quickly if I have to wait 3-5 minutes just to see whether I called the right method on glueContext or understood a particular DynamicFrame behavior.
What techniques would allow me to iterate faster?
I suppose I could develop Spark code locally, and deploy it to Glue as an execution framework. But if I need to test code with Glue-specific extensions, I am stuck.
For development and testing scripts Glue has Development Endpoints which you can use with notebooks like Zeppelin installed either on a local machine or on Amazon EC2 instance (other options are 'REPL Shell' and 'PyCharm Professional').
Please don't forget to remove the endpoint when you are done with testing since you pay for it even if it's idling.
I keep pyspark code in separate class file and glue code in another file. We use glue for reading and writing data only. We do test driven development using pytest in local machine. No need of dev endpoint or zeppelin. Once all syntactical or business logic specific bugs are fixed in pyspark, end-to-end testing is done using glue. We also wrote shell script, which uploads latest code to S3 bucket from which glue job is run.
https://github.com/fatangare/aws-glue-deploy-utility
https://github.com/fatangare/aws-python-shell-deploy
I managed to connect to cloud sql via JDBCIO
DataSourceConfiguration.create("com.mysql.jdbc.Driver","jdbc:mysql://google/?cloudSqlInstance=::&socketFactory=com.google.cloud.sql.mysql.SocketFactory&user=&password=")
This works, however, the batch writes takes between 2-5 minutes for 1000 records, which is terrible. i have tried different networks to see if this was related, and the results were consistent.
Anyone have any ideas?
Where are you initializing this connection? If you are doing this inside of your DoFn it will create latency as the socket is built up and torn down on each bundle.
Have a look at DoFn.Setup this provides a clean way to init resources that will be persisted across bundle calls.
I am developing a web app with the following components :
Apache Spark running on a Cluster with 3 nodes (spark 1.4.0, hadoop 2.4 and YARN)
Django Web App server
The Django app will create 'on demand' spark jobs (they can be concurrent jobs, depending of how many users are using the app)
I would to know if there is any way to submit spark jobs from the python code in Django? can i integrate pyspark in django? or Maybe can i call the YARN API directly to submit jobs?
I know that i can submit jobs to the cluster using the spark-submit script, but i'm trying to avoid using it. (because it would have to be a shell command execution from the code, and is not very safe to do it)
Any help would be very appreciated.
Thanks a lot,
JG
A partial, untested answer: Django is a web framework, so it's difficult to manage long jobs (more than 30sec), which is probably the case for your spark jobs.
So you'll need a asynchronous job queue, such as celery. It's a bit of a pain (not that bad but still), but I would advise you to start with that.
You would then have :
Django to launch/monitor jobs
rabbitMQ/celery asynchronous job queue
custom celery tasks, using pySpark and launching sparks
There's a project on github called Ooyala's job server:
https://github.com/ooyala/spark-jobserver.
This allows you to submit spark jobs to YARN via HTTP request.
In Spark 1.4.0+ support was added to monitor job status via HTTP request as well.
I am using AWS EMR clusters to run Hive. I want to be able to enforce that certain tables should never be empty After initial creation, such as refrence tables, and if they are found to be empty to throw an error (or log a message) and stop processing.
Does anyone know of any ways to achieve this?
Thanks
You could install a cron job on the master server that periodically runs a check against your Hive table. Once this table is empty, you can terminate the cluster or stop the job flow or take some other action. These actions can be executed using EMR CLI tools http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-commands.html
These commands can also be run using AWS SDK inside a Java Program - in case you want all of this as a Java program instead of a script.
You have not specified if the cluster is persistent or transient. If it is persistent, this script can run outside the master.