I am developing a web app with the following components :
Apache Spark running on a Cluster with 3 nodes (spark 1.4.0, hadoop 2.4 and YARN)
Django Web App server
The Django app will create 'on demand' spark jobs (they can be concurrent jobs, depending of how many users are using the app)
I would to know if there is any way to submit spark jobs from the python code in Django? can i integrate pyspark in django? or Maybe can i call the YARN API directly to submit jobs?
I know that i can submit jobs to the cluster using the spark-submit script, but i'm trying to avoid using it. (because it would have to be a shell command execution from the code, and is not very safe to do it)
Any help would be very appreciated.
Thanks a lot,
JG
A partial, untested answer: Django is a web framework, so it's difficult to manage long jobs (more than 30sec), which is probably the case for your spark jobs.
So you'll need a asynchronous job queue, such as celery. It's a bit of a pain (not that bad but still), but I would advise you to start with that.
You would then have :
Django to launch/monitor jobs
rabbitMQ/celery asynchronous job queue
custom celery tasks, using pySpark and launching sparks
There's a project on github called Ooyala's job server:
https://github.com/ooyala/spark-jobserver.
This allows you to submit spark jobs to YARN via HTTP request.
In Spark 1.4.0+ support was added to monitor job status via HTTP request as well.
Related
I am creating a Django application for school project. I want to schedule jobs (every (work)day on 9:00 and 17:00).
I am trying to do it with Celery right now, but I stuck very hard on it, and as the deadline is in sight, I want to use alternative options: just a cronjob. I think just the cronjob works fine, but the user should be able to edit the times of the cronjobs using the Django web application (so not logging in to SSH, edit the crontab manually).
Is this possible? Can't find anything about it on the internet.
You need django-celery-beat plugin that adds new models to the django admin named "Periodic tasks" where you can manage cron schedule for your tasks.
As an alternate, if you really do not want to run a background tasks, you can create django management commands and use a library like python-crontab to add/modify/remove cron jobs in the system.
I need to access the Spark History Server so that I performance tune a slow spark job.
I was looking for a link within DSX but could not find one, so I have opened up the spark service in the Bluemix console and have navigated to the spark history server directly from there (Job History link) .
Is there a way to access the spark history server directly from DSX?
It seems that you have to access the spark history server by logging in to the Bluemix console as I have been doing.
There is a feature request here asking to make this process much more integrated: https://datascix.uservoice.com/forums/387207-general/suggestions/17707459-give-access-to-spark-ui-to-monitor-jobs-without-ha - please vote for this!
Running PySpark 2 job on EMR 5.1.0 as a step. Even after the script is done with a _SUCCESS file written to S3 and Spark UI showing the job as completed, EMR still shows the step as "Running". I've waited for over an hour to see if Spark was just trying to clean itself up but the step never shows as "Completed". The last thing written in the logs is:
INFO MultipartUploadOutputStream: close closed:false s3://mybucket/some/path/_SUCCESS
INFO DefaultWriterContainer: Job job_201611181653_0000 committed.
INFO ContextCleaner: Cleaned accumulator 0
I didn't have this problem with Spark 1.6. I've tried a bunch of different hadoop-aws and aws-java-sdk jars to no avail.
I'm using the default Spark 2.0 configurations so I don't think anything else like metadata is being written. Also the size of the data doesn't seem to have an impact on this problem.
If you aren't already, you should close your spark context.
sc.stop()
Also, if you are watching the Spark Web UI via a browser, you should close that as it sometimes keeps the spark context alive. I recall seeing this on the spark dev mailing list, but can't find the jira for it.
We experienced this problem and resolved it by running the job in cluster deploy mode using the following spark-submit option:
spark-submit --deploy-mode cluster
It was something to do with when running in client mode the driver runs in the master instance and the spark-submit process is getting stuck despite the spark spark context closing. This was causing the instance controller to continuously polling for process as it never receives the completion signal. Running the driver on one of the instance nodes using the above option doesn't seem to have this problem. Hope this helps
I experienced the same issue with Spark on AWS EMR and I solved the issue by calling sys.exit(O) at the end of my Python script. The same worked with Scala program with System.exit(0).
I am trying to submit multiple Hive queries using CLI and I want the queries to run concurrently. However, these queries are running sequentially.
Can somebody tell me how to invoke a number of Hive queries so that they do in fact run concurrently?
This is not because of Hive, it has to do with your Hadoop configuration. By default, Hadoop uses a simple FIFO queue for job submission and execution. You can, however, configure a different policy so that multiple jobs can run at once.
Here's a nice blog post from Cloudera back in 2008 on the matter: Job Scheduling in Hadoop
Pretty much any scheduler other than the default will support concurrent jobs, so take your pick!
I'm building a simple website with django that requires constant monitoring of text-based data from another website, that's the way it have to be.
How could I run this service on my web-host using django? would I have to start a separate app and run it via SSH, so it updates the database used by django, or are there any easier/better way?
You could use celery to schedule a job that would read data from that other website and do whatever you want with it.
As an alternative to celery, you could also create a cron job that executes a custom django-admin command. That would give you full access to your django install and ORM. The downside is that cron's smallest time resolution is 1 minute, so if you need it to be real-time, you're not going to be able to do that.
If you do need realtime, then creating a python daemon might be a better option.