I am using AWS EMR clusters to run Hive. I want to be able to enforce that certain tables should never be empty After initial creation, such as refrence tables, and if they are found to be empty to throw an error (or log a message) and stop processing.
Does anyone know of any ways to achieve this?
Thanks
You could install a cron job on the master server that periodically runs a check against your Hive table. Once this table is empty, you can terminate the cluster or stop the job flow or take some other action. These actions can be executed using EMR CLI tools http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-commands.html
These commands can also be run using AWS SDK inside a Java Program - in case you want all of this as a Java program instead of a script.
You have not specified if the cluster is persistent or transient. If it is persistent, this script can run outside the master.
Related
I have the following need - the code needs to call some APIs, get some data, and store them in a database (flat file will do for our purpose). As the APIs give access to a huge number of records, we want to split it into 30 parts, each part scraping a certain section of the data from the APIs. We want these 30 scrapers to run in 30 different machines - and for that, we have got a Python program that does the following:
Call the API, get the data, based on parameters (which part of the API to call)
Dump it to the local flatfile.
And then later, we will merge the output from the 30 files into one giant DB.
Question is - which AWS tool to use for our purpose? We can use EC2 instance, but we have to keep the EC2 console open on our desktop where we connect to it to run the Python program, it is not feasible to keep 30 connections open on my laptop. It is very complicated to get remote desktop on those machines, so logging there, starting the job and then disconnecting - this is also not feasible.
What we want is this - start the tasks (one each on 30 machines), let them run and finish by themselves, and if possible notify me (or I can myself check for health periodically).
Can anyone guide me which AWS tool suits our purpose, and how?
"We can use EC2 instance, but we have to keep the EC2 console open on
our desktop where we connect to it to run the Python program"
That just means you are running the script wrong, and you need to look into running it as a service.
In general you need to look into queueing up these tasks in SQS and then triggering either EC2 auto-scaling or Lambda functions depending on if your script will run inside the Lambda runtime restrictions.
This seems like a good application for Step Functions. Step Functions allow you to orchestrate multiple lambda functions, Glue jobs, and other services into a business process. You could write lambda functions that call the API endpoints and store the results in S3. Once all the data is gathered, your step function could trigger a lambda function, glue job, or something else that processes the data into your database. Step Functions help with error handling and retry and allow easy monitoring of your process.
I have a aws glue pyspark job which is long running after a certain command . In the log it is not writing anything after that command even a simple “print hello “ statement.
How can I debug aws glue pyspark job which is long running and not even writing logs. Job is not throwing any error it shows running status in the console
AWS Glue is based on Apache Spark which means until an action called there will not be any actual execution. So if you put print statements in between and see them in the logs that does't mean that your job is executed up to that point. As your job is long running check this article by AWS which explains about Debugging Demanding Stages and Straggler Tasks. Also this is a good blog to take a look at.
I'm running Apache Spark 2.4.5 on an EMR-5.30 cluster. My driver node is doing some work to retrieve data from an external service, so I can put it into a text file and distribute a copy to all worker nodes. There were a few possible solutions I came up with for distributing files to all worker nodes, but realized they wouldn't work out:
Use EMR bootstrap actions to submit an EMR step that runs a shell script. This runs on all worker nodes, but the EMR cluster won't have the data necessary to create the file at this point in time.
Use org.apache.hadoop.mapreduce.Job to run a distributed job across all worker nodes, creating the file before the task is run by accessing HDFS. I was working with this approach, but we've put up some guards in our service to restrict access to HDFS.
Use sparkContext.addFile(path) to later retrieve this file with sparkContext.textFile(path). This would be nice but it isn't possible as the external dependency that needs the text file is coded to look for the file locally, and wouldn't have access to any sparkContext.
I've been looking around for a while but can't seem to find other options, any tips?
Referencing from my previous answer : https://stackoverflow.com/a/64458117/7094520
TLDR :
you can use AWS cli or SDK to do so. Something like (python):
emr_client = boto3.client('emr')
ssm_client = boto3.client('ssm')
You can get the list of WORKERS using emr_client.list_instances
finally send a command to each of these instance using ssm_client.send_command
Running PySpark 2 job on EMR 5.1.0 as a step. Even after the script is done with a _SUCCESS file written to S3 and Spark UI showing the job as completed, EMR still shows the step as "Running". I've waited for over an hour to see if Spark was just trying to clean itself up but the step never shows as "Completed". The last thing written in the logs is:
INFO MultipartUploadOutputStream: close closed:false s3://mybucket/some/path/_SUCCESS
INFO DefaultWriterContainer: Job job_201611181653_0000 committed.
INFO ContextCleaner: Cleaned accumulator 0
I didn't have this problem with Spark 1.6. I've tried a bunch of different hadoop-aws and aws-java-sdk jars to no avail.
I'm using the default Spark 2.0 configurations so I don't think anything else like metadata is being written. Also the size of the data doesn't seem to have an impact on this problem.
If you aren't already, you should close your spark context.
sc.stop()
Also, if you are watching the Spark Web UI via a browser, you should close that as it sometimes keeps the spark context alive. I recall seeing this on the spark dev mailing list, but can't find the jira for it.
We experienced this problem and resolved it by running the job in cluster deploy mode using the following spark-submit option:
spark-submit --deploy-mode cluster
It was something to do with when running in client mode the driver runs in the master instance and the spark-submit process is getting stuck despite the spark spark context closing. This was causing the instance controller to continuously polling for process as it never receives the completion signal. Running the driver on one of the instance nodes using the above option doesn't seem to have this problem. Hope this helps
I experienced the same issue with Spark on AWS EMR and I solved the issue by calling sys.exit(O) at the end of my Python script. The same worked with Scala program with System.exit(0).
I am trying to submit multiple Hive queries using CLI and I want the queries to run concurrently. However, these queries are running sequentially.
Can somebody tell me how to invoke a number of Hive queries so that they do in fact run concurrently?
This is not because of Hive, it has to do with your Hadoop configuration. By default, Hadoop uses a simple FIFO queue for job submission and execution. You can, however, configure a different policy so that multiple jobs can run at once.
Here's a nice blog post from Cloudera back in 2008 on the matter: Job Scheduling in Hadoop
Pretty much any scheduler other than the default will support concurrent jobs, so take your pick!