I created an MR job to ingest the data into the Elastic Search Nodes. Below is the command line argument I used to run this MR job for data ingestion.
hadoop jar inputdirs outputdir esnode1,esnode2,esnode3,esnode4.
Now, through command line arguments I am able to successfully execute it.
Same thing, if I try to do scheduling, I am facing challenges. Oozie MR action accepts only two arguments: inputdir outputdir. I am not sure where to provide 3rd argument (i.e esNodes) in Oozie workflow in order to execute the MR action as I designed.
Is it possible? If not, do I have to write a regular Java action in Oozie?
You can define argument related to your MR job in workflow.xml file.More information related to workflow.xml can be found here
Related
I have to cheack the status of workflow weather that workflow completed within scheduled time or not in sql query format. And also send an email of workflow status like 'completed within time ' or not 'completed within time'. So, please help me out
You can do it either using option1 or option 2.
You need access to repository meta database.
Create a post session shell script. You can pass workflow name and benchmark value to the shell script.
Get workflow run time from repository metadata base.
SQL you can use -
SELECT WORKFLOW_NAME,(END_TIME-START_TIME)*24*60*60 diff_seconds
FROM
REP_WFLOW_RUN
WHERE WORKFLOW_NAME='myWorkflow'
You can then compare above value with benchmark value. Shell script can send a mail depending on outcome.
you need to create another workflow to check this workflow.
If you do not have access to Metadata, please follow above steps except metadata SQL.
Use pmcmd GetWorkflowDetails to check status, start and end time for a workflow.
pmcmd GetWorkflowDetails -sv service -d domain -f folder myWorkflow
You can then grep start and end time from there, compare them with benchmark values. The problem is the format etc. You need little bit scripting here.
I've honed my transformations in DataPrep, and am now trying to run the DataFlow job directly using gcloud CLI.
I've exported my template and template metadata file, and am trying to run them using gcloud dataflow jobs run and passing in the input & output locations as parameters.
I'm getting the error:
Template metadata regex '[ \t\n\x0B\f\r]*\{[ \t\n\x0B\f\r]*((.|\r|\n)*".*"[ \t\n\x0B\f\r]*:[ \t\n\x0B\f\r]*".*"(.|\r|\n)*){17}[ \t\n\x0B\f\r]*\}[ \t\n\x0B\f\r]*' was too large. Max size is 1000 but was 1187.
I've not specified this at the command line, so I know it's getting it from the metadata file - which is straight from DataPrep, unedited by me.
I have 17 input locations - one containing source data, all the others are lookups. There is a regex for each one, plus one extra.
If it's running when prompted by DataPrep, but won't run via CLI, am I missing something?
In this case I'd suspect the root cause is a limitation in gcloud that is not present in the Dataflow API or Dataprep. The best thing to do in this case is to open a new Cloud Dataflow issue in the public tracker and provide details there.
I am running a pig script in -hcatalogue mode and it is failing at the map reduce job execution. It says one of the map reduce job is failing. What would be the best way to troubleshoot. i am trying to find the log file but i could not get it.
Is there any specific place i can find the logs?
Log of pig-script is created in the working directory from where you have run the pig script or started the pig-console.
For Map-Reduce log you have to check the [HADOOP_HOME]/logs/userlogs directory. you will get ERROR message either in sysout file or syserr file.
Tidal - How to get output of web services job on to hard disk? I a trying to output the results of web services job on to hard disk. Please let me know who I can achieve this using Tidal.
Create a second job that depends on the web service job. In this second job, use the "output" command from the Tidal command line interface to save the output of the previous job to a file. It would look something like this:
command: /path/to/tidal/sacmd
parameters:
-i <JobID..1234> -o /path/to/output/file.out
For the job id parameter, you select the predecessor web service job and the JobID reference to the predecessor gets filled in as above.
sacmd may be called tescmd depending on the OS and the version of Tidal you are using.
I know there are api to configure the notification when a job is failed or finished.
But what if, say, I run a hive query that count the number of rows in a table. If the returned result is zero I want to send out emails to the concerned parties. How can I do that?
Thanks.
You may want to look at Airflow and Qubole's operator for airflow. We use airflow to orchestrate all jobs being run using Qubole and in some cases non Qubole environments. We DataDog API to report success / failures of each task (Qubole / Non Qubole). DataDog in this case can be replaced by Airflow's email operator. Airflow also has some chat operator (like Slack)
There is no direct api for triggering notification based on results of a query.
However there is a way to do this using Qubole:
-Create a work flow in qubole with following steps:
1. Your query (any query) that writes output to a particular location on s3.
2. A shell script - This script reads result from your s3 and fails the job based on any criteria. For instance in your case, fail the job if result returns 0 rows.
-Schedule this work flow using "Scheduler" API to notify on failure.
You can also use "Sendmail" shell command to send mail based on results in step 2 above.