Is it possible to trigger/call another program when kafka HdfsSinkConnector finish - hdfs

I want to trigger the impala refresh job when kafka HdfsSinkConnector task finish it. Is it possible to get notification when task complete or any other way to trigger/call my other program?

HDFS has an inotify feature which essentially translates those log entries into events that can be consumed.
https://issues.apache.org/jira/browse/HDFS-6634
Here's a Java based example: https://github.com/onefoursix/hdfs-inotify-example
Alternatively, rather than having Oozie monitor many directories and waste resources, a script can execute 'hdfs dfs -ls -R /folder|grep|sed' every minute or so but that's still not event based, so it depends how fast of a reaction you need vs how easy you can implement/use the inotify API
https://community.cloudera.com/t5/Support-Questions/HDFS-Best-way-to-trigger-execution-at-File-arrival/td-p/163423

Related

JBeret Queueing Mechanism

I'm receiving a bunch of csv files (e.g. 200) at once which I want to read and process one after the other with a JBeret job. How would I configure JBeret to achieve that? Is there some sort of queueing mechanism? Thanks in advance.
When running batch jobs in WildFly (which contains jberet as a subsystem), submitted job execution requests will be started if there is sufficient processing resources available. Otherwise, requests will be queued for later execution. You can configure the max-threads attribute in batch-jberet subsystem to influence the number of concurrent job executions.

Could an HDFS read/write process be suspended/resumed?

I have one question regarding the HDFS read/write process:
Assuming that we have a client (for the sake of the example let's say that the client is a HADOOP map process) who requests to read a file from HDFS and or to write a file to HDFS, which is the process which actually does the read/write from/to the HDFS?
I know that there is a process for the Namenode and a process for each Datanode, what are their responsibilities to the system in general but I am confused in this scenario.
Is it the client's process by itself or is there another process in the HDFS, created and dedicated to the this specific client, in order to access and read/write from/to the HDFS?
Finally, if the second answer is true, is there any possibility that this process can be suspended for a while?
I have done some research and the most important solutions that I found were Oozie and JobControl class from hadoop API.
But, because I am not sure about the above workflow, I am not sure what process I am suspending and resuming with these tools.
Is it the client's process or a process which runs in HDFS in order to serve the request of the client?
Have a look at these SE posts to understand how HDFS writes work:
Hadoop 2.0 data write operation acknowledgement
Hadoop file write
Hadoop: HDFS File Writes & Reads
Apart from file/block writes, above question explain about datanode failure scenarios.
The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on. The failed datanode is removed from the pipeline, and a new pipeline is constructed from the two good datanodes.
One failure in datanode triggers corrective actions by framework.
Regarding your second query :
You have two types of schedulers :
FairScheduler
CapacityScheduler
Have a look at this article on suspend and resume
In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN.
When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs.
In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs.
So as far as I understand the process of a Datanode receives the data from the client's process (who requests to store some data in HDFS) and stores it. Then this Datanode forwards the exact same data to another Datanode (to achieve replication) and so on. When the replication will finish, an acknowledgement will go back to the Namenode who will finally inform the client about the completion of his write-request.
Based on the above flow, It is impossible to suspend an HDFS write operation in order to serve a second client's write-request (let's assume that the second client has higher priority) because if we suspend the Datanode by itself it will remain suspended for everyone who wants to write on it and as a result this part of the HDFS will be remained blocked. Finally, if I suspend a job from JobController class functions, I actually suspend the client's process (if I actually manage to catch it before his request will be done). Please correct me if I am wrong.

Recommendation for batch processing on aws

I'm new to using AWS, so any pointers would be appreciated.
I have a need to process large files using our in-house software.
It takes about 2GB of input and generates 5GB of output, running for 2 hours on a c3.8xlarge.
For now I do it manually, start an instance (either on-demand or spot-request), but now I want to reliably automate and scale this processing - what are good frameworks or platform or amazon services to do that?
Especially regarding the possibility that a spot-instance will be terminated half-way through (and I'll need to detect that and restart the job).
I heard about Python Celery, but does it work well with amazon and spot-instances?
Or are there other recommended mechanisms?
Thank you!
This is somewhat opinion-based, but you can mix and match some of the AWS pieces to make this easier:
put the input data on S3
push an entry into a SQS queue indicating a job needs to be processed with a long visibility timeout
set up an autoscaling policy based on SQS with your machine description in CloudFormation.
use UserData/cloudinit to set up the machine and start your application
write code to receive the queue entry, start processing, finish processing, then delete the SQS message.
code should check for another queued entry. If none, code should terminate machine.

Concurrency in running Oozie workflow: how many and how to throttle

Let us say we have a Oozie workflow that has a copy action node then a Shell action node. Can I start multiple instances of such a OOzie workflow and run them in parallel? How about the concurrency number could spike to thousands and/or even millions level. Is that possible, or even Oozie supports that high level concurrency?
If not, then we will have to consider throttling and enforce a cap on how many concurrent Oozie workflow instances can be. We'd prefer to throttle this on server/Oozie side (basically with any out of box Oozie software functionality), not on client/callee side. For example, we have a huge launch script with lines like this. We want to run that in a single shot, then let Oozie figure out how to throttle all these instances on itself. We don't want to split it into multiple smaller chunks, then kick off one chunk at a time.
oozie job -oozie http://myhost.com:11000/oozie -config job1.properties -run
oozie job -oozie http://myhost.com:11000/oozie -config job2.properties -run
......
oozie job -oozie http://myhost.com:11000/oozie -config job1000000.properties -run
You will not be able to have a higher Oozie workflow concurrency than the number of map slots on your cluster because a Shell action is run by a one-mapper-zero-reducer MR job.
If you have many instances of a workflow to get through then the best mechanism is to use an Oozie coordinator. This will keep track of the completion of each instance and easily manage concurrency. An Oozie coordinator has a <concurrency> tag that controls how many instances of the workflow will execute in parallel, and a <throttle> tag that controls how many instances are brought into a waiting state before there is free concurrency for one to begin.
See: https://oozie.apache.org/docs/3.1.3-incubating/CoordinatorFunctionalSpec.html#a6.3._Synchronous_Coordinator_Application_Definition
Note that the default behavior of an Oozie coordinator is to wait 5 minutes between each polling of whether a new instance should be created. If your workflows run in less than 5 minutes then the process will bottleneck on this interval. You can change this with the oozie.service.CoordMaterializeTriggerService.lookup.interval property (in seconds) in your oozie-site.xml file.

SQS/task-queue job retry count strategy?

I'm implementing a task queue with Amazon SQS ( but i guess the question applies to any task-queue ) , where the workers are expected to take different action depending on how many times the job has been re-tried already ( move it to a different queue, increase visibility timeout, send an alert..etc )
What would be the best way to keep track of failed job count? I'd like to avoid having to keep a centralized db for job:retry-count records. Should i look at time spent in the queue instead in a monitoring process? IMO that would be ugly or un-clean at best, iterating over jobs until i find ancient ones..
thanks!
Andras
There is another simpler way. With your message you can request ApproximateReceiveCount information and base your retry logic on that. This way you won't have to keep it in the database and can calculate it from the message itself.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_ReceiveMessage.html
I've had good success combining SQS with SimpleDB. It is "centralized", but only as much as SQS is.
Every job gets a record in simpleDB and a task in SQS. You can put any information you like in SimpleDB like the job creation time. When a worker pulls a job from the queue it can grab the corresponding record from simpleDB to determine it's history. You can see how old the job is, and you can see how many times it has been attempted. Once you're done, you can add worker data to the SimpleDB record (completion time, outcome, logs, errors, stack-trace, whatever) and acknowledge the message from SQS.
I prefer this method because it helps diagnose faults by providing lots of debug info for failed tasks. It also allows workers to handle the job differently depending on how long the job has been queued, how many failures it's had, etc.
It also gives you the ability to query SimpleDB directly and calculate things like average time per task, percent failure rate, etc.
Amazon just released Simple workflow serice (swf) which you can think of as a more sophisticated/flexible version of GAE Task queues.
It will let you monitor your tasks (with hearbeats), configure retry strategies and create complicated workflows. It looks pretty promising abstracting out task dependencies, scheduling and fault tolerance for tasks (esp. asynchronous ones)
Checkout http://docs.amazonwebservices.com/amazonswf/latest/developerguide/swf-dg-intro-to-swf.html for overview.
SQS stands for "Simple Queue Service" which, in concept is the incorrect name for that service. The first and foremost feature of a "Queue" is FIFO (First in, First out), and SQS lacks that. Just wanting to clarify.
Also, Azure Queue Services lacks that as well. For the best cloud Queue service, use Azure's Service Bus since it's a TRUE Queue concept.