Could an HDFS read/write process be suspended/resumed? - hdfs

I have one question regarding the HDFS read/write process:
Assuming that we have a client (for the sake of the example let's say that the client is a HADOOP map process) who requests to read a file from HDFS and or to write a file to HDFS, which is the process which actually does the read/write from/to the HDFS?
I know that there is a process for the Namenode and a process for each Datanode, what are their responsibilities to the system in general but I am confused in this scenario.
Is it the client's process by itself or is there another process in the HDFS, created and dedicated to the this specific client, in order to access and read/write from/to the HDFS?
Finally, if the second answer is true, is there any possibility that this process can be suspended for a while?
I have done some research and the most important solutions that I found were Oozie and JobControl class from hadoop API.
But, because I am not sure about the above workflow, I am not sure what process I am suspending and resuming with these tools.
Is it the client's process or a process which runs in HDFS in order to serve the request of the client?

Have a look at these SE posts to understand how HDFS writes work:
Hadoop 2.0 data write operation acknowledgement
Hadoop file write
Hadoop: HDFS File Writes & Reads
Apart from file/block writes, above question explain about datanode failure scenarios.
The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on. The failed datanode is removed from the pipeline, and a new pipeline is constructed from the two good datanodes.
One failure in datanode triggers corrective actions by framework.
Regarding your second query :
You have two types of schedulers :
FairScheduler
CapacityScheduler
Have a look at this article on suspend and resume
In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN.
When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs.
In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs.

So as far as I understand the process of a Datanode receives the data from the client's process (who requests to store some data in HDFS) and stores it. Then this Datanode forwards the exact same data to another Datanode (to achieve replication) and so on. When the replication will finish, an acknowledgement will go back to the Namenode who will finally inform the client about the completion of his write-request.
Based on the above flow, It is impossible to suspend an HDFS write operation in order to serve a second client's write-request (let's assume that the second client has higher priority) because if we suspend the Datanode by itself it will remain suspended for everyone who wants to write on it and as a result this part of the HDFS will be remained blocked. Finally, if I suspend a job from JobController class functions, I actually suspend the client's process (if I actually manage to catch it before his request will be done). Please correct me if I am wrong.

Related

Is it possible to trigger/call another program when kafka HdfsSinkConnector finish

I want to trigger the impala refresh job when kafka HdfsSinkConnector task finish it. Is it possible to get notification when task complete or any other way to trigger/call my other program?
HDFS has an inotify feature which essentially translates those log entries into events that can be consumed.
https://issues.apache.org/jira/browse/HDFS-6634
Here's a Java based example: https://github.com/onefoursix/hdfs-inotify-example
Alternatively, rather than having Oozie monitor many directories and waste resources, a script can execute 'hdfs dfs -ls -R /folder|grep|sed' every minute or so but that's still not event based, so it depends how fast of a reaction you need vs how easy you can implement/use the inotify API
https://community.cloudera.com/t5/Support-Questions/HDFS-Best-way-to-trigger-execution-at-File-arrival/td-p/163423

JBeret Queueing Mechanism

I'm receiving a bunch of csv files (e.g. 200) at once which I want to read and process one after the other with a JBeret job. How would I configure JBeret to achieve that? Is there some sort of queueing mechanism? Thanks in advance.
When running batch jobs in WildFly (which contains jberet as a subsystem), submitted job execution requests will be started if there is sufficient processing resources available. Otherwise, requests will be queued for later execution. You can configure the max-threads attribute in batch-jberet subsystem to influence the number of concurrent job executions.

Routing an activity task to a specific worker in the SWF fleet

I have a fleet of multiple worker hosts polling for the following tasks of my SWF:
Activity 1: Perform some business logic to create a large file.
Activity 2: Wait for some time (a human approval, timer, etc.)
Activity 3: Transmit the file using some protocol (governed by input parameters of the SWF).
Activity 4: Clean-up the local-generated file.
The file generated in Step-1 needs to be used again in Step-3, and then eventually discarded at the end of the workflow.
The system would work fine if there is only 1 host polling for all tasks. However, when I have multiple workers, I cannot seem to ensure that task-1 and task-3 would end up on the same host.
I would like to avoid doing the following:
Uploading the file to a central repository (say S3) on step-1 and download it in step-3; or
Having a single activity for the task-1 and task-3.
I have the following questions:
Is it possible to control that subsequent activities be run on the same host as opposed to going to any random host in my fleet?
What are specific guidelines/best practices on re-using resources generated in different activities in a workflow?
Is it possible to control that subsequent activities be run on the
same host as opposed to going to any random host in my fleet?
Yes, absolutely. The basic idea is that SWF task lists (queues used to deliver activity tasks) are dynamic. So each host can have its own task list and workflow can specify specific task list name when calling an activity. See fileprocessing sample which executes download activity on any host from the pool, then converts the file and uploads the result on the same host as the first one.
List item What are specific guidelines/best practices on re-using resources generated in different activities in a workflow?
The approach of caching result in the worker process memory or on the local disk is considered the best practice. Sometimes using external data store and getting it each times also makes sense.

How to relaunch a Spark executor after it crashes (in YARN client mode)?

Is it possible to relaunch a Spark executor after it crashes? I understand that the failed tasks are re-run in the existing working Spark executors, but I hope there is a way to relaunch the crashed Spark executor.
I am running pyspark 1.6 on YARN, in client mode
No. It is not possible. Spark takes care of it and when an executor dies, it will request a new one the next time it asks for "resource containers" for executors.
If the executor was close to the data to process Spark will request for a new executor given locality preferences of the task(s) and chances are that the host where the executor has died will be used again to run the new one.
An executor is a JVM process that spawns threads for tasks and honestly does not do much. If you're concerned with the data blocks you should consider using Spark's external shuffle service.
Consider reading the document Job Scheduling in the official documentation.

AWS SWF Simple Workflow - Best Way to Keep Activity Worker Scripts Running?

The maximum amount of time the pollForActivityTask method stays open polling for requests is 60 seconds. I am currently scheduling a cron job every minute to call my activity worker file so that my activity worker machine is constantly polling for jobs.
Is this the correct way to have continuous queue coverage?
The way that the Java Flow SDK does it and the way that you create an ActivityWorker, give it a tasklist, domain, activity implementations, and a few other settings. You set both the setPollThreadCount and setTaskExecutorSize. The polling threads long poll and then hand over work to the executor threads to avoid blocking further polling. You call start on the ActivityWorker to boot it up and when wanting to shutdown the workers, you can call one of the shutdown methods (usually best to call shutdownAndAwaitTermination).
Essentially your workers are long lived and need to deal with a few factors:
New versions of Activities
Various tasklists
Scaling independently on tasklist, activity implementations, workflow workers, host sizes, etc.
Handle error cases and deal with polling
Handle shutdowns (in case of deployments and new versions)
I ended using a solution where I had another script file that is called by a cron job every minute. This file checks whether an activity worker is already running in the background (if so, I assume a workflow execution is already being processed on the current server).
If no activity worker is there, then the previous long poll has completed and we launch the activity worker script again. If there is an activity worker already present, then the previous poll found a workflow execution and started processing so we refrain from launching another activity worker.