If some jobs are enqueued in a stream, when a job in the stream launches, would it blocks the stream and other jobs in stream would wait until current job finishes, or if there is resource available for the stream, following jobs also would be launched concurrently.
I discussed with a NVidia Staff about the question and it was the result.
Async memory copy and launching kernel in default stream and within user-defined stream are non-blocking.
I have created an gRPC async client written in C++ which makes both streaming and unary requests to a server, using a completion queue.
In the destructor of the client class the Shutdown method of the completion queue is called, then I thought I could call Next to drain the queue and obtain the pending tags, but instead the call to Next blocks everything.
The pending tags are needed since they are objects create with new and must be deleted to avoid leaks.
What is the correct way to drain a queue used for an async client?
It should be that 1 tag into the completion queue, 1 tag out, so all the pending ops will get their tags returned from Next (even if the RPC gets canceled).
The symptom that Next blocks is likely due to there are pending events that is not finished.
You may like to use TryCancel to terminate the call quickly
I need to create a Kinesis Stream programatically and then start performing operations on it. Create Stream is an async operation. What is the best way to establish when the stream is ready?
Is it polling for the a stream status and waiting for a response with StreamStatus ACTIVE?
You can poll, but otherwise if you have the option you can stream to an intermediate destination such as CloudWatch, and then once the stream is active it will begin consuming those events.
I have one question regarding the HDFS read/write process:
Assuming that we have a client (for the sake of the example let's say that the client is a HADOOP map process) who requests to read a file from HDFS and or to write a file to HDFS, which is the process which actually does the read/write from/to the HDFS?
I know that there is a process for the Namenode and a process for each Datanode, what are their responsibilities to the system in general but I am confused in this scenario.
Is it the client's process by itself or is there another process in the HDFS, created and dedicated to the this specific client, in order to access and read/write from/to the HDFS?
Finally, if the second answer is true, is there any possibility that this process can be suspended for a while?
I have done some research and the most important solutions that I found were Oozie and JobControl class from hadoop API.
But, because I am not sure about the above workflow, I am not sure what process I am suspending and resuming with these tools.
Is it the client's process or a process which runs in HDFS in order to serve the request of the client?
Have a look at these SE posts to understand how HDFS writes work:
Hadoop 2.0 data write operation acknowledgement
Hadoop file write
Hadoop: HDFS File Writes & Reads
Apart from file/block writes, above question explain about datanode failure scenarios.
The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on. The failed datanode is removed from the pipeline, and a new pipeline is constructed from the two good datanodes.
One failure in datanode triggers corrective actions by framework.
Regarding your second query :
You have two types of schedulers :
Have a look at this article on suspend and resume
In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN.
When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs.
In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs.
So as far as I understand the process of a Datanode receives the data from the client's process (who requests to store some data in HDFS) and stores it. Then this Datanode forwards the exact same data to another Datanode (to achieve replication) and so on. When the replication will finish, an acknowledgement will go back to the Namenode who will finally inform the client about the completion of his write-request.
Based on the above flow, It is impossible to suspend an HDFS write operation in order to serve a second client's write-request (let's assume that the second client has higher priority) because if we suspend the Datanode by itself it will remain suspended for everyone who wants to write on it and as a result this part of the HDFS will be remained blocked. Finally, if I suspend a job from JobController class functions, I actually suspend the client's process (if I actually manage to catch it before his request will be done). Please correct me if I am wrong.
I am trying to use AWS Kinesis stream for one of our data streams. I would like to monitor pending messages on my stream for ops purposes(scale downstream according to backlog), but unable to find any API that gives (approx) pending messages in my stream.
This looks strange as messages get expired after 7 days and if the producers and consumers are isolated and can't communicate, how do you know messages are expiring. How do you handle this problem?
There is no such concept as "pending" message in Kinesis. All the incoming data will be placed on a shard.
Your consumer application should be in running state all the time, to keep track of changes in your stream. The application (with the help of KCL) will continue to poll "Shard Iterator" in the background, thus you will be notified about the new data when it comes.
Roughly; you can see Kinesis as a FIFO queue and the messages will disappear in a short time if you don't pop them.
If your application will process a few messages in an hour, you should think about changing your architecture. Kinesis is probably not the correct tool for you.
I am trying to make a Kinesis Consumer Client. To work on it I went through the Developer Guide of Kinesis and AWS Document http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-record-processor-implementation-app-java.html.
I was wondering is It possible to get Data from two different Streams and process it accordingly.
Say I have two different Streams stream1 and stream2 .
Is it possible to get Data from both stream and process individually ?
Why not? Do get_records from both streams.
If your streams only have a single shard each, you will also see all the events, as it is recommended to process each shard with a single worker, but if your logic is somehow to join events from different sources/streams, you can implement it with a single worker reading from both streams.
Note that if you have streams with multiple shards, each one of your workers will see only a part of the events. You can have the following options:
Both streams have a single shard each - in this case you can read with a single worker from bout streams and see all events from both streams. You can add timestamps or other keys to allow you to "join" these events in the worker.
One stream (stream1) with one shard and the second streams (stream2) with multiple shards - in this case you can read from stream1 from all your workers, that will also process single shard from the stream2 each. Each one of your workers will see all the events of stream1 and its share of events of stream2. Note that you have a limit of the speed that you can read the events from stream1 with the single shard (2MB/second or 5 reads/second), and if you have many shards in stream2, this can be a real limit.
Both streams can have multiple shards - in this case it will be more complex for you to ensure that you are able to "join" these events, as you need to sync both the writes and the reads to these streams. You can also read from all shards of both streams with a single worker, but this is not a good practice as it is limiting your ability to scale since you don't have a distributed system anymore. Another option is to use the same partition_key in both streams, and have the same number of shards and partition definition for both streams, and verify that you are reading from the "right" shard from each stream in each of your workers, and that you are doing it correctly every time one of your workers is failing and restarting, which might be a bit complex.
Another option that you can consider is to write both types of events in a single stream, again using the same partition_key, and then filter them on the reader side if you need to process them differently (for example, to write them to different log files in S3).