I didn't set the uid for my flink operator, is there any way to read the flink state data?
State Processor API requires uid.
Thanks.
Related
I know that Kinesis typical use case is event streaming, however we'd like to use it to broadcast some information to have it in near real time in some apps besides making it available for further stream processing. KCL seems to be the only viable option to use Kinesis as stream API is too low level.
As far I understand to use KCL we'd have to generate random applicationId so all apps could receive all the data, but this means creating a new DynamoDB table each time an application starts. Of course we can perform clean up when application stops but when application doesn't stop gracefully there would be DynamoDB table hanging around.
Is there a way/pattern to use Kinesis streams in a broadcast fashion?
Following is my use case
Bunch of applications enqueue messages in Kafka under different topics.
Have consumer of each topic distribute the work to a worker in a cluster. The work can be classified as long running, memory intensive, simple etc and the worker is chosen accordingly.
This has me exploring Akka cluster for work distribution, routing and scaling. I can use Akka "Supervisor" as a Kafka consumer and assign incoming work to the appropriate worker based on its classification.
But what I am still trying to understand is the correct way to implement a resilient way of communication between the supervisor and workers in the Akka cluster. Because as soon as the supervisor consumes the message from Kafka, the Kafka offset is committed. If some error happens in processing after the offset commit, is the following acceptable way to recover and start from where it was last left?
Make the supervisor a persistent actor by using durable mailbox backed by Kafka. Supervisor enqueues work in Kafka and worker gets its work from Kafka and commits its offset only after completing the work.
As said by Jaakko, it really depends on the third-part library you are using.
As far as I'm concerned I have successfully used Akka Streams Kafka although I did enable offset auto-commit.
However, this library may meet your needs since it allows you to customize offset commit (see sections External Offset Storage and Offset Storage in Kafka).
The documentation says:
The Consumer.committableSource makes it possible to commit offset positions to Kafka. Compared to auto-commit this gives exact control of when a message is considered consumed.
In order to disable auto-commit, you have to complete your Akka application.conf file by adding an akka.kafka.consumer section:
akka.kafka.consumer {
# Properties defined by org.apache.kafka.clients.consumer.ConsumerConfig
# can be defined in this configuration section.
kafka-clients {
# Disable auto-commit by default
enable.auto.commit = false
}
}
Last version of akka-stream-kafka_2.11 (version 0.16) is compatible with Akka 2.5.x but you have to override akka-stream_2.11 dependency with the one of the Akka toolkit. Currently, I am using this library with Akka 2.5.3 and it works really well.
Hope you will find what your are looking for :)
I have one question regarding the HDFS read/write process:
Assuming that we have a client (for the sake of the example let's say that the client is a HADOOP map process) who requests to read a file from HDFS and or to write a file to HDFS, which is the process which actually does the read/write from/to the HDFS?
I know that there is a process for the Namenode and a process for each Datanode, what are their responsibilities to the system in general but I am confused in this scenario.
Is it the client's process by itself or is there another process in the HDFS, created and dedicated to the this specific client, in order to access and read/write from/to the HDFS?
Finally, if the second answer is true, is there any possibility that this process can be suspended for a while?
I have done some research and the most important solutions that I found were Oozie and JobControl class from hadoop API.
But, because I am not sure about the above workflow, I am not sure what process I am suspending and resuming with these tools.
Is it the client's process or a process which runs in HDFS in order to serve the request of the client?
Have a look at these SE posts to understand how HDFS writes work:
Hadoop 2.0 data write operation acknowledgement
Hadoop file write
Hadoop: HDFS File Writes & Reads
Apart from file/block writes, above question explain about datanode failure scenarios.
The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on. The failed datanode is removed from the pipeline, and a new pipeline is constructed from the two good datanodes.
One failure in datanode triggers corrective actions by framework.
Regarding your second query :
You have two types of schedulers :
FairScheduler
CapacityScheduler
Have a look at this article on suspend and resume
In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN.
When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs.
In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs.
So as far as I understand the process of a Datanode receives the data from the client's process (who requests to store some data in HDFS) and stores it. Then this Datanode forwards the exact same data to another Datanode (to achieve replication) and so on. When the replication will finish, an acknowledgement will go back to the Namenode who will finally inform the client about the completion of his write-request.
Based on the above flow, It is impossible to suspend an HDFS write operation in order to serve a second client's write-request (let's assume that the second client has higher priority) because if we suspend the Datanode by itself it will remain suspended for everyone who wants to write on it and as a result this part of the HDFS will be remained blocked. Finally, if I suspend a job from JobController class functions, I actually suspend the client's process (if I actually manage to catch it before his request will be done). Please correct me if I am wrong.
When file is ingested using "hdfs dfs -put" client computes checksum and sends both input data+checksum to Datanode for storing.
How does this checksum calculatio/validation happen when File is read/write using WebHdfs ? how data integrity is insured with WebHdfs ?
Hadoop documentation on apache don't mention anything about it.
WebHDFS is just a proxy through the usual datanode operations. Datanodes host the webhdfs servlets, which open standard DFSClients and read or write data through the standard pipeline. It's an extra step in the normal process but does not fundamentally change it. Here is a brief overview.
Is there any document or step by step process which guides us on how we can use WS02 DAS to pull data from java class objects and display reports using this data using WS02 Dashboards.
Any help would be really appreciated.
First You can create an Event Stream by specifying attributes and mention what are the attributes you need to persist. When events arrives to the streams, those will be stored in Events tables [1].
Then you can create an Event Receiver for that Event Stream [2]. When creating an event stream you can use a protocol such as Thrift, Soap, Http, Mqtt, JMS, Kafka and Web sockets. You can write a simple Java Application to publish data to DAS Receiver you created on message format protocol which you have selected. For an instance if you create SOAP receiver you can use data on soap message format and also if you create a HTTP receiver you can use JSON format.
You can create a dashboard and gadgets to visualize Event table which was created by your persistent stream [3]. Please note that this event table consist all the events WSO2 DAS received, you can process these data by using spark SQL [4] and create several streams which could be used in Analytics Dashboard.
[1]https://docs.wso2.com/display/DAS300/Understanding+Event+Streams+and+Event+Tables
[2] https://docs.wso2.com/display/DAS300/Configuring+Event+Receivers
[3] https://docs.wso2.com/display/DAS300/Analytics+Dashboard
[4] https://docs.wso2.com/display/DAS300/Batch+Analytics+Using+Spark+SQL
Your subject of the question and the body is contradictory. The subject says to push data while the body says pull data.
If push data is what you want to achieve, you can refer https://docs.wso2.com/pages/viewpage.action?pageId=45952633 This uses a thrift client to push data to DAS.
Please refer https://docs.wso2.com/display/DAS300/Analyzing+Data for how to analyze the raw data. You can write spark scripts for analyzing.
Finally you can https://docs.wso2.com/display/DAS300/Communicating+Results on how to analyze data. You may use the REST API exposed with DAS 3.0.0 to pull data from DAS.