It cost lots of time to save state on hdfs in flink - hdfs

For flink heap state, checkpoint to hdfs will cost lots of time or hang up when state achieves over 10G. From web dashboard, we can see most of the time is used in async period, maybe it is always waiting block writing on hdfs. Is there some way to solve the problem?

Related

How to slow down reads in Kinesis Consumer Library?

We have an aggregation system where the Aggregator is an KDA application running Flink which Aggregates the data over 6hrs time window and puts all the data into AWS Kinesis data Stream.
We also have an Consumer Application that uses KCL 2.x library and reads the data from KDS and puts the data into DynamoDB. We are using default KCL configuration and have set the poll time to 30 seconds. The issue that we are facing now is, Consumer application is reading all the data in KDS with in few minutes, causing huge writes in DynamoDB with in short period of time causing scaling issues in DynamoDB.
We would like to consume the KDS Data slowly and even out the data consumption across time allowing us to keep lower provisioned capacity for WCU's.
One way we can do that is increase the polling time for the KCL consumer application, I am trying to see if there is any configuration that can limit the number of records that we can poll, helping us to reduce the write throughput in dynamoDB or any other way to fix this problem?
Appreciate any responses

How spark streaming application works when it fails?

I started learning about spark streaming applications with kinesis. I got a case where our spark streaming application fails, it restarts but the issue is, when it restarts, it tries to process more amount of messages than it can process and fails again. So,
Is there any way, we can limit the amount of data a spark streaming application can process in terms of bytes?
Any let say, if a spark streaming application fails and remains down for 1 or 2 hours, and the InitialPositionInStream is set to TRIM_HORIZON, so when it restarts, it will start from the last messages processed in kinesis stream, but since there is live ingestion going on in kinesis then how the spark streaming application works to process this 1 or 2 hours of data present in kinesis and the live data which is getting ingested in kinesis?
PS - The spark streaming is running in EMR and the batch size is set to 15 secs, and the kinesis CheckPointInterval is set to 60 secs, after every 60 secs it writes the processed data details in DynamoDB.
If my question is/are unclear or you need any more informations for answering my questions, do let me know.
spark-streaming-kinesis
Thanks..
Assuming you are trying to read the data from message queues like kafka or event hub.
If thats the case, when ever spark streaming application goes down, it will try to process the data from the offset it left before getting failed.
By the time, you restart the job - it would have accumulated more data and it will try to process all backlog data and it will fail either by Out of Memory or executors getting lost.
To prevent that, you can use something like "maxOffsetsPerTrigger" configuration which will create a back pressuring mechanism there by preventing the job from reading all data at once. It will stream line the data pull and processing.
More details can be found here: https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
From official docs
Rate limit on maximum number of offsets processed per trigger
interval. The specified total number of offsets will be proportionally
split across topicPartitions of different volume.
Example to set max offsets per trigger
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1")
.option("subscribe", "topicName")
.option("startingOffsets", "latest")
.option("maxOffsetsPerTrigger", "10000")
.load()
To process the backfills as soon as possible and catch up with real time data, you may need to scale up your infra accordingly.
May be some sort of auto scaling might help in this case.
After processing the backlogged data, your job will scale down automatically.
https://emr-etl.workshop.aws/auto_scale/00-setup.html

AWS S3 rate limit and SlowDown errors

I'm refactoring a job that uploads ~1.2mln small files to AWS; previously this upload was made file by file on a 64 CPUs machine with processes. I switched to an async + multiprocess approach following S3 rate limits and best practices and performance guidelines to make it faster. With sample data I can achieve execution times as low as 1/10th. With production loads S3 is returning "SlowDown" errors.
Actually the business logic makes the folder structure like this:
s3://bucket/this/will/not/change/<shard-key>/<items>
The objects will be equally splitted across ~30 shard-keys, making every prefix contain ~40k items.
We have every process writing on its own prefix and launching batches of 3k PUT requests in async until completion. There is a sleep after the batch write operation to ensure that we do not send another batch before 1.1sec has passed, so we will respect the 3500 PUT requests per second.
The problem is that we receive SlowDown errors for ~1 hour and then the job writes all the files in ~15 minutes. If we lower the limit to 1k/sec this gets even worse, running for hours and never finishing.
This is the distribution of the errors over time for the 3k/sec limit:
We are using Python 3.6 with aiobotocore to run async.
Doing some sort of trial and error to try to understand how to mitigate this takes forever on production data and testing with a lower quantity of data gives us different results (flawlessly works).
Did I miss any documentation regarding how to make the system scale up correctly?

Could an HDFS read/write process be suspended/resumed?

I have one question regarding the HDFS read/write process:
Assuming that we have a client (for the sake of the example let's say that the client is a HADOOP map process) who requests to read a file from HDFS and or to write a file to HDFS, which is the process which actually does the read/write from/to the HDFS?
I know that there is a process for the Namenode and a process for each Datanode, what are their responsibilities to the system in general but I am confused in this scenario.
Is it the client's process by itself or is there another process in the HDFS, created and dedicated to the this specific client, in order to access and read/write from/to the HDFS?
Finally, if the second answer is true, is there any possibility that this process can be suspended for a while?
I have done some research and the most important solutions that I found were Oozie and JobControl class from hadoop API.
But, because I am not sure about the above workflow, I am not sure what process I am suspending and resuming with these tools.
Is it the client's process or a process which runs in HDFS in order to serve the request of the client?
Have a look at these SE posts to understand how HDFS writes work:
Hadoop 2.0 data write operation acknowledgement
Hadoop file write
Hadoop: HDFS File Writes & Reads
Apart from file/block writes, above question explain about datanode failure scenarios.
The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on. The failed datanode is removed from the pipeline, and a new pipeline is constructed from the two good datanodes.
One failure in datanode triggers corrective actions by framework.
Regarding your second query :
You have two types of schedulers :
FairScheduler
CapacityScheduler
Have a look at this article on suspend and resume
In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN.
When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs.
In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs.
So as far as I understand the process of a Datanode receives the data from the client's process (who requests to store some data in HDFS) and stores it. Then this Datanode forwards the exact same data to another Datanode (to achieve replication) and so on. When the replication will finish, an acknowledgement will go back to the Namenode who will finally inform the client about the completion of his write-request.
Based on the above flow, It is impossible to suspend an HDFS write operation in order to serve a second client's write-request (let's assume that the second client has higher priority) because if we suspend the Datanode by itself it will remain suspended for everyone who wants to write on it and as a result this part of the HDFS will be remained blocked. Finally, if I suspend a job from JobController class functions, I actually suspend the client's process (if I actually manage to catch it before his request will be done). Please correct me if I am wrong.

Redis is taking too long to respond

Experiencing very high response latency with Redis, to the point of not being able to output information when using the info command through redis-cli.
This server handles requests from around 200 concurrent processes but it does not store too much information (at least to our knowledge). When the server is responsive, the info command reports used memory around 20 - 30 MB.
When running top on the server, during periods of high response latency, CPU usage hovers around 95 - 100%.
What are some possible causes for this kind of behavior?
It is difficult to propose an explanation only based on the provided data, but here is my guess. I suppose that you have already checked the obvious latency sources (the ones linked to persistence), that no Redis command is hogging the CPU in the slow log, and that the size of the job data pickled by Python-rq is not huge.
According to the documentation, Python-rq inserts the jobs into Redis as hash objects, and let Redis expires the related keys (500 seconds seems to be the default value) to get rid of the jobs. If you have some serious throughput, at a point, you will have many items in Redis waiting to be expired. Their number will be high compared to the pending jobs.
You can check this point by looking at the number of items to be expired in the result of the INFO command.
Redis expiration is based on a lazy mechanism (applied when a key is accessed), and a active mechanism based on key sampling, which is run in the event loop (in pseudo background mode, every 100 ms). The point is when the active expiration mechanism is running, no Redis command can be processed.
To avoid impacting the performance of the client applications too much, only a limited number of keys are processed each time the active mechanism is triggered (by default, 10 keys). However, if more than 25% keys are found to be expired, it tries to expire more keys and loops. This is the way this probabilistic algorithm automatically adapt its activity to the number of keys Redis has to expire.
When many keys are to be expired, this adaptive algorithm can impact the performance of Redis significantly though. You can find more information here.
My suggestion would be to try to prevent Python-rq to delegate item cleaning to Redis by setting expiration. This is a poor design for a queuing system anyway.
I think reduce ttl should not be the right way to avoid CPU usage when Redis expire keys.
Didier says, with a good point, that the current architecture of Python-rq that it delegates the cleaning jobs to Redis by using the key-expire feature. And surely, like Didier said it is not the best way. ( this is used only when result_ttl is greater than 0 )
Then the problem should rise when you have a set of keys/jobs with a expiration dates near one of the other, and it could be done when you have a bursts of job creation.
But Python-rq sets expire key when the job has been finished in one worker,
Then it doesn't have too sense, because the keys should spread around the time with enough time between them to avoid this situation