I am facing an issue with Azure Stream Analytics reading data from eventhub and pushing data to Azure Storage. There are 10 messages sent to eventhub by eventhub publisher and shown as such as input on the Dashboard metrics. However, Stream analytics miss 1 message while reading data from eventhub. This is an intermittent issue. It does not happen always. Does anyone help me to sort out the issue that why only 1 message is missed and only few times.
Total data is around 8-9 MB. Stream analytics has 3 stream units set.
1 input- reading from eventhub as json
1 output- to storage account
Related
I am trying to ingest payload to Kusto/ADX via Event Hub
Limitation = 1 standard tier EH can only support throughput up to 40 Mbps.
Goal: Increasing max throughput by sending compressed payload without handling translation manually.
Example: payload = {
a: 1,
b: 2
}
we send this payload to EH by compression manually and Kusto store it as 1 row with 2 column a and b. Not handling compression handle from our end.
I am expecting Event hub to handle the compressed data and translation from their end.
It's 40 MB (Megabyte) per second, not 40Mb (Megabit).
You can compress your payload with gzip.
Kusto will open it automatically as part of the ingestion process.
Ingest data from event hub into Azure Data Explorer
Setting
Suggested value
Field description
Compression
None
The compression type of the event hub messages payload. Supported compression types: None, Gzip.
Having said that, the right thing to do would probably be to switch to an Event Hubs in higher tier.
I started learning about spark streaming applications with kinesis. I got a case where our spark streaming application fails, it restarts but the issue is, when it restarts, it tries to process more amount of messages than it can process and fails again. So,
Is there any way, we can limit the amount of data a spark streaming application can process in terms of bytes?
Any let say, if a spark streaming application fails and remains down for 1 or 2 hours, and the InitialPositionInStream is set to TRIM_HORIZON, so when it restarts, it will start from the last messages processed in kinesis stream, but since there is live ingestion going on in kinesis then how the spark streaming application works to process this 1 or 2 hours of data present in kinesis and the live data which is getting ingested in kinesis?
PS - The spark streaming is running in EMR and the batch size is set to 15 secs, and the kinesis CheckPointInterval is set to 60 secs, after every 60 secs it writes the processed data details in DynamoDB.
If my question is/are unclear or you need any more informations for answering my questions, do let me know.
spark-streaming-kinesis
Thanks..
Assuming you are trying to read the data from message queues like kafka or event hub.
If thats the case, when ever spark streaming application goes down, it will try to process the data from the offset it left before getting failed.
By the time, you restart the job - it would have accumulated more data and it will try to process all backlog data and it will fail either by Out of Memory or executors getting lost.
To prevent that, you can use something like "maxOffsetsPerTrigger" configuration which will create a back pressuring mechanism there by preventing the job from reading all data at once. It will stream line the data pull and processing.
More details can be found here: https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
From official docs
Rate limit on maximum number of offsets processed per trigger
interval. The specified total number of offsets will be proportionally
split across topicPartitions of different volume.
Example to set max offsets per trigger
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1")
.option("subscribe", "topicName")
.option("startingOffsets", "latest")
.option("maxOffsetsPerTrigger", "10000")
.load()
To process the backfills as soon as possible and catch up with real time data, you may need to scale up your infra accordingly.
May be some sort of auto scaling might help in this case.
After processing the backlogged data, your job will scale down automatically.
https://emr-etl.workshop.aws/auto_scale/00-setup.html
I'm running into a issue reading GCP PubSub from Dataflow where when publish large number of messages in short period of time, Dataflow will receive most of the sent messages, except some messages will be lost, and some other messages would be duplicated. And the most weird part is that the number of lost messages will be exactly the same as the number of messages being duplicated.
In one of the examples, I send 4,000 messages in 5 sec, and in total 4,000 messages were received, but 9 messages were lost, and exactly 9 messages were duplicated.
The way I determine the duplicates is via logging. I'm logging every message that is published to Pubsub along with the message id generated by pubsub. I'm also logging the message right after reading from PubsubIO in a Pardo transformation.
The way I read from Pubsub in Dataflow is using org.apache.beam.sdk.ioPubsubIO:
public interface Options extends GcpOptions, DataflowPipelineOptions {
// PUBSUB URL
#Description("Pubsub URL")
#Default.String("https://pubsub.googleapis.com")
String getPubsubRootUrl();
void setPubsubRootUrl(String value);
// TOPIC
#Description("Topic")
#Default.String("projects/test-project/topics/test_topic")
String getTopic();
void setTopic(String value);
...
}
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
options.setStreaming(true);
options.setRunner(DataflowRunner.class);
...
Pipeline pipeline = Pipeline.create(options);
pipeline.apply(PubsubIO
.<String>read()
.topic(options.getTopic())
.withCoder(StringUtf8Coder.of())
)
.apply("Logging data coming out of Pubsub", ParDo
.of(some_logging_transformation)
)
.apply("Saving data into db", ParDo
.of(some_output_transformation)
)
;
pipeline.run().waitUntilFinish();
}
I wonder if this is a known issue in Pubsub or PubsubIO?
UPDATE:
tried 4000 request with pubsub emulator, no missing data and no duplicates
UPDATE #2:
I went through some more experiments and found that the duplicating messages are taking the message_id from the missing ones. Because the direction of the issue has been diverted from it's origin quite a bit, I decide to post another question with detailed logs as well as the code I used to publish and receive messages.
link to the new question: Google Cloud Pubsub Data lost
I talked with a Google guy from the PubSub team. It seems to be caused by a thread-safety issue with the Python client. Please refer to the accepted answer for Google Cloud Pubsub Data lost for the response from Google
I am using the EventHubStream provider in a project based on Orleans.
After the system has been running a few minutes Orleans starts throwing a QueueCacheMissException while trying to push an event to OnNext from a producer.
i have tried to increase the size of the cache but that helped only for a while.
Is this a normal behavior due to the size of the cache?
In this situation should i unsubscribe and subscribe again? i have tried to resume stream but that didn't work, the stream was in faulted state... any ideas?
It is likely that the service is reading events from eventhub faster than grains are processing them. EventHub can deliver events at a rate of ~1k/second per partition.
The latest version of the EventHub stream provider supports backpressure that should prevent this problem, but it has not been released. You can however build your own nugets.
How do I tell what percentage of the data in a Kinesis stream a reader has already processed? I know each reader has a per-shard checkpoint sequence number, and I can also get the StartingSequenceNumber of each shard from describe-stream, however, I don't know how far along in my data the reader currently is (I don't know the latest sequence number of the shard).
I was thinking of getting a LATEST iterator for each shard and getting the last record's sequence number, however that doesn't seem to work if there's no new data since I got the LATEST iterator.
Any ideas or tools for doing this out there?
Thanks!
I suggest you implement a custom metric or metrics in your applications to track this.
For example, you could append a message send time within your Kinesis message, and on processing the message, record the time difference as an AWS CloudWatch custom metric. This would indicate how close your consumer is to the front of the stream.
You could also record the number of messages pushed (at the pushing application) and messages received at the Kinesis consumer. If you compare these in a chart on CloudWatch, you could see that the curves roughly follow each other indicating that the consumer is doing a good job at keeping up with the workload.
You could also try monitoring your Kinesis consumer, to see how often it idly waits for records (i.e, no results are returned by Kinesis, suggesting it is at the front of the stream and all records are processed)
Also note there is not a way to track a "percent" processed in the stream, since Kinesis messages expire after 24 hours (so the total number of messages is constantly rolling). There is also not a direct (API) function to count the number of messages inside your stream (unless you have recorded this as above).
If you use KCL you can do that by comparing IncomingRecords from the cloudwatch built-in metrics of Kinesis with RecordsProcessed which is a custom metric published by the KCL.
Then you select a time range and interval of say 1 day.
You would then get the following type of graphs:
As you can see there were much more records added than processed. By looking at the values in each point you will know exactly if your processor is behind or not.