Is Event Hub's intended to be used for Event Sourcing / append-only log architectures - azure-eventhub

Event Hubs don't let you store messages longer than 7 (maybe up to 30) days. What is Azure's suggested architecture for PaaS Event Sourcing with these limitations? If it's Event Hub + snapshotting, what happens if we somehow need to rebuild that state? Additional, is Event Hub's answer to KSQL/Spark Azure Stream Analytics?

Great Question!
Yes, EventHubs is intended to be used for Event Sourcing or Append-only log pattern. EventHubs can be used as source/sink for stream processing & analytics engines like SPARK and hence not its competitor. In general, EventHubs offers similar capabilities as that of Apache Kafka.
& Yes, to implement rebuilding transactions from the append-only log Snapshotting is definitely the recommended approach!
While shaping EventHubs as a product offering, our considerations for assigning a default value for retentionPeriod - were -
most of the critical systems create snapshots every few minutes.
most of the design patterns around this suggest retaining older snapshots for rebuild
So, it was clear that we don't need infinite log, & a timebound of a day will do for most use-cases. Hence, we started with a default 1 day - and gave a knob until 7 days.
If you think, you would have a case, where you will have to go back in time for >7 days to rebuild a snapshot (for ex: for debugging - which is generally not a 99% scenario - but, agreed that designing & accommodating for this is very-wise), recommended approach is to push the data to an archival store.
When our usage Metrics showed that many of our customers have one EventHubs consumer group dedicated for pushing data to archival store - we wanted to enable this capability out-of-the-box & then started to offer - Event Hubs Capture feature.
More on Event Hubs.

Event Hubs are supposed to be used for temporarily storing events while moving them between the data storage instances. You would have to load them to some permanent storage to use for indefinite time period, e.g. Cosmos DB.
KSQL is somewhat comparable to Azure Stream Analytics. Spark is a much more broad product, but you can use Spark to process Event Hubs data.
P.S. I'm not an official speaker of Microsoft, so that's just my view.

Related

AWS Lambda best practices for Real Time Tracking

We currently run an AWS Lambda function that primarily simply redirects the user to a different URL. The function is invoked via API-Gateway.
For tracking purposes, we would like to create a widget on our dashboard that provides real-time insights into how many redirects are performed each second. The creation of the widget itself is not the problem.
My main question currently is which AWS Services is best suited for telling our other services that an invocation took place. We plan to register the invocation in our database.
Some additional things:
low latency (< 5 seconds) in order to be real-time data
nearly no increased time wait for the user. We aim to redirect the user as fast as possible
Many thanks in advance!
Best Regards
Martin
I understand that your goal is to simply persist the information that an invocation happened somewhere with minimal impact on the response time of the Lambda.
For that purpose I'd probably use an SQS standard queue and just send a message to the queue that the invocation happened.
You can then have an asynchronous process (Lambda, Docker, EC2) process the messages from the queue and update your Dashboard.
Depending on the scalability requirements looking into Kinesis Data Analytics might also be worth it.
It's a fully managed streaming data solution and the analytics part allows you to do sliding window analyses using SQL on data in the Stream.
In that case you'd write the info that something happened to the stream, which also has a low latency.

Clear event hub for consumers after load testing

I'm load testing and ingestion app of which part sends data to event hub.
During this time, I disable the consuming azure functions, as I'm not currently testing that part of the system and don't wish to pay for them.
I then wish to test the consuming azure functions, but there is a huge backlog of item in the event hub.
I understand that it's a log and after the retention period it will be cleared.
But I'm hoping for a more immediate option, I do not to "delete" the messages per say just informing the consuming group it doesn't need to read those messages.
A few quick hacks I have tried is reducing the retention period down to 1 and disabling the event hub and re-enabling.
Searching people say event hubs cannot be cleared, I suspect one option is regarding check points however I am open to alternatives (it being a dev environment means easier albeit more drastic techniques can be adopted). This is separate to specific questions regarding check pointing even if that is the answer, as I wish to highlight a "clearing" technique.

Display real time data on website that scales?

I am starting a project where I want to create a website which will display LIVE flight information and status. We all have seen this at airport. An example is given here - http://www.computronics.biz/productimages/prodairport4.jpg. As you can see this information changes continuously. The website will talk to a backend api and the this backend api will talk to database. Now the important part is that the flight information in the database will be updated by the airline itself. There could be several airlines and they will update their data respectively. I have drawn a diagram and uploaded here - https://imgur.com/a/ssw1S.
Now those airlines will obviously have an interface (website talking to some backend API) through which they will update the database.
Now here is my attempt to solve it. We need to have some sort of trigger such that if any airline updates a flight detail in the database between current time - 1 hour to current + 4 hours (website will only display few hours of flights), we need to call the web api and then send the update to the website in the real time. The user must not refresh the page at all. At the same time the website needs to scale well i.e. if 1 million users are on the website, and there is an update in the database in the correct time range, all 1 million user's website should get updated within a decent amount of time.
I did some research and it looks like we need to have an event based approach. For example - we need to create a function (AWS lambda or Azure function) that should be called whenever there is an update in the database (Dynamo DB for example) within the correct time range. This function then should call an API which should then update the website through web socket technology for example.
I am not looking for any code but just some alternative suggestions on how this can be solved in a scalable way. Also how do we test scalability?
Dont use serverless functions(Lambda/Azure functions)
Although I am a huge fan of serverless functions, and currently running a full web app in Lambda, I don't think its needed for your use case and doesn't make sense economically. As you've answered in the comments, each airline will not write directly to the database, they'll push to an API, meaning you are explicitly told when flights have changed. When an airline has sent you new data you can simply propagate this to all the browser endpoints via websockets. This keeps the design very simple. There is no need to artificially create a database event that then triggers a function that will then tell you a flight has been updated. Thats like removing your doorbell and replacing it with a motion detector that triggers a doorbell :)
Cost
Money always deserves its own section. Lambda is more of an economic break through than a technological one. You have to know when its cost effective. You pay per request so if your dealing with a process that handles 10,000 operations a month, or something that only fires 1,000 times a day, than lambda is dirt cheap and practically free. You also pay for the length of time the function is executing and the memory consumed while executing. Generally, it makes sense to use lambda functions where a dedicated server would be sitting idle for most of the time. So instead of a whole EC2 instance, AWS provides you with a container on demand. There are points at which high requests rates and constantly running processes makes lambda more expensive than EC2. This article discusses how generally its cheaper to use lambda up to a point -> https://www.trek10.com/blog/lambda-cost/ The same applies to Azure functions and googles equivalent. They are all just containers offered on demand.
If you're dealing with flight information I would imagine you will have thousands of flights being updated every minute so your lambda functions will be firing constantly as if you were running an EC2 instance. You will end up paying a lot more than EC2. When you have a service that needs to stay up 24/7 and run 24/7 with high activity that is most certainly a valid use case for a dedicated server or servers.
Proposed Solution
These are the components I would use below:
Message Queue of some sort (RabbitMQ or AWS SQS with SNS perhaps)
Web Socket Backend (The choice will depend on programming language)
Airline input API (REST,GraphQL, or maybe AWS Kinesis Data Firehose)
The airlines publish their data to a back-end api. The updates are stored on a message queue and the web applicaton that actually displays the results to users, via websockets, reads from the queue.
Scalability
For scalability you can run the websocket application on multiple EC2 instances (all reading from the same queuing service) in an autoscaling group, so with extra load more instances will be created automatically hence the name "autoscaling". And those instances can sit behind an elastic load balancer. Lots of AWS documentation on how to do this and its their flagship design pattern. If you use AWS SQS you don't have to manage the scalability details yourself, aws handles that. The only real components to scale are your websocket application and the flight data input endpoint. You can run the flight api in an autoscaling group as well but AWS does offer an additional tool for high traffic data processing. I detail that below.
Testing Scalability
It would be fairly easy to have a mock airline blast your service with thousands and thousands of fake updates and on the other end you can easily run multiple threads of selenium tests simulating browser clicks and validating that the UI is still operational.
Additional tools
If it ends up being large amounts of data, rather than using a conventional REST api for your flight update service you could consider a service AWS offers specifically for dealing with large amounts of real time updates (Kinessis Data Firehose) https://aws.amazon.com/kinesis/data-firehose/ But I've never used it.
First, please don't over think this. This is a trivial problem to solve and doesn't require any special techniques, technologies or trendy patterns & frameworks.
You actually have three functional areas you can address almost separately.
Ingestion - Collection and normalization of the data from the various sources. For this, you'll need a process and transformation engine, LogicApps or such.
Your databases. You'll quickly learn that not all flights are the same ;). While it might seem so, the amount of data isn't that much. Instances of MySQL/SQL Server tuned for a particular function will work just fine. Hint, you don't need to have data for every movement ready to present all the time.
Presentation. The data API and UIs. This, really, is the easy part. I would suggest you use basic polling at first. For reasons you will never have any control over, the SLA for flight data is ~5 minutes so a real-time client notification system is time you should spend elsewhere at first.

Replay events with Google Pub/Sub

I'm looking into Google Cloud, it is very appealing, specially for data intensive applications. I'm looking into Pub/Sub + Dataflow and I'm trying to figure out the best way to replay events that were send via Pub/Sub in case the processing logic changes.
As far as I can tell, Pub/Sub retention has an upper bound of 7 days and it is per subscription, the topic itself does not retain data. In my mind, it would allow to disable the log compaction, like in Kafka, so I can replay data from the very beginning.
Now, since dataflow promises that you can run the same jobs in batch and streaming mode, how effective would it be to simulate this desired behavior by dumping all events into Google Storage and replying from there?
I'm also open for any other ideas.
Thank you
As you said, Cloud Pub/Sub does not currently support replays, so you need to save events somewhere to replay later and Cloud Storage sounds like a good place to do that.
Cloud Pub/Sub now has the ability to replay previously acknowledged messages. Please see the quickstart and related blog post for information on how to use the feature.

Kafka Storm HDFS/S3 data flow

It is unclear if you can do a fan-out (duplication) in Kafka like you can in Flume.
I'd like to have Kafka save data to HDFS or S3 and send a duplicate of that data to Storm for real time processing. The output of Storm aggregations/analysis will be stored in Cassandra. I see some implementations flowing all data from Kafka into Storm and then two outputs from Storm. However, I'd like to eliminate the dependency of Storm for the raw data storage.
Is this possible? Are you aware of any documentation/examples/implementations like this?
Also, does Kafka have good support for S3 storage?
I saw Camus for storing to HDFS -- do you just run this job via cron to continually load data from Kafka to HDFS? What happens if a second instance of the job starts before the previous has finished? Finally, would Camus work with S3?
Thanks -- I appreciate it!
Regarding Camus,
Yeah, a scheduler that launches the job should work.
What they use at LinkedIn is Azkaban, you can look at that too.
If one launches before the other finishes, some amount of data will be read twice. Since the second job will start reading from the same offsets used by the first one.
Regarding Camus with S3, currently I dont think that is in place.
Regarding Kafka support for S3 storage, there are several Kafka S3 consumers you can easily plugin to get your data saved to S3. kafka-s3-storage is one of them.
There are many possible ways to feed storm with translated data. The main question that is not clear to me is what the dependency you wish to eliminate and what tasks you wish to keep storm from doing.
If it is considered ok that storm would receive an xml or json, you could easily read from the original queue using two consumers. As each consumer controls the messages it reads, both could read the same messages. One consumer could insert the data to your storage and the other will translate the information and send it to storm. There is no real complexity with the feasibiliy of this, but, I believe this is not the ideal solution due to the following reasons:
Maintainability - a consumer needs supervision. You would therefor need to supervise your running consumers. Depending on your deployment and the way you handle data types, this might be a non-trivial effort. Especially, when you already have storm installed and therefore supervised.
Storm connectiviy - you still need to figure out how to connect this data to storm. Srorm has a kafka spout, that i have used, and works very well. But, using the suggested architecture , this means an additional kafka topic to place the translated messages on. This is not very efficient as the spout could also read information directly from the original topic and translate it using a simple bolt.
Suggested way to handle this would be to form a topology, using kafka spout to read raw data and one bolt to send the raw data to storage and another one to translate it. But, this solution depends on the reasons you wish to keep storm out of the raw data business.
Kafka actually retains events for a configurable period of time -- events are not purged immediately upon consumption like other message or queue systems. This allows you to have multiple consumers that can read from Kafka either at the beginning (per the configurable retention time) or from an offset.
For the use case described, you would use Camus to batch load events to hadoop, and Storm to read events off the same Kafka output. Just ensure both processes read new events before the configurable retention time expires.
Regarding Camus, ggupta1612 answered this aspect best
A scheduler that launches the job should work. What they use at LinkedIn is Azkaban, you can look at that too.
If one launches before the other finishes, some amount of data will be read twice. Since the second job will start reading from the same offsets used by the first one.