Kafka Storm HDFS/S3 data flow - hdfs

It is unclear if you can do a fan-out (duplication) in Kafka like you can in Flume.
I'd like to have Kafka save data to HDFS or S3 and send a duplicate of that data to Storm for real time processing. The output of Storm aggregations/analysis will be stored in Cassandra. I see some implementations flowing all data from Kafka into Storm and then two outputs from Storm. However, I'd like to eliminate the dependency of Storm for the raw data storage.
Is this possible? Are you aware of any documentation/examples/implementations like this?
Also, does Kafka have good support for S3 storage?
I saw Camus for storing to HDFS -- do you just run this job via cron to continually load data from Kafka to HDFS? What happens if a second instance of the job starts before the previous has finished? Finally, would Camus work with S3?
Thanks -- I appreciate it!

Regarding Camus,
Yeah, a scheduler that launches the job should work.
What they use at LinkedIn is Azkaban, you can look at that too.
If one launches before the other finishes, some amount of data will be read twice. Since the second job will start reading from the same offsets used by the first one.
Regarding Camus with S3, currently I dont think that is in place.

Regarding Kafka support for S3 storage, there are several Kafka S3 consumers you can easily plugin to get your data saved to S3. kafka-s3-storage is one of them.

There are many possible ways to feed storm with translated data. The main question that is not clear to me is what the dependency you wish to eliminate and what tasks you wish to keep storm from doing.
If it is considered ok that storm would receive an xml or json, you could easily read from the original queue using two consumers. As each consumer controls the messages it reads, both could read the same messages. One consumer could insert the data to your storage and the other will translate the information and send it to storm. There is no real complexity with the feasibiliy of this, but, I believe this is not the ideal solution due to the following reasons:
Maintainability - a consumer needs supervision. You would therefor need to supervise your running consumers. Depending on your deployment and the way you handle data types, this might be a non-trivial effort. Especially, when you already have storm installed and therefore supervised.
Storm connectiviy - you still need to figure out how to connect this data to storm. Srorm has a kafka spout, that i have used, and works very well. But, using the suggested architecture , this means an additional kafka topic to place the translated messages on. This is not very efficient as the spout could also read information directly from the original topic and translate it using a simple bolt.
Suggested way to handle this would be to form a topology, using kafka spout to read raw data and one bolt to send the raw data to storage and another one to translate it. But, this solution depends on the reasons you wish to keep storm out of the raw data business.

Kafka actually retains events for a configurable period of time -- events are not purged immediately upon consumption like other message or queue systems. This allows you to have multiple consumers that can read from Kafka either at the beginning (per the configurable retention time) or from an offset.
For the use case described, you would use Camus to batch load events to hadoop, and Storm to read events off the same Kafka output. Just ensure both processes read new events before the configurable retention time expires.
Regarding Camus, ggupta1612 answered this aspect best
A scheduler that launches the job should work. What they use at LinkedIn is Azkaban, you can look at that too.
If one launches before the other finishes, some amount of data will be read twice. Since the second job will start reading from the same offsets used by the first one.

Related

Can Google Dataflow connect to API data source and insert data into Big Query

We are exploring few use cases where we might have to ingest data generated by the SCADA/PIMS devices.
For security reason, we are not allowed to directly connect to OT devices or datasources. Hence, this data has REST APIs which can be used to consume the data.
Please suggest if Dataflow or any other service from GCP can be used to capture this data and put it into Big Query or any other relevant target service.
If possible, please share any relevant documentation/link around such requirements.
Yes!
Here is what you need to know: when you write an Apache Beam pipeline, your processing logic lives in DoFn that you create. These functions can call any logic you want. If your data source is unbounded or just big, then you will author a "splittable DoFn" that can be read by multiple worker machines in parallel and checkpointed. You will need to figure out how to provide exactly-once ingestion from your REST API and how to not overwhelm your service; that is usually the hardest part.
That said, you may wish to use a different approach, such as pushing the data into Cloud Pubsub first. Then you would use Cloud Dataflow to read the data from Cloud Pubsub. This will provide a natural scalable queue between your devices and your data processing.
You can capture data with PubSub and direct it to be processed in Dataflow and then saved into BigQuery (or storage), with a specific IO connector.
Stream messages from Pub/Sub by using Dataflow:
https://cloud.google.com/pubsub/docs/stream-messages-dataflow
Google-provided streaming templates (for Dataflow): PubSub->Dataflow->BigQuery:
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming
Whole solution:
https://medium.com/codex/a-dataflow-journey-from-pubsub-to-bigquery-68eb3270c93

What is the correct way to organize multiple recordings from a single device: AWS Kinesis video streams

What is the correct way to create a searchable archive of videos from a single Raspberry PI type device?
Should I create a single stream per device, then whenever that device begins a broadcast it adds to that stream?
I would then create a client that lists timestamps of those separate recordings on the stream? I have been trying to do this, but I have only gotten as far are ListFragments and GetClip. Neither of which seem to do the job. What is the use case for working with fragments? I'd like to get portions of the stream separated by distinct timestamps. As in, if I have a recording from 2pm to 2:10pm, that would be a separate list item from a recording taken between 3pm and 3:10pm.
Or should I do a single stream per broadcast?
I would create a client to list the streams then allow users to select between streams to view each video. This seems like an inefficient use of the platform, where if I have 5 10 second recordings made by the same device over a few days, it creates 5 separate archived streams.
I realize there are implications related to data retention in here, but am also not sure how that would act if part of a stream expires, but another part does not.
I've been digging through the documentation to try to infer what best practices are related to this but haven't found anything directly answering it.
Thanks!
Hard to tell what your scenario really is. Some applications use sparsely populated streams per device and use ListFragments API and other means to understand the sessions within the stream.
This doesn't work well if you have very sparse streams and large number of devices. In this case, some customers implement "stream leasing" mechanism by which their backend service or some centralized entity keeps the track of pool of streams and leases those to the requestor, potentially adding new streams to the pool. The streams leased times are then stored in a database somewhere for the consumer side application to be able to do its business logic. The producer application can also "embed" certain information within the stream using FragmentMetadata concept which really evaluates into outputting MKV tags into the stream.
If you have any further scoped down questions regarding the implementations, etc, don't hesitate to cut GitHub issues against particular KVS assets in question which would be the fastest way to get answers.

What happens to the data when uploading it to gcp bigquery when there is no internet?

I am using GCP Bigquery to store some data. I have created a pub/sub job for the Dataflow of the event.Currently, I am facing a issue with data loss. Sometimes, due to "no internet connection" the data is not uploaded to bigquery and the data for that time duration is lost. How can i overcome this situation.
Or what kind of database should i use to store data offline and then upload it online whenever there is connectivity.
Thank You in advance!
What you need to have is either a retry mechanism or a persistent storage. There can be several ways to implement this.
You can use a Message Queue to store data and process. Choice of message queue can be either cloud based like AWS SQS, Cloud Pub/Sub(GCP) or a hosted one like Kafka, RabbitMq.
Another but a bit unoptimized way could be to persist data locally till it is successfully uploaded on the cloud. Local storage can be either some buffer or database etc. If upload is failed you, retry again from the storage. This is something similar to Producer Consumer Problem.
You can use a Google Compute Engine to store your data and always run your data loading job from there. In that case, if your internet connection is lost, data will still continue to load into BigQuery.
By what I understood you are publishing data to PubSub and Dataflow does the rest to get the data inside BigQuery, is it right?
The options I suggest to you:
If your connection loss happens occasionally and for a short amount of time, a retry mechanism could be enough to solve this problem.
If you have frequent connection loss or connection loss for large periods of time, I suggest that you mix a retry mechanism with some process redundancy. You could for example have two process running in different machines to avoid this kind of situation. Its important to mention that for this case you could also try only a retry mechanism but it would be more complex because you would need to determine if the process failed, save the data somewhere (if its not saved) and trigger the process again in the future.
I suggest that you take a look in Apache Nifi. Its a very powerful data flow automation software that might help you solving this kind of issue. Apache Nifi has specific processors to push data directly to PubSub.
As a last suggestion, you could create an automated process to make data quality analysis after the data ingestion. Having this process you could determine more easily if your process failed.

AWS Event-Sourcing implementation

I'm quite a newbe in microservices and Event-Sourcing and I was trying to figure out a way to deploy a whole system on AWS.
As far as I know there are two ways to implement an Event-Driven architecture:
Using AWS Kinesis Data Stream
Using AWS SNS + SQS
So my base strategy is that every command is converted to an event which is stored in DynamoDB and exploit DynamoDB Streams to notify other microservices about a new event. But how? Which of the previous two solutions should I use?
The first one has the advanteges of:
Message ordering
At least one delivery
But the disadvantages are quite problematic:
No built-in autoscaling (you can achieve it using triggers)
No message visibility functionality (apparently, asking to confirm that)
No topic subscription
Very strict read transactions: you can improve it using multiple shards from what I read here you must have a not well defined number of lamdas with different invocation priorities and a not well defined strategy to avoid duplicate processing across multiple instances of the same microservice.
The second one has the advanteges of:
Is completely managed
Very high TPS
Topic subscriptions
Message visibility functionality
Drawbacks:
SQS messages are best-effort ordering, still no idea of what they means.
It says "A standard queue makes a best effort to preserve the order of messages, but more than one copy of a message might be delivered out of order".
Does it means that giving n copies of a message the first copy is delivered in order while the others are delivered unordered compared to the other messages' copies? Or "more that one" could be "all"?
A very big thanks for every kind of advice!
I'm quite a newbe in microservices and Event-Sourcing
Review Greg Young's talk Polygot Data for more insight into what follows.
Sharing events across service boundaries has two basic approaches - a push model and a pull model. For subscribers that care about the ordering of events, a pull model is "simpler" to maintain.
The basic idea being that each subscriber tracks its own high water mark for how many events in a stream it has processed, and queries an ordered representation of the event list to get updates.
In AWS, you would normally get this representation by querying the authoritative service for the updated event list (the implementation of which could include paging). The service might provide the list of events by querying dynamodb directly, or by getting the most recent key from DynamoDB, and then looking up cached representations of the events in S3.
In this approach, the "events" that are being pushed out of the system are really just notifications, allowing the subscribers to reduce the latency between the write into Dynamo and their own read.
I would normally reach for SNS (fan-out) for broadcasting notifications. Consumers that need bookkeeping support for which notifications they have handled would use SQS. But the primary channel for communicating the ordered events is pull.
I myself haven't looked hard at Kinesis - there's some general discussion in earlier questions -- but I think Kevin Sookocheff is onto something when he writes
...if you dig a little deeper you will find that Kinesis is well suited for a very particular use case, and if your application doesn’t fit this use case, Kinesis may be a lot more trouble than it’s worth.
Kinesis’ primary use case is collecting, storing and processing real-time continuous data streams. Data streams are data that are generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes).
Another thing: the fact that I'm accessing data from another
microservice stream is an anti-pattern, isn't it?
Well, part of the point of dividing a system into microservices is to reduce the coupling between the capabilities of the system. Accessing data across the microservice boundaries increases the coupling. So there's some tension there.
But basically if I'm using a pull model I need to read
data from other microservices' stream. Is it avoidable?
If you query the service you need for the information, rather than digging it out of the stream yourself, you reduce the coupling -- much like asking a service for data rather than reaching into an RDBMS and querying the tables yourself.
If you can avoid sharing the information between services at all, then you get even less coupling.
(Naive example: order fulfillment needs to know when an order has been paid for; so it needs a correlation id when the payment is made, but it doesn't need any of the other billing details.)

Microbatch loading into Redshift - Some practical questions

We are designing our workflow for doing microbatch loading of data into redshift. Basically we get a series of requests coming through an API. The API pumps those in a queue that is later processed, each item is ETLd, and finally saved into a ready-to-load file in S3. So the steps are:
Client sends request to API.
API picks up and transforms into JSON
API queues JSON in queue
Consumer picks up from queue
Depending on contents it writes request in the relevant file (represents the table to load stuff into)
My questions are around the coordination of this flow. At what point do we fire the copy command from S3 into Redshift? I mean this is an ongoing stream of data and each data batch is a minute wide. Are there any AWS tools that do this thing for us or should we write this ourselves.
Thanks
AWS Lambda is made for this use case.
Alternatively, if your queue is a Kafka queue, you might find secor (https://github.com/pinterest/secor) useful. It dumps queue data into s3 where it can then be copied to Redshirt.
Spotify's Luigi or AWS Data Pipeline are both good options for orchestrating the copy command if you go the secor route.
In the past, I've written similar logic couple of times, and it's not an easy task. There is lots of complexity there. You can use these article as a reference for the architecture.
These days, instead of implementing yourself, you may want to look at the Amazon Kinesis Firehose. It'll handle both the S3 logic and writing into Redshift for you.