I am exploring AWS Kinesis for a data processing requirement that replaces old batch ETL processing with a stream based approach.
One of the key requirements for this project is the ability to reprocess data in cases when
A bug is discovered and fixed and the application is redeployed. Data needs to be reprocessed from the beginning.
New features are added and the history needs to be reprocessed either fully or partially.
The scenarios are very nicely documented here for Kafka - https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Data+%28Re%29Processing+Scenarios.
I have seen the timestamp based ShardIterator in Kinesis and I think a Kafka like resetter-tool can be built using Kinesis APIs but it would be great if something like this already exists. Even if it doesn't, it would be good to learn from those who have solved similar problems.
So, does anyone know of any existing resources, patterns and tools available to do this in Kinesis?
I have run into scenarios where i want to reprocess the kinesis processed records, I have used Kinesis-VCR for re-processing the kinesis generated records.
Kinesis-VCR records the kinesis streams and maintains a metadata of the files processed by kinesis at a given time.
Later, we can use to re-process/replay the events for any given time range.
Here is the github link for the same.
https://github.com/scopely/kinesis-vcr
Let me know if this works for you.
Thanks & Regards,
Srivignesh KN
Related
I'm writing a concurrent tailing utility for watching multiple AWS CloudWatch log groups across many regions simultaneously, and in CloudWatch logs, there are log groups, which contain many log streams that are rotated occasionally. Thus, to tail a log group, one must find the latest log stream, read it in a loop, and occasionally check for a new log stream, and start reading that in a loop.
I can't seem to find any documentation on this, but is there a set of published conditions upon which I can conclude that a log stream has been "closed?" I'm assuming I'll need to have multiple tasks tailing multiple log streams in a group up until a certain cut-off point, but I don't know how to logically determine that a log stream has been completed and to abandon tailing it.
Does anyone know whether such published conditions exist?
I don't think you'll find that published anywhere.
If AWS had some mechanism to know that a log stream was "closed" or would no longer receive log entries, I believe their own console for a stream would make use of it somehow. As it stands, when you view even a very old stream in the console, it will show this message at the bottom:
I know it is not a direct answer to your question, but I believe that is strong indirect evidence that AWS can't tell when a log stream is "closed" either. Resuming auto retry on an old log stream generates traffic that would be needless, so if they had a way to know the stream was "closed" they would disable that option for such streams.
Documentation says
A log stream is a sequence of log events that share the same source.
Since each new "source" will create a new log stream, and since CloudWatch supports many different services and options, there won't be a single answer. It depends on too many factors. For example, with the Lambda service, each lambda container will be a new source, and AWS Lambda may create new containers based on many factors like lambda execution volume, physical work in its data center, outages, changes to lambda code, etc. And that is just for one potential stream source for log streams.
You've probably explored options, but these may give some insights into ways to achieve what you're looking to do:
The CLI has an option to tail that will include all log streams in a group: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/logs/tail.html though if you're building your own utility the CLI won't likely be an option
Some options are discussed at how to view aws log real time (like tail -f) but there are no mentions of published conditions for when a stream is "closed"
When does AWS CloudWatch create new log streams? may yield some insights
I am looking to get all of the Activity and Lead data in Marketo to be mirrored in an AWS S3 bucket so that I can build dashboards on it in Quicksight, so preferably I'd like to stream the data from Marketo into S3 in real-time, and then use Glue and Athena to connect the data to Quicksight. However, the only way to get large volumes of data out of Marketo appears to be their Bulk Extract tool (one for Leads, one for Activity data).
The problem is that these API interfaces make any attempt at near real-time streaming really clunky. Currently, I have Lambda functions being triggered every hour to pull the most recent hour of Lead/Activity data and saving it as a gzipped CSV in S3. But Marketo's Bulk Extract tool has a request queue and requests often take longer than 15 minutes to process (15 minutes being Lambda's max timeout length). So at least once a day my requests are getting dropped.
The solution seems to be to instead run this on an EC2 instance that can juggle multiple requests and patiently wait for Marketo's queue. But I'd rather not get into all the async and error-handling issues that that approach may entail if there is an easier way to accomplish this.
As an alternative solution, Amazon Appflow integrates with Marketo. But last I checked, it only works with Lead data, not Activity data. And there are restrictions on the filters you have to apply to the Lead data that make it clunky to work with anyway.
On Google I have found several companies that claim to offer seamless, reliable Marketo-to-S3 ETL, but I haven't yet researched their pricing or quality.
If anyone knows of a good approach to set up reliable and cost-efficient ETL between Marketo and S3 in a short period of time, I would very much appreciate it.
In a case like this, I would be tempted to recommend using an EC2 instance to run Singer with a Marketo input and CSV output, then set up something to move the CSV over to S3 as needed. That would be the absolute cheapest ETL solution, but this does suppose you have some comfort and familiarity with Python.
Also worth noting is that Stitch, Singers's paid product equivalent, supports native S3 export--you could always first test with a non-Marketo data source and see if that performs the way you'd like if you prefer money over time.
When I analyze the contents of a Kinesis stream using a Kinesis Analytics SQL query grouping by time blocks, how certain can I be that all items in the stream are contained in the aggregates? Suppose I update the query during runtime, will the analytics application output aggregates v1 up to a point and then aggregates v2 for all items that were not yet reported on by v1? I something fails under the hood in the implementation, will a new node start reporting exactly from the point where the previous node ended? Or should you not rely on the completeness of these aggregations?
Answer posted on the AWS Forums, where I had cross posted:
Please see delivery semantics the service guarantees at https://docs.aws.amazon.com/kinesisanalytics/latest/dev/failover-checkpoint.html
Analytics service maintains checkpoints and if an update happens or any kind of failure happens, the application resumes from these checkpoints. Due to the design, it is possible the service reprocesses some of the same data and produces duplicates. Downstream applications should be able to handle that.
We are designing our workflow for doing microbatch loading of data into redshift. Basically we get a series of requests coming through an API. The API pumps those in a queue that is later processed, each item is ETLd, and finally saved into a ready-to-load file in S3. So the steps are:
Client sends request to API.
API picks up and transforms into JSON
API queues JSON in queue
Consumer picks up from queue
Depending on contents it writes request in the relevant file (represents the table to load stuff into)
My questions are around the coordination of this flow. At what point do we fire the copy command from S3 into Redshift? I mean this is an ongoing stream of data and each data batch is a minute wide. Are there any AWS tools that do this thing for us or should we write this ourselves.
Thanks
AWS Lambda is made for this use case.
Alternatively, if your queue is a Kafka queue, you might find secor (https://github.com/pinterest/secor) useful. It dumps queue data into s3 where it can then be copied to Redshirt.
Spotify's Luigi or AWS Data Pipeline are both good options for orchestrating the copy command if you go the secor route.
In the past, I've written similar logic couple of times, and it's not an easy task. There is lots of complexity there. You can use these article as a reference for the architecture.
These days, instead of implementing yourself, you may want to look at the Amazon Kinesis Firehose. It'll handle both the S3 logic and writing into Redshift for you.
It is unclear if you can do a fan-out (duplication) in Kafka like you can in Flume.
I'd like to have Kafka save data to HDFS or S3 and send a duplicate of that data to Storm for real time processing. The output of Storm aggregations/analysis will be stored in Cassandra. I see some implementations flowing all data from Kafka into Storm and then two outputs from Storm. However, I'd like to eliminate the dependency of Storm for the raw data storage.
Is this possible? Are you aware of any documentation/examples/implementations like this?
Also, does Kafka have good support for S3 storage?
I saw Camus for storing to HDFS -- do you just run this job via cron to continually load data from Kafka to HDFS? What happens if a second instance of the job starts before the previous has finished? Finally, would Camus work with S3?
Thanks -- I appreciate it!
Regarding Camus,
Yeah, a scheduler that launches the job should work.
What they use at LinkedIn is Azkaban, you can look at that too.
If one launches before the other finishes, some amount of data will be read twice. Since the second job will start reading from the same offsets used by the first one.
Regarding Camus with S3, currently I dont think that is in place.
Regarding Kafka support for S3 storage, there are several Kafka S3 consumers you can easily plugin to get your data saved to S3. kafka-s3-storage is one of them.
There are many possible ways to feed storm with translated data. The main question that is not clear to me is what the dependency you wish to eliminate and what tasks you wish to keep storm from doing.
If it is considered ok that storm would receive an xml or json, you could easily read from the original queue using two consumers. As each consumer controls the messages it reads, both could read the same messages. One consumer could insert the data to your storage and the other will translate the information and send it to storm. There is no real complexity with the feasibiliy of this, but, I believe this is not the ideal solution due to the following reasons:
Maintainability - a consumer needs supervision. You would therefor need to supervise your running consumers. Depending on your deployment and the way you handle data types, this might be a non-trivial effort. Especially, when you already have storm installed and therefore supervised.
Storm connectiviy - you still need to figure out how to connect this data to storm. Srorm has a kafka spout, that i have used, and works very well. But, using the suggested architecture , this means an additional kafka topic to place the translated messages on. This is not very efficient as the spout could also read information directly from the original topic and translate it using a simple bolt.
Suggested way to handle this would be to form a topology, using kafka spout to read raw data and one bolt to send the raw data to storage and another one to translate it. But, this solution depends on the reasons you wish to keep storm out of the raw data business.
Kafka actually retains events for a configurable period of time -- events are not purged immediately upon consumption like other message or queue systems. This allows you to have multiple consumers that can read from Kafka either at the beginning (per the configurable retention time) or from an offset.
For the use case described, you would use Camus to batch load events to hadoop, and Storm to read events off the same Kafka output. Just ensure both processes read new events before the configurable retention time expires.
Regarding Camus, ggupta1612 answered this aspect best
A scheduler that launches the job should work. What they use at LinkedIn is Azkaban, you can look at that too.
If one launches before the other finishes, some amount of data will be read twice. Since the second job will start reading from the same offsets used by the first one.