How to handle bad events in a batch job on EMR - amazon-web-services

I am running an EMR which processes some logs containing around 15-20M log events. Sometimes few log events contain badly formatted data that break my pipeline. I am looking for some options to drop those log events in a file or a queue. Then I can verify them, report them to the corresponding service and reprocess them maybe not in the same pipeline as the analysis would require some time to correct the logs.
What are the best options available and widely used by different companies running batch job?

Related

Are there fixed conditions for how long a log stream is open?

I'm writing a concurrent tailing utility for watching multiple AWS CloudWatch log groups across many regions simultaneously, and in CloudWatch logs, there are log groups, which contain many log streams that are rotated occasionally. Thus, to tail a log group, one must find the latest log stream, read it in a loop, and occasionally check for a new log stream, and start reading that in a loop.
I can't seem to find any documentation on this, but is there a set of published conditions upon which I can conclude that a log stream has been "closed?" I'm assuming I'll need to have multiple tasks tailing multiple log streams in a group up until a certain cut-off point, but I don't know how to logically determine that a log stream has been completed and to abandon tailing it.
Does anyone know whether such published conditions exist?
I don't think you'll find that published anywhere.
If AWS had some mechanism to know that a log stream was "closed" or would no longer receive log entries, I believe their own console for a stream would make use of it somehow. As it stands, when you view even a very old stream in the console, it will show this message at the bottom:
I know it is not a direct answer to your question, but I believe that is strong indirect evidence that AWS can't tell when a log stream is "closed" either. Resuming auto retry on an old log stream generates traffic that would be needless, so if they had a way to know the stream was "closed" they would disable that option for such streams.
Documentation says
A log stream is a sequence of log events that share the same source.
Since each new "source" will create a new log stream, and since CloudWatch supports many different services and options, there won't be a single answer. It depends on too many factors. For example, with the Lambda service, each lambda container will be a new source, and AWS Lambda may create new containers based on many factors like lambda execution volume, physical work in its data center, outages, changes to lambda code, etc. And that is just for one potential stream source for log streams.
You've probably explored options, but these may give some insights into ways to achieve what you're looking to do:
The CLI has an option to tail that will include all log streams in a group: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/logs/tail.html though if you're building your own utility the CLI won't likely be an option
Some options are discussed at how to view aws log real time (like tail -f) but there are no mentions of published conditions for when a stream is "closed"
When does AWS CloudWatch create new log streams? may yield some insights

What happens to the data when uploading it to gcp bigquery when there is no internet?

I am using GCP Bigquery to store some data. I have created a pub/sub job for the Dataflow of the event.Currently, I am facing a issue with data loss. Sometimes, due to "no internet connection" the data is not uploaded to bigquery and the data for that time duration is lost. How can i overcome this situation.
Or what kind of database should i use to store data offline and then upload it online whenever there is connectivity.
Thank You in advance!
What you need to have is either a retry mechanism or a persistent storage. There can be several ways to implement this.
You can use a Message Queue to store data and process. Choice of message queue can be either cloud based like AWS SQS, Cloud Pub/Sub(GCP) or a hosted one like Kafka, RabbitMq.
Another but a bit unoptimized way could be to persist data locally till it is successfully uploaded on the cloud. Local storage can be either some buffer or database etc. If upload is failed you, retry again from the storage. This is something similar to Producer Consumer Problem.
You can use a Google Compute Engine to store your data and always run your data loading job from there. In that case, if your internet connection is lost, data will still continue to load into BigQuery.
By what I understood you are publishing data to PubSub and Dataflow does the rest to get the data inside BigQuery, is it right?
The options I suggest to you:
If your connection loss happens occasionally and for a short amount of time, a retry mechanism could be enough to solve this problem.
If you have frequent connection loss or connection loss for large periods of time, I suggest that you mix a retry mechanism with some process redundancy. You could for example have two process running in different machines to avoid this kind of situation. Its important to mention that for this case you could also try only a retry mechanism but it would be more complex because you would need to determine if the process failed, save the data somewhere (if its not saved) and trigger the process again in the future.
I suggest that you take a look in Apache Nifi. Its a very powerful data flow automation software that might help you solving this kind of issue. Apache Nifi has specific processors to push data directly to PubSub.
As a last suggestion, you could create an automated process to make data quality analysis after the data ingestion. Having this process you could determine more easily if your process failed.

Suppress mesage-The Mapping task failed to run. Another instance of the task is currently running

I have set up multiple jobs in Informatica cloud to sync data from Oracle with Informatica objects. The job is scheduled to run every 3 minutes as per the business requirements. Sometimes the job used to run long due to secure agent resource crunch and my team used to multiple emails as below
The Mapping task failed to run. Another instance of the task is currently running.
Is there any way to suppress these failure emails in the mapping?
This wont get set at the mapping level but on the session or integration service level see following https://network.informatica.com/thread/7312
This type of error comes when workflow/session is running and trying to re-run. use check if by script if already running then wait. If want to run multiple instance of same:
In Workflow Properties Enable 'Configure Concurrent Execution' by checking the check box.
once its enables you 2 options
Allow Concurrent Run with same instance name
Allow Concurrent run only with unique instance name
Notifications configured at the task level over ride those at the org level, so you could do this by configuring notifications at the task level and only sending out warnings to the broader list. That said, some people should still be receiving the error level warning because if it recurs multiple times within a short period of time there may be another issue.
Another thought is that batch processes that run every three minutes that take longer than three minutes is usually an opportunity to improve the design. Often a business requirement for short batch intervals is around a "near real time" desire. If you have also Cloud Application Integration service, you may want to set up an event to trigger the batch run. If there is still overlap based on events, you can use the Cloud Data Integration API to and create a dynamic version of the task each time. For really simple integrations you could perform the integration in CAI, which does allow multiple instances running at the same time.
HTH

How to handle reprocessing scenarios in AWS Kinesis?

I am exploring AWS Kinesis for a data processing requirement that replaces old batch ETL processing with a stream based approach.
One of the key requirements for this project is the ability to reprocess data in cases when
A bug is discovered and fixed and the application is redeployed. Data needs to be reprocessed from the beginning.
New features are added and the history needs to be reprocessed either fully or partially.
The scenarios are very nicely documented here for Kafka - https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Data+%28Re%29Processing+Scenarios.
I have seen the timestamp based ShardIterator in Kinesis and I think a Kafka like resetter-tool can be built using Kinesis APIs but it would be great if something like this already exists. Even if it doesn't, it would be good to learn from those who have solved similar problems.
So, does anyone know of any existing resources, patterns and tools available to do this in Kinesis?
I have run into scenarios where i want to reprocess the kinesis processed records, I have used Kinesis-VCR for re-processing the kinesis generated records.
Kinesis-VCR records the kinesis streams and maintains a metadata of the files processed by kinesis at a given time.
Later, we can use to re-process/replay the events for any given time range.
Here is the github link for the same.
https://github.com/scopely/kinesis-vcr
Let me know if this works for you.
Thanks & Regards,
Srivignesh KN

Microbatch loading into Redshift - Some practical questions

We are designing our workflow for doing microbatch loading of data into redshift. Basically we get a series of requests coming through an API. The API pumps those in a queue that is later processed, each item is ETLd, and finally saved into a ready-to-load file in S3. So the steps are:
Client sends request to API.
API picks up and transforms into JSON
API queues JSON in queue
Consumer picks up from queue
Depending on contents it writes request in the relevant file (represents the table to load stuff into)
My questions are around the coordination of this flow. At what point do we fire the copy command from S3 into Redshift? I mean this is an ongoing stream of data and each data batch is a minute wide. Are there any AWS tools that do this thing for us or should we write this ourselves.
Thanks
AWS Lambda is made for this use case.
Alternatively, if your queue is a Kafka queue, you might find secor (https://github.com/pinterest/secor) useful. It dumps queue data into s3 where it can then be copied to Redshirt.
Spotify's Luigi or AWS Data Pipeline are both good options for orchestrating the copy command if you go the secor route.
In the past, I've written similar logic couple of times, and it's not an easy task. There is lots of complexity there. You can use these article as a reference for the architecture.
These days, instead of implementing yourself, you may want to look at the Amazon Kinesis Firehose. It'll handle both the S3 logic and writing into Redshift for you.