Can Kinesis Firehose do filtering? - amazon-web-services

So we have 100 different types of messages coming into our Kinesis stream. We only want to save 4 types. I know Kinesis can transform messages, but can it filter as well? How is this done?

Filtering is just a transform in which you decide not to output anything. You indicate this by sending the result with a value "Dropped" as per the documentation.
You can find at this post an example of transform, and the logic includes several things: letting records just pass through without any transform (status "OK"), transforming and outputting a record (again, status "OK"), dropping -or filtering- a record (status "Dropped"), and communicating an error using the status "ProcessingFailed"

Related

Join Kinesis Streams

I have two Kinesis streams and I would like to create a third stream that is the intersection of these two streams. My goal is to have a stream processor respond to an event on the resulting third stream without having to write a consumer that performs this intersection.
A record on stream a would be:
{
"customer_id": 3,
"first_name":"Marcy",
"last_name":"Shurtleff"
}
and a record on stream b would be:
{
"payment_id": 10001,
"customer_id": 1,
"amount":234.56,
"date":"2018-09-07T10:25:43.511Z"
}
I would like to perform a join (like I can in KSQL with Kafka) that will join stream a.customer_id to stream b.customer_id resulting in:
{
"customer_id": 3,
"first_name":"Marcy",
"last_name":"Shurtleff",
"payment_id": 10001,
"amount":234.56,
"date":"2018-09-07T10:25:43.511Z"
}
(or whatever sql-like projection I choose).
I know this is possible with Kafka and KSQL, but is this possible with Kinesis?
Kinesis Data Analytics will not help as you cannot use more than one stream as a datasource in that product and you can only perform joins on 'in-application' streams.
I recently implemented a solution that does exactly what you are asking using Kinesis Data Anlytics. Indeed, a KDA In-application takes only one stream as input data source; so this limitation makes the schema standardization of the data flowing into KDA necessary when you are dealing with multiple sets of streams. To work around these issues, a python snippet code can be used inside of lambda to flatten and standardize any event by converting its entire payload to a JSON-encoded string. The image below shows how my whole solution is deployed:
The process of standardize and flatten the streams is illustrated in detail below:
Note that after this stage both JSON events have the same schema and no nested fields. Yet, all information is preserved. In addition, the ssn field is placed on the header to be used as join key inside of the KDA application.
For more information about this solution, check this article I wrote: https://medium.com/#guilhermeepassos/joining-and-enriching-multiple-sets-of-streaming-data-with-kinesis-data-analytics-24b4088b5846

Delays when streaming data from the Google Search console API to BigQuery

So I've been trying to stream data from Google Search Console API to BigQuery in real time.
The data are retrieved from GSC API and streamed to the BigQuery stream buffer. However, I experience high latency before the streaming buffer can be flushed (up to 2 hours or more). So, the data stays in the streaming buffer but is not in the table.
The data are also not visible in the preview and the table size is 0B with 0 rows (actually after waiting for >1day I still see 0B even though there are more than 0 rows).
Another issue is that, some time after the data is stored in the table (table size and number of rows are correct), it simply disappears from it and appears in the streaming buffer (I only saw this once). -> This was explained by the second bullet in shollyman's answer.
What I want is to have the data in the table in real time. According to the documentation this seems possible but doesn't work in my case (2h of delay as stated above).
Here's the code responsible for that part:
for row in response['rows']:
keys = ','.join(row['keys'])
# Data Manipulation Languate (DML) Insert one row each time to BigQuery
row_to_stream = {'keys':keys, 'f1':row['f1'], 'f2':row['f2'], 'ctr':row['ctr'], 'position':row['position']}
insert_all_data = {
"kind": "bigquery#tableDataInsertAllRequest",
"skipInvaliedRows": True,
"ignoreUnknownValues": True,
'rows':[{
'insertId': str(uuid.uuid4()),
'json': row_to_stream,
}]
}
build('bigquery', 'v2', cache_discovery=False).tabledata().insertAll(
projectId=projectid,
datasetId=dataset_id,
tableId=tableid,
body=insert_all_data).execute(num_retries=5)
I've seen questions that seem very similar to mine on here but I haven't really found an answer. I therefore have 2 questions.
1. What could cause this issue?
Also, I'm new to GCP and I've seen other options (at least they seemed like options to me) for real time streaming of data to BigQuery (e.g., using PubSub and a few projects around real time Twitter data analysis).
2. How do you pick the best option for a particular task?
By default, the BigQuery web UI doesn't automatically refresh the state of a table. There is a Refresh button when you click into the details of a table, that should show you the updated size information for both managed storage and the streaming buffer (displayed below the main table details). Rows in the buffer are available to queries, but the preview button may not show results until some data is extracted from the streaming buffer to managed storage.
I suspect the case where you observed data disappearing from managed storage and appearing back in the streaming buffer may have been a case where the table was deleted and recreated with the same name, or was truncated in some fashion and streaming restarted. Data doesn't transition from managed storage back to the buffer.
Deciding what technology to use for streaming depends on your needs. Pub/Sub is a great choice when you have multiple consumers of the information (multiple pub/sub subscribers consuming the same stream of messages independently), or you need to apply additional transformations of the data between the producer and consumer. To get the data from pub/sub to BigQuery, you'll still need a subscriber to write the messages into BigQuery, as the two have no direct integration.

How to handle limitation of Dynamodb BatchWriteItem

Just wondering whats the best way to handle the fact that dynamodb can only write batch sizes of max 25.
I have 3 Lambdas (there are more but I am simplifying down so we don't get side tracked)
GetNItemsFromExternalSourceLambda
SaveAllToDynamoDBLambda
AnalyzeDynamoDBLambda
Here is what happens:
GetNItemsFromExternalSourceLambda can potentially fetch 250 items in 1 rest call it makes to an external api.
It then invokes SaveAllToDynamoDBLambda and passes a) all these items and b) paging info e.g. {pageNum:1, pageSize : 250, numPages:5 } in the payload
SaveAllToDynamoDBLambda needs to save all items to a dynamodb table and then , based on the paging info will either a) re-invoke GetNItemsFromExternalSourceLambda (to fetch next page of data) or b) invoke AnalyzeDynamoDBLambda
these steps can loop many times obviously until we have got all the data from the external source before finally proceeding to last step
the final AnalyzeDynamoDBLambda then is some lambda that processes all the data that was fetched and saved to the db
So my problems lies in fact that SaveAllToDynamoDBLambda can only write 25 items in a batch, which means I would have to tell my GetNItemsFromExternalSourceLambda to only fetch 25 items at a time from the external source which is not ideal. (being able to fetch 250 at a time would be a lot better)
One could extend the timeout period of the SaveAllToDynamoDBLambda so that it could do multiple batch writes inside one invocation but i dont like that approach.
I could also zip up the 250 items and save to s3 in one upload which could trigger a stream event but I would have same issue on the other side of that solution.
just wondering whats a better approach while still being able to invoke AnalyzeDynamoDBLambda only after all info from all rest calls has been saved to dynamodb.
Basically the problem is you need a way of subdividing the large batch (250 items in this case) down to batches of 25 of less.
A very simple solution would be to use a Kinesis stream in the middle. Kinesis can take up to 500 records per PutRecords call. You can then use GetRecords with a Limit of 25 and put the records into Dynamo with a single BatchWriteItem call.
Make sure you look at the size limits as well before deciding if this solution will work for you.

Dataflow pipeline waits for elements from all streams before performing GroupBy

We are running a Dataflow job that handles multiple input streams. Some of them are high traffic and some of them rarely get messages through. We are joining all streams with a "shared" stream that contains information relevant to all elements. This is a simplified example of the pipeline:
I noticed that the job will not produce any output, until both streams contain some traffic.
For example, let's suppose that Stream 1 gets a steady flow of traffic, whereas Stream 2 does not produce any messages for a period of time. For this time, the job's DAG will show elements being accumulated in the GroupByKey step but nothing will be propagated beyond it. I can also see the Flatten PCollections step showing input elements for the left side of the graph but not the right one. This creates a problem when dealing with high traffic and low traffic streams in the same job, since it will cause output to be delayed for as much as it takes for Stream 2 to pick up messages.
I am not sure if the observation is correct, but I wanted to ask if this is how Flatten/GroupByKey works in general and if so, if the issue we're seeing can be avoided through an alternative way of constructing the pipeline.
(Example JobID: 2017-02-10_06_48_01-14191266875301315728)
As described in the documentation of group-by-key the default behavior is to wait for all data within the window to have arrived -- this is necessary to ensure correctness of down-stream results.
Depending on what you are trying to do, you may be able to use triggers to cause the aggregates to be output earlier.
You may also be able to use the slow-stream as a side-input to the processing of the fast-stream.
If you're still stuck, it would help if you could describe in more detail the contents of the streams and how you're trying to use them, since more detailed answers depend on the goal.

WSO2 CEP Multiple rows in resultset

I wanted to know if the WSO2 CEP/Siddhi query supports returning multiple rows if yes how data from those rows can be mapped to the output XML ? e.g. my event stream has a field statusCode which can have values A/B/C I wanted to write a query which gives me the count by status type for past 5 mins e.g A-10,B-5,C-2.. in the current query i used group by statusCode to get the count of status
MyQuery- ...insert into TestStream statusCode, count(statusCode) as count group by statusCode
and my output XML is something like
<statusSmry>
<status>{statusCode}</status>
<count>{count}</status>
</statusSmry>
the output i receive is something like
<statusSmry>
<status>A</status>
<count>10</status>
</statusSmry>
.....
<statusSmry>
<status>B</status>
<count>5</status>
</statusSmry>
....
<statusSmry>
<status>C</status>
<count>2</status>
</statusSmry>
Is it possible to get results of query in a single XML ? i.e. in above case counts for A,B,C in a single XML ?
Thanks
Rajiv
What you asked is not possible in Siddhi.
This is because whenever there is an input event the total count will be updated, at the same time an output for the corresponding updated group need to be triggered to notify the subscribers. Since this is a realtime process Siddhi cannot accumulate all the events and output as one event/XML. If in any case its going to accumulate the events then there will be a problem for how long it's going to accumulate for, 1sec or 1day?, and in what format the output need to be sent, therefore currently it's (WSO2 CEP 2.0.1) not supporting accumulation.
If you need this feature then you have to send the output of CEP to an ESB and run some kind of an aggregation process.
Suho