I have two Kinesis streams and I would like to create a third stream that is the intersection of these two streams. My goal is to have a stream processor respond to an event on the resulting third stream without having to write a consumer that performs this intersection.
A record on stream a would be:
{
"customer_id": 3,
"first_name":"Marcy",
"last_name":"Shurtleff"
}
and a record on stream b would be:
{
"payment_id": 10001,
"customer_id": 1,
"amount":234.56,
"date":"2018-09-07T10:25:43.511Z"
}
I would like to perform a join (like I can in KSQL with Kafka) that will join stream a.customer_id to stream b.customer_id resulting in:
{
"customer_id": 3,
"first_name":"Marcy",
"last_name":"Shurtleff",
"payment_id": 10001,
"amount":234.56,
"date":"2018-09-07T10:25:43.511Z"
}
(or whatever sql-like projection I choose).
I know this is possible with Kafka and KSQL, but is this possible with Kinesis?
Kinesis Data Analytics will not help as you cannot use more than one stream as a datasource in that product and you can only perform joins on 'in-application' streams.
I recently implemented a solution that does exactly what you are asking using Kinesis Data Anlytics. Indeed, a KDA In-application takes only one stream as input data source; so this limitation makes the schema standardization of the data flowing into KDA necessary when you are dealing with multiple sets of streams. To work around these issues, a python snippet code can be used inside of lambda to flatten and standardize any event by converting its entire payload to a JSON-encoded string. The image below shows how my whole solution is deployed:
The process of standardize and flatten the streams is illustrated in detail below:
Note that after this stage both JSON events have the same schema and no nested fields. Yet, all information is preserved. In addition, the ssn field is placed on the header to be used as join key inside of the KDA application.
For more information about this solution, check this article I wrote: https://medium.com/#guilhermeepassos/joining-and-enriching-multiple-sets-of-streaming-data-with-kinesis-data-analytics-24b4088b5846
Related
So we have 100 different types of messages coming into our Kinesis stream. We only want to save 4 types. I know Kinesis can transform messages, but can it filter as well? How is this done?
Filtering is just a transform in which you decide not to output anything. You indicate this by sending the result with a value "Dropped" as per the documentation.
You can find at this post an example of transform, and the logic includes several things: letting records just pass through without any transform (status "OK"), transforming and outputting a record (again, status "OK"), dropping -or filtering- a record (status "Dropped"), and communicating an error using the status "ProcessingFailed"
I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier
Example Data
Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}
Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
There are about 100 rules I can use to compare one record to another
to see if customer in source system 1 is the same as source system 2.
Some rules may be able to infer record values and add data to a master record about a customer.
Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
We have millions of records per day we'd have to unify
Apache Beam / Dataflow implementation
Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with Flink Gelly or Spark GraphX
Is there any way you may know in dataflow to process such a problem efficiently?
Other thoughts
Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
Graph database: Something like neo4j or datomic would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar
BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns
Or maybe there's a more standard way to solving these class of problems.
It is hard to say what solution works best for you from what I can read so far. I would try to split the problem further and try to tackle different aspects separately.
From what I understand, the goal is to combine together the matching records that represent the same thing in different sources:
records come from a number of sources:
it is logically the same data but formatted differently;
there are rules to tell if the records represent the same entity:
collection of rules is static;
So, the logic probably roughly goes like:
read a record;
try to find existing matching records;
if matching record found:
update it with new data;
otherwise save the record for future matching;
repeat;
To me this looks very high level and there's probably no single 'correct' solution at this level of detail.
I would probably try to approach this by first understanding it in more detail (maybe you already do), few thoughts:
what are the properties of the data?
are there patterns? E.g. when one system publishes something, do you expect something else from other systems?
what are the requirements in general?
latency, consistency, availability, etc;
how data is read from the sources?
can all the systems publish the records in batches in files, submit them into PubSub, does your solution need to poll them, etc?
can the data be read in parallel or is it a single stream?
then the main question of how can you efficiently match a record in general will probably look different under different assumptions and requirements as well. For example I would think about:
can you fit all data in memory;
are your rules dynamic. Do they change at all, what happens when they do;
can you split the data into categories that can be stored separately and matched efficiently, e.g. if you know you can try to match some things by id field, some other things by hash of something, etc;
do you need to match against all of historical/existing data?
can you have some quick elimination logic to not do expensive checks?
what is the output of the solution? What are the requirements for the output?
So I've been trying to stream data from Google Search Console API to BigQuery in real time.
The data are retrieved from GSC API and streamed to the BigQuery stream buffer. However, I experience high latency before the streaming buffer can be flushed (up to 2 hours or more). So, the data stays in the streaming buffer but is not in the table.
The data are also not visible in the preview and the table size is 0B with 0 rows (actually after waiting for >1day I still see 0B even though there are more than 0 rows).
Another issue is that, some time after the data is stored in the table (table size and number of rows are correct), it simply disappears from it and appears in the streaming buffer (I only saw this once). -> This was explained by the second bullet in shollyman's answer.
What I want is to have the data in the table in real time. According to the documentation this seems possible but doesn't work in my case (2h of delay as stated above).
Here's the code responsible for that part:
for row in response['rows']:
keys = ','.join(row['keys'])
# Data Manipulation Languate (DML) Insert one row each time to BigQuery
row_to_stream = {'keys':keys, 'f1':row['f1'], 'f2':row['f2'], 'ctr':row['ctr'], 'position':row['position']}
insert_all_data = {
"kind": "bigquery#tableDataInsertAllRequest",
"skipInvaliedRows": True,
"ignoreUnknownValues": True,
'rows':[{
'insertId': str(uuid.uuid4()),
'json': row_to_stream,
}]
}
build('bigquery', 'v2', cache_discovery=False).tabledata().insertAll(
projectId=projectid,
datasetId=dataset_id,
tableId=tableid,
body=insert_all_data).execute(num_retries=5)
I've seen questions that seem very similar to mine on here but I haven't really found an answer. I therefore have 2 questions.
1. What could cause this issue?
Also, I'm new to GCP and I've seen other options (at least they seemed like options to me) for real time streaming of data to BigQuery (e.g., using PubSub and a few projects around real time Twitter data analysis).
2. How do you pick the best option for a particular task?
By default, the BigQuery web UI doesn't automatically refresh the state of a table. There is a Refresh button when you click into the details of a table, that should show you the updated size information for both managed storage and the streaming buffer (displayed below the main table details). Rows in the buffer are available to queries, but the preview button may not show results until some data is extracted from the streaming buffer to managed storage.
I suspect the case where you observed data disappearing from managed storage and appearing back in the streaming buffer may have been a case where the table was deleted and recreated with the same name, or was truncated in some fashion and streaming restarted. Data doesn't transition from managed storage back to the buffer.
Deciding what technology to use for streaming depends on your needs. Pub/Sub is a great choice when you have multiple consumers of the information (multiple pub/sub subscribers consuming the same stream of messages independently), or you need to apply additional transformations of the data between the producer and consumer. To get the data from pub/sub to BigQuery, you'll still need a subscriber to write the messages into BigQuery, as the two have no direct integration.
We are running a Dataflow job that handles multiple input streams. Some of them are high traffic and some of them rarely get messages through. We are joining all streams with a "shared" stream that contains information relevant to all elements. This is a simplified example of the pipeline:
I noticed that the job will not produce any output, until both streams contain some traffic.
For example, let's suppose that Stream 1 gets a steady flow of traffic, whereas Stream 2 does not produce any messages for a period of time. For this time, the job's DAG will show elements being accumulated in the GroupByKey step but nothing will be propagated beyond it. I can also see the Flatten PCollections step showing input elements for the left side of the graph but not the right one. This creates a problem when dealing with high traffic and low traffic streams in the same job, since it will cause output to be delayed for as much as it takes for Stream 2 to pick up messages.
I am not sure if the observation is correct, but I wanted to ask if this is how Flatten/GroupByKey works in general and if so, if the issue we're seeing can be avoided through an alternative way of constructing the pipeline.
(Example JobID: 2017-02-10_06_48_01-14191266875301315728)
As described in the documentation of group-by-key the default behavior is to wait for all data within the window to have arrived -- this is necessary to ensure correctness of down-stream results.
Depending on what you are trying to do, you may be able to use triggers to cause the aggregates to be output earlier.
You may also be able to use the slow-stream as a side-input to the processing of the fast-stream.
If you're still stuck, it would help if you could describe in more detail the contents of the streams and how you're trying to use them, since more detailed answers depend on the goal.
Apparently (based on an exception) EventHubClient.SendBatch and EventHubClient.SendBatchAsync only support sending to a single partition per operation. This appears to be indicated indicated in the documentation by the method summary "Sends a batch of event data to the logical partition represented by PartitionId" which appears to be copied from the partition specific EventHubSender.SendBatch.
Are there design considerations (vs just writing less code) in having the higher level client not rebatch as needed?
The EventHubClient has control over the partition key hashing/distribution which is not available to callers of EventHubClient that wish to send a batch of data with differing keys that may lie on the same partition. Left to rebatch myself I need to make calls on the order of the number of messages as opposed to on the number of partitions which with small messages is easily two orders of magnitude difference.
Since it's already necessary to rebatch it could be worse.
I was assuming the PartitionKey on your EventData objects of the batch would be used to partition out. But apparently not.
However, there's Paolo Salvatori, who wrote a nice set of Extension methods to provide good and easy support for sending in batches to Event Hub.
You'll probably like his post here: http://blogs.msdn.com/b/paolos/archive/2015/03/26/how-to-implement-a-partitioned-sendbatch-method-for-azure-service-bus-entities.aspx
Best regards