Is it possible to set multiple Input streams for Kinesis Analytics? - amazon-web-services

I have 3 mongodb collections (from Same Database) which are act as input source. Currently I'm useing 3 kinesis streams to access these collections. I need to analyze them by combining them. Can I use Kinesis Analytics to do this? Because i can't see an option to select multiple streams as inputs for Kinesis Analytic app.

Kinesis Analytics does not have the feature to add more than one streaming source and one reference data source as yet.
See: http://docs.aws.amazon.com/kinesisanalytics/latest/dev/limits.html

You can use Drools Kinesis Analytics, which supports multiple input streams

Related

How should we configure Datastream in Google Cloud

We have c. 60 tables whose changes we would like to capture (CDC) using GCP's datastream and ingest into our data lake.
Are there any drawbacks to using Datastream? And should I set up one stream that ingests all tables? Or should I create a stream per group (or per single) table (in order to create a fail-safe that if a stream fails, it will be localized to a few/specific table)?
Thanks in advance.

Best way to get pub sub json data into bigquery

I am currently trying to ingest numerous types of pub sub-data (JSON format) into GCS and BigQuery using a cloud function. I am just wondering what is the best way to approach this?
At the moment I am just dumping the events to GCS (each even type is on its own directory path) and was trying to create an external table but there are issues since the JSON isn't newline delimited.
Would it be better just to write the data as JSON strings in BQ and do the parsing in BQ?
With BigQuery, you have a brand new type name JSON. It helps you to query more easily JSON data type. It could be the solution if you store your event in BigQuery.
About your questions about the use of Cloud Functions, it depends. If you have a few events, Cloud Functions are great and not so much expensive.
If you have an higher rate of event, Cloud Run can be a good alternative to leverage concurrency and to keep the cost low.
If you have million of event per hour or per minute, consider Dataflow with the pubsub to bigquery template.

How do I perform spatial data analysis to kinesis streams?

I'm quite new to AWS products and have been learning from their tutorials a lot. Recently I'm building an end-to-end IoT application to do near real time spatial analytics. So the scenario is that we will deploy several sensors in each room/space and examine the spatial correlation across all the sensor measurements every once a while (maybe every 10s). The iot sensor data payload would look like this:
{"roomid": 1,"sensorid":"s001", "value": 0.012}
Kinesis seems powerful for real time analytics, so Im sending data from IoT to kinesis stream, and then using kinesis analytics to perform data aggregation using tumbling window and enrich measurements with sensor locations. The processed data to be sent out would look like this now[pic]:
kinesis analytics data
My plan was to send data from kinesis analytics to kinesis stream and then invoke lambda function to do spatial analysis. It seems that the records with same rowtime(process time) will be sent within a single event in kinesis stream. But I want to apply lambda function to each room separately, how can I separate these records from kinesis stream? (Am I using the wrong strategy?)
I'm also looking for a data storage solution for the sensor data. Any advice? There are many mature time-series DB but I want to find the best suitable DB for spatial analysis. Shall I look into graph DB, like Neo4j?
Thanks,
Sasa

want to write event from event hub to Data Lake store using C# without stream analytics

want to write event from event hub to Data Lake store using C# without stream analytics.
we are able to write in BLOB but how we can write in data lake.
Other than Azure Stream Analytics, there are few other options you could use. We have listed them at https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-data-scenarios - search for "Streamed data".
Do these meet your requirements?
Thanks,
Sachin Sheth
Program Manager,
Azure Data Lake.
You can try connecting the eventhub triggered function to blob storage as an output binding and then orchestrate a movement of data from blob to data lake.
This way we can save some efforts and process the datas in batch as well.
Considering other options such as Azure functions to Data Lake which involves complex operation such as appending the events in same file till a threshold and then flushing to data lake,it might not be ideal for real time environment.

ETL Possible Between S3 and Redshift with Kinesis Firehose?

My team is attempting to use Redshift to consolidate information from several different databases. In our first attempt to implement this solution, we used Kinesis Firehose to write records of POSTs to our APIs to S3 then issued a COPY command to write the data being inserted to the correct tables in Redshift. However, this only allowed us to insert new data and did not let us transform data, update rows when altered, or delete rows.
What is the best way to maintain an updated data warehouse in Redshift without using batch transformation? Ideally, we would like updates to occur "automatically" (< 5min) whenever data is altered in our local databases.
Firehose or Redshift don't have triggers, however you could potentially use the approach using Lambda and Firehose to pre-process the data before it gets inserted as described here: https://blogs.aws.amazon.com/bigdata/post/Tx2MUQB5PRWU36K/Persist-Streaming-Data-to-Amazon-S3-using-Amazon-Kinesis-Firehose-and-AWS-Lambda
In your case, you could extend it to use Lambda on S3 as Firehose is creating new files, which would then execute COPY/SQL update.
Another alternative is just writing your own KCL client that would implement what Firehose does, and then executing the required updates after COPY of micro-batches (500-1000 rows).
I've done such an implementation (we needed to update old records based on new records) and it works alright from consistency point of view, though I'd advise against such architecture in general due to bad Redshift performance with regards to updates. Based on my experience, the key rule is that Redshift data is append-only, and it is often faster to use filters to remove unnecessary rows (with optional regular pruning, like daily) than to delete/update those rows in real-time.
Yet another alernative, is to have Firehose dump data into staging table(s), and then have scheduled jobs take whatever is in that table, do processing, move the data, and rotate tables.
As a general reference architecture for real-time inserts into Redshift, take a look at this: https://blogs.aws.amazon.com/bigdata/post/Tx2ANLN1PGELDJU/Best-Practices-for-Micro-Batch-Loading-on-Amazon-Redshift
This has been implemented multiple times, and works well.