How do I perform spatial data analysis to kinesis streams? - amazon-web-services

I'm quite new to AWS products and have been learning from their tutorials a lot. Recently I'm building an end-to-end IoT application to do near real time spatial analytics. So the scenario is that we will deploy several sensors in each room/space and examine the spatial correlation across all the sensor measurements every once a while (maybe every 10s). The iot sensor data payload would look like this:
{"roomid": 1,"sensorid":"s001", "value": 0.012}
Kinesis seems powerful for real time analytics, so Im sending data from IoT to kinesis stream, and then using kinesis analytics to perform data aggregation using tumbling window and enrich measurements with sensor locations. The processed data to be sent out would look like this now[pic]:
kinesis analytics data
My plan was to send data from kinesis analytics to kinesis stream and then invoke lambda function to do spatial analysis. It seems that the records with same rowtime(process time) will be sent within a single event in kinesis stream. But I want to apply lambda function to each room separately, how can I separate these records from kinesis stream? (Am I using the wrong strategy?)
I'm also looking for a data storage solution for the sensor data. Any advice? There are many mature time-series DB but I want to find the best suitable DB for spatial analysis. Shall I look into graph DB, like Neo4j?
Thanks,
Sasa

Related

Best way to get pub sub json data into bigquery

I am currently trying to ingest numerous types of pub sub-data (JSON format) into GCS and BigQuery using a cloud function. I am just wondering what is the best way to approach this?
At the moment I am just dumping the events to GCS (each even type is on its own directory path) and was trying to create an external table but there are issues since the JSON isn't newline delimited.
Would it be better just to write the data as JSON strings in BQ and do the parsing in BQ?
With BigQuery, you have a brand new type name JSON. It helps you to query more easily JSON data type. It could be the solution if you store your event in BigQuery.
About your questions about the use of Cloud Functions, it depends. If you have a few events, Cloud Functions are great and not so much expensive.
If you have an higher rate of event, Cloud Run can be a good alternative to leverage concurrency and to keep the cost low.
If you have million of event per hour or per minute, consider Dataflow with the pubsub to bigquery template.

Data Storage and Analytics on AWS

I have one data analytics requirement on AWS. I have limited knowledge on Big Data processing, but based on my
analysis, I have figured out some options.
The requirement is to collect data by calling a Provider API every 30 mins. (data ingestion)
The data is mainly structured.
This data need to be stored in a storage (S3 data lake or Red Shift.. not sure)and various aggregations/dimensions from this data are to be provided through a REST API.
There is a future requirement to run ML algorithms on the original data and hence the storage need to be decided accordingly. So based on this, can you suggest:
How to ingest data (Lambda to run at a scheduled interval and pull data, store in the storage OR any better way to pull data in AWS)
How to store (store in S3 or RedShift)
Data Analytics (currently some monthly, weekly aggregations), what tools can be used? What tools to use if I am storing data in S3.
Expose the analytics results through an API. (Hope I can use Lambda to query the Analytics engine in the previous step)
Ingestion is simple. If the retrieval is relatively quick, then scheduling an AWS Lambda function is a good idea.
However, all the answers to your other questions really depend upon how you are going to use the data, and then work backwards.
For Storage, Amazon S3 makes sense at least for the initial storage of the retrieved data, but might (or might not) be appropriate for the API and Analytics.
If you are going to provide an API, then you will need to consider how the API code (eg using AWS API Gateway) will need to retrieve the data. For example, is it identical to the blob of data original retrieved, or are there complex transformations required or perhaps combining of data from other locations and time intervals. This will help determine how the data should be stored so that it is easily retrieved.
Data Analytics needs will also drive how your data is stored. Consider whether an SQL database sufficient. If there are millions and billions of rows, you could consider using Amazon Redshift. If the data is kept in Amazon S3, then you might be able to use Amazon Athena. The correct answer depends completely upon how you intend to access and process the data.
Bottom line: Consider first how you will use the data, then determine the most appropriate place to store it. There is no generic answer that we can provide.

Real time ingestion to Redshift without using S3?

I'm currenty using Apache NiFi to ingest realtime data using kafka, I would like to take this data to redshift to have a near real time transacction table for online analytics of campaign results and other sutff.
The catch, if I use Kinesis or copy from S3 I would have a LOT of read/writes from/to S3, I found on previous experiences that this becomes very expensive.
So, is there a way to send data to a redshift table without constant locking the destination table? the idea is putting the data directly from NiFi without persisting it. I have a hourly batch process so it would not be a problem if I lost a couple or rows on the on online stream.
Why redshift? that's the Data Lake platform and it will become handy to cross online data with other tables.
Any idea?
Thanks!

want to write event from event hub to Data Lake store using C# without stream analytics

want to write event from event hub to Data Lake store using C# without stream analytics.
we are able to write in BLOB but how we can write in data lake.
Other than Azure Stream Analytics, there are few other options you could use. We have listed them at https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-data-scenarios - search for "Streamed data".
Do these meet your requirements?
Thanks,
Sachin Sheth
Program Manager,
Azure Data Lake.
You can try connecting the eventhub triggered function to blob storage as an output binding and then orchestrate a movement of data from blob to data lake.
This way we can save some efforts and process the datas in batch as well.
Considering other options such as Azure functions to Data Lake which involves complex operation such as appending the events in same file till a threshold and then flushing to data lake,it might not be ideal for real time environment.

Ordering of streaming data with kinesis stream and firehose

I have an architecture dilemma for my current project which is for near realtime processing of big amount of data. So here is a diagram of the the current architecture:
Here is an explanation of my idea which led me to that picture:
When the API gateway receives a request it's put in the stream(this is because of the nature of my application- "fire and forget) That's how I came up to that conclusion. The input data is separated in the shards based on a specific request attribute which guarantees me the correct order.
Then I have a lambda which cares for validating the input and anomaly detection. So it's an abstraction which keeps the data clean for the next layer- the data enrichment. So this lambda sends the data to a kinesis firehose because it can backup the "raw" data(something which I definitely want to have) and also attach a transformation lambda which will do the enrichment- so I won't care for saving the data in S3, it will come out of the box. So everything is great until the moment where I need a preserved ordering of the received data(the enricher is doing sessionization), which is lost in the firehose, because there's no data separation there as it's in the kinesis streams.
So the only thing I could think of is- to move the sissionization in the first lambda, which will break my abstraction, because it will start caring about data enrichment and the bigger drawback is that the backup data will have enriched data in it, which is also breaking the architecture. And all this is happening because the missing sharding conception in the firehose.
So can someone think of a solution of that problem without losing the out of the box features which aws provides us?
I think that sessionization and data enrichment are two different abstractions, will need to be split between the lambdas.
A session is a time bound, strictly ordered flow of events that are bounded by a purpose or task. You only have that information at the first lambda stage (from the kinesis stream categorization), and should label flows with session context at the source and where sessions can be bounded.
If storing session information in a backup is a problem, it may be that the definition of a session is not well specified or subject to redefinition. If sessions are subject to future recasting, the session data already calculated can be ignored, provided enough additional data to inform the unpredictable future concepts of possible sessions has also been recorded with enough detail.
Additional enrichment providing business context (aka externally identifiable data) should process the sessions transactionally within the previously recorded boundaries.
If sessions aren't transactional at the business level, then the definition of a session is over or under specified. If that is the case, you are out of the stream processing business and into batch processing, where you will need to scale state to the number of possible simultaneous interleaved sessions and their maximum durations -- querying the entire corpus of events to bracket sessions of hopefully manageable time durations.