IoT and DynamoDB - amazon-web-services

We have to work on an IoT system. Basically sensors sending data to the cloud, and users being able to access the data belonging to them.
The amount of data can be pretty substantial so we need to ensure something that covers both security and heavy load.
The typology of data is pretty straightforward, basically a data and its value at a specified time.
The idea was to use DynamoDB for this, having a table with :
[id of sensor-array]
[id of sensor]
[type of measure]
[value of measure]
[date of measure]
The idea was for the IoT system to put directly (in python) data into the database.
Our questions are :
In terms of performance :
will DynamoDB be able to handle a lot of insertions on a daily basis (we may be talking about hundredth of thousands insertions per minute) ?
does querying the table by giving the id of the sensor array and a minimal date will ensure being able to retrieve the data in a efficient fashion?
In terms of security is it okay to proceed this way?
We used to use NoSQL like MongoDB, so we're finding hard to apply our notions on DynamoDB where the data seems to be arranged in a pretty simple fashion.
Thanks.

will DynamoDB be able to handle a lot of insertions on a daily basis (we may be talking about hundredth of thousands insertions per minute) ?
Yes, DynamoDB will sustain (and cost) all the write throughput you provision for the table. As your data is small, aggregating before writing in batches (BatchWriteItem) is probably more cost efficient than individual writes.
does querying the table by giving the id of the sensor array and a minimal date will ensure being able to retrieve the data in a efficient fashion?
Yes, queries by hash (id) and range keys (date) would be very efficient. You may need secondary indexes for more complex queries though.
In terms of security is it okay to proceed this way?
Although data is encrypted in transit and client-side encryption is straightforward, there is a lot uncovered here. For example, AWS IoT provides TLS mutual authentication with certificates, IAM or Cognito, among other security features. AWS IoT can store data in DynamoDB using a simple rule.

Related

Schema design for Google BigTable

In my project, Im using Google BigQuery that holds loots of data.
The BigQuery columns are:
account_id, session_id, transaction_id, username, event, timestamp.
In my dashboard, Im fetching the entire data based on time stamp (last 30 days).
Since I have very large data, the performance are pretty slow (13 sec to fetch the last 30 days data).
Lately, I try to look on Google BigTable and I saw they have an option to get data based on time.
In my tests, the performance of the BigTable are slower from the BigQuery.
Is any suggested schema that can improve the performance with BigTable?
This is example to my schema in BigTable:
const row = {
key: `transactions#${timestamp_micros}`,
data: {
identifiers: {
session_id: `session_id-${startCounter}`,
account_id: `acount-${startCounter}`,
device_id: `device-${startCounter}`,
transaction_id: `transaction_id-${startCounter}`,
runtime_id: 'AQW+2Xx5AQAAstvxskK0c8NTk+vP5eBM',
page_id: `page_id-${startCounter}`,
start_time: timestamp,
},
},
};
Is anyone can suggest a better schema that will help me to fetch the data (based on timestamp range) with the best performance?
A good schema results in excellent performance and scalability, and a bad schema can lead to a poorly performing system. However, no single schema design provides the best fit for all use cases and hence your question is opinionated and will vary from person to person. The patterns described on this page provide a starting point to decide a schema for BigTable. Your unique dataset and the queries you plan to use are the most important things to consider as you design a schema for your time-series data.
As you've discovered from our docs, the row key format is the biggest decision you make when using Bigtable, as it determines which access patterns can be performed efficiently. Having row key transaction_id#reverse_timestampgets your data sorted from the latest timestamp. This could avoid hotspotting issues, which is one of the big reasons for slow query results.
However, you're also coming from a SQL architecture, which isn't always a good fit for Bigtable's schema/query model. So here are some questions to get you started:
Are you planning to perform lots of ad hoc queries like "SELECT A
FROM Bigtable WHERE B=x"? If so, strongly prefer BigQuery. Bigtable
can't support this query without performing a full table scan. (hence
it is slower than BigQuery)
Will you require multi-row OLTP transactions? Again, use BigQuery, as
Bigtable only supports transactions within a single row.
Are you streaming in new events at high QPS? Bigtable is much better
for these sorts of high-volume updates. Do you want to perform any
sort of large-scale complex transformations on the data? Again,
Bigtable is likely better here, as you can stream data out and back
in faster.
You can also combine the two services if you need some combination of these features. For example, say you're receiving high-volume updates all the time, but want to be able to perform complex ad hoc queries. If you're alright working with a slightly delayed version of the data, it could make sense to write the updates to Bigtable, then periodically scan the table using Dataflow and export a post-processed version of the latest events into BigQuery. GCP also allows BigQuery to serve queries directly from Bigtable in a some regions: https://cloud.google.com/bigquery/external-data-bigtable
My personal choice for your use case is Big Query. You can leverage the pruning in Big Query where BigQuery scans the partitions that match the filter and skip the remaining partitions. Not only does it make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query. You can use time-unit column partitioning or ingestion time partitioning. When you create a table partitioned by ingestion time, BigQuery automatically assigns rows to partitions based on the time when BigQuery ingests the data. You can choose hourly, daily, monthly, or yearly granularity for the partitions.
So your query for fetching the entire data based on timestamp (last 30 days) should be something like this in BigQuery (when used partitioning):
SELECT
column
FROM
dataset.table
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')

Use Case for Amazon Athena

We are building an web application to allow customers insight into their activity based on events currently streaming into ElasticSearch. A customer is an organisation sending messages to people.
A concern has been raised that a requirement to host this data for three years infers a very large amount of storage and high cost of implementation given Elasticsearch.
An alternative is to process each day's data into a report CSV stored in S3 and use something like Amazon Athena to perform the queries. Is Athena something that our application can send ad-hoc queries to in response to a web browser request? It is unlikely to generate a large volume of requests all the time, but I'm uncertain what the latency could be like.
Yes, Athena would be a possible solution to this use case – and done right it could also be fairly cheap.
Athena is not a low latency query engine, but for reporting purposes it's usually good enough. There's no way to say for sure without knowing more, but done right we're talking low single digit seconds.
You can approach this in different ways, either you do as you say and generate a CSV every day, store these for as long as you need, and run queries against them as needed. From your description it sounds like these CSVs would already be aggregates, and I assume they would be significantly less than a megabyte per customer per day. If you partition by customer and month you should be able to run queries for arbitrary time periods in seconds.
Another approach would be to store all your data on S3 and run queries on the full data set. As you stream data into ElasticSearch, stream it to S3 too. Depending on how you do that you probably need some ETL in the form of Lambda functions that partitions the data per customer and time (day or month depending on the volume). You can then run Athena queries on the full historical data set. The downside would be slower queries (double digit seconds for most queries, but I don't know your data volumes), but the upside would be full flexibility on what you can query.
With more details about the particulars of the use case I could help you with the details.
Athena is serverless. You can quickly query your data without having to set up and manage any servers or data warehouses. Just point to your data in Amazon S3, define the schema, and start querying using the built-in query editor.
Amazon Athena automatically executes queries in parallel, so most results come back within seconds/mins.

Single query to get the data from DynamoDB and RDS

Looking for an advice on AWS architecture. Did some research on my own, but I'm far from an expert and I would really love to hear other opinions. This seems to be a pretty common problem for miscroservice architecture, but AWS looks like a different universe to me with its own rules (and tools), there should be best practices that I'm not aware of yet.
What we have:
SOA: Lambda per entity (usually node.js + DynamoDB)
Some Lambda functions use RDS (MySQL) as a DB (this data was supposed to be used by Quicksight)
GraphQL (AppSync)
First problem occurred when we understood that we have to display in Quicksight the data that is stored in DynamoDB. This was solved by Data Pipeline job that transfers the data from DynamoDB to S3 and then is fetched by Quicksight using Athena. In this case it's acceptable that the data for analysis is not updated in real time.
But now we need to create a table in the main application and combine the data that is stored in different data sources - DynamoDB and MySQL. For example, we have an entity payment with attributes like amount and currency, this data is stored in MySQL. And then there is a contract entity which is stored in DynamoDB. Payment can have a link to a contract (one to many relation). We need to create a table with a list of contracts, so the user can filter contracts by payments attributes like seeing the contracts that have payments in EUR or with total amount > 500 USD. This table must contain real time data and have common data grid features: filtering, sorting, pagination.
Options that I see at the moment:
use SQS to transfer payment attributes from payment service to DynamodDB and store it as a String Set in DynamoDB (e.g. column currencies: ['EUR', 'USD']).
use streams (DynamoDB streams, Kinesis?) to transfer data from DynamoDB to S3, and then query the data with Athena. Not sure it will work for us, I got really bad performance issues with Athena, queries stuck in queue for a couple of minutes, did I do something wrong?
remodel the architecture, merge entities into one DB. Probably this one will take far too long to be allowed by project managers.
Data duplication (and consistency issues as a result) was always a pain for me, but it seems to be unavoidable here.
Any thoughts or links to the articles that might help are highly appreciated.
P.S. The architecture was designed by a previous development team.

AWS Redshift + Tableau Performance Booster

I'm using AWS Redshift as a back-end to my tableau desktop. AWS cluster is running with two dc1.large nodes and database table which I'm analyzing is of 30GB (with redshift compression enabled), I chose Redshift over tableau extract for performance issue but seems like Redshift live connection is much slower than extract. Any suggestions where shall I look into?
To use Redshift as a backend for a BI platform like Tableau, there are four things you can do to address latency:
1) Concurrency: Redshift is not great at running multiple queries at the same time so before you start tuning the database, make sure your query is not waiting in line behind other queries. (If you are the only one on the cluster, this shouldn't be a problem.)
2) Table size: Whenever you can, use aggregate tables for better performance. Fewer rows to scan means less IO and faster turnaround!
3) Query complexity: Ideally, you want your BI tool to issue simple, fast performing queries. Make sure your source tables are fast, and that Tableau isn't being forced to do a bunch of joins. Also, if your query does need to join multiple tables, make sure any large tables have the same distribution key.
4) "Indexing": Technically, Redshift does not support true indexing, but you can get close to the same thing by using "interleaved" sort keys. Traditional compound sort keys won't help, but an interleaved sort key can allow you to quickly access rows from multiple vectors (date and customer_id, for instance) without having to scan the entire table.
Reality Check
After all of these things are optimized, you will often find that you still can't be as fast as a Tableau extract. Simply stated, a "fast" Tableau dashboard needs to return data to it's user in <5 seconds. If you have 7 visuals on your dashboard, and each of the underlying queries takes 800 milliseconds to return (which is super fast for a database query), then you still are just barely reaching your target performance. Now, if just one of those queries take 5 seconds or more, your dashboard is going to feel "slow" no matter what you do.
In Summary
Redshift can be tuned using the approach above, but it may or may not be worth the effort. The best applications for using a live Redshift query instead of Tableau Extracts are in cases where the data is physically too large to create an extract of, and when you require data at a level of granularity that makes pre-aggregation infeasible. One good strategy is to create your main dashboard using an extract so that exploration/discovery is as fast as possible, and then use direct (live) Redshift queries for your drill-through reports (for instance, when you want to see exactly which customers roll up into your totals).
Few pointers as below
1) Use vacuum & Analyze once your ETL completes
2) Have you created the Table with proper Dist key and Sort Key
3) Aggregation if it's ok from the point of Data Granularity, requirement etc
1.Remove cursor, tableau access data from redshift leader node using cursor. Cursor works iteratively. Thus, impacting the performance.
2. Perform manual analyze on the table, after running heavy load operations. https://docs.aws.amazon.com/redshift/latest/dg/r_ANALYZE.html
3.Check the dist key distribution to avoid data skewness and improve performance.

Modeling data in NoSQL DynamoDB

I'm trying to figure out how to model the following data in AWS DynamoDB table.
I have a lot of IOT devices, each sends telemetry data every few seconds.
Attributes
device_id
timestamp
malware_name
company_name
action_performed (two possible values)
Queries
Show all incidents that happened in the last week.
Show all incidents for a specific device_id.
Show all incidents with action "unable_to_remove".
show all incidents related to specific malware.
Show all incidents related to specific company.
Thoughts
I understand that I can add GSI's for each attribute, but I would like to use GSI's only if there is no other choice as it costs me more money.
What would be the main primary-key (partition-key:sort-key) ?
Please share you thoughts, I care about them more than I care about the perfect answer as I'm trying to learn how to think and what to consider instead of having an answer for a specific question.
Thanks a lot !
If you absolutely need the querability patterns mentioned, you have no way out but create GSIs for each. That too has its set of caveats:
For query #1, your GSI would be incident_date (or whatever) as partition-key and device_id as sort-key. This might lead to hot partitioning in DynamoDB, based on your access patterns.
There is a limit of 5 GSIs per table, that you'll use up right away. What'll you do if you need to support another kind of query in future?
While evaluating pros and cons of using NoSQL for a given situation, one needs to consider both read and write access patterns. So, the question you should ask is, why DynamoDB?
For e.g., do you really need realtime queries? If not, you can use DynamoDB as the main database and periodically sync data (using AWS Lambda or Kinesis Firehose) to EMR or Redshift for later batch processing.
Edit: Proposed primary key:
device_id as partition-key and incident_date as sort-key, if you know that no 2 or more incidents, for a given device_id, can come at exact same time.
If above doesn't work, then incident_id as partition-key and incident_date as sort-key.