Dist/Sort key for Redshift time series database - amazon-web-services

I am involved in a time series telemetry project, where we store data into Amazon Redshift. We have a timestamp column for collection time. And ClientID, IOt-ID indicating a unique IOT device within a client.
All our queries are time bound in the sense we query for a particular day/week/month. Would the following be a good dist/sort key ?
Distribution key - (Clientid, IOT-ID)
Sort key - timestamp

The general rule for Amazon Redshift is:
Set the Distribution Key to the field normally used to JOIN with other tables. This will put all data for a given value of that column on the same slice, making it easier to JOIN with other tables that have the same DISTKEY.
Set the Sort Key to the field that is most commonly used in a WHERE statement. Rows will be stored in order of this field, making it easier to "skip over" disk blocks that do not contain the desired data. (This is very powerful.)
So, it sounds like your timestamp field is ideal as the SORTKEY.
The choice of DISTKEY depends on how you JOIN, but can also help GROUP BY since the relevant data is co-located.

Related

AWS DynamoDB sorting without partition key

I have a DynamoDB table with a partition key (UUID) with a few attributes (like name, email, created date etc). Created date is one of the attribute in the item and its format is YYYY-MM-DD. But now there is a requirement change - I have to sort it based on created date and bring the entire data (that is, I cannot just bring the data on a particular partition, but the whole data from all the partitions in a sorted fashion. I know this might take time as DynamoDB to fetch data from all the partitions and sort it after. My question is:
Is the querying possible with the current design? I can see that partition key is required in the query, this is why I am confused because I cannot give a partition key here.
Is there a better way to redesign the table for such a use case?
Thanks in advance.
As your table exists you couldn't change the structure now, and even if you wanted to you would be reliant on UUID as your partition key.
There is functionality however to create a global secondary index for your DynamoDB table.
By using a GSI you can rearrange your data representation to include the creation date as the partition key of your table instead.
The reason why partition keys are important is that data in DynamoDB data is distributed across multiple nodes, with each partition sharing the same node. By performing a query it is more efficient to only communicate with one partition, as there is no need to wait on the other partitions returning results.

Dynamo db will not allow data to be inserted into table unless the value contains the primary key set during table creation?

The dynamo db will not allow data to be inserted into table unless the value contains the primary key set during table creation.
Dynamodb table:
id (primary key)
device_id
temperature_value
I am sending data from IoT core rule engine into the Dynamodb (Split message into multiple columns of a DynamoDB table (DynamoDBv2)). However, data does not arrive at the dynamo db table if the msg is missing the id attribute.
Is there any way to set primary key to be auto incrementing every time a new data point arrives?
DynamoDB does not support auto incrementing functionality for keys as it might have in a relational database.
Instead this will need to be generated by you at the time of inserting the record into DynamoDB.
There are a few options to generate:
Use a primary key combined of partition key (referencing your sensor id) and a sort key (something such as an event time, or a randomly generated string).
Generate a random string instead and insert this.
Use a seperate data store such as relational or Redis, where you autoincrement a value and use this. This is really not ideal.
Use a seperate DynamoDB table to include this value ensuring you use a transactional write to lock the row and increment, and strongly consistent read to get the latest value. Again this is not ideal

Using fake timestamps to create partitions on Google BigQuery

Google BigQuery (BQ) allows you to create a partition using timestamp or date types only.
99% of my data has a very clear selector, idClient. I've created to my customer's views with a predicate like idClient = code so the privacy is guaranteed.
The problem with this strategy is that there are customers with 5M rows and others with 200K and as BQ does not have indexes, they are always processing data from each other (and the costs are rising).
I am intending to create a timestamp field where each customer will have a different timestamp that will be repeated for every Insert in every customer sensitive table and thus I can query by timestamp by fixing it as it would be with a standard ID.
Does this make any sense? If BQ was an indexed database I'd be concerned about skewed data but as it is always full table scan, I think I'd have only benefits and no downsides.
The solution for your problem is to add Cluster field to your table which is equivalent to an Index in other databases
This link provides the basic on how to use cluster field
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns
Note: When using cluster field BigQuert dryRun doesn't show the cost improvement which can only be seen post-execution

BigQuery - querying only a subset of keys in a table with key value schema

So I have a table with the following schema:
timestamp: TIMESTAMP
key: STRING
value: FLOAT
There are around 200 unique keys. I am partitioning the dataset by date.
I want to run several (5-6 currently, but I expect to add at least 15 more) queries on a daily basis on this database. Brute forcing these would cost me a lot daily, which I want to avoid.
The issue is that because of this key - value format, and BigQuery being a columnar database, each query queries the whole day's data, despite each query actually using a maximum of 4 keys. What is a best way to optimize this?
I am thinking the best way I can go about it right now is to create separate temp tables for each key as a daily batch process, run my queries on them and then delete them.
Ideal way I would want to go about it is partitioning by key, I am not sure there is any such provision?
You can try using recently introduced clustering partitioned tables
When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.
Update (moved from comments)
Also have in mind below
Feature Partitioning Clustering
--------------- ------------- -------------
Cardinality Less than 10k Unlimited
Dry Run Pricing Available Not available
Query Pricing Exact Best Effort
Pay special attention to Dry Run Pricing - unfortunately - clustered tables do not support dry run (validation) based on clustered keys - and rather show only validation based on partitions. but if you set your clustering properly - actual run will end up with lower cost. you should try with smaller data to get comfortable with this
See more at Clustering partitioned tables

DynamoDB dynamic schema

I'd like to use AWS DynamoDB as a datastore for a data-collection application, where the data schema may vary over time.
For example, initially an Item may represent attributes of people e.g. {name, age}. However, later the schema may be modified to contain {name, age, gender}.
Each schema modification will be tracked and versioned and older data won't need to be migrated - but it may still need to be queried alongside newer data.
Is it an acceptable pattern to store each data-schema change in its own table? Is there a straightforward mechanism to query aggregated data across tables?
Schemas for DynamoDB tables are dynamic in nature. The only thing that needs to be set up upfront is the key name and type. You can add global indexes any time too (indexes with a different partition key). Local indexes, however, those with the same partition key but different sort key, they are added at table creation table. Because of this dynamic schema, you can add new fields, or stop adding them any time.
You need to design tables knowing how would you query them. Queries are quite restricted, you can filter but that's not a fast/cheap operation. Fast queries rely on existing indexes. Queries can fetch from a single table. Joins/unions aren't available.
A table scan is done without any criteria, only filters are available. With filters, data is fetched from disk but can be removed from the returned set. It's an expensive operation in both cost and time. Queries passing a key are faster because they fetch data from a single partition. So you might want to design a key with both a partition (userId for instance) and sort key (item id). It is usual to have compound keys on DynamoDB.
Also it is important to avoid hot spots inside a table. That is, data needs to be fairly distributed inside partition keys.
Reference: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BestPractices.html