How to partition high cardinality data that is extended daily? - amazon-athena

I have an Athena table like this:
values_by_time:
Columns:
id string (contains UUIDs)
value string
created timestamp (this is a partition key)
Partition: [created]
But I want to query like this:
SELECT * FROM values_by_time WHERE id = 'ac09b809-471e-4296-b99b-e57408075609'
So my partition by [created] is inefficient.
I read that bucketing can be used for high cardinality columns, such as my id:
values_by_id:
Columns:
id string (bucket by this)
created timestamp
value string
Bucket By: [id]
This makes the look-up efficient:
SELECT * FROM values_by_id WHERE id = 'ac09b809-471e-4296-b99b-e57408075609'
However, I cannot do an INSERT INTO on a table with bucketing! This means that I cannot add new data to it after table creation. Creating the table fresh every day seems inefficient too.
How should I arrange my Athena table so that:
I can efficiently query by id
I can insert new data every day

Related

How to fetch all the values of one column using column name in dynamoDB table in java?

I have a dynamoDB table in my aws account. I can create client like this.
AmazonDynamoDB amazonDynamoDB =
AmazonDynamoDBClient.builder().withRegion("eu-west-1").withCredentials(creds).build();
DynamoDB dynamoDB = new DynamoDB(amazonDynamoDB);
Table table = dynamoDB.getTable("table name");
Suppose there is column name "content". I want to get a list or set of all values in "content" column.

Querying a Global Secondary Index of a DynamoDB table without using the partition key

I have a DynamoDB table with partition key as userID and no sort key.
The table also has a timestamp attribute in each item. I wanted to retrieve all the items having a timestamp in the specified range (regardless of userID i.e. ranging across all partitions).
After reading the docs and searching Stack Overflow (here), I found that I need to create a GSI for my table.
Hence, I created a GSI with the following keys:
Partition Key: userID
Sort Key: timestamp
I am querying the index with Java SDK using the following code:
String lastWeekDateString = getLastWeekDateString();
AmazonDynamoDB client = AmazonDynamoDBClientBuilder.standard().build();
DynamoDB dynamoDB = new DynamoDB(client);
Table table = dynamoDB.getTable("user table");
Index index = table.getIndex("userID-timestamp-index");
QuerySpec querySpec = new QuerySpec()
.withKeyConditionExpression("timestamp > :v_timestampLowerBound")
.withValueMap(new ValueMap()
.withString(":v_timestampLowerBound", lastWeekDateString));
ItemCollection<QueryOutcome> items = index.query(querySpec);
Iterator<Item> iter = items.iterator();
while (iter.hasNext()) {
Item item = iter.next();
// extract item attributes here
}
I am getting the following error on executing this code:
Query condition missed key schema element: userID
From what I know, I should be able to query the GSI using only the sort key without giving any condition on the partition key. Please help me understand what is wrong with my implementation. Thanks.
Edit: After reading the thread here, it turns out that we cannot query a GSI with only a range on the sort key. So, what is the alternative, if any, to query the entire table by a range query on an attribute? One suggestion I found in that thread was to use year as the partition key. This will require multiple queries if the desired range spans multiple years. Also, this does not distribute the data uniformly across all partitions, since only the partition corresponding to the current year will be used for insertions for one full year. Please suggest any alternatives.
When using dynamodb Query operation, you must specify at least the Partition key. This is why you get the error that userId is required. (In the AWS Query docs)
The condition must perform an equality test on a single partition key value.
The only way to get items without the Partition Key is by doing a Scan operation (but this wont be sorted by your sort key!)
If you want to get all the items sorted, you would have to create a GSI with a partition key that will be the same for all items you need (e.g. create a new attribute on all items, such as "type": "item"). You can then query the GSI and specify #type=:item
QuerySpec querySpec = new QuerySpec()
.withKeyConditionExpression(":type = #item AND timestamp > :v_timestampLowerBound")
.withKeyMap(new KeyMap()
.withString("#type", "type"))
.withValueMap(new ValueMap()
.withString(":v_timestampLowerBound", lastWeekDateString)
.withString(":item", "item"));
Always good solution for any customised querying requirements with DDB is to have right primary key scheme design for GSI.
In designing primary key of DDB, the main principal is that hash key should be designed for partitioning entire items, and sort key should be designed for sorting items within the partition.
Having said that, I recommend you to use year of timestamp as a hash key, and month-date as a sort key.
At most, the number of query you need to make is just 2 at max in this case.
you are right, you should avoid filtering or scanning as much as you can.
So for example, you can make the query like this If the year of start date and one of end date would be same, you need only one query:
.withKeyConditionExpression("#year = :year and #month-date > :start-month-date and #month-date < :end-month-date")
and else like this:
.withKeyConditionExpression("#year = :start-year and #month-date > :start-month-date")
and
.withKeyConditionExpression("#year = :end-year and #month-date < :end-month-date")
Finally, you should union the result set from both queries.
This consumes only 2 read capacity unit at most.
For better comparison of sort key, you might need to use UNIX timestamp.
Thanks

Get latest 3 entries from DynamoDb

I have a dynamo-db table with following schema
{
"id": String [hash key]
"type": String [range key]
}
I have a usecase where I need to fetch last 3 rows for a given id when type is unknown.
Your items need a timestamp attribute. Without that they can’t be sorted out filtered by time. Once you have that, you can define a local secondary index with the id as partition key and the timestamp as the sort key. You can then get the top three items from the index.
Find more information about DynamoDb’s Local Secondary Index here.
Add a field to store the timestamp to the schema
Use query to fetch all the records for the given key
Query always returns records sorted by range key, you cannot set a sort order (without changing table's schema), so, sort the records by timestamp in your code
Get top 3 records
If you have a lot of records, use filter expressions to drop extra results. E.g. if you know that latest records will always have a timestamp not older than a hour (day, week or so) you could filter older records.

DynamoDB : List all partition keys

I want to update contains in DynamoDB, for which I need to iterate over existing partition keys present in table.
Is there any way to fetch only list of partition keys using Python. Scan and Query only work on attributes of my table. Is there any way to get all partition key for table ?
If your table uses sort keys in addition to the partition keys (stated differently, if the keys are composite of partition + sort key) then the answer is: no - there is no way to query or scan for just the partition keys. To clarify, you can still scan your table with a projection that returns the keys only, but it will return each primary key multiple times, once for each item that has the same primary key with a different sort key.
If your table schema uses partition keys only (no sort key) then you can write a scan with a projection of only the primary key and therefore, get the list of partition keys as a result.
Overview
To gain all of the partition keys from a table you need to use Scan which will read all of the items in the table. As you are only wanting keys returned, you can use the ProjectionExpression parameter to specify which attributes you would like to be returned.
Scan
The Scan operation returns one or more items and item attributes by accessing every item in a table or a secondary index. To have DynamoDB return fewer items, you can provide a FilterExpression operation.
ProjectionExpression
A string that identifies one or more attributes to retrieve from the specified table or index. These attributes can include scalars, sets, or elements of a JSON document. The attributes in the expression must be separated by commas.If no attribute names are specified, then all attributes will be returned. If any of the requested attributes are not found, they will not appear in the result.
Solution
My Table
pk
sk
col1
col2
col3
123
abc
data
data
data
456
def
data
data
data
789
ghi
data
data
data
Scan with ProjectionExpression
aws dynamodb scan \
--table-name MusicCollection \
--projection-expression "pk, sk"
Response
pk
sk
123
abc
456
def
789
ghi

Slow Selection Query even after indexing the table (sqlite and c++)

Create tables
I have a database composed of two tables:
ENTITE_CANDIDATE
VARIATIONS
Tables are created by using the following queries:
CREATE TABLE IF NOT EXISTS ENTITE_CANDIDATE (ID INTEGER PRIMARY KEY NOT NULL, ID_KBP TEXT NOT NULL, wiki_title TEXT, type TEXT NOT NULL);"
CREATE TABLE IF NOT EXISTS VARIATIONS (ID INTEGER PRIMARY KEY NOT NULL, ID_ENTITE INTEGER, NAME TEXT, TYPE TEXT, LANGUAGE TEXT, FOREIGN KEY(ID_ENTITE) REFERENCES ENTITE_CANDIDATE(ID));"
Table ENTITE_CANDIDATE is composed of 818,742 records
Table VARIATIONS is composed of 154,716,653 records
Index tables
I indexed the previous tables by using the following queries:
`CREATE INDEX var_id ON VARIATIONS (ID, ID_ENTITE, NAME);`
`CREATE INDEX entity_id ON ENTITE_CANDIDATE (ID, wiki_title);`
Retrieve information
I want to retrieve from table VARIATIONS the following records:
"SELECT ID, ID_ENTITE, NAME FROM VARIATIONS WHERE NAME=foo ;"
Every select query is taking around 5.414931 seconds. I know the table contains a very large number of records. But can I make the retrieval faster? Am I indexing correctly the tables?
The documentation says:
the index might be used if the initial columns of the index … appear in WHERE clause terms.
This query uses only the NAME column to search, so the var_id index cannot be used. (That index is useful only for lookups that use ID, which is mostly useless because the ID column is already indexed as PRIMARY KEY.)