Is it a good idea to have an index on a boolean?

Is it a good idea to have an index on a boolean? - google-cloud-platform

I have a table with a boolean field, IsNew, that indicates whether or not the corresponding entity is new. I want to periodically query for all entities in a particular state. What are the implications of having index on boolean (or enum)? Will it create a hotspot? Any limitations on QPS?

A secondary index is implemented internally as a table that has a primary key based on the declared secondary index key, plus whatever indexed table keys weren't mentioned in the secondary index explicitly. So, say you have a table like this:
CREATE TABLE UserThings (
UserId INT64 NOT NULL,
ThingId INT64 NOT NULL,
...
IsNew BOOL NOT NULL,
...
) PRIMARY KEY(UserId, ThingId), ...
And you create an index like this:
CREATE INDEX UserThingsByIsNew ON UserThings(IsNew, ThingId)
That'll create an internal table that looks something like this:
CREATE TABLE UserThingsByStatus_Index (
IsNew BOOL,
ThingId INT64 NOT NULL,
UserId INT64 NOT NULL,
) PRIMARY KEY(new, ThingId, UserId), ...
So, when you update rows of UserThings to change the value of the IsNew column, it will delete the old row in UserThingsByIsNew_Index, and insert an additional row. This will tend to create a lot of churn in the index if the IsNew value of rows is changing at a high frequency. This might not be a problem at all, but you will only really know by testing your scenario under a real-world workload for a sustained time.
If you don't update the IsNew field of entities too frequently, then you probably won't have any hot-spotting problems. That's why I mentioned earlier that Cloud Spanner also appends the original table keys to the keys of the index: assuming that your original table rows are well-distributed by the table's keys, then the portion of the index for IsNew=true and IsNew=false, respectively, will have a similar distribution, and shouldn't cause a hotspot.

Related

DynamoDB Query with filter

I like to write a dynamoDb query in which I filter for a certain field, sounds simple.
All the examples I find always include the partition key value, which really confuses me, since it is unique value, but I want a list.
I got id as the partition key and no sort key or any other index. I tried to add partner as an index did not make any difference.
AttributeValue attribute = AttributeValue.builder()
.s(partner)
.build();
Map<String, AttributeValue> expressionValues = new HashMap<>();
expressionValues.put(":value", attribute);
Expression expression = Expression.builder()
.expression("partner = :value")
.expressionValues(expressionValues)
.build();
QueryConditional queryConditional = QueryConditional
.keyEqualTo(Key.builder()
.partitionValue("id????")
.build());
Iterator<Product> results = productTable.query(r -> r.queryConditional(queryConditional)
Would appreciate any help. Is there a misunderstandig on my side?

DynamoDB has two distinct, but similar, operations - Query and Scan:
Scan is for reading the entire table, including all partition keys.
Query is for reading a specific partition key - and all sort key in it (or a contiguous range of sort key - hence the nickname "range key" for that key).
If your data model does not have a range key, Query is not relevant for you - you should use Scan.
However this means that each time you call this query, the entire table will be read. Unless your table is tiny, this doesn't make economic sense, and you should reconsider your data model. For example, if you frequently look up results by the "partner" attribute, you can consider creating a GSI (global secondary index) with "partner" as its partition key, allowing you to quickly and cheapy fetch the list of items with a given "partner" value without scanning the entire table.

Querying a Global Secondary Index of a DynamoDB table without using the partition key

I have a DynamoDB table with partition key as userID and no sort key.
The table also has a timestamp attribute in each item. I wanted to retrieve all the items having a timestamp in the specified range (regardless of userID i.e. ranging across all partitions).
After reading the docs and searching Stack Overflow (here), I found that I need to create a GSI for my table.
Hence, I created a GSI with the following keys:
Partition Key: userID
Sort Key: timestamp
I am querying the index with Java SDK using the following code:
String lastWeekDateString = getLastWeekDateString();
AmazonDynamoDB client = AmazonDynamoDBClientBuilder.standard().build();
DynamoDB dynamoDB = new DynamoDB(client);
Table table = dynamoDB.getTable("user table");
Index index = table.getIndex("userID-timestamp-index");
QuerySpec querySpec = new QuerySpec()
.withKeyConditionExpression("timestamp > :v_timestampLowerBound")
.withValueMap(new ValueMap()
.withString(":v_timestampLowerBound", lastWeekDateString));
ItemCollection<QueryOutcome> items = index.query(querySpec);
Iterator<Item> iter = items.iterator();
while (iter.hasNext()) {
Item item = iter.next();
// extract item attributes here
}
I am getting the following error on executing this code:
Query condition missed key schema element: userID
From what I know, I should be able to query the GSI using only the sort key without giving any condition on the partition key. Please help me understand what is wrong with my implementation. Thanks.
Edit: After reading the thread here, it turns out that we cannot query a GSI with only a range on the sort key. So, what is the alternative, if any, to query the entire table by a range query on an attribute? One suggestion I found in that thread was to use year as the partition key. This will require multiple queries if the desired range spans multiple years. Also, this does not distribute the data uniformly across all partitions, since only the partition corresponding to the current year will be used for insertions for one full year. Please suggest any alternatives.

When using dynamodb Query operation, you must specify at least the Partition key. This is why you get the error that userId is required. (In the AWS Query docs)
The condition must perform an equality test on a single partition key value.
The only way to get items without the Partition Key is by doing a Scan operation (but this wont be sorted by your sort key!)
If you want to get all the items sorted, you would have to create a GSI with a partition key that will be the same for all items you need (e.g. create a new attribute on all items, such as "type": "item"). You can then query the GSI and specify #type=:item
QuerySpec querySpec = new QuerySpec()
.withKeyConditionExpression(":type = #item AND timestamp > :v_timestampLowerBound")
.withKeyMap(new KeyMap()
.withString("#type", "type"))
.withValueMap(new ValueMap()
.withString(":v_timestampLowerBound", lastWeekDateString)
.withString(":item", "item"));

Always good solution for any customised querying requirements with DDB is to have right primary key scheme design for GSI.
In designing primary key of DDB, the main principal is that hash key should be designed for partitioning entire items, and sort key should be designed for sorting items within the partition.
Having said that, I recommend you to use year of timestamp as a hash key, and month-date as a sort key.
At most, the number of query you need to make is just 2 at max in this case.
you are right, you should avoid filtering or scanning as much as you can.
So for example, you can make the query like this If the year of start date and one of end date would be same, you need only one query:
.withKeyConditionExpression("#year = :year and #month-date > :start-month-date and #month-date < :end-month-date")
and else like this:
.withKeyConditionExpression("#year = :start-year and #month-date > :start-month-date")
and
.withKeyConditionExpression("#year = :end-year and #month-date < :end-month-date")
Finally, you should union the result set from both queries.
This consumes only 2 read capacity unit at most.
For better comparison of sort key, you might need to use UNIX timestamp.
Thanks

DynamoDB query/scan only returns subset of items

I noticed that DynamoDB query/scan only returns documents that contain a subset of the document, just the key columns it appears.
This means I need to do a separate Batch_Get to get the actual documents referenced by those keys.
I am not using a projection expression, and according to the documentation this means the whole item should be returned.1
How do I get query to return the entire document so I don't have to do a separate batch get?
One example bit of code that shows this is below. It prints out found documents, yet they contain only the primary key, the secondary key, and the sort key.
t1 = db.Table(tname)
q = {
'IndexName': 'mysGSI',
'KeyConditionExpression': "secKey= :val1 AND " \
"begins_with(sortKey,:status)",
'ExpressionAttributeValues': {
":val1": 'XXX',
":status": 'active-',
}
}
res = t1.query(**q)
for doc in res['Items']:
print(json.dumps(doc))

This situation is discussed in the documentation for the Select parameter. You have to read quite a lot to find this, which is not ideal.
If you query or scan a global secondary index, you can only request
attributes that are projected into the index. Global secondary index
queries cannot fetch attributes from the parent table.
Basically:
If you query the parent table then you get all attributes by default.
If you query an LSI then you get all attributes by default - they're retrieved from the projection in the LSI if all attributes are projected into the index (so that costs nothing extra) or from the base table otherwise (which will cost you more reads).
If you query or scan a GSI, you can only request attributes that are projected into the index. GSI queries cannot fetch attributes from the parent table.

Is it possible to sort a Cassandra Column Family by a specific column of a list of a user-defined datatype?

I'm having a little hard time understanding Cassandra. I simply couldn't write this question without making it look like confusing, but as I detail it below it may become clearer.
Suppose I have this datatype that I've created:
CREATE TYPE transaction (
transaction_id UUID,
value float,
transaction_date timestamp,
PRIMARY KEY (transaction_id, transaction_date)
);
PS: I'm using it as if it was a 'class', but that might be a logical mistake of mine, please correct me if it can't be used as such.
Anyway, also I have this Column Family, in which I've created a list of this 'transaction' datatype:
CREATE TABLE transactions_history_by_date (
wallet_address UUID,
user_id UUID,
transactions list <transaction>,
PRIMARY KEY (wallet_address, transaction_date))
WITH CLUSTERING ORDER BY (transaction_date DESC);
So what I'd like to know if this Column Family above is correct. I'd like to get all the transactions of a wallet, sorted by the transaction date (but the date is a column of the 'transaction' datatype - and to complicate it even more, in this Column Family there's a list of transactions, and not just a single one).

No, in Cassandra you can sort only on the value of the clustering column - in this case you need to move transaction_date into table itself...

To expand on Alex's answer, in your situation I think the best approach would probably be to denormalise your table. Rather than using a UDT, you could create something like this:
CREATE TABLE transactions_history_by_date (
wallet_address UUID,
user_id UUID,
transaction_id UUID,
value float,
transaction_date timestamp,
PRIMARY KEY ((wallet_address), transaction_date, transaction_id))
WITH CLUSTERING ORDER BY (transaction_date DESC);
Now you can make the following query and the results will be sorted by date:
SELECT * FROM transactions_history_by_date WHERE wallet_address = ...;
Note that I added transaction_id as a second clustering key. If this was omitted the table would not have been able to hold two transactions that had the same wallet_address and the same transaction_date. This is because unique rows are identified by the primary key.

Slow Selection Query even after indexing the table (sqlite and c++)

Create tables
I have a database composed of two tables:
ENTITE_CANDIDATE
VARIATIONS
Tables are created by using the following queries:
CREATE TABLE IF NOT EXISTS ENTITE_CANDIDATE (ID INTEGER PRIMARY KEY NOT NULL, ID_KBP TEXT NOT NULL, wiki_title TEXT, type TEXT NOT NULL);"
CREATE TABLE IF NOT EXISTS VARIATIONS (ID INTEGER PRIMARY KEY NOT NULL, ID_ENTITE INTEGER, NAME TEXT, TYPE TEXT, LANGUAGE TEXT, FOREIGN KEY(ID_ENTITE) REFERENCES ENTITE_CANDIDATE(ID));"
Table ENTITE_CANDIDATE is composed of 818,742 records
Table VARIATIONS is composed of 154,716,653 records
Index tables
I indexed the previous tables by using the following queries:
`CREATE INDEX var_id ON VARIATIONS (ID, ID_ENTITE, NAME);`
`CREATE INDEX entity_id ON ENTITE_CANDIDATE (ID, wiki_title);`
Retrieve information
I want to retrieve from table VARIATIONS the following records:
"SELECT ID, ID_ENTITE, NAME FROM VARIATIONS WHERE NAME=foo ;"
Every select query is taking around 5.414931 seconds. I know the table contains a very large number of records. But can I make the retrieval faster? Am I indexing correctly the tables?

The documentation says:
the index might be used if the initial columns of the index … appear in WHERE clause terms.
This query uses only the NAME column to search, so the var_id index cannot be used. (That index is useful only for lookups that use ID, which is mostly useless because the ID column is already indexed as PRIMARY KEY.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Is it a good idea to have an index on a boolean? - google-cloud-platform

I have a table with a boolean field, IsNew, that indicates whether or not the corresponding entity is new. I want to periodically query for all entities in a particular state. What are the implications of having index on boolean (or enum)? Will it create a hotspot? Any limitations on QPS?

Related

DynamoDB Query with filter

Querying a Global Secondary Index of a DynamoDB table without using the partition key

DynamoDB query/scan only returns subset of items

Is it possible to sort a Cassandra Column Family by a specific column of a list of a user-defined datatype?

Slow Selection Query even after indexing the table (sqlite and c++)

Categories

Resources