Amazon DynamoDB multiple scan conditions with multiple BeginsWith

Amazon DynamoDB multiple scan conditions with multiple BeginsWith - amazon-web-services

I have table in Amazon DynamoDB with partition key and range key.
Table structure
Subscriber ID (partition key) | Item Id (Range Key) | Date |...
123 | P_345 | some date 1 | ...
123 | I_456 | some date 2 |
123 | A_678 | some date 3 | ...
Now I want to retrieve the data from the table using QueryAsync C# library with multiple scan conditions.
HashKey = 123
condition 1; Date is between 'some date 1' and 'some date 2'
condition 2. Range Key begins_with I_ and P_
Is there any way which I can achieve this using c# dynamoDB APIs?
Please help

You'll need to do the following (I'm not a C# expert, but you can use the following instructions to find the right C# syntax to do it):
Because you are looking for a specific hashkey, this will be a Query request, not a Scan.
You have a begins_with() condition on the range key. You specify that using the KeyConditionExpression parameter to the Query. The KeyConditionExpression will ask for HashKey=123 AND begins_with(RangeKey,"P_").
However, KeyConditionExpression does not allow an "OR" (rangekey begins with either "P_" or "I_"). You'll just need to run two separate queries - one with "I_" and one with "P_" (you can even do the two queries in parallel, if you wish).
The date is not one of the key columns, so you will need to filter it with a FilterExpression parameter to the query. Note that filtering only happens in the last step, after DynamoDB already read all the items matching the KeyConditionExpression above (this may increase your costs if filtering removes a lot of items and you will still pay for them).

Related

Querying DynamoDB with many records to get results with contains

We have a DynamoDB table:
resource "aws_dynamodb_table" "prospectmaterials_table" {
name = "drinks"
hash_key = "PK"
billing_mode = "PAY_PER_REQUEST"
read_capacity = 5
write_capacity = 5
attribute {
name = "PK"
type = "S"
}
}
It currently contains 36,000 records.
An example of the data it contains:
PK
Name
Description
Price
Coke-Coke Cola-Classic beverage-1.00
Coke Cola
Classic beverage
1.00
Pepsi-Pepsi Cola-Another beverage-1.00
Pepsi Cola
Another beverage
1.00
Dr. Pepper-Dr. Pepper-Yet another beverage-2.00
Dr. Pepper
Yet another beverage
2.00
We want to retrieve all ~1000 records with the word "beverage" in the Description field.
Via an API Gateway endpoint, we want to query the table to retrieve each record which contains "beverage". This query currently breaks with "Invalid operator used in KeyConditionExpression: contains":
{
"TableName": "drinks",
"ConsistentRead": true,
"ExpressionAttributeValues": {
":m": {
"S": "beverage"
}
},
"KeyConditionExpression": "contains(PK,:m)"
}
How should I construct this query so that it performs quickly and returns all the records I require?

The CONTAINS operation you are trying to use is not supported as an operation in a KeyConditionExpression with the Query API. The only KeyConditionExpressions available are EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN (see docs). On top of that, all of those operations except for EQ are reserved for the Sort Key. With a query operation you must specify a single partition, i.e. by using the = operator on the Partition Key you are querying. This means that not only will you will have to restructure your keys in order to accomplish the access pattern of:
We want to retrieve all ~1000 records with the word "beverage" in the Description field.
You will also probably have to change the access pattern itself. Something more feasible with DynamoDB might read as:
We want to retrieve all ~1000 items with the type of beverage
This is because equality on a partition key is a prerequisite to every single query operation you perform on your base table.
If you can't change the way your table is structured, then DynamoDB is likely not right tool for the job. If you can though, there are certainly ways of evaluating and shaping data to work with NoSQL tables in general, and DynamoDB specifically.
The best approach would be to lay out all of your access patterns, consult best practices documentation provided by AWS (linked earlier), and design your base table around your main patterns, while leveraging secondary indexes to supplement secondary patterns if necessary.

DynamoDB GSI using boolean like a hash key

It's just a doubt that i cant find on the internet.
I have a table like this:
|  id | infos | ignored |
|  1  | abc  | true       |
|  2  | def   | false      |
|  3  | ghi   | false      |
I see i cant create a DynamoDB GSI on booleans columns. It's right?
I want to create a GSI on this ignored column.

The DynamoDB console only allows GSIs using types of string, binary, or number.
So you could use strings ("t" or "f"), numbers (1 or 0) or binary (also 1 or 0) to represent a boolean value if you'd like.
It sounds like you're trying to build a sparse index (e.g. only certain items are in the index). Keep in mind that you can do this by the mere existence of the attribute that makes up the GSI.
For example, you could include the ignored attribute on items you want to project into the index and remove the ignored attribute from items you do not want in the index.

What is the difference between scan and query in dynamodb? When use scan / query?

A query operation as specified in DynamoDB documentation:
A query operation searches only primary key attribute values and supports a subset of comparison operators on key attribute values to refine the search process.
and the scan operation:
A scan operation scans the entire table. You can specify filters to apply to the results to refine the values returned to you, after the complete scan.
Which is best based on performance and cost?

When creating a Dynamodb table select Primary Keys and Local Secondary Indexes (LSIs) so that a Query operation returns the items you want.
Query operations only support an equal operator evaluation of the Primary Key, but conditional (=, <, <=, >, >=, Between, Begin) on the Sort Key.
Scan operations are generally slower and more expensive as the operation has to iterate through each item in your table to get the items you are requesting.
Example:
Table: CustomerId, AccountType, Country, LastPurchase
Primary Key: CustomerId + AccountType
In this example, you can use a Query operation to get:
A CustomerId with a conditional filter on AccountType
A Scan operation would need to be used to return:
All Customers with a specific AccountType
Items based on conditional filters by Country, ie All Customers from USA
Items based on conditional filters by LastPurchase, ie All Customers that made a purchase in the last month
To avoid scan operations on frequently used operations create a Local Secondary Index (LSI) or Global Secondary Index (GSI).
Example:
Table: CustomerId, AccountType, Country, LastPurchase
Primary Key: CustomerId + AccountType
GSI: AccountType + CustomerId
LSI: CustomerId + LastPurchase
In this example a Query operation can allow you to get:
A CustomerId with a conditional filter on AccountType
[GSI] A conditional filter on CustomerIds for a specific AccountType
[LSI] A CustomerId with a conditional filter on LastPurchase

You are having dynamodb table partition key/primary key as customer_country. If you use query, customer_country is the mandatory field to make query operation. All the filters can be made only items that belongs to customer_country.
If you perform table scan the filter will be performed on all partition key/primary key. First it fetched all data and apply filter after fetching from table.
eg:
here customer_country is the partition key/primary key
and id is the sort_key
-----------------------------------
customer_country | name | id
-----------------------------------
VV | Tom | 1
VV | Jack | 2
VV | Mary | 4
BB | Nancy | 5
BB | Lom | 6
BB | XX | 7
CC | YY | 8
CC | ZZ | 9
------------------------------------
If you perform query operation it applies only on customer_country value.
The value should only be equal operator (=).
So only items equal to that partition key/primary key value are fetched.
If you perform scan operation it fetches all items in that table and filter out data after it takes that data.
Note: Don't perform scan operation it exceeds your RCU.

Its similar as in the relational database.
Get query you are using a primary key in where condition, The computation complexity is log(n) as the most of key structure is binary tree.
while scan query you have to scan whole table then apply filter on every single row to find the right result. The performance is O(n). Its much slower if your table is big.
In short, Try to use query if you know primary key. only scan for only the worst case.
Also, think about the global secondary index to support a different kind of queries on different keys to gain performance objective

In terms of performance, I think it's good practice to design your table for applications to use Query instead of Scan. Because a scan operation always scan the entire table before it filters out the desired values, which means it takes more time and space to process data operations such as read, write and delete. For more information, please refer to the official document

Query is much better than Scan - performence wise. scan, as it's name imply, will scan the whole table. But you must be well aware of the table key, sort key, indexes and and related sort indexes in order to know that you can use the Query.
if you filter your query using:
key
key & key sort
index
index and it's related sort key
use Query! otherwise use scan which is more flexible about which columns you can filter.
you can NOT Query if:
more that 2 fields in the filter (e.g. key, sort and index)
sort key only (of primary key or index)
regular fields (not key, index or sort)
mixed index and sort (index1 with sort of index2)\
...
a good explaination:
https://medium.com/#amos.shahar/dynamodb-query-vs-scan-sql-syntax-and-join-tables-part-1-371288a7cb8f

Optional secondary indexes in DynamoDB

I am migrating my persistence tier from Riak to DynamoDB. My data model contains an optional business identifier field, which is desired to be able to be queried as an alternative to the key.
It appears that DynamoDB secondary indexes can't be null and require a range key, so despite the similar name to Riak's secondary indexes, make this appear quite a different beast.
Is there an elegant way to efficiently query my optional field, short of throwing the data in an external search index?

When you asked this question, DynamoDB did not have Global Secondary Indexes: http://aws.amazon.com/about-aws/whats-new/2013/12/12/announcing-amazon-dynamodb-global-secondary-indexes/
Now, it does.
A local secondary index is best thought of, and functions as, a secondary range key. #andreimarinescu is right: you still must query by the item's hash key, only with a secondary index you can use a limited subset of a DynamoDB query's comparison operators on that range key (e.g. greater than, equal to, less than, etc.) So, you still need to know which "hash bucket" you're performing the comparison within.
Global secondary indexes are a bit of a different beast. They are more like a secondary version of your table (and Amazon charges you similarly in terms of provisioned throughput). You can use non-primary key attributes of your table as primary key attributes of your index in a global secondary index, and query them accordingly.
For example, if your table looks like:
|**Hash key**: Item ID | **Range Key**: Serial No | **Attribute**: Business ID |
--------------------------------------------------------------------------------
| 1 | 12345 | 1A |
--------------------------------------------------------------------------------
| 2 | 45678 | 2B |
--------------------------------------------------------------------------------
| 3 | 34567 | (empty) |
--------------------------------------------------------------------------------
| 3 | 12345 | 2B |
--------------------------------------------------------------------------------
Then, with a local secondary index on Business ID you could perform queries like, "find all the items with a hash key of 3 and a business ID equal to 2B", but you could not do "find all items with a business ID equal to 2B" because the secondary index requires a hash key.
If you were to add a global secondary index using business ID, then you could perform such queries. You would essentially be providing an alternate primary key for the table. You could perform a query like "find all items with a business ID equal to 2B and get items 2-45678 and 3-12345 as a response.
Sparse indexes work fine with DynamoDB; it's perfectly allowable that not all the items have a business ID and can allow you to keep the provisioned throughput on your index lower than that of the table depending on how many items you anticipate having a business ID.

The same is also possible using LSI.
Just make sure that you don't write any data to that Attribute.
In my scenario, for a LSI, I was writing empty string (""), which is not allowed. I skipped initialization of the sort key and it worked fine.
Basically DynamoDB won't even create the that attribute for that row.
Details of behavior is explained below
How can I make a sparse index if the key is always required?

Amazon RedShift: Unique Column not being honored

I use the following query to create my table.
create table t1 (url varchar(250) unique);
Then I insert about 500 urls, twice. I am expecting that the second time I had the URLs that no new entries show up in my table, but instead my count value doubles for:
select count(*) from t1;
What I want is that when I try and add a url that is already in my table, it is skipped.
Have I declared something in my table deceleration incorrect?
I am using RedShift from AWS.
Sample
urlenrich=# insert into seed(url, source) select 'http://www.google.com', '1';
INSERT 0 1
urlenrich=# select * from seed;
url | wascrawled | source | date_crawled
-----------------------+------------+--------+--------------
http://www.google.com | 0 | 1 |
(1 row)
urlenrich=# insert into seed(url, source) select 'http://www.google.com', '1';
INSERT 0 1
urlenrich=# select * from seed;
url | wascrawled | source | date_crawled
-----------------------+------------+--------+--------------
http://www.google.com | 0 | 1 |
http://www.google.com | 0 | 1 |
(2 rows)
Output of \d seed
urlenrich=# \d seed
Table "public.seed"
Column | Type | Modifiers
--------------+-----------------------------+-----------
url | character varying(250) |
wascrawled | integer | default 0
source | integer | not null
date_crawled | timestamp without time zone |
Indexes:
"seed_url_key" UNIQUE, btree (url)

Figured out the problem
Amazon RedShift does not enforce constraints...
As explained here
http://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html
They said they may get around to changing it at some point.
NEW 11/21/2013
RDS has added support for PostGres, if you need unique and such an postgres rds instance is now the best way to go.

In redshift, constraints are recommended but doesn't take effect, constraints will just help to the query planner to select better ways to perform the query.
Usually, columnar databases do not manage indexes or constraints.

Although Amazon Redshift doesn't support unique constraints, there are some ways to delete duplicated records that can be helpful.
See the following link for the details.
copy data from Amazon s3 to Red Shift and avoid duplicate rows

Primary and unique key enforcement in distributed systems, never mind column store systems, is difficult. Both RedShift (Paracel) and Vertica face the same problems.
The challenge with a column store is that the question that is being asked is "does this table row have a relevant entry in another table row" but column stores are not designed for row operations.
In HP Vertica there is an explicit command to report on constraint violations.
In Redshift it appears that you have to roll your own.
SELECT COUNT(*) AS TotalRecords, COUNT(DISTINCT {your PK_Column}) AS UniqueRecords
FROM {Your table}
HAVING COUNT(*)> COUNT(DISTINCT {your PK_Column})
Obviously, if you have a multi-column PK you have to do something more heavyweight.
SELECT COUNT(*)
FROM (
SELECT {PkColumns}
FROM {Your Table}
GROUP BY {PKColumns}
HAVING COUNT(*)>1
) AS DT
If the above returns a value greater than zero then you have a primary key violation.

For anyone who:
Needs to use redshift
Wants unique inserts in a single query
Doesn't care too much about query performance
Only really cares about inserting a single unique value at a time
Here's an easy way to get it done
INSERT INTO MY_TABLE (MY_COLUMNS)
SELECT MY_UNIQUE_VALUE WHERE MY_UNIQUE_VALUE NOT IN (
SELECT MY_UNIQUE_VALUE FROM MY_TABLE
WHERE MY_UNIQUE_COLUMN = MY_UNIQUE_VALUE
)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js