Cassandra CQL - update (insert) if not equal to - if-statement

I have a scenario where I need to update (or insert) a record if a (non-key) field is not equal to some string OR the record does not exist. For example, given something like:
UPDATE mytable SET firstname='John', lastname='Doe' WHERE id='1' IF lastname != 'Doe';
If the lastname is not currently 'Doe', then update it, or if the record does not exist, update (insert) it. My assumption was that the IF condition would yield true if there was no record, but apparently not. Is there an alternative?

In Cassandra an UPDATE behaves very similar as the INSERT statement, as explained in the Apache CQL Documenation:
"Note that unlike in SQL, UPDATE does not check the prior existence of the row by default (except through IF, see below): the row is created if none existed before, and updated otherwise. Furthermore, there are no means to know whether a creation or update occurred." - CQL Documentation - Update
I did a simple test and it did work:
cqlsh:test_keyspace> select * from conditional_updating ;
id | firstname | lastname
----+-----------+----------
(0 rows)
cqlsh:test_keyspace> update conditional_updating
set firstname = 'John',
lastname = 'Doe'
WHERE id = 1 IF lastname != 'Doe';
[applied]
-----------
True
cqlsh:test_keyspace> select * from conditional_updating ;
id | firstname | lastname
----+-----------+----------
1 | John | Doe
(1 rows)
cqlsh:test_keyspace> update conditional_updating
set lastname = 'New'
WHERE id = 1 IF lastname != 'Doe';
[applied] | lastname
-----------+----------
False | Doe
Note that using the IF condition isn't free Under the hood it triggers a lightweight transaction (LWT) (also known as CAS for Compara and SET). Such queries require a read and a write and they also need to reach consensus among all replicas, which makes it a bit onerous.
"But, please note that using IF conditions will incur a non-negligible performance cost (internally, Paxos will be used) so this should be used sparingly." - CQL Documentation - Update
If you are interested in knowing why Lightweight transactions are considered an anti-pattern in Cassandra I encourage you to have a look here: Lightweight Transactions In Cassandra

Please refer to this documentation as this is what you need and in Cassandra UPDAtE Query act as an insert if not exists.
update with condition
Example:
UPDATE keyspace_name.table_name
USING option AND option
SET assignment, assignment, ...
WHERE row_specification
IF column_name = literal AND column_name = literal . . .
IF EXISTS

Related

Where/when to check the version of the row when doing optimistic locking? [duplicate]

This question already has answers here:
Optimistic vs. Pessimistic locking
(13 answers)
Closed 6 months ago.
I want to implement optimistic locking for a relational database.
Suppose there is a users table
id
name
version
1
Jhon
1
2
Jhane
1
My application fetches the Jhon user to change his name
SELECT id, name, version FROM users WHERE id = 1;
jhon = get_user_by_id('1');
jhon.change_name_to('Jhin');
jhon.save() // this method should fail or succeed depending on the version of the row in the database
So where do I need to compare the version of the selected row with the version of the row that is in the database?
Is a database transaction a good place to fetch the existing version of the row and compare it with the already fetched record?
transaction_begin()
jhon = get_user_by_id('1')
if (jhon.version !== updated_jhon.version) { // Ensures that version match
// If no rollback
transaction_rollback();
} else {
// If yes, update and commit
query("UPDATE table SET name = {updated_jhon.name}, SET version = {jhon.version + 1} WHERE id = 1;")
}
transaction_commit()
I found an answer in a similar asked question
How does Hibernate do row version check for Optimistic Locking before committing the transaction
The answer would be not to read a version at all.
Optimistic locking does not require any extra SELECT to get and check the version after the entity was modified
In order to update a record (user), we also need to pass a version
UPDATE users SET name = Jhin WHERE id = 1 AND VERSION = 1;
If the number of affected records is greater than 0, it means that the row was not affected by someone else during the name update.
If the number of affected records is equal to 0. It means that someone else has changed the row during our modification.

Querying DynamoDB with many records to get results with contains

We have a DynamoDB table:
resource "aws_dynamodb_table" "prospectmaterials_table" {
name = "drinks"
hash_key = "PK"
billing_mode = "PAY_PER_REQUEST"
read_capacity = 5
write_capacity = 5
attribute {
name = "PK"
type = "S"
}
}
It currently contains 36,000 records.
An example of the data it contains:
PK
Name
Description
Price
Coke-Coke Cola-Classic beverage-1.00
Coke Cola
Classic beverage
1.00
Pepsi-Pepsi Cola-Another beverage-1.00
Pepsi Cola
Another beverage
1.00
Dr. Pepper-Dr. Pepper-Yet another beverage-2.00
Dr. Pepper
Yet another beverage
2.00
We want to retrieve all ~1000 records with the word "beverage" in the Description field.
Via an API Gateway endpoint, we want to query the table to retrieve each record which contains "beverage". This query currently breaks with "Invalid operator used in KeyConditionExpression: contains":
{
"TableName": "drinks",
"ConsistentRead": true,
"ExpressionAttributeValues": {
":m": {
"S": "beverage"
}
},
"KeyConditionExpression": "contains(PK,:m)"
}
How should I construct this query so that it performs quickly and returns all the records I require?
The CONTAINS operation you are trying to use is not supported as an operation in a KeyConditionExpression with the Query API. The only KeyConditionExpressions available are EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN (see docs). On top of that, all of those operations except for EQ are reserved for the Sort Key. With a query operation you must specify a single partition, i.e. by using the = operator on the Partition Key you are querying. This means that not only will you will have to restructure your keys in order to accomplish the access pattern of:
We want to retrieve all ~1000 records with the word "beverage" in the Description field.
You will also probably have to change the access pattern itself. Something more feasible with DynamoDB might read as:
We want to retrieve all ~1000 items with the type of beverage
This is because equality on a partition key is a prerequisite to every single query operation you perform on your base table.
If you can't change the way your table is structured, then DynamoDB is likely not right tool for the job. If you can though, there are certainly ways of evaluating and shaping data to work with NoSQL tables in general, and DynamoDB specifically.
The best approach would be to lay out all of your access patterns, consult best practices documentation provided by AWS (linked earlier), and design your base table around your main patterns, while leveraging secondary indexes to supplement secondary patterns if necessary.

Amazon DynamoDB multiple scan conditions with multiple BeginsWith

I have table in Amazon DynamoDB with partition key and range key.
Table structure
Subscriber ID (partition key) | Item Id (Range Key) | Date |...
123 | P_345 | some date 1 | ...
123 | I_456 | some date 2 |
123 | A_678 | some date 3 | ...
Now I want to retrieve the data from the table using QueryAsync C# library with multiple scan conditions.
HashKey = 123
condition 1; Date is between 'some date 1' and 'some date 2'
condition 2. Range Key begins_with I_ and P_
Is there any way which I can achieve this using c# dynamoDB APIs?
Please help
You'll need to do the following (I'm not a C# expert, but you can use the following instructions to find the right C# syntax to do it):
Because you are looking for a specific hashkey, this will be a Query request, not a Scan.
You have a begins_with() condition on the range key. You specify that using the KeyConditionExpression parameter to the Query. The KeyConditionExpression will ask for HashKey=123 AND begins_with(RangeKey,"P_").
However, KeyConditionExpression does not allow an "OR" (rangekey begins with either "P_" or "I_"). You'll just need to run two separate queries - one with "I_" and one with "P_" (you can even do the two queries in parallel, if you wish).
The date is not one of the key columns, so you will need to filter it with a FilterExpression parameter to the query. Note that filtering only happens in the last step, after DynamoDB already read all the items matching the KeyConditionExpression above (this may increase your costs if filtering removes a lot of items and you will still pay for them).

Declare a variable in RedShift

SQL Server has the ability to declare a variable, then call that variable in a query like so:
DECLARE #StartDate date;
SET #StartDate = '2015-01-01';
SELECT *
FROM Orders
WHERE OrderDate >= #StartDate;
Does this functionality work in Amazon's RedShift? From the documentation, it looks that DECLARE is used solely for cursors. SET looks to be the function I am looking for, but when I attempt to use that, I get an error.
set session StartDate = '2015-01-01';
[Error Code: 500310, SQL State: 42704] [Amazon](500310) Invalid operation: unrecognized configuration parameter "startdate";
Is it possible to do this in RedShift?
Slavik Meltser's answer is great. As a variation on this theme, you can also use a WITH construct:
WITH tmp_variables AS (
SELECT
'2015-01-01'::DATE AS StartDate,
'some string' AS some_value,
5556::BIGINT AS some_id
)
SELECT *
FROM Orders
WHERE OrderDate >= (SELECT StartDate FROM tmp_variables);
Actually, you can simulate a variable using a temporarily table, create one, set data and you are good to go.
Something like this:
CREATE TEMP TABLE tmp_variables AS SELECT
'2015-01-01'::DATE AS StartDate,
'some string' AS some_value,
5556::BIGINT AS some_id;
SELECT *
FROM Orders
WHERE OrderDate >= (SELECT StartDate FROM tmp_variables);
The temp table will be deleted after the transaction execution.
Temp tables are bound per session (connect), therefor cannot be shared across sessions.
No, Amazon Redshift does not have the concept of variables. Redshift presents itself as PostgreSQL, but is highly modified.
There was mention of User Defined Functions at the 2014 AWS re:Invent conference, which might meet some of your needs.
Update in 2016: Scalar User Defined Functions can perform computations but cannot act as stored variables.
Note that if you are using the psql client to query, psql variables can still be used as always with Redshift:
$ psql --host=my_cluster_name.clusterid.us-east-1.redshift.amazonaws.com \
--dbname=your_db --port=5432 --username=your_login -v dt_format=DD-MM-YYYY
# select current_date;
date
------------
2015-06-15
(1 row)
# select to_char(current_date,:'dt_format');
to_char
------------
15-06-2015
(1 row)
# \set
AUTOCOMMIT = 'on'
...
dt_format = 'DD-MM-YYYY'
...
# \set dt_format 'MM/DD/YYYY'
# select to_char(current_date,:'dt_format');
to_char
------------
06/15/2015
(1 row)
You can now use user defined functions (UDF's) to do what you want:
CREATE FUNCTION my_const()
RETURNS CSTRING IMMUTABLE AS
$$ return 'my_string_constant' $$ language plpythonu;
Unfortunately, this does require certain access permissions on your redshift database.
Not an exact answer but in DBeaver, you can set up variables to use in your local queries in the IDE. Our team has found this helpful in testing before we put code into production.
From this answer: https://stackoverflow.com/a/58308439/220997
You should then be able to do:
#set date = '2019-10-09'
SELECT ${date}::DATE, ${date}::TIMESTAMP WITHOUT TIME ZONE
which produces:
| date | timestamp |
|------------|---------------------|
| 2019-10-09 | 2019-10-09 00:00:00 |
Again note: This only works in the DBeaver IDE. This SQL won't work when integrated in stored procedures or called from other tools

Amazon RedShift: Unique Column not being honored

I use the following query to create my table.
create table t1 (url varchar(250) unique);
Then I insert about 500 urls, twice. I am expecting that the second time I had the URLs that no new entries show up in my table, but instead my count value doubles for:
select count(*) from t1;
What I want is that when I try and add a url that is already in my table, it is skipped.
Have I declared something in my table deceleration incorrect?
I am using RedShift from AWS.
Sample
urlenrich=# insert into seed(url, source) select 'http://www.google.com', '1';
INSERT 0 1
urlenrich=# select * from seed;
url | wascrawled | source | date_crawled
-----------------------+------------+--------+--------------
http://www.google.com | 0 | 1 |
(1 row)
urlenrich=# insert into seed(url, source) select 'http://www.google.com', '1';
INSERT 0 1
urlenrich=# select * from seed;
url | wascrawled | source | date_crawled
-----------------------+------------+--------+--------------
http://www.google.com | 0 | 1 |
http://www.google.com | 0 | 1 |
(2 rows)
Output of \d seed
urlenrich=# \d seed
Table "public.seed"
Column | Type | Modifiers
--------------+-----------------------------+-----------
url | character varying(250) |
wascrawled | integer | default 0
source | integer | not null
date_crawled | timestamp without time zone |
Indexes:
"seed_url_key" UNIQUE, btree (url)
Figured out the problem
Amazon RedShift does not enforce constraints...
As explained here
http://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html
They said they may get around to changing it at some point.
NEW 11/21/2013
RDS has added support for PostGres, if you need unique and such an postgres rds instance is now the best way to go.
In redshift, constraints are recommended but doesn't take effect, constraints will just help to the query planner to select better ways to perform the query.
Usually, columnar databases do not manage indexes or constraints.
Although Amazon Redshift doesn't support unique constraints, there are some ways to delete duplicated records that can be helpful.
See the following link for the details.
copy data from Amazon s3 to Red Shift and avoid duplicate rows
Primary and unique key enforcement in distributed systems, never mind column store systems, is difficult. Both RedShift (Paracel) and Vertica face the same problems.
The challenge with a column store is that the question that is being asked is "does this table row have a relevant entry in another table row" but column stores are not designed for row operations.
In HP Vertica there is an explicit command to report on constraint violations.
In Redshift it appears that you have to roll your own.
SELECT COUNT(*) AS TotalRecords, COUNT(DISTINCT {your PK_Column}) AS UniqueRecords
FROM {Your table}
HAVING COUNT(*)> COUNT(DISTINCT {your PK_Column})
Obviously, if you have a multi-column PK you have to do something more heavyweight.
SELECT COUNT(*)
FROM (
SELECT {PkColumns}
FROM {Your Table}
GROUP BY {PKColumns}
HAVING COUNT(*)>1
) AS DT
If the above returns a value greater than zero then you have a primary key violation.
For anyone who:
Needs to use redshift
Wants unique inserts in a single query
Doesn't care too much about query performance
Only really cares about inserting a single unique value at a time
Here's an easy way to get it done
INSERT INTO MY_TABLE (MY_COLUMNS)
SELECT MY_UNIQUE_VALUE WHERE MY_UNIQUE_VALUE NOT IN (
SELECT MY_UNIQUE_VALUE FROM MY_TABLE
WHERE MY_UNIQUE_COLUMN = MY_UNIQUE_VALUE
)