For a select query, for what rows are read locks acquired? Is it only the rows that match the filter, or all rows that had to be scanned?
First, note that locks are only needed for read-write transactions but not for read-only transactions. (https://cloud.google.com/spanner/docs/reads)
Cloud Spanner will acquire locks on all the returned rows. It will also acquire enough extra locks to avoid “false negatives”, which are rows that aren’t returned because they initially don’t match the filter, but then are modified to match the filter before your transaction commits. These false negatives are often called “phantom rows:” you execute a query and get a set of a results, and then in the same transaction you execute the exact same query and get more rows. If the query plan does a scan over the base table, we will take a range lock on the whole table, so that no phantom rows can appear until your transaction completes. If the query plan uses an index to find rows with value ‘X’ for field ‘Y’, then we’ll lock a range of the index corresponding to all possible index entries for ‘Y=X’, so that if any transaction wants to insert a new index entry with ‘Y=X’ it would have to wait until your transaction completes.
Related
Here are my tables:
Table1
Id (String, composite PK partition key)
IdTwo (String, composite PK sort key)
Table2
IdTwo (String, simple PK partition key)
Timestamp (Number)
I want to PutItem in Table1 only if IdTwo does not exist in Table2 or the item in Table2 with the same IdTwo has Timestamp less than the current time (can be given as outside input).
The simple approach I know would work is:
GetItem on Table2 with ConsistentRead=true. If item exists or its Timestamp < current time, exit early.
PutItem on Table1.
However, this is two network calls to DDB. I'd prefer optimizing it, like using TransactWriteItems which is one network call. Is it possible for my use case?
If you want to share code, I'd prefer Go, but any language is fine.
First off, the operation you're looking for is TransactWriteItems - https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_TransactWriteItems.html
This is the API operation that lets you do atomic and transactional conditional writing operations. There's two parts to your question, not sure they can be done together—but then they might not need to be.
The first part, insert in table1 if condition is met in table2 is simple enough—you add the item you want in table1 in the Put section of the API call, and phrase the existence check for table2 in the ConditionCheck section.
You can't do multiple checks right now, so the check to see if the timestamp is lower than current time is another separate operation, also in the ConditionCheck. You can't combine them together or do just one because of your rules.
I'd suggest doing a bit of optimistic concurrency here. Try the TransactWriteItems with the second ConditionCheck, where the write will succeed only if the timestamp is less than current time. This is what should happen in most cases. If the transaction fails, now you need to check if it failed because the timestamp was lower or because the item doesn't yet exist.
If it doesn't yet exist, then do a TransactWiteItems where you populate the timestamp with a ConditionCheck to make sure it doesn't exist (another thread might have written it in the meantime) and then retry the first operation.
You basically want to keep retrying the first operation (write with condition check to make sure timestamp is lower) until it succeeds or fails for a good reason. If it fails because the data is uninitialized, initizalize it taking into account race conditions and then try again.
I thought of this scenario in querying/scanning in DynamoDB table.
What if i want to get a single data in a table and i have 20k data in that table, and the data that im looking for is at 19k th row. Im using Scan with a limit 1000 for example. Does it consume throughput even though for the 19th time it does not returned any Item?. For Instance,
I have a User table:
type UserTable{
userId:ID!
username:String,
password:String
}
then my query
var params = {
TableName: "UserTable",
FilterExpression: "username = :username",
ExpressionAttributeValues: {
":username": username
},
Limit: 1000
};
How to effectively handle this?
According to the doc
A Scan operation always scans the entire table or secondary index. It
then filters out values to provide the result you want, essentially
adding the extra step of removing data from the result set.
Performance
If possible, you should avoid using a Scan operation on a large table
or index with a filter that removes many results. Also, as a table or
index grows, the Scan operation slows
Read units
The Scan operation examines every item for the requested values and can
use up the provisioned throughput for a large table or index in a
single operation. For faster response times, design your tables and
indexes so that your applications can use Query instead of Scan
For better performance an less read unit consumption i advice you create GSI using it with query
A Scan operation will look at entire table and visits all records to find out which of them matches your filter criteria. So it will consume throughput enough to retrieve all the visited records. Scan operation is also very slow, especially if the table size is large.
To your second question, you can create a Secondary Index on the table with UserName as Hash key. Then you can convert the scan operation to a Query. That way it will only consume throughput enough for fetching one record.
Read about Secondary Indices Here
When we scan a DynamoDB table, we can/should use LastEvaluatedKey to track the progress so that we can resume in case of failures. The documentation says that
LastEvaluateKey is The primary key of the item where the operation stopped, inclusive of the previous result set. Use this value to start a new operation, excluding this value in the new request.
My question is if I start a scan, pause, insert a few rows and resume the scan from the previous LastEvaluatedKey, will I get those new rows after resuming the scan?
My guess is I might miss some of all of the new rows because the new keys will be hashed and the values could be smaller than LastEvaluatedKey.
Is my guess right? Any explanation or documentation links are appreciated.
It is going sequentially through your data, and it does not know about all items that were added in the process:
Scan operations proceed sequentially; however, for faster performance
on a large table or secondary index, applications can request a
parallel Scan operation by providing the Segment and TotalSegments
parameters.
Not only it can miss some of the items that were added after you've started scanning it can also miss some of the items that were added before the scan started if you are using eventually consistent read:
Scan uses eventually consistent reads when accessing the data in a
table; therefore, the result set might not include the changes to data
in the table immediately before the operation began.
If you need to keep track of items that were added after you've started a scan you can use DynamoDB streams for that.
Situation
I'm using multiple storage databases as attachments to one central "manager" DB.
The storage tables share one pseudo-AUTOINCREMENT index across all storage databases.
I need to iterate over the shared index frequently.
The final number and names of storage tables are not known on storage DB creation.
On some signal, a then-given range of entries will be deleted.
It is vital that no insertion fails and no entry gets deleted before its signal.
Energy outage is possible, data loss in this case is hardly, if ever, tolerable. Any solutions that may cause this (in-memory databases etc) are not viable.
Database access is currently controlled using strands. This takes care of sequential access.
Due to the high frequency of INSERT transactions, I must trigger WAL checkpoints manually. I've seen journals of up to 2GB in size otherwise.
Current solution
I'm inserting datasets using parameter binding to a precreated statement.
INSERT INTO datatable VALUES (:idx, ...);
Doing that, I remember the start and end index. Next, I bind it to an insert statement into the registry table:
INSERT INTO regtable VALUES (:idx, datatable);
My query determines the datasets to return like this:
SELECT MIN(rowid), MAX(rowid), tablename
FROM (SELECT rowid,tablename FROM entryreg LIMIT 30000)
GROUP BY tablename;
After that, I query
SELECT * FROM datatable WHERE rowid >= :minid AND rowid <= :maxid;
where I use predefined statements for each datatable and bind both variables to the first query's results.
This is too slow. As soon as I create the registry table, my insertions slow down so much I can't meet benchmark speed.
Possible Solutions
There are several other ways I can imagine it can be done:
Create a view of all indices as a UNION or OUTER JOIN of all table indices. This can't be done persistently on attached databases.
Create triggers for INSERT/REMOVE on table creation that fill a registry table. This can't be done persistently on attached databases.
Create a trigger for CREATE TABLE on database creation that will create the triggers described above. Requires user functions.
Questions
Now, before I go and add user functions (something I've never done before), I'd like some advice if this has any chances of solving my performance issues.
Assuming I create the databases using a separate connection before attaching them. Can I create views and/or triggers on the database (as main schema) that will work later when I connect to the database via ATTACH?
From what it looks like, a trigger AFTER INSERT will fire after every single line of insert. If it inserts stuff into another table, does that mean I'm increasing my number of transactions from 2 to 1+N? Or is there a mechanism that speeds up triggered interaction? The first case would slow down things horribly.
Is there any chance that a FULL OUTER JOIN (I know that I need to create it from other JOIN commands) is faster than filling a registry with insertion transactions every time? We're talking roughly ten transactions per second with an average of 1000 elements (insert) vs. one query of 30000 every two seconds (query).
Open the sqlite3 databases in multi-threading mode, handle the insert/update/query/delete functions by separate threads. I prefer to transfer query result to a stl container for processing.
OS : Solaris
Database : Informix
I have a process which has 2 threads:
Thread 1 dealing with new transactions and doing DB INSERTS
Thread 2 dealing with existing transactions and doing DB DELETES
PROBLEM
Thread 1 is continuously doing INSERTS(adding new transactions) on a table.
Thread 2 is continuously doing DELETES(removing expired transactions) from the same table based on primary key
INSERTS are failing because of Informix error 244 which are occurring due to page/table locking.
I guess, the DELETE is doing a Table lock instead of Row lock and preventing the INSERTs to work.
Is there any way I can prevent this deadlocking?
EDIT
I found another clue. The 244 error is caused by a SELECT query.
Both insert and delete operation does a select from a frequently updating table, before doing the operation.
Isolation is set as COMMITTED READ. When I manually do a select on this table from dbaccess, when the deletes are happening, I get the same error.
I would be very surprised if a DELETE was doing a full table lock when removing single elements by primary key. Rather, it is likely the longevity of one (or both) of the transactions themselves is eventually tripping a table lock due to the number of modified rows. In general, you can avoid deadlocks in volatile tables such as this by eliminating all but single-row operations in each transaction, and ensuring your transaction model is read-committed. At least thus has been my experience.