Is Optimistic-Locking absolutely safe? - optimistic-locking

when using optimistic-locking strategy, it can solve concurrency problem like below:
| the first transaction started |
| |
| select a row |
| | the second transaction started
| update the row with version checking |
| | select the same row
| commit txn |
| | update the row with version checking
| |
| | rolls back because version is dirty
But what if in the extremely rare cases if the update in the second transaction is after the udpate in the first transaction but before the the transaction commit?
| the first transaction started |
| | the second transaction started
| select a row |
| | select the same row
| update the row with version checking |
| | update the row with version checking
| commit txn |
| | rolls back because version is dirty // will it?
| |
| |
I made an experiment that the update in the second transaction could not read the 'dirty' version because the first transaction had not been committed yet. Will the second transaction fail in this case?

You didn't say in your question what database system you're actually using, so I don't know the details of your system.
But in any case, under an optimistic locking system, a process cannot just check the row versions when it performs the update statement, because of exactly the problem you are worried about.
For fully serializable, isolated transactions, each process must atomically check the row versions of all the rows it examined and modified, at commit time. So in your second scenario, the right-hand process will not detect a conflict until it tries to commit (a step which you did not include for the right-hand process). When it tries to commit, it will detect the conflict and roll back.

As you've already discovered, Optimistic Locking is subject to TOCTOU race condition: before the commit decision and the actual commit, there is a short time window during which another transaction can modify the data.
To make Optimistic Locking 100% safe, you have to make sure that the second transaction waits until the first transaction commits and only then makes a version check:
You can achieve this by acquiring a row-level (select for update) lock prior to the update statement.
jOOQ does that for you. In Hibernate, you have to lock the row manually:
var pessimisticRead = new LockOptions(LockMode.PESSIMISTIC_READ);
session.buildLockRequest(pessimisticRead).lock(entity);
Beware that you can't reproduce that annoying TOCTOU race condition in Hibernate on a single VM. Hibernate will smoothly resolve this thanks to the shared Persistent Context. When transactions run on different VMs, Hibernate can't help, and you'll have to add extra locking.

Related

How do I keep a running count in DynamoDB without a hot table row?

We have a completely server-less architecture and have been using DynamoDB almost since it was released, but I am struggling to see how to deal with tabulating global numbers on a large scale. Say we have users who choose to do either A or B. We want to keep track of how many users do each and they could happen at a high scale. According to DyanamoDB best practices, you are not supposed to write continually to a single row. What is the best way to handle this outside using another service like CouchDB or ElastiCache?
You could bucket your users by first letter of their usernames (or something similar) as the partition key, and either A or B as the sort key, with a regular attribute as the counts.
For example:
PARTITION KEY | SORT KEY | COUNT
--------------------------------
a | A | 5
a | B | 7
b | B | 15
c | A | 1
c | B | 3
The advantage is that you can reduce the risk of hot partitions by spreading your writes across multiple partitions.
Of course, you're trading hot writes for more expensive reads, since now you'll have to scan + filter(A) to get the total count that chose A, and another scan + filter(B) for the total count of B. But if you're writing a bunch and only reading on rare occasions, this may be ok.

SQL not returning when executed on top of a large data set

I have below sql which is getting stuck in oracle database for more than 2 hours. This stuck happens only when it is executed via the C++ application. Interestingly, at the same time when it was stuck I can execute it through sql developer manually and it returns within seconds. My table has millions of rows and about 100 columns. Can someone please point out how can I overcome this issue?
select *
from MY_TABLE
INNER JOIN ( (select max(concat(DATE ,concat('',to_char(INDEX, '0000000000')))) AS UNIQUE_ID
from MY_TABLE
WHERE ((DATE < '2018/01/29')
OR (DATE = '2018/01/29' AND INDEX <= 100000))
AND EXISTS ( select ID
from MY_TABLE
where DATE = '2018/01/29'
AND INDEX > 100000
AND LATEST =1)
group by ID ) SELECTED_SET )
ON SELECTED_SET.UNIQUE_ID = concat(DATE, concat('',to_char(INDEX, '0000000000')))
WHERE (FIELD_1 = 1 AND FIELD_2 = 1 AND FIELD_3='SomeString');
UPDATE:
db file sequential read is present on the session.
SELECT p3, count(*) FROM v$session_wait WHERE event='db file sequential read' GROUP BY p3;
.......................................
| P3 | COUNT(*) |
.......................................
| 1 | 2 |
.......................................
"I can execute it through sql developer manually and it returns within seconds"
Clearly the problem is not intrinsic to the query. So it must be a problem with your application.
Perhaps you have a slow network connection between your C++ application and the database. To check this you should talk to your network admin team. They are likely to be resistant to the suggestion that the network is the problem. So you may need to download and install Wireshark, and investigate it yourself.
Or your C++ is just very inefficient in handling the data. Is the code instrumented? Do you know what it's been doing for those two hours?
"the session is shown as 'buffer busy wait'"
Buffer busy waits indicate contention for blocks between sessions. If your application has many sessions running this query then you may have a problem. Buffer busy waits can indicate that there are sessions waiting on a full table scan to complete; but as the query returned results when you ran it in SQL Developer I think we can discount this. Perhaps there are other sessions updating MY_TABLE. How many sessions are reading or writing to it?
Also, what is the output of this query?
SELECT p3, count(*)
FROM v$session_wait
WHERE event='buffer busy wait'
GROUP BY p3
;
Worked with our DBA and he disabled the plan directives at system level using
alter system set "_optimizer_dsdir_usage_control"=0;
As per him, SQL plan directives were created as cardinality mis-estimates after executing the sql. After that timing was greatly improved and the problem is solved.

DynamoDB Schema Design

I'm thinking of using Amazon AWS DynamoDB for a project that I'm working on. Here's the gist of the situation:
I'm going to be gathering a ton of energy usage data for hundreds of machines (energy readings are taken around every 5 minutes). Each machine is in a zone, and each zone is in a network.
I'm then going to roll up these individual readings by zone and network, by hour and day.
My thinking is that by doing this, I'll be able to perform one query against DynamoDB on the network_day table, and return the energy usage for any given day quickly.
Here's my schema at this point:
table_name | hash_key | range_key | attributes
______________________________________________________
machine_reading | machine.id | epoch | energy_use
machine_hour | machine.id | epoch_hour | energy_use
machine_day | machine.id | epoch_day | energy_use
zone_hour | machine.id | epoch_hour | energy_use
zone_day | machine.id | epoch_day | energy_use
network_hour | machine.id | epoch_hour | energy_use
network_day | machine.id | epoch_day | energy_use
I'm not immediately seeing that great of performance in tests when I run the rollup cronjob, so I'm just wondering if someone with more experience could comment on my key design? The only experience I have so far is with RDS, but I'm very much trying to learn about DynamoDB.
EDIT:
Basic structure for the cronjob that I'm using for rollups:
foreach network
foreach zone
foreach machine
add_unprocessed_readings_to_dynamo()
roll_up_fixture_hours_to_dynamo()
roll_up_fixture_days_to_dynamo()
end
roll_up_zone_hours_to_dynamo()
roll_up_zone_days_to_dynamo()
end
roll_up_network_hours_to_dynamo()
roll_up_network_days_to_dynamo()
end
I use the previous function's values in Dynamo for the next roll up, i.e.
I use zone hours to roll up zone days
I then use zone days to roll up
network days
This is what (I think) is causing a lot of unnecessary reads/writes. Right now I can manage with low throughputs because my sample size is only 100 readings. My concerns begin when this scales to what is expected to contain around 9,000,000 readings.
First things first, time series data in DynamoDB is hard to do right, but not impossible.
DynamoDB uses the hash key to shard the data so using the machine.id means that some of you are going to have hot keys. However, this is really a function of the amount of data and what you expect your IOPS to be. DynamoDB doesn't create a 2nd shard until you push past 1000 read or write IOPS. If you expect to be well below that level you may be fine, but if you expect to scale beyond that then you may want to redesign, specifically include a date component in your hash key to break things up.
Regarding performance, are you hitting your provisioned read or write throughput level? If so raise them to some crazy high level and re-run the test until the bottleneck becomes your code. This could be a simple as setting the throughput level appropriately.
However, regarding your actual code, without seeing the actual DynamoDB queries you are performing a possible issue would be reading too much data. Make sure you are not reading more data than you need from DynamoDB. Since your range key is a date field use a range conditional (not a filter) to reduce the number of records you need to read.
Make sure you code executes the rollup using multiple threads. If you are not able to saturate the DynamoDB provisioned capacity the issue may not be DynamoDB, it may be your code. By performing the rollups using multiple threads in parallel you should be able to see some performance gains.
What's the provisioned throughput on the tables you are using? how are you performing the rollup? Are you reading everything and filtering / filtering on range keys, etc?
Do you need to roll up/a cron job in this situation?
Why not use a table for the readings
machine_reading | machine.id | epoch_timestamp | energy_use
and a table for the aggregates
hash_key can be aggregate type and range key can be aggregate name
example:
zone, zone1
zone, zone3
day, 03/29/1940
when getting machine data, dump it in the first table and after that use atomic counters to increment entities in 2nd table:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.AtomicCounters

Change data in table and copying to new table

I would like to make a macro in Excel, but I think it's too complicated to do it with recording... That's why I'm coming here for assistance.
The file:
I have a list of warehouse boxes all containing a specific ID, location (town), location (in or out) and a date.
Whenever boxes change location, this needs to be changed in this list and the date should be adjusted accordingly (this should be a manual input, since the changing of the list might not happen on the same day as the movement of the box).
On top of that, I need to count the number of times the location changes from in to out (so that I know how many times the box has been used).
The way of inputting:
A good way of inputting would be that you can make a list of the boxes where you want to change the information from, f.e.:
ID | Location (town) | Location (in/out) | Date
------------------------------------------------
123-4 | Paris | OUT | 9-1-14
124-8 | London | IN | 9-1-14
999-84| London | IN | 10-1-14
124-8 | New York | OUT | 9-1-14
Then I'd make a button that uses the macro to change the data mentioned above in the master list (where all the boxes are listed) and on some way count the number of times OUT changes to IN etc.
Is this possible?
I'm not entirely sure what you want updated in your Main List but I don't think you need Macros at all to achieve this. You can count the number of times and box location has changed by simply making a list of all your boxes in one column and the count in the next column. For the count use the formula COUNTIFS to count all the rows where the box id is the same and the location is in/out. Check VLOOKUP for updating your main list values.

How can I code this problem? (C++)

I am writing a simple game which stores datasets in a 2D grid (like a chess board). Each cell in the grid may contain a single integer (0 means the cell is empty). If the cell contains a number > 0, it is said to be "filled". The set of "filled" cells on the grid is known as a "configuration".
My problems is being able to "recognize" a particular configuration, regardless of where the configuration of cells are in the MxN grid.
The problem (in my mind), breaks down into the following 2 sub problems:
Somehow "normalising" the position of a configuration (for e.g. "rebasing" its position to (0,0), such that the smallest rectangle containing all cells in the configuration has its left vertice at (0,0) in the MxN grid
Computing some kind of similarity metric (or maybe simply set difference?), to determine if the current "normalised" configuration is one of the known configurations (i.e. "recognized")
I suspect that I will need to use std::set (one of the few STL containers I haven't used so far, in all my years as a C++ coder!). I would appreciate any ideas/tips from anyone who has solved such a problem before. Any code snippets, pseudocode and/or links would be very useful indeed.
Similarity metrics are a massive area of academic research. You need to decide what level of sophistication is required for your task. It may be that you can simply drag a "template pattern" across your board, raster style, and for each location score +1 for a hit and -1 for a miss and sum the score. Then find the location where the template got the highest score. This score_max is then your similarity metric for that template. If this method is inadequate then you may need to go in to more detail about the precise nature of the game.
Maybe you could use some hash function to identify configurations. Since you need to recognize patterns even if they are not at the same position on the board, this hash should not depend on the position of the cells but on the way they are organized.
If you store your 2D grid in a 1D array, you would need to find the first filled cell and start calculating the hash from here, until the last filled cell.
Ex:
-----------------
| | | | |
-----------------
| | X | X | |
-----------------
| | | X | |
-----------------
| | | | |
-----------------
----------------+---------------+---------------+----------------
| | | | | | X | X | | | | X | | | | | |
----------------+---------------+---------------+----------------
|_______________________|
|
Compute hash based on this part of the array
However, there are cases where this won't work, e.g. when the pattern is shifted across lines:
-----------------
| | | | X |
-----------------
| X | | | |
----------------- Different configuration in 2D.
| X | | | |
-----------------
| | | | |
-----------------
----------------+---------------+---------------+----------------
| | | | X | X | | | | X | | | | | | | |
----------------+---------------+---------------+----------------
|_______________________|
Seems similar in 1D
So you'll need some way of dealing with these cases. I don't have any solution yet, but I'll try to find something if my schedule allows it!
After thinking a bit about it, maybe you could use two different representations for the grid: one where the lines are appended in a 1D array, and another one where the columns are appended in a 1D array. The hash would be calculated with these two representations, and that would (I think) resolve the problem evoked above.
This may be overkill for a small application, but OpenCV has some excellent image recognition and blob finding routines. If you treat the 2D board as an image, and the integer as brightness, it should be possible to use functions from that library.
And the link:
http://opencv.willowgarage.com/documentation/index.html
you can use a neural network for that job.
if you lookfor neural network shape recognition i think that you can find something usefull. you can find tons of library that may help you but if you have no experiece with NN this could be a little hard, but i think that is the easiest way
Sounds like you want to feed your chessboard to a neural network trained to recognize the configuration.
This is very similar to the classic examples of image classification, with the only complication being that you don't know exactly where your configuration thingy will appear in the grid, unless you're always considering the whole grid - in that case a classic 2 layers network should work.
HTM neural networks implementations solve the offset problem out-of-the-box. You can find plenty of ready-to-use implementations starting from here. Obviously you'll have to tweak the heck out of the implementations you'll find but you should be able to achieve exactly what you're asking for to my understanding.
If you want to further investigate this route the Numenta forum will be a good starting point.
This reminds me of HashLife which uses QuadTrees. Check out the wikipedia pages on HashLife and Quadtrees.
There's also a link at the bottom of the Wikipedia page to a DrDobbs article about actually implementing such an algorithm: http://www.ddj.com/hpc-high-performance-computing/184406478
I hope those links help. They're interesting reads if nothing else.
As to the first part of question,i.e debasing try this:
make a structure with 2 integers .Declare a pointer of that struct type. Input (or compute number of live cells) and assign that much storage (using routines like calloc) .Input the coordinates in the structure. Compute the minimum x coordinate and minimum y coordinate. In the universe assign [x][y] (user given values or current coordinate) to [x-minx][y-miny] Though expensive when reading from an already filled grid,but works and helps in the subsequent part of the question.