how deal with atomicity situation - c++

Hi imagine I have such code:
0. void someFunction()
1. {
2. ...
3. if(x>5)
4. doSmth();
5.
6. writeDataToCard(handle, data1);
7.
8. writeDataToCard(handle, data2);
9.
10. incrementDataOnCard(handle, data);
11. }
The thing is following. If step 6 & 8 gets executed, and then someone say removes the card - then operation 10 will not be completed successfully. But this will be a bug in my system. Meaning if 6 & 8 are executed then 10 MUST also be executed. How to deal with such situations?
Quick Summary: What I mean is say after step 8 someone may remove my physical card, which means that step 10 will never be reached, and that will cause a problem in my system. Namely card will be initialized with incomplete data.

You will have to create some kind of protcol, for instance you write to the card a list of operatons to complete:
Step6, Step8, Step10
and as you complete the tasks you remove the entry from the list.
When you reread the data from the disk, you check the list if any entry remains. If it does, the operation did not successfully complete before and you restore a previous state.
Unless you can somehow physically prevent the user from removing the card, there is no other way.

If the transaction is interrupted then the card is in the fault state. You have three options:
Do nothing. The card is in fault state, and it will remain there. Advise users not to play with the card. Card can be eligible for complete clean or format.
Roll back the transaction the next time the card becomes available. You need enough information on the card and/or some central repository to perform the rollback.
Complete the transaction the next time the card becomes available. You need enough information on the card and/or some central repository to perform the completion.
In all three cases you need to have a flag on the card denoting a transaction in progress.

More details are required in order to answer this.
However, making some assumption, I will suggest two possible solutions (more are possible...).
I assume the write operations are persistent - hence data written to the card is still there after card is removed-reinserted, and that you are referring to the coherency of the data on the card - not the state of the program performing the function calls.
Also assumed is that the increment method, increments the data already written, and the system must have this operation done in order to guarantee consistency:
For each record written, maintain another data element (on the card) that indicates the record's state. This state will be initialized to something (say "WRITING" state) before performing the writeData operation. This state is then set to "WRITTEN" after the incrementData operation is (successfully!) performed.
When reading from the card - you first check this state and ignore (or delete) the record if its not WRITTEN.
Another option will be to maintain two (persistent) counters on the card: one counting the number of records that began writing, the other counts the number of records that ended writing.
You increment the first before performing the write, and then increment the second after (successfully) performing the incrementData call.
When later reading from the card, you can easily check if a record is indeed valid, or need to be discarded.
This option is valid if the written records are somehow ordered or indexed, so you can see which and how many records are valid just by checking the counter. It has the advantage of requiring only two counters for any number of records (compared to 1 state for EACH record in option 1.)
On the host (software) side you then need to check that the card is available prior to beginning the write (don't write if its not there). If after the incrementData op you you detect that the card was removed, you need to be sure to tidy up things (remove unfinished records, update the counters) either once you detect that the card is reinserted, or before doing another write. For this you'll need to maintain state information on the software side.
Again, the type of solution (out of many more) depends on the exact system and requirements.

Isn't that just:
Copy data to temporary_data.
Write to temporary_data.
Increment temporary_data.
Rename data to old_data.
Rename temporary_data to data.
Delete the old_data.
You will still have a race condition (if a lucky user removes the card) at the two rename steps, but you might restore the data or temporary_data.

You haven't said what you're incrementing (or why), or how your data is structured (presumably there is some relationship between whatever you're writing with writeDataToCard and whatever you're incrementing).
So, while there may be clever techniques specific to your data, we don't have enough to go on. Here are the obvious general-purpose techniques instead:
the simplest thing that could possibly work - full-card commit-or-rollback
Keep two copies of all the data, the good one and the dirty one. A single byte at the lowest address is sufficient to say which is the current good one (it's essentially an index into an array of size 2).
Write your new data into the dirty area, and when that's done, update the index byte (so swapping clean & dirty).
Either the index is updated and your new data is all good, or the card is pulled out and the previous clean copy is still active.
Pro - it's very simple
Con - you're wasting exactly half your storage space, and you need to write a complete new copy to the dirty area when you change anything. You haven't given enough information to decide whether this is a problem for you.
... now using less space ... - commit-or-rollback smaller subsets
if you can't waste 50% of your storage, split your data into independent chunks, and version each of those independently. Now you only need enough space to duplicate your largest single chunk, but instead of a simple index you need an offset or pointer for each chunk.
Pro - still fairly simple
Con - you can't handle dependencies between chunks, they have to be isolated
journalling
As per RedX's answer, this is used by a lot of filesystems to maintain integrity.
Pro - it's a solid technique, and you can find documentation and reference implementations for existing filesystems
Con - you just wrote a modern filesystem. Is this really what you wanted?

Related

Efficient lookup of a buffer with stack of data modifications applied

I am trying to write a C++11 library as part of a wider project that implements a stack of changes (modification, insertion and deletion) implemented on top of an original buffer. Then, the aim is to be able to quickly look "through" the changes and get the modified data out.
My current approach is:
Maintain an ordered list of changes, ordered by offset of the start of the change
Also maintain a stack of the same changes, so they can be rolled back in order
New changes are pushed onto the stack and inserted into the list at the right place
The changes-by-offset list may be modified if the change interacts with others
For example, a modification of bytes 5-10 invalidates the start of an earlier modification from 8-12
Also, insertion or deletion changes will change the apparent offset of data occurring after them (deleting bytes 5-10 means that what used to be byte 20 is now found at 15)
To find the modified data, you can look though the list for the change that applies (and the offset within that change that applies - another change might have invalidated some of it), or find the right offset in the original data if no change touched that offset
The aim here is to make the lookup fast - adding a change might take some effort to mess with the list, but lookups later, which will outnumber the modifications greatly, in an ordered list should be pretty straightforward.
Also you don't need to continuously copy data - each change's data is kept with it, and the original data is untouched
Undo is then implemented by popping the last change off the stack and rolling back any changes made to it by this change's addition.
This seems to be quite a difficult task - there are a lot of things to take care of and I am quickly piling up complex code!
I feel sure that this must be problem that has been dealt with in other software, but looking around various hex editors and so on hasn't pointed me to a useful implementation. Is there a name for this problem ("data undo stack" and friends hasn't got me very far!), or a library that can be used, even as a reference, for this kind of thing?
I believe the most common approach (one I have used successfully in the past) is to simply store the original state and then put each change operation (what's being done + arguments) on the undo stack. Then, to get to a particular prior state you start from the original and apply all changes except the ones you want undone.
This is a lot easier to implement than trying to identify what parts of the data changed, and it works well unless the operations themselves are very time-consuming (and therefore slow to "replay" onto the original state).
I would look at persistent data structures, such as https://en.wikipedia.org/wiki/Persistent_data_structure and http://www.toves.org/books/persist/#s2 - or websearch on terms from these. I think you could do this with a persistent tree whose leaves carry short strings.

Is there any workaround to "reserve" a cache fraction?

Assume I have to write a C or C++ computational intensive function that has 2 arrays as input and one array as output. If the computation uses the 2 input arrays more often than it updates the output array, I'll end up in a situation where the output array seldom gets cached because it's evicted in order to fetch the 2 input arrays.
I want to reserve one fraction of the cache for the output array and enforce somehow that those lines don't get evicted once they are fetched, in order to always write partial results in the cache.
Update1(output[]) // Output gets cached
DoCompute1(input1[]); // Input 1 gets cached
DoCompute2(input2[]); // Input 2 gets cached
Update2(output[]); // Output is not in the cache anymore and has to get cached again
...
I know there are mechanisms to help eviction: clflush, clevict, _mm_clevict, etc. Are there any mechanisms for the opposite?
I am thinking of 3 possible solutions:
Using _mm_prefetch from time to time to fetch the data back if it has been evicted. However this might generate unnecessary traffic plus that I need to be very careful to when to introduce them;
Trying to do processing on smaller chunks of data. However this would work only if the problem allows it;
Disabling hardware prefetchers where that's possible to reduce the rate of unwanted evictions.
Other than that, is there any elegant solution?
Intel CPUs have something called No Eviction Mode (NEM) but I doubt this is what you need.
While you are attempting to optimise the second (unnecessary) fetch of output[], have you given thought to using SSE2/3/4 registers to store your intermediate output values, update them when necessary, and writing them back only when all updates related to that part of output[] are done?
I have done something similar while computing FFTs (Fast Fourier Transforms) where part of the output is in registers and they are moved out (to memory) only when it is known they will not be accessed anymore. Until then, all updates happen to the registers. You'll need to introduce inline assembly to effectively use SSE* registers. Of course, such optimisations are highly dependent on the nature of the algorithm and data placement.
I am trying to get a better understanding of the question:
If it is true that the 'output' array is strictly for output, and you never do something like
output[i] = Foo(newVal, output[i]);
then, all elements in output[] are strictly write. If so, all you would ever need to 'reserve' is one cache-line. Isn't that correct?
In this scenario, all writes to 'output' generate cache-fills and could compete with the cachelines needed for 'input' arrays.
Wouldn't you want a cap on the cachelines 'output' can consume as opposed to reserving a certain number of lines.
I see two options, which may or may not work depending on the CPU you are targeting, and on your precise program flow:
If output is only written to and not read, you can use streaming-stores, i.e., a write instruction with a no-read hint, so it will not be fetched into cache.
You can use prefetching with a non-temporally-aligned (NTA) hint for input. I don't know how this is implemented in general, but I know for sure that on some Intel CPUs (e.g., the Xeon Phi) each hardware thread uses a specific way of cache for NTA data, i.e., with an 8-way cache 1/8th per thread.
I guess solution to this is hidden inside, the algorithm employed and the L1 cache size and cache line size.
Though I am not sure how much performance improvement we will see with this.
We can probably introduce artificial reads which cleverly dodge compiler and while execution, do not hurt computations as well. Single artificial read should fill cache lines as many needed to accommodate one page. Therefore, algorithm should be modified to compute blocks of output array. Something like the ones used in matrix multiplication of huge matrices, done using GPUs. They use blocks of matrices for computation and writing result.
As pointed out earlier, the write to output array should happen in a stream.
To bring in artificial read, we should initialize at compile time the output array at right places, once in each block, probably with 0 or 1.

Sending list of connected users to newly connected user in multithreaded iocp server

I need some advice how to send properly doubly linked list of connected users. Some basic information about my code and my approach so far:
I keep information about all connected users in doubly linked list, which is shared among threads. I store the head of the list in global variable : *PPER_user g_usersList, and struct for users look like:
typedef struct _user {
char id;
char status;
struct _user *pCtxtBack;
struct _user *pCtxtForward;
} user, *PPER_user;
When new user connects to server, data about connected users is gathered from linked list and send to him:
WSABUF wsabuf; PPER_player pTemp1, pTemp2; unsigned int c=0;
.....
EnterCriticalSection(&g_CSuserslist);
pTemp1 = g_usersList;
while( pTemp1 ) {
pTemp2 = pTemp1->pCtxtBack;
wsabuf.buf[c++]=pTemp1->id; // fill buffer with data about all users
wsabuf.buf[c++]=pTemp1->status; //
pTemp1 = pTemp2;
};
WSASend(...,wsabuf,...);
LeaveCriticalSection(&g_CSuserslist);
But a few things about code above makes me confused:
the linked list is rather under heavily usage by other threads. The more connected users (for example 100,1000), the longer period of time list is locked for the entire duration of ghatering data. Should I reconcile with that or find some better way to do this?
it seems that when one thread locks list whilst while loop went trough all chained struct(users) gathering all id, status , other threads should use the same CriticalSection(&g_CSuserslist) when users want to change their own id, status etc. But this will likely kill the performance. Maybe should i change all design of my app or something?
Any insight you may have would be appreciated. Thanks in advance.
The only problem I see in your code (and more generally in the description of your app) is the size of the critical section that protects g_usersList. The rule is avoid any time consuming operation while in critical section.
So you must protect :
adding a new user
removing a user at deconnexion
taking a snapshot of the list for further processing
All those operation are memory only, so unless you go under really heavy conditions, all should be fine provided you put all IO outside of critical sections (1), because it only happens when users are connecting/disconnecting. If you put the WSASend outside of critical section, all should go fine and IMHO it is enough.
Edit per comment :
Your struct user is reasonably small, I would say between 10 and 18 useful bytes (depending on pointer size 4 or 8 bytes), and a total of 12 of 24 bytes including padding. With 1000 connected users you only have to copy less then 24k bytes of memory and having only to test if next user is null (or at most keep the current number of connected user to have a simpler loop). Anyway, maintaining such a buffer should also be done in a critical section. IMHO until you have far more than 1000 users (between 10k and 100k, but you could get other problems ...) a simple global lock (like your critical section) around the whole double linked list of user should be enough. But all that needs to be probed because it may depend of external things like hardware ...
Too Long Don't Read discussion :
As you describe your application, you only gather the list of connected users when a new users connects, so you have exactly one full read per two writes (one at connection and one at deconnection) : IMHO it is no use trying to implement share locks for reading and exclusive ones for writing. If you did many reads between a connection and a deconnection, it won't be same thing, and you should try to allow concurrent reads.
If you really find the contention is too heavy, because you have a very large number of connected users and very frequent connection/disconnection, you could try to implement a row level like locking. Instead of locking the whole list, only lock what you are processing : top and first for an insertion, current record plus previous and next for a deletion, and current and next while reading. But it will be hard to write and test, much more time consuming, because you will have to do many lock/release while reading the list, and you will have to be very cautious to avoid dead lock condition. So my advice is don't do that unless it is really required.
(1) in the code you show, the WSASend(...,wsabuf,...); is inside the critical section when it should be outside. Write instead :
...
LeaveCriticalSection(&g_CSuserslist);
WSASend(...,wsabuf,...);
The first performance problem is the linked list itself: It takes quite a bit longer to traverse a linked list than to traverse an array/std::vector<>. A single linked list has the advantage of allowing thread safe insertion/deletion of elements via atomic types/compare-and-swap operations. A double linked list is much harder to maintain in a thread safe fashion without resorting to mutexes (which are always the big, heavy guns).
So, if you go with mutex to lock the list, use std::vector<>, but you can also solve your problem with a lock-free implementation of a single linked list:
You have a single linked list with one head that is a global, atomic variable.
All entries are immutable once they are published.
When you add a user, take the current head and store it in a thread local variable (atomic read). Since the entries won't change, you have all time in the world to traverse this list, even if other threads add more users while you are traversing it.
To add the new user, create a new list head containing it, then use a compare-and-swap operation to replace the old list head pointer with the new one. If that fails, retry.
To remove a user, traverse the list until you find the user in the list. While you walk the list, copy its contents to newly allocated nodes in a new linked list. Once you find the user to delete, set the next pointer of the last user on the new list to the deleted user's next pointer. Now the new list contains all users of the old one except the removed user. So you can now publish that list by another compare-and-swap on the list head. Unfortunately, you'll have to redo the work should the publishing operation fail.
Do not set the next pointer of the deleted object to NULL, another thread might still need it to find the rest of the list (in its view the object won't have been removed yet).
Do not delete the old list head right away, another thread might still be using it. The best thing to do is to enqueue its nodes in another list for cleanup. This cleanup list should be replaced from time to time with a new one, and the old one should be cleaned up after all threads have given their OK (you can implement this by passing around a token, when it comes back to the originating process, you can safely destroy the old objects.
Since the list head pointer is the only globally visible variable that can ever change, and since that variable is atomic, such an implementation guarantees a total ordering of all add/remove operations.
The "correct" answer is probably to send less data to your users. Do they really NEED to know the id and status of every other user or do they only need to know aggregate information which can be kept up-to-date dynamically.
If your app must send this information (or such changes are considered too much work), then you could cut down your processing significantly by only making this calculation, say, once per second (or even per minute). Then, when someone logged on, they would receive a copy of this information that is, at most, 1 second old.
The real question here is just how urgent it is to send every byte of that list to the new user?
How well does the client side track this list data?
If the client can handle partial updates, wouldn't it make more sense to 'trickle' the data to each user - perhaps using a timestamp to indicate freshness of the data and not have to lock the list in such a massive fashion?
You could also switch to a rwsem style lock where list access is only exclusive if the user intends to modify the list.

Atomic operations on critical section data

Some part of shared memory modified in a critical section consists of considerable amount of data however only small portion of it is changed in a single pass (e.g. free memory pages bitmap).
How to make sure that when program is interrupted/killed the data remains in a consistent state. Any suggestions other than having two copies
(like a copy&swap in an example bellow or having some kind of rollback segment) ?
struct some_data{
int a;
int t[100000]; //large number of total data but a few bytes changed in a single pass (eg. free entries bitmap/tree).
};
short int active=0;
some_data section_data[2];
//---------------------------------------------------
//semaphore down
int inactive=active % 2;
section_data[inactive]=section_data[active];
// now, make changes to the section data (section_data[next_active])
active=inactive;
//semaphore up
You are looking for transactional consistency: a transaction occurs in whole, or not at all.
A common pattern is a journal, where you store the change you intend to make while you apply them. Anyone accessing the shared memory and detecting the crashed process (such as noticing that they somehow acquired the semaphore with a partially present journal), takes responsibility for replaying the journal before continuing.
You still have one race case, the actual writing of a bit signalling to all processes that there is, in fact, a journal to consume. However, that is a small enough body of information that you can send it through whatever channel you please, such as another semaphore or clever use of fences.
It's best if the journal is sufficiently independent of the state of the memory such that the repairing process can just start at the start of the journal and replay the whole thing. If you have to identify which entry in the journal is "next," then you need a whole lot more synchronization.

Optimal strategy to make a C++ hash table, thread safe

(I am interested in design of implementation NOT a readymade construct that will do it all.)
Suppose we have a class HashTable (not hash-map implemented as a tree but hash-table)
and say there are eight threads.
Suppose read to write ratio is about 100:1 or even better 1000:1.
Case A) Only one thread is a writer and others including writer can read from HashTable(they may simply iterate over entire hash table)
Case B) All threads are identical and all could read/write.
Can someone suggest best strategy to make the class thread safe with following consideration
1. Top priority to least lock contention
2. Second priority to least number of locks
My understanding so far is thus :
One BIG reader-writer lock(semaphore).
Specialize the semaphore so that there could be eight instances writer-resource for case B, where each each writer resource locks one row(or range for that matter).
(so i guess 1+8 mutexes)
Please let me know if I am thinking on the correct line, and how could we improve on this solution.
With such high read/write ratios, you should consider a lock free solution, e.g. nbds.
EDIT:
In general, lock free algorithms work as follows:
arrange your data structures such that for each function you intend to support there is a point at which you are able to, in one atomic operation, determine whether its results are valid (i.e. other threads have not mutated its inputs since they have been read) and commit to them; with no changes to state visible to other threads unless you commit. This will involve leveraging platform-specific functions such as Win32's atomic compare-and-swap or Cell's cache line reservation opcodes.
each supported function becomes a loop that repeatedly reads the inputs and attempts to perform the work, until the commit succeeds.
In cases of very low contention, this is a performance win over locking algorithms since functions mostly succeed the first time through without incurring the overhead of acquiring a lock. As contention increases, the gains become more dubious.
Typically the amount of data it is possible to atomically manipulate is small - 32 or 64 bits is common - so for functions involving many reads and writes, the resulting algorithms become complex and potentially very difficult to reason about. For this reason, it is preferable to look for and adopt a mature, well-tested and well-understood third party lock free solution for your problem in preference to rolling your own.
Hashtable implementation details will depend on various aspects of the hash and table design. Do we expect to be able to grow the table? If so, we need a way to copy bulk data from the old table into the new safely. Do we expect hash collisions? If so, we need some way of walking colliding data. How do we make sure another thread doesn't delete a key/value pair between a lookup returning it and the caller making use of it? Some form of reference counting, perhaps? - but who owns the reference? - or simply copying the value on lookup? - but what if values are large?
Lock-free stacks are well understood and relatively straightforward to implement (to remove an item from the stack, get the current top, attempt to replace it with its next pointer until you succeed, return it; to add an item, get the current top and set it as the item's next pointer, until you succeed in writing a pointer to the item as the new top; on architectures with reserve/conditional write semantics, this is enough, on architectures only supporting CAS you need to append a nonce or version number to the atomically manipulated data to avoid the ABA problem). They are one way of keeping track of free space for keys/data in an atomic lock free manner, allowing you to reduce a key/value pair - the data actually stored in a hashtable entry - to a pointer/offset or two, a small enough amount to be manipulated using your architecture's atomic instructions. There are others.
Reads then become a case of looking up the entry, checking the kvp against the requested key, doing whatever it takes to make sure the value will remain valid when we return it (taking a copy / increasing its reference count), checking the entry hasn't been modified since we began the read, returning the value if so, undoing any reference count changes and repeating the read if not.
Writes will depend on what we're doing about collisions; in the trivial case, they are simply a case of finding the correct empty slot and writing the new kvp.
The above is greatly simplified and insufficient to produce your own safe implementation, especially if you are not familiar with lock-free/wait-free techniques. Possible complications include the ABA problem, priority inversion, starvation of particular threads; I have not addressed hash collisions.
The nbds page links to an excellent presentation on a real world approach that allows growth / collisions. Others exist, a quick Google finds lots of papers.
Lock free and wait free algorithms are fascinating areas of research; I encourage the reader to Google around. That said, naive lock free implementations can easily look reasonable and behave correctly much of the time while in reality being subtly unsafe. While it is important to have a solid grasp on the principles, I strongly recommend using an existing, well-understood and proven implementation over rolling your own.
You may want to look at Java's ConcurrentHashMap implementation for one possible implementation.
The basic idea is NOT to lock for every read operation but only for writes. Since in your interview they specifically mentioned an extremely high read:write ratio it makes sense trying to stuff as much overhead as possible into writes.
The ConcurrentHashMap divides the hashtable into so called "Segments" that are themselves concurrently readable hashtables and keep every single segment in a consistent state to allow traversing without locking.
When reading you basically have the usual hashmap get() with the difference that you have to worry about reading stale values, so things like the value of the correct node, the first node of the segment table and next pointers have to be volatile (with c++'s non-existent memory model you probably can't do this portably; c++0x should help here, but haven't looked at it so far).
When putting a new element in there you get all the overhead, first of all having to lock the given segment. After locking it's basically a usual put() operation, but you have to guarantee atomic writes when updating the next pointer of a node (pointing to the newly created node whose next pointer has to be already correctly pointing to the old next node) or overwriting the value of a node.
When growing the segment, you have to rehash the existing nodes and put them into the new, larger table. The important part is to clone nodes for the new table as not to influence the old table (by changing their next pointers too early) until the new table is complete and replaces the old one (they use some clever trick there that means they only have to clone about 1/6 of the nodes - nice that but I'm not really sure how they reach that number).
Note that garbage collection makes this a whole lot easier because you don't have to worry about the old nodes that weren't reused - as soon as all readers are finished they will automatically be GCed. That's solvable though, but I'm not sure what the best approach would be.
I hope the basic idea is somewhat clear - obviously there are several points that aren't trivially ported to c++, but it should give you a good idea.
No need to lock the whole table, just have a lock per bucket. That immediately gives parallelism. Inserting a new node to the table requires a lock on the bucket about to have the head node modified. New nodes are always added at the head of the table so that readers can iterate through the nodes without worrying about seeing new nodes.
Each node has a r/w lock; readers iterating get a read lock lock. Node modification requires a write lock.
Iteration without the bucket lock leading to node removal requires an attempt to take the bucket lock, and if it fails it must release the locks and retry to avoid deadlock because the lock order is different.
Brief overview.
You can try atomic_hashtable for c
https://github.com/Taymindis/atomic_hashtable for read, write, and delete without locking while multithreading accessing the data, Simple and Stable
API documents given in README.