Expecting Concurrency Exceptions - concurrency

Is it bad practice to expect to always have concurrency exceptions and to just keep retrying? I don't mean in the rare circumstances when two users might edit data at the same time, I mean where it is guaranteed to happen ALL the time time. For example
A league is an aggregate root, which has league matches. When a match is complete an event is raised for that match and the league will update its standings. Most matches are played at the same time, so imagine 4 played at the same time, resulting in 4 events to be raised at the same time "MatchPlayedEvent" We then have the following handler called for the same league 4 times at the exact same time
UpdateLeagueStandings
UpdateLeagueStandings
UpdateLeagueStandings
UpdateLeagueStandings
This always throws concurrency exceptions and then will retry.
So again, is it bad to always have this, accept it and to just keep retrying if we get concurrency exceptions?

Related

How to handle delays of scheduled jobs and thus duplications by mistake? (caching? message brokers?)

In our project, we have scheduled jobs which send shipment requests for orders every 60 seconds. There must be exactly one request per order. Some jobs have delays (take around 70 seconds instead), which results in sending a request twice for the same order just because the previous job had a delay and a new one already started. How to ensure that only one request is sent per order no matter what delay there is?
My assumptions so far:
Add a flag to the database, lookup for it before processing a request for an order (we use DynamoDb)
Temporary store the result in caches (I'd assume even something like 10 minutes, cause delayed jobs usually don't take longer than 1,5 minutes, so it'd be a safe assumption)
Temporary store it in some message broker (similar to caching). We already use SQS and SNS in our project. Would it be appropriate to store messages about orders which were already processed there? Are message brokers ever used for scheduled jobs to ensure they don't duplicate each other?
Increase the interval between jobs to 2 mins. Even though delays are not longer than 1,5 mins in total now, it will not guarantee to prevent possible longer delays in the future. However, this solution would be simple enough
What do you think? What would be a good solution in this case, in terms of simple implementation, fast performance and preventing duplicates?
So, if you want to make your operation idempotent by using de-duplication logic then you should ask the following questions to narrow down the possible options:
In worst case how many times would you receive the exact same request?
In worst case how much time would be between first and last duplicate requests?
In worst case how much requests should be evaluated nearly at the same time during peak hours?
Which storage system does allow me to use point query instead of scan?
Which storage system does have the lowest overhead during write operation to capture the I have seen this flag?
...
Depending on your answers you can justify that the given storage is suitable for your needs or not.

Is my understanding of transactional memory as described below correct?

I am trying to understand TM. I have read Ben's answer here and tried to understand some other articles on the Internet. I am still not quite sure if I understood correctly though. In my understanding in transactional memory the threads may execute the transactions in parallel. If two (or more) threads try to access the same transaction variable, all threads except one will abort the transaction and start over (at some point, not necessarily immediately). The one that doesn't abort updates the transaction variable.
So in a nutshell in TM all threads run in parallel and we hope that there won't be any access overlaps to transactional variables and if there are, we just only let one thread continue, while the others roll back and retry. Is this understanding of TM correct?
That is a pretty good synopsis. The details are quite convoluted, and it is possible that some transactions cannot be expressed in a given TM monitor; which means that you may have to include two implementations of your transaction - an optimistic and pessimistic one.
The cache is the underlying implementation; when you make a transactional reference to memory, the cache notes this, and either generates an alarm (restart) when any of those references are modified, or rejects the transaction close if any have been modified.
The number of transactional variables may have to, in general, be lower than your cache's associativity; otherwise they would evict one another from the cache, resulting in a transaction that could never complete.
How interrupts function in the midst of a transaction remains an open problem.
In short, it was a bit of a fascinating idea 20 years ago. As it nears general usability, it seems to have rapidly expanding hardware requirements. It may be more useful for warming cold climes than accelerating computer systems.

How do I profile Hiccups in performance?

Usually profile data is gathered by randomly sampling the stack of the running program to see which function is in execution, over a running period it is possible to be statistically sure which methods/function calls eats most time and need intervention in case of bottlenecks.
However this has to do with overall application/game performance. Sometime happens that there are singular and isolated hiccups in performance that are causing usability troubles anyway (user notice it / introduced lag in some internal mechanism etc). With regular profiling over few seconds of execution is not possible to know which. Even if the hiccup lasts long enough (says 30 ms, which are not enough anyway), to detect some method that is called too often, we will still miss to see execution of many other methods that are just "skipped" because of the random sampling.
So are there any tecniques to profile hiccups in order to keep framerate more stable after fixing those kind of "rare bottlenecks"? I'm assuming usage of languages like C# or C++.
This has been answered before, but I can't find it, so here goes...
The problem is that the DrawFrame routine sometimes takes too long.
Suppose it normally takes less than 1000/30 = 33ms, but once in a while it takes longer than 33ms.
At the beginning of DrawFrame, set a timer interrupt that will expire after, say, 40ms.
Then at the end of DrawFrame, disable the interrupt.
So if it triggers, you know DrawFrame is taking an unusually long time.
Put a breakpoint in the interrupt handler, and when it gets there, examine the stack.
Chances are pretty good that you have caught it in the process of doing the costly thing.
That's a variation on random pausing.

When a ConcurrencyException occurs, what gets written?

We use RavenDB in our production environment. It stores millions of documents, and gets updated pretty much constantly during the day.
We have two boxes load-balanced using a round-robin strategy which replicate to one another.
Every week or so, we get a ConcurrencyException from Raven. I understand that this basically means that one of the servers was told to insert or update the same document within in short timeframe - it's kind of like a conflict exception, except occurring on the same server instead of two replicating servers.
What happens when this error occurs? Can I assume that at least one of the writes succeeded? Can I predict which one? Is there anything I can do to make these exceptions less likely?
ConcurrencyException means that on a single server, you have two writes to the same document at the same instant.
That lead to:
One write is accepted.
One write is rejected (with concurrency exception).

Debugging crashes in production environments

First, I should give you a bit of context. The program in question is
a fairly typical server application implemented in C++. Across the
project, as well as in all of the underlying libraries, error
management is based on C++ exceptions.
My question is pertinent to dealing with unrecoverable errors and/or
programmer errors---the loose equivalent of "unchecked" Java
exceptions, for want of a better parallel. I am especially interested
in common practices for dealing with such conditions in production
environments.
For production environments in particular, two conflicting goals stand
out in the presence of the above class of errors: ease of debugging
and availability (in the sense of operational performance). Each of
these suggests in turn a specific strategy:
Install a top-level exception handler to absorb all uncaught
exceptions, thus ensuring continuous availability. Unfortunately,
this makes error inspection more involved, forcing the programmer to
rely on fine-grained logging or other code "instrumentation"
techniques.
Crash as hard as possible; this enables one to perform a post-mortem
analysis of the condition that led to the error via a core
dump. Naturally, one has to provide a means for the system to resume
operation in a timely manner after the crash, and this may be far
from trivial.
So I end-up with two half-baked solutions; I would like a compromise
between service availability and debugging facilities. What am I
missing ?
Note: I have flagged the question as C++ specific, as I am interested
in solutions and idiosyncrasies that apply to it particular;
nonetheless, I am aware there will be considerable overlap with other
languages/environments.
Disclaimer: Much like the OP I code for servers, thus this entire answer is focused on this specific use case. The strategy for embedded software or deployed applications should probably be widely different, no idea.
First of all, there are two important (and rather different) aspects to this question:
Easing investigation (as much as possible)
Ensuring recovery
Let us treat both separately, as dividing is conquering. And let's start by the tougher bit.
Ensuring Recovery
The main issue with C++/Java style of try/catch is that it is extremely easy to corrupt your environment because try and catch can mutate what is outside their own scope. Note: contrast to Rust and Go in which a task should not share mutable data with other tasks and a fail will kill the whole task without hope of recovery.
As a result, there are 3 recovery situations:
unrecoverable: the process memory is corrupted beyond repairs
recoverable, manually: the process can be salvaged in the top-level handler at the cost of reinitializing a substantial part of its memory (caches, ...)
recoverable, automatically: okay, once we reach the top-level handler, the process is ready to be used again
An completely unrecoverable error is best addressed by crashing. Actually, in a number of cases (such as a pointer outside your process memory), the OS will help in making it crash. Unfortunately, in some cases it won't (a dangling pointer may still point within your process memory), that's how memory corruptions happen. Oops. Valgrind, Asan, Purify, etc... are tools designed to help you catch those unfortunate errors as early as possible; the debugger will assist (somewhat) for those which make it past that stage.
An error that can be recovered, but requires manual cleanup, is annoying. You will forget to clean in some rarely hit cases. Thus it should be statically prevented. A simple transformation (moving caches inside the scope of the top-level handler) allows you to transform this into an automatically recoverable situation.
In the latter case, obviously, you can just catch, log, and resume your process, waiting for the next query. Your goal should be for this to be the only situation occurring in Production (cookie points if it does not even occur).
Easing Investigation
Note: I will take the opportunity to promote a project by Mozilla called rr which could really, really, help investigating once it matures. Check the quick note at the end of this section.
Without surprise, in order to investigate you will need data. Preferably, as much as possible, and well ordered/labelled.
There are two (practiced) ways to obtain data:
continuous logging, so that when an exception occurs, you have as much context as possible
exception logging, so that upon an exception, you log as much as possible
Logging continuously implies performance overhead and (when everything goes right) a flood of useless logs. On the other hand, exception logging implies having enough trust in the system ability to perform some actions in case of exceptions (which in case of bad_alloc... oh well).
In general, I would advise a mix of both.
Continuous Logging
Each log should contain:
a timestamp (as precise as possible)
(possibly) the server name, the process ID and thread ID
(possibly) a query/session correlator
the filename, line number and function name of where this log came from
of course, a message, which should contain dynamic information (if you have a static message, you can probably enrich it with dynamic information)
What is worth logging ?
At least I/O. All inputs, at least, and outputs can help spotting the first deviation from expected behavior. I/O include: inbound query and corresponding response, as well as interactions with other servers, databases, various local caches, timestamps (for time-related decisions), ...
The goal of such logging is to be able to reproduce the issue spotted in a control environment (which can be setup thanks to all this information). As a bonus, it can be useful as crude performance monitor since it gives some check-points during the process (note: I am talking about monitoring and not profiling for a reason, this can allow you to raise alerts and spot where, roughly, time is spent, but you will need more advanced analysis to understand why).
Exception Logging
The other option is to enrich exception. As an example of a crude exception: std::out_of_range yields the follow reason (from what): vector::_M_range_check when thrown from libstdc++'s vector.
This is pretty much useless if, like me, vector is your container of choice and therefore there are about 3,640 locations in your code where this could have been thrown.
The basics, to get a useful exception, are:
a precise message: "access to index 32 in vector of size 4" is slightly more helpful, no ?
a call stack: it requires platform specific code to retrieve it, though, but can be automatically inserted in your base exception constructor, so go for it!
Note: once you have a call-stack in your exceptions, you will quickly find yourself addicted and wrapping lesser-abled 3rd party software into an adapter layer if only to translate their exceptions into yours; we all did it ;)
On top of those basics, there is a very interesting feature of RAII: attaching notes to the current exception during unwinding. A simple handler retaining a reference to a variable and checking whether an exception is unwinding in its destructor costs only a single if check in general, and does all the important logging when unwinding (but then, exception propagation is costly already, so...).
Finally, you can also enrich and rethrow in catch clauses, but this quickly litters the code with try/catch blocks so I advise using RAII instead.
Note: there is a reason that std exceptions do NOT allocate memory, it allows throwing exceptions without the throw being itself preempted by a std::bad_alloc; I advise to consciously pick having richer exceptions in general with the potential of a std::bad_alloc thrown when attempting to create an exception (which I have yet to see happening). You have to make your own choice.
And Delayed Logging ?
The idea behind delayed logging is that instead of calling your log handler, as usual, you will instead defer logging all finer-grained traces and only get to them in case of issue (aka, exception).
The idea, therefore, is to split logging:
important information is logged immediately
finer-grained information is written to a scratch-pad, which can be called to log them in case of exception
Of course, there are questions:
the scratch pad is (mostly) lost in case of crash; you should be able to access it via your debugger if you get a memory dump though it's not as pleasant.
the scratch pad requires a policy: when to discard it ? (end of the session ? end of the transaction ? ...), how much memory ? (as much as it wants ? bounded ? ...)
what of the performance cost: even if not writing the logs to disk/network, it still cost to format them!
I have actually never used such a scratch pad, for now all non-crasher bugs that I ever had were solved solely using I/O logging and rich exceptions. Still, should I implement it I would recommend making it:
transaction local: since I/O is logged, we should not need more insight that this
memory bounded: evicting older traces as we progress
log-level driven: just as regular logging, I would want to be able to only enable some logs to get into the scratch pad
And Conditional / Probabilistic Logging ?
Writing one trace every N is not really interesting; it's actually more confusing than anything. On the other hand, logging in-depth one transaction every N can help!
The idea here is to reduce the amount of logs written, in general, whilst still getting a chance to observe bugs traces in detail in the wild. The reduction is generally driven by the logging infrastructure constraints (there is a cost to transferring and writing all those bytes) or by the performance of the software (formatting the logs slows software down).
The idea of probabilistic logging is to "flip a coin" at the start of each session/transaction to decide whether it'll be a fast one or a slow one :)
A similar idea (conditional logging) is to read a special debug field in a transaction field that initiates a full logging (at the cost of speed).
A quick note on rr
With an overhead of only 20%, and this overhead applying only on the CPU processing, it might actually be worth using rr systematically. If this is not feasible, however, it could be feasible to have 1 out of N servers being launched under rr and used to catch hard to find bugs.
This is similar to A/B testing, but for debugging purposes, and can be driven either by a willing commitment of the client (flag in the transaction) or with a probabilistic approach.
Oh, and in the general case, when you are not hunting down anything, it can be easily deactivated altogether. No sense in paying those 20% then.
That's all folks
I could apologize for the lengthy read, but the truth I probably just skimmed the topic. Error Recovery is hard. I would appreciate comments and remarks, to help improve this answer.
If the error is unrecoverable, by definition there is nothing the application can do in production environment, to recover from the error. In other words, the top-level exception handler is not really a solution. Even if the application displays a friendly message like "access violation", "possible memory corruption", etc, that doesn't actually increase availability.
When the application crashes in a production environment, you should get as much information as possible for post-mortem analysis (your second solution).
That said, if you get unrecoverable errors in a production environment, the main problems are your product QA process (it's lacking), and (much before that), writing unsafe/untested code.
When you finish investigating such a crash, you should not only fix the code, but fix your development process so that such crashes are no longer possible (i.e. if the corruption is an uninitialized pointer write, go over your code base and initialize all pointers and so on).