Processing instrument capture data

Processing instrument capture data - c++

I have an instrument that produces a stream of data; my code accesses this data though a callback onDataAcquisitionEvent(const InstrumentOutput &data). The data processing algorithm is potentially much slower than the rate of data arrival, so I cannot hope to process every single piece of data (and I don't have to), but would like to process as many as possible. Thank of the instrument as an environmental sensor with the rate of data acquisition that I don't control. InstrumentOutput could for example be a class that contains three simultaneous pressure measurements in different locations.
I also need to keep some short history of data. Assume for example that I can reasonably hope to process a sample of data every 200ms or so. Most of the time I would be happy processing just a single last sample, but occasionally I would need to look at a couple of seconds worth of data that arrived prior to that latest sample, depending on whether abnormal readings are present in the last sample.
The other requirement is to get out of the onDataAcquisitionEvent() callback as soon as possible, to avoid data loss in the sensor.
Data acquisition library (third party) collects the instrument data on a separate thread.
I thought of the following design; have single producer/single consumer queue and push the data tokens into the synchronized queue in the onDataAcquisitionEvent() callback.
On the receiving end, there is a loop that pops the data from the queue. The loop will almost never sleep because of the high rate of data arrival. On each iteration, the following happens:
Pop all the available data from the queue,
The popped data is copied into a circular buffer (I used boost circular buffer), this way some history is always available,
Process the last element in the buffer (and potentially look at the prior ones),
Repeat the loop.
Questions:
Is this design sound, and what are the pitfalls? and
What could be a better design?
Edit: One problem I thought of is when the size of the circular buffer is not large enough to hold the needed history; currently I simply reallocate the circular buffer, doubling its size. I hope I would only need to do that once or twice.

I have a bit of experience with data acquisition, and I can tell you a lot of developers have problems with premature feature creep. Because it sounds easy to simply capture data from the instrument into a log, folks tend to add unessential components to the system before verifying that logging is actually robust. This is a big mistake.
The other requirement is to get out of the onDataAcquisitionEvent() callback as soon as possible, to avoid data loss in the sensor.
That's the only requirement until that part of the product is working 110% under all field conditions.
Most of the time I would be happy processing just a single last sample, but occasionally I would need to look at a couple of seconds worth of data that arrived prior to that latest sample, depending on whether abnormal readings are present in the last sample.
"Most of the time" doesn't matter. Code for the worst case, because onDataAcquisitionEvent() can't be spending its time thinking about contingencies.
It sounds like you're falling into the pitfall of designing it to work with the best data that might be available, and leaving open what might happen if it's not available or if providing the best data to the monitor is ultimately too expensive.
Decimate the data at the source. Specify how many samples will be needed for the abnormal case processing, and attempt to provide that many, at a constant sample rate, plus a margin of maybe 20%.
There should certainly be no loops that never sleep. A circular buffer is fine, but just populate it with whatever minimum you need, and analyze it only as frequently as necessary.
The quality of the system is determined by its stability and determinism, not trying to go an extra mile and provide as much as possible.

Your producer/consumer design is exactly the right design. In real-time systems we often also give different run-time priorities to the consuming threads, not sure this applies in your case.
Use a data structure that's basically a doubly-linked-list, so that if it grows you don't need to re-allocate everything, and you also have O(1) access to the samples you need.
If your memory isn't large enough to hold your several seconds worth of data (which it should -- one sample every 200ms? 5 samples per second.) then you need to see whether you can stand reading from auxiliary memory, but that's throughput and in your case has nothing to do with your design and requirement for "Getting out of the callback as soon as possible".
Consider an implementation of the queue that does not need locking (remember: single reader and single writer only!), so that your callback doesn't stall.
If your callback is really quick, consider disabling interrupts/giving it a high priority. May not be necessary if it can never block and has the right priority set.

Questions, (1) is this design sound, and what are the pitfalls, and (2) what could be a better design. Thanks.
Yes, it is sound. But for performance reasons, you should design the code so that it processes an array of input samples at each processing stage, instead of just a single sample each. This results in much more optimal code for current state of the art CPUs.
The length of such a an array (=a chunk of data) is either fixed (simpler code) or variable (flexible, but some processing may become more complicated).
As a second design choice, you probably should ignore the history at this architectural level, and relegate that feature...
Most of the time I would be happy processing just a single last sample, but occasionally I would need to look at a couple of seconds worth of data [...]
Maybe, tracking a history should be implemented in just that special part of the code, that occasionally requires access to it. Maybe, that should not be part of the "overall architecture". If so, it simplifies processing at all.

Related

In Vulkan, is it beneficial for the graphics queue family to be separate from the present queue family?

As far as I can tell it is possible for a queue family to support presenting to the screen but not support graphics. Say I have a queue family that supports both graphics and presenting, and another queue family that only supports presenting. Should I use the first queue family for both processes or should I delegate the first to graphics and the latter to presenting? Or would there be no noticeable difference between these two approaches?

No such HW exists, so best approach is no approach. If you want to be really nice, you can handle the separate present queue family case with expending minimal brain-power on it. Though you have no way to test it on real HW that needs it. So I would say abort with a nice error message would be as adequate, until you can get your hands on actual HW that does it.
I think there is bit of a design error here on Khronoses part. Separate present queue does look like a more explicit way. But then, present op itself is not a queue operation, so the driver can use whatever it wants anyway. Also separate present requires extra semaphore, and Queue Family Ownership Transfer (or VK_SHARING_MODE_CONCURRENT resource). The history went the way that no driver is so extremist to report a separate present queue. So I made KhronosGroup/Vulkan-Docs#1234.
For rough notion of what happens at vkQueuePresentKHR, you can inspect Mesa code: https://github.com/mesa3d/mesa/blob/bf3c9d27706dc2362b81aad12eec1f7e48e53ddd/src/vulkan/wsi/wsi_common.c#L1120-L1232. There's probably no monkey business there using the queue you provided except waiting on your semaphore, or at most making a blit of the image. If you (voluntarily) want to use separate present queue, you need to measure and whitelist it only for drivers (and probably other influences) it actually helps (if any such exist, and if it is even worth your time).

First off, I assume you mean "beneficial" in terms of performance, and whenever it comes to questions like that you can never have a definite answer except by profiling the different strategies. If your application needs to run on a variety of hardware, you can have it profile the different strategies the first time it's run and save the results locally for repeated use, provide the user with a benchmarking utility they can run if they see poor performance, etc. etc. Trying to reason about it in the abstract can only get you so far.
That aside, I think the easiest way to think about questions like this is to remember that when it comes to graphics programming, you want to both maximize the amount of work that can be done in parallel and minimize the amount of work overall. If you want to present an image from a non-graphics queue and you need to perform graphics operations on it, you'll need to transfer ownership of it to the non-graphics queue when graphics operations on it have finished. Presumably, that will take a bit of time in the driver if nothing else, so it's only worth doing if it will save you time elsewhere somehow.
A common situation where this would probably save you time is if the device supports async compute and also lets you present from the compute queue. For example, a 3D game might use the compute queue for things like lighting, blur, UI, etc. that make the most sense to do after geometry processing is finished. In this case, the game engine would transfer ownership of the image to be presented to the compute queue first anyway, or even have the compute queue own the swapchain image from beginning to end, so presenting from the compute queue once its work for the frame is done would allow the graphics queue to stay busy with the next frame. AMD and NVIDIA recommend this sort of approach where it's possible.
If your application wouldn't otherwise use the compute queue, though, I'm not sure how much sense it makes or not to present on it when you have the option. The advantage of that approach is that once graphics operations for a given frame are over, you can have the graphics queue immediately release ownership of the image for it and acquire the next one without having to pause to present it, which would allow presentation to be done in parallel with rendering the next frame. On the other hand, you'll have to transfer ownership of it to the compute queue first and set up presentation there, which would add some complexity and overhead. I'm not sure which approach would be faster and I wouldn't be surprised if it varies with the application and environment. Of course, I'm not sure how many realtime Vulkan applications of any significant complexity fit this scenario today, and I'd guess it's not very many as "per-pixel" things tend to be easier and faster to do with a compute shader.

Multithreaded Realtime audio programming - To block or Not to block

When writing audio software many people on the internet say it is paramount not to use either memory allocation or blocking code, i.e no locks. Due to the fact these are non deterministic so could cause the output buffer to underflow and the audio will glitch.
Real Time Audio Progrmaming
When I write video software, I generally use both, i.e. allocating video frames on the heap and passing between threads using locks and conditional variables (bounded buffers). I love the power this provides as a separate thread can be used for each operation, allowing the software to max out each of the cores, giving the best performance.
With audio I'd like to do something similar, passing frames of maybe 100 samples between threads, however, there are two issues.
How do I generate the frames without using memory allocation? I suppose I could use a pool of frames that have been pre-allocated but this seems messy.
I'm aware you can use lock free queue and that boost has a nice library to do this. This would be a great way to share between threads, but constantly polling the queue to see if data is available seems like a massive waist of CPU time.
In my experience using mutexes doesn't actually take much time at all, provided that the section where the mutex is locked is short.
What is the best way to achieve passing audio frames between threads, whilst keeping latency to a minimum, not wasting resources and using relatively little non-deterministic behaviour?

Seems like you did your research! You've already identified the two main problems that could be the root-cause of audio glitches. The question is: How much of this was important 10 years ago and is only folklore and cargo-cult programming these days.
My two cents:
1. Heap allocations in the rendering loop:
These can have quite a lot overhead depending on how small your processing chunks are. The main culprit is, that very few run-times have a per-thread heap, so each time you mess with the heap your performance depends on what other threads in your process do. If for example a GUI thread is currently deleting thousands of objects, and you - at the same time - access the heap from the audio rendering thread you may experience a significant delay.
Writing your own memory management with pre-allocated buffers may sound messy, but in the end it's just two functions that you can hide somewhere in a utility source. Since you usually know your allocation sizes in advance there is a lot of opportunity to fine-tune and optimize your memory management. You can store your segments as a simple linked list for example. If done right this has the benefit that you allocate the last used buffer again. This buffer has a very high probability of beeing in the cache.
If fixed size allocators don't work for you have a look at ring-buffers. They fit the use-cases of streaming audio very well.
2. To lock, or not to lock:
I'd say, these days using mutex and semaphore locks are fine if you can estimate that you do less than 1000 to 5000 of them per second (on a PC, things are different on something like a Raspberry Pi etc.). If you stay below that range it is unlikely that the overhead shows up in a performance profile.
Translated to your use-case: If you for example work with 48kHz audio and 100 sample chunks you generate roughly 960 lock/unlock operation in a simple two thread consumer/producer pattern. that is well within the range. In case you completely max out the rendering thread the locking will not show up in a profiling. If you on the other hand only use like 5% of the available processing power the locks may show up, but you will not have a performance problem either :-)
Going lock-less is also an option, but so are hybrid solutions that first do some lock-less tries and then fall back to hard locking. You'll get the best of both worlds that way. There is a lot of good stuff to read about this topic on the net.
In any case:
You should raise the thread priority of your non GUI threads gently to make sure that if they run into a lock, they get out of it quickly. It is also a good idea to read what Priority Inversion is, and what you can do to avoid it:
https://en.wikipedia.org/wiki/Priority_inversion

'I suppose I could use a pool of frames that have been pre-allocated but this seems messy' - not really. Either allocate an array of frames, or new up frames in a loop, and then shove the indices/pointers onto a blocking queue. Now you have an auto-managed pool of frames. Pop one off when you need a frame, push it back on when you are done with it. No continual malloc/free/new/delete, no chance or memory-runaway, simpler debugging, and frame flow-control, (if the pool runs out, threads asking for frames will wait until frames are released back into the pool), all built in.
Using an array may seem easier/safer/faster than a new loop, but newing individual frames does have an advantage - you can easily change the number of frames in the pool at runtime.

Um, why are you passing frames of 100 samples between threads?
Assuming that you are working at a nominal sample rate of 44.1kHz, and passing 100 samples at a time between threads, that presumes that your thread switching rate must be at least 100 samples / (44100 samples/s * 2). The 2 represents both the producer and the consumer. That means you have a time slice of ~1.13 ms for every 100 samples you send. Nearly all operating systems run at time slices greater than 10 ms. So it is impossible to build an audio engine where you are sharing only 100 samples between threads at 44.1kHz on a modern OS.
The solution is to buffer more samples per time slice, either via a queue or by using larger frames. Most modern real time audio APIs use 128 samples per channel (on dedicated audio hardware) or 256 samples per channel (on game consoles).
Ultimately, the answer to your question is mostly the answer you would expect... Pass around uniquely owned queues of pointers to buffers, not the buffers themselves; manage ALL audio buffers in a fixed pool allocated at program start; and lock all queues for as little time as necessary.
Interestingly, this is one of the few good situations in audio programming where there is a distinct performance advantage to busting out the assembly code. You definitely don't want a malloc and free occurring with every queue lock. Operating-system provided atomic locking functions can ALWAYS be improved upon, if you know your CPU.
One last thing: there's no such thing as a lockfree queue. All multithread "lockfree" queue implementations rely on a CPU barrier intrinsic or a hard compare-and-swap somewhere to make sure that exclusive access to memory is guaranteed per thread.

Writer/Reader buffer mechanism for large size - high freq data c++

I need a single writer and multiple reader (up to 5) mechanism that the writer pushes the data of size almost 1 MB each and 15 packages per second continuously which will be writtern in c++. What I’m trying to do is one thread keeps writing the data while 5 readers are going to make some search operations according to the timestamp of the data simultaneously. I have to keep each data package 60 min, and then they can be removed from the container.
Since the data can grow like 15 MB * 60 sec * 60 min = 54000MB/h I need almost 50 GB space to keep the data and make the operations fast enough for both the writer and the readers. Bu the thing is we cannot keep that size data on cache or RAM so it must be in a Hard drive like SSD (HDD would be too slow for that kind of operation)
Up to now what I’ve been thinking is, to make a circular buffer (since I can calculate the max size) directly implemented to an SSD, which I couldn’t find a suitable example up to now and I don’t know if it is possible or not either, or to implement some kind of mapping mechanism that one circular array will be available in the RAM that just keeps the timestamps of the data and physical address of the memory for searching the data which is available on the hard drive. So at least the search operations would be faster I guess.
Since any kind of lock, mutex or semaphore will slow down the operations (especially write is critical we cannot loose data because of any read operation) I don’t want to use them. I know there are some shared locks available but I think again they have some drawbacks. Is there any way/idea to implement such kind of system with lock free, wait free and thread safe as well? Any Data structure (container), pattern, example code/project or other kind of suggestions will be highly appreciated, thank you…
EDIT: Is there any other idea rather than bigger amount of RAM?

This can be done on a commodity PC (and can scale to a server without code changes).
Locks are not a problem. With a single writer and few consumers that do time-consuming tasks on big data, you will have rare locking and practically zero lock contention, so it's a non-issue.
Anything from a simple spinlock (if you're really desperate for low latency) or preferrably a pthread_mutex (which falls back to being a spinlock most of the time, anyway) will do fine. Nothing fancy.
Note that you do not acquire a lock, receive a megabyte of data from a socket, write it to disk, and then release the lock. That's not how it works.
You receive a megabyte of data and write it to a region that you own exclusively, then acquire a lock, change a pointer (and thus transfer ownership), and release the lock. The lock protects the metadata, not every single byte in a gigabyte-sized buffer. Long running tasks, short lock times, contention = zero.
As for the actual data, writing out 15MiB/s is absolutely no challenge, a normal harddisk will do 5-6 times as much, and a SSD will easily do 10 to 20 times that. It also isn't something you even need to do yourself. It's something you can leave to the operating system to manage.
I would create a 54.1GB1 file on disk and memory map that (assuming it's a 64bit system, a reasonable assumption when talking of multi-gigabyte-ram-servers, this is no problem). The operating system takes care of the rest. You just write your data to the mapped region which you use as circular buffer2.
What was most recently written will be more or less guaranteed3 to be resident in RAM, so the consumers can access it without faulting. Older data may or may not be in RAM, depending on whether your server has enough physical RAM available.
Data that is older can still be accessed, but likely at slightly slower speed (if there is not enough physical RAM to keep the whole set resident). It will however not affect the producer or the consumers reading the recently written data (unless the machine is so awfully low-spec that it can't even hold 2-3 of your 1MiB blocks in RAM, but then you have a different problem!).
You are not very concrete on how you intend to process data, other than there will be 5 consumers, so I will not go too deep into this part. You may have to implement a job scheduling system, or you can just divide each incoming block in 5 smaller chunks, or whatever -- depending on what exactly you want to do.
What you need to account for in any case is the region (either as pointer, or better as offset into the mapping) of data in your mapped ringbuffer that is "valid" and the region that is "unused".
The producer is the owner of the mapping, and it "allows" the consumers to access the data within the bounds given in the metadata (a begin/end pair of offsets). Only the producer may change this metadata.
Anyone (including the producer) accessing this metadata needs to acquire a lock.
It is probably even possible to do this with atomic operations, but seeing how you only lock rarely, I wouldn't even bother. It's a no-brainer using a lock, and there are no subtle mistakes that you can make.
Since the producer knows that the consumers will only look at data within well-defined bounds, it can write to areas outside the bounds (the area known being "emtpy") without locking. It only needs to lock to change the bounds afterwards.
As 54.1Gib > 54Gib, you have a hundred spare 1MiB blocks in the mapping that you can write to. That's probably much more than needed (2 or 3 should do), but it doesn't hurt to have a few extra. As you write to a new block (and increase the valid range by 1), also adjust the other end of the "valid range". That way, threads will no longer be allowed to access an old block, but a thread still working in that block can finish its work (the data still exists).
If one is strict about correctness, this may create a race condition if processing a block takes extremely long (over 1 1/2 minutes in this case). If you want to be absolutely sure, you'll need another lock which may in the worst case block the producer. That's something you absolutely didn't want, but blocking the producer in the worst case is the only thing that is 100% correct in every contrieved case unless a hypothetical computer has unlimited memory.
Given the situation, I think this theoretical race is an "allowable" thing. If processing a single block really takes that long with so much data steadily coming in, you have a much more serious problem at hand, so practically, it's a non-issue.
If your boss decides, at some point in the future, that you should keep more than 1 hour of backlog, you can enlarge the file and remap, and when the "empty" region is next at the end of the old buffer's size, simply extend the "known" file size, and adjust your max_size value in the producer. The consumer threads don't even need to know. You could of course create another file, copy the data, swap, and keep the consumers blocked in the mean time, but I deem that an inferior solution. It is probably not necessary for a size increase to be immediately visible, but on the other hand it is highly desirable that it is an "invisible" process.
If you put more RAM into the computer, your program will "magically" use it, without you needing to change anything. The operating system will simply keep more pages in RAM. If you add another few consumers, it will still work the same.
1 Intentionally bigger than what you need, let there be a few "extra" 1MiB blocks.
2 Preferrably, you can madvise the operating system (if you use a system that has a destructive DONT_NEED hint, such as Linux) that you are no longer interested in the contents before overwriting a region. But if you don't do that, it will work either way, only slightly less efficient because the OS will possibly do a read-modify-write operation where a write operation would have been enough.
3 There is of course never really a guarantee, but it's what will be the case anyway.

54GB/hour = 15MB/s. Good SSD these days can write 300+ MB/s. If you keep 1 hour in RAM and then occasionally flush older data to disk, you should be able to handle 10x more than 15MB/s (provided your search algorithm is fast enough to keep up).
Regarding fast locking mechanism between your threads, I would suggest looking into RCU - Read-Copy Update. Linux kernel is currently using it to achieve very efficient locking.

Do you have some minimum hardware requirements? 54GB in memory is perfectly possible these days (many motherboards can take 4x16GB these days, and that's not even server hardware). So if you want to require an SSD, you could maybe just as well require a lot of RAM and have an in-memory circular buffer as you suggest.
Also, if there's sufficient redudancy in the data, it may be viable to use some cheap compression algorithms (those which are easy on the CPU, i.e. some sort of 'level 0' compression). I.e. you don't store the raw data, but some compressed format (and possibly some index) which is decompressed by the readers.

Many good recommendations around. I'd like just to add that for circular buffer implementation you can have a look at Boost Circular Buffer

The fastest way to write data while producing it

In my program I am simulating a N-body system for a large number of iterations. For each iteration I produce a set of 6N coordinates which I need to append to a file and then use for executing the next iteration. The code is written in C++ and currently makes use of ofstream's method write() to write the data in binary format at each iteration.
I am not an expert in this field, but I would like to improve this part of the program, since I am in the process of optimizing the whole code. I feel that the latency associated with writing the result of the computation at each cycle significantly slows down the performance of the software.
I'm confused because I have no experience in actual parallel programming and low level file I/O. I thought of some abstract techniques that I imagined I could implement, since I am programming for modern (possibly multi-core) machines with Unix OSes:
Writing the data in the file in chunks of n iterations (there seem to be better ways to proceed...)
Parallelizing the code with OpenMP (how to actually implement a buffer so that the threads are synchronized appropriately, and do not overlap?)
Using mmap (the file size could be huge, on the order of GBs, is this approach robust enough?)
However, I don't know how to best implement them and combine them appropriately.

Of course writing into a file at each iteration is inefficient and most likely slow down your computing. (as a rule of thumb, depends on your actuel case)
You have to use a producer -> consumer design pattern. They will be linked by a queue, like a conveyor belt.
The producer will try to produce as fast as it can, only slowing if the consumer can't handle it.
The consumer will try to "consume" as fast as it can.
By splitting the two, you can increase performance more easily because each process is simpler and has less interferences from the other.
If the producer is faster, you need to improve the consumer, in your case by writing into file in the most efficient way, chunk by chunk most likely (as you said)
If the consumer is faster, you need to improve the producer, most likely by parallelizing it as you said.
There is no need to optimize both. Only optimize the slowest (the bottleneck).
Practically, you use threads and a synchronized queue between them. For implementation hints, have a look here, especially §18.12 "The Producer-Consumer Pattern".
About flow management, you'll have to add a little bit more complexity by selecting a "max queue size" and making the producer(s) wait if the queue has not enough space. Beware of deadlocks then, code it carefully. (see the wikipedia link I gave about that)
Note : It's a good idea to use boost threads because threads are not very portable. (well, they are since C++0x but C++0x availability is not yet good)

It's better to split operation into two independent processes: data-producing and file-writing. Data-producing would use some buffer for iteration-wise data passing, and file-writing would use a queue to store write requests. Then, data-producing would just post a write request and go on, while file-writing would cope with the writing in the background.
Essentially, if the data is produced much faster than it can possibly be stored, you'll quickly end up holding most of it in the buffer. In that case your actual approach seems to be quite reasonable as is, since little can be done programmatically then to improve the situation.

If you don't want to play with doing stuff in a different threads, you could try using aio_write(), which allows asynchronous writes. Essentially you give the OS the buffer to write, and the function returns immediately, and finishes the the write while your program continues, you can check later to see if the write has completed.
This solution still does suffer from the producer/consumer problem mentioned in other answers, if your algorithm is producing data faster than it can be written, eventually you will run out of memory to store the results between the algorithm and the write, so you'd have to try it and see how it works out.

"Using mmap (the file size could be huge, on the order of GBs, is this
approach robust enough?)"
mmap is the OS's method of loading programs, shared libraries and the page/swap file - it's as robust as any other file I/O and generally higher performance.
BUT on most OS's it's bad/difficult/impossible to expand the size of a mapped file while it's being used. So if you know the size of the data, or you are only reading, it's great. For a log/dump that you are continually adding to it's less sutiable - unless you know some maximum size.

What is the definition of realtime, near realtime and batch? Give examples of each?

I'm trying to get a good definition of realtime, near realtime and batch? I am not talking about sync and async although to me, they are different dimensions. Here is what I'm thinking
Realtime is sync web services or async web services.
Near realtime could be JMS or messaging systems or most event driven systems.
Batch to me is more of an timed system that is processing when it wakes up.
Give examples of each and feel free to fix my assumptions.

https://stackoverflow.com/tags/real-time/info
Real-Time
Real-time means that the time of an activity's completion is part of its functional correctness. For example, the sqrt() function's correctness is something like
The sqrt() function is implemented
correctly if, for all x >=0, sqrt(x) =
y implies y^2 == x.
In this setting, the time it takes to execute the sqrt() procedure is not part of its functional correctness. A faster algorithm may be better in some qualitative sense, but no more or less correct.
Suppose we have a mythical function called sqrtrt(), a real-time version of square root. Imagine, for instance, we need to compute the square root of velocity in order to properly execute the next brake application in an anti-lock braking system. In this setting, we might say instead:
The sqrtrt() function is implemented
correctly if
for all x >=0, sqrtrt(x) =
y implies y^2 == x and
sqrtrt() returns a result in <= 275 microseconds.
In this case, the time constraint is not merely a performance parameter. If sqrtrt() fails to complete in 275 microseconds, you may be late applying the brakes, triggering either a skid or reduced braking efficiency, possibly resulting in an accident. The time constraint is part of the functional correctness of the routine. Lift this up a few layers, and you get a real-time system as one (at least partially) composed of activities that have timeliness as part of their functional correctness conditions.
Near Real-Time
A near real-time system is one in which activities completion times, responsiveness, or perceived latency when measured against wall clock time are important aspects of system quality. The canonical example of this is a stock ticker system -- you want to get quotes reasonably quickly after the price changes. For most of us non-high-speed-traders, what this means is that the perceived delay between data being available and our seeing it is negligible.
The difference between "real-time" and "near real-time" is both a difference in precision and magnitude. Real-time systems have time constraints that range from microseconds to hours, but those time constraints tend to be fairly precise. Near-real-time usually implies a narrower range of magnitudes -- within human perception tolerances -- but typically aren't articulated precisely.
I would claim that near-real-time systems could be called real-time systems, but that their time constraints are merely probabilistic:
The stock price will be displayed to the user within 500ms of its change at the exchange, with
probability p > 0.75.
Batch
Batch operations are those which are perceived to be large blocks of computing tasks with only macroscopic, human- or process-induced deadlines. The specific context of computation is typically not important, and a batch computation is usually a self-contained computational task. Real-time and near-real-time tasks are often strongly coupled to the physical world, and their time constraints emerge from demands from physical/real-world interactions. Batch operations, by contrast, could be computed at any time and at any place; their outputs are solely defined by the inputs provided when the batch is defined.
Original Post
I would say that real-time means that the time (rather than merely the correct output) to complete an operation is part of its correctness.
Near real-time is weasel words for wanting the same thing as real-time but not wanting to go to the discipline/effort/cost to guarantee it.
Batch is "near real-time" where you are even more tolerant of long response times.
Often these terms are used (badly, IMHO) to distinguish among human perceptions of latency/performance. People think real-time is real-fast, e.g., milliseconds or something. Near real-time is often seconds or milliseconds. Batch is a latency of seconds, minutes, hours, or even days. But I think those aren't particularly useful distinctions. If you care about timeliness, there are disciplines to help you get that.

I'm curious for feedback myself on this. Real-time and batch are well defined and covered by others (though be warned that they are terms-of-art with very specific technical meanings in some contexts). However, "near real-time" seems a lot fuzzier to me.
I favor (and have been using) "near real-time" to describe a signal-processing system which can 'keep up' on average, but lags sometimes. Think of a system processing events which only happen sporadically... Assuming it has sufficient buffering capacity and the time it takes to process an event is less than the average time between events, it can keep up.
In a signal processing context:
- Real-time seems to imply a system where processing is guaranteed to complete with a specified (short) delay after the signal has been received. A minimal buffer is needed.
- Near real-time (as I have been using it) means a system where the delay between receiving and completion of processing may get relatively large on occasion, but the system will not (except under pathological conditions) fall behind so far that the buffer gets filled up.
- Batch implies post-processing to me. The incoming signal is just saved (maybe with a bit of real-time pre-processing) and then analyzed later.
This gives the nice framework of real-time and near real-time being systems where they can (in theory) run forever while new data is being acquired... processing happens in parallel with acquisition. Batch processing happens after all the data has been collected.
Anyway, I could be conflicting with some technical definitions I'm unaware of... and I assume someone here will gleefully correct me if needed.

There are issues with all of these answers in that the definitions are flawed. For instance, "batch" simply means that transactions are grouped and sent together. Real Time implies transactional, but may also have other implications. So when you combine batch in the same attribute as real time and near real time, clarity in purpose for that attribute is lost. The definition becomes less cohesive, less clear. This would make any application created with the data more fragile. I would guess that practitioners would be better off w/ a clearly modeled taxonomy such as:
Attribute1: Batched (grouped) or individual transactions.
Attribute2: Scheduled (time-driven), event-driven.
Attribute3: Speed per transaction. For batch that would be the average speed/transaction.
Attribute4: Protocol/Technology: SOAP, REST, combination, FTP, SFTP, etc. for data movement.
Attributex: Whatever.
Attribute4 is more related to something I am doing right now, so you could throw that out or expand the list for what you are trying to achieve. For each of these attribute values, there would likely be additional, specific attributes. But to bring the information together, we need to think about what is needed to make the collective data useful. For instance, what do we need to know between batched & transactional flows, to make them useful together. For instance, you may consider attributes for each to provide the ability to understand total throughput for a given time period. Seems funny how we may create conceptual, logical, and physical data models (hopefully) for our business clients, but we don't always apply that kind of thought to how we define terminology in our discussions.

Any system in which time at which output is produced is significant. This is usually because the input corresponding to some movement in the physical environment or world and the output has to relate to the same movement. The lag from input to output time must be sufficiently small for acceptable timelines.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js