Should threads act on separate memory?

Should threads act on separate memory? - c++

I have a C++ program whose task is to analyse a stream of binary data (typically it's a file on disk) and extract some information. This task is "memory-less", meaning that the outcome of each step is independent of the previous one. Because of this, I thought to speed it up by giving the data to separate threads in order to improve performance.
For now, the data is read in blocks of 1GB at a time and saved in an array to avoid I/O bottlenecks. Should I separate the data in n chunks/arrays (where n is the number of threads) or is a single array accessed by multiple threads not an issue?
I have a C++ program whose task is to analyse a stream of binary data (typically it's a file on disk) and extract some information. This task is "memory-less", meaning that the outcome of each step is independent of the previous one. Because of this, I thought to speed it up by giving the data to separate threads in order to improve performance.
For now, the data is read in blocks of 1GB at a time and saved in an array to avoid I/O bottlenecks. Should I separate the data in n chunks/arrays (where n is the number of threads) or is a single array accessed by multiple threads not an issue?
EDIT 1: data and anlaysis specification
I realize that the wording of the problem might be too broad, as pointed out by one of the comments. I will try to go into a bit more detail.
The data being analysed is a series of unsigned 64bits integers generated by a so-called "time-to-digital" converter (TDC), storing a timestamp information about some event they register. My TDC has multiple channels, so each timestamp has information about which channel triggered (first 3 bits), whether that was a rising or falling edge trigger (4th bit), and the actual time (in clock ticks since powering up the TDC, last 60 bits).
The timestamps, of course, are saved in the file chronologically. The task is finding coincidence events between channels within a certain time window which the user sets. So you keep reading the timestamps and when you find two in the channels of interest whose distance in time is less than the set one, you increase the number of coincidence events.
These files can be quite big (tens of GB) and the number of timestamps enormous (one clock tick is 80 picoseconds).
For now I only go through the whole file once, and the idea was to "cut it" in smaller pieces that would then be analysed by different threads. The possible loss of events between the cuts is acceptable for me, since, at most will be 2 over hundreds of thousands.
Of course, they would only read data from the file/memory. I can write the coincidence counts in three separate variables and then sum them when all threads finish, if this helps avoiding sync problems.
I hope now things are clearer.

Yes, the same array can be accessed by multiple threads: if threads only read the array (which seems to be the case here), you won't have false sharing effects.
And to optimize cache use, you could make each thread read consecutive elements of the array (i.e not interleave reads between threads).
As a side-note, you may want to reconsider the 1GB block : that's a lot! Have you measured that it's better than, say, 1MB or 10KB ?
You also might want to parallelize "file reading" (one small chunk at a time) and "processing the content that was read" (using many threads as you do), using (at least) 2 arrays (one is being processed, the other will receive the next read)

Related

Read large CSV file in C++(~4GB)

I want to read and store a large CSV file into a map. I started by just reading the file and seeing how long it takes to process. This is my loop:
while(!gFile.eof()){
gFile >> data;
}
It is taking me ~35 mins to process the csv file that contains 35 million lines and six columns. Is there any way to speed this up? Pretty new to SO, so apologies if not asking correctly.

Background
Files are stream devices or concepts. The most efficient usage of reading a file is to keep the data streaming (flowing). For every transaction there is an overhead. The larger the data transfer, the less impact the overhead has. So, the goal is to keep the data flowing.
Memory faster than file access
Search memory is many times faster than searching a file. So, searching for a "word" or delimiter is going to be faster than reading a file character by character to find the delimiter.
Method 1: Line by line
Using std::getline is much faster than using operator>>. Although the input code may read a block of data; you are only performing one transaction to read a record versus one transaction per column. Remember, keep data flowing and searching memory for the columns is faster.
Method 2: Block reading
In the spirit of keeping the stream flowing, read a block of memory into a buffer (large buffer). Process the data from the buffer. This is more efficient than reading line by line because you can read in multiple lines of data with one transaction, reducing the overhead of a transaction.
One caveat is that you may have a record cross buffer boundaries, so you'll need to come up with an algorithm to handle that. The execution penalty is small and only happens once per transaction (consider this part of the overhead of a transaction).
Method 3: Multiple threads
In the spirit of keeping the data streaming, you could create multiple threads. One thread is in charge or reading the data into a buffer while another thread processes the data from the buffer. This technique will have better luck keeping the data flowing.
Method 4: Double buffering & multiple threads
This takes Method 3 above and adds multiple buffers. The reading thread can fill up one buffer then start filling a second buffer. The data processing thread will wait until the first buffer is filled before processing the data. This technique is used to better match the speed of reading data to the speed of processing the data.
Method 5: Memory mapped files
With a memory mapped file, the operating system handles the reading of the file into memory on demand. Less code that you have to write, but you don't get as much control over when the file is read into memory. This is still faster than reading field by field.

Lets start with the bottlenecks.
Reading from disk
Decoding the data
Store in map
Memory speed
Amount of memory
Read from disk
Read till you drop, if you can't read fast enough to use all
bandwidth on the disk you can go faster. Ignore all other steps and only read.
Start by adding buffers to your instream
Set hints for reading
use mmap
4GB is a trivial size, if you don't already have 32 GB upgrade
Too slow buy M.2 disk.
Still to slow then more exotic, change disk driver, dump OS. Mirror disks, only you $£€ is the limit.
Decode the data
if you data is in lines where are all the same length then all decodes can be done in parallel, limited only by memory bandwidth.
if the line lengths only wary a little the find end of line can be done in parallel followed by parallel decode.
if the order of the lines doesn't matter for the final map just split the file in #hardwarethreads parts and let each process their part until the first newline in the next threads part.
memory bandwidth will most likely be reach far before the CPU is anyway near used up.
Store in map
hopefully you have thought about this map in advance as none of the std maps are thread safe.
if you don't care about order a std::array can be used and you can run at full memory bandwidth.
lets say you want to use std::unordered_map, there is a problem that it needs to update the size after each write, so effectively your are limited to 1 thread writing to it.
You could use 1 thread at a time to update while the other precompute the hash of the record.
having one thread write has the problem that nearly every write will be a cache miss severely limiting speed.
so if that is not fast enough roll your own hash_map, without a size that must be updated every write.
to ensure thread safety you also need to protect the write, having one mutex makes you as slow or slower than the single writer.
you could try to make it lock and wait free ... if your not an expert you will get severe headache instead.
if you have selected a bucket design for you hash then you could make X times number of writer threads mutexes, use the hash value to select the mutex. The extra mutexes increase the likelihood that two threads won't collide.
Memory speed
Each line will be transferred at least 4 times over the memory bus, once from the disk to ram (at least once more if the driver is not good), once when the data is decoded, once when the map makes a reads request, and one more for when the map writes.
A good setup can save one more memory access if the driver writes to cache and therefore decode not will result in a LLC-miss.
Amount of memory
you should have enough memory to hold the total file, the data structure and some intermediate data.
Check if RAM is cheaper than your programming time.

How to write data into a buffer and write the buffer into a binary file with a second thread?

I am getting data from a sensor(camera) and writing the data into a binary file. The problem is it takes lot of space on the disk.
So, I used the compression from boost (zlib) and the space reduced a lot! The problem is the compression process is slow and lots of data is missing.
So, I want to implement two threads, with one getting the data from the camera and writing the data into a buffer. The second thread will take the front data of the buffer and write it into the binary file. And in this case, all the data will be present.
How do I implement this buffer? It needs to expand dynamically and pop_front. Shall I use std::deque, or does something better already exist?

First, you have to consider these four rates (or speeds):
Speed of Production (SP): The average number of bytes your sensor produces per second.
Speed of Compression (SC): The average number of bytes per second you can compress. This is the number of input bytes to the compression algorithm.
Rate of Compression (RC): The average ratio of compressed data to uncompressed data your compress algorithm produces (ratio of size of output to the input of compression.) (This is obviously somewhere between 0 and 1.)
Speed of Writing (SW): The average number of bytes you can write to disk, per second.
If SC is less than SP, you are in trouble. It means you can't compress all the data you gather from your sensor, in real time. Which means you'll eventually run out of buffer memory. You'll have to find a faster compression algorithm, or dedicate more CPU cores to compression.
If SW is less than SP times RC (which is the size of sensor data after compression,) you are again in trouble. It means you can't write out your output data as fast as you are producing and compressing them, and again, you will eventually run out of buffer memory, no matter how much you have. You might be able to gain some speed by adopting a better write strategy or file system, but a real gain in SW comes from a better disk system (RAID, SSD, better hardware, etc.)
Now, if everything is OK speed-wise, you can probably employ something like the following architecture to read, compress and write the data out:
You'll have three threads (or two, described later) that do one part of the pipeline each. You'll also have two thread-safe queues, one for communication from each stage of the pipeline to the next.
Assuming the two queues are named Q1 and Q2, the high-level operation of the threads will look like this:
Input Thread:
Read K bytes of sensor data
Put the whole K bytes as a unit on Q1.
Go to 1.
Compression Thread:
Wait till there is something on Q1.
Pop one buffer of data (probably K bytes) from Q1.
Compress the buffer into a hopefully smaller buffer and put it on Q2.
Go to 1.
Output Thread:
Wait till there is something on Q2.
Pop one buffer of data from Q2.
Write the buffer to the output file.
Go to 1.
The most CPU-intensive part of the work is in the second thread, and the other two probably don't consume much CPU time and therefore probably can share a CPU core. This means that the above strategy may be runnable on two cores. But it can also run on a single core if the workload is light, or require many many cores. That all depends on the four rates I described up top.
Using asynchronous writes (e.g. IOCP on Windows or epoll on Linux,) you can drop the third thread and the second queue altogether. Then your second thread needs to execute something like this:
Wait till there is something on Q1.
Pop one buffer of data (probably K bytes) from Q1.
Compress the buffer into a hopefully smaller buffer.
Issue an asynchronous write request to the OS to write out the compressed buffer to disk.
Go to 1.
There are four more issues worth mentioning:
K should be selected so that the time required for various (usually constant time) activities associated with allocating a buffer, pushing it into and popping it from a thread-safe queue, starting a compression run and issuing a write request into a file become negligible relative to doing the actual work (reading sensor data, compressing bytes and writing to disk.) This usually means that K needs to be as large as possible. But if K is very large (many megabytes or hundreds of megabytes) then if your application crashes, you'll lose a lot of data. You need to find a balance between performance and risk of data loss. I suggest (without any knowledge of your specific needs and constraints) a value between 10KiB to 1MiB for K.
Implementing a thread-safe queue is easy if you have some knowledge and experience with concurrent/parallel programming, but rather hard and error-prone if you do not. Finding good examples and implementations should not be hard. A normal std::deque or std::list or std::anything won't be usable by itself, but can used as a good basis for writing a thread-safe queue.
Note that you are queuing buffers of data, not individual numbers or bytes. If you pass your data one number at a time through this pipeline, it will be painfully slow and wasteful.
Some compression algorithms are limited in how much data they can consume in each invocation, or that you must sync the output of each one call to compression routine with one call to the decompression routine later on. These might affect the choice of K, and also how you write your output file. You might have to add some metadata so that you can be able to actually decompress and read the data later.

Multi threaded reading from a file in c++?

My application uses text file to store data to file.
I was testing for the fastest way of reading it by multi threading the operation.
I used the following 2 techniques:
Use as many streams as NUMBER_OF_PROCESSORS environment variable. Each stream is on a different thread. Divide total no of lines in file equally for each stream. Parse the text.
Only one stream parses the entire file and loads the data in memory. Create threads (= NUMBER_OF_PROCESSORS - 1) to parse data from memory.
The test was run on various file sizes 100kB - 800MB.
Data in file:
100.23123 -42343.342555 ...(and so on)
4928340 -93240.2 349 ...
...
The data is stored in 2D array of double.
Result: Both methods take approximately the same time for parsing the file.
Question: Which method should I choose?
Method 1 is bad for the Hard disk as multiple read access are performed at random locations simultaneously.
Method 2 is bad because memory required is proportional to file size. This can be partially overcome by limiting the container to a fixed size, deleting the parsed content and fill it again from the reader. But this increases the processing time.

Method 2 has a sequential bottleneck (the single-threaded reading and handing out of the work items). This will not scale indefinitely according to Amdahls Law. It is a very fair and reliable method, though.
Method 1 has not bottleneck and will scale. Be sure to not cause random IO on the disk. I'd use a mutex to have only one thread read at a time. Read in big sequential block of maybe 4-16MB. In the time the disk does a single head seek it could have read about 1MB of data.
If parsing the lines takes a considerable amount of time, you can't use method 2 because of the big sequential part. It would not scale. If parsing is fast, though, use method 2 because it is easier to get right.
To illustrate the concept of a bottleneck: Imagine 1.000.000 computation threads asking one reader thread to give them lines. That one reader thread would not be able to keep up handing out lines as quickly as they are demanded. You would not get 1e6 times the throughput. This would not scale. But if 1e6 threads read independently from a very fast IO device, you would get 1e6 times the throughput because there is no bottleneck. (I have used extreme numbers to make the point. The same idea applies in the small.)

I'd prefer slightly modified 2 method. I would read data sequentally in single thread by big chunks. Ready chunk is passed to a thread pool where data is processed. So you have concurrent reading & processing

With enough RAM you can do it without single-thread bottleneck. For Linux:
1) mmap you whole file to RAM with MAP_LOCKED, requires root or system wide permissions tune. Or without MAP_LOCKED for SSD, they handle random access well.
2) give each thread a start position. Process data from first newline after self start position to first newline after next thread start position.
PS What is your program CPU load? Probably HDD is the bottleneck.

CPU Cores not Utilized properly using QThreads

Using : C++ (MinGW), Qt4.7.4, Vista (OS), intel core2vPro
I need to process 2 huge files in exactly the same way. So i would like to call the processing routine from 2 separate threads for 2 separate files. The GUI thread does nothing heavy; just displays a label and runs an event loop to check for emission of thread termination conditions and quits the main Application accordingly. I expected this to utilize the two cores (intel core2) somewhat equally, but on the contrary i see from Task Manager that one of the cores is highly utilized and the other is not (though not every time i run the code); also the time taken to process the 2 files is much more than the time taken to process one file (i thought it should have been equal or a little more but this is almost equal to processing the 2 files one after another in a non-threaded application). Can i somehow force the threads to use the cores that i specify?
QThread* ptrThread1=new QThread;
QThread* ptrThread2=new QThread;
ProcessTimeConsuming* ptrPTC1=new ProcessTimeConsuming();
ProcessTimeConsuming* ptrPTC2=new ProcessTimeConsuming();
ptrPTC1->moveToThread(ptrThread1);
ptrPTC2->moveToThread(ptrThread2);
//make connections to specify what to do when processing ends, threads terminate etc
//display some label to give an idea that the code is in execution
ptrThread1->start();
ptrThread2->start(); //i want this thread to be executed in the core other than the one used above
ptrQApplication->exec(); //GUI event loop for label display and signal-slot monitoring

Reading in parallel from a single mechanical disk often times (and probably in your case) will not yield any performance gain, since the mechanical head of the disk needs to spin every time to seek the next reading location, effectively making your reads sequential. Worse, if a lot of threads are trying to read, the performance may even degrade with respect to the sequential version, because the disk head is bounced to different locations of the disk and thus needs to spin back where it left off every time.
Generally, you cannot do better than reading the files in a sequence and then processing them in parallel using perhaps a producer-consumer model.

With mechanical hard drives, you need to explicitly control the ratio of time spent doing sequential reads vs. time spent seeking. The canonical way of doing it is with n+m objects running on m+min(n, QThread::idealThreadCount()) threads. Here, m is the number of hard drives that the files are on, and n is the number of files.
Each of m objects is reading files from given hard drive in a round robin fashion. Each read must be sufficiently large. On modern hard drives, let's budget 70Mbytes/s of bandwidth (you can benchmark the real value), 5ms for a seek. To waste at most 10% of the bandwidth, you only have 100ms or 100ms/(5ms/seek)=20 seeks per second. Thus you must read at least 70Mbytes/(20seeks+1)=3.3 Megabytes from each file before reading from the next file. This thread fills a buffer with file data, and the buffer then signals the relevant computation object that is attached to the other side of the buffer. When a buffer is busy, you simply skip reading from given file until the buffer becomes available again.
The other n objects are computation objects, they perform a computation upon a signal from a buffer that indicates the buffer is full. As soon as the buffer data is not needed anymore, the buffer is "reset" so that the file reader can refill it.
All reader objects need their own threads. The computation objects can be distributed among their own threads in a round-robin fashion, so that the threads all have within +1, -0 objects of each other.

I thought my empirical data might be of some use to this discussion. I have a directory with 980 txt files that I would like to read. In the Qt/C++ framework and running on an Intel i5 quad core, I created a GUI Application and added a class worker to read a file given its path. I pushed the worker into a thread, then repeated adding an additional thread each run. I timed roughly 13 mins with 1 thread, 9 minutes with 2, and 8 minutes with 3. So, in my case there was some benefit, but it degraded quickly.

What is the best way to determine the number of threads to fire off in a machine with n cores? (C++)

I have a vector<int> with 10,000,000 (10 million) elements, and that my workstation has four cores. There is a function, called ThrFunc, that operates on an integer. Assume that the runtime for ThrFunc for each integer in the vector<int> is roughly the same.
How should I determine the optimal number of threads to fire off? Is the answer as simple as the number of elements divided by the number of cores? Or is there a more subtle computation?
Editing to provide extra information
No need for blocking; each function invocation needs only read-only
access

The optimal number of threads is likely to be either the number of cores in your machine or the number of cores times two.
In more abstract terms, you want the highest possible throughput. Getting the highest throughput requires the fewest contention points between the threads (since the original problem is trivially parallelizable). The number of contention points is likely to be the number of threads sharing a core or twice that, since a core can either run one or two logical threads (two with hyperthreading).
If your workload makes use of a resource of which you have fewer than four available (ALUs on Bulldozer? Hard disk access?) then the number of threads you should create will be limited by that.
The best way to find out the correct answer is, with all hardware questions, to test and find out.

Borealid's answer includes test and find out, which is impossible to beat as advice goes.
But there's perhaps more to testing this than you might think: you want your threads to avoid contention for data wherever possible. If the data is entirely read-only, then you might see best performance if your threads are accessing "similar" data -- making sure to walk through the data in small blocks at a time, so each thread is accessing data from the same pages over and over again. If the data is completely read-only, then there is no problem if each core gets its own copy of the cache lines. (Though this might not make the most use of each core's cache.)
If the data is in any way modified, then you will see significant performance enhancements if you keep the threads away from each other, by a lot. Most caches store data along cache lines, and you desperately want to keep each cache line from bouncing among CPUs for good performance. In that case, you might want to keep the different threads running on data that is actually far apart to avoid ever running into each other.
So: if you're updating the data while working on it, I'd recommend having N or 2*N threads of execution (for N cores), starting them with SIZE/N*M as their starting point, for threads 0 through M. (0, 1000, 2000, 3000, for four threads and 4000 data objects.) This will give you the best chance of feeding different cache lines to each core and allowing updates to proceed without cache line bouncing:
+--------------+---------------+--------------+---------------+--- ...
| first thread | second thread | third thread | fourth thread | first ...
+--------------+---------------+--------------+---------------+--- ...
If you're not updating the data while working on it, you might wish to start N or 2*N threads of execution (for N cores), starting them with 0, 1, 2, 3, etc.. and moving each one forward by N or 2*N elements with each iteration. This will allow the cache system to fetch each page from memory once, populate the CPU caches with nearly identical data, and hopefully keep each core populated with fresh data.
+-----------------------------------------------------+
| 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 ... |
+-----------------------------------------------------+
I also recommend using sched_setaffinity(2) directly in your code to force the different threads to their own processors. In my experience, Linux aims to keep each thread on its original processor so much it will not migrate tasks to other cores that are otherwise idle.

Assuming ThrFunc is CPU-bound then you want probably one thread per core, and divide the elements between them.
If there's an I/O element to the function then the answer is more complicated, because you can have one or more threads per core waiting for I/O while another is executing. Do some tests and see what happens.

I agree with the previous comments. You should run tests to determine what number yields the best performance. However, this will only yield the best performance for the particular system you're optimizing for. In most scenarios, your program will be run on other people's machines, on the architecture of which you should not make too many assumptions.
A good way to numerically determine the number of threads to start would be to use
std::thread::hardware_concurrency()
This is part of the C++11 and should yield the number of logical cores in the current system. Logical cores means either the physical number of cores - in case the processor does not support hardware threads (ie HyperThreading) - or the number of hardware threads.
There's also a Boost-function that does the same, see Programmatically find the number of cores on a machine.

The optimal number of threads should equal the number of cores, in which situation the computation capacity of each core will be fully utilized, if the computation on each element is independently.

The optimal number of cores (threads) will probably be determined by when you achieve saturation of the memory system (caches and RAM). Another factor that could come into play is that of inter-core locking (locking a memory area that other cores might want to access, updating it and then unlocking it) and how efficient it is (how long the lock is in place and how often it is locked/unlocked).
A single core running a generic software whose code and data are not optmized for multi-core will come close to saturating memory all by itself. Adding more cores will, in such a scenario, result in a slower application.
So unless your code economizes heavily on memory accesses I'd guess the answer to your question is one (1).

I've found a real world example I'll put here for the ones who want a less technical / more intuitional answer:
Having multiple threads per core is like having two queues in an airport for each scanner(which people on both queues eventually have to pass through).
Two people at a time can put their baggage on the conveyer belt, but only one at a time can pass through the scanner. Now at this point, obviously there's a contention point at the entrance of the scanner, but what happens in reality is most of the times both queues function very well.
In this example, the queues represent threads and the scanner is the main functions of a core. As a general rule of thumb, the impact of each thread is 1.25th a core, i.e., it's not like having an entire new core. So if the task is CPU-bound slightly over the number of available processors is probably best.
But notice that if the task is IO-Bound, where threads will be spending most of their time waiting for external resources such as database connections, file systems, or other external sources of data, then you can assign (many) more threads than the number of available processors.
Source1, Source2

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js