Excessive thread count yields better results with file reading - c++

I have a hundred million files, my program reads all these files at every startup. I have been looking for ways to make this process faster. On the way, I've encountered something strange. My CPU has 4 physical cores, but reading this many files with even higher thread counts yields much better results. Which is interesting, given that opening threads more than the logical core count of the CPU should be somewhat pointless.
8 Threads: 29.858 s
16 Threads: 15.882 s
32 Threads: 9.989 s
64 Threads: 7.965 s
128 Threads: 8.275 s
256 Threads: 8.159 s
512 Threads: 8.098 s
1024 Threads: 8.253 s
4096 Threads: 8.744 s
16001 Threads: 10.033 s
Why this may occur ? Is it some disk bottleneck ?
Did the homework, profiled the code, literally %95 of the runtime consists of read(), open() and close()
I am reading the first 4096 bytes of every file (my pagesize)
Ubuntu 18.04
Intel i7 6700HQ
Samsung 970 Evo Plus NVMe SSD
GCC/G++ 11

Why this may occur ?
If you open one file at "/a/b/c/d/e" then read one block of data from the file; the OS may have to fetch directory info for "/a", then fetch directory info for "/a/b", then fetch directory info for "/a/b/c", then... It might add up to a total of 6 blocks fetched from disk (5 blocks of directory info then one block of file data), and those blocks might be scattered all over the disk.
If you open a 100 million files and read one block of file data from each; then this might involve fetching 600 million things (500 million pieces of directory info, and 100 million pieces of file data).
What is the optimal order to do these 600 million things?
Often there's directory info caches and file data caches involved (and all requests that can be satisfied by data that's already cached should be done ASAP, before that data is evicted out of cache/s to make room for other data). Often the disk hardware also has rules (e.g. faster to access all blocks within the same "group of disk blocks" before switching to the next "group of disk blocks"). Sometimes there's parallelism in the disk hardware (e.g. two requests from the same zone can't be done in parallel, but 2 requests from different zones can be done in parallel).
The optimal order to do these 600 million things is something the OS can figure out.
More specifically; the optimal order to do these 600 million things is something the OS can figure out; if and only if the OS actually knows about all of them.
If you have (e.g.) 8 threads that send one request (e.g. to open a file) and then block (using no CPU time) until the pending request completes; then the OS will only know about a maximum of 8 requests at a time. In other words; the operating system's ability to optimize the order that file IO requests are performed is constrained by the number of pending requests, which is constrained by the number of threads you have.
Ideally; a single thread would be able to ask the OS "open all the files in this list of a hundred million files" so that the OS can fully optimize the order (with the least thread management overhead). Sadly, most operating systems don't support anything like this (e.g. POSIX asynchronous IO fails to support any kind of "asynchronous open").
Having a large number of threads (that are all blocked and not using any CPU time while they wait for their request/s to actually be done by file system and/or disk driver) is the only way to improve the operating system's ability to optimize the order of IO requests.

Related

Latency jitters when using shared memory for IPC

I am using shared memory for transferring data between two process, using boost::interprocess::managed_shared_memory to allocate a vector as buffer and atomic variables for enforcing memory synchronization (similar to boost::lockfree::spsc_queue).
I was measuring the end-to-end latency for the setup with 2 processes -
sender process - writes to the buffer in shared memory, and sleeps. So the rate at which it pushes data is in interval of around 55 microseconds.
receiver process - runs a busy loop to see if something can be consumed from the buffer.
I am using a RingBuffer of size 4K (high for safety), although ideally a maximun of 1 element will be present in the buffer as per the current setup. Also, I am pushing data around 3 million times to get a good estimate for the end to end latency.
To measure the latency - I get the current time in nanoseconds and store it in a vector (resized to size 3 million at the beginning). I have a 6 core setup, with isolated cpus, and I do taskset to different cores for both sender and receiver process. I also make sure no other program is running from my end on the machine when doing this testing. Output of /proc/cmdline
initrd=\initramfs-linux-lts.img root=PARTUUID=cc2a533b-d26d-4995-9166-814d7f59444d rw isolcpus=0-4 intel_idle.max_cstate=0 idle=poll
I have already done the verification that all data transfer is accurate and nothing is lost. So simple row-wise subtraction of the timestamp is sufficient to get the latency.
I am getting latency of around a 300-400 nanosecods as mean and median of the distribution, but the standard deviation was too high (few thousands of nanos). On looking at the numbers, I found out that there are 2-3 instances where the latency shoots upto 600000 nanos, and then gradually comes down (in steps of around 56000 nanos - probably queueing is happening and consecutive pops from the buffer are successful). Attaching a sample "jitter" here -
568086 511243 454416 397646 340799 284018 227270 170599 113725 57022 396
If I filter out these jittery datapoints, the std_dev becomes very less. So I went digging into what can be the reason for this. Initially I was looking if there was some pattern, or if it is occuring periodically, but it doesnot seem so in my opinion.
I ran the receiver process with perf stat -d, it clearly shows the number of context switches to be 0.
Interestingly, when looking the receiver process's /proc/${pid}/status, I monitor
voluntary_ctxt_switches, nonvoluntary_ctxt_switches and see that the nonvoluntary_ctxt_switches increase at a rate of around 1 per second, and voluntary_ctxt_switches is constant once the data sharing starts. But the problem is that for around the 200 seconds of my setup runtime, the number of latency spikes is around 2 or 3 and does not match the frequency of this context_switch numbers. (what is this count then?)
I also followed a thread which feels relevant, but cant get anything.
For the core running the receiver process, the trace on core 1 with context switch is (But the number of spikes this time was 5)-
$ grep " 1)" trace | grep "=>"
1) jemallo-22010 => <idle>-0
2) <idle>-0 => kworker-138
3) kworker-138 => <idle>-0
I also checked the difference between /proc/interrupts before and after the run of the setup.
The differences are
name
receiver_core
sender_core
enp1s0f0np1-0
2
0
eno1
0
3280
Non-maskable interrupts
25
25
Local timer interrupts
2K
~3M
Performance monitoring interrupts
25
25
Rescheduling interrupts
9
12
Function call interrupts
120
110
machine-check polls
1
1
I am not exactly sure of what most of these numbers represent. But I am curious as why there are rescheduling interrupts, and what is enp1s0f0np1-0.
It might be the case that the spike is not coming due to context switches at the first place, but a number of the range 600 mics does hunch towards that. Leads towards any other direction would be very helpful. I have also tried restarting the server.
Turns out the problem was indeed not related to context switch.
I was also dumping the received data in a file. Stopping that recording removed the spikes. So, the high latency was due to some kind of write flush happening.

Idendify the reason for a 200 ms freezing in a time critical loop

New description of the problem:
I currently run our new data acquisition software in a test environment. The software has two main threads. One contains a fast loop which communicates with the hardware and pushes the data into a dual buffer. Every few seconds, this loop freezes for 200 ms. I did several tests but none of them let me figure out what the software is waiting for. Since the software is rather complex and the test environment could interfere too with the software, I need a tool/technique to test what the recorder thread is waiting for while it is blocked for 200 ms. What tool would be useful to achieve this?
Original question:
In our data acquisition software, we have two threads that provide the main functionality. One thread is responsible for collecting the data from the different sensors and a second thread saves the data to disc in big blocks. The data is collected in a double buffer. It typically contains 100000 bytes per item and collects up to 300 items per second. One buffer is used to write to in the data collection thread and one buffer is used to read the data and save it to disc in the second thread. If all the data has been read, the buffers are switched. The switch of the buffers seems to be a major performance problem. Each time the buffer switches, the data collection thread blocks for about 200 ms, which is far too long. However, it happens once in a while, that the switching is much faster, taking nearly no time at all. (Test PC: Windows 7 64 bit, i5-4570 CPU #3.2 GHz (4 cores), 16 GB DDR3 (800 MHz)).
My guess is, that the performance problem is linked to the data being exchanged between cores. Only if the threads run on the same core by chance, the exchange would be much faster. I thought about setting the thread affinity mask in a way to force both threads to run on the same core, but this also means, that I lose real parallelism. Another idea was to let the buffers collect more data before switching, but this dramatically reduces the update frequency of the data display, since it has to wait for the buffer to switch before it can access the new data.
My question is: Is there a technique to move data from one thread to another which does not disturb the collection thread?
Edit: The double buffer is implemented as two std::vectors which are used as ring buffers. A bool (int) variable is used to tell which buffer is the active write buffer. Each time the double buffer is accessed, the bool value is checked to know which vector should be used. Switching the buffers in the double buffer just means toggling this bool value. Of course during the toggling all reading and writing is blocked by a mutex. I don't think that this mutex could possibly be blocking for 200 ms. By the way, the 200 ms are very reproducible for each switch event.
Locking and releasing a mutex just to switch one bool variable will not take 200ms.
Main problem is probably that two threads are blocking each other in some way.
This kind of blocking is called lock contention. Basically this occurs whenever one process or thread attempts to acquire a lock held by another process or thread. Instead parallelism you have two thread waiting for each other to finish their part of work, having similar effect as in single threaded approach.
For further reading I recommend this article for a read, which describes lock contention with more detailed level.
Since you are running on windows maybe you use visual studio? if yes I would resort to VS profiler which is quite good (IMHO) in such cases, once you don't need to check data/instruction caches (then the Intel's vTune is a natural choice). From my experience VS is good enough to catch contention problems as well as CPU bottlenecks. you can run it directly from VS or as standalone tool. you don't need the VS installed on your test machine you can just copy the tool and run it locally.
VSPerfCmd.exe /start:SAMPLE /attach:12345 /output:samples - attach to process 12345 and gather CPU sampling info
VSPerfCmd.exe /detach:12345 - detach from process
VSPerfCmd.exe /shutdown - shutdown the profiler, the samples.vsp is written (see first line)
then you can open the file and inspect it in visual studio. if you don't see anything making your CPU busy switch to contention profiling - just change the "start" argument from "SAMPLE" to "CONCURRENCY"
The tool is located under %YourVSInstallDir%\Team Tools\Performance Tools\, AFAIR it is available from VS2010
Good luck
After discussing the problem in the chat, it turned out that the Windows Performance Analyser is a suitable tool to use. The software is part of the Windows SDK and can be opened using the command wprui in a command window. (Alois Kraus posted this useful link: http://geekswithblogs.net/akraus1/archive/2014/04/30/156156.aspx in the chat). The following steps revealed what the software had been waiting on:
Record information with the WPR using the default settings and load the saved file in the WPA.
Identify the relevant thread. In this case, the recording thread and the saving thread obviously had the highest CPU load. The saving thread could be easily identified. Since it saves data to disc, it is the one that with file access. (Look at Memory->Hard Faults)
Check out Computation->CPU usage (Precise) and select Utilization by Process, Thread. Select the process you are analysing. Best display the columns in the order: NewProcess, ReadyingProcess, ReadyingThreadId, NewThreadID, [yellow bar], Ready (µs) sum, Wait(µs) sum, Count...
Under ReadyingProcess, I looked for the process with the largest Wait (µs) since I expected this one to be responsible for the delays.
Under ReadyingThreadID I checked each line referring to the thread with the delays in the NewThreadId column. After a short search, I found a thread that showed frequent Waits of about 100 ms, which always showed up as a pair. In the column ReadyingThreadID, I was able to read the id of the thread the recording loop was waiting for.
According to its CPU usage, this thread did basically nothing. In our special case, this led me to the assumption that the serial port io command could cause this wait. After deactivating them, the delay was gone. The important discovery was that the 200 ms delay was in fact composed of two 100 ms delays.
Further analysis showed that the fetch data command via the virtual serial port pair gets sometimes lost. This might be linked to very high CPU load in the data saving and compression loop. If the fetch command gets lost, no data is received and the first as well as the second attempt to receive the data timed out with their 100 ms timeout time.

Arranging physical disk sectors before writing to disk

Bakcground:
I'm developing on the new SparkleDB NoSQL database, the database is ACID and has its own disk space manager (DSM) all for its database file storage accessing. The DSM allows for multiple thread concurrent I/O operations on the same physical file, ie. Asynchronous I/O or overlapped I/O. We disable disk caching, thus we write pages directly to the disk, as this is required for ACID databases.
My question is:
Is there a performance gain by arranging continuous disk page from many threads writes before sending the I/O request to the underlying disk OS I/O subsystem(thus merging the data to be written if they are continuous), or does the I/O subsystem do this for you? My question applies to UNIX, Linux, and Windows.
Example (all happends within a space of 100ms):
Thread #1: Write 4k to physical file address 4096
Thread #2: Write 4k to physical file address 0
Thread #3: Write 4k to physical file address 8192
Thread #4: Write 4k to physical file address 409600
Thread #5: Write 4k to physical file address 413696
Using this information, the DSM arranges a single 12kb write operation to physical file address 0, and a single 8kb write operation to physical file address 409600.
Update:
The DSM does all the physical file access address positioning on Windows by providing a OVERLAPPED structure, io_prep_pwrite on Linux AIO, and aiocb's aio_offset on POSIX AIO.
The most efficient method to use a hard drive is to keep writing as much data as you can while the platters are still spinning. This involves reducing the quantity of writes and increase the amount of data per write. If this can happen, then having a disk area of continuous sectors will help.
For each write, the OS needs to translate the write to your file into logical or physical coordinates on the drive. This may involve reading the directory, searching for your file and locating the mapping of your file within the directory.
After the OS determines the location, it sends data across the interface to the hard drive. Your data may be cached along the way many times until it is placed onto the platters. An efficient write will use the block sizes of the caches and data interfaces.
Now the questions are: 1) How much time does this save? and 2) Is the time saving significant. For example, if all this work saves you 1 second, this one second gained may be lost in waiting for a response from the User.
Many programs, OS and drivers will postpone writes to a hard drive to non-critical or non-peak periods. For example, while you are waiting for User input, you could be writing to the hard drive. This posting of writes may be less effort than optimizing the disk writes and have more significant impact to your application.
BTW, this has nothing to do with C++.

What is the fastest way to search all the files in hard disk?

I am currently trying to search all the files in the hard disk.
I'll search a lot of documents on window 7. That means using lot of File I/O...
I am thinking I should use multi-thread or Asynchronous I/O.
What do you think?
If you think about it the right way, this can lend itself well to a worker pipeline: thread 1 consumes a list of directories to retrieve and fetches directory listings. Thread 2 consumes directory listings and dispatches additional directories back to thread 1 while forwarding files to thread 3.
Thread 3 meanwhile has a simple job: fetch N pages of data at a time from files and forward them to thread 4 which searches pages of memory for matches.
Because the application is largely going to be IO bound, you can comfortably afford to invest some CPU in thread 3 to optimize the concurrency and priority of requests to try and ensure that you maximize the speed with which new pages are delivered to thread 4 and thus how quickly the entire process completes.
OTOH, you may find that just switching to memory-mapped IO will produce a less complex solution with a good-enough speed.

Writing large files in c++

I am writing large files, in range from 70 - 700gb. Does anyone have experience if Memory mapped files would be more efficient than regular writing in chunks?
The code will be in c++ and run on linux 2.6
If you are writing the file from the beginning and onwards, there is nothing to be gained from memory mapping the file.
If you are writing the file in any other pattern, please update the question :)
Typical sustained hard drive transfer speeds for consumer grade drives are around 60 megabytes per second, with the sun shining, a stiff breeze in the back and the file system not too fragmented so the disk drive head doesn't have to seek too often.
So a hard lower limit on the amount of time it takes to write 700 gigabytes is 700 * 1024 / 60 = 11947 seconds or 3 hours and 20 minutes. No amount of buffering is going to fix that, it will quickly be overwhelmed by the drastic mismatch between the disk write speed and the ability of the processor to fill the fire hose. Start looking for a problem in your code or the disk drive state only when it takes a couple of times longer than that.