Linux AIO: Poor Scaling

Linux AIO: Poor Scaling - c++

I am writing a library that uses the Linux asynchronous I/O system calls, and would like to know why the io_submit function is exhibiting poor scaling on the ext4 file system. If possible, what can I do to get io_submit not to block for large IO request sizes? I already do the following (as described here):
Use O_DIRECT.
Align the IO buffer to a 512-byte boundary.
Set the buffer size to a multiple of the page size.
In order to observe how long the kernel spends in io_submit, I ran a test in which I created a 1 Gb test file using dd and /dev/urandom, and repeatedly dropped the system cache (sync; echo 1 > /proc/sys/vm/drop_caches) and read increasingly larger portions of the file. At each iteration, I printed the time taken by io_submit and the time spent waiting for the read request to finish. I ran the following experiment on an x86-64 system running Arch Linux, with kernel version 3.11. The machine has an SSD and a Core i7 CPU. The first graph plots the number of pages read against the time spent waiting for io_submit to finish. The second graph displays the time spent waiting for the read request to finish. The times are measured in seconds.
For comparison, I created a similar test that uses synchronous IO by means of pread. Here are the results:
It seems that the asynchronous IO works as expected up to request sizes of around 20,000 pages. After that, io_submit blocks. These observations lead to the following questions:
Why isn't the execution time of io_submit constant?
What is causing this poor scaling behavior?
Do I need to split up all read requests on ext4 file systems into multiple requests, each of size less than 20,000 pages?
Where does this "magic" value of 20,000 come from? If I run my program on another Linux system, how can I determine the largest IO request size to use without experiencing poor scaling behavior?
The code used to test the asynchronous IO follows below. I can add other source listings if you think they are relevant, but I tried to post only the details that I thought might be relevant.
#include <cstddef>
#include <cstdint>
#include <cstring>
#include <chrono>
#include <iostream>
#include <memory>
#include <fcntl.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
// For `__NR_*` system call definitions.
#include <sys/syscall.h>
#include <linux/aio_abi.h>
static int
io_setup(unsigned n, aio_context_t* c)
{
return syscall(__NR_io_setup, n, c);
}
static int
io_destroy(aio_context_t c)
{
return syscall(__NR_io_destroy, c);
}
static int
io_submit(aio_context_t c, long n, iocb** b)
{
return syscall(__NR_io_submit, c, n, b);
}
static int
io_getevents(aio_context_t c, long min, long max, io_event* e, timespec* t)
{
return syscall(__NR_io_getevents, c, min, max, e, t);
}
int main(int argc, char** argv)
{
using namespace std::chrono;
const auto n = 4096 * size_t(std::atoi(argv[1]));
// Initialize the file descriptor. If O_DIRECT is not used, the kernel
// will block on `io_submit` until the job finishes, because non-direct
// IO via the `aio` interface is not implemented (to my knowledge).
auto fd = ::open("dat/test.dat", O_RDONLY | O_DIRECT | O_NOATIME);
if (fd < 0) {
::perror("Error opening file");
return EXIT_FAILURE;
}
char* p;
auto r = ::posix_memalign((void**)&p, 512, n);
if (r != 0) {
std::cerr << "posix_memalign failed." << std::endl;
return EXIT_FAILURE;
}
auto del = [](char* p) { std::free(p); };
std::unique_ptr<char[], decltype(del)> buf{p, del};
// Initialize the IO context.
aio_context_t c{0};
r = io_setup(4, &c);
if (r < 0) {
::perror("Error invoking io_setup");
return EXIT_FAILURE;
}
// Setup I/O control block.
iocb b;
std::memset(&b, 0, sizeof(b));
b.aio_fildes = fd;
b.aio_lio_opcode = IOCB_CMD_PREAD;
// Command-specific options for `pread`.
b.aio_buf = (uint64_t)buf.get();
b.aio_offset = 0;
b.aio_nbytes = n;
iocb* bs[1] = {&b};
auto t1 = high_resolution_clock::now();
auto r = io_submit(c, 1, bs);
if (r != 1) {
if (r == -1) {
::perror("Error invoking io_submit");
}
else {
std::cerr << "Could not submit request." << std::endl;
}
return EXIT_FAILURE;
}
auto t2 = high_resolution_clock::now();
auto count = duration_cast<duration<double>>(t2 - t1).count();
// Print the wait time.
std::cout << count << " ";
io_event e[1];
t1 = high_resolution_clock::now();
r = io_getevents(c, 1, 1, e, NULL);
t2 = high_resolution_clock::now();
count = duration_cast<duration<double>>(t2 - t1).count();
// Print the read time.
std::cout << count << std::endl;
r = io_destroy(c);
if (r < 0) {
::perror("Error invoking io_destroy");
return EXIT_FAILURE;
}
}

My understanding is that very few (if any) filesystems on linux fully supports AIO. Some filesystem operations still block, and sometimes io_submit() will, indirectly via filesystem operations, invoke such blocking calls.
My understanding is further that the main users of kernel AIO primarily care about AIO being truly asynchronous on raw block devices (i.e. no filesystem). essentially database vendors.
Here's a relevant post from the linux-aio mailing list. (head of the thread)
A possibly useful recommendation:
Add more requests via /sys/block/xxx/queue/nr_requests and the problem
will get better.

Why isn't the execution time of io_submit constant?
Because you are submitting I/Os that are so big, the block layer has to split them up and then queue the resulting requests. This can then cause you to hit resource limitations that in turn cause io_submit() to behave as if it's blocking...
What is causing this poor scaling behavior?
The bigger the I/O is over the splitting threshold (see below) the more likely it becomes that the number of splits done to turn it into appropriately sized requests will also increase (presumably actually doing the splits will cost a small amount of time too). With direct I/O io_submit() does not return until all its requests have been allocated and queued at the block layer level. Further, the amount of requests that can be queued by the block layer for a given disk is limited to /sys/block/[disk_device]/queue/nr_requests. Exceeding this limit leads to io_submit() blocking until enough request slots have been freed up such that all its allocations have been satisfied (this is related to Arvid was recommending).
Do I need to split up all read requests on ext4 file systems into multiple requests, each of size less than 20,000 pages?
Ideally you should split your requests into far smaller amounts than that - 20000 pages (assuming a 4096 byte page which is what is used on x86 platforms) is roughly 78 megabytes! This doesn't just apply to when you're using ext4 - doing such large io_submit() I/O sizes to other filesystems or even directly to block devices will be unlikely perform well.
If you work out which disk device your filesystem is on and look at /sys/block/[disk_device]/queue/max_sectors_kb that will give you an upper bound but the bound at which splitting starts may be even smaller so you may want to limit the size of each I/O to /sys/block/[disk_device]/queue/max_segments * PAGE_SIZE instead.
Where does this "magic" value of 20,000 come from?
This is likely down to some combination of:
The maximum size each I/O can be before the block layer splits it (at most this will be /sys/block/[disk_device]/queue/max_sectors_kb but the observed split limit may be even lower)
The maximum number of I/Os that can be queued before blocking occurs (/sys/block/[disk_device]/queue/nr_requests)
Your hardware's command queue depth (/sys/block/[disk_device]/device/queue_depth)
How fast your disk is at completing requests. When the kernel can't queue any more I/Os to the real device (due to the hardware queue_depth being full and the kernel's additional queues being full) it becomes blocking on new requests until in-flight ones sent to the hardware have completed.
If I run my program on another Linux system, how can I determine the largest IO request size to use without experiencing poor scaling behavior?
Limit each request I/O to the lower of /sys/block/[disk_device]/queue/max_sectors_kb or /sys/block/[disk_device]/queue/max_segments * PAGE_SIZE. I would imagine I/Os no bigger than 524288 bytes should be safe but your hardware may be able to cope with a larger size and thus get a higher throughput but possibly at the expense of completion (as opposed to submission) latency.
If possible, what can I do to get io_submit not to block for large IO request sizes?
There's going to be an upper "good" limit and if you surpass it there are going to be consequences which you can't escape.
Related questions
asynchronous IO io_submit latency in Ubuntu Linux

You are missing the purpose of using AIO in the first place. The referenced example shows a sequence of [fill-buffer], [write], [write], [write], ... [read], [read], [read], ... operations. In effect you are stuffing data down a pipe. Eventually the pipe fills up when you reach the I/O bandwidth limit to your storage. Now you busy wait, which shows up on your linear performance degradation behavior.
The performance gains for an AIO write is that the application fills a buffer and then tells the kernel to begin the write operation; control returns to the application immediately while the kernel still owns the data buffer and its content; until the kernel signals I/O complete, the application must not touch the data buffer because you don't know yet what part (if any) of the buffer has actually made it to the media: modify the buffer before the I/O is complete and you've corrupted the data going out to the media.
Conversely, the gain from an AIO read is when the application allocates an I/O buffer, and then tells the kernel to begin filling the buffer. Control returns to the application immediately and the application must leave the buffer alone until the kernel signifies it is finished with the buffer by posting the I/O completion event.
So the behavior you see is the example quickly filling a pipeline to the storage. Eventually data are generated faster than the storage can suck in the data and performance drops to linearity while the pipeline gets refilled as quickly as it is emptied: linear behavior.
The example program does use AIO calls but it's still a linear stop-and-wait program.

Related

Qt QTcpSocket Reading Data Overlap Causes Invalid TCP Behavior During High Bandwidth Reading and Writing

Summary: Some of the memory within the TCP socket to be overwritten by other incoming data.
Application:
A client/server system that utilizes TCP within Qt (QTcpSocket and QTcpServer). The client request a frame from the server(just a simple string message), and the response (Server -> Client) which consists of that frame (614400 bytes for testing purposes). Frame sizes are established in advance and are fixed.
Implementation Details:
From the guarantees of the TCP protocol (Server -> Client), I know that I should be able to read the 614400 bytes from the socket and that they are in order. If any either of these two things fails, the connection must have failed.
Important Code:
Assuming the socket is connected.
This code requests a frame from the server. Known as the GetFrame() function.
// Prompt the server to send a frame over
if(socket->isWritable() && !is_receiving) { // Validate that socket is ready
is_receiving = true; // Forces only one request to go out at a time
qDebug() << "Getting frame from socket..." << image_no;
int written = SafeWrite((char*)"ReadyFrame"); // Writes then flushes the write buffer
if (written == -1) {
qDebug() << "Failed to write...";
return temp_frame.data();
}
this->SocketRead();
is_receiving = false;
}
qDebug() << image_no << "- Image Received";
image_no ++;
return temp_frame.data();
This code waits for the frame just requested to be read. This is the SocketRead() function
size_t byte_pos = 0;
qint64 bytes_read = 0;
do {
if (!socket->waitForReadyRead(500)) { // If it timed out return existing frame
if (!(socket->bytesAvailable() > 0)) {
qDebug() << "Timed Out" << byte_pos;
break;
}
}
bytes_read = socket->read((char*)temp_frame.data() + byte_pos, frame_byte_size - byte_pos);
if (bytes_read < 0) {
qDebug() << "Reading Failed" << bytes_read << errno;
break;
}
byte_pos += bytes_read;
} while (byte_pos < frame_byte_size && is_connected); // While we still have more pixels
qDebug() << "Finished Receiving Frame: " << byte_pos;
As shown in the code above, I read until the frame is fully received (where the number of bytes read is equal to the number of bytes in the frame).
The issue that I'm having is that the QTcpSocket read operation is skipping bytes in ways that are not in line with the guarantees of the TCP protocol. Since I skip bytes I end up not reaching the end of the while loop and just "Time Out". Why is this happening?
What I have done so far:
The data that the server sends is directly converted into uint16_t (short) integers which are used in other parts of the client. I have changed the server to simply output data that just counts up adding one for each number sent. Since the data type is uint16_t and the number of bytes exceeds that maximum number for that integer type, the int-16's will loop every 65535.
This is a data visualization software so this debugging configuration (on the client side) leads to something like this:
I have determined (and as you can see a little at the bottom of the graphic) that some bytes are being skipped. In the memory of temp_frame it is possible to see the exact point at which the memory skipped:
Under correct circumstances, this should count up sequentially.
From Wireshark and following this specific TCP connection I have determined that all of the bytes are in fact arriving (all 6114400), and that all the numbers are in order (I used a python script to ensure counting was sequential).
This is work on an open source project so this is the whole code base for the client.
Overall, I don't see how I could be doing something wrong in this solution, all I am doing is reading from the socket in the standard way.

Caveat: This isn't a definitive answer to your problem, but some things to try (it's too large for a comment).
With (e.g.) GigE, your data rate is ~100MB/s. With a [total] amount of kernel buffer space of 614400, this will be refilled ~175 times per second. IMO, this is still too small. When I've used SO_RCVBUF [for a commercial product], I've used a minimum of 8MB. This allows a wide(er) margin for task switch delays.
Try setting something huge like 100MB to eliminate this as a factor [during testing/bringup].
First, it's important to verify that the kernel and NIC driver can handle the throughput/latency.
You may be getting too many interrupts/second and the ISR prolog/epilog overhead may be too high. The NIC card driver can implement polled vs interrupt driver with NAPI for ethernet cards.
See: https://serverfault.com/questions/241421/napi-vs-adaptive-interrupts
See: https://01.org/linux-interrupt-moderation
You process/thread may not have high enough priority to be scheduled quickly.
You can use the R/T scheduler with sched_setscheduler, SCHED_RR, and a priority of (e.g.) 8. Note: going higher than 11 kills the system because at 12 and above you're at a higher priority than most internal kernel threads--not a good thing.
You may need to disable IRQ balancing and set the IRQ affinity to a single CPU core.
You can then set your input process/thread locked to that core [with sched_setaffinity and/or pthread_setaffinity].
You might need some sort of "zero copy" to bypass the kernel copying from its buffers into your userspace buffers.
You can mmap the kernel socket buffers with PACKET_MMAP. See: https://sites.google.com/site/packetmmap/
I'd be careful about the overhead of your qDebug output. It looks like an iostream type implementation. The overhead may be significant. It could be slowing things down significantly.
That is, you're not measuring the performance of your system. You're measuring the performance of your system plus the debugging code.
When I've had to debug/trace such things, I've used a [custom] "event" log implemented with an in-memory ring queue with a fixed number of elements.
Debug calls such as:
eventadd(EVENT_TYPE_RECEIVE_START,some_event_specific_data);
Here eventadd populates a fixed size "event" struct with the event type, event data, and a hires timestamp (e.g. struct timespec from clock_gettime(CLOCK_MONOTONIC,...).
The overhead of each such call is quite low. The events are just stored in the event ring. Only the last N are remembered.
At some point, your program triggers a dump of this queue to a file and terminates.
This mechanism is similar to [and modeled on] a H/W logic analyzer. It is also similar to dtrace
Here's a sample event element:
struct event {
long long evt_tstamp; // timestamp
int evt_type; // event type
int evt_data; // type specific data
};

Busy Loop/Spinning sometimes takes too long under Windows

I'm using a windows 7 PC to output voltages at a rate of 1kHz. At first I simply ended the thread with sleep_until(nextStartTime), however this has proven to be unreliable, sometimes working fine and sometimes being of by up to 10ms.
I found other answers here saying that a busy loop might be more accurate, however mine for some reason also sometimes takes too long.
while (true) {
doStuff(); //is quick enough
logDelays();
nextStartTime = chrono::high_resolution_clock::now() + chrono::milliseconds(1);
spinStart = chrono::high_resolution_clock::now();
while (chrono::duration_cast<chrono::microseconds>(nextStartTime -
chrono::high_resolution_clock::now()).count() > 200) {
spinCount++; //a volatile int
}
int spintime = chrono::duration_cast<chrono::microseconds>
(chrono::high_resolution_clock::now() - spinStart).count();
cout << "Spin Time micros :" << spintime << endl;
if (spinCount > 100000000) {
cout << "reset spincount" << endl;
spinCount = 0;
}
}
I was hoping that this would work to fix my issue, however it produces the output:
Spin Time micros :9999
Spin Time micros :9999
...
I've been stuck on this problem for the last 5 hours and I'd very thankful if somebody knows a solution.

According to the comments this code waits correctly:
auto start = std::chrono::high_resolution_clock::now();
const auto delay = std::chrono::milliseconds(1);
while (true) {
doStuff(); //is quick enough
logDelays();
auto spinStart = std::chrono::high_resolution_clock::now();
while (start > std::chrono::high_resolution_clock::now() + delay) {}
int spintime = std::chrono::duration_cast<std::chrono::microseconds>
(std::chrono::high_resolution_clock::now() - spinStart).count();
std::cout << "Spin Time micros :" << spintime << std::endl;
start += delay;
}
The important part is the busy-wait while (start > std::chrono::high_resolution_clock::now() + delay) {} and start += delay; which will in combination make sure that delay amount of time is waited, even when outside factors (windows update keeping the system busy) disturb it. In case that the loop takes longer than delay the loop will be executed without waiting until it catches up (which may be never if doStuff is sufficiently slow).
Note that missing an update (due to the system being busy) and then sending 2 at once to catch up might not be the best way to handle the situation. You may want to check the current time inside doStuff and abort/restart the transmission if the timing is wrong by more then some acceptable amount.

On Windows I dont think its possible to ever get such precise timing, because you can not garuntee your thread is actually running at the time you desire. Even with low CPU usage and setting your thread to real time priority, it can still be interuptted (Hardware interupts as I understand. Never fully investigate but even a simple while(true) ++i; type loop at realtime Ive seen get interupted then moved between CPU cores). While such interrupts and switching for a realtime thread is very quick, its still significant if your trying to directly drive a signal without buffering.
Instead you really want to read and write buffers of digital samples (so at 1KHz each sample is 1ms). You need to be sure to queue another buffer before the last one is completed, which will constrain how small they can be, but at 1KHz at realtime priority if the code is simple and no other CPU contention a single sample buffer (1ms) might even be possible, which is at worst 1ms extra latency over "immediate" but you would have to test. You then leave it up to the hardware and its drivers to handle the precise timing (e.g. make sure each output sample is "exactly" 1ms to the accuracy the vendor claims).
This basically means your code only has to be accurate to 1ms in worst case, rather than trying to persue somthing far smaller than the OS really supports such as microsecond accuracy.
As long as you are able to queue a new buffer before the hardware used up the previous buffer, it will be able to run at the desired frequency without issue (to use audio as an example again, while the tolerated latencies are often much higher and thus the buffers as well, if you overload the CPU you can still sometimes hear auidble glitches where an application didnt queue up new raw audio in time).
With careful timing you might even be able to get down to a fraction of a millisecond by waiting to process and queue your next sample as long as possible (e.g. if you need to reduce latency between input and output), but remember that the closer you cut it the more you risk submitting it too late.

Why is my C++ disk write test much slower than a simply file copy using bash?

Using below program I try to test how fast I can write to disk using std::ofstream.
I achieve around 300 MiB/s when writing a 1 GiB file.
However, a simple file copy using the cp command is at least twice as fast.
Is my program hitting the hardware limit or can it be made faster?
#include <chrono>
#include <iostream>
#include <fstream>
char payload[1000 * 1000]; // 1 MB
void test(int MB)
{
// Configure buffer
char buffer[32 * 1000];
std::ofstream of("test.file");
of.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
auto start_time = std::chrono::steady_clock::now();
// Write a total of 1 GB
for (auto i = 0; i != MB; ++i)
{
of.write(payload, sizeof(payload));
}
double elapsed_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - start_time).count();
double megabytes_per_ns = 1e3 / elapsed_ns;
double megabytes_per_s = 1e9 * megabytes_per_ns;
std::cout << "Payload=" << MB << "MB Speed=" << megabytes_per_s << "MB/s" << std::endl;
}
int main()
{
for (auto i = 1; i <= 10; ++i)
{
test(i * 100);
}
}
Output:
Payload=100MB Speed=3792.06MB/s
Payload=200MB Speed=1790.41MB/s
Payload=300MB Speed=1204.66MB/s
Payload=400MB Speed=910.37MB/s
Payload=500MB Speed=722.704MB/s
Payload=600MB Speed=579.914MB/s
Payload=700MB Speed=499.281MB/s
Payload=800MB Speed=462.131MB/s
Payload=900MB Speed=411.414MB/s
Payload=1000MB Speed=364.613MB/s
Update
I changed from std::ofstream to fwrite:
#include <chrono>
#include <cstdio>
#include <iostream>
char payload[1024 * 1024]; // 1 MiB
void test(int number_of_megabytes)
{
FILE* file = fopen("test.file", "w");
auto start_time = std::chrono::steady_clock::now();
// Write a total of 1 GB
for (auto i = 0; i != number_of_megabytes; ++i)
{
fwrite(payload, 1, sizeof(payload), file );
}
fclose(file); // TODO: RAII
double elapsed_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - start_time).count();
double megabytes_per_ns = 1e3 / elapsed_ns;
double megabytes_per_s = 1e9 * megabytes_per_ns;
std::cout << "Size=" << number_of_megabytes << "MiB Duration=" << long(0.5 + 100 * elapsed_ns/1e9)/100.0 << "s Speed=" << megabytes_per_s << "MiB/s" << std::endl;
}
int main()
{
test(256);
test(512);
test(1024);
test(1024);
}
Which improves the speed to 668MiB/s for a 1 GiB file:
Size=256MiB Duration=0.4s Speed=2524.66MiB/s
Size=512MiB Duration=0.79s Speed=1262.41MiB/s
Size=1024MiB Duration=1.5s Speed=664.521MiB/s
Size=1024MiB Duration=1.5s Speed=668.85MiB/s
Which is just as fast as dd:
time dd if=/dev/zero of=test.file bs=1024 count=0 seek=1048576
real 0m1.539s
user 0m0.001s
sys 0m0.344s

First, you're not really measuring the disk writing speed, but (partly) the speed of writing data to the OS disk cache. To really measure the disk writing speed, the data should be flushed to disk before calculating the time. Without flushing there could be a difference depending on the file size and the available memory.
There seems to be something wrong in the calculations too. You're not using the value of MB.
Also make sure the buffer size is a power of two, or at least a multiple of the disk page size (4096 bytes): char buffer[32 * 1024];. You might as well do that for payload too. (looks like you changed that from 1024 to 1000 in an edit where you added the calculations).
Do not use streams to write a (binary) buffer of data to disk, but instead write directly to the file, using FILE*, fopen(), fwrite(), fclose(). See this answer for an example and some timings.
To copy a file: open the source file in read-only and, if possible, forward-only mode, and using fread(), fwrite():
while fread() from source to buffer
fwrite() buffer to destination file
This should give you a speed comparable to the speed of an OS file copy (you might want to test some different buffer sizes).
This might be slightly faster using memory mapping:
open src, create memory mapping over the file
open/create dest, set file size to size of src, create memory mapping over the file
memcpy() src to dest
For large files smaller mapped views should be used.

Streams are slow
cp uses syscalls directly read(2) or mmap(2).

I'd wager that it's something clever inside either CP or the filesystem. If it's inside CP then it might be that the file that you are copying has a lot of 0s in it and cp is detecting this and writing a sparse version of your file. The man page for cp says "By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well." This could mean a few things but one of them is that cp could make a sparse version of your file which would require less disk write time.
If it's within your filesystem then it might be Deduplication.
As a long-shot 3rd, it might also be something within your OS or your disk firmware that is translating the read and write into some specialized instruction that doesn't require as much synchronization as your program requires (lower bus use means less latency).

You're using a relatively small buffer size. Small buffers mean more operations per second, which increases overhead. Disk systems have a small amount of latency before they receive the read/write request and begin processing it; a larger buffer amortizes that cost a little better. A smaller buffer may also mean that the disk is spending more time seeking.
You're not issuing multiple simultaneous requests - you require one read to finish before the next starts. This means that the disk may have dead time where it is doing nothing. Since all writes depend on all reads, and your reads are serial, you're starving the disk system of read requests (doubly so, since writes will take away from reads).
The total of requested read bytes across all read requests should be larger than the bandwidth-delay product of the disk system. If the disk has 0.5 ms delay and a 4 GB/sec performance, then you want to have 4 GB * 0.5 ms = 2 MB worth of reads outstanding at all times.
You're not using any of the operating system's hints that you're doing sequential reading.
To fix this:
Change your code to have more than one outstanding read request at all times.
Have enough read requests outstanding such that you're waiting on at least 2 MBs worth of data.
Use the posix_fadvise() flags to help the OS disk schedule and page cache optimize.
Consider using mmap to cut down on overhead.
Use a larger buffer size per read request to cut down on overhead.
This answer has more information:
https://stackoverflow.com/a/3756466/344638

The problem is that you specify too small buffer for your fstream
char buffer[32 * 1000];
std::ofstream of("test.file");
of.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
Your app runs in the user mode. To write to disk, ofstream calls system write function that executed in kernel mode. Then write transfers data to system cache, then to HDD cache and then it will be written to the disk.
This buffer size affect number of system calls (1 call for every 32*1000 bytes). During system call OS must switch execution context from user mode to kernel mode and then back. Switching context is overhead. In Linux it is equivalent about 2500-3500 simple CPU commands. Because of that, your app spending the most CPU time in context switching.
In your second app you use
FILE* file = fopen("test.file", "w");
FILE using the bigger buffer by default, that is why it produce more efficient code. You can try to specify small buffer with setvbuf. In this case you should see the same performance degradation.
Please note in your case, the bottle neck is not HDD performance. It is context switching

Windows shared memory access time slow

I am currently using shared memory with two mapped files (1.9 GBytes for the first one and 600 MBytes for the second) in a software.
I am using a process that read data from the first file, process the data and write the results to the second file.
I have noticed a strong delay sometimes (the reason is out of my knowledge) when reading or writing to the mapping view with memcpy function.
Mapped files are created this way :
m_hFile = ::CreateFileW(SensorFileName,
GENERIC_READ | GENERIC_WRITE,
0,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL);
m_hMappedFile = CreateFileMapping(m_hFile,
NULL,
PAGE_READWRITE,
dwFileMapSizeHigh,
dwFileMapSizeLow,
NULL);
And memory mapping is done this way :
m_lpMapView = MapViewOfFile(m_hMappedFile,
FILE_MAP_ALL_ACCESS,
dwOffsetHigh,
dwOffsetLow,
m_i64ViewSize);
The dwOffsetHigh/dwOffsetLow are "matching" granularity from the system info.
The process is reading about 300KB * N times, storing that in a buffer, processing and then writing 300KB * N times the processed contents of the previous buffer to the second file.
I have two different memory views (created/moved with MapViewOfFile function) with a size of 10 MBytes as default size.
For memory view size, I tested 10kBytes, 100kB, 1MB, 10MB and 100MB. Statistically no difference, 80% of the time reading process is as described below (~200ms) but writing process is really slow.
Normally :
1/ Reading is done in ~200ms.
2/ Process done in 2.9 seconds.
3/ Writing is done in ~200ms.
I can see that 80% of the time, either reading or writing (in the worst case both are slow) will take between 2 and 10 seconds.
Example : For writing, I am using the below code
for (unsigned int i = 0 ; i < N ; i++) // N = 500~3k
{
// Check the position of the memory view for ponderation
if (###)
MoveView(iOffset);
if (m_lpMapView)
{
memcpy((BYTE*)m_lpMapView + iOffset, pANNHeader, uiANNStatus);
// uiSize = ~300 kBytes
memcpy((BYTE*)m_lpMapView + iTemp, pLine[i], uiSize);
}
else
return uiANNStatus;
}
After using GetTickCount function to pinpoint where is the delay, I am seeing that the second memcpy call is always the one taking most of the time.
So, so far I am seeing N (for test, I used N = 500) calls to memcpy taking 10 seconds at the worst time when using those shared memories.
I made a temporary software that was doing the same quantity of memcpy calls, same amount of data and couldn't see the problem.
For tests, I used the following conditions, they all show the same delay :
1/ I can see this on various computers, 32 or 64 bits from windows 7 to windows 10.
2/ Using the main thread or multi-threads (up to 8 with critical sections for synchronization purpose) for reading/writing.
3/ OS on SATA or SSD, memory mapped files of the software physically on a SATA or SSD hard-disk, and if on external hard-disk, tests were done through USB1, USB2 or USB3.
I am kindly asking you what you would think my mistake is for memcpy to go slow.
Best regards.

I found a solution that works for me but not might be the case for others.
Following Thomas Matthews comments, I checked the MSDN and found two interesting functions FlushViewOfFile and FlushFileBuffers (but couldn't find anything interesting about locking memory).
Calling both after the for loop force update of the mapped file.
I am having no more "random" delay, but instead of the expected 200ms, I have an average of 400ms which is enough for my application.
After doing some tests I saw that calling those too often will cause heavy hard-disk access and will make the delay worse (10 seconds for every for loop) so the flush should be use carefully.
Thanks.

Linux & Windows: Using Large Files to save physical memory

Good afternoon, We have implemented a C++ cKeyArray class to test whether we can use the Large File API to save physical memory. During Centos Linux testing, we found that the Linux File API was just as fast as using the heap for random access processing. Here are the numbers: for a 2,700,000 row SQL database where the KeySize for each row is 62 bytes,
cKeyArray class using LINUX File API
BruteForceComparisons = 197275 BruteForceTimeElapsed = 1,763,504,445 microsecs
Each BruteForce Comparisons requires two random access, there the mean time required for each random access = 1,763,504,445 microsecs / (2 * 197275) = 4470 microsecs
Heap , no cKeyArray class
BruteForceComparisons = 197275 BruteForceTimeElapsed = 1,708,442,690microsecs
the mean time required for each random access = 4300 microsecs.
On 32 bit Windows,the numbers are,
cKeyArray class using Windows File API
BruteForceComparisons = 197275 BruteForceTimeElapsed = 9243787 millisecs
the mean time for each random access is 23.4 millisec
Heap, no cKeyArray class
BruteForceComparisons = 197275 BruteForceTimeElapsed = 2,141,941 millisecs
the mean time requires for each random access is 5.4 millisec
We are wondering why the Linux cKeyArray numbers are just as good the Linux heap numbers while on 32 bit Windows the mean heap random access time is 4 times as fast the cKeyArray Windows File API. Is there some way we can speed up the Windows cKeyArray File API?
Previouly, we received a lot of good suggestions from Stack Overflow on using the Windows Memory Mapped File API. Based on these Stack Overflow suggestions we have implemented a Memory Mapped File MRU caching class which functions properly.
Because we want to devlop a cross-platform solution, we want to do due diligence to see why the Linux File API is so fast? Thank you. We are trying to post a portion of the cKeyArray class implementation below.
#define KEYARRAY_THRESHOLD 100000000
// Use file instead of memory if requirement is above this number
cKeyArray::cKeyArray(long RecCount_,int KeySize_,int MatchCodeSize_, char* TmpFileName_) {
RecCount=RecCount_;
KeySize=KeySize_;
MatchCodeSize=MatchCodeSize_;
MemBuffer=0;
KeyBuffer=0;
MemFile=0;
MemFileName[0]='\x0';
ReturnBuffer=new char[MatchCodeSize + 1];
if (RecCount*KeySize<=KEYARRAY_THRESHOLD) {
InMemory=true;
MemBuffer=new char[RecCount*KeySize];
memset(MemBuffer,0,RecCount*KeySize);
} else {
InMemory=false;
strcpy(MemFileName,TmpFileName_);
try {
MemFile=
new cFile(MemFileName,cFile::CreateAlways,cFile::ReadWrite);
}
catch (cException e)
{
throw e;
}
try {
MemFile->SetFilePointer(
(int64_t)(RecCount*KeySize),cFile::FileBegin);
}
catch (cException e)
{
throw e;
}
if (!(MemFile->SetEndOfFile()))
throw cException(ERR_FILEOPEN,MemFileName);
KeyBuffer=new char[KeySize];
}
}
char *cKeyArray::GetKey(long Record_) {
memset(ReturnBuffer,0,MatchCodeSize + 1);
if (InMemory) {
memcpy(ReturnBuffer,MemBuffer+Record_*KeySize,MatchCodeSize);
} else {
MemFile->SetFilePointer((int64_t)(Record_*KeySize),cFile::FileBegin);
MemFile->ReadFile(KeyBuffer,KeySize);
memcpy(ReturnBuffer,KeyBuffer,MatchCodeSize);
}
return ReturnBuffer;
}
uint32_t cKeyArray::GetDupeGroup(long Record_) {
uint32_t DupeGroup(0);
if (InMemory) {
memcpy((char*)&DupeGroup,
MemBuffer+Record_*KeySize + MatchCodeSize,sizeof(uint32_t));
} else {
MemFile->SetFilePointer(
(int64_t)(Record_*KeySize + MatchCodeSize) ,cFile::FileBegin);
MemFile->ReadFile((char*)&DupeGroup,sizeof(uint32_t));
}
return DupeGroup;
}

On Linux, the OS aggressively caches file data in main memory -- so although you haven't explicitly allocated memory for the file contents, they are nevertheless stored in RAM. Here's a decent link with some more information about the page cache -- only one thing is missing from that description, which is that most Linux filesystems actually implement the standard I/O interfaces as thin wrappers around the page cache. That means that even though you haven't explicitly memory mapped the file, the system is still treating it as though it were memory mapped under-the-covers. That's why you see roughly equivalent performance with either approach.
I second the suggestion to factor the platform-specific stuff out, and use whichever appropach is fastest for each platform. Be sure to benchmark -- don't ever make assumptions about performance.

Your memory-mapped solution should be as much as 10x faster than the file solution even in Linux. That is the speed I experience in my test cases.
Each file access system call takes hundreds of CPU cycles to complete. Time which your program could be using to do real work.
One explanation for why the speeds are similar could be that your memory map has not been used before. When a memory mapped page is accessed for the first time it must be assigned to a physical page of RAM and zeroed out or if it is a disk file it must be loaded from disk into RAM. All of that takes a considerable amount of time.
If you touch (read or write a value) each 4K of RAM before using it you should see a significant speed increase in the memory map.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js