File read() hangs on binary large file - c++

I'm working on a benchmark program. Upon making the read() system call, the program appears to hang indefinitely. The target file is 1 GB of binary data and I'm attempting to read directly into buffers that can be 1, 10 or 100 MB in size.
I'm using std::vector<char> to implement dynamically-sized buffers and handing off &vec[0] to read(). I'm also calling open() with the O_DIRECT flag to bypass kernel caching.
The essential coding details are captured below:
std::string fpath{"/path/to/file"};
size_t tries{};
int fd{};
while (errno == EINTR && tries < MAX_ATTEMPTS) {
fd = open(fpath.c_str(), O_RDONLY | O_DIRECT | O_LARGEFILE);
tries++;
}
// Throw exception if error opening file
if (fd == -1) {
ostringstream ss {};
switch (errno) {
case EACCES:
ss << "Error accessing file " << fpath << ": Permission denied";
break;
case EINVAL:
ss << "Invalid file open flags; system may also not support O_DIRECT flag, required for this benchmark";
break;
case ENAMETOOLONG:
ss << "Invalid path name: Too long";
break;
case ENOMEM:
ss << "Kernel error: Out of memory";
}
throw invalid_argument {ss.str()};
}
size_t buf_sz{1024*1024}; // 1 MiB buffer
std::vector<char> buffer(buf_sz); // Creates vector pre-allocated with buf_sz chars (bytes)
// Result is 0-filled buffer of size buf_sz
auto bytes_read = read(fd, &buffer[0], buf_sz);
Poking through the executable with gdb shows that buffers are allocated correctly, and the file I've tested with checks out in xxd. I'm using g++ 7.3.1 (with C++11 support) to compile my code on a Fedora Server 27 VM.
Why is read() hanging on large binary files?
Edit: Code example updated to more accurately reflect error checking.

There are multiple problems with your code.
This code will never work properly if errno ever has a value equal to EINTR:
while (errno == EINTR && tries < MAX_ATTEMPTS) {
fd = open(fpath.c_str(), O_RDONLY | O_DIRECT | O_LARGEFILE);
tries++;
}
That code won't stop when the file has been successfully opened and will keep reopening the file over and over and leak file descriptors as it keeps looping once errno is EINTR.
This would be better:
do
{
fd = open(fpath.c_str(), O_RDONLY | O_DIRECT | O_LARGEFILE);
tries++;
}
while ( ( -1 == fd ) && ( EINTR == errno ) && ( tries < MAX_ATTEMPTS ) );
Second, as noted in the comments, O_DIRECT can impose alignment restrictions on memory. You might need page-aligned memory:
So
size_t buf_sz{1024*1024}; // 1 MiB buffer
std::vector<char> buffer(buf_sz); // Creates vector pre-allocated with buf_sz chars (bytes)
// Result is 0-filled buffer of size buf_sz
auto bytes_read = read(fd, &buffer[0], buf_sz);
becomes
size_t buf_sz{1024*1024}; // 1 MiB buffer
// page-aligned buffer
buffer = mmap( 0, buf_sz, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, NULL );
auto bytes_read = read(fd, &buffer[0], buf_sz);
Note also the the Linux implementation of O_DIRECT can be very dodgy. It's been getting better, but there are still potential pitfalls that aren't very well documented at all. Along with alignment restrictions, if the last amount of data in the file isn't a full page, for example, you may not be able to read it if the filesystem's implementation of direct IO doesn't allow you to read anything but full pages (or some other block size). Likewise for write() calls - you may not be able to write just any number of bytes, you might be constrained to something like a 4k page.
This is also critical:
Most examples of read() hanging appear to be when using pipes or non-standard I/O devices (e.g., serial). Disk I/O, not so much.
Some devices simply do not support direct IO. They should return an error, but again, the O_DIRECT implementation on Linux can be very hit-or-miss.

Pasting your program and running on my linux system, was a working and non-hanging program.
The most likely cause for the failure is the file is not a file-system item, or it has a hardware element which is not working.
Try with a smaller size - to confirm, and try on a different machine to help diagnose
My complete code (with no error checking)
#include <vector>
#include <string>
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
int main( int argc, char ** argv )
{
std::string fpath{"myfile.txt" };
auto fd = open(fpath.c_str(), O_RDONLY | O_DIRECT | O_LARGEFILE);
size_t buf_sz{1024*1024}; // 1 MiB buffer
std::vector<char> buffer(buf_sz); // Creates vector pre-allocated with buf_sz chars (bytes)
// Result is 0-filled buffer of size buf_sz
auto bytes_read = read(fd, &buffer[0], buf_sz);
}
myfile.txt was created with
dd if=/dev/zero of=myfile.txt bs=1024 count=1024
If the file is not 1Mb in size, it may fail.
If the file is a pipe, it can block until the data is available.

Most examples of read() hanging appear to be when using pipes or non-standard I/O devices (e.g., serial). Disk I/O, not so much.
O_DIRECT flag is useful for filesystems and block devices. With this flag people normally map pages into the user space.
For sockets, pipes and serial devices it is plain useless because the kernel does not cache that data.
Your updated code hangs because fd is initialized with 0 which is STDIN_FILENO and it never opens that file, then it hangs reading from stdin.

Related

Why is data corrupt when reading back from a file as it's being written with O_DIRECT

I have a C++ program that uses the POSIX API to write a file opened with O_DIRECT. Concurrently, another thread is reading back from the same file via a different file descriptor. I've noticed that occasionally the data read back from the file contains all zeroes, rather than the actual data I wrote. Why is this?
Here's an MCVE in C++17. Compile with g++ -std=c++17 -Wall -otest test.cpp or equivalent. Sorry I couldn't seem to make it any shorter. All it does is write 100 MiB of constant bytes (0x5A) to a file in one thread and read them back in another, printing a message if any of the read-back bytes are not equal to 0x5A.
WARNING, this MCVE will delete and rewrite any file in the current working directory named foo.
#include <algorithm>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <iostream>
#include <thread>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
constexpr size_t CHUNK_SIZE = 1024 * 1024;
constexpr size_t TOTAL_SIZE = 100 * CHUNK_SIZE;
int main(int argc, char *argv[])
{
::unlink("foo");
std::thread write_thread([]()
{
int fd = ::open("foo", O_WRONLY | O_CREAT | O_DIRECT, 0777);
if (fd < 0) std::exit(-1);
uint8_t *buffer = static_cast<uint8_t *>(
std::aligned_alloc(4096, CHUNK_SIZE));
std::fill(buffer, buffer + CHUNK_SIZE, 0x5A);
size_t written = 0;
while (written < TOTAL_SIZE)
{
ssize_t rv = ::write(fd, buffer,
std::min(TOTAL_SIZE - written, CHUNK_SIZE));
if (rv < 0) { std::cerr << "write error" << std::endl; std::exit(-1); }
written += rv;
}
});
std::thread read_thread([]()
{
int fd = ::open("foo", O_RDONLY, 0);
if (fd < 0) std::exit(-1);
uint8_t *buffer = new uint8_t[CHUNK_SIZE];
size_t checked = 0;
while (checked < TOTAL_SIZE)
{
ssize_t rv = ::read(fd, buffer, CHUNK_SIZE);
if (rv < 0) { std::cerr << "write error" << std::endl; std::exit(-1); }
for (ssize_t i = 0; i < rv; ++i)
if (buffer[i] != 0x5A)
std::cerr << "readback mismatch at offset " << checked + i << std::endl;
checked += rv;
}
});
write_thread.join();
read_thread.join();
}
(Details such as proper error checking and resource management are omitted here for the sake of the MCVE. This is not my actual program but it shows the same behavior.)
I'm testing on Linux 4.15.0 with an SSD. About 1/3 of the time I run the program, the "readback mismatch" message prints. Sometimes it doesn't. In all cases, if I examine foo after the fact I find that it does contain the correct data.
If you remove O_DIRECT from the ::open() flags in the write thread, the problem goes away and the "readback mismatch" message never prints.
I could understand why my ::read() might return 0 or something to indicate I've already read everything that has been flushed to disk yet. But I can't understand why it would perform what appears to be a successful read, but with data other than what I wrote. Clearly I'm missing something, but what is it?
So, O_DIRECT has some additional constraints that might not make it what you're looking for:
Applications should avoid mixing O_DIRECT and normal I/O to the same
file, and especially to overlapping byte regions in the same file.
Even when the filesystem correctly handles the coherency issues in
this situation, overall I/O throughput is likely to be slower than
using either mode alone.
Instead, I think O_SYNC might be better, since it does provide the expected guarantees:
O_SYNC provides synchronized I/O file integrity completion, meaning
write operations will flush data and all associated metadata to the
underlying hardware. O_DSYNC provides synchronized I/O data
integrity completion, meaning write operations will flush data to the
underlying hardware, but will only flush metadata updates that are
required to allow a subsequent read operation to complete
successfully. Data integrity completion can reduce the number of
disk operations that are required for applications that don't need
the guarantees of file integrity completion.

Size error on read file

RESOLVED
I'm trying to make a simple file loader.
I aim to get the text from a shader file (plain text file) into a char* that I will compile later.
I've tried this function:
char* load_shader(char* pURL)
{
FILE *shaderFile;
char* pShader;
// File opening
fopen_s( &shaderFile, pURL, "r" );
if ( shaderFile == NULL )
return "FILE_ER";
// File size
fseek (shaderFile , 0 , SEEK_END);
int lSize = ftell (shaderFile);
rewind (shaderFile);
// Allocating size to store the content
pShader = (char*) malloc (sizeof(char) * lSize);
if (pShader == NULL)
{
fputs ("Memory error", stderr);
return "MEM_ER";
}
// copy the file into the buffer:
int result = fread (pShader, sizeof(char), lSize, shaderFile);
if (result != lSize)
{
// size of file 106/113
cout << "size of file " << result << "/" << lSize << endl;
fputs ("Reading error", stderr);
return "READ_ER";
}
// Terminate
fclose (shaderFile);
return 0;
}
But as you can see in the code I have a strange size difference at the end of the process which makes my function crash.
I must say I'm quite a beginner in C so I might have missed some subtilities regarding the memory allocation, types, pointers...
How can I solve this size issue?
*EDIT 1:
First, I shouldn't return 0 at the end but pShader; that seemed to be what crashed the program.
Then, I change the type of reult to size_t, and added a end character to pShader, adding pShdaer[result] = '/0'; after its declaration so I can display it correctly.
Finally, as #JamesKanze suggested, I turned fopen_s into fopen as the previous was not usefull in my case.
First, for this sort of raw access, you're probably better off
using the system level functions: CreateFile or open,
ReadFile or read and CloseHandle or close, with
GetFileSize or stat to get the size. Using FILE* or
std::filebuf will only introduce an additional level of
buffering and processing, for no gain in your case.
As to what you are seeing: there is no guarantee that an ftell
will return anything exploitable as a numeric value; it could
very well be just a magic cookie. On most current systems, it
is a byte offset into the physical file, but on any non-Unix
system, the offset into the physical file will not map directly
to the logical file you are reading unless you open the file in
binary mode. If you use "rb" to open the file, you'll
probably see the same values. (Theoretically, you could get
extra 0's at the end of the file, but practically, the OS's
where that happened are either extinct, or only used on legacy
mainframes.)
EDIT:
Since the answer stating this has been deleted: you should loop
on the fread until it returns 0 (setting errno to 0 before
each call, and checking it after the return to see whether the
function returned because of an error or because it reached the
end of file). Having said this: if you're on one of the usual
Windows or Unix systems, and the file is local to the machine,
and not too big, fread will read it all in one go. The
difference in size you are seeing (given the numerical values
you posted) is almost certainly due to the fact that the two
byte Windows line endings are being mapped to a single '\n'
character. To avoid this, you must open in binary mode;
alternatively, if you really are dealing with text (and want
this mapping), you can just ignore the extra bytes in your
buffer, setting the '\0' terminator after the last byte
actually read.

Can't seem to get a linecount from this file

I'm trying to read a file and find out how many lines are in it.
These are code snippets without the 'if' error handling.
...
int fd = -1;
struct stat buff;
char * logbuff;
fd = open("/home/path/to/test", O_RDONLY, 0);
fstat(fd, &buff);
logbuff =
(char*)mmap(
NULL, //OS chooses address
buff.st_size, //Size of file
PROT_READ, //Read only
MAP_ANON|MAP_SHARED, //copy on write
fd,
0
);
const char * ch = &logbuff[0];
for ( i = 0; i < buff.st_size; i++ ) {
if ( ch[i] == '\n' ) {
newlines++;
cout << ch[i];
}
}
cout << "lines: " << newlines << endl;
I get 'lines: 0'
I have a
Let's read the documentation for MAP_ANON and MAP_ANONYMOUS:
MAP_ANON
Synonym for MAP_ANONYMOUS. Deprecated.
MAP_ANONYMOUS
The mapping is not backed by any file; its contents are initialized to zero.
The fd and offset arguments are ignored; however, some implementations require
fd to be -1 if MAP_ANONYMOUS (or MAP_ANON) is specified, and portable applications
should ensure this. The use of MAP_ANONYMOUS in conjunction with MAP_SHARED is
only supported on Linux since kernel 2.4.
So we can see that when we use MAP_ANON, the fd parameter is ignored! You actually are looking for MAP_PRIVATE:
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
Therefore, the call to mmap should be:
logbuff = mmap(NULL, buff.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
Additionally, you'll want to verify that the call to mmap() succeeds. To do that, make sure that the returned pointer is not MAP_FAILED.

SSD raw I/O benchmarks with random read/write

My laptop has a SSD disk that has 512 byte physical disk sector size and 4,096 byte logical disk sector size. I'm working on an ACID database system that has to bypass all OS caches, so I write directly from allocated internal memory (RAM) to the SSD disk. I also extend the files before I run the tests and don't resize it during the tests.
Now here is my problem, according to SSD benchmarks random read & write should be in the range 30 MB/s to 90 MB/s, respectively. But here is my (rather horrible) telemetry from my numerous perfrmance tests:
1.2 MB/s when reading random 512 byte blocks (physical sector size)
512 KB/s when writing random 512 byte blocks (physical sector size)
8.5 MB/s when reading random 4,096 byte blocks (logical sector size)
4.9 MB/s when writing random 4,096 byte blocks (logical sector size)
In addition to using asynchronous I/O I also set the FILE_SHARE_READ and FILE_SHARE_WRITE flags to disable all OS buffering - because our database is ACID I must do this, I also tried FlushFileBuffers() but that gave me even worse performance. I also wait for each async I/O operation to complete as is required by some of our code.
Here is my code, is there are problem with it or am I stuck with this bad I/O performance?
HANDLE OpenFile(const wchar_t *fileName)
{
// Set access method
DWORD desiredAccess = GENERIC_READ | GENERIC_WRITE ;
// Set file flags
DWORD fileFlags = FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING /*| FILE_FLAG_RANDOM_ACCESS*/;
//File or device is being opened or created for asynchronous I/O
fileFlags |= FILE_FLAG_OVERLAPPED ;
// Exlusive use (no share mode)
DWORD shareMode = 0;
HANDLE hOutputFile = CreateFile(
// File name
fileName,
// Requested access to the file
desiredAccess,
// Share mode. 0 equals exclusive lock by the process
shareMode,
// Pointer to a security attribute structure
NULL,
// Action to take on file
CREATE_NEW,
// File attributes and flags
fileFlags,
// Template file
NULL
);
if (hOutputFile == INVALID_HANDLE_VALUE)
{
int lastError = GetLastError();
std::cerr << "Unable to create the file '" << fileName << "'. [CreateFile] error #" << lastError << "." << std::endl;
}
return hOutputFile;
}
DWORD ReadFromFile(HANDLE hFile, void *outData, _UINT64 bytesToRead, _UINT64 location, OVERLAPPED *overlappedPtr,
asyncIoCompletionRoutine_t completionRoutine)
{
DWORD bytesRead = 0;
if (overlappedPtr)
{
// Windows demand that you split the file byte locttion into high & low 32-bit addresses
overlappedPtr->Offset = (DWORD)_UINT64LO(location);
overlappedPtr->OffsetHigh = (DWORD)_UINT64HI(location);
// Should we use a callback function or a manual event
if (!completionRoutine && !overlappedPtr->hEvent)
{
// No manual event supplied, so create one. The caller must reset and close it themselves
overlappedPtr->hEvent = CreateEvent(NULL, TRUE, FALSE, NULL);
if (!overlappedPtr->hEvent)
{
DWORD errNumber = GetLastError();
std::wcerr << L"Could not create a new event. [CreateEvent] error #" << errNumber << L".";
}
}
}
BOOL result = completionRoutine ?
ReadFileEx(hFile, outData, (DWORD)(bytesToRead), overlappedPtr, completionRoutine) :
ReadFile(hFile, outData, (DWORD)(bytesToRead), &bytesRead, overlappedPtr);
if (result == FALSE)
{
DWORD errorCode = GetLastError();
if (errorCode != ERROR_IO_PENDING)
{
std::wcerr << L"Can't read sectors from file. [ReadFile] error #" << errorCode << L".";
}
}
return bytesRead;
}
Random IO performance is not measured well in MB/sec. It is measured in IOPS. "1.2 MB/s when reading random 512 byte blocks" => 20000 IOPS. Not bad. Double the block size and you'll get 199% the MB/sec and 99% the IOPS because it takes almost the same time to read 512 bytes than it does to read 1024 bytes (almost no time at all). SSDs are not free of seeking costs as is sometimes mistakenly assumed.
So the numbers are not actually bad at all.
SSDs benefit from high queue depth. Try issuing multiple IOs at once and keep that number outstanding at all times. The optimal concurrency will be somewhere in the range of 1-32.
Because SSDs have hardware concurrency you can expect a small multiple of the single-threaded performance. My SSD has 4 parallel "banks" for example.
Using FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING is all that is needed to achieve direct writes to hardware. If these flags do not work your hardware does not respect these flags and you can't do anything about it. All server hardware respects these flags and I have not seen a consumer disk that doesn't.
The sharing flags are not meaningful in this context.
The code is fine although I don't see why you use async IO and later wait on an event to wait for completion. That makes no sense. Either use synchronous IO (which will perform about the same as async IO) or use async IO with completion ports and without waiting.
Use hdparm -I /dev/sdx to check your logical and physical block size. Most modern SSDs have a 4096 byte physical block size but also support 512byte blocks for backward compatibility with older drives & OS software. This is done by "512 byte emulation" A.K.A 512e. If your drive is one of the ones that does 512 byte emulation your 512 byte accesses are actually read modify write operations. The SSD will try to turn sequential accesses in to 4k block writes.
If you can switch to 4k block writes you will (probably) see much better numbers for IOPS as well as bandwidth since this makes for much less work on the SSD. Random 512 block writes also have a big impact on long term performance due to increased write amplification.

Does posix_fallocate work with files opened with appened mode?

I'm trying to preallocate disk space for file operations, however, I encounter one weird issue that posix_fallocate only alloates one byte when I call it to allocate disk space for files opened with append mode and file contents are also unexpected. Has anyone known this issue? And my test codes are,
#include <cstdio>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <cerrno>
int main(int argc, char **argv)
{
FILE *fp = fopen("append.txt", "w");
for (int i = 0; i < 5; ++i)
fprintf(fp, "## Test loop %d\n", i);
fclose(fp);
sleep(1);
int fid = open("append.txt", O_WRONLY | O_APPEND);
struct stat status;
fstat(fid, &status);
printf("INFO: sizeof 'append.txt' is %ld Bytes.\n", status.st_size);
int ret = posix_fallocate(fid, (off_t)status.st_size, 1024);
if (ret) {
switch (ret) {
case EBADF:
fprintf(stderr, "ERROR: %d is not a valid file descriptor, or is not opened for writing.\n", fid);
break;
case EFBIG:
fprintf(stderr, "ERROR: exceed the maximum file size.\n");
break;
case ENOSPC:
fprintf(stderr, "ERROR: There is not enough space left on the device\n");
break;
default:
break;
}
}
fstat(fid, &status);
printf("INFO: sizeof 'append.txt' is %ld Bytes.\n", status.st_size);
char *hello = "hello world\n";
write(fid, hello, 12);
close(fid);
return 0;
}
And the expected result should be,
## Test loop 0
## Test loop 1
## Test loop 2
## Test loop 3
## Test loop 4
hello world
However, the result of above program is,
## Test loop 0
## Test loop 1
## Test loop 2
## Test loop 3
## Test loop 4
^#hello world
So, what's "^#"?
And the message shows,
INFO: sizeof 'append.txt' is 75 Bytes.
INFO: sizeof 'append.txt' is 76 Bytes.
Any clues?
Thanks
Quick Answer
Yes, posix_fallocate does work with files opened in APPEND mode. IF your filesystem supports the fallocate system call. If your filesystem does not support it the glibc emulation adds a single 0 byte to the end in APPEND mode.
More Information
This was a strange one and really puzzled me. I found the answer by using the strace program which shows what system calls are being made.
Check this out:
fallocate(3, 0, 74, 1000) = -1 EOPNOTSUPP (Operation not
supported)
fstat(3, {st_mode=S_IFREG|0664, st_size=75, ...}) = 0
fstatfs(3, {f_type=0xf15f, f_bsize=4096, f_blocks=56777565,
f_bfree=30435527, f_bavail=27551380, f_files=14426112,
f_ffree=13172614, f_fsid={1863489073, -1456395543}, f_namelen=143,
f_frsize=4096}) = 0
pwrite(3, "\0", 1, 1073) = 1
It looks like the GNU C Library is trying to help you here. The fallocate system call is apparently not implemented on your filesystem, so GLibC is emulating it by using pwrite to write a 0 byte out at the end of the requested allocation, thus extending the file.
This works fine in normal write mode. But in APPEND mode the write is always done at the end of the file so the pwrite writes one 0 byte at the end.
Not what was intended. Might be a GNU C Library bug.
It looks like ext4 does support fallocate. And if I write the file into /tmp it works. It fails in my home directory because I am using an encrypted home directory in Ubuntu with the ecryptfs filesystem
Per POSIX:
If the offset+ len is beyond the current file size, then posix_fallocate() shall adjust the file size to offset+ len. Otherwise, the file size shall not be changed.
So it doesn't make sense to use posix_fallocate with append mode, since it will extend the size of the file (filled with null bytes) and subsequent writes will take place after those null bytes, in space that's not yet reserved.
As for why it's only extending the file by one byte, are you sure that's correct? Have you measured? That sounds like a bug in the implementation.