I'm trying to read a file and find out how many lines are in it.
These are code snippets without the 'if' error handling.
...
int fd = -1;
struct stat buff;
char * logbuff;
fd = open("/home/path/to/test", O_RDONLY, 0);
fstat(fd, &buff);
logbuff =
(char*)mmap(
NULL, //OS chooses address
buff.st_size, //Size of file
PROT_READ, //Read only
MAP_ANON|MAP_SHARED, //copy on write
fd,
0
);
const char * ch = &logbuff[0];
for ( i = 0; i < buff.st_size; i++ ) {
if ( ch[i] == '\n' ) {
newlines++;
cout << ch[i];
}
}
cout << "lines: " << newlines << endl;
I get 'lines: 0'
I have a
Let's read the documentation for MAP_ANON and MAP_ANONYMOUS:
MAP_ANON
Synonym for MAP_ANONYMOUS. Deprecated.
MAP_ANONYMOUS
The mapping is not backed by any file; its contents are initialized to zero.
The fd and offset arguments are ignored; however, some implementations require
fd to be -1 if MAP_ANONYMOUS (or MAP_ANON) is specified, and portable applications
should ensure this. The use of MAP_ANONYMOUS in conjunction with MAP_SHARED is
only supported on Linux since kernel 2.4.
So we can see that when we use MAP_ANON, the fd parameter is ignored! You actually are looking for MAP_PRIVATE:
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
Therefore, the call to mmap should be:
logbuff = mmap(NULL, buff.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
Additionally, you'll want to verify that the call to mmap() succeeds. To do that, make sure that the returned pointer is not MAP_FAILED.
Related
I am mapping a huge file to avoid my app thrashing to main virtual memory, and to be able to run the app with more than the RAM I have. The code is c++ but partly follows old c APIs. When I work with the allocated pointer, the memory does get backed to the file as desired. However, when I run the app next time, I want the memory to be read from this same file which already has the prepared data. For some reason, on the next run, I read back all zeros. What am I doing wrong? Is it the ftruncate call? Is it the fopen call with wrong flag? Is it the mmap flags?
int64_t mmbytes=1<<36;
FILE *file = fopen(filename, "w+");
int fd = fileno(file);
int r = ftruncate(fd, mmbytes );
if (file == NULL || r){
perror("Failed: ");
throw std::runtime_error(std::strerror(errno));
} //
if ((mm = mmap(0, mmbytes,
PROT_READ | PROT_WRITE, MAP_FILE | MAP_SHARED, fd, 0)) == MAP_FAILED)
{
fprintf(stderr,"mmap error for output, errno %d\n", errno);
exit(-1);
}
}
FILE *file = fopen(filename, "w+");
I refer you to fopen's manual page, which describes "w+" as follows:
w+ Open for reading and writing. The file is created if it does
not exist, otherwise it is truncated. The stream is positioned
at the beginning of the file.
I specifically draw your attention to the "it is truncated" part. In other words, if there's anything in an existing file this ends up nuking it from high orbit.
Depending on what else you're doing "a" will work better.
Even better would be to forget fopen entirely, and simply use open:
int fd=open(filename, O_RDWR|O_CREAT, 0666);
There's your file descriptor, without jumping through any hoops. The file gets created, and left untouched if it already exists.
I'm working on a benchmark program. Upon making the read() system call, the program appears to hang indefinitely. The target file is 1 GB of binary data and I'm attempting to read directly into buffers that can be 1, 10 or 100 MB in size.
I'm using std::vector<char> to implement dynamically-sized buffers and handing off &vec[0] to read(). I'm also calling open() with the O_DIRECT flag to bypass kernel caching.
The essential coding details are captured below:
std::string fpath{"/path/to/file"};
size_t tries{};
int fd{};
while (errno == EINTR && tries < MAX_ATTEMPTS) {
fd = open(fpath.c_str(), O_RDONLY | O_DIRECT | O_LARGEFILE);
tries++;
}
// Throw exception if error opening file
if (fd == -1) {
ostringstream ss {};
switch (errno) {
case EACCES:
ss << "Error accessing file " << fpath << ": Permission denied";
break;
case EINVAL:
ss << "Invalid file open flags; system may also not support O_DIRECT flag, required for this benchmark";
break;
case ENAMETOOLONG:
ss << "Invalid path name: Too long";
break;
case ENOMEM:
ss << "Kernel error: Out of memory";
}
throw invalid_argument {ss.str()};
}
size_t buf_sz{1024*1024}; // 1 MiB buffer
std::vector<char> buffer(buf_sz); // Creates vector pre-allocated with buf_sz chars (bytes)
// Result is 0-filled buffer of size buf_sz
auto bytes_read = read(fd, &buffer[0], buf_sz);
Poking through the executable with gdb shows that buffers are allocated correctly, and the file I've tested with checks out in xxd. I'm using g++ 7.3.1 (with C++11 support) to compile my code on a Fedora Server 27 VM.
Why is read() hanging on large binary files?
Edit: Code example updated to more accurately reflect error checking.
There are multiple problems with your code.
This code will never work properly if errno ever has a value equal to EINTR:
while (errno == EINTR && tries < MAX_ATTEMPTS) {
fd = open(fpath.c_str(), O_RDONLY | O_DIRECT | O_LARGEFILE);
tries++;
}
That code won't stop when the file has been successfully opened and will keep reopening the file over and over and leak file descriptors as it keeps looping once errno is EINTR.
This would be better:
do
{
fd = open(fpath.c_str(), O_RDONLY | O_DIRECT | O_LARGEFILE);
tries++;
}
while ( ( -1 == fd ) && ( EINTR == errno ) && ( tries < MAX_ATTEMPTS ) );
Second, as noted in the comments, O_DIRECT can impose alignment restrictions on memory. You might need page-aligned memory:
So
size_t buf_sz{1024*1024}; // 1 MiB buffer
std::vector<char> buffer(buf_sz); // Creates vector pre-allocated with buf_sz chars (bytes)
// Result is 0-filled buffer of size buf_sz
auto bytes_read = read(fd, &buffer[0], buf_sz);
becomes
size_t buf_sz{1024*1024}; // 1 MiB buffer
// page-aligned buffer
buffer = mmap( 0, buf_sz, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, NULL );
auto bytes_read = read(fd, &buffer[0], buf_sz);
Note also the the Linux implementation of O_DIRECT can be very dodgy. It's been getting better, but there are still potential pitfalls that aren't very well documented at all. Along with alignment restrictions, if the last amount of data in the file isn't a full page, for example, you may not be able to read it if the filesystem's implementation of direct IO doesn't allow you to read anything but full pages (or some other block size). Likewise for write() calls - you may not be able to write just any number of bytes, you might be constrained to something like a 4k page.
This is also critical:
Most examples of read() hanging appear to be when using pipes or non-standard I/O devices (e.g., serial). Disk I/O, not so much.
Some devices simply do not support direct IO. They should return an error, but again, the O_DIRECT implementation on Linux can be very hit-or-miss.
Pasting your program and running on my linux system, was a working and non-hanging program.
The most likely cause for the failure is the file is not a file-system item, or it has a hardware element which is not working.
Try with a smaller size - to confirm, and try on a different machine to help diagnose
My complete code (with no error checking)
#include <vector>
#include <string>
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
int main( int argc, char ** argv )
{
std::string fpath{"myfile.txt" };
auto fd = open(fpath.c_str(), O_RDONLY | O_DIRECT | O_LARGEFILE);
size_t buf_sz{1024*1024}; // 1 MiB buffer
std::vector<char> buffer(buf_sz); // Creates vector pre-allocated with buf_sz chars (bytes)
// Result is 0-filled buffer of size buf_sz
auto bytes_read = read(fd, &buffer[0], buf_sz);
}
myfile.txt was created with
dd if=/dev/zero of=myfile.txt bs=1024 count=1024
If the file is not 1Mb in size, it may fail.
If the file is a pipe, it can block until the data is available.
Most examples of read() hanging appear to be when using pipes or non-standard I/O devices (e.g., serial). Disk I/O, not so much.
O_DIRECT flag is useful for filesystems and block devices. With this flag people normally map pages into the user space.
For sockets, pipes and serial devices it is plain useless because the kernel does not cache that data.
Your updated code hangs because fd is initialized with 0 which is STDIN_FILENO and it never opens that file, then it hangs reading from stdin.
I've search the MSDN but did not find any information about sharing a same HANDLE with both WriteFile and ReadFile. NOTE:I did not use create_always flag, so there's no chance for the file being replaced with null file.
The reason I tried to use the same HANDLE was based on performance concerns. My code basically downloads some data(writes to a file) ,reads it immediately then delete it.
In my opinion, A file HANDLE is just an address of memory which is also an entrance to do a I/O job.
This is how the error occurs:
CreateFile(OK) --> WriteFile(OK) --> GetFileSize(OK) --> ReadFile(Failed) --> CloseHandle(OK)
If the WriteFile was called synchronized, there should be no problem on this ReadFile action, even the GetFileSize after WriteFile returns the correct value!!(new modified file size), but the fact is, ReadFile reads the value before modified (lpNumberOfBytesRead is always old value). A thought just came to my mind,caching!
Then I tried to learn more about Windows File Caching which I have no knowledge with. I even tried Flag FILE_FLAG_NO_BUFFERING, and FlushFileBuffers function but no luck. Of course I know I can do CloseHandle and CreateFile again between WriteFile and ReadFile, I just wonder if there's some possible way to achieve this without calling CreateFile again?
Above is the minimum about my question, down is the demo code I made for this concept:
int main()
{
HANDLE hFile = CreateFile(L"C://temp//TEST.txt", GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL| FILE_FLAG_WRITE_THROUGH, NULL);
//step one write 12345 to file
std::string test = "12345";
char * pszOutBuffer;
pszOutBuffer = (char*)malloc(strlen(test.c_str()) + 1); //create buffer for 12345 plus a null ternimator
ZeroMemory(pszOutBuffer, strlen(test.c_str()) + 1); //replace null ternimator with 0
memcpy(pszOutBuffer, test.c_str(), strlen(test.c_str())); //copy 12345 to buffer
DWORD wmWritten;
WriteFile(hFile, pszOutBuffer, strlen(test.c_str()), &wmWritten, NULL); //write 12345 to file
//according to msdn this refresh the buffer
FlushFileBuffers(hFile);
std::cout << "bytes writen to file(num):"<< wmWritten << std::endl; //got output 5 here as expected, 5 bytes has bebn wrtten to file.
//step two getfilesize and read file
//get file size of C://temp//TEST.txt
DWORD dwFileSize = 0;
dwFileSize = GetFileSize(hFile, NULL);
if (dwFileSize == INVALID_FILE_SIZE)
{
return -1; //unable to get filesize
}
std::cout << "GetFileSize result is:" << dwFileSize << std::endl; //got output 5 here as expected
char * bufFstream;
bufFstream = (char*)malloc(sizeof(char)*(dwFileSize + 1)); //create buffer with filesize & a null terminator
memset(bufFstream, 0, sizeof(char)*(dwFileSize + 1));
std::cout << "created a buffer for ReadFile with size:" << dwFileSize + 1 << std::endl; //got output 6 as expected here
if (bufFstream == NULL) {
return -1;//ERROR_MEMORY;
}
DWORD nRead = 0;
bool bBufResult = ReadFile(hFile, bufFstream, dwFileSize, &nRead, NULL); //dwFileSize is 5 here
if (!bBufResult) {
free(bufFstream);
return -1; //copy file into buffer failed
}
std::cout << "nRead is:" << nRead << std::endl; //!!!got nRead 0 here!!!? why?
CloseHandle(hFile);
free(pszOutBuffer);
free(bufFstream);
return 0;
}
then the output is:
bytes writen to file(num):5
GetFileSize result is:5
created a buffer for ReadFile with size:6
nRead is:0
nRead should be 5 not 0.
Win32 files have a single file pointer, both for read and write; after the WriteFile it is at the end of the file, so if you try to read from it it will fail. To read what you just wrote you have to reposition the file pointer at the start of the file, using the SetFilePointer function.
Also, the FlushFileBuffer isn't needed - the operating system ensures that reads and writes on the file handle see the same state, regardless of the status of the buffers.
After first write file cursor points at file end. There is nothing to read. You can rewind it back to the beginning using SetFilePointer:
::DWORD const result(::SetFilePointer(hFile, 0, nullptr, FILE_BEGIN));
if(INVALID_SET_FILE_POINTER == result)
{
::DWORD const last_error(::GetLastError());
if(NO_ERROR != last_error)
{
// TODO do error handling...
}
}
when you try read file - from what position you try read it ?
FILE_OBJECT maintain "current" position (CurrentByteOffset member) which can be used as default position (for synchronous files only - opened without FILE_FLAG_OVERLAPPED !!) when you read or write file. and this position updated (moved on n bytes forward) after every read or write n bytes.
the best solution always use explicit file offset in ReadFile (or WriteFile). this offset in the last parameter OVERLAPPED lpOverlapped - look for Offset[High] member - the read operation starts at the offset that is specified in the OVERLAPPED structure
use this more effective and simply compare use special api call SetFilePointer which adjust CurrentByteOffset member in FILE_OBJECT (and this not worked for asynchronous file handles (created with FILE_FLAG_OVERLAPPED flag)
despite very common confusion - OVERLAPPED used not for asynchronous io only - this is simply additional parameter to ReadFile (or WriteFile) and can be used always - for any file handles
I have two code samples, the first is the following:
//THIS CODE READS IN THE CALC.EXE BINARY INTO MEMORY BUFFER USING ISTREAM
ifstream in("notepad.exe", std::ios::binary | std::ios::ate);
int size = in.tellg();
char* buffer = new char[size];
ifstream input("calc.exe", std::ios::binary);
input.read(buffer, size);
This is the second:
//THIS CODE GETS FILE MAPPING IMAGE OF SAME BINARY
handle = CreateFile("notepad.exe", GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, 0, NULL);
mappinghandle = CreateFileMapping(hFile, NULL, PAGE_READONLY, 0, 0, NULL);
image = (DWORD) MapViewOfFile(hMap, FILE_MAP_READ, 0, 0, 0);
My question is, what exactly is the difference between these two methods? If we are ignoring the sizing issue that is better handled by file mapping, are these two objects returned essentially the same? Will not the image variable point to essentially the same thing as the buffer variable - this being an image of the binary executable in memory? What are all the differences between the two?
The method using std::ifstream actually copies the data of the file into RAM, when input.read() is called whereas MapViewOfFile doesn't access the file data.
MapViewOfFile() returns a pointer, but the data will only actually be read from disk when you access the virtual memory that the pointer points to.
// This creates just a "view" of the file but doesn't read data from the file.
const char *buffer = reinterpret_cast<const char*>( MapViewOfFile(hMap, FILE_MAP_READ, 0, 0, 0) );
// Only at this point the file actually gets read. The virtual memory manager now
// reads a page of 4KiB into memory.
char value = buffer[ 10 ];
To further illustrate the difference, say we read a byte at offset 12345 from the memory-mapped file:
char value = buffer[ 12345 ];
Now the virtual memory manager won't read all the data up to this offset, but instead only map the page that is closest to that offset into memory. That would be the page located between offset 12288 (=4096*3) and 16384 (=4096*4).
The first one reads from the file into a buffer, the buffer is independent of the original file.
The second one is accessing the file itself, you won't be able to delete the file while a mapping exists, and although you can't make changes because you have a read-only mapping, changes made to that file outside your program can be visible in your mapping.
Following is the code that I am using for mmaping a file in ubuntu with hugepages, but this call is failing with error "invalid argument". However, when I do pass
MAP_ANON flag with no file descriptor parameter in mmap, then it works. I am not being able to understand the possible reason behind this.
Secondly, I am not able to understand why file mmaping is allowed with MAP_PRIVATE when this flag itself means that no change will be written back to file. This can always be accomplished using MAP_ANON, or is there something I am missing ?
Can someone help me with these ?
int32_t main(int32_t argc, char** argv) {
int32_t map_length = 16*1024*1024; // 16 MB , huge page size is 2 MB
int32_t protection = PROT_READ | PROT_WRITE;
int32_t flags = MAP_SHARED | MAP_HUGETLB;
int32_t file__ = open("test",O_RDWR|O_CREAT | O_LARGEFILE,s_IRWXU | S_IRGRP | S_IROTH);
if(file__ < 0 ) {
std::cerr << "Unable to open file\n";
return -1;
}
if (ftruncate(file__, map_length) < 0) {
std::cerr
<< "main :: unable to truncate the file\n"
<< "main :: " << strerror(errno) << "\n"
<< "main :: error number is " << errno << "\n";
return -1;
}
void *addr= mmap(NULL, map_length, protection, flags, file__, 0);
if (addr == MAP_FAILED) {
perror("mmap");
return -1;
}
const char* msg = "Hello World\n";
int32_t len = strlen(msg);
memcpy(addr,msg,len);
munmap(addr, map_length);
close(file__);
return 0;
}
Both your questions come down to the same point: Using mmap() you can obtain two kinds of mappings: anonymous memory and files.
Anonymous memory is (as stated in the man page) not backed by any file in the file system. Instead the memory you get back from an MAP_ANON call to mmap() is plain system memory. The main user of this interface is the C library which uses it to obtain backing storage for malloc/free. So, using MAP_ANON is explicitly saying that you don't want to map a file.
File-backed memory kind of blends in a file (or portions of it) into the address space of your application. In this case, the memory content is actually backed by the file's content. Think of the MAP_PRIVATE flag as first allocating memory for the file and then copying the content into this memory. In truth this will not be what the kernel is doing, but let's just pretend.
HUGE_TLB is a feature the kernel provides for anonymous memory (see Documentation/vm/hugetlb‐page.txt as referenced in the mmap() man page). This should be the reason for your mmap() call failing when using HUGETLB for a file. *Edit: not completely correct. There is a RAM file system (hugetlbfs) that does support huge pages. However, huge_tlb mappings won't work on arbitrary files, as I understand the docs.*
For details on how to use HUGE_TLB and the corresponding in-memory file system (hugetlbfs), you might want to consider the following articles on LWN:
Huge Pages, Part 1 (Intro)
Huge Pages, Part 2 (Interfaces)
Huge Pages, Part 3 (Administration)
Huge Pages, Part 4 (Benchmarking)
Huge Pages, Part 5 (TLB costs)
Adding MAP_PRIVATE to the flags fixed this for me.