multiple access to file in r/w - c++

I'm planning to write a programm which has to access to a certain file many times in r/w.
So I decided to use fstream, since I can use this class for both reading and writing purpose.
My idea is to open the file at the startup of the application and then close it as the application is closed too.
Since the file can be arbitrarily big, I was planning to use a "paging" structure, in which:
1) preallocate a fixed amount of memory for each page and a fixed number of page
2) load part of the file in to the first free page
3) if there is no free page, I select one non empty with a certain criterion, I commit all edit in it (if there are any) and then load the part of file in the page.
That's not so hard to code. But I was wondering If I'm going to reinvent the wheel... maybe the fstream itself is written in a smart way so that it also implements a similar paging mechanism. In that case, I would not take care about, just write and read at any time.
Some suggestion?

Don't do this by yourself. Unless you are using very exotic implementation, the fstream class already implement such a mechanism efficiently.
Checkout http://www.cplusplus.com/doc/tutorial/files/ "Buffers and Synchronization"
There are possible issues if you are seek-ing into file larger than 2GB with a old kernel or implementation of the standard library. Check this
Large file support in C++
or use Boost.Filesystem

Internal working of the standard C++ library vary by implementation. Hence a test would be needed to get some real data on your preferred platform. Generally memory mapped files are considered to be the fastest way to access data stored in a file (as Uflex has mentioned in his comment, but it has some drawbacks as well (see the linked wiki page). You can either use the standard (POSIX) C functions mmap() and munmap(), or the Boost C++ libraries which also have a portable C++ interface for memory mapped files.

Related

Are the C++ Standard Library File stream operations crippled in Microsoft?

I'm asking this question because I have been working on a project that requires collecting a lot of data REALLY fast, depending on the scenario. 5.7GBytes with a capital BYTE per second or 11.4GBytes per second.
We are working with a small striped raid array using 3 Samsung Pro NVME (for 11.4GB/s we have a larger array).
Currently, the project has been developed on Windows, I wanted to make things as portable as possible so I focused on using C++ Standard Library; however, no matter what I did I could not crack transferring files faster than 1.5GB/s
The strategy was simple to create a couple of huge swap buffers, and write them directly to disk as a huge unformatted binary file.
Using std::ofstream
and benchmarking manually setting varied buffer sizes through:
rdbuf()->pubsetbuf(buffer, BUFFER_SIZE);
open(Filename, std::ios::binary|std::ios::trunc);
followed by my managed write loop, I was able to find a sweet spot, but never able to crack 1.5GB/s
I then found the Windows SDK and its CreateFile function
In particular, the create file function using the FILE_FLAG_NO_BUFFERING flag.
This was a game-changer, as long as I made sure I fed it sector-aligned data (in my case everything needed to be some multiple of 512Bytes) I was suddenly able to take full advantage of the raid array throughput.
I revisited the std::ofstream function in an attempt to work with more OS-agnostic functions; however, even though one can specify zero buffer for std::ofstream, there doesn't appear to be any documentation with regards to any caveats to using that function with no buffer.
std::ofstream allows 64bit values for its write size, unlike Windows SDK WriteFile which only accepts DWORD's setting the maximum write size is the largest multiple of 512 one can squeeze into a uint32_t and you must manage your write in a loop if your file exceeds 4GB (mine do).
This just raises the question, is Microsoft simply not giving the C++ Standard Library Devs access to the necessary OS-level system calls to take advantage of Ultra-high-speed drive arrays? Or am I missing something in how to use the C++ Standard Library to its full potential?
"is Microsoft simply not giving the C++ Standard Library Devs..."
You might notice that the product you're using is called Microsoft Visual Studio. The Standard Library developers for Visual Studio work at Microsoft, although in a different team as the Windows developers.
The reason is a bit more simple: the Visual C++ devs can't possibly know and optimize for all possible use scenario's. It's a bit unusual to do text formatting at such high speeds. Remember, the point of ostream is to provide operator<<. ofstream is for formatted output to files. But for high-speed I/O you want binary output anyway.
To put it bluntly, the bandwidth you're aiming for are within the ballpark of the physical limits of current commodity hardware (~24GByte/s for 16Ă—PCIe.4), and in my own work I found it very challenging to reach single-core memory transfer rates above 8GByte/s without the use of "dark magic" (aka hand crafted assembly and optimized system call code), and it involved carefully aligning the memory accesses and making use of vector extensions. But most importantly, to reach these levels of optimization requires to be aware of the kind of data that is being processed and what kind of access patters to expect and/or build caching intermediaries to accomodate for the underlying hardware.
Such optimizations are plain and simply outside of the scope of general purpose standard libraries. Standard libraries in their implementation must adhere to the behaviours written down in the specification, and some of these requirements tend to collide with what has to be done to make the most of the underlying hardware.
So I'm sorry to tell you, but you'll probably have to bite the bullet and use the low level system APIs directly, bypassing the standard library.

FILE IO to InMemory IO

I have a legacy C library which accepts a file, works on the file payload and writes the processed payload to an output file. The functions in the library are tightly coupled with FILE i.e. it passes around FILE handle to the functions and functions do file IO to retrieve the necessary data.
I want to modify this library such that it works with in memory data(No file IO). i.e pass a binary array and get back binary array.
I have 2 solution in mind
Implement a InMemory File module (which implants all operations as C FILE) and override the default file operations with new implementation using typedef or #define
Pass around binary array to all the functions of the library and retrieve the necessary data from the same.
Which one of this is better or any other better way to solve the problem
I would suggest not to change legacy code if any other code depends on it.
If you are building for a somewhat POSIX compliant platform, you can use fmemopen http://pubs.opengroup.org/onlinepubs/9699919799/functions/fmemopen.html
For Windows maybe this might help
C - create file in memory
I don't know what is the exact purpose of changing legacy code.The issue which is I understand is the overhead caused by reading and writing. But there are many methods are available to resolve overhead issues as:
As already told you can use fmemopen
You may also use mmap for plain read/write should make little difference; either way, everything happens through the filesystem cache/buffers.
You can also use tmpfs to leverage memory as (temporary) files also known as RAMDisk as storage. As the files are washed out easily as files are temporary already in nature
Another solution - you can use inmemory database (TimesTen for sample)

how to encrypt data files in C++?

now I am writing a app in C++, and currently my app reads models or parameters from several data files. Those files, i.e. self-define dictionary, are currently stored in plain text and to be loaded dynamically by C++ while runtime.
Yet, I don't want those files to be easily seen by my client while they get the released application, so I need to encrypt the file first. What's the general practice for this situation?
And those file are huge in size, so compile to a resource file is not a good option.
Actually I just need a simple 'encryption', at least not plain text stored in released version. And I dont want the encryption libraries which will load the whole file into the memory first in order to perform decryption, since the files are huge and no need to load its whole body into memory at one time.
Thanks!
Usually when you want to deal with encryption in C++ people tend to go for Open SSL libraries which encompass all of the functionality in a pretty standard way.
You'd have to get yourself a copy of the library and some code samples, but it's a pretty common thing and there's lots of documentation around.

Simple API for random access into a compressed data file

Please recommend a technology suitable for the following task.
I have a rather big (500MB) data chunk, which is basically a matrix of numbers. The data entropy is low (it should be well-compressible) and the storage is expensive where it sits.
What I am looking for, is to compress it with a good compression algorithm (Like, say, GZip) with markers that would enable very occasional random access. Random access as in "read byte from location [64bit address] in the original (uncompressed) stream". This is a little different than the classic deflator libraries like ZLIB, which would let you decompress the stream continuously. What I would like, is have the random access at latency of, say, as much as 1MB of decompression work per byte read.
Of course, I hope to use existing library rather than reinvent the NIH wheel.
If you're working in Java, I just published a library for that: http://code.google.com/p/jzran.
Byte Pair Encoding allows random access to data.
You won't get as good compression with it, but you're sacrificing adaptive (variable) hash trees for a single tree, so you can access it.
However, you'll still need some kind of index in order to find a particular "byte". Since you're okay with 1 MB of latency, you'll be creating an index for every 1 MB. Hopefully you can figure out a way to make your index small enough to still benefit from the compression.
One of the benefits of this method is random access editing too. You can update, delete, and insert data in relatively small chunks.
If it's accessed rarely, you could compress the index with gzip and decode it when needed.
If you want to minimize the work involved, I'd just break the data into 1 MB (or whatever) chunks, then put the pieces into a PKZIP archive. You'd then need a tiny bit of front-end code to take a file offset, and divide by 1M to get the right file to decompress (and, obviously, use the remainder to get to the right offset in that file).
Edit: Yes, there is existing code to handle this. Recent versions of Info-zip's unzip (6.0 is current) include api.c. Among other things, that includes UzpUnzipToMemory -- you pass it the name of a ZIP file, and the name of one of the file in that archive that you want to retrieve. You then get a buffer holding the contents of that file. For updating, you'll need the api.c from zip3.0, using ZpInit and ZpArchive (though these aren't quite as simple to use as the unzip side).
Alternatively, you can just run a copy of zip/unzip in the background to do the work. This isn't quite as neat, but undoubtedly a bit simpler to implement (as well as allowing you to switch formats pretty easily if you choose).
Take a look at my project - csio. I think it is exactly what you are looking for: stdio-like interface and multithreaded compressor included.
It is library, writen in C, which provides CFILE structure and functions cfopen, cfseek, cftello, and others. You can use it with regular (not compressed) files and with files, compressed with help of dzip utility. This utility included in the project and written in C++. It produces valid gzip archive, wich can be handled by standard utilities as well as with csio. dzip can compress in many threads (see -j option), so it can very fast compress very big files.
Tipical usage:
dzip -j4 myfile
...
CFILE file = cfopen("myfile.dz", "r");
off_t some_offset = 673820;
cfseek(file, some_offset);
char buf[100];
cfread(buf, 100, 1, file);
cfclose(file);
It is MIT licensed, so you can use it in your projects without restrictions. For more information visit project page on github: https://github.com/hoxnox/csio
Compression algorithms usually work in blocks I think so you might be able to come up with something based on block size.
I would recommend using the Boost Iostreams Library. Boost.Iostreams can be used to create streams to access TCP connections or as a framework for cryptography and data compression. The library includes components for accessing memory-mapped files, for file access using operating system file descriptors, for code conversion, for text filtering with regular expressions, for line-ending conversion and for compression and decompression in the zlib, gzip and bzip2 formats.
The Boost library been accepted by the C++ standards committee as part of TR2 so it will eventually be built-in to most compilers (under std::tr2::sys). It is also cross-platform compatible.
Boost Releases
Boost Getting Started Guide NOTE: Only some parts of boost::iostreams are header-only library which require no separately-compiled library binaries or special treatment when linking.
Sort the big file first
divide it in chunks of your desire size (1MB) with some sequence in the name (File_01, File_02, .., File_NN)
take first ID from each chunk plus the filename and put both data into another file
compress the chunks
you will able to made a search into the ID's file using the method that you wish, may be a binary search and open each file as you need.
If you need a deep Indexing you could use a BTree algorithm with the "pages" are the files.
on the web exists several implementation of this because are little tricky the code.
You could use bzip2 and make your own API pretty easily based on the James Taylor's seek-bzip2

How to create a virtual file?

I'd like to simulate a file without writing it on disk. I have a file at the end of my executable and I would like to give its path to a dll. Of course since it doesn't have a real path, I have to fake it.
I first tried using named pipes under Windows to do it. That would allow for a path like \\.\pipe\mymemoryfile but I can't make it works, and I'm not sure the dll would support a path like this.
Second, I found CreateFileMapping and GetMappedFileName. Can they be used to simulate a file in a fragment of another ? I'm not sure this is what this API does.
What I'm trying to do seems similar to boxedapp. Any ideas about how they do it ? I suppose it's something like API interception (Like Detour ), but that would be a lot of work. Is there another way to do it ?
Why ? I'm interested in this specific solution because I'd like to hide the data and for the benefit of distributing only one file but also for geeky reasons of making it works that way ;)
I agree that copying data to a temporary file would work and be a much easier solution.
Use BoxedApp and do not worry.
You can store the data in an NTFS stream. That way you can get a real path pointing to your data that you can give to your dll in the form of
x:\myfile.exe:mystreamname
This works precisely like a normal file, however it only works if the file system used is NTFS. This is standard under Windows nowadays, but is of course not an option if you want to support older systems or would like to be able to run this from a usb-stick or similar. Note that any streams present in a file will be lost if the file is sent as an attachment in mail or simply copied from a NTFS partition to a FAT32 partition.
I'd say that the most compatible way would be to write your data to an actual file, but you can of course do it one way on NTFS systems and another on FAT systems. I do recommend against it because of the added complexity. The appropriate way would be to distribute your files separately of course, but since you've indicated that you don't want this, you should in that case write it to a temporary file and give the dll the path to that file. Make sure you write the temporary file to the users' temp directory (you can find the path using GetTempPath in C/C++).
Your other option would be to write a filesystem filter driver, but that is a road that I strongly advise against. That sort of defeats the purpose of using a single file as well...
Also, in case you want only a single file for distribution, how about using a zip file or an installer?
Pipes are for communication between processes running concurrently. They don't store data for later access, and they don't have the same semantics as files (you can't seek or rewind a pipe, for instance).
If you're after file-like behaviour, your best bet will always be to use a file. Under Windows, you can pass FILE_ATTRIBUTE_TEMPORARY to CreateFile as a hint to the system to avoid flushing data to disk if there's sufficient memory.
If you're worried about the performance hit of writing to disk, the above should be sufficient to avoid the performance impact in most cases. (If the system is low enough on memory to force the file data out to disk, it's probably also swapping heavily anyway -- you've already got a performance problem.)
If you're trying to avoid writing to disk for some other reason, can you explain why? In general, it's quite hard to stop data from ever hitting the disk -- the user can always hibernate the machine, for instance.
Since you don't have control over the DLL you have to assume that the DLL expects an actual file. It probably at some point makes that assumption which is why named pipes are failing on you.
The simplest solution is to create a temporary file in the temp directory, write the data from your EXE to the temp file and then delete the temporary file.
Is there a reason you are embedding this "pseudo-file" at the end of your EXE instead of just distributing it with our application? You are obviously already distributing this third party DLL with your application so one more file doesn't seem like it is going to hurt you?
Another question, will this data be changing? That is are you expecting to write back data this "pseudo-file" in your EXE? I don't think that will work well. Standard users may not have write access to the EXE and that would probably drive anti-virus nuts.
And no CreateFileMapping and GetMappedFileName definitely won't work since they don't give you a file name that can be passed to CreateFile. If you could somehow get this DLL to accept a HANDLE then that would work.
And I wouldn't even bother with API interception. Just hand the DLL a path to an acutal file.
Reading your question made me think: if you can pretend an area of memory is a file and have kind of "virtual path" to it, then this would allow loading a DLL directly from memory which is what LoadLibrary forbids by design by asking for a path name. And this is why people write their own PE loader when they want to achieve that.
I would say you can't achieve what you want with file mapping: the purpose of file mapping is to treat a portion of a file as if it was physical memory, and you're wanting the reciprocal.
Using Detours implies that you would have to replicate everything the intercepted DLL function does except from obtaining data from a real file; hence it's not generic. Or, even more intricate, let's pretend the DLL uses fopen; then you provide your own fopen that detects a special pattern in the path and you mimmic the C runtime internals... Hmm is it really worth all the pain? :D
Please explain why you can't extract the data from your EXE and write it to a temporary file. Many applications do this -- it's the classic solution to this problem.
If you really must provide a "virtual file", the cleanest solution is probably a filesystem filter driver. "clean" doesn't mean "good" -- a filter is a fully documented and supported solution, so it's cleaner than API hooking, injection, etc. However, filesystem filters are not easy.
OSR Online is the best place to find Windows filesystem information. The NTFSD mailing list is where filesystem developers hang out.
How about using a some sort of RamDisk and writing the file to this disk? I have tried some ramdisks myself, though never found a good one, tell me if you are successful.
Well, if you need to have the virtual file allocated in your exe, you will need to create a vector, stream or char array big enough to hold all of the virtual data you want to write.
that is the only solution I can think of without doing any I/O to disk (even if you don't write to file).
If you need to keep a file like path syntax, just write a class that mimics that behaviour and instead of writing to a file write to your memory buffer. It's as simple as it gets. Remember KISS.
Cheers
Open the file called "NUL:" for writing. It's writable, but the data are silently discarded. Kinda like /dev/null of *nix fame.
You cannot memory-map it though. Memory-mapping implies read/write access, and NUL is write-only.
I'm guessing that this dll cant take a stream? Its almost to simple to ask BUT if it can you could just use that.
Have you tried using the \?\ prefix when using named pipes? Many APIs support using \?\ to pass the remainder of the path directly through without any parsing/modification.
http://msdn.microsoft.com/en-us/library/aa365247(VS.85,lightweight).aspx
Why not just add it as a resource - http://msdn.microsoft.com/en-us/library/7k989cfy(VS.80).aspx - the same way you would add an icon.