Providing data directly instead of file path - c++

I am dealing with a closed source library which needs some data to be passed to it in order to work. This data is around 150 MB. I have this data loaded in memory at the moment of initializing the main class of this lib which has the following constructor:
Foo::Foo(const std::string path_to_data_file);
Foo accepts the data as a file and there is no another overload that accepts the data directly (as string or byte array...).
The only possible way to call the library in my case is to write the data I have to disk then pass the path of the file to the library which is a very bad idea..
Is there is any technique to pass some kind of virtual path to the library that will result in reading the data from memory directly instead of the disk?
In other words, I am looking for (if it is exist or even possible) some technique that creates a virtual file that leads to memory address rather than physical address on the Disk.
I know that the right solution is to edit the library and isolate the data layer from the processing layer. However, this is not possible for now at least..
Edit:
the solution should be cross-platform. However, I can guess that those problem are usually OS dependent. So, I am looking for Linux and Windows solution.
The library is doing some Computer Vision stuffs and data is a kind of trained model

It is probably operating system specific. But you could put the data into some RAM or virtual memory based filesystem like tmpfs.
Then you don't need to change the library, you just pass it some file in a tmpfs file system.
BTW, on some OSes, if you have recently written a file, it sits in the page cache (so is in RAM).
Notice also that reading 150Mb should not take much. If you can't put it on some tmpfs or RAM disk, try at least to use some SSD.

Related

RAM consumption on opening a file

I have a Binary file of ~400MB which I want to convert to CSV format. The output CSV file will be ~1GB (according to my calculations).
I read the binary file and store it in an array of structures (required for other processing too), and when the user wants to export it to CSV, I am creating a file (or opening an existing file - depending on the user's choice), opening it using fopen and then writing to it using fwrite, line by line.
Coming to my question, this link from CPlusPlus.com says:
The returned stream is fully buffered by default if it is known to not
refer to an interactive device
My query is when I open this file, will it be loaded in RAM? Like when at the end, my file is of ~1GB, will it consume that much RAM or will it be just on the hard disk?
This code will run on Windows as well as Android.
FILE* streams buffering is a C feature and it is used to reduce system call overhead (i.e. do not call read() for each fgetc() which is expensive). Usually buffer is small - i.e. 512 bytes.
Page Cache or similiar mechanisms are different beasts -- they are used to reduce number of disks operations. Usually operating system uses free memory to cache previously read or written data to/from disk so subsequent operations will use RAM.
If there are shortage of free memory -- data is evicted from page cache.
It is operating system and file system and computer specific. And it might not matter that much. Read about page cache.
BTW, you might be interested by sqlite
From an application writer point of view, you should care more about virtual memory and address space of your process than about RAM. Physical RAM is managed by the operating system.
On Linux and Android, if you want to optimize that you might consider (later) using posix_fadvise(2) and perhaps madvise(2). I'm not sure it is worth the pain in your case (since a gigabyte file is not that much today).
I read the binary file and store it in an array of structures (required for other processing too), and when the user wants to export it to CSV
Reading per se doesn't use a lot of memory, like myaut says the buffer is small. The elephant in the room here is: do you you read up all the file and put all the data into structures? or do you start processing after one or few reads to get the minimum amount of data needed to do some processing? Doing the former will indeed use ~400MB or more memory, doing the later will use quite a lot less, that being said, it all depends on the amount of data needed to start processing, and maybe you need all the data loaded at once.

Read with File Mapping Objects in C++

I am trying to use Memory Mapped File (MMF) to read my .csv data file (very large and time consuming).
I've heared that MMF is very fast since it caches content of the file, thus users can get access to the content in disk as in memory.
May I know if MMF is any faster than using other reading methods?
If this is true, can anyone show me a simple example how to read a file from disk?
Many thanks in advance.
May I know if MMF is any faster than using other reading methods?
If you're reading the entire file sequentially in one pass, then a memory-mapped file is probably approximately the same as using conventional file I/O.
can anyone show me a simple example how to read a file from disk?
Memory mapped files are typically an operating system feature, so you'd have to tell us which platform you're on to get an example of using it.
If you want to read a file sequentially, you can use the C++ ifstream class or the C run-time functions like fopen, fread, and fclose.
If it's faster or not depends on many different factors (such as what data you are accessing, how you are accessing it, etc. To determine what is right for YOUR case, you need to benchmark different solutions, and see what is best in your case.
The main benefit of memory mapped files is that the data can be copied directly from the filesystem into the user-accessible memory.
In traditional (fstream::read(), fredad(), etc) types of file-reading, the content of the file is read into a temporary buffer in the OS, then (part of) that buffer is copied to the user supplied buffer. This is because the OS can't rely on the memory being there and it gets pretty messy pretty quickly. For memory mapped files, the OS knows directly where the memory is for the different sections (because it's the OS's task to assign that memory and keep track of where it is!) of the file, so the OS can just copy it straight in.
However, I strongly suspect that the method of reading the file is a minor part, and the actual interpretation/parsing/copying out of the file may well be a large part. [Speculation, we haven't seen your code, of course]. And of course, the I/O speed available from the DISK itself may play a large factor if the file is very large.

Execute a process from memory within another process?

I would like to have a small "application loader" program that receives other binary application files over TCP from an external server and runs them.
I could do this by saving the transmitted file to the hard disk and using the system() call to run it. However, I am wondering if it would be possible to launch the new application from memory without it ever touching the hard drive.
The state of the loader application does not matter after loading a new application. I prefer to stick to C, but C++ solutions are welcome as well. I would also like to stick to standard Linux C functions and not use any external libraries, if possible.
Short answer: no.
Long answer: It's possible but rather tricky to do this without writing it out to disk. You can theoretically write your own elf loader that reads the binary, maps some memory, handles the dynamic linking as required, and then transfers control but that's an awful lot of work, that's hardly ever going to be worth the effort.
The next best solution is to write it to disk and call unlink ASAP. The disk doesn't even have to be "real" disk, it can be tmpfs or similar.
The alternative I've been using recently is to not pass complete compiled binaries around, but to pass LLVM bytecode instead, which can then be JIT'd/interpreted/saved as fit. This also has the advantage of making your application work in heterogeneous environments.
It may be tempting to try a combination of fmemopen, fileno and fexecve, but this won't work for two reasons:
From fexecve() manpage:
"The file descriptor fd must be opened read-only, and the caller must have permission to execute the file that it refers to"
I.e. it needs to be a fd that refers to a file.
From fmemopen() manpage:
"There is no file descriptor associated with the file stream returned by these functions (i.e., fileno(3) will return an error if called on the returned stream)"
Much easier than doing it is C would just to set up a tmpfs file system. You'd have all the advantages of the interface of a harddisk, from your program / server / whatever you could just do an exec. These types of virtual filesystems are quite efficient nowadays, there would be really just one copy of the executable in the page cache.
As Andy points out, for such scheme to be efficient you'd have to ensure that you don't use buffered writes to the file but that you "write" (in a broader sense) directly in place.
you'd have to know how large your executable will be
create a file on your tmpfs
scale it to that size with ftruncate
"map" that file into memory with mmap to obtain the addr of a buffer
pass that address directly to the recv call to write the data in place
munmap the file
call exec with the file
rm the file. can be done even when the executable is still running
You might want to look at and reuse UPX, which decompresses the executable to memory, and then transfers control to ld-linux to start it.

What free tiniest flash file system could you advice for embedded system?

I've got DSP tms320vc5509a and NOR flash AT26DF321 on the board and I'm going to store named data on the flash. I don't need directory hierarchy, wear leveling (I hope system will write to flash very few times), but CRC is strongly needed.
Thank you
You could look at the ELM-Petit FAT File System Module for an good small file system implementation. Not sure that it has CRC but you can add that to your low-level hardware drivers.
On a NOR flash, especially one that also contains my boot code and application, I generally avoid the overhead of a formal file system. Instead, I store each "interesting" object starting at an erase block boundary, and beginning with a header structure that at minimum holds the object size and a checksum. Adding a name or resource ID to the header is a natural extension.
The boot loader looks for a valid application by verifying the checksum before using the block. Similarly, other resources can be confirmed to be valid before use.
It also makes it easy for a firmware update utility to validate the object before erasing and programming it to the FLASH.
A pool of small resources might be best handled by wrapping it in a container for flashing. If the runtime resources support it, I would be tempted to use ZIP to wrap the files, wrapping the image of the ZIP archive in a size and checksum header and storing it at an erase block boundary. If you can't afford the decompression runtime, it is still possible to use ZIP with uncompressed files, or to use a simpler format such as tar.
Naturally, the situation is very different for a NAND flash. There, I would strongly recommend picking an established (commercial or open source) filesystem designed for the quirks of NAND flash.

Temp file that exists only in RAM?

I'm trying to write an encrpytion using the OTP method. In keeping with the security theories I need the plain text documents to be stored only in memory and never ever written to a physical drive. The tmpnam command appears to be what I need, but from what I can see it saves the file on the disk and not the RAM.
Using C++ is there any (platform independent) method that allows a file to exist only in RAM? I would like to avoid using a RAM disk method if possible.
Thanks
Edit:
Thanks, its more just a learning thing for me, I'm new to encryption and just working through different methods, I don't actually plan on using many of them (esspecially OTP due to doubling the original file size because of the "pad").
If I'm totally honest, I'm a Linux user so ditching Windows wouldn't be too bad, I'm looking into using RAM disks for now as FUSE seems a bit overkill for a "learning" thing.
The simple answer is: no, there is no platform independent way. Even keeping the data only in memory, it will still risk being swapped out to disk by the virtual memory manager.
On Windows, you can use VirtualLock() to force the memory to stay in RAM. You can also use CryptProtectMemory() to prevent other processes from reading it.
On POSIX systems (e.g. BSD, Linux) you can use mlock() to lock memory in RAM.
Not really unless you count in-memory streams (like stringstream).
No especially and specifically for security purposes: any piece of data can be swapped to disk on virtual memory systems.
Generally, if you are concerned about security, you have to use platform-specific methods for controlling access: What good is keeping your data in RAM if everyone can read it?
You might want to look at TrueCrypt's source code. Getting code at the file system level might be your best bet.
OTP is an awful encryption method for arbitrary files, unless you have a massive amount of entropy that you can guarantee never repeats itself (that's why it's called "one-time"!)
If you want to create a file-like object that only exists in memory and you don't care about Windows, I'd look at writing a custom FUSE filesystem (http://fuse.sourceforge.net/); this way you guarantee what will and will not get written to disk, and your files are accessible by all programs.
Using one of std::stringstream or fmemopen will get you file-like access to blocks of memory. If (for security) you want to avoid it being swapped out, use mlock which is probably easiest to use with fmemopen's buffer than std::stringstream. Combining mlock with std::stringstream would probably need to be done via a custom allocator (used as a template parameter).