How to read from binary file (bytes, also known as unsigned charin C++) to vector skipping the first value which is number as unsigned int 32, because the first value is also the size of vector?
First value is also the size of the whole file.
You could try something like this:
uint32_t data_size = 0;
data_file.read((char *) &data_size, sizeof(data_size));
std::vector<uint8_t> data(data_size);
data_file.read((char *) &data[0], data_size);
The above code fragment first reads the size or quantity of the data from the file.
A std::vector is created, using the quantity value that was read in.
Finally, the data is read into the vector.
Edit 1: Memory-mapped files
You may want to consider opening the data file as a memory mapped file. This is where the Operating System treats the file as memory. You don't have to store the data in memory nor read it in. Since memory mapped file APIs vary among operating systems, you'll have to search your Operating System API to find out how to use the feature.
Related
I have a big file (let's assume I can make it binary), that can not fit in RAM, and I want to sort numbers from it. In the process I need to read/write a big amount of numbers from/to the file (from/to vector<int> or int[]) quickly, so I'd like not to read/write it one by one, but read/write it by blocks with a fixed size. How can I do it?
I have a big file (let's assume I can make it binary), that can not fit in RAM, and I want to sort numbers from it.
Given that the file is binary, perhaps the simplest and presumably efficient solution is to memory map the file. Unfortunately there is no standard interface to perform memory mapping. On POSIX systems, there is the mmap function.
Now, the memory mapped file is simply an array of raw bytes. Treating it as an array of integers is technically not allowed until C++20 where C-style "implicit creation of low level objects" is introduced. In practice, that already works on most current language implementations Note 1.
For this reinterpretation to work, the representation of the integers in the file must match the representation of integers used by the CPU. The file will not be portable to the same program running on other, incompatible systems.
We can simply use std::sort on this array. The operating system should take care of paging the file in and out of memory. The algorithm used by std::sort isn't necessarily optimised for this use case however. To find the optimal algorithm, you may need to do some research.
1 In case Pre-C++20 standard conformance is a concern, it is possible to iterate over the array, copy the underlying bytes into an integer, placement-new an integer object into the memory using the copied integer as the new value. A compiler can optimise these operations into zero instructions, and this makes the program's behaviour well defined.
You can use ostream::write to write into a file and istream::read to read from a file.
To make the process clean, it will be good to have the number of items also in the file.
Let's say you have a vector<int>.
You can use the following code to write its contents to a file.
std::vector<int> myData;
// .. Fill up myData;
// Open a file to write to, in binary mode.
std::ofstream out("myData.bin", std::ofstream::binary);
// Write the size first.
auto size = myData.size();
out.write(reinterpret_cast<char const*>(&size), sizeof(size));
// Write the data.
out.write(reinterpret_cast<char const*>(myData.data()), sizeof(int)*size);
You can read the contents of such a file using the following code.
std::vector<int> myData;
// Open the file to read from, in binary mode.
std::ifstream in("myData.bin", std::ifstream::binary);
// Read the size first.
auto size = myData.size();
in.read(reinterpret_cast<char*>(&size), sizeof(size));
// Resize myData so it has enough space to read into.
myData.resize(size);
// Read the data.
in.read(reinterpret_cast<char*>(myData.data()), sizeof(int)*size);
If not all of the data can fit into RAM, you can read and write the data in smaller chunks. However, if you read/write them in smaller chunks, I don't know how you would sort them.
I did not know binary files could be read with mmap(). I used to think mmap()could only be used for IPC(interprocess communication) in Linux to exchange data between unrelated process.
Can someone explain how files are read with mmap()? I heard it has huge advantage when binary files are randomly accessed.
Well, mmapping a file is done the same way as it is done for the IPC or mapping anonymous memory. In the case of mapping anonymous memory the parts that have not been written to will be filled with zeroed pages on demand.
In case of a mapped file, the pages that correspond to file contents are read upon access (and upon writes too) from the file / or the buffer cache. Reading or writing outside the size of the file will result in SIGBUS. Essentially the pointer returned by mmap can be considered in the similar manner as the pointer returned by malloc, except that up to the size of mapping / up to the end-of-file bytes within the mapping are automatically read from / and possibly written to the backing file transparently.
Example:
fd = open("input.txt", O_RDWR, 0666);
fstat(fd, &stat_result);
char* contents = mmap(0, stat_result->st_size,
PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
(error checking omitted)
After executing that you can consider contents as pointing to the first byte of a character array of stat_result->st_size characters, and you can use it just like an ordinary array, and the operating system will transparently write back the changes into the file.
With mmap the operating system will have a better view about which parts of the file should be kept in memory / buffer cache and which shouldn't.
If I have a huge file (eg. 1TB, or any size that does not fit into RAM. The file is stored on the disk). It is delimited by space. And my RAM is only 8GB. Can I read that file in ifstream? If not, how to read a block of file (eg. 4GB)?
There are a couple of things that you can do.
First, there's no problem opening a file that is larger than the amount of RAM that you have. What you won't be able to do is copy the whole file live into your memory. The best thing would be for you to find a way to read just a few chunks at a time and process them. You can use ifstream for that purpose (with ifstream.read, for instance). Allocate, say, one megabyte of memory, read the first megabyte of that file into it, rinse and repeat:
ifstream bigFile("mybigfile.dat");
constexpr size_t bufferSize = 1024 * 1024;
unique_ptr<char[]> buffer(new char[bufferSize]);
while (bigFile)
{
bigFile.read(buffer.get(), bufferSize);
// process data in buffer
}
Another solution is to map the file to memory. Most operating systems will allow you to map a file to memory even if it is larger than the physical amount of memory that you have. This works because the operating system knows that each memory page associated with the file can be mapped and unmapped on-demand: when your program needs a specific page, the OS will read it from the file into your process's memory and swap out a page that hasn't been used in a while.
However, this can only work if the file is smaller than the maximum amount of memory that your process can theoretically use. This isn't an issue with a 1TB file in a 64-bit process, but it wouldn't work in a 32-bit process.
Also be aware of the spirits that you're summoning. Memory-mapping a file is not the same thing as reading from it. If the file is suddenly truncated from another program, your program is likely to crash. If you modify the data, it's possible that you will run out of memory if you can't save back to the disk. Also, your operating system's algorithm for paging in and out memory may not behave in a way that advantages you significantly. Because of these uncertainties, I would consider mapping the file only if reading it in chunks using the first solution cannot work.
On Linux/OS X, you would use mmap for it. On Windows, you would open a file and then use CreateFileMapping then MapViewOfFile.
I am sure you don't have to keep all the file in memory. Typically one wants to read and process file by chunks. If you want to use ifstream, you can do something like that:
ifstream is("/path/to/file");
char buf[4096];
do {
is.read(buf, sizeof(buf));
process_chunk(buf, is.gcount());
} while(is);
A more advances aproach is to instead of reading whole file or its chunks to memory you can map it to memory using platform specific apis:
Under windows: CreateFileMapping(), MapViewOfFile()
Under linux: open(2) / creat(2), shm_open, mmap
you will need to compile 64bit app to make it work.
for more details see here: CreateFileMapping, MapViewOfFile, how to avoid holding up the system memory
You can use fread
char buffer[size];
fread(buffer, size, sizeof(char), fp);
Or, if you want to use C++ fstreams you can use read as buratino said.
Also have in mind that you can open a file regardless of its size, the idea is to open it and read it in chucks that fit in your RAM.
I need to parse binary files which contain a sequence of elements. The format of an element is as follows:
4 bytes: name of element
4 bytes: size of the element
variable size: data for the element
I just need to parse through the file and extract the name, position and size of each element. Typical element size is around 100kb, and typical file size is around 10GB.
What is the fastest way of going through such a file? Read all of the file's data, seek to the next element, other approach?
Does it make a difference if the file is local or over the network?
The one thing you do not want to do is to use unbuffered reads (i.e. OS calls) to read every individual element.
You can get an OK performance by the naive approach of buffered reads. If memory is not a concern whatsoever, you might squeeze some time by using memory-mapped files, and having a pre-fetcher thread to populate the mapping.
Essentially I need to implement a program to act as a user space file system that implements very simple operations such as viewing what is on the disk, copying files to and from the native file system to my file system ( which is contained in a single file called "disk01") and deleting files from my file system.
I'm basically looking for a springboard or some hint as to where I can start as I'm unsure how to create my own "disk" and put other files inside it and this is a homework assignment.
Just a C++ student looking for some direction.
Edit:
I know this is a concept that is already used in several different places as "VFS's" or virtual file systems, kind of like zip files (you can only view the contents through a program that can handle zip files). I'm basically trying to write my own program similar to zip or winrar or whatever, but not near as complicated and feature rich.
Thank you for your suggestions so far! You all are a tremendous help!
It's very simple to create a FAT-like disk structure.
At a fixed location in the file, most likely first, you have a structure with some information about the disk. Then follows the "FAT", a table of simple structures detailing the files on the disk. This is basically a fixed-size table of structure similar to this:
struct FATEntry
{
char name[20]; /* Name of file */
uint32_t pos; /* Position of file on disk (sector, block, something else) */
uint32_t size; /* Size in bytes of file */
uint32_t mtime; /* Time of last modification */
};
After this table you have a fixed-sized area used for a bitmap of free blocks on the disk. If the filesystem can grow or shrink dynamically then the bitmap might not be needed. This is then followed by the actual file data.
For a system like this, all files must be continuously laid out on the disk. This will lead to fragmenting as you add and remove and resize files.
Another way is to use the linked-list approach, used for example on the old Amiga filesystems. Using this scheme all blocks are simply linked lists.
Like before you need a structure for the actual disk data, and possibly a bitmap showing free/allocated disk blocks. The only field needed in the disk-data structure is a "pointer" to the first file.
By pointer I mean an integer pointing out the location on disk of a block.
The files themselves can be similar to the above FAT-like system:
struct FileNode
{
char name[12]; /* Name of file */
uint32_t next; /* Next file, zero for last file */
uint32_t prev; /* Previous file, zero for first file */
uint32_t data; /* Link to first data block */
uint32_t mtime; /* Last modification time */
uint32_t size; /* Size in bytes of the file */
};
The data blocks are themselves linked lists:
struct DataNode
{
uint32_t next; /* Next data block for file, zero for last block */
char data[BLOCK_SIZE - 4]; /* Actual data, -4 for the block link */
};
The good thing about a linked-list filesystem is that it will never be fragmented. The drawbacks are that you might have to jump all over the disk to fetch data blocks, and that the data blocks can't be used in full as they need at least one link to the next data block.
A third way, common in Unix-like systems, is to have the file data contain a set of links to the data-blocks. Then the data blocks doesn't have to be stored contiguously on the disk. It will include some of the same drawbacks as the linked-list method, in that blocks could be stored all over the disk, and that the maximum size of the file is limited. One pro is that the data-blocks can be fully utilized.
Such a structure could look like
struct FileNode
{
char name[16]; /* Name of file */
uint32_t size; /* Size in bytes of file */
uint32_t mtime; /* Last modification time of file */
uint32_t data[26]; /* Array of data-blocks */
};
The above structure limits the maximum file-size to 26 data blocks.
Open a file for non-destructive read/write. With an fstream this might be fstream stream(filename).
Then use the seek functions to move around it. If you are using a C++ fstream this is stream.seekg(position).
Then you'd want binary read and write functions so you'd use stream.read(buffer, len) and stream.write(buffer, len).
An easy way to start a filesystem is to decide on a block size. Most people used 512 bytes in the old days. You could do that or use 4K or make it completely adjustable. Then you set aside a block near the beginning for a free-space map. This can be a bit per block or if you are lazy one byte per block. Then after that you have a root directory. FAT did this an easy way: it's just a list of names, some meta data like timestamps, file size and block offsets. I think FAT blocks had a pointer to the next block in the file so that it could fragment the file without needing to run a defrag while writing.
So then you search the directory, find the file, go to the offset and read the blocks.
Where real filesystems get complicated is in hard tasks like allocating blocks for files so they have room to grow at the ends without wasting space. Handling fragmentation. Having good performance while multiple threads or programs are writing at the same time. Robust recovery in the face of unexpected disk errors or power loss.