Essentially I need to implement a program to act as a user space file system that implements very simple operations such as viewing what is on the disk, copying files to and from the native file system to my file system ( which is contained in a single file called "disk01") and deleting files from my file system.
I'm basically looking for a springboard or some hint as to where I can start as I'm unsure how to create my own "disk" and put other files inside it and this is a homework assignment.
Just a C++ student looking for some direction.
Edit:
I know this is a concept that is already used in several different places as "VFS's" or virtual file systems, kind of like zip files (you can only view the contents through a program that can handle zip files). I'm basically trying to write my own program similar to zip or winrar or whatever, but not near as complicated and feature rich.
Thank you for your suggestions so far! You all are a tremendous help!
It's very simple to create a FAT-like disk structure.
At a fixed location in the file, most likely first, you have a structure with some information about the disk. Then follows the "FAT", a table of simple structures detailing the files on the disk. This is basically a fixed-size table of structure similar to this:
struct FATEntry
{
char name[20]; /* Name of file */
uint32_t pos; /* Position of file on disk (sector, block, something else) */
uint32_t size; /* Size in bytes of file */
uint32_t mtime; /* Time of last modification */
};
After this table you have a fixed-sized area used for a bitmap of free blocks on the disk. If the filesystem can grow or shrink dynamically then the bitmap might not be needed. This is then followed by the actual file data.
For a system like this, all files must be continuously laid out on the disk. This will lead to fragmenting as you add and remove and resize files.
Another way is to use the linked-list approach, used for example on the old Amiga filesystems. Using this scheme all blocks are simply linked lists.
Like before you need a structure for the actual disk data, and possibly a bitmap showing free/allocated disk blocks. The only field needed in the disk-data structure is a "pointer" to the first file.
By pointer I mean an integer pointing out the location on disk of a block.
The files themselves can be similar to the above FAT-like system:
struct FileNode
{
char name[12]; /* Name of file */
uint32_t next; /* Next file, zero for last file */
uint32_t prev; /* Previous file, zero for first file */
uint32_t data; /* Link to first data block */
uint32_t mtime; /* Last modification time */
uint32_t size; /* Size in bytes of the file */
};
The data blocks are themselves linked lists:
struct DataNode
{
uint32_t next; /* Next data block for file, zero for last block */
char data[BLOCK_SIZE - 4]; /* Actual data, -4 for the block link */
};
The good thing about a linked-list filesystem is that it will never be fragmented. The drawbacks are that you might have to jump all over the disk to fetch data blocks, and that the data blocks can't be used in full as they need at least one link to the next data block.
A third way, common in Unix-like systems, is to have the file data contain a set of links to the data-blocks. Then the data blocks doesn't have to be stored contiguously on the disk. It will include some of the same drawbacks as the linked-list method, in that blocks could be stored all over the disk, and that the maximum size of the file is limited. One pro is that the data-blocks can be fully utilized.
Such a structure could look like
struct FileNode
{
char name[16]; /* Name of file */
uint32_t size; /* Size in bytes of file */
uint32_t mtime; /* Last modification time of file */
uint32_t data[26]; /* Array of data-blocks */
};
The above structure limits the maximum file-size to 26 data blocks.
Open a file for non-destructive read/write. With an fstream this might be fstream stream(filename).
Then use the seek functions to move around it. If you are using a C++ fstream this is stream.seekg(position).
Then you'd want binary read and write functions so you'd use stream.read(buffer, len) and stream.write(buffer, len).
An easy way to start a filesystem is to decide on a block size. Most people used 512 bytes in the old days. You could do that or use 4K or make it completely adjustable. Then you set aside a block near the beginning for a free-space map. This can be a bit per block or if you are lazy one byte per block. Then after that you have a root directory. FAT did this an easy way: it's just a list of names, some meta data like timestamps, file size and block offsets. I think FAT blocks had a pointer to the next block in the file so that it could fragment the file without needing to run a defrag while writing.
So then you search the directory, find the file, go to the offset and read the blocks.
Where real filesystems get complicated is in hard tasks like allocating blocks for files so they have room to grow at the ends without wasting space. Handling fragmentation. Having good performance while multiple threads or programs are writing at the same time. Robust recovery in the face of unexpected disk errors or power loss.
Related
I'm currently using google's protobuffer library to store and load data from disk.
It's pretty convenient because it's fast, provides a nice way of defining my datastructures, allows to compress/decompress the data upon writing/reading from the file.
So far that served me well. The problem is now that I have to deal with a datastructure that is several hundred gigabytes large and with protobuf I can only write and load the whole file.
The datastructure looks something like that
struct Layer {
std::vector<float> weights_;
std::vector<size_t> indices_;
};
struct Cell {
std::vector<Layer> layers_;
};
struct Data {
int some_header_fields;
...
std::vector<Cell> cells_;
};
There are two parts to the algorithm.
In the first part, data is added (not in sequence, the access pattern is random, weights and indices might be added to any layer of any cell). No data is removed.
In the second part, the algorithm accesses one cell at a time and processes the data in it, but the access order of cells is random.
What I'd like would be something similar to protobuf, but backed by some file storage that doesn't need to be serialized/deserialized in one go.
Something that would allow me to do things like
Data.cells_[i].layers_[j].FlushToDisk();
at which point the weights_ and indices_ arrays/lists would write their current data to the disk (and free the associated memory) but retain their indices, so that I can add more data to it as I go.
And later during the second part of the algorithm, I could do something like
Data.cells_[i].populate(); //now all data for cell i has been loaded into ram from the file
... process cell i...
Data.cells_[i].dispose(); //now all data for cell i is removed from memory but remains in the file
Additionally to store data to the disk, I'd like it to also support compression of data. It should also allow multithreaded access.
What library would enable me to do this? Or can I still use protobuf for this in some way? (I guess not, because I would not write the data to disk in a serialized fashion)
//edit:
performance is very important. So when I populate the cell, I need the data to be in main memory and in continguous arrays
How to read from binary file (bytes, also known as unsigned charin C++) to vector skipping the first value which is number as unsigned int 32, because the first value is also the size of vector?
First value is also the size of the whole file.
You could try something like this:
uint32_t data_size = 0;
data_file.read((char *) &data_size, sizeof(data_size));
std::vector<uint8_t> data(data_size);
data_file.read((char *) &data[0], data_size);
The above code fragment first reads the size or quantity of the data from the file.
A std::vector is created, using the quantity value that was read in.
Finally, the data is read into the vector.
Edit 1: Memory-mapped files
You may want to consider opening the data file as a memory mapped file. This is where the Operating System treats the file as memory. You don't have to store the data in memory nor read it in. Since memory mapped file APIs vary among operating systems, you'll have to search your Operating System API to find out how to use the feature.
For my neural-network-training project, I've got a very large file of input data. The file format is binary, and it consists of a very large number of fixed-size records. The file is currently ~13GB, but in the future it could become larger; for the purposes of this question let's assume it will be too large to just hold all of it in my computer's RAM at once.
Today's problem involves a little utility program I wrote (in C++, although I think choice of language doesn't matter too much here as one would likely encounter the same problem in any language) that is intended to read the big file and output a similar big file -- the output file is to contain the same data as the input file, except with the records shuffled into a random ordering.
To do this, I mmap() the input file into memory, then generate a list of integers from 1 to N (where N is the number of records in the input file), randomly shuffle the ordering of that list, then iterate over the list, writing out to the output file the n'th record from the mmap'd memory area.
This all works correctly, as far as it goes; the problem is that it's not scaling very well; that is, as the input file's size gets bigger, the time it takes to do this conversion is increasing faster than O(N). It's getting to the point where it's become a bottleneck for my workflow. I suspect the problem is that the I/O system (for MacOS/X 10.13.4, using the internal SSD of my Mac Pro trashcan, in case that's important) is optimized for sequential reads, and jumping around to completely random locations in the input file is pretty much the worst-case scenario as far as caching/read-ahead/other I/O optimizations are concerned. (I imagine that on a spinning disk it would perform even worse due to head-seek delays, but fortunately I'm at least using SSD here)
So my question is, is there any clever alternative strategy or optimization I could use to make this file-randomization-process more efficient -- one that would scale better as the size of my input files increases?
If problem is related to swapping and random disk access while reading random file locations, can you at least read input file sequentially?
When you're accessing some chunk in mmap-ed file, prefetcher will think that you'll need adjacent pages soon, so it will also load them. But you won't, so these pages will be discarded and loading time will be wasted.
create array of N toPositons, so toPosition[i]=i;
randomize destinations (are you using knuth's shuffle?);
then toPosition[i] = destination of input[i]. So, read input data sequentially from start and place them into corresponding place of destination file.
Perhaps, this will be more prefetcher-friendly. Of course, writing data randomly is slow too, but at least, you won't waste prefetched pages from input file.
Additional benefit is that when you've processed few millions of input data pages, these GBs will be unloaded from RAM and you'll never need them again, so you won't pollute actual disk cache. Remember that actual memory page size is at least 4K, so even when you're randomly accessing 1 byte of mmap-ed file, at least 4K of data should be read from disk into cache.
I'd recommend not using mmap() - there's no way all that memory pressure is any help at all, and unless you're re-reading the same data multiple times, mmap() is often the worst-performing way to read data.
First, generate your N random offsets, then, given those offsets, use pread() to read the data - and use low-level C-style IO.
This uses the fcntl() function to disable the page cache for your file. Since you're not re-reading the same data, the page cache likely does you little good, but it does use up RAM, slowing other things down. Try it both with and without the page cache disabled and see which is faster. Note also that I've left out all error checking:
(I'm also assuming C-style IO functions are in namespace std on a MAC, and I've used C-style strings and arrays to match the C-style IO functions while keeping the code simpler.)
#include <sys/types.h>
#include <sys/uio.h>
#include <unistd.h>
#include <fcntl.h>
void sendRecords( const char *dataFile, off_t offsets, size_t numOffsets )
{
int fd = std::open( dataFile, O_RDONLY );
// try with and without this
std::fcntl( fd, F_NOCACHE, 1 );
// can also try using page-aligned memory here
char data[ RECORD_LENGTH ];
for ( size_t ii = 0; ii < numOffsets; ii++ )
{
ssize_t bytesRead = std::pread( fd, data, sizeof( data ), offsets[ ii ] );
// process this record
processRecord( data );
}
close( datafd );
}
Assuming you have a file containing precalculated random offsets:
#include <sys/types.h>
#include <sys/uio.h>
#include <unistd.h>
#include <fcntl.h>
void sendRecords( const char *dataFile, const char *offsetFile )
{
int datafd = std::open( dataFile, O_RDONLY );
// try with and without this
std::fcntl( fd, F_NOCACHE, 1 );
int offsetfd = std::open( offsetFile, O_RDONLY );
// can also try using page-aligned memory here
char data[ RECORD_LENGTH ];
for ( ;; )
{
off_t offset;
ssize_t bytesRead = std::read( offsetfd, &offset, sizeof( offset ) );
if ( bytesRead != sizeof( offset ) )
{
break;
}
bytesRead = std::pread( fd, data, sizeof( data ), offset );
// process this record
processRecord( data );
}
std::close( datafd );
std::close( offsetfd );
}
You can go faster, too, since that code alternates reading and processing, and it'd probably be faster to use multiple threads to read and process simultaneously. It's not that hard to use one or more threads to read data into preallocated buffers that you then queue up and send to your processing thread.
Thanks to advice of various people in this thread (in particular Marc Glisse and Andrew Henle) I was able to reduce the execution time of my program on a 13GB input file, from ~16 minutes to ~2 minutes. I'll document how I did it in this answer, since the solution wasn't very much like either of the answers above (it was more based on Marc's comment, so I'll give Marc the checkbox if/when he restates his comment as an answer).
I tried replacing the mmap() strategy with pread(), but that didn't seem to make much difference; and I tried passing F_NOCACHE and various other flags to fcntl(), but they seemed to either have no effect or make things slower, so I decided to try a different approach.
The new approach is to do things in a 2-layer fashion: rather than reading in single records at a time, my program now loads in "blocks" of sequential records from the input file (each block containing around 4MB of data).
The blocks are loaded in random order, and I load in blocks until I have a certain amount of block-data held in RAM (currently ~4GB, as that is what my Mac's RAM can comfortably hold). Then I start grabbing random records out of random in-RAM blocks, and writing them to the output file. When a given block no longer has any records left in it to grab, I free that block and load in another block from the input file. I repeat this until all blocks from the input file have been loaded and all their records distributed to the output file.
This is faster because all of my output is strictly sequential, and my input is mostly sequential (i.e. 4MB of data is read after each seek rather than only ~2kB). The ordering of the output is slightly less random than it was, but I don't think that will be a problem for me.
If I have a huge file (eg. 1TB, or any size that does not fit into RAM. The file is stored on the disk). It is delimited by space. And my RAM is only 8GB. Can I read that file in ifstream? If not, how to read a block of file (eg. 4GB)?
There are a couple of things that you can do.
First, there's no problem opening a file that is larger than the amount of RAM that you have. What you won't be able to do is copy the whole file live into your memory. The best thing would be for you to find a way to read just a few chunks at a time and process them. You can use ifstream for that purpose (with ifstream.read, for instance). Allocate, say, one megabyte of memory, read the first megabyte of that file into it, rinse and repeat:
ifstream bigFile("mybigfile.dat");
constexpr size_t bufferSize = 1024 * 1024;
unique_ptr<char[]> buffer(new char[bufferSize]);
while (bigFile)
{
bigFile.read(buffer.get(), bufferSize);
// process data in buffer
}
Another solution is to map the file to memory. Most operating systems will allow you to map a file to memory even if it is larger than the physical amount of memory that you have. This works because the operating system knows that each memory page associated with the file can be mapped and unmapped on-demand: when your program needs a specific page, the OS will read it from the file into your process's memory and swap out a page that hasn't been used in a while.
However, this can only work if the file is smaller than the maximum amount of memory that your process can theoretically use. This isn't an issue with a 1TB file in a 64-bit process, but it wouldn't work in a 32-bit process.
Also be aware of the spirits that you're summoning. Memory-mapping a file is not the same thing as reading from it. If the file is suddenly truncated from another program, your program is likely to crash. If you modify the data, it's possible that you will run out of memory if you can't save back to the disk. Also, your operating system's algorithm for paging in and out memory may not behave in a way that advantages you significantly. Because of these uncertainties, I would consider mapping the file only if reading it in chunks using the first solution cannot work.
On Linux/OS X, you would use mmap for it. On Windows, you would open a file and then use CreateFileMapping then MapViewOfFile.
I am sure you don't have to keep all the file in memory. Typically one wants to read and process file by chunks. If you want to use ifstream, you can do something like that:
ifstream is("/path/to/file");
char buf[4096];
do {
is.read(buf, sizeof(buf));
process_chunk(buf, is.gcount());
} while(is);
A more advances aproach is to instead of reading whole file or its chunks to memory you can map it to memory using platform specific apis:
Under windows: CreateFileMapping(), MapViewOfFile()
Under linux: open(2) / creat(2), shm_open, mmap
you will need to compile 64bit app to make it work.
for more details see here: CreateFileMapping, MapViewOfFile, how to avoid holding up the system memory
You can use fread
char buffer[size];
fread(buffer, size, sizeof(char), fp);
Or, if you want to use C++ fstreams you can use read as buratino said.
Also have in mind that you can open a file regardless of its size, the idea is to open it and read it in chucks that fit in your RAM.
I need to sequentially read a file in C++, dealing with 4 characters at a time (but it's a sliding window, so the next character is handled along with the 3 before it). I could read chunks of the file into a buffer (I know mmap() will be more efficient but I want to stick to platform-independent plain C++), or I could read the file a character at a time using std::cin.read(). The file could be arbitrary large, so reading the whole file is not an option.
Which approach is more efficient?
The most efficient method is to read a lot of data into memory using the fewest function calls or requests.
The objective is to keep the hard drive spinning. One of the bottlenecks is waiting for the hard drive to spin to correct speed. Another is trying to locate the sectors on the hard drive where your requested data lives. A third bottleneck is collisions with the database and memory.
So I vote for the read method into a buffer and search the buffer.
Determine what the largest chunk of data you can read at a time. Then read the file by the chunks.
Say you can only deal with 2K characters at a time. Then, use:
std::ifstream if(filename);
char chunk[2048];
while ( if.read(chunk, 2048)) )
{
std::streamsize nread = in.gcount();
// Process nread number of characters of the chunk.
}