Speedy splitting. Or how to make filesystem to recognize array of bytes as a file? - c++

I want to split a big file into smaller ones without copying part of file, and without using filestream or functions which use it (if it is possible).
Imagine, we have big file which is consisted of 3 files:
[[File1bytes][File2bytes][File3bytes]]
In my opinion we can do this with these steps:
Use SetEndOfFile function to truncate the bytes of the last file ([File3bytes] in our example)
Somehow force our file system to recognize those truncated bytes ([File3bytes]) as a real file (maybe by adding some info to MFT table, or doing something with NTFS if it is possible, or using some function or method which can do all mentioned for us).
Any suggestions?

How about create a file system nesting over the existing file system where the very large file actually resides and define some IOCTL commands for splitting? Check this link:
How can I write my own 'filesystem' within Windows?

Related

std::filesystem::is_regular_file() vs ::exists()- what counts as a regular file

I am using Visual Studio 2017 C++ WITH MFC
I have a program in MFC that collects filepaths as strings and adds them to a zip file.
I want to add an error check, where I check if a file exists before trying to add it. If it exists, perfect, I add the file to the zip. If not, not a problem, I continue to the next filepath.
I came across std::filesystem (here) and I see two different functions that I think can work:
is_regular_file() (here) and exists() (here).
However, I am not sure which of the two to use. The file types I will be zipping vary from .txt to .zip, and a lot of arbitrary file types.
From my research, they appear to be similar, both returning a bool value.
What is the difference between the two functions and which is the better one to use?
Furthermore, from my understanding, the library is relatively new and might not be suitable for use in MFC. Is this true? And if so, what else could I use to check if given a file path, that file exists on the computer?
Thank you in advance ! :D
tl;dr: use std::filesystem::is_regular_file(), std::filesystem::is_symlink() and std::filesystem::is_directory().
There are several file types defined in std::filesystem:
Constant Meaning
none indicates that the file status has not been evaluated yet, or an error occurred when evaluating it
not_found indicates that the file was not found (this is not considered an error)
regular a regular file
directory a directory
symlink a symbolic link
block a block special file
character a character special file
fifo a FIFO (also known as pipe) file
socket a socket file
implementation-defined an additional implementation-defined constant for each additional file type supported by the implementation (e.g. MSVC STL defines junction for NTFS junctions)
unknown the file exists but its type could not be determined
So in first glance, std::filesystem::exists() would be what you want.
But you're adding files to a ZIP archive, and AFAICT, ZIP only supports regular files, symlinks, and directories, so that's what you should use.
You wouldn't archive sockets or NTFS junctions even if they existed and were given in command line.
To clarify:
A "regular file" is a file that has data that is (a) stored by the filesystem and (b) has no semantics assigned to it by the filesystem - i.e. the filesystem doesn't know how to read the file or act on it.
For example, a MS Word document is something that MS Word understand, but the filesystem only knows that it's a bunch of bytes.
A symlink, on the other hand, is something the filesystem stores, and knows how to act on (follow the link).
A block special file is something that is not actually stored by the filesystem, but rather just a piece of metadata that other parts of the OS act on.
A ZIP archive, or RAR, are files that the filesystem stores. But it doesn't know how to read them or act on them, so they are regular files.
In the old Windows Explorer, the "shortcuts" on your desktop were small files stored by the filesystem, but it was Explorer that knew how to read them - they weren't symbolic links, they were regular files. (Maybe that's still the case, I haven't checked recently.)

concatenate/append/merge files in c++ (windows) without coping

How can i concatenate few large files(total size~ 3 Tb) in 1 file using c/c++ on windows?
I cant copy data, because it takes too much time, so i cant use:
cmd copy
Appending One File to Another File(https://msdn.microsoft.com/en-us/library/windows/desktop/aa363778%28v=vs.85%29.aspx)
and so on(stream::readbuf(),...)
I just need represent few files as one.
if this is inside your own program only, then you can create a class that would virtually glue the files together so you can read over it and make it apear as a single file.
if you want to physically have a single file. then no, not possible.
that requires opening file 1 and appending the others.
or creating a new file and appending all the files.
neither the C/C++ library nor the windows API have a means to concatenate files
even if such an API would be available, it would be restrictive in that the first file would have to be of a size that is a multiple of the disk allocation size.
Going really really low level, and assuming the multiple of allocation size is fulfilled... yes, if you unmount the drive, and physically override the file system and mess around with the file system structures, you could "stitch" the files together but that would be a challenge to do for FAT, and near impossible for NTFS.

hardlink multiple file to one file

I have many files in a folder. I want to concatenate all these files to a single file. For example cat * > final_file;
But this will increase disk space. Is there is a way where I can hardlink all the files to final_file? For example ln * final_file.
This is not possible using links.
If you really need this kind of feature and can not afford to create one large file you could go for a custom file system driver. FUSE will allow you to write a simple file system driver which runs in the user space and allows to access the files as they were one large file.
You could also write a custom block device (e.g. by emulating the NBD "Network Block Device" protocol) which combines two or more files into one large block device.
Getting to know the concrete use case would help to give a better answer.
No. Hardlinking links 2 files, nothing more. The filesystem does not support that at an underlying level.

Include static data/text file

I have a text file (>50k lines) of ascii numbers, with string identifiers, that can be thought of as a collection of data vectors. Based on user input, the application only needs one of these data vectors at runtime.
As far as I can see, I have 3 options for getting the information from this text file:
Keep it as a text file, extract the required vector at run-time. I believe the downside is that you can't have a relative path in the code, so the user would have to point to the file's correct location (?). Or alternatively, get the configure script to inject the absolute path as a macro.
Convert it to a static unsigned char using xxd (as explained here) and then include the resulting file. Downside is that a 5MB file turns into a 25MB include file. Am I correct in thinking that this 25MB is loaded into memory for the duration of the runtime?
Convert it to an object and link using objcopy as explained here. This seems to keep the file size about the same -- are there other trade-offs?
Is there a standard/recommended method for doing this? I can use C or C++ if that makes a difference.
Thanks.
(Running on linux with gcc)
I would go with number 1 and pass the filepath into the program as an argument. There's nothing wrong with doing that and it is simple and straight-forward.
You should have a look at the answers here:
Directory of running program
The top voted answer gives you a glue how to handle your data file. But instead of the home folder I would suggest to save it under /usr/share as explained in the link.
I'd preffer to use zlib (and both ways are possible:side file or include with compressed data).

Truncating the file in c++

I was writing a program in C++ and wonder if anyone can help me with the situation explained here.
Suppose, I have a log file of about size 30MB, I have copied last 2MB of file to a buffer within the program.
I delete the file (or clear the contents) and then write back my 2MB to the file.
Everything works fine till here. But, the concern is I read the file (the last 2MB) and clear the file (the 30MB file) and then write back the last 2MB.
To much of time will be needed if in a scenario where I am copying last 300MB of file from a 1GB file.
Does anyone have an idea of making this process simpler?
When having a large log file the following reasons should and will be considered.
Disk Space: Log files are uncompressed plain text and consume large amounts of space.
Typical compression reduce the file size by 10:1. However a file cannot be compressed
when it is in use (locked). So a log file must be rotated out of use.
System resources: Opening and closing a file regularly will consume lots of system
resources and it would reduce the performance of the server.
File size: Small files are easier to backup and restore in case of a failure.
I just do not want to copy, clear and re-write the last specific lines to a file. Just a simpler process.... :-)
EDIT: Not making any inhouse process to support log rotation.
logrotate is the tool.
I would suggest an slightly different approach.
Create a new temporary file
Copy the required data from the original file to the temporary file
Close both files
Delete the original file
Rename the temp file to the same name as the original file
To improve the performance of the copy, you can copy the data in chunks, you can play around with the chunk size to find the optimal value.
If this is your file before:
-----------------++++
Where - is what you don't want and + is what you do want, the most portable way of getting:
++++
...is just as you said. Read in the section you want (+), delete/clear the file (as with fopen(... 'wb') or something similar and write out the bit you want (+).
Anything more complicated requires OS-specific help, and isn't portable. Unfortunately, I don't believe any major OS out there has support for what you want. There might be support for "truncate after position X" (a sort of head), but not the tail like operation you're requesting.
Such an operation would be difficult to implement, as varying blocksizes on filesystems (if the filesystem has a block size) would cause trouble. At best, you'd be limited to cutting on blocksize boundaries, but this would be harry. This is such a rare case, that this is probably why such a procudure is not directly supported.
A better approach might be not to let the file grow that big but rather use rotating log files with a set maximum size per log file and a maximum number of old files being kept.
If you can control the writing process, what you probably want to do here is to write to the file like a circular buffer. That way you can keep the last X bytes of data without having to do what you're suggesting at all.
Even if you can't control the writing process, if you can at least control what file it writes to, then maybe you could get it to write to a named pipe. You could attach your own program at the end of this named pipe that writes to a circular buffer as discussed.