concatenate/append/merge files in c++ (windows) without coping - c++

How can i concatenate few large files(total size~ 3 Tb) in 1 file using c/c++ on windows?
I cant copy data, because it takes too much time, so i cant use:
cmd copy
Appending One File to Another File(https://msdn.microsoft.com/en-us/library/windows/desktop/aa363778%28v=vs.85%29.aspx)
and so on(stream::readbuf(),...)
I just need represent few files as one.

if this is inside your own program only, then you can create a class that would virtually glue the files together so you can read over it and make it apear as a single file.
if you want to physically have a single file. then no, not possible.
that requires opening file 1 and appending the others.
or creating a new file and appending all the files.
neither the C/C++ library nor the windows API have a means to concatenate files
even if such an API would be available, it would be restrictive in that the first file would have to be of a size that is a multiple of the disk allocation size.
Going really really low level, and assuming the multiple of allocation size is fulfilled... yes, if you unmount the drive, and physically override the file system and mess around with the file system structures, you could "stitch" the files together but that would be a challenge to do for FAT, and near impossible for NTFS.

Related

Speedy splitting. Or how to make filesystem to recognize array of bytes as a file?

I want to split a big file into smaller ones without copying part of file, and without using filestream or functions which use it (if it is possible).
Imagine, we have big file which is consisted of 3 files:
[[File1bytes][File2bytes][File3bytes]]
In my opinion we can do this with these steps:
Use SetEndOfFile function to truncate the bytes of the last file ([File3bytes] in our example)
Somehow force our file system to recognize those truncated bytes ([File3bytes]) as a real file (maybe by adding some info to MFT table, or doing something with NTFS if it is possible, or using some function or method which can do all mentioned for us).
Any suggestions?
How about create a file system nesting over the existing file system where the very large file actually resides and define some IOCTL commands for splitting? Check this link:
How can I write my own 'filesystem' within Windows?

hardlink multiple file to one file

I have many files in a folder. I want to concatenate all these files to a single file. For example cat * > final_file;
But this will increase disk space. Is there is a way where I can hardlink all the files to final_file? For example ln * final_file.
This is not possible using links.
If you really need this kind of feature and can not afford to create one large file you could go for a custom file system driver. FUSE will allow you to write a simple file system driver which runs in the user space and allows to access the files as they were one large file.
You could also write a custom block device (e.g. by emulating the NBD "Network Block Device" protocol) which combines two or more files into one large block device.
Getting to know the concrete use case would help to give a better answer.
No. Hardlinking links 2 files, nothing more. The filesystem does not support that at an underlying level.

Truncating the file in c++

I was writing a program in C++ and wonder if anyone can help me with the situation explained here.
Suppose, I have a log file of about size 30MB, I have copied last 2MB of file to a buffer within the program.
I delete the file (or clear the contents) and then write back my 2MB to the file.
Everything works fine till here. But, the concern is I read the file (the last 2MB) and clear the file (the 30MB file) and then write back the last 2MB.
To much of time will be needed if in a scenario where I am copying last 300MB of file from a 1GB file.
Does anyone have an idea of making this process simpler?
When having a large log file the following reasons should and will be considered.
Disk Space: Log files are uncompressed plain text and consume large amounts of space.
Typical compression reduce the file size by 10:1. However a file cannot be compressed
when it is in use (locked). So a log file must be rotated out of use.
System resources: Opening and closing a file regularly will consume lots of system
resources and it would reduce the performance of the server.
File size: Small files are easier to backup and restore in case of a failure.
I just do not want to copy, clear and re-write the last specific lines to a file. Just a simpler process.... :-)
EDIT: Not making any inhouse process to support log rotation.
logrotate is the tool.
I would suggest an slightly different approach.
Create a new temporary file
Copy the required data from the original file to the temporary file
Close both files
Delete the original file
Rename the temp file to the same name as the original file
To improve the performance of the copy, you can copy the data in chunks, you can play around with the chunk size to find the optimal value.
If this is your file before:
-----------------++++
Where - is what you don't want and + is what you do want, the most portable way of getting:
++++
...is just as you said. Read in the section you want (+), delete/clear the file (as with fopen(... 'wb') or something similar and write out the bit you want (+).
Anything more complicated requires OS-specific help, and isn't portable. Unfortunately, I don't believe any major OS out there has support for what you want. There might be support for "truncate after position X" (a sort of head), but not the tail like operation you're requesting.
Such an operation would be difficult to implement, as varying blocksizes on filesystems (if the filesystem has a block size) would cause trouble. At best, you'd be limited to cutting on blocksize boundaries, but this would be harry. This is such a rare case, that this is probably why such a procudure is not directly supported.
A better approach might be not to let the file grow that big but rather use rotating log files with a set maximum size per log file and a maximum number of old files being kept.
If you can control the writing process, what you probably want to do here is to write to the file like a circular buffer. That way you can keep the last X bytes of data without having to do what you're suggesting at all.
Even if you can't control the writing process, if you can at least control what file it writes to, then maybe you could get it to write to a named pipe. You could attach your own program at the end of this named pipe that writes to a circular buffer as discussed.

Out of Core Implementation of a Quadtree

I am trying to build a Quadtree data structure(or let's just say a tree) on the secondary memory(Hard Disk).
I have a C++ program to do so and I use fopen to create the files. Also, I am using tesseral coding to store each cell in a file named with its corresponding code to store it on the disk in one directory.
The problem is that after creating about 1,100 files, fopen just returns NULL and stops creating new files. I can create further files manually in that directory, but using C++ it can not create any further files.
I know about max limit of inode on ext3 filesystem which is (from Wikipedia) 32,000 but mine is way less than that, also note that I can create files manually on the disk; just not through fopen.
Also, I really appreciate any idea regarding the best way to store a very dynamic quadtree on disk(I need the nodes to be in separate files and the quadtree might have a depth of 50).
Using nested directories is one idea, but I think it will slow down the performance because of following the links on the filesystem to access the file.
Thanks,
Nima
Whats the errno value of the failed fopen() call?
Do you keep the files you have created open? If yes you are most probably exceeding the maximum number of open files per process.
When you use directories as data structures, you delegate the work of maintaining that structure to the file system, which is not necessarily designed to do that.
Edit: Frank is probably right that you'v exceeded the number of available file descriptors. You can increase those, but that shows that you're also using internals of your ABI as a data structure. Slow and (as resources are exhausted) unstable.
Either code for a very specific OS installation, or use a SQL database.
I have no idea why fopen wouldn't work. Look at errno.
However, storing everything in one directory is a bad idea. When you add a lot of files, it will get slow. Having a directory for every level of the tree will also be slow.
Instead, combine multiple levels into one directory. You could, for example, have one directory for every four levels of the tree. This would limit the number of directories, amount of nesting, and number of files per directory, giving very good performance.
The limitation could come from:
stdio (C library). most 256 handles. Can be increased to 1024 (in VC, call _setmaxstdio)
OS kernel on the file hanldes per process (usually 1024).

Quick file access in a directory with 500,000 files

I have a directory with 500,000 files in it. I would like to access them as quickly as possible. The algorithm requires me to repeatedly open and close them (can't have 500,000 file open simultaneously).
How can I do that efficiently? I had originally thought that I could cache the inodes and open the files that way, but *nix doesn't provide a way to open files by inode (security or some such).
The other option is to just not worry about it and hope the FS does good job on file look up in a directory. If that is the best option, which FS's would work best. Do certain filename patterns look up faster than others? eg 01234.txt vs foo.txt
BTW this is all on Linux.
Assuming your file system is ext3, your directory is indexed with a hashed B-Tree if dir_index is enabled. That's going to give you as much a boost as anything you could code into your app.
If the directory is indexed, your file naming scheme shouldn't matter.
http://lonesysadmin.net/2007/08/17/use-dir_index-for-your-new-ext3-filesystems/
A couple of ideas:
a) If you can control the directory layout then put the files into subdirectories.
b) If you can't move the files around, then you might try different filesystems, I think xfs might be good for directories with lots of entries?
If you've got enough memory, you can use ulimit to increase the maximum number of files that your process can have open at one time, I have successfully done with with 100,000 files, 500,000 should work as well.
If that isn't a option for you, try to make sure that your dentry cache has enough room to store all the entries. The dentry cache is the filename -> inode mapping that the kernel uses to speed up file access based on filename, accessing huge numbers of different files can effectively eliminate the benefit of the dentry cache as well as introduce an additional performance hit. Stock 2.6 kernel has a hash with up to 256 * MB RAM entries in it at a time, if you have 2GB of memory you should be okay for up to a little over 500,000 files.
Of course, make sure you perform the appropriate profiling to determine if this really causes a bottlneck.
The traditional way to do this is with hashed subdirectories. Assume your file names are all uniformly-distributed hashes, encoded in hexadecimal. You can then create 256 directories based on the first two characters of the file name (so, for instance, the file 012345678 would be named 01/2345678). You can use two or even more levels if one is not enough.
As long as the file names are uniformly distributed, this will keep the directory sizes manageable, and thus make any operations on them faster.
Another question is how much data is in the files? Is an SQL back end an option?