hardlink multiple file to one file - c++

I have many files in a folder. I want to concatenate all these files to a single file. For example cat * > final_file;
But this will increase disk space. Is there is a way where I can hardlink all the files to final_file? For example ln * final_file.

This is not possible using links.
If you really need this kind of feature and can not afford to create one large file you could go for a custom file system driver. FUSE will allow you to write a simple file system driver which runs in the user space and allows to access the files as they were one large file.
You could also write a custom block device (e.g. by emulating the NBD "Network Block Device" protocol) which combines two or more files into one large block device.
Getting to know the concrete use case would help to give a better answer.

No. Hardlinking links 2 files, nothing more. The filesystem does not support that at an underlying level.

Related

concatenate/append/merge files in c++ (windows) without coping

How can i concatenate few large files(total size~ 3 Tb) in 1 file using c/c++ on windows?
I cant copy data, because it takes too much time, so i cant use:
cmd copy
Appending One File to Another File(https://msdn.microsoft.com/en-us/library/windows/desktop/aa363778%28v=vs.85%29.aspx)
and so on(stream::readbuf(),...)
I just need represent few files as one.
if this is inside your own program only, then you can create a class that would virtually glue the files together so you can read over it and make it apear as a single file.
if you want to physically have a single file. then no, not possible.
that requires opening file 1 and appending the others.
or creating a new file and appending all the files.
neither the C/C++ library nor the windows API have a means to concatenate files
even if such an API would be available, it would be restrictive in that the first file would have to be of a size that is a multiple of the disk allocation size.
Going really really low level, and assuming the multiple of allocation size is fulfilled... yes, if you unmount the drive, and physically override the file system and mess around with the file system structures, you could "stitch" the files together but that would be a challenge to do for FAT, and near impossible for NTFS.

Is it a good idea to include a large text variable in compiled code?

I am writing a program that produces a formatted file for the user, but it's not only producing the formatted file, it does more.
I want to distribute a single binary to the end user and when the user runs the program, it will generate the xml file for the user with appropriate data.
In order to achieve this, I want to give the file contents to a char array variable that is compiled in code. When the user runs the program, I will write out the char file to generate an xml file for the user.
char* buffers = "a xml format file contents, \
this represent many block text \
from a file,...";
I have two questions.
Q1. Do you have any other ideas for how to compile my file contents into binary, i.e, distribute as one binary file.
Q2. Is this even a good idea as I described above?
What you describe is by far the norm for C/C++. For large amounts of text data, or for arbitrary binary data (or indeed any data you can store in a file - e.g. zip file) you can write the data to a file, link it into your program directly.
An example may be found on sites like this one
I'll recommend using another file to contain data other than putting data into the binary, unless you have your own reasons. I don't know other portable ways to put strings into binary file, but your solution seems OK.
However, note that using \ at the end of line to form strings of multiple lines, the indentation should be taken care of, because they are concatenated from the begging of the next lineļ¼š
char* buffers = "a xml format file contents, \
this represent many block text \
from a file,...";
Or you can use another form:
char *buffers =
"a xml format file contents,"
"this represent many block text"
"from a file,...";
Probably, my answer provides much redundant information for topic-starter, but here are what I'm aware of:
Embedding in source code: plain C/C++ solution it is a bad idea because each time you will want to change your content, you will need:
recompile
relink
It can be acceptable only your content changes very rarely or never of if build time is not an issue (if you app is small).
Embedding in binary: Few little more flexible solutions of embedding content in executables exists, but none of them cross-platform (you've not stated your target platform):
Windows: resource files. With most IDEs it is very simple
Linux: objcopy.
MacOS: Application Bundles. Even more simple than on Windows.
You will not need recompile C++ file(s), only re-link.
Application virtualization: there are special utilities that wraps all your application resources into single executable, that runs it similar to as on virtual machine.
I'm only aware of such utilities for Windows (ThinApp, BoxedApp), but there are probably such things for other OSes too, or even cross-platform ones.
Consider distributing your application in some form of installer: when starting installer it creates all resources and unpack executable. It is similar to generating whole stuff by main executable. This can be large and complex package or even simple self-extracting archive.
Of course choice, depends on what kind of application you are creating, who are your target auditory, how you will ship package to end-users etc. If it is a game and you targeting children its not the same as Unix console utility for C++ coders =)
It depends. If you are doing some small unix style utility with no perspective on internatialization, then it's probably fine. You don't want to bloat a distributive with a file no one would ever touch anyways.
But in general it is a bad practice, because eventually someone might want to modify this data and he or she would have to rebuild the whole thing just to fix a typo or anything.
The decision is really up to you.
If you just want to keep your distributive in one piece, you might also find this thread interesting: Store data in executable
Why don't you distribute your application with an additional configuration file? e.g. package your application executable and config file together.
If you do want to make it into a single file, try embed your config file into the executable one as resources.
I see it more of an OS than C/C++ issue. You can add the text to the resource part of your binary/program. In Windows programs HTML, graphics and even movie files are often compiled into resources that make part of the final binary.
That is handy for possible future translation into another language, plus you can modify resource part of the binary without recompiling the code.

Hiding application resources

I'm making a simple game with SFML 1.6 in C++. Of course, I have a lot of picture, level, and data files. Problem is, I don't want these files visible. Right now they're just plain picture files in a res/ subdirectory, and I want to either conceal them or encrypt them. Is it possible to put the raw data from the files into a resource file or something? Any solution is okay to me, I just don't want the files exposed to the user.
EDIT
Cross platform solutions best, but if they don't exist, that's okay, I'm working on windows. But I don't really want to use a library if it's not needed.
Most environments come with a resource compiler that converts images/icons/etc into string data and includes them in the source.
Another common technique is to copy them into the end of the final .exe as the last part of the build process. Then at run time, open the .exe as a file and read the data from some determined offset, see Embedding a filesystem in an executable?
The ideal way for this is to make your own archive format, which would contain all of your files' data along with some extra info needed to split files distinctly within it.

Out of Core Implementation of a Quadtree

I am trying to build a Quadtree data structure(or let's just say a tree) on the secondary memory(Hard Disk).
I have a C++ program to do so and I use fopen to create the files. Also, I am using tesseral coding to store each cell in a file named with its corresponding code to store it on the disk in one directory.
The problem is that after creating about 1,100 files, fopen just returns NULL and stops creating new files. I can create further files manually in that directory, but using C++ it can not create any further files.
I know about max limit of inode on ext3 filesystem which is (from Wikipedia) 32,000 but mine is way less than that, also note that I can create files manually on the disk; just not through fopen.
Also, I really appreciate any idea regarding the best way to store a very dynamic quadtree on disk(I need the nodes to be in separate files and the quadtree might have a depth of 50).
Using nested directories is one idea, but I think it will slow down the performance because of following the links on the filesystem to access the file.
Thanks,
Nima
Whats the errno value of the failed fopen() call?
Do you keep the files you have created open? If yes you are most probably exceeding the maximum number of open files per process.
When you use directories as data structures, you delegate the work of maintaining that structure to the file system, which is not necessarily designed to do that.
Edit: Frank is probably right that you'v exceeded the number of available file descriptors. You can increase those, but that shows that you're also using internals of your ABI as a data structure. Slow and (as resources are exhausted) unstable.
Either code for a very specific OS installation, or use a SQL database.
I have no idea why fopen wouldn't work. Look at errno.
However, storing everything in one directory is a bad idea. When you add a lot of files, it will get slow. Having a directory for every level of the tree will also be slow.
Instead, combine multiple levels into one directory. You could, for example, have one directory for every four levels of the tree. This would limit the number of directories, amount of nesting, and number of files per directory, giving very good performance.
The limitation could come from:
stdio (C library). most 256 handles. Can be increased to 1024 (in VC, call _setmaxstdio)
OS kernel on the file hanldes per process (usually 1024).

Quick file access in a directory with 500,000 files

I have a directory with 500,000 files in it. I would like to access them as quickly as possible. The algorithm requires me to repeatedly open and close them (can't have 500,000 file open simultaneously).
How can I do that efficiently? I had originally thought that I could cache the inodes and open the files that way, but *nix doesn't provide a way to open files by inode (security or some such).
The other option is to just not worry about it and hope the FS does good job on file look up in a directory. If that is the best option, which FS's would work best. Do certain filename patterns look up faster than others? eg 01234.txt vs foo.txt
BTW this is all on Linux.
Assuming your file system is ext3, your directory is indexed with a hashed B-Tree if dir_index is enabled. That's going to give you as much a boost as anything you could code into your app.
If the directory is indexed, your file naming scheme shouldn't matter.
http://lonesysadmin.net/2007/08/17/use-dir_index-for-your-new-ext3-filesystems/
A couple of ideas:
a) If you can control the directory layout then put the files into subdirectories.
b) If you can't move the files around, then you might try different filesystems, I think xfs might be good for directories with lots of entries?
If you've got enough memory, you can use ulimit to increase the maximum number of files that your process can have open at one time, I have successfully done with with 100,000 files, 500,000 should work as well.
If that isn't a option for you, try to make sure that your dentry cache has enough room to store all the entries. The dentry cache is the filename -> inode mapping that the kernel uses to speed up file access based on filename, accessing huge numbers of different files can effectively eliminate the benefit of the dentry cache as well as introduce an additional performance hit. Stock 2.6 kernel has a hash with up to 256 * MB RAM entries in it at a time, if you have 2GB of memory you should be okay for up to a little over 500,000 files.
Of course, make sure you perform the appropriate profiling to determine if this really causes a bottlneck.
The traditional way to do this is with hashed subdirectories. Assume your file names are all uniformly-distributed hashes, encoded in hexadecimal. You can then create 256 directories based on the first two characters of the file name (so, for instance, the file 012345678 would be named 01/2345678). You can use two or even more levels if one is not enough.
As long as the file names are uniformly distributed, this will keep the directory sizes manageable, and thus make any operations on them faster.
Another question is how much data is in the files? Is an SQL back end an option?