Concatenating text and binary data into one file

Concatenating text and binary data into one file - c++

I am developing an application, and I have several pieces of data that I want to be able to save to and open from the same file. The first is several lines that are essentially human readable that store simple data about certain attributes. The data is stored in AttributeList objects that support operator<< and operator>>. The rest are .png images which I have loaded into memory.
How can I save all this data to one file in such a way that I can then easily read it back into memory? Is there a way to store the image data in memory that will make this easier?

How can I save all this data to one file in such a way that I can then
easily read it back into memory? Is there a way to store the image
data in memory that will make this easier?
Yes.
In an embedded system I once worked on, the requirement was to capture sysstem configuration into a ram file system. (1 Meg byte)
We used zlib to compress and 'merge' multiple files into a single storage file.
Perhaps any compression system can work for you. On Linux, I would use popen() to run gzip or gunzip, etc.
update 2017-08-07
In my popen demo (for this question), I build the command string with standard shell commands:
std::string cmd;
cmd += "tar -cvf dumy514DEMO.tar dumy*V?.cc ; gzip dumy514DEMO.tar ; ls -lsa *.tar.gz";
// tar without compression ; next do compress
Then construct my popen-wrapped-in-a-class instance and invoke the popen read action. There is normally very little feedback to the user (as is the style of UNIX Philosophy, i.e. no success messages), so I included (for this demo) the -v (for verbose option). The resulting feedback lists the 4 files tar'd together, and I list the resulting .gz file.
dumy514V0.cc
dumy514V1.cc
dumy514V2.cc
dumy514V3.cc
8 -rw-rw-r-- 1 dmoen dmoen 7983 Aug 7 17:23 dumy514DEMO.tar.gz
And a snippet from the dir listing shows my executable, my source code, and the newly created tar.gz.
-rwxrwxr-x 1 dmoen dmoen 86416 Aug 7 17:18 dumy514DEMO
-rw-rw-r-- 1 dmoen dmoen 13576 Aug 7 17:18 dumy514DEMO.cc
-rw-rw-r-- 1 dmoen dmoen 7983 Aug 7 17:23 dumy514DEMO.tar.gz
As you can see, the tar.gz is about 8000 bytes. The 4 files add to about 70,000 bytes.

Related

Can I redirect additional file descriptors from the command line while launching a program?

From C++, you can output to cout and cerr, which are handled by the file descriptors 1 and 2 respectively. Outside of the C++ program, I can then redirect this output wherever I want it (in this case, writing it two separate files):
$ my-program 1>output 2>errors;
Am I stuck with only file descriptors 1 and 2, or can I "create my own"? Let's say, I wanted a third output that saves debug information, or a fourth output that mails the administrator?
$ my-program 1>output 2>errors 3>>/logs/my-program.log 4>&1 | scripts/email-admin.sh;
Can I write to file descriptor's 3 and 4 within my program?

Opening all your files in a wrapper script is not usually a good design. When you want your program to be smarter, and able to close a big log file and start a new one, you'll need the logic in your program.
But to answer the actual question:
Yes, you can have the shell open whatever numbered file descriptors you like, for input, output or both. Having the parent open them before execve(2) is identical to what you'd get from opening them with code in the child process (at least on a POSIX system, where stdin/out/err are just like other file descriptors, and not special.) File descriptors can be marked as close-on-exec or not. Use open(2) with O_CLOEXEC, or after opening use fcntl(2) to set FD_CLOEXEC
They don't have to refer to regular files, either. Any of them can be ttys, pipes, block or character device files, sockets, or even directories. (There's no shell redirection syntax for opening directories, because you can only use readdir(3) on them, not read(2) or write(2).)
See this bash redirection tutorial. And just as a quick example of my own:
peter#tesla:~$ yes | sleep 60 4> >(cat) 5</etc/passwd 9>/dev/tcp/localhost/22 42<>/tmp/insecure_temp &
[1] 25644
peter#tesla:~$ ll /proc/$!/fd
total 0
lr-x------ 1 peter peter 64 Sep 9 21:31 0 -> pipe:[46370587]
lrwx------ 1 peter peter 64 Sep 9 21:31 1 -> /dev/pts/19
lrwx------ 1 peter peter 64 Sep 9 21:31 2 -> /dev/pts/19
l-wx------ 1 peter peter 64 Sep 9 21:31 4 -> pipe:[46372222]
lrwx------ 1 peter peter 64 Sep 9 21:31 42 -> /tmp/insecure_temp
lr-x------ 1 peter peter 64 Sep 9 21:31 5 -> /etc/passwd
l-wx------ 1 peter peter 64 Sep 9 21:31 63 -> pipe:[46372222]
lrwx------ 1 peter peter 64 Sep 9 21:31 9 -> socket:[46372228]
# note the rwx permissions: some are read-write, some are one-way
The >(cmd) process substitution syntax expands to a filename like /dev/fd/63. Using 4> >(cmd) opens that fd as fd 4, as well.
Redirecting stderr to a pipe takes some juggling of file descriptors, because there's no 2| cmd syntax. 2> >(cmd) works, but the cmd runs in the background:
peter#tesla:~$ (echo foo >&2 ) 2> >(wc) # next prompt printed before wc output
peter#tesla:~$ 1 1 4
peter#tesla:~$ ( echo foo >&2; ) 2>&1 | wc
1 1 4

The usual way to handle something like sending debug information somewhere else would be to choose a logging system that writes directly to a file (rather than sending it to another virtual file descriptor, and then redirecting that file descriptor in bash to file).
One option would be to use mkfifo, and have your program write to the fifo as a file, then use some other means to direct the fifo to other locations.
Another option for the mail script would be to run the mail script as a subprocess from in the C++ program and write to its stdin with internal piping. Piping into sendmail is the traditional way for a program to send mail on a Unix system.
The best thing to do is to find libraries that handle the things that you want to do.
Examples
mkfifo
http://linux.die.net/man/3/mkfifo
You use the mkfifo command to create a special file, and then you can fopen and fprint* to it as usual. You can pass the file name to the program like any other.
The bash part looks something like this:
mkfifo mailfifo
yourprogram mailfifo &
cat mailfifo | scripts/email-admin.s
Pipe to subprocess
http://www.gnu.org/software/libc/manual/html_node/Pipe-to-a-Subprocess.html

Windows Get list of ALL files on volume with size

Question: how to list all files on volume with size they occupy on disk?
Applicable solutions:
cmd script
free tool with sqlite/txt/xls/xml/json output
C++ / winapi code
The problem:
There are many tools and apis to list files, but their results dont match chkdsk and actual free space info:
Size Count (x1000)
chkdsk c: 67 GB 297
dir /S 42 GB 267
FS Inspect 47 GB 251
Total Commander (Ctrl+L) 47 GB 251
explorer (selection size) 44 GB 268
explorer (volume info) 67 GB -
WinDirStat 45 GB 245
TreeSize couldn't download it - site unavailable
C++ FindFirstFile/FindNextFile 50 GB 288
C++ GetFileInformationByHandleEx 50 GB 288
Total volume size is 70 GB, about 3 GB is actually free.
I'm aware of:
File can occupy on disk, more than its actual size, i need the size it occupies (i.e. greater one)
Symlinks, Junctions etc - that would be good to see them (though i don't think this alone can really give 20 GB difference in my case)
Filesystem uses some space for indexes and system info (chkdisk shows negligible, don't give 20 GB)
I run all tools with admin privileges, hidden files are shown.
FindFirstFile/FindNextFile C++ solution - this dont give correct results, i don't know because of what, but this gives the same as Total commander NOT the same as chkdsk
Practical problem:
I have 70 GB SSD disk, all the tools report about 50 GB is occupied, but in fact it's almost full.
Format all and reinstall - is not an option since this will happens again quite soon.
I need a report about filesizes. Report total must match actual used and free space. I'm looking for an existing solution - a tool, a script or a C++ library or C++ code.
(Actual output below)
chkdsk c:
Windows has scanned the file system and found no problems.
No further action is required.
73715708 KB total disk space.
70274580 KB in 297259 files.
167232 KB in 40207 indexes.
0 KB in bad sectors.
463348 KB in use by the system.
65536 KB occupied by the log file.
2810548 KB available on disk.
4096 bytes in each allocation unit.
18428927 total allocation units on disk.
702637 allocation units available on disk.
dir /S
Total Files Listed:
269966 File(s) 45 071 190 706 bytes
143202 Dir(s) 3 202 871 296 bytes free
FS Inspect http://sourceforge.net/projects/fs-inspect/
47.4 GB 250916 Files
Total Commander
49709355k, 48544M 250915 Files

On a Posix system, the answer would be to use the stat function. Unfortunately, it does not give the number of allocated blocs in Windows so it does not meet your requirements.
The correct function from Windows API is GetFileInformationByHandleEx. You can use FindFirstFile, FindNextFile to browse the full disk, and ask for FileStandardInfo to get a FILE_STANDARD_INFO that contains for a file (among other fields): LARGE_INTEGER AllocationSize for the allocated size and LARGE_INTEGER EndOfFile for the used size.
Alternatively, you can use directly GetFileInformationByHandleEx on directories, asking for FileIdBothDirectoryInfo to get a FILE_ID_BOTH_DIR_INFO structure. This allows you to get information on many files in a single call. My advice would be to use that one, even if it is of less common usage.

To get list of all files (including hidden and system files), sorted within directories with descending size, you can go to your cmd.exe and type:
dir /s/a:-d/o:-s C:\* > "list_of_files.txt"
Where:
/s lists files within the specified directory and all subdirectories,
/a:-d lists only files (no directories),
/o:-s put files within directory in descending size order,
C:\* means all directories on disk C,
> "list_of_files.txt" means save output to list_of_files.txt file
Listing files grouped by directory may be a little inconvenient, but it's the easiest way to list all files. For more information, take a look at technet.microsoft.com
Checked on Win7 Pro.

Correcting VirtualBox raw disk VMDK file after resizing a partition

I have a Windows 8.1 installation on second partition of my second HDD (/dev/sdb2 in Ubuntu) created using the command
VBoxManage internalcommands createrawvmdk -filename sdb2.vmdk -rawdisk /dev/sdb -partitions 2
Everything worked just fine - Windows installation was runable from VirtualBox and even bootable normally from GRUB. Last time when installing some software in Windows (PC booted up directly into Windows), I discovered there's not enough space on the system partition (/dev/sdb2) and enlarged it by 15 GB that were left spare on the HDD.
These changes made the Windows installation unusable in VirtualBox of course - it fails to boot offering some repair options. The first thing which I realized that is needed to do was enlarging the partition in the VMDK file, so I backed up the old sdb2.vmdk and sdb2-pt.vmdk files and recreated them with the same command as before.
This, however, made no change, because sdb2-pt.vmdk seems to be storing the boot record (MBR in my case, currently with GRUB) and some more stuff needed for Windows to work properly. My next attempt was replacing the new sdb2-pt.vmdk with the old one (with Windows bootloader and perhaps the old partition table) - this didn't work either.
How to update the VMDK files with the new partition size to make the enlarged Windows 8.1 installation bootable from VirtualBox again?

I have finally found the solution myself. Since the VBoxManage internalcommands createrawvmdk -filename sdb2.vmdk -rawdisk /dev/sdb -partitions 2 command produces two valid files based on the current disk structure, the only change needed was to recover the Windows boot loader from the old sdb2-pt.vmdk file which is a rather straightforward process. If you only wish to learn the recovery steps, you can skip the following theoretical part.
Some background information on the VMDK file format
VMWare Disk Format (VMDK) consists of two files - a descriptor file (sdb2.vmdk in the original question) and an extent file (sdb2-pt.vmdk). Their internal structure is well defined in the specification from VMWare. I'll sum up the most important parts:
The descriptor file (sdb2.vmdk) contains a section annotated # Extent description which can look something like this:
# Extent description
RW 63 FLAT "sdb2-pt.vmdk" 0
RW 41943040 ZERO
RW 83886080 FLAT "/dev/sdb" 58722304
RW 2 FLAT "sdb2-pt.vmdk" 63
RW 1191843568 ZERO
One extent description (a row from those above) has the following structure:
Access Size in sectors Type of extent Filename (Offset)
The offset parameter (specified only for FLAT type extents) specifies the offset (in sectors) of the given extent within the file Filename. Notice that file sdb2-pt.vmdk consists of two extents, the first 63 sectors long and the second only 2 sectors long.
The FLAT extent file sdb2-pt.vmdk is a raw data binary file identical to one you would obtain e.g. using the dd command on Unix-like systems. Since the sector size was 512 bytes in my case (I don't know if this is a general rule), the sdb2-pt.vmdk file (based on the new disk partitioning described in the extent description above) was (63+2)*512 bytes long.
Now to the second extent (the one with only 2 sectors in size). This is a padding extent which arose in my new partition table after enlarging the Windows partition (third extent in the description table). Since my previous partition table did not contain any such padding, the old sdb2-pt.vmdk file only contained the first 63 sectors long extent and thus was 1 024 bytes smaller than the new one generated by the VBoxManage internalcommands createrawvmdk -filename sdb2.vmdk -rawdisk /dev/sdb -partitions 2 command. This obviously rendered the old extent file and the new one incompatible.
The recovery process
Please be aware that the following steps apply to the old MBR disk structure only!
You surely want to keep the new partition structure and to propagate any changes made in the partition table to the VMDK file. Proceed with these steps:
Backup your old description file (sdb2.vmdk) and extent file (sdb2-pt.vmdk). In the following steps, you will only need the second one but you never know what else could happen.
Generate new descriptor and extent files issuing the command:
VBoxManage internalcommands createrawvmdk -filename sdb2.vmdk -rawdisk /dev/sdb -partitions 2
Now, the first extent entry in your new description file (sdb2.vmdk) should look like this:
RW ## FLAT "sdb2-pt.vmdk" 0
With the knowledge that you want to keep the new partition table (with everything what follows) and only restore the Windows boot loader stored in the backed up extent file (old sdb2-pt.vmdk), you have to copy the first 440 bytes (boot loader) from the old extent file to the new one. This can either be done with a hex editor (copy all values starting from address 0x0 up to 0x1B8 exclusive) or on a Unix-like system using the command:
dd if=old-sdb2-pt.vmdk of=sdb2-pt.vmdk bs=1 count=440
Violà.

On github there is a tool that will do that automatically (and re-running with same options will update its vmdk and auxiliary files, so you can change partitions too later) https://github.com/vasi/vmdk-raw-parts

"too many open files" error after deleting many files

My program creates a log file every 10 seconds in a specified directory. Then in a different thread, it iterates the files in that directory. If the file has content it compresses it and uploads it to external storage, if the file is empty, it deletes it. After the program runs for a while I get an error "too many open files" (gzopen failed, errno = 24).
When I looked inside /proc/<pid>/fd I see many broken links to files in the same directory where the logs are created and the word (deleted) next to the link.
Any idea what am I doing wrong? I checked the return values in both threads, of the close function (in the thread which writes the logs), and in the boost::filesystem::remove (the thread which compresses and uploads the non empty log files and deletes empty log files). All the return values are zero while the list of the (deleted) links gets longer buy 1 every 10 seconds.
I think this problem never happened to me on 32 bits but recently I moved to 64 bits and now I got this surprise.

You are neglecting to close files you open.
From your description, it sounds like you close the files you open for logging in your logging thread, but you go on to say that you just boost::filesystem::remove files after compressing and/or uploading.
Remember that:
Any compressed file you opened with gzopen has to be gzclosed
Any uncompressed file you open to compress it has to be closed.
If you open a file to check if it's empty, you have to close it.
If you open a file to transfer it, you have to close it.
Output of /proc/pid/fd would be very helpful in narrowing this down, but unfortunately you don't post it. Examples of how seemingly unhelpful output gives subtle hints:
# You forgot to gzclose the output file after compressing it
l-wx------ user group 64 Apr 9 10:17 43 -> /tmp/file.gz (deleted)
# You forgot to close the input file after compressing it
lr-x------ user group 64 Apr 9 10:17 43 -> /tmp/file (deleted)
# You forgot to close the input file after logging
l-wx------ user group 64 Apr 9 10:17 43 -> /tmp/file (deleted)
# You forgot to close the input file after transferring it
lr-x------ user group 64 Apr 9 10:17 43 -> /tmp/file.gz (deleted)

Reading binary files, Linux Buffer Cache

I am busy writing something to test the read speeds for disk IO on Linux.
At the moment I have something like this to read the files:
Edited to change code to this:
const int segsize = 1048576;
char buffer[segsize];
ifstream file;
file.open(sFile.c_str());
while(file.readsome(buffer,segsize)) {}
For foo.dat, which is 150GB, the first time I read it in, it takes around 2 minutes.
However if I run it within 60 seconds of the first run, it will then take around 3 seconds to run. How is that possible? Surely the only place that could be read from that fast is the buffer cache in RAM, and the file is too big to fit in RAM.
The machine has 50GB of ram, and the drive is a NFS mount with all the default settings. Please let me know where I could look to confirm that this file is actually being read at this speed? Is my code wrong? It appears to take a correct amount of time the first time the file is read.
Edited to Add:
Found out that my files were only reading up to a random point. I've managed to fix this by changing segsize down to 1024 from 1048576. I have no idea why changing this allows the ifstream to read the whole file instead of stopping at a random point.
Thanks for the answers.

On Linux, you can do this for a quick troughput test:
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.863904 s, 243 MB/s
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.0748273 s, 2.8 GB/s
$ sync && echo 3 > /proc/sys/vm/drop_caches
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.919688 s, 228 MB/s
echo 3 > /proc/sys/vm/drop_caches will flush the cache properly

in_avail doesn't give the length of the file, but a lower bound of what is available (especially if the buffer has already been used, it return the size available in the buffer). Its goal is to know what can be read without blocking.
unsigned int is most probably unable to hold a length of more than 4GB, so what is read can very well be in the cache.
C++0x Stream Positioning may be interesting to you if you are using large files

in_avail returns the lower bound of how much is available to read in the streams read buffer, not the size of the file. To read the whole file via the stream, just keep
calling the stream's readsome() method and checking how much was read with the gcount() method - when that returns zero, you have read everthing.

It appears to take a correct amount of time the first time the file is read.
On that first read, you're reading 150GB in about 2 minutes. That works out to about 10 gigabits per second. Is that what you're expecting (based on the network to your NFS mount)?

One possibility is that the file could be at least in part sparse. A sparse file has regions that are truly empty - they don't even have disk space allocated to them. These sparse regions also don't consume much cache space, and so reading the sparse regions will essentially only require time to zero out the userspace pages they're being read into.
You can check with ls -lsh. The first column will be the on-disk size - if it's less than the file size, the file is indeed sparse. To de-sparse the file, just write to every page of it.
If you would like to test for true disk speeds, one option would be to use the O_DIRECT flag to open(2) to bypass the cache. Note that all IO using O_DIRECT must be page-aligned, and some filesystems do not support it (in particular, it won't work over NFS). Also, it's a bad idea for anything other than benchmarking. See some of Linus's rants in this thread.
Finally, to drop all caches on a linux system for testing, you can do:
echo 3 > /proc/sys/vm/drop_caches
If you do this on both client and server, you will force the file out of memory. Of course, this will have a negative performance impact on anything else running at the time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js