How to distributed load GZIP File? - c++

I have a gzip file "dat.gz", the origin file contains only ascii text line by line. The .gz file is generated by 'pigz -i'
I want to load "dat.gz" into several process to do parallel data processing. The program language must be C or C++. Under Linux
For example, the origin file contains "1\n2\n3", and I load the .gz file into 3 process(p0, p1, p2), so that p0 gets "1", p1 gets "2" and p3 gets"3".
I read the file format of gz here: http://tools.ietf.org/pdf/rfc1952.pdf , and I found that each block of one .gz file starts with "\x1f\x8b". So I cut the .gz file by "\x1f\x8b" into blocks. But when I use the decompress lib of boost to process the block, something goes wrong.
Maybe my method was wrong at root.
My test .gz file can be downloaded here: https://drive.google.com/file/d/0B9DaAjBTb3bbcEM1N1c4OEg0SWc/view?usp=sharing
My C++ test code is following. Running by "g++ -std=c++11 test.cpp -lboost_iostreams && ./a.out". It throws out an exception.
terminate called after throwing an instance of
boost::exception_detail::clone_impl >'
what(): gzip error
Aborted
#include <stdio.h>
#include <stdlib.h>
#include <string>
#include <fstream>
#include <iostream>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/gzip.hpp>
#include <boost/iostreams/copy.hpp>
#include <sstream>
//define buffer size of fread: 128KB
#define BUFSIZE 128*1024
void get_first_block(char *fn) {
FILE* fin = fopen(fn, "rb");
char buf[BUFSIZE] = {0};
int pos = 0;
//skip first 2 byte
fread(buf, sizeof(char), 2, fin);
int i;
while (1) {
int sz = fread(buf, sizeof(char), BUFSIZE, fin);
if (sz <= 1) {
break;
}
for (i=0; i<sz-1; ++i) {
if (buf[i] == (char)0x1f && buf[i+1] == (char)0x8b) {
break;
}
}
pos += sz;
}
//first block start: 0
//first block end: pos + i -1
int len = pos+i;
fseek(fin, 0, SEEK_SET);
char *blk = (char*)malloc(len);
fread(blk, 1, len, fin);
using namespace boost::iostreams;
filtering_streambuf<input> in;
in.push( gzip_decompressor() );
in.push( boost::iostreams::array_source(blk , len) );
std::stringstream _sstream;
boost::iostreams::copy(in, _sstream);
std::cout << _sstream.rdbuf() ;
}
int main() {
get_first_block("0000.gz");
return 0;
}

It's unlikely that there is more than one of those blocks in a .gz file, also see the Wikipedia article about gzip:
Although its file format also allows for multiple such streams to be
concatenated (zipped files are simply decompressed concatenated as if
they were originally one file), gzip is normally used to compress
just single files.
This is especially true for your test file, because if you additionally look at the "compression method" flag, you can expand the search string to 0x1F, 0x8B, 0x08 which only appears once at the very beginning of your test file.
When trying to split a .gz file into blocks, you've got to do some more parsing instead of just looking for 0x1F, 0x8B, because this can also appear inside compressed data blocks or other parts of the member.
You have to parse the members and the compressed data. Unfortunately, the header only contains the uncompressed length of the data, not the compressed length, so you can't just skip the compressed data without parsing it.
The compressed data will be deflate data (there are other, but unused compression types), see RFC 1951. For non-compressed deflate blocks (chapter 3.2.4), there's a LEN field in the header so you can skip those easily. But unfortunately, there's no length field in the header of compressed blocks, so you'll have to completely parse those.

pigz -i compresses each block independently, which permits random access at each block boundary. Between each block is an empty stored block, which ends with the sequence of bytes 00 00 ff ff. You can search for that sequence, and attempt to decompress after that. There are 39 such markers in your example file.
There is nothing that prevents 00 00 ff ff from appearing in the middle of a compressed block, not marking a block boundary. So you should expect that occasionally you will get a false indication of such a boundary, indicated by a failure to decompress. In that case, simply move on to the next such marker.

Related

Reading multiple bytes from file and storing them for comparison in C++

I want to binary read a photo in 1460 bytes increments and compare consecutive packets for corrupted transmission. I have a python script that i wrote and want to translate in C++, however I'm not sure that what I intend to use is correct.
for i in range(0, fileSize-1):
buff=f.read(1460) // buff stores a packet of 1460 bytes where f is the opened file
secondPacket=''
for j in buff:
secondPacket+="{:02x}".format(j)
if(secondPacket==firstPacket):
print(f'Packet {i+1} identical with {i}')
firstPacket=secondPacket
I have found int fseek ( FILE * stream, long int offset, int origin ); but it's unclear if it reads the first byte that is located offset away from origin or everything in between.
Thanks for clarifications.
#include <iostream>
#include <fstream>
#include <array>
std::array<char, 1460> firstPacket;
std::array<char, 1460> secondPacket;
int i=0;
int main() {
std::ifstream file;
file.open("photo.jpg", std::ios::binary);
while (file.read(firstPacket.data(), firstPacket.size())){
++i;
if (firstPacket==secondPacket)
std::cout<<"Packet "<<i<<" is a copy of packet "<<i-1<<std::endl;
memcpy(&secondPacket, &firstPacket, firstPacket.size());
}
std::cout<<i; //tested to check if i iterate correctly
return 0;
}
This is the code i have so far which doesn't work.
fseek
doesn't read, it just moves the point where the next read operation should begin. If you read the file from start to end you don't need this.
To read binary data you want the aptly named std::istream::read. You can use it like this wih a fixed size buffer:
// char is one byte, could also be uint8_t, but then you would need a cast later on
std::array<char, 1460> bytes;
while(myInputStream.read(bytes.data(), bytes.size())) {
// do work with the read data
}

Why read system call stops reading when less than block is missing?

Introduction and general objective
I am trying to send an image from a child process (generated by calling popen from the parent) to the parent process.
The image is a grayscale png image. It is opened with the OpenCV library and encoded using imencode function of the same library. So the resulting encoded data is stored into a std::vector structure of type uchar, namely the buf vector in the code below.
No error in sending preliminary image information
First the child sends the following image information needed by the parent:
size of the buf vector containing the encoded data: this piece of information is needed so that the parent will allocate a buffer of the same size where to write the image information that it will receive from the child. Allocation is performed as follows (buf in this case is the array used to received data not the vector containing the encoded data):
u_char *buf = (u_char*)malloc(val*sizeof(u_char));
number of rows of the original image: needed by the parent to decode the image after all data have been received;
number of columns of the original image: needed by the parent to decode the image after all data have been received.
These data are written by the child on the standard output using cout and read by the parent using fgets system call.
This pieces of information are correctly sent and received so no problem until now.
Sending image data
The child writes the encoded data (i.e. the data contained in the vector buf) to the standard output using write system call while the parent uses the file-descriptor returned by popen to read the data. Data is read using read system call.
Data writing and reading is performed in blocks of 4096 bytes inside while loops. The writing line is the following:
written += write(STDOUT_FILENO, buf.data()+written, s);
where STDOUT_FILENO tells to write on standard output.
buf.data() returns the pointer to the first element in the array used internally by the vector structure.
written stores the number of bytes that have been written until now and it is used as index. s is the number of bytes (4096) that write will try to send each time.
write returns the number of bytes that actually have been written and this is used to update written.
Data reading is very similar and it is performed by the following line:
bytes_read = read(fileno(fp), buf+total_bytes, bytes2Copy);
fileno(fp) is telling from where to read data (fp is the filedescriptor returned by popen). buf is the array where received data is stored and total_bytes are the number of bytes read until now so it is used as index. bytes2Copy is the number of bytes expected to be received: it is wither BUFLEN (i.e. 4096) or for the last block of data the remaining data (if for example the total bytes are 5000 then after 1 block of 4096 bytes another block of 5000-4096 is expected).
The code
Consider this example. The following is a process launching a child process with popen
#include <stdlib.h>
#include <unistd.h>//read
#include "opencv2/opencv.hpp"
#include <iostream>
#define BUFLEN 4096
int main(int argc, char *argv[])
{
//file descriptor to the child process
FILE *fp;
cv::Mat frame;
char temp[10];
size_t bytes_read_tihs_loop = 0;
size_t total_bytes_read = 0;
//launch the child process with popen
if ((fp = popen("/path/to/child", "r")) == NULL)
{
//error
return 1;
}
//read the number of btyes of encoded image data
fgets(temp, 10, fp);
//convert the string to int
size_t bytesToRead = atoi((char*)temp);
//allocate memory where to store encoded iamge data that will be received
u_char *buf = (u_char*)malloc(bytesToRead*sizeof(u_char));
//some prints
std::cout<<bytesToRead<<std::endl;
//initialize the number of bytes read to 0
bytes_read_tihs_loop=0;
int bytes2Copy;
printf ("bytesToRead: %ld\n",bytesToRead);
bytes2Copy = BUFLEN;
while(total_bytes_read<bytesToRead &&
(bytes_read_tihs_loop = read(fileno(fp), buf+total_bytes_read, bytes2Copy))
)
{
//bytes to be read at this iteration: either 4096 or the remaining (bytesToRead-total)
bytes2Copy = BUFLEN < (bytesToRead-total_bytes_read) ? BUFLEN : (bytesToRead-total_bytes_read);
printf("%d btytes to copy\n", bytes2Copy);
//read the bytes
printf("%ld bytes read\n", bytes_read_tihs_loop);
//update the number of bytes read
total_bytes_read += bytes_read_tihs_loop;
printf("%lu total bytes read\n\n", total_bytes_read);
}
printf("%lu bytes received over %lu expected\n", total_bytes_read, bytesToRead);
printf("%lu final bytes read\n", total_bytes_read);
pclose(fp);
cv::namedWindow( "win", cv::WINDOW_AUTOSIZE );
frame = cv::imdecode(cv::Mat(1,total_bytes_read,0, buf), 0);
cv::imshow("win", frame);
return 0;
}
and the process opened by the above corresponds to the following:
#include <unistd.h> //STDOUT_FILENO
#include "opencv2/opencv.hpp"
#include <iostream>
using namespace std;
using namespace cv;
#define BUFLEN 4096
int main(int argc, char *argv[])
{
Mat frame;
std::vector<uchar> buf;
//read image as grayscale
frame = imread("test.png",0);
//encode image and put data into the vector buf
imencode(".png",frame, buf);
//send the total size of vector to parent
cout<<buf.size()<<endl;
unsigned int written= 0;
int i = 0;
size_t toWrite = 0;
//send until all bytes have been sent
while (written<buf.size())
{
//send the current block of data
toWrite = BUFLEN < (buf.size()-written) ? BUFLEN : (buf.size()-written);
written += write(STDOUT_FILENO, buf.data()+written, toWrite);
i++;
}
return 0;
}
The error
The child reads an image, encodes it and sends first the dimensions (size, #rows, #cols) to the parent and then the encoded image data.
The parent reads first the dimensions (no prob with that), then it starts reading data. Data is read 4096 bytes at each iteration. However when less than 4096 bytes are missing, it tries to read only the missing bytes: in my case the last step should read 1027 bytes (115715%4096), but instead of reading all of them it just reads `15.
What I got printed for the last two iterations is:
4096 btytes to copy
1034 bytes read
111626 total bytes read
111626 bytes received over 115715 expected
111626 final bytes read
OpenCV(4.0.0-pre) Error: Assertion failed (size.width>0 && size.height>0) in imshow, file /path/window.cpp, line 356
terminate called after throwing an instance of 'cv::Exception'
what(): OpenCV(4.0.0-pre) /path/window.cpp:356: error: (-215:Assertion failed) size.width>0 && size.height>0 in function 'imshow'
Aborted (core dumped)
Why isn't read reading all the missing bytes?
I am working on this image:
There might be errors also on how I am trying to decode back the image so any help there would be appreciated too.
EDIT
In my opinion as opposed to some suggestions the problem is not related to the presence of \n or \r or \0.
In fact when I print data received as integer with the following lines:
for (int ii=0; ii<val; ii++)
{
std::cout<<(int)buf[ii]<< " ";
}
I see 0, 10 and 13 values (the ASCII values of the above mentioned characters) in the middle of data so this makes me think it is not the problem.
fgets(temp, 10, fp);
...
read(fileno(fp), ...)
This cannot possibly work.
stdio routines are buffered. Buffers are controlled by the implementation. fgets(temp, 10, fp); will read an unknown number of bytes from the file and put it in a buffer. These bytes will never be seen by low level file IO again.
You never, ever, use the same file with both styles of IO. Either do everything with stdio, or do everything with low-level IO. The first option is the easiest by far, you just replace read with fread.
If for some ungodly reason known only to the evil forces of darkness you want to keep both styles of IO, you can try that by calling setvbuf(fp, NULL, _IOLBF, 0) before doing anything else. I have never done that and cannot vouch for this method, but they say it should work. I don't see a single reason to use it though.
On a possibly unrelated, note, your reading loop has some logic in its termination condition that is not so easy to understand and could be invalid. The normal way to read a file looks approximately as follows:
left = data_size;
total = 0;
while (left > 0 &&
(got=read(file, buf+total, min(chunk_size, left))) > 0) {
left -= got;
total += got;
}
if (got == 0) ... // reached the end of file
else if (got < 0) ... // encountered an error
The more correct way would be to try again if got < 0 && errno == EINTR, so the modified condition could look like
while (left > 0 &&
(((got=read(file, buf+total, min(chunk_size, left))) > 0) ||
(got < 0 && errno == EINTR))) {
but at this point readability starts to suffer and you may want to split this in separate statements.
You're writing binary data to standard output, which is expecting text. Newline characters (\n) and/or return characters (\r) can be added or removed depending on your systems encoding for end-of-line in text files. Since you're missing characters, it appears that you system is removing one of those two characters.
You need to write your data to a file that you open in binary mode, and you should read in your file in binary.
Updated Answer
I am not the world's best at C++, but this works and will give you a reasonable starting point.
parent.cpp
#include <stdlib.h>
#include <unistd.h>
#include <iostream>
#include "opencv2/opencv.hpp"
int main(int argc, char *argv[])
{
// File descriptor to the child process
FILE *fp;
// Launch the child process with popen
if ((fp = popen("./child", "r")) == NULL)
{
return 1;
}
// Read the number of bytes of encoded image data
std::size_t filesize;
fread(&filesize, sizeof(filesize), 1, fp);
std::cout << "Filesize: " << filesize << std::endl;
// Allocate memory to store encoded image data that will be received
std::vector<uint8_t> buffer(filesize);
int bufferoffset = 0;
int bytesremaining = filesize;
while(bytesremaining>0)
{
std::cout << "Attempting to read: " << bytesremaining << std::endl;
int bytesread = fread(&buffer[bufferoffset],1,bytesremaining,fp);
bufferoffset += bytesread;
bytesremaining -= bytesread;
std::cout << "Bytesread/remaining: " << bytesread << "/" << bytesremaining << std::endl;
}
pclose(fp);
// Display that image
cv::Mat frame;
frame = cv::imdecode(buffer, -CV_LOAD_IMAGE_ANYDEPTH);
cv::imshow("win", frame);
cv::waitKey(0);
}
child.cpp
#include <cstdio>
#include <cstdint>
#include <vector>
#include <fstream>
#include <cassert>
#include <iostream>
int main()
{
std::FILE* fp = std::fopen("image.png", "rb");
assert(fp);
// Seek to end to get filesize
std::fseek(fp, 0, SEEK_END);
std::size_t filesize = std::ftell(fp);
// Rewind to beginning, allocate buffer and slurp entire file
std::fseek(fp, 0, SEEK_SET);
std::vector<uint8_t> buffer(filesize);
std::fread(buffer.data(), sizeof(uint8_t), buffer.size(), fp);
std::fclose(fp);
// Write filesize to stdout, followed by PNG image
std::cout.write((const char*)&filesize,sizeof(filesize));
std::cout.write((const char*)buffer.data(),filesize);
}
Original Answer
There are a couple of issues:
Your while loop writing the data from the child process is incorrect:
while (written<buf.size())
{
//send the current block of data
written += write(STDOUT_FILENO, buf.data()+written, s);
i++;
}
Imagine your image is 4097 bytes. You will write 4096 bytes the first time through the loop and then try and write 4096 (i.e. s) bytes on the second pass when there's only 1 byte left in your buffer.
You should write whichever is the lesser of 4096 and bytes remaining in buffer.
There's no point sending the width and height of the file, they are already encoded in the PNG file you are sending.
There's no point calling imread() in the child to convert the PNG file from disk into a cv::Mat and then calling imencode() to convert it back into a PNG to send to the parent. Just open() and read the file as binary and send that - it is already a PNG file.
I think you need to be clear in your mind whether you are sending a PNG file or pure pixel data. A PNG file will have:
PNG header,
image width and height,
date of creation,
color type, bit-depth
compressed, checksummed pixel data
A pixel-data only file will have:
RGB, RGB, RGB, RGB

Binary Files in C++, changing the content of raw data on an audio file

I have never worked with binary files before. I opened an .mp3 file using the mode ios::binary, read data from it, assigned 0 to each byte read and then rewrote them to another file opened in ios::binary mode. I opened the output file on a media player, it sounds corrupted but I can still hear the song. I want to know what happened physically.
How can I access/modify the raw data ( bytes ) of an audio ( video, images, ... ) using C++ ( to practice file encryption/decryption later )?
Here is my code:
#include <iostream>
#include <fstream>
#include <cstring>
using namespace std;
int main(){
char buffer[256];
ifstream inFile;
inFile.open("Backstreet Boys - Incomplete.mp3",ios::binary);
ofstream outFile;
outFile.open("Output.mp3",ios::binary);
while(!inFile.eof()){
inFile.read(buffer,256);
for(int i = 0; i<strlen(buffer); i++){
buffer[i] = 0;
}
outFile.write(buffer,256);
}
inFile.close();
outFile.close();
}
What you did has nothing to do with binary files or audio. You simply copied the file while zeroing some of the bytes. (The reason you didn't zero all of the bytes is because you use i<strlen(buffer), which simply counts up to the first zero byte rather than reporting the size of the buffer. Also you modify the buffer which means strlen(buffer) will report the length as zero after you zero the first byte.)
So the exact change in audio you get is entirely dependent on the mp3 file format and the audio compression it uses. MP3 is not an audio format that can be directly manipulated in useful ways.
If you want to manipulate digital audio, you need to learn about how raw audio is represented by computers.
It's actually not too difficult. For example, here's a program that writes out a raw audio file containing just a 400Hz tone.
#include <fstream>
#include <limits>
int main() {
const double pi = 3.1415926535;
double tone_frequency = 400.0;
int samples_per_second = 44100;
double output_duration_seconds = 5.0;
int output_sample_count =
static_cast<int>(output_duration_seconds * samples_per_second);
std::ofstream out("signed-16-bit_mono-channel_44.1kHz-sample-rate.raw",
std::ios::binary);
for (int sample_i = 0; sample_i < output_sample_count; ++sample_i) {
double t = sample_i / static_cast<double>(samples_per_second);
double sound_amplitude = std::sin(t * 2 * pi * tone_frequency);
// encode amplitude as a 16-bit, signed integral value
short sample_value =
static_cast<short>(sound_amplitude * std::numeric_limits<short>::max());
out.write(reinterpret_cast<char const *>(&sample_value),
sizeof sample_value);
}
}
To play the sound you need a program that can handle raw audio, such as Audacity. After running the program to generate the audio file, you can File > Import > Raw data..., to import the data for playing.
How can I access/modify the raw data ( bytes ) of an audio ( video, images, ... ) using C++ ( to practice file encryption/decryption later )?
As pointed out earlier, the reason your existing code is not completely zeroing out the data is because you are using an incorrect buffer size: strlen(buffer). The correct size is the number of bytes read() put into the buffer, which you can get with the function gcount():
inFile.read(buffer,256);
int buffer_size = inFile.gcount();
for(int i = 0; i < buffer_size; i++){
buffer[i] = 0;
}
outFile.write(buffer, buffer_size);
Note: if you were to step through your program using a debugger you probably would have pretty quickly seen the problem yourself when you noticed the inner loop executing less than you expected. Debuggers are a really handy tool to learn how to use.
I notice you're using open() and close() methods here. This is sort of pointless in this program. Just open the file in the constructor, and allow the file to be automatically closed when inFile and outFile go out of scope:
{
ifstream inFile("Backstreet Boys - Incomplete.mp3",ios::binary);
ofstream outFile("Output.mp3",ios::binary);
// don't bother calling .close(), it happens automatically.
}

Append to gzipped Tar-Archive

I've written a program, generating a tarball, which gets compressed by zlib.
At regular intervals, the same program is supposed to add a new file to the tarball.
Per definition, the tarball needs empty records (512 Byte blocks) to work properly at it's end, which already shows my problem.
According to documentation gzopen is unable to open the file in r+ mode, meaning I can't simply jump to the beginning of the empty records, append my file information and seal it again with empty records.
Right now, I'm at my wits end. Appending works fine with zlib, as long as the empty records are not involved, yet I need them to 'finalize' my compressed tarball.
Any ideas?
Ah yes, it would be nice if I could avoid decompressing the whole thing and/or parsing the entire tarball.
I'm also open for other (preferably simple) file formats I could implement instead of tar.
This is two separate problems, both of which are solvable.
The first is how to append to a tar file. All you need to do there is overwrite the final two zeroed 512-byte blocks with your file. You would write the 512-byte tar header, your file rounded up to an integer number of 512-byte blocks, and then two 512-byte blocks filled with zeros to mark the new end of the tar file.
The second is how to frequently append to a gzip file. The simplest approach is to write separate gzip streams and concatenate them. Write the last two 512-byte zeroed blocks in a separate gzip stream, and remember where that starts. Then overwrite that with a new gzip stream with the new tar entry, and then another gzip stream with the two end blocks. This can be done by seeking back in the file with lseek() and then using gzdopen() to start writing from there.
That will work well, with good compression, for added files that are large (at a minimum several 10's of K). If however you are adding very small files, simply concatenating small gzip streams will result in lousy compression, or worse, expansion. You can do something more complicated to actually add small amounts of data to a single gzip stream so that the compression algorithm can make use of the preceding data for correlation and string matching. For that, take a look at the approach in gzlog.h and gzlog.c in examples/ in the zlib distribution.
Here is an example of how to do the simple approach:
/* tapp.c -- Example of how to append to a tar.gz file with concatenated gzip
streams. Placed in the public domain by Mark Adler, 16 Jan 2013. */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <unistd.h>
#include <fcntl.h>
#include "zlib.h"
#define local static
/* Build an allocated string with the prefix string and the NULL-terminated
sequence of words strings separated by spaces. The caller should free the
returned string when done with it. */
local char *build_cmd(char *prefix, char **words)
{
size_t len;
char **scan;
char *str, *next;
len = strlen(prefix) + 1;
for (scan = words; *scan != NULL; scan++)
len += strlen(*scan) + 1;
str = malloc(len); assert(str != NULL);
next = stpcpy(str, prefix);
for (scan = words; *scan != NULL; scan++) {
*next++ = ' ';
next = stpcpy(next, *scan);
}
return str;
}
/* Usage:
tapp archive.tar.gz addthis.file andthisfile.too
tapp will create a new archive.tar.gz file if it doesn't exist, or it will
append the files to the existing archive.tar.gz. tapp must have been used
to create the archive in the first place. If it did not, then tapp will
exit with an error and leave the file unchanged. Each use of tapp appends a
new gzip stream whose compression cannot benefit from the files already in
the archive. As a result, tapp should not be used to append a small amount
of data at a time, else the compression will be particularly poor. Since
this is just an instructive example, the error checking is done mostly with
asserts.
*/
int main(int argc, char **argv)
{
int tgz;
off_t offset;
char *cmd;
FILE *pipe;
gzFile gz;
int page;
size_t got;
int ret;
ssize_t raw;
unsigned char buf[3][512];
const unsigned char z1k[] = /* gzip stream of 1024 zeros */
{0x1f, 0x8b, 8, 0, 0, 0, 0, 0, 2, 3, 0x63, 0x60, 0x18, 5, 0xa3, 0x60,
0x14, 0x8c, 0x54, 0, 0, 0x2e, 0xaf, 0xb5, 0xef, 0, 4, 0, 0};
if (argc < 2)
return 0;
tgz = open(argv[1], O_RDWR | O_CREAT, 0644); assert(tgz != -1);
offset = lseek(tgz, 0, SEEK_END); assert(offset == 0 || offset >= (off_t)sizeof(z1k));
if (offset) {
if (argc == 2) {
close(tgz);
return 0;
}
offset = lseek(tgz, -sizeof(z1k), SEEK_END); assert(offset != -1);
raw = read(tgz, buf, sizeof(z1k)); assert(raw == sizeof(z1k));
if (memcmp(buf, z1k, sizeof(z1k)) != 0) {
close(tgz);
fprintf(stderr, "tapp abort: %s was not created by tapp\n", argv[1]);
return 1;
}
offset = lseek(tgz, -sizeof(z1k), SEEK_END); assert(offset != -1);
}
if (argc > 2) {
gz = gzdopen(tgz, "wb"); assert(gz != NULL);
cmd = build_cmd("tar cf - -b 1", argv + 2);
pipe = popen(cmd, "r"); assert(pipe != NULL);
free(cmd);
got = fread(buf, 1, 1024, pipe); assert(got == 1024);
page = 2;
while ((got = fread(buf[page], 1, 512, pipe)) == 512) {
if (++page == 3)
page = 0;
ret = gzwrite(gz, buf[page], 512); assert(ret == 512);
} assert(got == 0);
ret = pclose(pipe); assert(ret != -1);
ret = gzclose(gz); assert(ret == Z_OK);
tgz = open(argv[1], O_WRONLY | O_APPEND); assert(tgz != -1);
}
raw = write(tgz, z1k, sizeof(z1k)); assert(raw == sizeof(z1k));
close(tgz);
return 0;
}
In my opinion this is not possible with TAR conforming to standard strictly. I have read through zlib[1] manual and GNU tar[2] file specification. I did not find any information how appending to TAR can be implemented. So I am assuming it has to be done by over-writing the empty blocks.
So I assume, again, you can do it by using gzseek(). However, you would need to know how large is the uncompressed archive (size) and set offset to size-2*512.
Note, that this might be cumbersome since "The whence parameter is defined as in lseek(2); the value SEEK_END is not supported."1 and you can't open file for reading and writing at the same time, i.e. for introspect where the end blocks are.
However, it should be possible abusing TAR specs slightly. The GNU tar[2] docs mention something funny:
"
Each file archived is represented by a header block which describes the file, followed by zero or more blocks which give the contents of the file. At the end of the archive file there are two 512-byte blocks filled with binary zeros as an end-of-file marker. A reasonable system should write such end-of-file marker at the end of an archive, but must not assume that such a block exists when reading an archive. In particular GNU tar always issues a warning if it does not encounter it.
"
This means, you can deliberately not write those blocks. This is easy if you wrote the tarball compressor. Then you can use zlib in the normal append mode, remembering that the TAR decompressor must be aware of the "broken" TAR file.
[1]http://www.zlib.net/manual.html#Gzip
[2]http://www.gnu.org/software/tar/manual/html_node/Standard.html#SEC182

Need Convert Binary file to Txt file

I have a dat(binary) file but i wish to convert this file into Ascii (txt) file using c++ but i am very new in c++ programming.so I juct opend my 2 files:myBinaryfile and myTxtFile but I don't know how to read data from that dat file and then how to write those data into new txt file.so i want to write a c+ codes that takes in an input containing binary dat file, and converts it to Ascii txt in an output file. if this possible please help to write this codes. thanks
Sorry for asking same question again but still I didn’t solve my problem, I will explain it more clearly as follows: I have a txt file called “A.txt”, so I want to convert this into binary file (B.dat) and vice verse process. Two questions:
1. how to convert “A.txt” into “B.dat” in c++
2. how to convert “B.dat” into “C.txt” in c++ (need convert result of the 1st output again into new ascii file)
my text file is like (no header):
1st line: 1234.123 543.213 67543.210 1234.67 12.000
2nd line: 4234.423 843.200 60543.232 5634.60 72.012
it have more than 1000 lines in similar style (5 columns per one line).
Since I don’t have experiences in c++, I am struggle here, so need your helps. Many Thanks
All files are just a stream of bytes. You can open files in binary mode, or text mode. The later simply means that it may have extra newline handling.
If you want your text file to contain only safe human readable characters you could do something like base64 encode your binary data before saving it in the text file.
Very easy:
Create target or destination file
(a.k.a. open).
Open source file in binary mode,
which prevents OS from translating
the content.
Read an octet (byte) from source
file; unsigned char is a good
variable type for this.
Write the octet to the destination
using your favorite conversion, hex,
decimal, etc.
Repeat at 3 until the read fails.
Close all files.
Research these keywords: ifstream, ofstream, hex modifier, dec modifier, istream::read, ostream::write.
There are utilities and applications that already perform this operation. On the *nix and Cygwin side try od, *octal dump` and pipe the contents to a file.
There is the debug utility on MS-DOS system.
A popular format is:
AAAAAA bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb cccccccccccccccc
where:
AAAAAA -- Offset from beginning of file in hexadecimal or decimal.
bb -- Hex value of byte using ASCII text.
c -- Character representation of byte, '.' if the value is not printable.
Please edit your post to provide more details, including an example layout for the target file.
Edit:
A complex example (not tested):
#include <iostream>
#include <fstream>
#include <cstdio>
#include <cstdlib>
using namespace std;
const unsigned int READ_BUFFER_SIZE = 1024 * 1024;
const unsigned int WRITE_BUFFER_SIZE = 2 * READ_BUFFER_SIZE;
unsigned char read_buffer[READ_BUFFER_SIZE];
unsigned char write_buffer[WRITE_BUFFER_SIZE];
int main(void)
{
int program_status = EXIT_FAILURE;
static const char hex_chars[] = "0123456789ABCDEF";
do
{
ifstream srce_file("binary.dat", ios::binary);
if (!srce_file)
{
cerr << "Error opening input file." << endl;
break;
}
ofstream dest_file("binary.txt");
if (!dest_file)
{
cerr << "Error creating output file." << endl;
}
// While no read errors from reading a block of source data:
while (srce_file.read(&read_buffer[0], READ_BUFFER_SIZE))
{
// Get the number of bytes actually read.
const unsigned int bytes_read = srce_file.gcount();
// Define the index and byte variables outside
// of the loop to maybe save some execution time.
unsigned int i = 0;
unsigned char byte = 0;
// For each byte that was read:
for (i = 0; i < bytes_read; ++i)
{
// Get source, binary value.
byte = read_buffer[i];
// Convert the Most Significant nibble to an
// ASCII character using a lookup table.
// Write the character into the output buffer.
write_buffer[i * 2 + 0] = hex_chars[(byte >> 8)];
// Convert the Least Significant nibble to an
// ASCII character and put into output buffer.
write_buffer[i * 2 + 1] = hex_chars[byte & 0x0f];
}
// Write the output buffer to the output, text, file.
dest_file.write(&write_buffer[0], 2 * bytes_read);
// Flush the contents of the stream buffer as a precaution.
dest_file.flush();
}
dest_file.flush();
dest_file.close();
srce_file.close();
program_status = EXIT_SUCCESS;
} while (false);
return program_status;
}
The above program reads 1MB chunks from the binary file, converts to ASCII hex into an output buffer, then writes the chunk to the text file.
I think you are misunderstanding that the difference between a binary file and a test file is in the interpretation of the contents.