Seek in libarchive, how to reset header? - c++

Is it possible to read decompressed file once again?
Let imagine I used archive_read_next_header(a, &entry),
and I read an unknown number of bytes using archive_read_data(a, ptr_to_buffer, buffer_size). Right now I want to reset it and start reading again from the beginning. I trying to override seekoff(std::streamoff off, std::ios_base::seekdir way, std::ios_base::openmode which). I understand that might be impossible to just seek inside decompressed data because of inner work of compression algorithms, and data is not stored anywhere except a limited number of bytes in libarchive internal buffer.
The idea is to just reset it all, and read std::streamoff off bytes, that way I could create backward seek. Forward seek would be easy, just read std::streamoff off bytes. It's really inefficient, but let's hope, seek won't be used much.
Whole structure archive was initialized that way:
archive_read_set_read_callback(a, read_callback);
archive_read_set_callback_data(a, container);
archive_read_set_seek_callback(a, seek_callback);
archive_read_set_skip_callback(a, skip_callback);
int r = (archive_read_open1(a));
where container contains most of all std::istream, and callbacks are functions which manipulate that stream.
Template of what I would like to achive
`
std::streampos seek_beg(std::streamoff off) {
if(off >= 0) {
// read/skip 'off' bytes
} else {
// reset (a)
// read/skip 'off' bytes
}
// return position
}
`
also my underflow() method is implemented that way:
`
int underflow() {
int r = archive_read_data(ar, ptr, BUFFER_SIZE);
if (r < 0) {
throw std::runtime_error("ERROR");
} else if (r == 0) {
return std::streambuf::traits_type::eof();
} else {
setg(ptr, ptr, ptr + r);
}
return std::streambuf::traits_type::to_int_type(*ptr);
}
`

Libarchive documentation, more precisely, wishlist in libarchive wiki on GitHub says:
A few people have asked for the ability to efficiently "re-read"
particular archive entries. This is a tricky subject. For many
formats, the performance gains from this would be very modest. For
example, with a little performance work, the seeking Zip reader could
support very fast re-reading from the beginning since it only involves
re-parsing the central directory. The cases where there would be real
gains (e.g., tar.gz) are going to be very difficult to handle. The
most likely implementation would be some form of checkpointing so that
clients can explicitly ask for a checkpoint object and then restore
back to that checkpoint. The checkpoint object could be complex if you
have a series of stacked read filters plus state in the format handler
itself.
As I see seeking in archives with help of libarchive is not right now possible, so a solution to my problem was to remember all read data only if I have some suspicion that I would want to re-read it, and alternatively push it back to stream.

Related

How to copy every N-th byte(s) of a C array

I am writing bit of code in C++ where I want to play a .wav file and perform an FFT (with fftw) on it as it comes (and eventually display that FFT on screen with ncurses). This is mainly just as a "for giggles/to see if I can" project, so I have no restrictions on what I can or can't use aside from wanting to try to keep the result fairly lightweight and cross-platform (I'm doing this on Linux for the moment). I'm also trying to do this "right" and not just hack it together.
I'm using SDL2_audio to achieve the playback, which is working fine. The callback is called at some interval requesting N bytes (seems to be desiredSamples*nChannels). My idea is that at the same time I'm copying the memory from my input buffer to SDL I might as well also copy it in to fftw3's input array to run an FFT on it. Then I can just set ncurses to refresh at whatever rate I'd like separate from the audio callback frequency and it'll just pull the most recent data from the output array.
The catch is that the input file is formatted where the channels are packed together. I.E "(LR) (LR) (LR) ...". So while SDL expects this, I need a way to just get one channel to send to FFTW.
The audio callback format from SDL looks like so:
void myAudioCallback(void* userdata, Uint8* stream, int len) {
SDL_memset(stream, 0, sizeof(stream));
SDL_memcpy(stream, audio_pos, len);
audio_pos += len;
}
where userdata is (currently) unused, stream is the array that SDL wants filled, and len is the length of stream (I.E the number of bytes SDL is looking for).
As far as I know there's no way to get memcpy to just copy every other sample (read: Copy N bytes, skip M, copy N, etc). My current best idea is a brute-force for loop a la...
// pseudocode
for (int i=0; i<len/2; i++) {
fftw_in[i] = audio_pos + 2*i*sizeof(sample)
}
or even more brute force by just reading the file a second time and only taking every other byte or something.
Is there another way to go about accomplishing this, or is one of these my best option? It feels kind of kludgey to go from a nice one line memcpy to send to the data to SDL to some sort of weird loop to send it to fftw.
Very hard OP's solution can be simplified (for copying bytes):
// pseudocode
const char* s = audio_pos;
for (int d = 0; s < audio_pos + len; d++, s += 2*sizeof(sample)) {
fftw_in[d] = *s;
}
If I new what fftw_in is, I would memcpy blocks sizeof(*fftw_in).
Please check assembly generated by #S.M.'s solution.
If the code is not vectorized, I would use intrinsics (depending on your hardware support) like _mm_mask_blend_epi8

Capnp: Move to previous position in BufferedInputStreamWrapper

I have a binary file with multiple Capnp messages which I want to read. Reading sequentially works well, but I have the use-case, that I want to jump to a previously known position.
The data sequential images with metadata including there timestamp. I would like to have the possibility to jump back and forth (like in a video player).
This is what I have tried:
int fd = open(filePath.c_str(), O_RDONLY);
kj::FdInputStream fdStream(fd);
kj::BufferedInputStreamWrapper bufferedStream(fdStream);
for (;;) {
kj::ArrayPtr<const kj::byte> framePtr = bufferedStream.tryGetReadBuffer();
if (framePtr != nullptr) {
capnp::PackedMessageReader message(bufferedStream);
// This should reset the buffer to the last read message?
bufferedStream.read((void*)framePtr.begin(), framePtr.size());
// ...
}
else {
// reset to beginning
}
}
But I get this error:
capnp/serialize.c++:186: failed: expected segmentCount < 512; Message has too many segments
I was assuming that tryGetReadBuffer() returns the position and size of the next packed message. But then again, how is the BufferedInputStream supposed to know what "a message" is.
Question: How can I get position and size of messages and read these messages later on from the BufferedInputStreamWrapper?
Alternative: Reading the whole file once, take ownership of the data and save it to a vector. Such as described here (https://groups.google.com/forum/#!topic/capnproto/Kg_Su1NnPOY). Better solution all along?
BufferedInputStream is not seekable. In order to seek backwards, you will need to destroy bufferedStream and then seek the underlying file descriptor, e.g. with lseek(), then create a new buffered stream.
Note that reading the current position (in order to pass to lseek() later to go back) is also tricky if a buffered stream is present, since the buffered stream will have read past the position in order to fill the buffer. You could calculate it by subtracting off the buffer size, e.g.:
// Determine current file position, so that we can seek to it later.
off_t messageStartPos = lseek(fd, 0, SEEK_CUR) -
bufferedStream.tryGetReadBuffer().size();
// Read a message
{
capnp::PackedMessageReader message(bufferedStream);
// ... do stuff with `message` ...
// Note that `message` is destroyed at this }. It's important that this
// happens before querying the buffered stream again, because
// PackedMesasgeReader updates the buffer position in its destructor.
}
// Determine the end position of the message (if you need it?).
off_t messageEndPos = lseek(fd, 0, SEEK_CUR) -
bufferedStream.tryGetReadBuffer().size();
bufferedStream.read((void*)framePtr.begin(), framePtr.size());
FWIW, the effect of this line is "advance past the current buffer an on to the next one". You don't want to do this when using PackedMessageReader, as it will already have advanced the stream itself. In fact, because PackedMessageReader might have already advanced past the current buffer, framePtr may now be invalid, and this line might segfault.
Alternative: Reading the whole file once, take ownership of the data and save it to a vector. Such as described here (https://groups.google.com/forum/#!topic/capnproto/Kg_Su1NnPOY). Better solution all along?
If the file fits comfortably in RAM, then reading it upfront is usually fine, and probably a good idea if you expect to be seeking back and forth a lot.
Another option is to mmap() it. This makes it appear as if the file is in RAM, but the operating system will actually read in the contents on-demand when you access them.
However, I don't think this will actually simplify the code much. Now you'll be dealing with an ArrayInputStream (a subclass of BufferedInputStream). To "seek" you would create a new ArrayInputStream based on a slice of the buffer starting at the point where you want to start.

C++ Winsock Download File Cut off HTTP Header

I'm downloading the bytes of a file from the web using winsock2. so good so far.
I have the problem that I download my bytes including the http header which I don't need and which causes troubles in my files bytecodes.
Example:
I know I can find the position where the header is ending by finding "\r\n\r\n".
But somehow I can't find or at least cut it... :(
int iResponseBytes = 0;
ofstream ofDownloadedFile;
ofDownloadedFile.open(pathonclient, ios::binary);
do {
iResponseBytes = recv(this->Socket, responseBuffer, pageBufferSize, 0);
if (iResponseBytes > 0) // if bytes received
{
ofDownloadedFile.write(responseBuffer, pageBufferSize);
}
else if (iResponseBytes == 0) //Done
{
break;
}
else //fail
{
cout << "Error while downloading" << endl;
break;
}
} while (iResponseBytes > 0);
I tried searching the array / the pointer using strncmp etc.
Hopefully someone can help me.
Best greetings
You have no guarantees, whatsoever, that the \r\n\r\n sequence will be received completely within a single recv() call.
For example, the first recv() call could end up reading everything up until the first two characters of the sequence, \r\n, then your code runs around the loop again, and the second time recv() gets called it receives the remaining \r\n for the initial two bytes received (followed by the first part of the actual content). A small possibility that this might happen, but it cannot be ignored, and must be correctly handled.
If your goal is to trim everything up until the \r\n\r\n, your current approach is not going to work very well.
Instead, what you should do is invest some time studying how file stream buffering actually works. Pontificate, for a moment, how std::istream/std::ostream read/write large chunks of data at a time, but they provide a character-oriented interface. std::istream, for example, reads a buffer's full of file data at a time, placing it into an internal buffer, which your code can then retrieve one character at a time (if it wishes to). How does that work? Think about it.
To do this correctly, you need to implement the same algorithm yourself: recv() from the socket a buffer at a time, then provide a byte-oriented interface, to return the received contents one byte at a time.
Then, the main code becomes a simple loop, reading the streamed socket contents one byte at a time, at which point discarding everything up until the code sees \r\n\r\n becomes trivial (although there are still a few non-obvious gotchas in doing this right, but that can be a new question).
Of course, once the \r\n\r\n gets processed, it is certainly possible to optimize things going forward, by flushing out whatever's still buffered internally, to the output file, and then resume reading from the socket a whole buffer-at-a-time, and copying it to the output file without burning CPU cycles dealing with the byte-oriented interface.

Reading a Potentially incomplete File C++

I am writing a program to reformat a DNS log file for insertion to a database. There is a possibility that the line currently being written to in the log file is incomplete. If it is, I would like to discard it.
I started off believing that the eof function might be a good fit for my application, however I noticed a lot of programmers dissuading the use of the eof function. I have also noticed that the feof function seems to be quite similar.
Any suggestions/explanations that you guys could provide about the side effects of these functions would be most appreciated, as would any suggestions for more appropriate methods!
Edit: I currently am using the istream::peek function in order to skip over the last line, regardless of whether it is complete or not. While acceptable, a solution that determines whether the last line is complete would be preferred.
The specific comparison I'm using is: logFile.peek() != EOF
I would consider using
int fseek ( FILE * stream, long int offset, int origin );
with SEEK_END
and then
long int ftell ( FILE * stream );
to determine the number of bytes in the file, and therefore - where it ends. I have found this to be more reliable in detecting the end of the file (in bytes).
Could you detect an (End of Record/Line) EOR marker (CRLF perhaps) in the last two or three bytes of the file? (3 bytes might be used for CRLF^Z...depends on the file type). This would verify if you have a complete last row
fseek (stream, -2,SEEK_END);
fread (2 bytes... etc
If you try to open the file with exclusive locks, you can detect (by the failure to open) that the file is in use, and try again in a second...(or whenever)
If you need to capture the file contents as the file is being written, it's much easier if you eliminate as many layers of indirection and buffering between your logic and the actual bytes of data in the file.
Do not use C++ IO streams of any type - you have no real control over them. Don't use FILE *-based functions such as fopen() and fread() - those are buffered, and even if you disable buffering there are layers of code between your code and the data that once again you can't control and don't know what's happening.
In a POSIX environment, you can use low-level C-style open() and read()/pread() calls. And use fstat() to know when the file contents have changed - you'll see the st_size member of the struct stat argument change after a call to fstat().
You'd open the file like this:
int logFileFD = open( "/some/file/name.log", O_RDONLY );
Inside a loop, you could do something like this (error checking and actual data processing omitted):
size_t lastSize = 0;
while ( !done )
{
struct stat statBuf;
fstat( logFileFD, &statBuf );
if ( statBuf.st_size == lastSize )
{
sleep( 1 ); // or however long you want
continue; // go to next loop iteration
}
// process new data - might need to keep some of the old data
// around to handle lines that cross boundaries
processNewContents( logFileFD, lastSize, statBuf.st_size );
}
processNewContents() could look something like this:
void processNewContents( int fd, size_t start, size_t end )
{
static char oldData[ BUFSIZE ];
static char newData[ BUFSIZE ];
// assumes amount of data will fit in newData...
ssize_t bytesRead = pread( fd, newData, start, end - start );
// process the data that was read read here
return;
}
You may also find that you need to add some code to close() then re-open() the file in case your application doesn't seem to be "seeing" data written to the file. I've seen that happen on some systems - the application somehow sees a cached copy of the file size somewhere while an ls run in another context gets the more accurate, updated size. If, for example, you know your log file is written to every 10-15 seconds, if you go 30 seconds without seeing any change to the file you know to try reopening the file.
You can also track the inode number in the struct stat results to catch log file rotation.
In a non-POSIX environment, you can replace open(), fstat() and pread() calls with the low-level OS equivalent, although Windows provides most of what you'd need. On Windows, lseek() followed by read() would replace pread().

C++ read text line-by-line, speed/efficiency savings needed

I have a series of large text files (10s - 100s of thousands of lines) that I want to parse line-by-line. The idea is to check if the line has a specific word/character/phrase and to, for now, record to a secondary file if it does.
The code I've used so far is:
ifstream infile1("c:/test/test.txt");
while (getline(infile1, line)) {
if (line.empty()) continue;
if (line.find("mystring") != std::string::npos) {
outfile1 << line << '\n';
}
}
The end goal is to be writing those lines to a database. My thinking was to write them to the file first and then to import the file.
The problem I'm facing is the time taken to complete the task. I'm looking to minimize the time as far as possible, so any suggestions as to time savings on the read/write scenario above would be most welcome. Apologies if anything is obvious, I've only just started moving into C++.
Thanks
EDIT
I should say that I'm using VS2015
EDIT 2
So this was my own dumb fault, when switching to Release and changing the architecture type I had noticeable speed increases. Thanks to everyone for pointing me in that direction. I'm also looking at the mmap stuff and that's proving useful too. Thanks guys!
When you use ifstream to read and process to/from really big files, you have to increase the default buffer size that is used (normally 512 bytes).
The best buffer size depends on your needs, but as a hint you can use the partition block size of the file(s) your reading/writing. To know that information you can use a lot of tools or even code.
Example in Windows:
fsutil fsinfo ntfsinfo c:
Now, you have to create a new buffer to ifstream like this:
size_t newBufferSize = 4 * 1024; // 4K
char * newBuffer = new char[newBufferSize];
ifstream infile1;
infile1.rdbuf()->pubsetbuf(newBuffer, newBufferSize);
infile1.open("c:/test/test.txt");
while (getline(infile1, line)) {
/* ... */
}
delete newBuffer;
Do the same with the output stream and don't forget set new buffer before open file or it may not work.
You can play with values to find the very best size for you.
You'll note the difference.
C-style I/O functions are much faster than fstream.
You may use fgets/fputs to read/write each text line.