Problems parsing a Microsoft compound document - c++

I'm having a bit of a struggle wrestling with the compound document format.
I'm working in C at the moment but am having problems with locating the directory sector.
I can obtain the compound doc header which is trivial and I know the formula for finding a file offset of a sector id (secid + 1 << sec_size), but whenever I use this formula to convert the secid to fileoffset for the directory I get random values.
Can someone help me understand how I resolve secid offsets properly and maybe also how to develop secid chains from the sector allocation table in a compound document?
Here is an example of what I've tried:
comp_doc_header* cdh((comp_doc_header*)buffer);
printf("cdoc header:%d\n", sizeof(cd_dir_entry));
if(cdh->rev_num == 0x003E)printf("rev match\n");
//check magic number
if(cdh->comp_doc_id[0] != (unsigned char)0xD0 ||
cdh->comp_doc_id[1] != (unsigned char)0xCF ||
cdh->comp_doc_id[2] != (unsigned char)0x11 ||
cdh->comp_doc_id[3] != (unsigned char)0xE0 ||
cdh->comp_doc_id[4] != (unsigned char)0xA1 ||
cdh->comp_doc_id[5] != (unsigned char)0xB1 ||
cdh->comp_doc_id[6] != (unsigned char)0x1A ||
cdh->comp_doc_id[7] != (unsigned char)0xE1)
return 0;
buffer += 512;
//here i try and get the first directory entry
cd_dir_entry* cde((cd_dir_entry*)&buffer[(cdh->first_sector_id + 1) << 512]);
EDIT: (secid + 1) * 512 should be (secid + 1) * sec_size

Is this C? I can't parse your first or last posted lines
cd_dir_entry* cde((cd_dir_entry*)&buffer[(cdh->first_sector_id + 1) << 512]);
It appears you're declaring cde as a function that returns a pointer to a cd_dir_entry; but the parameter prototype is all wrong ... so you're calling the function and multiplying the result by cd_dir_entry and promptly ignoring the result of the multiplication.
Edit
My simplification trying to understand the line
cd_dir_entry* cde(<cast>&buffer[(cdh->first_sector_id + 1) << 512]);
cd_dir_entry* cde(<cast>&buffer[<elem>]);
cd_dir_entry* cde(<parameter>);
/* this is either a function prototype */
/* or a multiplication with `cd_dir_entry` and the value returned from cde() */
/* in either case it does nothing (no side-effects present), */
/* unless cde messes with global variables */

Related

How to walk along UTF-16 codepoints?

I have the following definition of varying ranges which correspond to codepoints and surrogate pairs:
https://en.wikipedia.org/wiki/UTF-16#Description
My code is based on ConvertUTF.c from the Clang implementation.
I'm currently struggling with wrapping my head around how to do this.
The code which is most relevant from LLVM's implementation that I'm trying to understand is:
unsigned short bytesToWrite = 0;
const u32char_t byteMask = 0xBF;
const u32char_t byteMark = 0x80;
u8char_t* target = *targetStart;
utf_result result = kConversionOk;
const u16char_t* source = *sourceStart;
while (source < sourceEnd) {
u32char_t ch;
const u16char_t* oldSource = source; /* In case we have to back up because of target overflow. */
ch = *source++;
/* If we have a surrogate pair, convert to UTF32 first. */
if (ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_HIGH_END) {
/* If the 16 bits following the high surrogate are in the source buffer... */
if (source < sourceEnd) {
u32char_t ch2 = *source;
/* If it's a low surrogate, convert to UTF32. */
if (ch2 >= UNI_SUR_LOW_START && ch2 <= UNI_SUR_LOW_END) {
ch = ((ch - UNI_SUR_HIGH_START) << halfShift)
+ (ch2 - UNI_SUR_LOW_START) + halfBase;
++source;
} else if (flags == kStrictConversion) { /* it's an unpaired high surrogate */
--source; /* return to the illegal value itself */
result = kSourceIllegal;
break;
}
} else { /* We don't have the 16 bits following the high surrogate. */
--source; /* return to the high surrogate */
result = kSourceExhausted;
break;
}
} else if (flags == kStrictConversion) {
/* UTF-16 surrogate values are illegal in UTF-32 */
if (ch >= UNI_SUR_LOW_START && ch <= UNI_SUR_LOW_END) {
--source; /* return to the illegal value itself */
result = kSourceIllegal;
break;
}
}
...
Specifically they say in the comments:
If we have a surrogate pair, convert to UTF32 first.
and then:
If it's a low surrogate, convert to UTF32.
I'm getting lost along the lines of "if we have.." and "if it's.." and my response being while reading the comments: "what do we have?" and "what is it?"
I believe ch and ch2 is the first char16 and the next char16 (if one exists), checking to see if the second is part of a surrogate pair, and then walking along each char16 (or do you walk along pairs of chars?) until the end.
I'm getting lost along the lines of how they are using UNI_SUR_HIGH_START, UNI_SUR_HIGH_END, UNI_SUR_LOW_START, UNI_SUR_LOW_END, and their use of halfShift and halfBase.
Wikipedia also notes:
There was an attempt to rename "high" and "low" surrogates to "leading" and "trailing" due to their numerical values not matching their names. This appears to have been abandoned in recent Unicode standards.
Making note of "leading" and "trailing" in any responses may help clarify things as well.
ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_HIGH_END checks if ch is in the range where high surrogates are, that is, [D800-DBFF]. That's it. Then the same is done for checking if ch2 is in the range where low surrogates are, meaning [DC00-DFFF].
halfShift and halfBase are just used as prescribed by the UTF-16 decoding algorithm, which turns a pair of surrogates into the scalar value they represent. There's nothing special being done here; it's the textbook implementation of that algorithm, without any tricks.

Byte length of a MySQL column in C++

I am using C++ to write a program for a MySQL database. I am trying to check a condition by comparing the length of a column (in bytes) to pass/fail. Here is the code:
while (row = mysql_fetch_row(result))
{
lengths = mysql_fetch_lengths(result);
num_rows = mysql_num_rows(result);
for (i = 0; i < num_fields; i++)
{
if (strstr(fields[i].name, "RSSI") != NULL)
{
if (lengths[*row[i]] == ??)
printf("current value is %s \t", row[i]);
}
}
}
So basically what i am trying to do is to look for the string "RSSI" in the columns and if the string is present i want to print that value. The values in each column are 3 bytes in length if present . So how do i check if lengths [*rows[i]] is 3 bytes in length? Thanks
According to the official MySQL documentation mysql_fetch_lengths returns an array of unsigned long with the lengths of the columns of the current row. Although the description isn't clear whether it's in bytes or something else, the example shown clarifies it.
So you should be checking directly to 3.
Also, there are some syntactic and semantic errors, and a possible refactoring in your code, among them the following:
Given the lengths variable is an array with the current rows' lengths, the expression lengths[*row[i]] should just be lengths[i] because i is the index of the current column.
The two ifs inside the for could be merged with the && operator for better readability.
Some variables are not defined or used correctly.
The code would look like this:
// Properly assign a value to fields variable.
fields = mysq_fetch_fields(result);
// Getting the number of fields outside the loop is better.
num_fields = mysql_num_fields(result);
while (row = mysql_fetch_row(result))
{
lengths = mysql_fetch_lengths(row);
for (i = 0; i < num_fields; i++)
if (strstr(fields[i].name, "RSSI") != NULL && lengths[i] == 3)
printf("current value is %s \t", row[i]);
printf("\n"); // For better output print each row in a new line.
}
You should really read the documentation carefully in order to avoid compilation or logic errors for using the wrong function.
I think there is a typo:
dev docs states:
(http://dev.mysql.com/doc/refman/5.0/en/mysql-fetch-lengths.html)
...
num_fields = mysql_num_fields(result);
lengths = mysql_fetch_lengths(result);
for(i = 0; i < num_fields; i++)
NOT
lengths = mysql_fetch_lengths(row);

Adapting Boyer-Moore Implementation

I'm trying to adapt the Boyer-Moore c(++) Wikipedia implementation to get all of the matches of a pattern in a string. As it is, the Wikipedia implementation returns the first match. The main code looks like:
char* boyer_moore (uint8_t *string, uint32_t stringlen, uint8_t *pat, uint32_t patlen) {
int i;
int delta1[ALPHABET_LEN];
int *delta2 = malloc(patlen * sizeof(int));
make_delta1(delta1, pat, patlen);
make_delta2(delta2, pat, patlen);
i = patlen-1;
while (i < stringlen) {
int j = patlen-1;
while (j >= 0 && (string[i] == pat[j])) {
--i;
--j;
}
if (j < 0) {
free(delta2);
return (string + i+1);
}
i += max(delta1[string[i]], delta2[j]);
}
free(delta2);
return NULL;
}
I have tried to modify the block after if (j < 0) to add the index to an array/vector and letting the outer loop continue, but it doesn't appear to be working. In testing the modified code I still only get a single match. Perhaps this implementation wasn't designed to return all matches, and it needs more than a few quick changes to do so? I don't understand the algorithm itself very well, so I'm not sure how to make this work. If anyone can point me in the right direction I would be grateful.
Note: The functions make_delta1 and make_delta2 are defined earlier in the source (check Wikipedia page), and the max() function call is actually a macro also defined earlier in the source.
Boyer-Moore's algorithm exploits the fact that when searching for, say, "HELLO WORLD" within a longer string, the letter you find in a given position restricts what can be found around that position if a match is to be found at all, sort of a Naval Battle game: if you find open sea at four cells from the border, you needn't test the four remaining cells in case there's a 5-cell carrier hiding there; there can't be.
If you found for example a 'D' in eleventh position, it might be the last letter of HELLO WORLD; but if you found a 'Q', 'Q' not being anywhere inside HELLO WORLD, this means that the searched-for string can't be anywhere in the first eleven characters, and you can avoid searching there altogether. A 'L' on the other hand might mean that HELLO WORLD is there, starting at position 11-3 (third letter of HELLO WORLD is a L), 11-4, or 11-10.
When searching, you keep track of these possibilities using the two delta arrays.
So when you find a pattern, you ought to do,
if (j < 0)
{
// Found a pattern from position i+1 to i+1+patlen
// Add vector or whatever is needed; check we don't overflow it.
if (index_size+1 >= index_counter)
{
index[index_counter] = 0;
return index_size;
}
index[index_counter++] = i+1;
// Reinitialize j to restart search
j = patlen-1;
// Reinitialize i to start at i+1+patlen
i += patlen +1; // (not completely sure of that +1)
// Do not free delta2
// free(delta2);
// Continue loop without altering i again
continue;
}
i += max(delta1[string[i]], delta2[j]);
}
free(delta2);
index[index_counter] = 0;
return index_counter;
This should return a zero-terminated list of indexes, provided you pass something like a size_t *indexes to the function.
The function will then return 0 (not found), index_size (too many matches) or the number of matches between 1 and index_size-1.
This allows for example to add additional matches without having to repeat the whole search for the already found (index_size-1) substrings; you increase num_indexes by new_num, realloc the indexes array, then pass to the function the new array at offset old_index_size-1, new_num as the new size, and the haystack string starting from the offset of match at index old_index_size-1 plus one (not, as I wrote in a previous revision, plus the length of the needle string; see comment).
This approach will report also overlapping matches, for example searching ana in banana will find b*ana*na and ban*ana*.
UPDATE
I tested the above and it appears to work. I modified the Wikipedia code by adding these two includes to keep gcc from grumbling
#include <stdio.h>
#include <string.h>
then I modified the if (j < 0) to simply output what it had found
if (j < 0) {
printf("Found %s at offset %d: %s\n", pat, i+1, string+i+1);
//free(delta2);
// return (string + i+1);
i += patlen + 1;
j = patlen - 1;
continue;
}
and finally I tested with this
int main(void)
{
char *s = "This is a string in which I am going to look for a string I will string along";
char *p = "string";
boyer_moore(s, strlen(s), p, strlen(p));
return 0;
}
and got, as expected:
Found string at offset 10: string in which I am going to look for a string I will string along
Found string at offset 51: string I will string along
Found string at offset 65: string along
If the string contains two overlapping sequences, BOTH are found:
char *s = "This is an andean andeandean andean trouble";
char *p = "andean";
Found andean at offset 11: andean andeandean andean trouble
Found andean at offset 18: andeandean andean trouble
Found andean at offset 22: andean andean trouble
Found andean at offset 29: andean trouble
To avoid overlapping matches, the quickest way is to not store the overlaps. It could be done in the function but it would mean to reinitialize the first delta vector and update the string pointer; we also would need to store a second i index as i2 to keep saved indexes from going nonmonotonic. It isn't worth it. Better:
if (j < 0) {
// We have found a patlen match at i+1
// Is it an overlap?
if (index && (indexes[index] + patlen < i+1))
{
// Yes, it is. So we don't store it.
// We could store the last of several overlaps
// It's not exactly trivial, though:
// searching 'anana' in 'Bananananana'
// finds FOUR matches, and the fourth is NOT overlapped
// with the first. So in case of overlap, if we want to keep
// the LAST of the bunch, we must save info somewhere else,
// say last_conflicting_overlap, and check twice.
// Then again, the third match (which is the last to overlap
// with the first) would overlap with the fourth.
// So the "return as many non overlapping matches as possible"
// is actually accomplished by doing NOTHING in this branch of the IF.
}
else
{
// Not an overlap, so store it.
indexes[++index] = i+1;
if (index == max_indexes) // Too many matches already found?
break; // Stop searching and return found so far
}
// Adapt i and j to keep searching
i += patlen + 1;
j = patlen - 1;
continue;
}

.wav C++ data not outputting correctly

So I know I have asked this question before, however, I am still stuck and before I can move on with my project. Basically, I'm trying to read in a .wav file, I have read in all of the required header information and then stored all the data inside a char array. This is all good, however, I then recast the data as an integer and try and output the data.
I have tested the data in MatLab, however, I get very different results:
Matlab -0.0078
C++: 1031127695
Now these are very wrong results, and someone kindly from here said it's because I'm outputting it as an integer, however, I have tried pretty much every single data type and still get the wrong results. Someone has suggestion that it could be something to do with Endianness (http://en.wikipedia.org/wiki/Endianness) .. Does this seem logical?
Here is the code:
bool Wav::readHeader(ifstream &file)
{
file.read(this->chunkId, 4);
file.read(reinterpret_cast<char*>(&this->chunkSize), 4);
file.read(this->format, 4);
file.read(this->formatId, 4);
file.read(reinterpret_cast<char*>(&this->formatSize), 4);
file.read(reinterpret_cast<char*>(&this->format2), 2);
file.read(reinterpret_cast<char*>(&this->numChannels), 2);
file.read(reinterpret_cast<char*>(&this->sampleRate), 4);
file.read(reinterpret_cast<char*>(&this->byteRate), 4);
file.read(reinterpret_cast<char*>(&this->align), 2);
file.read(reinterpret_cast<char*>(&this->bitsPerSample), 4);
char testing[4] = {0};
int testingSize = 0;
while(file.read(testing, 4) && (testing[0] != 'd' ||
testing[1] != 'a' ||
testing[2] != 't' ||
testing[3] != 'a'))
{
file.read(reinterpret_cast<char*>(&testingSize), 4);
file.seekg(testingSize, std::ios_base::cur);
}
this->dataId[0] = testing[0];
this->dataId[1] = testing[1];
this->dataId[2] = testing[2];
this->dataId[3] = testing[3];
file.read(reinterpret_cast<char*>(&this->dataSize), 4);
this->data = new char[this->dataSize];
file.read(data, this->dataSize);
unsigned int *te;
te = reinterpret_cast<int*>(&this->data);
cout << te[3];
return true;
}
Any help would be really appreciated. I hope I've given enough details.
Thank you.
I think there're several issues with your code. One of them are the casts, for instance this one:
unsigned int *te;
te = reinterpret_cast<int*>(&this->data); // (2)
cout << te[3];
te is a pointer to unsigned int, while you try to cast to a pointer to int. I would expect compilation error at line (2)...
And what do you mean by te[3]? I expect some garbage from *(te + 3) memory location to be outputed here.

Vector is out of range

A good day to all stackers.
I am running my program in Quincy2005 and I have this following error.
"Terminate called after throwing an instance of 'std::out_of_range"
"what(): vector::_M_range_check"
The below is my bunch of codes
int ptextLoc,ctextLoc; //location of the plain/cipher txt
char ctextChar; //cipher text variable
//by default, the location of the plain text is even
bool evenNumberLocBool = true;
ifstream ptextFile;
//open plain text file
ptextFile.open("ptext.txt");
//character by character encryption
while (!ptextFile.eof())
{
//get (next) character from file and store it in a variable ptextChar
char ptextChar = ptextFile.get();
//find the position of the ptextChar in keyvector Vector
ptextLoc = std::find(keyvector.begin(), keyvector.end(), ptextChar) - keyvector.begin();
//if the location of the plain text is even
if ( ((ptextLoc % 2) == 0) || (ptextLoc == 0) )
evenNumberLocBool = true;
else
evenNumberLocBool = false;
//if the location of the plain text is even/odd, find the location of the cipher text
if (evenNumberLocBool)
ctextLoc = ptextLoc + 1;
else
ctextLoc = ptextLoc - 1;
//store the cipher pair in ctextChar variable
ctextChar = keyvector.at(ctextLoc);
cout << ctextChar;
}
Contents of ptext.txt
ab cd ef
If the first letter is 'a' which is at the position 0, the pair cipher alphabet will be kevector[1].
LATEST UPDATE: I have found the line which has been creating this error.
ctextChar = keyvector.at(ctextLoc);
However, I am not sure why is it happening with this line.
I hope someone will be able to guide me.
if (ctextLoc > keyvector.size())
should probably be
if (ctextLoc >= keyvector.size())
std::vector's size() returns a value 1 --- N, where as at() relies on values 0 --- (N - 1). Therefore, you should use:
if (keyvector.size() != 0 && ctextLoc > keyvector.size() - 1)
break;