Performing string operations on data that might contain null \0

Performing string operations on data that might contain null \0 - c++

I'm implementing a pluggable MIME filter for IE (this question concerns IInternetProtocol::Read(void*, ULONG, ULONG*)) and I'm intercepting incoming HTML with a view to modify the HTML.
The HTML is generally UTF-8 encoded, except there are some \0 (null) characters, and sits inside a char buffer. I want to load it inside a std::string instance so I can perform string operations such as std::string::find as well as inserting content (by copying substrings into a destination buffer around my injected string, something like this:
string received( this->buffer );
size_t index = received.find("<p id=\"foo\">");
if( index != string::npos ) {
memcpy( destination , received , index );
memcpy( destination + index , "Injected content" , 17 );
memcpy( destination + index + 17, received.substr(index), received.size() - 17 - index );
} else {
memcpy( destination , this->buffer , this->bufferSize );
}
The problem is that the buffer might contain null bytes (it's a quirk of the website I'm working with). To what extent would \0 character values interact with the string operations such as find? The documentation on MSDN nor CPlusPlus.com does not say.

Related

Calculate SHA1 hash like Git with Qt C++

I'd like to hash a file in the same way that git hash-object does, so I can compare it to an existing hash, but using Qt and C++.
The answers to this question show how to get the same hash, but none of the examples use C++.
So far this is what we've tried:
QString fileName = entry.toObject().value( "name" ).toString();
QByteArray shaJson = entry.toObject().value( "sha" ).toString().toUtf8();
QByteArray shaFile;
QFile f( QString( "%1/%2" ).arg( QCoreApplication::applicationDirPath() ).arg( fileName ) );
if( f.open(QFile::ReadOnly ) )
{
QCryptographicHash hash(QCryptographicHash::Sha1);
hash.addData( QString( "blob " ).toUtf8() ); // start with the string "blob "
hash.addData( QString( "%1" ).arg( f.size() ).toUtf8() ); // add size in bytes of the content
hash.addData( QString( "\0" ).toUtf8() ); // null byte
hash.addData( f.readAll() ); // actual file content
shaFile = hash.result().toHex();
if( shaFile != shaJson ){
}
}
How to implement this hashing method with Qt?
Edit:
Here's an example hash output:
ccbf4f0a52fd5ac59e18448ebadf2ef37c62f54f
Computed with git hash-object from this file:
https://raw.githubusercontent.com/ilia3101/MLV-App/master/pixel_maps/80000301_1808x1007.fpm
So that's the hash we also like to compute with Qt.

The problem is that on the one hand, QString ignores \0 as termination string, on the other hand, QByteArray always appends extra \0. From Qt's docs:
Using QByteArray is much more convenient than using const char *.
Behind the scenes, it always ensures that the data is followed by a
\0 terminator, and uses implicit sharing (copy-on-write) to reduce
memory usage and avoid needless copying of data.
https://doc.qt.io/qt-5/qbytearray.html
So, every addData in your case is adding extra \0 to the data that is to be hashed. Some workaround might be the following code:
QFile file(path);
if( file.open(QFile::ReadOnly ) )
{
QCryptographicHash hash(QCryptographicHash::Sha1);
QByteArray header = QString("blob %1").arg(file.size()).toUtf8();
hash.addData(header.data(), header.size() + 1);
hash.addData(file.readAll());
shaFile = hash.result().toHex();
qDebug() << shaFile;
}
The data() of QByteArray is returning a pointer to the data stored in the byte array. The pointer can be used to access and modify the bytes that compose the array. The data is '\0'-terminated, i.e. the number of bytes in the returned character string is size() + 1 for the '\0' terminator. Therefore, we do not need add explicitly \0, QByteArray is doing that for us. We need to add +1 to the size since QByteArray returns size of an array as it would be no \0 character.
The code above generated ccbf4f0a52fd5ac59e18448ebadf2ef37c62f54f for your file, so I guess it is a correct hash.

How is memset working in this snippet of code?

I think this snippet of code is enough to get the idea of what I'm doing.
I'm using getline to read input data from a text file that has lines that might look something like this: The cat is fat/And likes to sing
From searching around the internet I was able to get it working, but I'd like to better understand WHY it is working. My primary question is how the
memcpy(id, buffer, temp - buffer);
line is working. I read what memcpy() does but do not understand how the temp - buffer part is working.
So from my understanding I'm setting *temp to the '/' in that line. Then I'm copying the line up until the '/' into it. But how does the temp, which is at '/' minus the buffer (which is the whole line from getline) work out to just be The cat is fat?
Hopefully that made some sense.
#define MAX_SIZE 255
char buffer[MAX_SIZE + 1] = { 0 };
cin.getline(buffer, MAX_SIZE);
memset(id, 0, 256);
memset(title, 0, 256);
char* temp = strchr(buffer, '/');
memcpy(id, buffer, temp - buffer);
temp++;
strcpy(title, temp);
Also, if I can double dip, why would MAX_SIZE be defined at 255 but MAX_SIZE+1 is often used. Does this have to do with a delimiter or white space at the end of a line?
Thanks for the help.

In my opinion it is simply a bad code.:)
I would write it like
const size_t MAX_SIZE = 256
char buffer[MAX_SIZE] = {};
std::cin.getline( buffer, MAX_SIZE );
id[0] = '\0';
title[0] = '\0';
if ( char* temp = strchr( buffer, '/' ) )
{
std::memcpy( id, buffer, temp - buffer );
id[temp - buffer] = '\0';
std::strcpy( title, temp + 1 );
}
else
{
std::strcpy( id, buffer );
}
As for memcpy in this statement
memcpy(id, buffer, temp - buffer);
then it copies temp - buffer bytes from buffer to id. As id was previously set to zeroes then after memcpy it will contain a string with terminating zero.

You're question concerns pointer-difference calculation, part of the family of arithmetic operations that are done in pointer-arithmetic.
Most beginners don't have too much trouble grasping how pointer-addition works. Given this:
char buffer[256];
char *p = buffer + 10;
it is usually clear that p points to the 10th slot in the buffer char array. But you need to remember that the pointer type is important. The same construct you see above also works for more complicated data types:
struct Something
{
char name[128];
int ident;
int supervisor;
} people[64];
struct Something *p = people+10; // NOTE: same line, different types
Just as before, p points to the tenth element in the array, but note the arithmetic; the size of the underlying type is used to calculate the relevant memory offset. You don't need to do it yourself. No sizeof required here.
So why do you care? Because just like regular math, pointer math has certain properties, one of them being the following:
char buffer[256];
char *p = buffer+10; // p addresses the 10th slot in the array
size_t len = p-buffer // len is the typed-difference between p and buffer.
In this case, len will be 10, the same as the offset of p. So how does this relate to your question? Well...
char* temp = strchr(buffer, '/');
memcpy(id, buffer, temp - buffer);
The horrid nature of this code aside (if there is no '/' in the buffer array the result is temp being NULL, and the ensuing memcpy will all-but-guarantee a massive segfault). This code finds the location in the string where '/' resides. Once it has that, the calculation temp - buffer uses pointer arithmetic (specifically pointer differencing) to calculate the distance between the address in temp and the address as the base of the array. The result is the element count not including the slash itself. Therefore this code copies up-to, but not including, the discovered slash, into the id buffer. The rest of the id buffer retains all the 0 values populated with the memset and therefore the string is terminated (which is way more work than you need to do, btw).
After that line, the remainder:
temp++;
strcpy(title, temp);
post-increments the temp pointer, which says "move to the next element in the array". Then the strcpy copies the remaining chars of the null-terminated buffer string into title. Worth noting this could have simply been:
strcpy(title, ++temp);
And likewise:
strcpy(title, temp+1);
which retains temp at the '/' position. In all of the above, the result in title will be the same: all chars after the slash, but not including it.
I hope that explains what is going on. Best of luck.

MAX_SIZE+1 is reserving space for the null terminator at the end of the string ('\0')
memcpy(id, buffer, temp - buffer)
This is copying (temp-buffer) bytes from buffer to id. Since strchr finds the '/' character in the input, temp is pointing inside buffer (assumiing it's found). So for example assume buffer points to a location in memory:
buffer = 0x781230001
and the third byte is the '/', after strchr, you have
temp = 0x781230003
temp - buffer therefore is 2.
HOWEVER: If the '/' is not found, then temp will not work and the code will crash. You should check the result of strchr before doing the pointer arithmetic.

There you calculate position of first / in buffer.
char* temp = strchr(buffer, '/');
Now temp points to / in buffer. If you want to copy this part of buffer, its enough to get pointer to start and length of string. So temp - buffer evaluates to length.
=================================
The cat is fat/And likes to sing
=================================
^ ^
buffer temp
| length | = temp - buffer
End of null terminated string determinated by \0 (or simply 0). So if you need to store N chars you need to allocate N+1 buffer size.

Efficient means of null terminating an unsigned char buffer in a string append function?

I've been writing a "Byte Buffer" utility module - just a set of functions for personal use in low level development.
Unfortunately, my ByteBuffer_Append(...) function doesn't work properly when it null terminates the character at the end, and/or adds extra room for the null termination character. One result, when this is attempted, is when I call printf() on the buffer's data (a cast to (char*) is performed): I'll get only a section of the string, as the first null termination character within the buffer will be found.
So, what I'm looking for is a means to incorporate some kind of null terminating functionality within the function, but I'm kind of drawing a blank in terms of what would be a good way of going about this, and could use a point in the right direction.
Here's the code, if that helps:
void ByteBuffer_Append( ByteBuffer_t* destBuffer, uInt8* source, uInt32 sourceLength )
{
if ( !destBuffer )
{
puts( "[ByteBuffer_Append]: param 'destBuffer' received is NULL, bailing out...\n" );
return;
}
if ( !source )
{
puts( "[ByteBuffer_Append]: param 'source' received is NULL, bailing out...\n" );
return;
}
size_t byteLength = sizeof( uInt8 ) * sourceLength;
// check to see if we need to reallocate the buffer
if ( destBuffer->capacity < byteLength || destBuffer->length >= sourceLength )
{
destBuffer->capacity += byteLength;
uInt8* newBuf = ( uInt8* ) realloc( destBuffer->data, destBuffer->capacity );
if ( !newBuf )
{
Mem_BadAlloc( "ByteBuffer_Append - realloc" );
}
destBuffer->data = newBuf;
}
uInt32 end = destBuffer->length + sourceLength;
// use a separate pointer for the source data as
// we copy it into the destination buffer
uInt8* pSource = source;
for ( uInt32 iBuffer = destBuffer->length; iBuffer < end; ++iBuffer )
{
destBuffer->data[ iBuffer ] = *pSource;
++pSource;
}
// the commented code below
// is where the null termination
// was happening
destBuffer->length += sourceLength; // + 1;
//destBuffer->data[ destBuffer->length - 1 ] = '\0';
}
Many thanks to anyone providing input on this.

Looks like your issue is caused by memory corruption.
You have to fix the following three problems:
1 check if allocated space is enough
if ( destBuffer->capacity < byteLength || destBuffer->length >= sourceLength )
does not properly check if buffer reallocation is needed,
replace with
if ( destBuffer->capacity <= destBuffer->length+byteLength )
2 allocating enough space
destBuffer->capacity += byteLength;
is better to become
destBuffer->capacity = destBuffer->length + byteLength + 1;
3 properly null terminating
destBuffer->data[ destBuffer->length - 1 ] = '\0';
should become
destBuffer->data[ destBuffer->length ] = '\0';

In C/C++, a list of chars terminating by a '\0' is a string. There are a set of string functions, such as strcpy(), strcmp(), they take char * as parameter, and when they find a '\0', they the string end there. In your case, printf("%s", buf) treats buf as a string, so when it find a '\0', it stops print.
If you are doing a buffer, that means any data include '\0' is normal data in the buffer. So you should avoid to use string functions. To print a buffer, you need to implement your own function.

Error on memcpy, length not correct

I am copying data in Gateway (contains the string Oct/10/12) to dest_data but dest_datais getting more characters than the source:
unsigned_8 *dest_data
int_16 len;
len = (int_16)strlen( Gateway ); // len got 9 correctly
(void)memcpy( dest_data, GatewayApplicationRlsDate, len );
The final output of dest_data is "Oct/10/1210.1.3"
Do I have to clean the dest_data before copying?

You copy your string content, but not the terminating null character. Add one to len, and you should be fine. But the proper solution would be to use strcpy(), which copies the trailling null character automatically.
Also, think to allocate memory for dest_data (malloc((len + 1) * sizeof(*dest_data));)
unsigned_8 *dest_data;
int_16 len;
len = (int_16)strlen( Gateway ) + 1;
dest_data = malloc(len * sizeof(*dest_data));
(void)strcpy( dest_data, GatewayApplicationRlsDate );

No memory has been allocated for dest_data (it is an uninitialised pointer) and the memcpy() is not copying the null terminator. Allocate len + 1 bytes of memory for dest_data and copy len + 1 to also copy the null terminator.

You need to copy len + 1 bytes
At the moment you forget to copy the null terminator \0.
When you try to acces the copy, the string functions search untill they find a \0 which could be anywhere.

Shouldn't your strlen use the length from the GatewayApplicationRlsDate?
ie:
len = (int_16)strlen( GatewayApplicationRlsDate );

You should use strcpy, this will also copy the trailing null byte.
strcpy( dest_data, GatewayApplicationRlsDate );
Of course all the caveats about handling raw pointers apply. Really you should probably be using std::string or std::vector<char>.

How do the variable length fields in the windows EVENTLOGRECORD structure work?

I've tried, with little success, to identify how the variable length portion of the EVENTLOGRECORD data works.
Winnt.h defines the structure, and the following data, as follows:
typedef struct _EVENTLOGRECORD {
DWORD Length; // Length of full record
DWORD Reserved; // Used by the service
DWORD RecordNumber; // Absolute record number
DWORD TimeGenerated; // Seconds since 1-1-1970
DWORD TimeWritten; // Seconds since 1-1-1970
DWORD EventID;
WORD EventType;
WORD NumStrings;
WORD EventCategory;
WORD ReservedFlags; // For use with paired events (auditing)
DWORD ClosingRecordNumber; // For use with paired events (auditing)
DWORD StringOffset; // Offset from beginning of record
DWORD UserSidLength;
DWORD UserSidOffset;
DWORD DataLength;
DWORD DataOffset; // Offset from beginning of record
//
// Then follow:
//
// WCHAR SourceName[]
// WCHAR Computername[]
// SID UserSid
// WCHAR Strings[]
// BYTE Data[]
// CHAR Pad[]
// DWORD Length;
//
} EVENTLOGRECORD, *PEVENTLOGRECORD;
I can pull out the first chunk which appears to be the source with the following code, but its certainly not the intended method:
memcpy(&strings, pRecord+sizeof(EVENTLOGRECORD), tmpLog->UserSidOffset);
But from the comments in Winnt.h, I'm also getting the computer name.
So can someone explain how to determine the "SourceName" length from the EVENTLOGRECORD structure, and explain what StringOffset, DataLength and DataOffset are?
Thanks.

Note: throughout the answer I'll assume that you have a pointer to that structure like this:
EVENTLOGRECORD * elr;
to shorten the code snippets.
So can someone explain how to determine the "SourceName" length from the EVENTLOGRECORD structure
There's no field that specifies how long it is, but you can determine it quite easily: it is the first field of the record after the well-defined fields, so you can simply do:
WCHAR * SourceName=(WCHAR *)((unsigned char *)elr + sizeof(*elr));
Now, in SourceName you have a pointer to that string; you can easily determine its length with the usual string functions.
By the way, after the terminator of SourceName there should be the the ComputerName string.
and explain what StringLength
There's no StringLength member, what are you talking about?
DataLength and DataOffset are?
An event log is composed also of arbitrary binary data, that is embedded in the record.
The DataOffset member specifies the offset of such data from the beginning of the record, and DataLength specifies how long is that data. If you were to copy that data to a buffer (assuming that it's big enough), you'd do:
memcpy(targetBuffer,(unsigned char *)elr + elr->DataOffset,elr->DataLength);
By the way, instead of reading directly the include files you should read the documentation, it's far easier to understand.
Addendum about StringOffset
The StringOffset field specifies the offset of the strings associated to the event from the beginning of the record.
The StringOffset field works very much like the DataOffset field described above, but there's no corrispondent StringLength field, since the length of each string can be easily determined using the normal string functions (in fact the string section is just made of several NUL-terminated strings put one after the other).
Moreover, the location where the strings section ends can be easily determined examining the DataOffset member, in facts the strings section ends where the data chunk begins. The EVENTLOGRECORD structure also provides the NumStrings field to determine the number of strings contained in the strings section (thanks Remy Lebeau).
If you were to put these strings in a vector<wstring> you'd do something like this (careful, untested code):
vector<wstring> strings;
for(
wchar_t * ptr=(wchar_t *)((unsigned char *)elr + elr->StringOffset);
strings.size()<elr->NumStrings;
ptr+=strings.back().length() + 1
)
strings.push_back(wstring(ptr));

So can someone explain how to determine the "SourceName" length from the EVENTLOGRECORD structure,
From what I can see, SourceName[] and Computername[] are one behind each other, separated by a '\0', with the first starting right behind DataOffset, and the second starting right behind the '\0' of the first, and going up to two bytes before UserSidOffset, with a '\0' trailing.
and explain what StringLength, DataLength and DataOffset are?
StringLength I cannot find (and StringOffset is where Strings[] starts), DataLength is the number of bytes in Data[], and DataOffset is where Data[] starts.
To read the strings, you could do something like this:
// Beware, brain-compiled code ahead!
void f(EVENTLOGRECORD* rec)
{
std::wstring source_name(
reinterpret_cast<const wchar_t*>(
reinterpret_cast<const unsigned char*>( rec
+ sizeof(EVENTLOGRECORD ) ) ) );
std::wstring computer_name(
reinterpret_cast<const wchar_t*>(
reinterpret_cast<const unsigned char*>( rec
+ sizeof(EVENTLOGRECORD )
+ source_name.length()+1 ) ) );
// ...
}

Please read the documentation. It tells you what the StringLength, DataLength and DataOffset members are.
As for the SourceName and ComputerName members, they are both null-terminated strings (with potentially extra padding after ComputerName to align the UserSid member). You saw the ComputerName appear in your buffer because you told memcpy() to copy the raw bytes of both members together. Try using lstrlenW() and lstrcpyW() (or equivilent functions).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Performing string operations on data that might contain null \0 - c++

Related

Calculate SHA1 hash like Git with Qt C++

How is memset working in this snippet of code?

Efficient means of null terminating an unsigned char buffer in a string append function?

Error on memcpy, length not correct

How do the variable length fields in the windows EVENTLOGRECORD structure work?

Categories

Resources