Gzip compress/uncompress a long char array - c++

I need to compress a large byte array, im already using the Crypto++ library in the application, so having the compression/decompression part in the same library would be great.
this little test works as expected:
///
string test = "bleachbleachtestingbiatchbleach123123bleachbleachtestingb.....more";
string compress(string input)
{
string result ("");
CryptoPP::StringSource(input, true, new CryptoPP::Gzip(new CryptoPP::StringSink(result), 1));
return result;
}
string decompress(string _input)
{
string _result ("");
CryptoPP::StringSource(_input, true, new CryptoPP::Gunzip(new CryptoPP::StringSink(_result), 1));
return _result;
}
void main()
{
string compressed = compress(test);
string decompressed = decompress(compressed);
cout << "orginal size :" << test.length() << endl;
cout << "compressed size :" << compressed.length() << endl;
cout << "decompressed size :" << decompressed.length() << endl;
system("PAUSE");
}
I need to compress something like this:
unsigned char long_array[194506]
{
0x00,0x00,0x02,0x00,0x00,0x04,0x00,0x00,0x00,
0x01,0x00,0x02,0x00,0x00,0x04,0x02,0x00,0x04,
0x04,0x00,0x02,0x00,0x01,0x04,0x02,0x00,0x04,
0x01,0x00,0x02,0x02,0x00,0x04,0x02,0x00,0x00,
0x03,0x00,0x02,0x00,0x00,0x04,0x01,0x00,0x04,
....
};
i tried to use the long_array as const char * and as byte then feed it to the compress function, it seems to be compressed but the decompressed one has a size of 4, and its clearly uncomplete. maybe its too long.
How could i rewrite those compress/uncompress functions to work with that byte array?
Thank you all. :)

i tried to use the array as const char * and as byte then feed it to the compress function, it seems to be compressed but the decompressed one has a size of 4, and its clearly uncomplete.
Use the alternate StringSource constructor that takes a pointer and a length. It will be immune to embedded NULL's.
CryptoPP::StringSource ss(long_array, sizeof(long_array), true,
new CryptoPP::Gzip(
new CryptoPP::StringSink(result), 1)
));
Or, you can use:
Gzip zipper(new StringSink(result), 1);
zipper.Put(long_array, sizeof(long_array));
zipper.MessageEnd();
Crypto++ added an ArraySource at 5.6. You can use it too (but its really a typedef for a StringSource):
CryptoPP::ArraySource as(long_array, sizeof(long_array), true,
new CryptoPP::Gzip(
new CryptoPP::StringSink(result), 1)
));
The 1 that is used as an argument to Gzip is a deflate level. 1 is one of the lowest compressions. You might consider using 9 or Gzip::MAX_DEFLATE_LEVEL (which is 9). The default log2 windows size is the max size, so there's no need to turn any knobs on it.
Gzip zipper(new StringSink(result), Gzip::MAX_DEFLATE_LEVEL);
You should also name your declarations. I've seen GCC generate bad code when using anonymous declarations.
Finally, use long_array (or similar) because array is a keyword in C++ 11.

Related

How to read custom string with C++ from binary recursively

I've recently been getting in to IO with C++. I am trying to read a string from a binary file stream.
The custom type is saved like this:
The string is prefixed with the length of the string. So hello, would be stored like this: 6Hello\0.
I am basically reading text from a table (in this case a name table) in a binary file. The file header tells me the offset of this table (112 bytes in this case) and the number of names (318).
Using this information I can read the first byte at this offset. This tells me the length of the string (e.g. 6). So I'll start at the next byte and read 5 more to get the full string "Hello". This seems to work fine with the first name at the offset. trying to recursively read the rest provides a lot of garbage really. I've tried using loops and recursive functions but its not working out so well. Not sure what the problem is, so reverted to the original one name retrieval method. Here's the code:
int printName(fstream& fileObj, __int8 buff, DWORD offset, int& iteration){
fileObj.seekg(offset);
fileObj.read((char*)&buff, sizeof(char));
int nameSize = (int)buff;
char* szName = new char[nameSize];
for(int i=1; i <= nameSize; i++){
fileObj.seekg(offset+i);
fileObj.read((char*)&szName[i-1], sizeof(char));
}
cout << szName << endl;
return 0;
}
Any idea how to iterate through all 318 names without creating dodgy output?
Thanks for taking the time to look through this, your help is greatly appreciated.
You're overcomplicating a bit - there's no need to seek to the next sequential read.
Removing unused and pointless parameters, I would write this function something like this:
void printName(fstream& fileObj, DWORD offset) {
char size = 0;
if (fileObj.seekg(offset) && fileObj.read(&size, sizeof(char)))
{
char* name = new char[size];
if (fileObj.read(name, size))
{
cout << name << endl;
}
delete [] name;
}
}

Writing/Reading strings in binary file-C++

I searched for a similar post but couldn't find something that could help me.
I' m trying to first write the integer containing the string length of a String and then write the string in the binary file.
However when i read data from the binary file i read integers with value=0 and my strings contain junk.
for example when i type 'asdfgh' for username and 'qwerty100' for password
i get 0,0 for both string lengths and then i read junk from the file.
This is how i write data to the file.
std::fstream file;
file.open("filename",std::ios::out | std::ios::binary | std::ios::trunc );
Account x;
x.createAccount();
int usernameLength= x.getusername().size()+1; //+1 for null terminator
int passwordLength=x.getpassword().size()+1;
file.write(reinterpret_cast<const char *>(&usernameLength),sizeof(int));
file.write(x.getusername().c_str(),usernameLength);
file.write(reinterpret_cast<const char *>(&passwordLength),sizeof(int));
file.write(x.getpassword().c_str(),passwordLength);
file.close();
Right below in the same function i read the data
file.open("filename",std::ios::binary | std::ios::in );
char username[51];
char password[51];
char intBuffer[4];
file.read(intBuffer,sizeof(int));
file.read(username,atoi(intBuffer));
std::cout << atoi(intBuffer) << std::endl;
file.read(intBuffer,sizeof(int));
std::cout << atoi(intBuffer) << std::endl;
file.read(password,atoi(intBuffer));
std::cout << username << std::endl;
std::cout << password << std::endl;
file.close();
When reading the data back in you should do something like the following:
int result;
file.read(reinterpret_cast<char*>(&result), sizeof(int));
This will read the bytes straight into the memory of result with no implicit conversion to int. This will restore the exact binary pattern written to the file in the first place and thus your original int value.
file.write(reinterpret_cast<const char *>(&usernameLength),sizeof(int));
This writes sizeof(int) bytes from the &usernameLength; which is binary representation of integer and depends on the computer architecture (little endian vs big endian).
atoi(intBuffer))
This converts ascii to integer and expect the input to contain character representation. e.g. intBuffer = { '1', '2' } - would return 12.
You can try to read it in the same way you have written -
*(reinterpret_cast<int *>(&intBuffer))
But it can potentially lead to unaligned memory access issues. Better use serialization formats like JSON, which would be helpful to read it in cross-platform ways.

Get SHA1 of Unicode string in Crypto++

I study C++ independently and I have one problem, which I can't solve more than week. I hope you can help me.
I need to get a SHA1 digest of a Unicode string (like Привет), but I don't know how to do that.
I tried to do it like this, but it returns a wrong digest!
For wstring('Ы')
It returns - A469A61DF29A7568A6CC63318EA8741FA1CF2A7
I need - 8dbe718ab1e0c4d75f7ab50fc9a53ec4f0528373
Regards and sorry for my English :).
CryptoPP 5.6.2
MVC++ 2013
#include <iostream>
#include "cryptopp562\cryptlib.h"
#include "cryptopp562\sha.h"
#include "cryptopp562\hex.h"
int main() {
std::wstring string(L"Ы");
int bs_size = (int)string.length() * sizeof(wchar_t);
byte* bytes_string = new byte[bs_size];
int n = 0; //real bytes count
for (int i = 0; i < string.length(); i++) {
wchar_t wcharacter = string[i];
int high_byte = wcharacter & 0xFF00;
high_byte = high_byte >> 8;
int low_byte = wcharacter & 0xFF;
if (high_byte != 0) {
bytes_string[n++] = (byte)high_byte;
}
bytes_string[n++] = (byte)low_byte;
}
CryptoPP::SHA1 sha1;
std::string hash;
CryptoPP::StringSource ss(bytes_string, n, true,
new CryptoPP::HashFilter(sha1,
new CryptoPP::HexEncoder(
new CryptoPP::StringSink(hash)
)
)
);
std::cout << hash << std::endl;
return 0;
}
I need to get a SHA1 digest of a Unicode string (like Привет), but I don't know how to do that.
The trick here is you need to know how to encode the Unicode string. On Windows, a wchar_t is 2 octets; while on Linux a wchar_t is 4 otects. There's a Crypto++ wiki page on it at Character Set Considerations, but its not that good.
To interoperate most effectively, always use UTF-8. That means you convert UTF-16 or UTF-32 to UTF-8. Because you are on Windows, you will want to call WideCharToMultiByte function to convert it using CP_UTF8. If you were on Linux, then you would use libiconv.
Crypto++ has a built-in function called StringNarrow that uses C++. Its in the file misc.h. Be sure to call setlocale before using it.
Stack Overflow has a few question on using the Windows function . See, for example, How do you properly use WideCharToMultiByte.
I need - 8dbe718ab1e0c4d75f7ab50fc9a53ec4f0528373
What is the hash (SHA-1, SHA-256, ...)? Is it a HMAC (keyed hash)? Is the information salted (like a password in storage)? How is it encoded? I have to ask because I cannot reproduce your desired results:
SHA-1: 2805AE8E7E12F182135F92FB90843BB1080D3BE8
SHA-224: 891CFB544EB6F3C212190705F7229D91DB6CECD4718EA65E0FA1B112
SHA-256: DD679C0B9FD408A04148AA7D30C9DF393F67B7227F65693FFFE0ED6D0F0ADE59
SHA-384: 0D83489095F455E4EF5186F2B071AB28E0D06132ABC9050B683DA28A463697AD
1195FF77F050F20AFBD3D5101DF18C0D
SHA-512: 0F9F88EE4FA40D2135F98B839F601F227B4710F00C8BC48FDE78FF3333BD17E4
1D80AF9FE6FD68515A5F5F91E83E87DE3C33F899661066B638DB505C9CC0153D
Here's the program I used. Be sure to specify the length of the wide string. If you don't (and use -1 for the length), then WideCharToMultiByte will include the terminating ASCII-Z in its calculations. Since we are using a std::string, we don't need the function to include the ASCII-Z terminator.
int main(int argc, char* argv[])
{
wstring m1 = L"Привет"; string m2;
int req = WideCharToMultiByte(CP_UTF8, 0, m1.c_str(), (int)m1.length(), NULL, 0, NULL, NULL);
if(req < 0 || req == 0)
throw runtime_error("Failed to convert string");
m2.resize((size_t)req);
int cch = WideCharToMultiByte(CP_UTF8, 0, m1.c_str(), (int)m1.length(), &m2[0], (int)m2.length(), NULL, NULL);
if(cch < 0 || cch == 0)
throw runtime_error("Failed to convert string");
// Should not be required
m2.resize((size_t)cch);
string s1, s2, s3, s4, s5;
SHA1 sha1; SHA224 sha224; SHA256 sha256; SHA384 sha384; SHA512 sha512;
HashFilter f1(sha1, new HexEncoder(new StringSink(s1)));
HashFilter f2(sha224, new HexEncoder(new StringSink(s2)));
HashFilter f3(sha256, new HexEncoder(new StringSink(s3)));
HashFilter f4(sha384, new HexEncoder(new StringSink(s4)));
HashFilter f5(sha512, new HexEncoder(new StringSink(s5)));
ChannelSwitch cs;
cs.AddDefaultRoute(f1);
cs.AddDefaultRoute(f2);
cs.AddDefaultRoute(f3);
cs.AddDefaultRoute(f4);
cs.AddDefaultRoute(f5);
StringSource ss(m2, true /*pumpAll*/, new Redirector(cs));
cout << "SHA-1: " << s1 << endl;
cout << "SHA-224: " << s2 << endl;
cout << "SHA-256: " << s3 << endl;
cout << "SHA-384: " << s4 << endl;
cout << "SHA-512: " << s5 << endl;
return 0;
}
You say ‘but it returns wrong digest’ – what are you comparing it with?
Key point: digests such as SHA-1 don't work with sequences of characters, but with sequences of bytes.
What you're doing in this snippet of code is generating an ad-hoc encoding of the unicode characters in the string "Ы". This encoding will (as it turns out) match the UTF-16 encoding if the characters in the string are all in the BMP (‘basic multilingual plane’, which is true in this case) and if the numbers that end up in wcharacter are integers representing unicode codepoints (which is sort-of probably correct, but not, I think, guaranteed).
If the digest you're comparing it with turns an input string into an sequence of bytes using the UTF-8 encoding (which is quite likely), then that will produce a different byte sequence from yours, so that the SHA-1 digest of that sequence will be different from the digest you calculate here.
So:
Check what encoding your test string is using.
You'd be best off using some library functions to specifically generate a UTF-16 or UTF-8 (as appropriate) encoding of the string you want to process, to ensure that the byte sequence you're working with is what you think it is.
There's an excellent introduction to unicode and encodings in the aptly-named document The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
This seems to work fine for me.
Rather than fiddling about trying to extract the pieces I simply cast the wide character buffer to a const byte* and pass that (and the adjusted size) to the hash function.
int main() {
std::wstring string(L"Привет");
CryptoPP::SHA1 sha1;
std::string hash;
CryptoPP::StringSource ss(
reinterpret_cast<const byte*>(string.c_str()), // cast to const byte*
string.size() * sizeof(std::wstring::value_type), // adjust for size
true,
new CryptoPP::HashFilter(sha1,
new CryptoPP::HexEncoder(
new CryptoPP::StringSink(hash)
)
)
);
std::cout << hash << std::endl;
return 0;
}
Output:
C6F8291E68E478DD5BD1BC2EC2A7B7FC0CEE1420
EDIT: To add.
The result is going to be encoding dependant. For example I ran this on Linux where wchar_t is 4 bytes. On Windows I believe wchar_t may be only 2 bytes.
For consistency it may be better to use UTF8 a store the text in a normal std::string. This also makes calling the API simpler:
int main() {
std::string string("Привет"); // UTF-8 encoded
CryptoPP::SHA1 sha1;
std::string hash;
CryptoPP::StringSource ss(
string,
true,
new CryptoPP::HashFilter(sha1,
new CryptoPP::HexEncoder(
new CryptoPP::StringSink(hash)
)
)
);
std::cout << hash << std::endl;
return 0;
}
Output:
2805AE8E7E12F182135F92FB90843BB1080D3BE8

how to read a particular string from a buffer

i have a buffer
char buffer[size];
which i am using to store the file contents of a stream(suppose pStream here)
HRESULT hr = pStream->Read(buffer, size, &cbRead );
now i have all the contents of this stream in buffer which is of size(suppose size here). now i know that i have two strings
"<!doctortype html" and ".html>"
which are present somewhere (we don't their loctions) inside the stored contents of this buffer and i want to store just the contents of the buffer from the location
"<!doctortype html" to another string ".html>"
in to another buffer2[SizeWeDontKnow] yet.
How to do that ??? (actually contents from these two location are the contents of a html file and i want to store the contents of only html file present in this buffer). any ideas how to do that ??
You can use strnstr function to find the right position in your buffer. After you've found the starting and ending tag, you can extract the text inbetween using strncpy, or use it in place if the performance is an issue.
You can calculate needed size from the positions of the tags and the length of the first tag nLength = nPosEnd - nPosStart - nStartTagLength
Look for HTML parsers for C/C++.
Another way is to have a char pointer from the start of the buffer and then check each char there after. See if it follows your requirement.
If that's the only operation which operates on HTML code in your app, then you could use the solution I provided below (you can also test it online - here). However, if you are going to do some more complicated parsing, then I suggest using some external library.
#include <iostream>
#include <cstdio>
#include <cstring>
using namespace std;
int main()
{
const char* beforePrefix = "asdfasdfasdfasdf";
const char* prefix = "<!doctortype html";
const char* suffix = ".html>";
const char* postSuffix = "asdasdasd";
unsigned size = 1024;
char buf[size];
sprintf(buf, "%s%sTHE STRING YOU WANT TO GET%s%s", beforePrefix, prefix, suffix, postSuffix);
cout << "Before: " << buf << endl;
const char* firstOccurenceOfPrefixPtr = strstr(buf, prefix);
const char* firstOccurenceOfSuffixPtr = strstr(buf, suffix);
if (firstOccurenceOfPrefixPtr && firstOccurenceOfSuffixPtr)
{
unsigned textLen = (unsigned)(firstOccurenceOfSuffixPtr - firstOccurenceOfPrefixPtr - strlen(prefix));
char newBuf[size];
strncpy(newBuf, firstOccurenceOfPrefixPtr + strlen(prefix), textLen);
newBuf[textLen] = 0;
cout << "After: " << newBuf << endl;
}
return 0;
}
EDIT
I get it now :). You should use strstr to find the first occurence of the prefix then. I edited the code above, and updated the link.
Are you limited to C, or can you use C++?
In the C library reference there are plenty of useful ways of tokenising strings and comparing for matches (string.h):
http://www.cplusplus.com/reference/cstring/
Using C++ I would do the following (using buffer and size variables from your code):
// copy char array to std::string
std::string text(buffer, buffer + size);
// define what we're looking for
std::string begin_text("<!doctortype html");
std::string end_text(".html>");
// find the start and end of the text we need to extract
size_t begin_pos = text.find(begin_text) + begin_text.length();
size_t end_pos = text.find(end_text);
// create a substring from the positions
std::string extract = text.substr(begin_pos,end_pos);
// test that we got the extract
std::cout << extract << std::endl;
If you need C string compatibility you can use:
char* tmp = extract.c_str();

C++ Character Encoding

This is my C++ Code where i'm trying to encode the received file path to utf-8.
#include <string>
#include <iostream>
using namespace std;
void latin1_to_utf8(unsigned char *in, unsigned char *out);
string encodeToUTF8(string _strToEncode);
int main(int argc,char* argv[])
{
// Code to receive fileName from Sockets
cout << "recvd ::: " << recvdFName << "\n";
string encStr = encodeToUTF8(recvdFName);
cout << "encoded :::" << encStr << "\n";
}
void latin1_to_utf8(unsigned char *in, unsigned char *out)
{
while (*in)
{
if (*in<128)
{
*out++=*in++;
}
else
{
*out++=0xc2+(*in>0xbf);
*out++=(*in++&0x3f)+0x80;
}
}
*out = '\0';
}
string encodeToUTF8(string _strToEncode)
{
int len= _strToEncode.length();
unsigned char* inpChar = new unsigned char[len+1];
unsigned char* outChar = new unsigned char[2*(len+1)];
memset(inpChar,'\0',len+1);
memset(outChar,'\0',2*(len+1));
memcpy(inpChar,_strToEncode.c_str(),len);
latin1_to_utf8(inpChar,outChar);
string _toRet = (const char*)(outChar);
delete[] inpChar;
delete[] outChar;
return _toRet;
}
And the OutPut is
recvd ::: /Users/zeus/ÄÈÊÑ.txt
encoded ::: /Users/zeus/AÌEÌEÌNÌ.txt
The above function latin1_to_utf8 is provided as an solution Convert ISO-8859-1 strings to UTF-8 in C/C++ , Looks like it works.[Answer is accepted]. So i think i must be making some mistake, but i'm not able to identify what it is. Can someone help me out with this , Please.
I have first posted this question in Codereview,but i'm not getting any answers out there. So sorry for the duplication.
Do you use any platform or you build it on the top of std? I am sure that many people use such convertions and therefore there is library. I strongly recommend you to use the libraray, because the library is tested and usually the best know way is used.
A library which I found doing this is boost locale
This is standard. If you use QT I will recommend you to use the QT conversion library for this (it is platform independant)
QT
In case you want to do it yourself (you want to see how it works or for any other reason)
1. Make sure that you allocate memory ! - this is very important in C,C++ . Since you use iostream use new to allocate memory and delete to release it (this is also important C++ won't figure out when to release it for sure. This is developer's job here - C++ is hardcore :D )
2. Check that you allocate the right size of memory. I expect unicode to be larger memory (it encodes more symbols and sometimes uses large numbers).
3. As already mentioned above read from somewhere (terminal or file) but output in new file. After that when you open the file with text editor make sure you set the encoding to be utf-8 ( your text editor has to know how to interpretate the data)
I hope that helps.
You are first outputting the original Latin-1 string to a terminal expecting a certain encoding, probably Latin-1. You then transcode to UTF-8 and output it to the same terminal, which interprets it differently. Classic mojibake. Try the following with the output instead:
for(size_t i=0, len=strlen(outChar); i!=len; ++i)
std::cout << static_cast<unsigned>(static_cast<unsigned char>(outChar[i])) << ' ';
Note that the two casts are to first get the unsigned byte value and then to get the unsigned value to keep the stream from treating it as a char. Note that your char might already be unsigned, but that's compile-dependent.