Why are Crypto++ and Ruby generating slightly different SHA-1 hashes? - c++

I'm using two different libraries to generate a SHA-1 hash for use in file validation - an older version of the Crypto++ library and the Digest::SHA1 class implemented by Ruby. While I've seen other instances of mismatched hashes caused by encoding differences, the two libraries are outputting hashes that are almost identical.
For instance, passing a file through each process produces the following results:
Crypto++
01c15e4f46d8181b984fa2a2c740f8f67130acac
Ruby:
eac15e4f46d8181b984fa2a2c740f8f67130acac
As you can see, only the first two characters of the hash string are different, and this behavior repeats itself over many files. I've taken a look at the source code for each implementation, and the only difference I found at first glance was in the data hex that is being used for the 160-bit hashing. I have no idea how that hex is used in the algorithm, and I figured it'd probably be quicker for me to ask the question in case anyone had encountered this issue before.
I've included the data from the respective libraries below. I also included the values from OpenSSL, since each of the three libraries had slightly different values.
Crypto++:
digest[0] = 0x67452301L;
digest[1] = 0xEFCDAB89L;
digest[2] = 0x98BADCFEL;
digest[3] = 0x10325476L;
digest[4] = 0xC3D2E1F0L;
Ruby:
context->state[0] = 0x67452301;
context->state[1] = 0xEFCDAB89;
context->state[2] = 0x98BADCFE;
context->state[3] = 0x10325476;
context->state[4] = 0xC3D2E1F0;
OpenSSL:
#define INIT_DATA_h0 0x67452301UL
#define INIT_DATA_h1 0xefcdab89UL
#define INIT_DATA_h2 0x98badcfeUL
#define INIT_DATA_h3 0x10325476UL
#define INIT_DATA_h4 0xc3d2e1f0UL
By the way, here is the code being used to generate the hash in Ruby. I do not have access to the source code for the Crypto++ implementation.
File.class_eval do
def self.hash_digest filename, options = {}
opts = {:buffer_length => 1024, :method => :sha1}.update(options)
hash_func = (opts[:method].to_s == 'sha1') ? Digest::SHA1.new : Digest::MD5.new
open(filename, "r") do |f|
while !f.eof
b = f.read
hash_func.update(b)
end
end
hash_func.hexdigest
end
end

I would guess that you are off by a byte in printing out the SHA-1 hashes. Can we see the code that is printing them? If not, here are a couple of potentially useful diagnostics:
Make a very short file (say, one word), and put its contents in as a hex string at http://www.fileformat.info/tool/hash.htm. You would need to know exactly the hex contents of the file, though. You can use xxd to that on Unix, but you'll have to watch out for endianness issues. I'm not sure how to do it on other OSs.
Does running the same file through the same SHA-1 implementation several times always print out the same value in that first byte? If so, does that value change when you change files?

This isn't making much sense. If there were something wrong with the SHA1 implementation, such as with those numbers, it would likely produce hashes that are completely different than the real SHA1 hashes, rather than just one byte off. Even if there were something wrong with your file reading loop, that it would drop a newline or something, you would still get a completely different hash by changing one byte in the stream, it wouldn't be one byte off of the real SHA1 hash.
If I do use your method in the following program, I get the correct results.
#!/usr/bin/env ruby
require 'digest/sha1'
require 'digest/md5'
File.class_eval do
def self.hash_digest filename, options = {}
opts = {:buffer_length => 1024, :method => :sha1}.update(options)
hash_func = (opts[:method].to_s == 'sha1') ? Digest::SHA1.new : Digest::MD5.new
open(filename, "r") do |f|
while !f.eof
b = f.read
hash_func.update(b)
end
end
hash_func.hexdigest
end
end
puts File.hash_digest(ARGV[0])
And its output compared with that of OpenSSL.
tmp$ dd if=/dev/urandom of=random.bin bs=1MB count=1
1+0 records in
1+0 records out
1000000 bytes (1.0 MB) copied, 0.287903 s, 3.5 MB/s
tmp$ ./digest.rb random.bin
a511d8153426ebea4e4694cde78db4e3a9e413d1
tmp$ openssl sha1 random.bin
SHA1(random.bin)= a511d8153426ebea4e4694cde78db4e3a9e413d1
So there's nothing wrong with your hashing method. Something is going wrong between its return value and it being printed.

Related

ZLib GZIP Returning Z_BUF_ERROR(-5)

I am using the zlib library (compiled from src) to deflate/inflate gzip/zlib/raw bytes. I have created a wrapper class for decompressing and compressing (Compressor/Decompresser). I have also created several test cases (GZIP, ZLib, Raw, Auto-Detect). The tests pass for Zlib/Raw/Auto-Detect(Zlib), but not for GZip (window bits of 15u | 16u).
Here is my compress function.
std::vector<char> out(zlib->avail_in + 8);
deflateInit2(zlib.get(), Z_DEFAULT_COMPRESSION, Z_DEFLATED, static_cast<int32_t>(mode), 8, Z_DEFAULT_STRATEGY);
zlib->avail_out = out.size();
zlib->next_out = reinterpret_cast<Bytef*>(out.data());
deflate(zlib.get(), Z_FINISH);
out.resize(zlib->total_out + 3);
deflateEnd(zlib.get());
return std::move(out);
And here is decompress
uIntf multiplier = 2;
uIntf currentSize = zlib->avail_in * (multiplier++) * 1000 /* Just to make sure enough output space(will implement loop) */;
std::vector<char> out(currentSize);
inflateInit2(zlib.get(), static_cast<int>(mode));
zlib->avail_out = out.size();
zlib->next_out = reinterpret_cast<Bytef*>(out.data());
inflate(zlib.get(), Z_FINISH);
out.resize(zlib->total_out);
inflateEnd(zlib.get());
return std::move(out);
Input is set in a different function (that is being called), that looks like this. (char* is not being deleted when compress/decompress is called)
zlib->next_in = reinterpret_cast<Bytef*>(bytes);
zlib->avail_in = static_cast<uIntf>(length);
I also have a mode enum
enum class Mode : int32_t {
AUTO = 15u | 32u, // Never used on compress
GZIP = 15u | 16u,
ZLIB = 15,
RAW = -15
};
Note: Test cases with the mode being AUTO (paired with zlib), ZLib, and RAW work. GZip fails the test case. (The test case is just a simple alphanum character array).
Also I debugged the output of the gzip decompress (after it failed) and the output is missing the last 3 characters (y, z, termination character)
Another note:
The constructor of the wrapper classes look like this
zlib->zalloc = Z_NULL;
zlib->zfree = Z_NULL;
zlib->opaque = Z_NULL;
zlib->avail_in = 0;
zlib->next_in = Z_NULL;
First off, a bunch of scattered code fragments with no context makes it impossible to see what's happening. See How to create a Minimal, Reproducible Example for how to provide a decent example.
Second, you are not saying what is returning Z_BUF_ERROR. There aren't even any places in your code fragments where you retain the return values of deflate() or inflate(), so it's not even possible for you to see a Z_BUF_ERROR! You need to at least do something like int ret = deflate(zlib.get(), Z_FINISH); and then check the value of ret.
Third, I cannot tell in your code fragments where or even if you set the input pointer and length. Is the length set to zero before the inits? Or is it set to the data? Or is the data pointer and length set after the inits? See the MRE link above.
Fourth, we don't have the example data that you're using. So we cannot reproduce the error. Again, see the MRE link.
Ok, so making a stab in the dark here, I will guess that deflate() is returning the error. Then the problem is likely that you have not provided enough output space, and you have asked for Z_FINISH, which is telling deflate() you have provided enough output space. In that case, deflate() returning Z_BUF_ERROR means that you didn't. Compression can expand the data if it is not compressible, and gzip adds more header and trailer information than zlib. Your + 8 is inadequate to account for those two things. A zlib header and trailer is six bytes, whereas a gzip header and trailer is at least 18 bytes. The expansion is a multiplier on the input, adding some part of a percent, where you have no multiplier on the length at all.
zlib provides a function for just this purpose, deflateBound(). You would call that after deflateInit() with the size of your input, and it will return the maximum size of the compressed output.
However it is better to call deflate() multiple times in a loop. For most practical applications, it is necessary to call inflate() multiple times in a loop. This is seen in your comment, as well as in your attempt (also inadequate) to account for the possible size of the inflated data by multiplying by a thousand.
You can find a heavily commented example of how to use the zlib functions properly, with loops, at zlib Usage Example.

Why pkzip accept two passwords?

I'm trying to do this homework https://www.root-me.org/en/Challenges/Cryptanalysis/File-PKZIP When I write a function to crack it.
import subprocess from time import sleep
file = open('/home/begood/Downloads/SecLists-master/Passwords/'
'rockyou-75.txt', 'r') lines = file.readlines() file.close() for line in lines:
command = 'unzip -P ' + line.strip() + ' /home/begood/Downloads/ch5.zip'
print command
p = subprocess.Popen(
command,
stdout=subprocess.PIPE, shell=True).communicate()[0]
if 'replace' in p:
print 'y\n'
sleep(1)
It stop in password = scooter:
unzip -P scooter /home/begood/Downloads/ch5.zip replace readme.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename:
but when I use it to unzip it said:
inflating: /home/begood/readme.txt
error: invalid compressed data to inflate
And it real password is : 14535. Why pkzip accept two password?
I presume that the encryption being used is the old, very weak, encryption that was part of the original PKZIP format.
That encryption method has a 12-byte salt header before the compressed data. From the PKWare specification:
After the header is decrypted, the last 1 or 2 bytes in Buffer
should be the high-order word/byte of the CRC for the file being
decrypted, stored in Intel low-byte/high-byte order. Versions of
PKZIP prior to 2.0 used a 2 byte CRC check; a 1 byte CRC check is
used on versions after 2.0. This can be used to test if the password
supplied is correct or not.
It was originally two bytes in the 1.0 specification, but in the 2.0 specification, and in the associated version of PKZIP, the check value was changed to one byte in order to make password searches like what you are doing more difficult. The result is that about one out of every 256 random passwords will result in passing that first check, and then proceeding to try to decompress the incorrectly decrypted compressed data, only then running into an error.
So it's far, far more than two passwords that will be "accepted". However it won't take very many bytes of decompressed data to detect that the password was nevertheless incorrect.

encrypting a large file by cryptoapi in C++

I am using CryptoApi to encrypt a file (asymmetric encryption). Everywthing is ok but when the file is large, it can not encrypt it. I searched and found that I must encrypt block by block. Except for the last block the Final flag in CryptEncrypt function must be false.
I know all of above conception but I dont know how I can to implement them. I mean I dont know how read, encrypt and write block by block.
Can you give me a real code example.
Update:
I used the code of this website: http://blogs.msdn.com/b/alejacma/archive/2008/01/28/how-to-generate-key-pairs-encrypt-and-decrypt-data-with-cryptoapi.aspx
I am writing this solution for programmers who will have this problem in the future:
In this link has been shown how to encrypt large file (block by block):
https://msdn.microsoft.com/en-us/library/windows/desktop/aa382358%28v=vs.85%29.aspx
Note: Somethings must be change when you want to use the above code
1) In encryption, block size must be set to 128 - 11 ( DWORD dwBlockLen = 128 - 11 )
2) In decryption, block size must be set to 128 ( DWORD dwBlockLen = 128 )
Both tested in win 7.
Try something like:
final_flag <- false
repeat
this_block <- read_next_block(file)
if (is_EoF(file)) {final_flag <- true }
encrypt(this_block, final_flag)
until (final_flag == true)
I don't know enough about the C++ file system handling to write a working check for the end of a file, but there should be one in there somethere.

Ensure File Extension Matches File Type in C++

all. I am trying to write a C++ program that will iterate through a user-specified directory (e.g. /home/alpernick/Pictures). Primarily, this is to ensure that there are no duplicates (checked via md5sum).
But one feature I truly want to include is to ensure that the extension of a filename matches the file's type.
For example, if the file's name is "sunrise.png" I want to ensure that it actually is indeed a PNG and not a mislabeled JPEG (for example).
I am approaching this with four functions, as follows.
string extension(string fileName) // returns the extension of fileName (including .tar.gz handling, so it isn't blindly just returning the last 3 characters)
string fileType(string fileName) // This one is the key -- it returns the actual file type, so if the file named fileName is a PNG, fileType() will return PNG, regardless of the return value of extension()
string basename(string fileName) // Rerturns the basename of the file, I.e. everything before the extension (so, for sunset.jpg, it would return sunset; for fluffytarball,tar.gz, it would return fluffytarball)
string renameFile(string incorrectFileName, string fileNameBeforeExtension, string actualFileType) // Returns a string whose value is the basename concatenated with the correct file extension.
string file = sunset.jpg;
/* Setting file to be hard-coded for illustrative purposes only */
if(extension(file) != fileType(file)
{
char fixedName [] = renameFile(file, basename(file), fileType(file));
puts(fixedName);
}
I have zero issues with the string processing. I'm stuck, however, on fileType(). I want this program to not only run on my primary machine (Kubuntu 14.04), but also to be capable of being run on a Windows machine as well. So, it seems I need some library or set of libraries that would be common to both (or at the least compiled for both).
Any help/advice?
There are more exceptions than rules for guessing the actual type of a file based on its contents.
This is exacerbated by the fact that a file can be valid and useful interpreted as two completely different file types.
For a good program trying to guess on insufficient data, try file on Unixoids.
You could try looking at file source code: https://github.com/file/file .
But as wikipedia states
file's position-sensitive tests are normally implemented by matching various locations within the file against a textual database of magic numbers (see the Usage section). This differs from other simpler methods such as file extensions and schemes like MIME.
In most implementations, the file command uses a database to drive the probing of the lead bytes. That database is implemented in a file called magic, whose location is usually in /etc/magic, /usr/share/file/magic or a similar location.
So it does not seem trivial.

Read text file step-by-step

I have a file which has text like this:
#1#14#ADEADE#CAH0F#0#0.....
I need to create a code that will find text that follows # symbol, store it to variable and then writes it to file WITHOUT # symbol, but with a space before. So from previous code I will get:
1 14 ADEADE CAH0F 0 0......
I first tried to did it in Python, but files are really big and it takes a really huge time to process file, so I decided to write this part in C++. However, I know nothing about C++ regex, and I'm looking for help. Could you, please, recommend me an easy regex library (I don't know C++ very well) or the well-documented one? It would be even better, if you provide a small example (I know how to perform transmission to file, using fstream, but I need help with how to read file as I said before).
This looks like a job for std::locale and his trusty sidekick imbue:
#include <locale>
#include <iostream>
struct hash_is_space : std::ctype<char> {
hash_is_space() : std::ctype<char>(get_table()) {}
static mask const* get_table()
{
static mask rc[table_size];
rc['#'] = std::ctype_base::space;
return &rc[0];
}
};
int main() {
using std::string;
using std::cin;
using std::locale;
cin.imbue(locale(cin.getloc(), new hash_is_space));
string word;
while(cin >> word) {
std::cout << word << " ";
}
std::cout << "\n";
}
IMO, C++ is not the best choice for your task. But if you have to do it in C++ I would suggest you have a look at Boost.Regex, part of the Boost library.
If you are on Unix, a simple sed 's/#/ /' <infile >outfile would suffice.
Sed stands for 'stream editor' (and supports regexes! whoo!), so it would be well-suited for the performance that you are looking for.
Alright, I'm just going to make this an answer instead of a comment. Don't use regex. It's almost certainly overkill for this task. I'm a little rusty with C++, so I'll not post any ugly code, but essentially what you could do is parse the file one character at a time, putting anything that wasn't a # into a buffer, then writing it out to the output file along with a space when you do hit a #. In C# at least two really easy methods for solving this come to mind:
StreamReader fileReader = new StreamReader(new FileStream("myFile.txt"),
FileMode.Open);
string fileContents = fileReader.ReadToEnd();
string outFileContents = fileContents.Replace("#", " ");
StreamWriter outFileWriter = new StreamWriter(new FileStream("outFile.txt"),
Encoding.UTF8);
outFileWriter.Write(outFileContents);
outFileWriter.Flush();
Alternatively, you could replace
string outFileContents = fileContents.Replace("#", " ");
With
StringBuilder outFileContents = new StringBuilder();
string[] parts = fileContents.Split("#");
foreach (string part in parts)
{
outFileContents.Append(part);
outFileContents.Append(" ");
}
I'm not saying you should do it either of these ways or my suggested method for C++, nor that any of these methods are ideal - I'm just pointing out here that there are many many ways to parse strings. Regex is awesome and powerful and may even save the day in extreme circumstances, but it's not the only way to parse text, and may even destroy the world if used for the wrong thing. Really.
If you insist on using regex (or are forced to, as in for a homework assignment), then I suggest you listen to Chris and use Boost.Regex. Alternatively, I understand Boost has a good string library as well if you'd like to try something else. Just look out for Cthulhu if you do use regex.
You've left out one crucial point: if you have two (or more) consecutive #s in the input, should they turn into one space, or the same number of spaces are there are #s?
If you want to turn the entire string into a single space, then #Rob's solution should work quite nicely.
If you want each # turned into a space, then it's probably easiest to just write C-style code:
#include <stdio.h>
int main() {
int ch;
while (EOF!=(ch=getchar()))
if (ch == '#')
putchar(' ');
else
putchar(ch);
return 0;
}
So, you want to replace each ONE character '#' with ONE character ' ' , right ?
Then it's easy to do since you can replace any portion of the file with string of exactly the same length without perturbating the organisation of the file.
Repeating such a replacement allows to make transformation of the file chunk by chunk; so you avoid to read all the file in memory, which is problematic when the file is very big.
Here's the code in Python 2.7 .
Maybe, the replacement chunk by chunk will be unsifficient to make it faster and you'll have a hard time to write the same in C++. But in general, when I proposed such codes, it has increased the execution's time satisfactorily.
def treat_file(file_path, chunk_size):
from os import fsync
from os.path import getsize
file_size = getsize(file_path)
with open(file_path,'rb+') as g:
fd = g.fileno() # file descriptor, it's an integer
while True:
x = g.read(chunk_size)
g.seek(- len(x),1)
g.write(x.replace('#',' '))
g.flush()
fsync(fd)
if g.tell() == file_size:
break
Comments:
open(file_path,'rb+')
it's absolutely obligatory to open the file in binary mode 'b' to control precisely the positions and movements of the file's pointer;
mode '+' is to be able to read AND write in the file
fd = g.fileno()
file descriptor, it's an integer
x = g.read(chunk_size)
reads a chunk of size chunk_size . It would be tricky to give it the size of the reading buffer, but I don't know how to find this buffer's size. Hence a good idea is to give it a power of 2 value.
g.seek(- len(x),1)
the file's pointer is moved back to the position from which the reading of the chunk has just been made. It must be len(x), not chunk_size because the last chunk read is in general less long than chink_size
g.write(x.replace('#',' '))
writes on the same length with the modified chunk
g.flush()
fsync(fd)
these two instructions force the writing, otherwise the modified chunk could remain in the writing buffer and written at uncontrolled moment
if g.tell() >= file_size: break
after the reading of the last portion of file , whatever is its length (less or equal to chunk_size), the file's pointer is at the maximum position of the file, that is to say file_size and the program must stop
.
In case you would like to replace several consecutive '###...' with only one, the code is easily modifiable to respect this requirement, since writing a shortened chunk doesn't erase characters still unread more far in the file. It only needs 2 files's pointers.