How to calculate CRC of a WinRAR file?

How to calculate CRC of a WinRAR file? - crc

I know CRC calculation algorithm from Wikipedia. About structure of RAR file I read here. For example, there was written:
The file has the magic number of:
0x 52 61 72 21 1A 07 00
Which is a break down of the following to describe an Archive Header:
0x6152 - HEAD_CRC
0x72 - HEAD_TYPE
0x1A21 - HEAD_FLAGS
0x0007 - HEAD_SIZE
If I understand correctly, the HEAD_CRC (0x6152) is CRC value of Marker Block (MARK_HEAD). Somewhere I read, that CRC of a WinRAR file is calculated with standard polynomial 0xEDB88320, but when size of CRC is less than 4 bytes, it's necessary to use less significant bytes. In this case (of course if I undestand correctly) CRC value is 0x6152, so it has 2 bytes. Now I don't know, which bytes I have to take as less significant. From the standard polynomial (0xEDB88320)? Then 0x8320 probably are less significant bytes of this polynomial. Next, how to calculate CRC of the Marker Block (i. e. from the following bytes: 0x 52 61 72 21 1A 07 00), if we have already right polynomial?

There was likely a 16-bit check for an older format that is not derived from a 32-bit CRC. The standard 32-bit CRC, used by zip and rar, applied to the last five bytes of the header has no portion equal to the first two bytes. The Polish page appears to be incorrect in claiming that the two-byte check is the low two-bytes of a 32-bit CRC.
It does appear from the documentation that that header is constructed in a standard way as other blocks in the older format, so that the author, for fun, arranged for his format to give the check value "Ra" so that it could spell out "Rar!" followed by a text-terminating control-Z.
I found another 16-bit check in the unrar source code, but that check does not result in those values either.
Oh, and no, you can't take part of a CRC polynomial and expect that to be a good CRC polynomial for a smaller check. What the page in Polish is saying is that you would compute the full 32-bit CRC, and then take the low two bytes of the result. However that doesn't work for the magic number header.

Per WinRAR TechNote.txt file included with the install:
The marker block is actually considered as a fixed byte sequence: 0x52 0x61 0x72 0x21 0x1a 0x07 0x00
And as you already indicated, at the very end you can read:
The CRC is calculated using the standard polynomial 0xEDB88320. In case the size of the CRC is less than 4 bytes, only the low order bytes are used.
In Python, the calculation and grabbing of the 2 low order bytes goes like this:
zlib.crc32(correct_byte_range) & 0xffff
rerar has some code that does this, just like the rarfile library that it uses. ReScene .NET source code has an algorithm in C# for calculating the CRC32 hash. See also How do I calculate CRC32 mathematically?

Related

how to calculate CRC-32 for 24 bit long hex (for example 0xAAAAAA00)?

generally, CRC-32 is being calculated for 32 bits and its multiples. i want to calculate CRC-32 for a 24-bit number. how to perform such action. I'm not from a computer science background so not having a thorough understanding of CRC-32 so kindly help.

The actual math is in effect appending 32 zero bits to the 24 bit number when calculating a CRC. A software version emulates this by cycling the CRC as needed.
To simplify things, assume the number is stored in big endian format. Then the 24 bit value could be place into a 32 bit register, and the 32 bit register cycled 32 times (emulating appending 32 zero bits) to produce a CRC. Since after putting a 24 bit number into a 32 bit register results in 8 leading zero bits, the first step could just shift the 24 bit number left 8 bits, then cycle the CRC 24 times.
If processing a byte at a time using a table lookup and 3 bytes of data that hold the 24 bit number, the process xor's the next byte into the upper 8 bits of the 32 bit CRC register, then uses the table to emulate cycling the 32 bit CRC 8 times.

Invert orientation of memory space?

I'm trying to read some bytes from a file.
This is what I've done:
struct HeaderData {
char format[2];
char n_trks[2];
char division[2];
};
HeaderData* header = new HeaderData;
Then, to get the data directly from the file to header I do
file.read(reinterpret_cast<char*>(header), sizeof(HeaderData))
If the first two bytes are00 06, header->format[0] will be 00 and header->format[1] 06. This two numbers combined represent the number 0x0006 which is 6 in decimal, which is the desired value.
When I do something like
*reinterpret_cast<unsigned*>(header->format) // In this case, the result is 0x0600
it erroneously returns the number 0x0600, so it seems that it inverts the reading of bytes.
My question is what is some workaround to correctly read the numbers as unsigned.

This is going to be an endianness mismatch.
When you read in from the file in that fashion, the bytes will be placed into your structure in the exact order they were in the file in.
When you read from the structure with an unsigned, the processor will interpret those bytes in whatever order the architecture requires it to do (most are hardcoded but some can be set to either order).
Or to put it another way
This two numbers combined represent the number 0x0006 which is 6 in decimal.
That's not necessarily remotely true. It's perfectly permissible for the processor of your choice to represent 6 in decimal as 0x06 0x00, this would be the little-endian scheme which is used on very common processors like x86. Representing it as 0x00 0x06 would be big-endian.
As M.M has stated in his comment, if your format explicitly defines the integer to be little-endian, you should explicitly read it as little-endian, e.g. format[0] + format[1] * 256, or if it is defined to be big-endian, you should read it as format[0] * 256 + format[1]. Don't rely on the processor's endianness happening to match the endianness of the data.

Reducing a CRC32 value to 16 or 8 bit

In a message framing scheme I want to secure packets with a CRC error detection. Those packets are sent over TCP connection.
For small packets with a length smaller than 16 bytes, I'd choose CRC8. For packets smaller than 4096 bytes a CRC16 algorithm. For larger packets the CRC32 algorithm.
The most attractive CRC implementation currently is the CRC32C because of hardware support (at least on some Intel CPUs). But there are no special instructions for 8bit and 16bit CRCs.
My question is now: is it possible to reduce the 32Bit value of the CRC32C algorithm to a 16 or 8 bit value, without hurting the error detection performance in comparison to native CRC16 or CRC8 algorithms?
An example:
char buffer[256];
...
uint32_t crc32 = compute_crc32c_of_block( buffer, 256 );
uint16_t fake_crc16 = ( crc32 >> 16 ) ^ crc32;
uint8_t fake_crc8 = ( fake_crc16 >> 8 ) ^ fake_crc16;
Is fake_crc8 as good as a real CRC8 implementation?
Thanks in advance.

The low 8 bits of a 32-bit CRC will not have as good error-correcting properties as an 8-bit CRC, e.g. the assurance of detecting burst errors. However it may be good enough for your application, depending the characteristics of the noise source. If you have a good number of bit errors with correlation in their location, then you should use a real CRC. If you have either rare bit flips or lots of gross errors, then a portion of a CRC would likely work just as well.
There is no substitute for testing to see how they perform.

If you can specify the polynomial for hardware CRC then padding it with zeros for the bits that you don't want will result in CRC that also has zeros at those bit positions. Then you simply discard them.
Using default data from the calculator 0x31 0x32 0x33 0x34 0x35 0x36 0x37 0x38 0x39 http://www.sunshine2k.de/coding/javascript/crc/crc_js.html) initial value 0x0, final xor 0x0 - here is the example
CRC8 with 0x7 poly is 0xF4
CRC16 with 0x700 poly is 0xF400
CRC32 with 0x07000000 is 0xF4000000
Now, the problem is that the hardware may not support this kind of polynomial. For example hardware 16bit SPI hardware CRC calculator in STM32 processors only supports odd polynomials. 0xF4 is odd, but 0xF400 is even and it produces garbage.

What encryption scheme meets requirement of decimal plaintext & ciphertext and preserves length?

I need an encryption scheme where the plaintext and ciphertext are composed entirely of decimal digits.
In addition, the plaintext and ciphertext must be the same length.
Also the underlying encryption algorithm should be an industry-standard.
I don't mind if its symmetric (e.g AES) or asymmetric (e.g RSA) - but it must be a recognized algorithm for which I can get a FIPS-140 approved library. (Otherwise it won't get past the security review stage).
Using AES OFB is fine for preserving the length of hex-based input (i.e. where each byte has 256 possible values: 0x00 --> 0xFF). However, this will not work for my means as plaintext and ciphertext must be entirely decimal.
NB: "Entirely decimal" may be interpreted two ways - both of which are acceptable for my requirements:
Input & output bytes are characters '0' --> '9' (i.e. byte values: 0x30 -> 0x39)
Input & output bytes have the 100 (decimal) values: 0x00 --> 0x99 (i.e. BCD)
Some more info:
The max plaintext & ciphertext length is likely to be 10 decimal digits.
(I.e. 10 bytes if using '0'-->'9' or 5 bytes if using BCD)
Consider following sample to see why AES fails:
Input string is 8 digit number.
Max 8-digit number is: 99999999
In hex this is: 0x5f5e0ff
This could be treated as 4 bytes: <0x05><0xf5><0xe0><0xff>
If I use AES OFB, I will get 4 byte output.
Highest possible 4-byte ciphertext output is <0xFF><0xFF><0xFF><0xFF>
Converting this back to an integer gives: 4294967295
I.e. a 10-digit number.
==> Two digits too long.
One last thing - there is no limit on the length any keys / IVs required.

Use AES/OFB, or any other stream cipher. It will generate a keystream of pseudorandom bits. Normally, you would XOR these bits with the plaintext. Instead:
For every decimal digit in the plaintext
Repeat
Take 4 bits from the keystream
Until the bits form a number less than 10
Add this number to the plaintext digit, modulo 10
To decrypt, do the same but subtract instead in the last step.
I believe this should be as secure as using the stream cipher normally. If a sequence of numbers 0-15 is indistinguishable from random, the subsequence of only those of the numbers that are smaller than 10 should still be random. Using add/subtract instead of XOR should still produce random output if one of the inputs are random.

One potential candidate is the FFX encryption mode, which has recently been submitted to NIST.

Stream ciphers require a nonce for security; the same key stream state must never be re-used for different messages. That nonce adds to the effective ciphertext length.
A block cipher used in a streaming mode has essentially the same issue: a unique initialization vector must be included with the cipher text.
Many stream ciphers are also vulnerable to ciphertext manipulation, where flipping a bit in the ciphertext undetectably flips the corresponding bit in the plaintext.
If the numbers are randomly chosen, and each number is encrypted only once, and the numbers are shorter than the block size, ECB offers good security. Under those conditions, I'd recommend AES in ECB mode as the solution that minimizes ciphertext length while providing strong privacy and integrity protection.
If there is some other information in the context of the ciphertext that could be used as an initialization vector (or nonce), then this could work. This could be something explicit, like a transaction identifier during a purchase, or something implicit like the sequence number of a message (which could be used as the counter in CTR mode). I guess that VeriShield is doing something like this.

I am not a cipher guru, but an obvious question comes to mind: would you be allowed to use One Time Pad encryption? Then you can just include a large block of truly random bits in your decoding system, and use the random data to transform your decimal digits in a reversible way.
If this would be acceptable, we just need to figure out how the decoder knows where in the block of randomness to look to get the key to decode any particular message. If you can send a plaintext timestamp with the ciphertext, then it's easy: convert the timestamp into a number, say the number of seconds since an epoch date, modulus that number by the length of the randomness block, and you have an offset within the block.
With a large enough block of randomness, this should be uncrackable. You could have the random bits be themselves encrypted with strong encryption, such that the user must type in a long password to unlock the decoder; in this way, even if the decryption software was captured, it would still not be easy to break the system.
If you have any interest in this and would like me to expand further, let me know. I don't want to spend a lot of time on an answer that doesn't meet your needs at all.
EDIT: Okay, with the tiny shred of encouragement ("you might be on to something") I'm expanding my answer.
The idea is that you get a block of randomness. One easy way to do this is to just pull data out of the Linux /dev/random device. Now, I'm going to assume that we have some way to find an index into this block of randomness for each message.
Index into the block of randomness and pull out ten bytes of data. Each byte is a number from 0 to 255. Add each of these numbers to the respective digit from the plaintext, modulo by 10, and you have the digits of the ciphertext. You can easily reverse this as long as you have the block of random data and the index: you get the random bits and subtract them from the cipher digits, modulo 10.
You can think of this as arranging the digits from 0 to 9 in a ring. Adding is counting clockwise around the ring, and subtracting is counting counter-clockwise. You can add or subtract any number and it will work. (My original version of this answer suggested using only 3 bits per digit. Not enough, as pointed out below by #Baffe Boyois. Thank you for this correction.)
If the plain text digit is 6, and the random number is 117, then: 6 + 117 == 123, modulo 10 == 3. 3 - 117 == -114, modulo 10 == 6.
As I said, the problem of finding the index is easy if you can use external plaintext information such as a timestamp. Even if your opponent knows you are using the timestamp to help decode messages, it does no good without the block of randomness.
The problem of finding the index is also easy if the message is always delivered; you can have an agreed-upon system of generating a series of indices, and say "This is the fourth message I have received, so I use the fourth index in the series." As a trivial example, if this is the fourth message received, you could agree to use an index value of 16 (4 for fourth message, times 4 bytes per one-time pad). But you could also use numbers from an approved pseudorandom number generator, initialized with an agreed constant value as a seed, and then you would get a somewhat unpredictable series of indexes within the block of randomness.
Depending on your needs, you could have a truly large chunk of random data (hundreds of megabytes or even more). If you use 10 bytes as a one-time pad, and you never use overlapping pads or reuse pads, then 1 megabyte of random data would yield over 100,000 one-time pads.

You could use the octal format, which uses digits 0-7, and three digits make up a byte. This isn't the most space-efficient solution, but it's quick and easy.
Example:
Text: Hello world!
Hexadecimal: 48 65 6C 6C 6F 20 77 6F 72 6C 64 21
Octal: 110 145 154 154 157 040 167 157 162 154 144 041
(spaces added for clarity to separate bytes)

I don't believe your requirement can be met (at all easily anyway), though it's possible to get pretty close close.
AES (like most encryption algorithms) is written to work with octets (i.e. 8-bit bytes), and it's going to produce 8-bit bytes. Once it's done its thing, converting the result to use only decimal digits or BCD values won't be possible. Therefore, your only choice is to convert the input from decimal or BCD digits to something that fills an octet as completely as possible. You can then encrypt that, and finally re-encode the output to use only decimal or BCD digits.
When you convert the ASCII digits to fill the octets, it'll "compress" the input somewhat. The encryption will then produce the same size of output as the input. You'll then encode that to use only decimal digits, which will expand it back to roughly the original size.
The problem is that neither 10 nor 100 is a number that you're going to easily fit exactly into a byte. Numbers from 1 to 100 can be encoded in 7 bits. So, you'll basically treat those as a bit-stream, putting them in 7 bits at a time, but taking them out 8 bits at a time to get bytes to encrypt.
That uses the space somewhat better, but it's still not perfect. 7 bits can encode values from 0 to 127, not just 0 to 99, so even though you'll use all 8 bits, you won't use every possible combination of those 8 bits. Likewise, in the result, one byte will turn into three decimal digits (0 to 255), which clearly wastes some space. As a result, your output will be slightly larger than your input.
To get closer than that, you could compress your input with something like Huffman or an LZ* compression (or both) before encrypting it. Then you'd do roughly the same thing: encrypt the bytes, and encode the bytes using values from 0 to 9 or 0 to 99. This will give better usage of the bits in the bytes you encrypt, so you'd waste very little space in that transformation, but does nothing to improve the encoding on the output side.

For those doubting FFX mode AES, please feel free to contact me for further details. Our implementation is a mode of AES that effectively sits on top of existing ciphers. the specification with proof/validation is up on the NIST modes website. FFSEM mode AES is included under FFX mode.
http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposedmodes/ffx/ffx-spec.pdf
If its meaningful, you can also have a conversation with NIST directly about their status in respect to modes submission/AES modes acceptance to address your FIPS question. FFX has security proofs, independent cryptographic review and is not a "new cipher". It is however based on methods that go back 20+ years - proven techniques. In implementation we ability to encrypt data whilst preserving length, structure, integrity, type and format. For example specify an explicit format policy that the output will be NNN-NN-NNNN.
So, as a mode of AES we can for example on a mainframe environment for implementation we simple use the native AES processor on a z10. Similar on open systems with HSM devices- we can sit on top of an existing AES implementation.
Format Preserving Encryption (as its often referred to) in this way is already being used in industry and available in off-the-shelf products and rather quick to deploy - already used in POS devices etc, Payments systems, Enterprise deployments etc.
Mark Bower
VP Product Management
Voltage Security
Drop a note to info#voltage.com for more info or take a look at our website for more info.

Something like a Feistel cipher should fit your requirements. Split your input number into two parts (say 8 digits each), pass one part through a not-necessarily-reversible-or-bijective function, and subtract the other part from the result of that function (modulo e.g. 100,000,000). Then rearrange the digits somehow and repeat a bunch of times. Ideally, one should slightly vary the function which is used each time. Decryption is similar to encryption except that one starts by undoing the last rearrangement step, then subtracts the second part of the message from the result of using the last function one used on the first part of the message (again, modulo 100,000,000), then undoes the previous rearrangement step, etc.
The biggest difficulties with a Feistel cipher are finding a function which achieve good encryption with a reasonable number of rounds, and figuring out how many rounds are required to achieve good encryption. If speed is not important, one could probably use something like AES to perform the scrambling function (since it doesn't have to be bijective, you could arbitrarily pad the data before each AES step, and interpret the result as a big binary number modulo 100,000,000). As for the number of rounds, 10 is probably too few, and 1000 is probably excessive. I don't know what value between there would be best.

Using only 10 digits as input/output is completely insecure. It is so insecure, that in very likely that will be cracked in real application, so consider using at least 39 digits (128 bits equivalent) If you are going to use only 10 digits there is no point in using AES in this case you have chance to invent your own (insecure) algorithm.
The only way you might get out of this is using STREAM cipher. Use 256 bit key "SecureKey" and initialisation vector IV which should be different at each beginning of starting season.
Translate this number into 77 digit (decimal) number and use "addition whit overflow" over modulo 10 whit each digit.
For instance
AES(IV,KEY) = 4534670 //and lot more
secret_message = 01235
+ and mod 10
---------------------------------------------
ciphertext = 46571 // and you still have 70 for next message
when you run out for digits from stream cipher -> AES(IV,KEY) increase IV and repeat IV=IV+1
Keep in mind that you should absolutely never use same IV twice, so you should have some scheme over this to prevent this.
Another concern is also in generating Streams. If you generate number that is higher than 10^77 you should discard that number increase IV and try again whit new IV. Other way there is high probability that you will have biased numbers and vulnerability.
There also very likely that there is flaw in this scheme or there will be in your implementation

What's the concept behind zip compression?

What's the concept behind zip compression? I can understand the concept of removing empty space etc, but presumably something has to be added to say how much/where that free space needs to be added back in during decompression?
What's the basic process for compressing a stream of bytes?

A good place to start would be to lookup the Huffman compression scheme. The basic idea behind huffman is that in a given file some bytes appear more frequently then others (in a plaintext file many bytes won't appear at all). Rather then spend 8 bits to encode every byte, why not use a shorter bit sequence to encode the most common characters, and a longer sequences to encode the less common characters (these sequences are determined by creating a huffman tree).
Once you get a handle on using these trees to encode/decode files based on character frequency, imagine that you then start working on word frequency - instead of encoding "they" as a sequence of 4 characters, why not consider it to be a single character due to its frequency, allowing it to be assigned its own leaf in the huffman tree. This is more or less the basis of ZIP and other lossless type compression - they look for common "words" (sequences of bytes) in a file (including sequences of just 1 byte if common enough) and use a tree to encode them. The zip file then only needs to include the tree info (a copy of each sequence and the number of times it appears) to allow the tree to be reconstructed and the rest of the file to be decoded.
Follow up:
To better answer the original question, the idea behind lossless compression is not so much to remove empty space, but to remove redundent information.
If you created a database to store music lyrics, you'd find a lot of space was being used to store the chorus which repeats several times. Instead of using all that space, you could simply place the word CHORUS before the first instance of the chorus lines, and then every time the chorus is to be repeated, just use CHORUS as a place holder (in fact this is pretty much the idea behind LZW compression - in LZW each line of the song would have a number shown before it. If a line repeats later in the song, rather then write out the whole line only the number is shown)

The basic concept is that instead of using eight bits to represent each byte, you use shorter representations for more frequently occuring bytes or sequences of bytes.
For example, if your file consists solely of the byte 0x41 (A) repeated sixteen times, then instead of representing it as the 8-bit sequence 01000001 shorten it to the 1-bit sequence 0. Then the file can be represented by 0000000000000000 (sixteen 0s). So then the file of the byte 0x41 repeated sixteen times can be represented by the file consisting of the byte 0x00 repeated twice.
So what we have here is that for this file (0x41 repeated sixteen times) the bits 01000001 don't convey any additional information over the bit 0. So, in this case, we throw away the extraneous bits to obtain a shorter representation.
That is the core idea behind compression.
As another example, consider the eight byte pattern
0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48
and now repeat it 2048 times. One way to follow the approach above is to represent bytes using three bits.
000 0x41
001 0x42
010 0x43
011 0x44
100 0x45
101 0x46
110 0x47
111 0x48
Now we can represent the above byte pattern by 00000101 00111001 01110111 (this is the three-byte pattern 0x05 0x39 0x77) repeated 2048 times.
But an even better approach is to represent the byte pattern
0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48
by the single bit 0. Then we can represent the above byte pattern by 0 repeated 2048 times which becomes the byte 0x00 repeated 256 times. Now we only need to store the dictionary
0 -> 0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48
and the byte pattern 0x00 repeated 256 times and we compressed the file from 16,384 bytes to (modulo the dictionary) 256 bytes.
That, in a nutshell is how compression works. The whole business comes down to finding short, efficient representations of the bytes and byte sequences in a given file. That's the simple idea, but the details (finding the representation) can be quite challenging.
See for example:
Data compression
Run length encoding
Huffman compression
Shannon-Fano coding
LZW

The concept between compression is basically statististical. If you've got a series of bytes, the chance of byte N being X in practice depends on the value distribution of the previous bytes 0..N-1. Without compression, you allocate 8 bits for each possible value X. With compression, the amounts of bytes allocated for each value X depends on the estimated chance p(N,X).
For instance, given a sequence "aaaa", a compression algorithm can assign a high value to p(5,a) and lower values to p(5,b). When p(X) is high, the bitstring assigned to X will be short, when p(X) is low a long bitstring is used. In this way, if p(N,X) is a good estimate then the average bitstring will be shorter than 8 bits.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js