Related
I'm trying to read some bytes from a file.
This is what I've done:
struct HeaderData {
char format[2];
char n_trks[2];
char division[2];
};
HeaderData* header = new HeaderData;
Then, to get the data directly from the file to header I do
file.read(reinterpret_cast<char*>(header), sizeof(HeaderData))
If the first two bytes are00 06, header->format[0] will be 00 and header->format[1] 06. This two numbers combined represent the number 0x0006 which is 6 in decimal, which is the desired value.
When I do something like
*reinterpret_cast<unsigned*>(header->format) // In this case, the result is 0x0600
it erroneously returns the number 0x0600, so it seems that it inverts the reading of bytes.
My question is what is some workaround to correctly read the numbers as unsigned.
This is going to be an endianness mismatch.
When you read in from the file in that fashion, the bytes will be placed into your structure in the exact order they were in the file in.
When you read from the structure with an unsigned, the processor will interpret those bytes in whatever order the architecture requires it to do (most are hardcoded but some can be set to either order).
Or to put it another way
This two numbers combined represent the number 0x0006 which is 6 in decimal.
That's not necessarily remotely true. It's perfectly permissible for the processor of your choice to represent 6 in decimal as 0x06 0x00, this would be the little-endian scheme which is used on very common processors like x86. Representing it as 0x00 0x06 would be big-endian.
As M.M has stated in his comment, if your format explicitly defines the integer to be little-endian, you should explicitly read it as little-endian, e.g. format[0] + format[1] * 256, or if it is defined to be big-endian, you should read it as format[0] * 256 + format[1]. Don't rely on the processor's endianness happening to match the endianness of the data.
I'd like to preface this by saying that my knowledge of CRC techniques is very limited, I spent most of the day googlin' and reading things, but I can't quite find what I'm looking for. It may very well not be possible, if so just let me know!
What I have is a sequence of seemingly random data:
0xAF 0xBC 0x1F 0x5C... etc
Within this data, there is a field that is not random (that I put there), and I want to use a CRC check of the entire data set to see if this field is set to the correct value (lets say 0x12 0x34 0x56 0x78). I am trying to do this sneakily and this is key because I don't want a casual observer to know that I am looking for that field - this is why I don't just read out the location I want and compare against expected value.
The field's value is constant, the rest is pretty much random. There are some fields here and there that will also be constant if that helps.
Is this possible to do? I am not limited in the number of times I do the CRC check, or which direction I go through data, or of I change the polynomial, or really anything. I can also start from the middle of the array, or the third, or whatever, but I would prefer not to start near my field of interest.
The only function that comes to mind that will do what you want is a discrete wavelet transform. (A CRC will always depend on all of the bits that you are computing it over — that's kind of the point.)
You can find the coefficients to apply to the set of discrete wavelet basis functions that will give you a function with a finite basis that covers only the region of interest, using the orthogonality of the basis functions. It will appear that the wavelet functions are over the entire message, but the coefficients are rigged so that the values outside the region of interest cancel in the sum.
While this all may not be obvious to a casual reader of the code, it would be straightforward to write down the functions and coefficients, and multiply it out to see what bytes in the message are selected by the coefficients.
OK, so, to confirm, you have something like this as your data:
0xAF 0xBC 0x1F 0x5C 0x11 0x1F 0x5C 0x11
0x2D 0xAB 0xBB 0xCC 0x00 0xBB 0xCC 0x00
0x12 0x34 0x56 0x78 0xFF 0x56 0x78 0xFF
and you're trying to isolate something in a particular location of that data, e.g., to find the 0x12 0x34 0x56 0x78 value there.
To clarify, you're wanting to 1) check that value (that particular address range's value), and 2) then do a crc on the whole? Or are you wanting to integrate the hunt for the value into the crc algorithm?
Honestly trying to understand where you're going. I realize this isn't really an answer, but it's a better place for this than in a comment.
Hey! I was looking at this code at http://www.gnu.org/software/m68hc11/examples/primes_8c-source.html
I noticed that in some situations they used hex numbers, like in line 134:
for (j = 1; val && j <= 0x80; j <<= 1, q++)
Now why would they use the 0x80? I am not that good with hex but I found an online hex to decimal and it gave me 128 for 0x80.
Also before line 134, on line 114 they have this:
small_n = (n & 0xffff0000) == 0;
The hex to decimal gave me 4294901760 for that hex number.
So here in this line they are making a bit AND and comparing the result to 0??
Why not just use the number?
Can anyone please explain and please do give examples of other situations.
Also I have seen large lines of code where it's just hex numbers and never really understood why :(
In both cases you cite, the bit pattern of the number is important, not the actual number.
For example,
In the first case,
j is going to be 1, then 2, 4, 8, 16, 32, 64 and finally 128 as the loop progresses.
In binary, that is,
0000:0001, 0000:0010, 0000:0100, 0000:1000, 0001:0000, 0010:0000, 0100:0000 and 1000:0000.
There's no option for binary constants in C (until C23) or C++ (until C++14), but it's a bit clearer in Hex:
0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, and 0x80.
In the second example,
the goal was to remove the lower two bytes of the value.
So given a value of 1,234,567,890 we want to end up with 1,234,567,168.
In hex, it's clearer: start with 0x4996:02d2, end with 0x4996:0000.
There's a direct mapping between hex (or octal for that matter) digits and the underlying bit patterns, which is not the case with decimal. A decimal '9' represents something different with respect to bit patterns depending on what column it is in and what numbers surround it - it doesn't have a direct relationship to a bit pattern. In hex, a '9' always means '1001', no matter which column. 9 = '1001', 95 = '*1001*0101' and so forth.
As a vestige of my 8-bit days I find hex a convenient shorthand for anything binary. Bit twiddling is a dying skill. Once (about 10 years ago) I saw a third year networking paper at university where only 10% (5 out of 50 or so) of the people in the class could calculate a bit-mask.
its a bit mask. Hex values make it easy to see the underlying binary representation. n & 0xffff0000 returns the top 16 bits of n. 0xffff0000 means "16 1s and 16 0s in binary"
0x80 means "1000000", so you start with "00000001" and continue shifting that bit over to the left "0000010", "0000100", etc until "1000000"
0xffff0000 is easy to understand that it's 16 times "1" and 16 times "0" in a 32 bit value, while 4294901760 is magic.
I find it maddening that the C family of languages have always supported octal and hex but not binary. I have long wished that they would add direct support for binary:
int mask = 0b00001111;
Many years/jobs ago, while working on a project that involved an enormous amount of bit-level math, I got fed up and generated a header file that contained defined constants for all possible binary values up to 8 bits:
#define b0 (0x00)
#define b1 (0x01)
#define b00 (0x00)
#define b01 (0x01)
#define b10 (0x02)
#define b11 (0x03)
#define b000 (0x00)
#define b001 (0x01)
...
#define b11111110 (0xFE)
#define b11111111 (0xFF)
It has occasionally made certain bit-level code more readable.
The single biggest use of hex is probably in embedded programming. Hex numbers are used to mask off individual bits in hardware registers, or split multiple numeric values packed into a single 8, 16, or 32-bit register.
When specifying individual bit masks, a lot of people start out by:
#define bit_0 1
#define bit_1 2
#define bit_2 4
#define bit_3 8
#define bit_4 16
etc...
After a while, they advance to:
#define bit_0 0x01
#define bit_1 0x02
#define bit_2 0x04
#define bit_3 0x08
#define bit_4 0x10
etc...
Then they learn to cheat, and let the compiler generate the values as part of compile time optimization:
#define bit_0 (1<<0)
#define bit_1 (1<<1)
#define bit_2 (1<<2)
#define bit_3 (1<<3)
#define bit_4 (1<<4)
etc...
Sometimes the visual representation of values in HEX makes code more readable or understandable. For instance bitmasking or use of bits becomes non-obvious when looking at decimal representations of numbers.
This can sometimes do with the amount of space a particular value type has to offer, so that may also play a role.
A typical example might be in a binary setting, so instead of using decimal to show some values, we use binary.
let's say an object had a non-exclusive set of properties that had values of either on or off (3 of them) - one way to represent the state of those properties is with 3 bits.
valid representations are 0 through 7 in decimal, but that is not so obvious. more obvious is the binary representation:
000, 001, 010, 011, 100, 101, 110, 111
Also, some people are just very comfortable with hex. Note also that hard-coded magic numbers are just that and it is not all that important no matter numbering system to use
I hope that helps.
Generally the use of Hex numbers instead of Decimal it's because the computer works with bits (binary numbers) and when you're working with bits also is more understandable to use Hexadecimal numbers, because is easier going from Hex to binary that from Decimal to binary.
OxFF = 1111 1111 ( F = 1111 )
but
255 = 1111 1111
because
255 / 2 = 127 (rest 1)
127 / 2 = 63 (rest 1)
63 / 2 = 31 (rest 1)
... etc
Can you see that? It's much more simple to pass from Hex to binary.
There are 8 bits in a byte. Hex, base 16, is terse. Any possible byte value is expressed using two characters from the collection 0..9, plus a,b,c,d,e,f.
Base 256 would be more terse. Every possible byte could have its own single character, but most human languages don't use 256 characters, so Hex is the winner.
To understand the importance of being terse, consider that back in the 1970's, when you wanted to examine your megabyte of memory, it was printed out in hex. The printout would use several thousand pages of big paper. Octal would have wasted even more trees.
Hex, or hexadecimal, numbers represent 4 bits of data, 0 to 15 or in HEX 0 to F. Two hex values represent a byte.
To be more precise, hex and decimal, are all NUMBERS. The radix (base 10, 16, etc) are ways to present those numbers in a manner that is either clearer, or more convenient.
When discussing "how many of something there are" we normally use decimal. When we are looking at addresses or bit patterns on computers, hex is usually preferred, because often the meaning of individual bytes might be important.
Hex, (and octal) have the property that they are powers of two, so they map groupings of bit nicely. Hex maps 4 bits to one hex nibble (0-F), so a byte is stored in two nibbles (00-FF). Octal was popular on Digital Equipment (DEC) and other older machines, but one octal digit maps to three bits, so it doesn't cross byte boundaries as nicely.
Overall, the choice of radix is a way to make your programming easier - use the one that matches the domain best.
Looking at the file, that's some pretty groady code. Hope you are good at C and not using it as a tutorial...
Hex is useful when you're directly working at the bit level or just above it. E.g, working on a driver where you're looking directly at the bits coming in from a device and twiddling the results so that someone else can read a coherent result. It's a compact fairly easy-to-read representation of binary.
What's the concept behind zip compression? I can understand the concept of removing empty space etc, but presumably something has to be added to say how much/where that free space needs to be added back in during decompression?
What's the basic process for compressing a stream of bytes?
A good place to start would be to lookup the Huffman compression scheme. The basic idea behind huffman is that in a given file some bytes appear more frequently then others (in a plaintext file many bytes won't appear at all). Rather then spend 8 bits to encode every byte, why not use a shorter bit sequence to encode the most common characters, and a longer sequences to encode the less common characters (these sequences are determined by creating a huffman tree).
Once you get a handle on using these trees to encode/decode files based on character frequency, imagine that you then start working on word frequency - instead of encoding "they" as a sequence of 4 characters, why not consider it to be a single character due to its frequency, allowing it to be assigned its own leaf in the huffman tree. This is more or less the basis of ZIP and other lossless type compression - they look for common "words" (sequences of bytes) in a file (including sequences of just 1 byte if common enough) and use a tree to encode them. The zip file then only needs to include the tree info (a copy of each sequence and the number of times it appears) to allow the tree to be reconstructed and the rest of the file to be decoded.
Follow up:
To better answer the original question, the idea behind lossless compression is not so much to remove empty space, but to remove redundent information.
If you created a database to store music lyrics, you'd find a lot of space was being used to store the chorus which repeats several times. Instead of using all that space, you could simply place the word CHORUS before the first instance of the chorus lines, and then every time the chorus is to be repeated, just use CHORUS as a place holder (in fact this is pretty much the idea behind LZW compression - in LZW each line of the song would have a number shown before it. If a line repeats later in the song, rather then write out the whole line only the number is shown)
The basic concept is that instead of using eight bits to represent each byte, you use shorter representations for more frequently occuring bytes or sequences of bytes.
For example, if your file consists solely of the byte 0x41 (A) repeated sixteen times, then instead of representing it as the 8-bit sequence 01000001 shorten it to the 1-bit sequence 0. Then the file can be represented by 0000000000000000 (sixteen 0s). So then the file of the byte 0x41 repeated sixteen times can be represented by the file consisting of the byte 0x00 repeated twice.
So what we have here is that for this file (0x41 repeated sixteen times) the bits 01000001 don't convey any additional information over the bit 0. So, in this case, we throw away the extraneous bits to obtain a shorter representation.
That is the core idea behind compression.
As another example, consider the eight byte pattern
0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48
and now repeat it 2048 times. One way to follow the approach above is to represent bytes using three bits.
000 0x41
001 0x42
010 0x43
011 0x44
100 0x45
101 0x46
110 0x47
111 0x48
Now we can represent the above byte pattern by 00000101 00111001 01110111 (this is the three-byte pattern 0x05 0x39 0x77) repeated 2048 times.
But an even better approach is to represent the byte pattern
0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48
by the single bit 0. Then we can represent the above byte pattern by 0 repeated 2048 times which becomes the byte 0x00 repeated 256 times. Now we only need to store the dictionary
0 -> 0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48
and the byte pattern 0x00 repeated 256 times and we compressed the file from 16,384 bytes to (modulo the dictionary) 256 bytes.
That, in a nutshell is how compression works. The whole business comes down to finding short, efficient representations of the bytes and byte sequences in a given file. That's the simple idea, but the details (finding the representation) can be quite challenging.
See for example:
Data compression
Run length encoding
Huffman compression
Shannon-Fano coding
LZW
The concept between compression is basically statististical. If you've got a series of bytes, the chance of byte N being X in practice depends on the value distribution of the previous bytes 0..N-1. Without compression, you allocate 8 bits for each possible value X. With compression, the amounts of bytes allocated for each value X depends on the estimated chance p(N,X).
For instance, given a sequence "aaaa", a compression algorithm can assign a high value to p(5,a) and lower values to p(5,b). When p(X) is high, the bitstring assigned to X will be short, when p(X) is low a long bitstring is used. In this way, if p(N,X) is a good estimate then the average bitstring will be shorter than 8 bits.
Hey! I was looking at this code at http://www.gnu.org/software/m68hc11/examples/primes_8c-source.html
I noticed that in some situations they used hex numbers, like in line 134:
for (j = 1; val && j <= 0x80; j <<= 1, q++)
Now why would they use the 0x80? I am not that good with hex but I found an online hex to decimal and it gave me 128 for 0x80.
Also before line 134, on line 114 they have this:
small_n = (n & 0xffff0000) == 0;
The hex to decimal gave me 4294901760 for that hex number.
So here in this line they are making a bit AND and comparing the result to 0??
Why not just use the number?
Can anyone please explain and please do give examples of other situations.
Also I have seen large lines of code where it's just hex numbers and never really understood why :(
In both cases you cite, the bit pattern of the number is important, not the actual number.
For example,
In the first case,
j is going to be 1, then 2, 4, 8, 16, 32, 64 and finally 128 as the loop progresses.
In binary, that is,
0000:0001, 0000:0010, 0000:0100, 0000:1000, 0001:0000, 0010:0000, 0100:0000 and 1000:0000.
There's no option for binary constants in C (until C23) or C++ (until C++14), but it's a bit clearer in Hex:
0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, and 0x80.
In the second example,
the goal was to remove the lower two bytes of the value.
So given a value of 1,234,567,890 we want to end up with 1,234,567,168.
In hex, it's clearer: start with 0x4996:02d2, end with 0x4996:0000.
There's a direct mapping between hex (or octal for that matter) digits and the underlying bit patterns, which is not the case with decimal. A decimal '9' represents something different with respect to bit patterns depending on what column it is in and what numbers surround it - it doesn't have a direct relationship to a bit pattern. In hex, a '9' always means '1001', no matter which column. 9 = '1001', 95 = '*1001*0101' and so forth.
As a vestige of my 8-bit days I find hex a convenient shorthand for anything binary. Bit twiddling is a dying skill. Once (about 10 years ago) I saw a third year networking paper at university where only 10% (5 out of 50 or so) of the people in the class could calculate a bit-mask.
its a bit mask. Hex values make it easy to see the underlying binary representation. n & 0xffff0000 returns the top 16 bits of n. 0xffff0000 means "16 1s and 16 0s in binary"
0x80 means "1000000", so you start with "00000001" and continue shifting that bit over to the left "0000010", "0000100", etc until "1000000"
0xffff0000 is easy to understand that it's 16 times "1" and 16 times "0" in a 32 bit value, while 4294901760 is magic.
I find it maddening that the C family of languages have always supported octal and hex but not binary. I have long wished that they would add direct support for binary:
int mask = 0b00001111;
Many years/jobs ago, while working on a project that involved an enormous amount of bit-level math, I got fed up and generated a header file that contained defined constants for all possible binary values up to 8 bits:
#define b0 (0x00)
#define b1 (0x01)
#define b00 (0x00)
#define b01 (0x01)
#define b10 (0x02)
#define b11 (0x03)
#define b000 (0x00)
#define b001 (0x01)
...
#define b11111110 (0xFE)
#define b11111111 (0xFF)
It has occasionally made certain bit-level code more readable.
The single biggest use of hex is probably in embedded programming. Hex numbers are used to mask off individual bits in hardware registers, or split multiple numeric values packed into a single 8, 16, or 32-bit register.
When specifying individual bit masks, a lot of people start out by:
#define bit_0 1
#define bit_1 2
#define bit_2 4
#define bit_3 8
#define bit_4 16
etc...
After a while, they advance to:
#define bit_0 0x01
#define bit_1 0x02
#define bit_2 0x04
#define bit_3 0x08
#define bit_4 0x10
etc...
Then they learn to cheat, and let the compiler generate the values as part of compile time optimization:
#define bit_0 (1<<0)
#define bit_1 (1<<1)
#define bit_2 (1<<2)
#define bit_3 (1<<3)
#define bit_4 (1<<4)
etc...
Sometimes the visual representation of values in HEX makes code more readable or understandable. For instance bitmasking or use of bits becomes non-obvious when looking at decimal representations of numbers.
This can sometimes do with the amount of space a particular value type has to offer, so that may also play a role.
A typical example might be in a binary setting, so instead of using decimal to show some values, we use binary.
let's say an object had a non-exclusive set of properties that had values of either on or off (3 of them) - one way to represent the state of those properties is with 3 bits.
valid representations are 0 through 7 in decimal, but that is not so obvious. more obvious is the binary representation:
000, 001, 010, 011, 100, 101, 110, 111
Also, some people are just very comfortable with hex. Note also that hard-coded magic numbers are just that and it is not all that important no matter numbering system to use
I hope that helps.
Generally the use of Hex numbers instead of Decimal it's because the computer works with bits (binary numbers) and when you're working with bits also is more understandable to use Hexadecimal numbers, because is easier going from Hex to binary that from Decimal to binary.
OxFF = 1111 1111 ( F = 1111 )
but
255 = 1111 1111
because
255 / 2 = 127 (rest 1)
127 / 2 = 63 (rest 1)
63 / 2 = 31 (rest 1)
... etc
Can you see that? It's much more simple to pass from Hex to binary.
There are 8 bits in a byte. Hex, base 16, is terse. Any possible byte value is expressed using two characters from the collection 0..9, plus a,b,c,d,e,f.
Base 256 would be more terse. Every possible byte could have its own single character, but most human languages don't use 256 characters, so Hex is the winner.
To understand the importance of being terse, consider that back in the 1970's, when you wanted to examine your megabyte of memory, it was printed out in hex. The printout would use several thousand pages of big paper. Octal would have wasted even more trees.
Hex, or hexadecimal, numbers represent 4 bits of data, 0 to 15 or in HEX 0 to F. Two hex values represent a byte.
To be more precise, hex and decimal, are all NUMBERS. The radix (base 10, 16, etc) are ways to present those numbers in a manner that is either clearer, or more convenient.
When discussing "how many of something there are" we normally use decimal. When we are looking at addresses or bit patterns on computers, hex is usually preferred, because often the meaning of individual bytes might be important.
Hex, (and octal) have the property that they are powers of two, so they map groupings of bit nicely. Hex maps 4 bits to one hex nibble (0-F), so a byte is stored in two nibbles (00-FF). Octal was popular on Digital Equipment (DEC) and other older machines, but one octal digit maps to three bits, so it doesn't cross byte boundaries as nicely.
Overall, the choice of radix is a way to make your programming easier - use the one that matches the domain best.
Looking at the file, that's some pretty groady code. Hope you are good at C and not using it as a tutorial...
Hex is useful when you're directly working at the bit level or just above it. E.g, working on a driver where you're looking directly at the bits coming in from a device and twiddling the results so that someone else can read a coherent result. It's a compact fairly easy-to-read representation of binary.