POSIX cksum and Boost.CRC

POSIX cksum and Boost.CRC - c++

I'm trying to implement the simple POSIX cksum using Boost.CRC.
The code I'm using amounts to this:
for(int i = 1; i<argc; ++i)
{
support::file current(argv[i], support::file::access::read);
size_t octets = 0;
boost::crc_32_type crc;
while(true)
{
size_t bytes_read = current.read_some(buffer_size, buffer);
octets += bytes_read;
crc.process_bytes(&buffer[0], bytes_read);
if(bytes_read < buffer_size)
break;
}
if(i>1)
support::print("\n");
support::print(boost::lexical_cast<string>(crc.checksum()) + " " + boost::lexical_cast<string>(octets) + " " + argv[i]);
}
Where support::file is a simple fopen/fread binary file I/O wrapper, which I succesfully used for a cat implementation. support::print gives the same output as std::cout, but I need it for reliable non-ASCII output on Windows.
The Boost header has this:
typedef crc_optimal<32, 0x04C11DB7, 0xFFFFFFFF, 0xFFFFFFFF, true, true> crc_32_type;
as the only 32-bit CRC typedef. It gives the wrong answer (checked with GNUWin32 coreutils cksum) for an empty file (touch test && cksum test). I have tried using the above typedef and modifying one or both of the 0xFFFFFFFF values to 0, I get the correct result for an empty file, but any other file still gives different results.
How is Boost.CRC related to the POSIX cksum specification?

This isn't really an answer to your question, but the results of the investigation that I've carried out. I used to struggle with CRC when I was building a simple implementation of an Ethernet controller in VHDL, and I do realize that implementations can be sometimes very varying due to reasons unknown.
Okay, let's begin. The typedef that you found in boost/crc.hpp :
typedef crc_optimal<32, 0x04C11DB7, 0xFFFFFFFF, 0xFFFFFFFF, true, true> crc_32_type;
Is a simple declaration of a CRC generator that will produce the CRC for Ethernet. The parameters of the template are as follows : Bits (the number of bits output by the generator), TruncPoly (the polynomial used by the generator), InitRem (the initial remainder to be fed into the algorithm before processing the first byte of input), FinalXor (the value that the output value should be XORed with after processing all the bytes of the input), ReflectIn and ReflectRem (if the input bytes and/or the output should be bit-reflected, e.g. bit 0 becomes bit 7, and so on). Ethernet not only requires the output to be calculated with the given polynomial, but it also requires the constraints that you can read from that typedef.
According to the specification of cksum, the typedef for the CRC generator for it should look something like this :
typedef crc_optimal<32, 0x04C11DB7, 0, 0xFFFFFFFF, false, false> cksum_crc_type;
This is because :
The specification doesn't specify a starting value for the generator, thus 0.
The output value should be complemented, according to number 4 of the specification. The same result can be achieved by XORing the value with ones.
Bit reflecting is not mentioned anywhere, and thus will not be performed.
However, there is one significant difference when it comes to cksum as opposed to ordinary CRCs :
(...) followed by one or more octets representing the length of the
file as a binary value, least significant octet first. The smallest
number of octets capable of representing this integer shall be used.
Ordinary CRC generators don't take into account the number of octets that were processed. This would also explain why you got a good result when processing a zero-length file, but a bad one when processing larger files.
Unfortunately, I don't see an easy solution for that problem. I guess what you could do is modify the process_bytes method in a following way :
template < std::size_t Bits, BOOST_CRC_PARM_TYPE TruncPoly,
BOOST_CRC_PARM_TYPE InitRem, BOOST_CRC_PARM_TYPE FinalXor,
bool ReflectIn, bool ReflectRem >
inline
void
BOOST_CRC_OPTIMAL_NAME::process_bytes
(
void const * buffer,
std::size_t byte_count
)
{
unsigned char const * const b = static_cast<unsigned char const *>(
buffer );
process_block( b, b + byte_count );
for(; byte_count; byte_count >>= 8)
rem_ = (rem_ << 8) ^ crc_table_type::table_[((rem_ >> 24) ^ byte_count) & 0xFF];
}
With such an implementation, the method gives the same result as cksum. for loop courtesy of GNU coreutils.
Hope I helped.

Related

Writing a program for a computer that uses Litttle or Big endian. And have the same result [duplicate]

This question already has answers here:
Detecting endianness programmatically in a C++ program
(29 answers)
Closed 2 years ago.
This question is about endian's.
Goal is to write 2 bytes in a file for a game on a computer. I want to make sure that people with different computers have the same result whether they use Little- or Big-Endian.
Which of these snippet do I use?
char a[2] = { 0x5c, 0x7B };
fout.write(a, 2);
or
int a = 0x7B5C;
fout.write((char*)&a, 2);
Thanks a bunch.

From wikipedia:
In its most common usage, endianness indicates the ordering of bytes within a multi-byte number.
So for char a[2] = { 0x5c, 0x7B };, a[1] will be always 0x7B
However, for int a = 0x7B5C;, char* oneByte = (char*)&a; (char *)oneByte[0]; may be 0x7B or 0x5C, but as you can see, you have to play with casts and byte pointers (bear in mind the undefined behaviour when char[1], it is only for explanation purposes).

One way that is used quite often is to write a 'signature' or 'magic' number as the first data in the file - typically a 16-bit integer whose value, when read back, will depend on whether or not the reading platform has the same endianness as the writing platform. If you then detect a mismatch, all data (of more than one byte) read from the file will need to be byte swapped.
Here's some outline code:
void ByteSwap(void *buffer, size_t length)
{
unsigned char *p = static_cast<unsigned char *>(buffer);
for (size_t i = 0; i < length / 2; ++i) {
unsigned char tmp = *(p + i);
*(p + i) = *(p + length - i - 1);
*(p + length - i - 1) = tmp;
}
return;
}
bool WriteData(void *data, size_t size, size_t num, FILE *file)
{
uint16_t magic = 0xAB12; // Something that can be tested for byte-reversal
if (fwrite(&magic, sizeof(uint16_t), 1, file) != 1) return false;
if (fwrite(data, size, num, file) != num) return false;
return true;
}
bool ReadData(void *data, size_t size, size_t num, FILE *file)
{
uint16_t test_magic;
bool is_reversed;
if (fread(&test_magic, sizeof(uint16_t), 1, file) != 1) return false;
if (test_magic == 0xAB12) is_reversed = false;
else if (test_magic == 0x12AB) is_reversed = true;
else return false; // Error - needs handling!
if (fread(data, size, num, file) != num) return false;
if (is_reversed && (size > 1)) {
for (size_t i = 0; i < num; ++i) ByteSwap(static_cast<char *>(data) + (i*size), size);
}
return true;
}
Of course, in the real world, you wouldn't need to write/read the 'magic' number for every input/output operation - just once per file, and store the is_reversed flag for future use when reading data back.
Also, with proper use of C++ code, you would probably be using std::stream arguments, rather than the FILE* I have shown - but the sample I have posted has been extracted (with only very little modification) from code that I actually use in my projects (to do just this test). But conversion to better use of modern C++ should be straightforward.
Feel free to ask for further clarification and/or explanation.
NOTE: The ByteSwap function I have provided is not ideal! It almost certainly breaks strict aliasing rules and may well cause undefined behaviour on some platforms, if used carelessly. Also, it is not the most efficient method for small data units (like int variables). One could (and should) provide one's own byte-reversal function(s) to handle specific types of variables - a good case for overloading the function with different argument types).

Which of these snippet do I use?
The first one. It has same output regardless of native endianness.
But you'll find that if you need to interpret those bytes as some integer value, that is not so straightforward. char a[2] = { 0x5c, 0x7B } can represent either 0x5c7B (big endian) or 0x7B5c (little endian). So, which one did you intend?
The solution for cross platform interpretation of integers is to decide on particular byte order for the reading and writing. De-facto "standard" for cross platform data is to use big endian.
To write a number in big endian, start by bit-shifting the input value right so that the most significant byte is in the place of the least significant byte. Mask all other bytes (technically redundant in first iteration, but we'll loop back soon). Write this byte to the output. Repeat this for all other bytes in order of significance.
This algorithm produces same output regardless of the native endianness - it will even work on exotic "middle" endian systems if you ever encounter one. Writing to little endian is similar, but in reverse order.
To read a big endian value, read the first byte of input, shift it left so that it goes to the place of most significant byte. Combine the shifted byte with the result (initially zero) using bitwise-or. Repeat with the next byte by shifting to the second most significant place and so on.
to know the Endianess of a computer?
To know endianness of a system, you can use std::endian in the upcoming C++20. Prior to that, you can use implementation specific macros from endian.h header. Or you can do a simple calculation like you suggest.
But you never really need to know the endianness of a system. You can simply use the algorithms that I described, which work on systems of all endianness without having to know what that endianness is.

How to store double - endian independent

Despite the fact that big-endian computers are not very widely used, I want to store the double datatype in an independant format.
For int, this is really simple, since bit shifts make that very convenient.
int number;
int size=sizeof(number);
char bytes[size];
for (int i=0; i<size; ++i)
bytes[size-1-i] = (number >> 8*i) & 0xFF;
This code snipet stores the number in big endian format, despite the machine it is being run on. What is the most elegant way to do this for double?

The best way for portability and taking format into account, is serializing/deserializing the mantissa and the exponent separately. For that you can use the frexp()/ldexp() functions.
For example, to serialize:
int exp;
unsigned long long mant;
mant = (unsigned long long)(ULLONG_MAX * frexp(number, &exp));
// then serialize exp and mant.
And then to deserialize:
// deserialize to exp and mant.
double result = ldexp ((double)mant / ULLONG_MAX, exp);

The elegant thing to do is to limit the endianness problem to as small a scope as possible. That narrow scope is the I/O boundary between your program and the outside world. For example, the functions that send binary data to / receive binary data from some other application need to be aware of the endian problem, as do the functions that write binary data to / read binary data from some data file. Make those interfaces cognizant of the representation problem.
Make everything else blissfully ignorant of the problem. Use the local representation everywhere else. Represent a double precision floating point number as a double rather than an array of 8 bytes, represent a 32 bit integer as an int or int32_t rather than an array of 4 bytes, et cetera. Dealing with the endianness problem throughout your code is going to make your code bloated, error prone, and ugly.

The same. Any numeric object, including double, is eventually several bytes which are interpreted in a specific order according to endianness. So if you revert the order of the bytes you'll get exactly the same value in the reversed endianness.

char *src_data;
char *dst_data;
for (i=0;i<N*sizeof(double);i++) *dst_data++=src_data[i ^ mask];
// where mask = 7, if native == low endian
// mask = 0, if native = big_endian
The elegance lies in mask which handles also short and integer types: it's sizeof(elem)-1 if the target and source endianness differ.

Not very portable and standards violating, but something like this:
std::array<unsigned char, 8> serialize_double( double const* d )
{
std::array<unsigned char, 8> retval;
char const* begin = reinterpret_cast<char const*>(d);
char const* end = begin + sizeof(double);
union
{
uint8 i8s[8];
uint16 i16s[4];
uint32 i32s[2];
uint64 i64s;
} u;
u.i64s = 0x0001020304050607ull; // one byte order
// u.i64s = 0x0706050403020100ull; // the other byte order
for (size_t index = 0; index < 8; ++index)
{
retval[ u.i8s[index] ] = begin[index];
}
return retval;
}
might handle a platform with 8 bit chars, 8 byte doubles, and any crazy-ass byte ordering (ie, big endian in words but little endian between words for 64 bit values, for example).
Now, this doesn't cover the endianness of doubles being different than that of 64 bit ints.
An easier approach might be to cast your double into a 64 bit unsigned value, then output that as you would any other int.

void reverse_endian(double number, char (&bytes)[sizeof(double)])
{
const int size=sizeof(number);
memcpy(bytes, &number, size);
for (int i=0; i<size/2; ++i)
std::swap(bytes[i], bytes[size-i-1]);
}

How to get values from unaligned memory in a standard way?

I know C++11 has some standard facilities which would allow to get integral values from unaligned memory. How could something like this be written in a more standard way?
template <class R>
inline R get_unaligned_le(const unsigned char p[], const std::size_t s) {
R r = 0;
for (std::size_t i = 0; i < s; i++)
r |= (*p++ & 0xff) << (i * 8); // take the first 8-bits of the char
return r;
}
To take the values stored in litte-endian order, you can then write:
uint_least16_t value1 = get_unaligned_le<uint_least16_t > (&buffer[0], 2);
uint_least32_t value2 = get_unaligned_le<uint_least32_t > (&buffer[2], 4);

How did the integral values get into the unaligned memory to begin with?
If they were memcpyed in, then you can use memcpy to get them out.
If they were read from a file or the network, you have to know their
format: how they were written to begin with. If they are four byte
big-endian 2s complement (the usual network format), then something
like:
// Supposes native int is at least 32 bytes...
unsigned
getNetworkInt( unsigned char const* buffer )
{
return buffer[0] << 24
| buffer[1] << 16
| buffer[2] << 8
| buffer[3];
}
This will work for any unsigned type, provided the type you're aiming
for is at least as large as the type you input. For signed, it depends
on just how portable you want to be. If all of your potential target
machines are 2's complement, and will have an integral type with the
same size as your input type, then you can use exactly the same code as
above. If your native machine is a 1's complement 36 bit machine (e.g.
a Unisys mainframe), and you're reading signed network format integers
(32 bit 2's complement), you'll need some additional logic.

As always, create the desired variable and populate it byte-wise:
#include <algorithm>
#include <type_traits>
template <typename R>
R get(unsigned char * p, std::size_t len = sizeof(R))
{
assert(len >= sizeof(R) && std::is_trivially_copyable<R>::value);
R result;
std::copy(p, p + sizeof(R), static_cast<unsigned char *>(&result));
return result;
}
This only works universally for trivially copyable types, though you can probably use it for on-trivial types if you have additional guarantees from elsewhere.

Appropriate hashing function to hash random binary strings

i have an two arrays : char data1[length] where length is a multiple of 8 i.e length can be 8, 16,24 ... The array contains binary data read from a file that is open in binary mode. I will keep reading from the file and everytime i read i will store the read value in a hash table. The disterbution of this binary data has a random distribution. I would like to hash each array and store them in a hash table in order to be able to look for the char with the specific data again. What would be a good hashing function to achive this task. Thanks
Please note that i am writing this in c++ and c so any language you choose to provide a solution for would be great.

If the data that you read is 8 bytes long and really distributed randomly, and your hashcode needs to be 32 bits, what about this:
uint32_t hashcode(const unsigned char *data) {
uint32_t hash = 0;
hash ^= get_uint32_le(data + 0);
hash ^= get_uint32_le(data + 4);
return hash;
}
uint32_t get_uint32_le(const unsigned char *data) {
uint32_t value = 0;
value |= data[0] << 0;
value |= data[1] << 8;
value |= data[2] << 16;
value |= data[3] << 24;
return value;
}
If you need more speed, this code can probably made a lot faster if you can guarantee that data is always properly aligned to be interpreted as an const uint32_t *.

I have successfully used MurmurHash3 in one of my projects.
Pros:
It is fast. Very fast.
It supposedly has a low collision rate.
Cons:
It's not suitable for cryptography applications.
It's not standardized in any shape or form.
It's not portable to non-x86 platforms. However, it's small enough that you should be able to port it if you really need to - I was able to port it to Java, although that's not nearly the same thing.
It's a good possibility for use in e.g. a fast hash-table implementation...

C/C++: Bitwise operators on dynamically allocated memory

In C/C++, is there an easy way to apply bitwise operators (specifically left/right shifts) to dynamically allocated memory?
For example, let's say I did this:
unsigned char * bytes=new unsigned char[3];
bytes[0]=1;
bytes[1]=1;
bytes[2]=1;
I would like a way to do this:
bytes>>=2;
(then the 'bytes' would have the following values):
bytes[0]==0
bytes[1]==64
bytes[2]==64
Why the values should be that way:
After allocation, the bytes look like this:
[00000001][00000001][00000001]
But I'm looking to treat the bytes as one long string of bits, like this:
[000000010000000100000001]
A right shift by two would cause the bits to look like this:
[000000000100000001000000]
Which finally looks like this when separated back into the 3 bytes (thus the 0, 64, 64):
[00000000][01000000][01000000]
Any ideas? Should I maybe make a struct/class and overload the appropriate operators? Edit: If so, any tips on how to proceed? Note: I'm looking for a way to implement this myself (with some guidance) as a learning experience.

I'm going to assume you want bits carried from one byte to the next, as John Knoeller suggests.
The requirements here are insufficient. You need to specify the order of the bits relative to the order of the bytes - when the least significant bit falls out of one byte, does to go to the next higher or next lower byte.
What you are describing, though, used to be very common for graphics programming. You have basically described a monochrome bitmap horizontal scrolling algorithm.
Assuming that "right" means higher addresses but less significant bits (ie matching the normal writing conventions for both) a single-bit shift will be something like...
void scroll_right (unsigned char* p_Array, int p_Size)
{
unsigned char orig_l = 0;
unsigned char orig_r;
unsigned char* dest = p_Array;
while (p_Size > 0)
{
p_Size--;
orig_r = *p_Array++;
*dest++ = (orig_l << 7) + (orig_r >> 1);
orig_l = orig_r;
}
}
Adapting the code for variable shift sizes shouldn't be a big problem. There's obvious opportunities for optimisation (e.g. doing 2, 4 or 8 bytes at a time) but I'll leave that to you.
To shift left, though, you should use a separate loop which should start at the highest address and work downwards.
If you want to expand "on demand", note that the orig_l variable contains the last byte above. To check for an overflow, check if (orig_l << 7) is non-zero. If your bytes are in an std::vector, inserting at either end should be no problem.
EDIT I should have said - optimising to handle 2, 4 or 8 bytes at a time will create alignment issues. When reading 2-byte words from an unaligned char array, for instance, it's best to do the odd byte read first so that later word reads are all at even addresses up until the end of the loop.
On x86 this isn't necessary, but it is a lot faster. On some processors it's necessary. Just do a switch based on the base (address & 1), (address & 3) or (address & 7) to handle the first few bytes at the start, before the loop. You also need to special case the trailing bytes after the main loop.

Decouple the allocation from the accessor/mutators
Next, see if a standard container like bitset can do the job for you
Otherwise check out boost::dynamic_bitset
If all fails, roll your own class
Rough example:
typedef unsigned char byte;
byte extract(byte value, int startbit, int bitcount)
{
byte result;
result = (byte)(value << (startbit - 1));
result = (byte)(result >> (CHAR_BITS - bitcount));
return result;
}
byte *right_shift(byte *bytes, size_t nbytes, size_t n) {
byte rollover = 0;
for (int i = 0; i < nbytes; ++i) {
bytes[ i ] = (bytes[ i ] >> n) | (rollover < n);
byte rollover = extract(bytes[ i ], 0, n);
}
return &bytes[ 0 ];
}

Here's how I would do it for two bytes:
unsigned int rollover = byte[0] & 0x3;
byte[0] >>= 2;
byte[1] = byte[1] >> 2 | (rollover << 6);
From there, you can generalize this into a loop for n bytes. For flexibility, you will want to generate the magic numbers (0x3 and 6) rather then hardcode them.

I'd look into something similar to this:
#define number_of_bytes 3
template<size_t num_bytes>
union MyUnion
{
char bytes[num_bytes];
__int64 ints[num_bytes / sizeof(__int64) + 1];
};
void main()
{
MyUnion<number_of_bytes> mu;
mu.bytes[0] = 1;
mu.bytes[1] = 1;
mu.bytes[2] = 1;
mu.ints[0] >>= 2;
}
Just play with it. You'll get the idea I believe.

Operator overloading is syntactic sugar. It's really just a way of calling a function and passing your byte array without having it look like you are calling a function.
So I would start by writing this function
unsigned char * ShiftBytes(unsigned char * bytes, size_t count_of_bytes, int shift);
Then if you want to wrap this up in an operator overload in order to make it easier to use or because you just prefer that syntax, you can do that as well. Or you can just call the function.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js