Is this checksum calculation completely waterproof? - bit-manipulation

long make_checksum(const char* str)
{
long chk=0;
long rot=0;
while(*str)
{
rot<<=9;
rot|=(rot>>23);
rot^=*(char*)str++;
chk+=rot;
}
return chk;
}
Not waterproof means: there's a chance I can get the same checksum for two different strings.

As there are more possible strings than long values, there are surely two different strings resulting in the same checksum.

A checksum can never be waterproof, since it contains less data than the original data of which you are calculating the checksum.
If you want a real waterproof 'checksum', you need to create a second 'instance' of your data and make sure that it contains identically the same data as the original data, although it does not have to be in the same format (can be encryped or compressed).

Related

why don't identical bitsets convert to identical ulong

I am dealing with data in a vector of std::bitset<16>, which i both have to convert to and from unsigned long (through std::bitset::to_ulong()) and to and from strings using a self-made function (the exact algorithm is irrelavant for this question)
the convertions between bitset vector and string does at first seem to work fine, since that if i first convert a vector of bitsets to string and then back to bitset it is identical; which i have proven by making a program which includes this:
for (std::bitset<16>& B : my_bitset16vector) std::cout<<B<<std::endl;//print bitsets before conversion
bitset_to_string(my_bitset16vector,my_str);
string_to_bitset(my_bitset16vector,my_str);
std::cout<<std::endl
for (std::bitset<16>& B : my_bitset16vector) std::cout<<B<<std::endl;//print bitsets after conversion
the output could look somewhat like this (in this case with only 4 bitsets):
1011000011010000
1001010000011011
1110100001101111
1001000011001111
1011000011010000
1001010000011011
1110100001101111
1001000011001111
Judging by this, the bitsets before and after conversion are clearly identical, however despite this, the bitsets converts completely differently when i tell them to convert to unsigned long; in a program which could look like this:
for (std::bitset<16>& B : my_bitset16vector) std::cout<<B<<".to_ulong()="<<B.to_ulong()<<std::endl;//print bitsets before conversation
bitset_to_string(my_bitset16vector,my_str);
string_to_bitset(my_bitset16vector,my_str);
std::cout<<std::endl
for (std::bitset<16>& B : my_bitset16vector) std::cout<<B<<".to_ulong()="<<B.to_ulong()<<std::endl;//print bitsets after conversion
the output could look somewhat like this:
1011000011010000.to_ulong()=11841744
1001010000011011.to_ulong()=1938459
1110100001101111.to_ulong()=22472815
1001000011001111.to_ulong()=18649295
1011000011010000.to_ulong()=45264
1001010000011011.to_ulong()=37915
1110100001101111.to_ulong()=59503
1001000011001111.to_ulong()=37071
firstly it is obvious that the bitsets still beyond all reasonable doubt are identical when displayed as binary, but when converted to unsigned long, the identical bitsets return completely different values (completely ruining my program)
Why is this? can it be that the bitsets are unidentical, even though they print as the same? can the error exist within my bitset to and from string converters, despite the bitsets being identical?
edit: not all programs including my conversations has this problem, it only happens when i have modified the bitset after creating it (from a string), in my case in an attempt to encrypt the bitset, which simply can not be cut down to something simple and short, but in my most compressed way of writing it looks like this:
(and that is even without including the defintion of the public key struct and the modular power function)
int main(int argc, char**argv)
{
if (argc != 3)
{
std::cout<<"only 2 arguments allowed: plaintext user"<<std::endl;
return 1;
}
unsigned long k=123456789;//any huge number loaded from an external file
unsigned long m=123456789;//any huge number loaded from an external file
std::vector< std::bitset<16> > data;
std::string datastring=std::string(argv[1]);
string_to_bitset(data,datastring);//string_to_bitset and bitset_to_string also empties string and bitset vector, this is not the cause of the problem
for (std::bitset<16>& C : data)
{
C =std::bitset<16>(modpow(C.to_ulong(),k,m));//repeated squaring to solve C.to_ulong()^k%m
}
//and now the problem happens
for (std::bitset<16>& C : data) std::cout<<C<<".to_ulong()="<<C.to_ullong()<<std::endl;
std::cout<<std::endl;
bitset_to_string(data,datastring);
string_to_bitset(data,datastring);
//bitset_to_string(data,datastring);
for (std::bitset<16>& C : data) std::cout<<C<<".to_ulong()="<<C.to_ullong()<<std::endl;
std::cout<<std::endl;
return 0;
}
I am well aware that you now all are thinking that i am doing the modular power function wrong (which i guarantee that i am not), but what i am doing to make this happen doesn't actually matter, for my question was not: what is wrong in my program; my question was: why don't the identical bitsets (which prints identical binary 1's and 0's) convert to identical unsigned longs.
other edit: i must also point out that the first printet values of unsigned longs are "correct" in that they when used allow me to decrypt the bitset perfectly, whereas the values of unsigned longs printed afterwards is "wrong" in that it produces a completely wrong result.
The "11841744" value is correct in the lower 16 bits, but has some extra set bits above the 16th. This could be a bug in your STL implementation where to_long accesses bits past the 16 it should be using.
Or (from your comment above) you're adding more bits to the bitset than it can hold and you're experiencing Undefined Behavior.

Variable length int C++

I have a program that takes in a file that is just a list of sets, each with their own integer identifier, and then sorts the sets by the number of members.
File format (the number after R is the ID):
R1
0123
0000
R2
0321
R3
0002
...
struct CodeBook {
string Residue; //stores the R lines
vector<string> CodeWords; //stores the lines between the R lines
};
vector<CodeBook> DATA;
With each file I run, the number of sets gets larger, and currently I am just storing everything into the huge vector DATA. My latest file is large enough that I've taken over the server's memory and flowing over into swap. This will be the last file I process before I possibly switch to a more RAM-friendly algorithm. With that file, the number of sets is larger than an unsigned 32 byte int.
I can calc how many there will be, and the number of sets is important for calculation purposes, so overflow is not an option. Going all the way up to a unsigned long long int is not an option either, because I've already pretty much maxed out memory usage.
How could I implement a variable length integer to more efficiently store everything so I can more efficiently calc everything?
Ex: small id ints get 1 or 2 bytes and the largest ints get 5 bytes
PS: Given the size of what I'm working with, speed is also a factor if it can be helped, but it's not the most important concern :/
Store everything in two huge vectors.
Define two structs with associated vectors:
struct u { int id; unsigned count };
struct ull { int id, unsigned long long count };
std::vector<u> u_vector;
std::vector<ull> ull_vector;
Read the count into an unsigned long long. If the count fits in an unsigned, store it in u_vector, otherwise in ull_vector. Sort both vectors; output u_vector first and ull_vector second.
Don't be tempted to try doing the same thing with unsigned char - the structure will be the same size as u (because of padding to make the id aligned).

How to read and write data in 8 bit integers unit form by c++ file functions

Is it possible to store data in integer form from 0 to 255 rather than 8-bit characters.Although both are same thing, how can we do it, for example, with write() function?
Is it ok to directly cast any integer to char and vice versa? Does something like
{
int a[1]=213;
write((char*)a,1);
}
and
{
int a[1];
read((char*)a,1);
cout<<a;
}
work to get 213 from the same location in the file? It may work on that computer but is it portable, in other words, is it suitable for cross-platform projects in that way? If I create a file format for each game level(which will store objects' coordinates in the current level's file) using this principle, will it work on other computers/systems/platforms in order to have loaded same level?
The code you show would write the first (lowest-address) byte of a[0]'s object representation - which may or may not be the byte with the value 213. The particular object representation of an int is imeplementation defined.
The portable way of writing one byte with the value of 213 would be
unsigned char c = a[0];
write(&c, 1);
You have the right idea, but it could use a bit of refinement.
{
int intToWrite = 213;
unsigned char byteToWrite = 0;
if ( intToWrite > 255 || intToWrite < 0 )
{
doError();
return();
}
// since your range is 0-255, you really want the low order byte of the int.
// Just reading the 1st byte may or may not work for your architecture. I
// prefer to let the compiler handle the conversion via casting.
byteToWrite = (unsigned char) intToWrite;
write( &byteToWrite, sizeof(byteToWrite) );
// you can hard code the size, but I try to be in the habit of using sizeof
// since it is better when dealing with multibyte types
}
{
int a = 0;
unsigned char toRead = 0;
// just like the write, the byte ordering of the int will depend on your
// architecture. You could write code to explicitly handle this, but it's
// easier to let the compiler figure it out via implicit conversions
read( &toRead, sizeof(toRead) );
a = toRead;
cout<<a;
}
If you need to minimize space or otherwise can't afford the extra char sitting around, then it's definitely possible to read/write a particular location in your integer. However, it can need linking in new headers (e.g. using htons/ntons) or annoying (using platform #defines).
It will work, with some caveats:
Use reinterpret_cast<char*>(x) instead of (char*)x to be explicit that you’re performing a cast that’s ordinarily unsafe.
sizeof(int) varies between platforms, so you may wish to use a fixed-size integer type from <cstdint> such as int32_t.
Endianness can also differ between platforms, so you should be aware of the platform byte order and swap byte orders to a consistent format when writing the file. You can detect endianness at runtime and swap bytes manually, or use htonl and ntohl to convert between host and network (big-endian) byte order.
Also, as a practical matter, I recommend you prefer text-based formats—they’re less compact, but far easier to debug when things go wrong, since you can examine them in any text editor. If you determine that loading and parsing these files is too slow, then consider moving to a binary format.

Reversibly Combining Two uint32_t's Without Changing Datatype

Here's my issue: I need to pass back two uint32_t's via a single uint32_t (because of how the API is set up...). I can hard code whatever other values I need to reverse the operation, but the parameter passed between functions needs to stay a single uint32_t.
This would be trivial if I could just bit-shift the two 32-bit ints into a single 64-bit int (like what was explained here), but the compiler wouldn't like that. I've also seen mathematical pairing functions, but I'm not sure if that's what I need in this case.
I've thought of setting up a simple cipher: the unint32_t could be the cipher text, and I could just hard code the key. This is an example, but that seems like overkill.
Is this even possible?
It is not possible to store more than 32 bits of information using only 32 bits. This is a basic result of information theory.
If you know that you're only using the low-order 16 bits of each value, you could shift one left 16 bits and combine them that way. But there's absolutely no way to get 64 bits worth of information (or even 33 bits) into 32 bits, period.
Depending on how much trouble this is really worth, you could:
create a global array or vector of std::pair<uint32_t,uint32_t>
pass an index into the function, then your "reverse" function just looks up the result in the array.
write some code to decide which index to use when you have a pair to pass. The index needs to not be in use by anyone else, and since the array is global there may be thread-safety issues. Essentially what you are writing is a simple memory allocator.
As a special case, on a machine with 32 bit data pointers you could allocate the struct and reinterpret_cast the pointer to and from uint32_t. So you don't need any globals.
Beware that you need to know whether or not the function you pass the value into might store the value somewhere to be "decoded" later, in which case you have a more difficult resource-management problem than if the function is certain to have finished using it by the time it returns.
In the easy case, and if the code you're writing doesn't need to be re-entrant at all, then you only need to use one index at a time. That means you don't need an array, just one pair. You could pass 0 to the function regardless of the values, and have the decoder ignore its input and look in the global location.
If both special cases apply (32 bit and no retaining of the value), then you can put the pair on the stack, and use no globals and no dynamic allocation even if your code does need to be re-entrant.
None of this is really recommended, but it could solve the problem you have.
You can use an intermediate global data structure to store the pair of uint32_t on it, using your only uint32_t parameter as the index on the structure:
struct my_pair {
uint32_t a, b;
};
std::map<uint32_t, my_pair> global_pair_map;
uint32_t register_new_pair(uint32_t a, uint32_t b) {
// Add the pair of (a, b) to the map global_pair_map on a new key, and return the
// new key value.
}
void release_pair(uint32_t key) {
// Remove the key from the global_pair_map.
}
void callback(uint32_t user_data) {
my_pair& p = global_pair_map[user_data];
// Use your pair of uint32_t with p.a, and p.b.
}
void main() {
uint32_t key = register_new_pair(number1, number2);
register_callback(callback, key);
}

String container for large N

I am looking for a string container suited for large number of strings (> 10^9). Strings have variable length. It must be fast for insertions and lookup and have frugal memory use. Strings are unordered when container is filled. Average string length is about 10 bytes. Lookup done on exact string value. Erasability - optional. N is unknown in advance. For 64bit architecture. Use case - think about associative array for AWK.
map<string> have about 20-40 bites overhead per string and each insertion calls one malloc (or two). So it is not fast and not frugal.
Can someone point me to a C/C++ library, a data structure or a paper?
Relavant -- Comparison of Hash Table Libraries
EDIT
I've removed "big data", bumped N to larger value, clarified requirements.
There is no silver bullet, but a radix tree gives the advantages of a trie (fast look up and insert, at least asymptotically) - with a better space consumption.
However - both are considered not "cache efficient" - which might be significant especially if iteration over the data is required at some point.
For your problem, a pointer on a 64bit machine nearly matches the length of your data. So using multiple pointers per string in your problem (average length less than 10 bytes) would make the size of the data structure dominate the size of your input.
One general way to deal with that is to not use pointers to represent your strings. A specialized representation using a 32bit offset into a large page where all your strings get stored would halve the pointer memory requirements for you, at the cost of needing to do an addition to a pointer to retrieve your string.
Edit: Below is a sample (untested) implementation of such a representation (using struct for the sake of simplicity, actual implementation would of course only make the user interface public). The representation assumes a hash table insertion, hence leaving room for a next_. Note that the offsets are scaled by size of hash_node to allow for representation in a 32bit offset.
struct hash_node {
uint32_t next_;
char * str () { return (const char *)(&next+1); }
const char * str () const { return (const char *)(&next+1); }
};
struct hash_node_store {
std::vector<hash_node> page_; /* holds the allocated memory for nodes */
uint32_t free_;
hash_node * ptr (uint32_t offset) {
if (offset == 0) return 0;
return &page_[offset-1];
}
uint32_t allocate (const char *str) {
hash_node *hn = ptr(free_);
uint32_t len = strlen(str) + 1;
uint32_t node_size =
1 + (len / sizeof(hash_node)) + !!(len % sizeof(hash_node));
strcpy(hn->str(), str);
free_ += node_size;
return 1 + (hn - &page_[0]);
}
};
A hash table would contain a node store, and a hash bucket vector.
struct hash_table {
hash_node_store store_;
std::vector<uint32_t> table_; /* holds allocated memory for buckets */
uint32_t hash_func (const char *s) { /* ... */ }
uint32_t find_at (uint32_t node_offset, const char *str);
bool insert_at (uint32_t &node_offset, const char *str);
bool insert (const char *str) {
uint32_t bucket = hash_func(s) % table_.size();
return insert_at(table_[bucket], str);
}
bool find (const char *str) {
uint32_t bucket = hash_func(s) % table_.size();
return find_at(table_[bucket], str);
}
};
Where find_at and insert_at are just simple functions implemented in the expected way.
uint32_t hash_table::find_at (uint32_t node_offset, const char *str) {
hash_node *hn = store_.ptr(node_offset);
while (hn) {
if (strcmp(hn->str(), str) == 0) break;
node_offset = hn->next_;
hn = store_.ptr(node_offset);
}
return node_offset;
}
bool hash_table::insert_at (uint32_t &node_offset, const char *str) {
if (! find_at(node_offset, str)) {
uint32_t new_node = store_.allocate(str);
store_.ptr(new_node)->next_ = node_offset;
node_offset = new_node;
return true;
}
return false;
}
As you're only inserting values, the string data itself can be concatenated as it's inserted - each with a delimiter character such as NUL. The character offset into that single buffer uniquely identifies the string. This means that sets of strings that shared a common substring will each be completely redundantly individually specified, but countering that no effort will be spent trying to find or encoding such factoring: that could backfire for highly unrelated string values (e.g. random text).
To find the strings, a hash table could be used. Given your aim of avoiding frequent dynamic memory allocations, to handle collisions efficiently you'd need to use displacement lists: the idea is that when inserting a string that hashes to an already used bucket, you add an offset (wrapping around the table if necessary) and try that other bucket, continuing until an empty bucket it found. This means you need a list of displacements to try: you can hand-code a finite list(s) to get you started, or even potentially nesting loops over a "big displacement" list whose values are added to those from a "small displacement" list until an empty bucket is found, e.g. two hand coded lists of 10 displacements yield 100 combinations. (Alternative hashing algorithms can be used instead of or combined with displacement lists.) You do need to have a reasonable ratio of total to used buckets though... I'd expect something around 1.2 to work ok typically, with larger values prioritorising speed over space - you could populate your system with sample data and tune to taste.
So, the space requirement is:
total_bytes_of_string_data + N delimiters + total_buckets * sizeof(string_offset)
Where sizeof(string_offset) probably needs 8 bytes as 10^9 * 10 is already more than 2^32.
For 10^9 strings of ~10 characters and 1.2*10^9 buckets, this is around 10^9 * (10+1) + 1.2*10^9 * 8 bytes = 20.6^10^9 bytes or 19.1 GB.
It's worth noting that 64 bit virtual address space means you can safely allocate much more space for the concatenated string data and hash table than you actually need, and only those pages actually accessed will require virtual memory (initially physical memory, but it could later be swapped to disk through normal virtual memory mechanisms).
Discussion
There's no way to guarantee reduction in the string memory usage without assumptions / insights about repetitions in the string data or character set used.
If all the insertions were followed by a huge number of searches, sorting the string data and using binary searches would be an ideal solution. But, for quick insertions interspersed with searches the above is reasonable.
You could also have an index based around balanced binary trees, but to avoid memory allocations for each insertion you'd need to group a lot of nodes into one memory page and manually manage the ordering and splitting thereof on a less granular level: painful to implement. There might be a library doing it already but I haven't heard of one.
You've added "associative arrays in AWK" as an example of what this could be used for. You could simply embed each mapped-to value immediately after its string key in the concatenated data.
Is a (low) false positive rate acceptable? If so, then Bloom filters would be an option. If you'd be content with a false positive rate of one in a million, or 2^(-20), you'd want to use a buffer size in bits of around 30 times the number of strings you expect, or 3*10^10 bits. That's less than 4GB. You'd also need around 20 independent hash functions.
If you can't accept false positives, you should consider putting a Bloom filter in front of whatever other solution you build, to weed out most negatives really quickly.