cross correlation of two string in c++ - c++

Consider I have two matrices of 1's and 0's. I want to save it as bool Matrix but opencv doesn't store that way instead it is stored as uchar Mat. Therefore my space increase by 8 times. (each element is 8 bit instead of 1 bit).
My code is basically as follows:
Mat mat1, mat2; //I want each index to be 1 bit
load(mat1); //data size is not important in memory
load(mat2);
corr2(mat1, mat2); //this corr2 is same as Matlab's cross correlation.
I'm doing this part 10M times. Therefore loading takes so much time. My matrices are 1K*1K, so I m able to store them as 1 MB but I want them to be 128 KB (matlab stores as 178 KB approx).
Here is my question: I want to store my matrices as string and instead of Mat operation, I want to use string.
For example, size of mat1 and mat2 is 2*8.
mat1:
0 1 0 0 0 0 1 0 (66=B)
0 1 1 1 0 1 1 1 (122=y)
mat2:
0 1 0 0 0 0 1 1 (67=C)
0 1 1 1 1 0 0 0 (122=z)
I will store str1=By and str2=Cz
Is there a way to cross correlate str1, str2?
Thanks in advance,

Note: This is not an answer, but rather a long comment. I'm posting it as an answer in order to avoid spamming the comments section of the OP.
Storing 1M elements of numeric type is never going to be a problem on any modern computer.
You should learn a little bit more about C and memory storage; bool is not an elemental type, and bool storage therefore only exists virtually. Packing several bits into a char is a good idea, but you should have a look at C++'s bitset if you want to be efficient.
Understand that there might be a significant difference between the way you store your data on hard-disk, and the format that is best suited for processing of active memory (eg RAM). This is probably the reason behind the odd size for Matlab's storage; storing additional information and/or in seemingly inefficient storage units is often desirable to make the algorithms easier to write and the elementary operations execute faster on CPU.
Overall, I think the advantage of switching to "bool-packed chars" storage like you suggest would be negligible in terms of processing speed, and will certainly incur a difficult programming work and obscure the process of maintenance. You are better off sticking with chars for the processing and switch to single-bit storage for write-on-disk operations.

Related

Using part of a variable as bool

Let's say memory is precious, and I have a class with a uint32_t member variable ui and I know that the values will stay below 1 million. The class also hase some bool members.
Does it make sense to use the highest (highest 2,3,..) bit(s) of ui in order to save memory, since bool is 1 byte?
If it does make sense, what is the most efficient way to get the highest (leftmost?) bit (or 2nd)? I read a few old threads and there seems to be disagreement about using inline ASM or some sort of shift.
It's a bit dangerous to use part of the bits as bool. The thing is that the way the numbers are kept in binary, makes it harder to maintain that keeping mechanism correct.
Negative numbers are kept as a complement of positive. Check this for more explanation. You may assign number to be 10 and then setting bool bit from false to true, and the number may turn out to become huge negative number as a result.
As for getting if n-th bit is 0 or 1 you can use this, where 0-th bit is the right most:
int nth_bit(int a, int n){
return a & (1 << n);
}
It will return 0 or 1 identifying the n-th bit.
Well, if the memory is in fact precious, you should look deeper.
1,000,000 uses only 20 bits. This is less that 3 bytes. So you can allocate 3 bytes to keep your value and up to four booleans. Obviously, access will be a bit more complicated, but you save 25% of memory!
If you know that the values are below 524,287, for example, you can save another 15% by packing it (with bool) into 20 bits :)
Also, keeping bool in a separate array (as you said in a comment) would kill performance if you need to access the value and a corresponding bool simultaneously because they are far apart and will likely never be in a cache.

Relinearizing one Ciphertext in SEAL

Let's say I calulated addition or multiplication of 2 Ciphertexts and seved the result in a third one. If I want to perform additional mathematical operations on my result Ciphertext (destination Chipertext), is it advisable to use evaluator.relinearize() on it before doing so? Because if I understood it correctly, some operations on Ciphertext cause the result Ciphertext size to be larger than 2. If yes, then would this be a good approach for relinearizing one Ciphertext?
EvaluationKeys ev_keys;
int size = result.size();
keygen.generate_evaluation_keys(size - 2, ev_keys); // We need size - 2 ev_keys for performing this relinearization.
evaluator.relinearize(result, ev_keys);
Only Evaluator::multiply will increase the size of your ciphertext. Every ciphertext has size at least 2 (fresh encryptions have size 2) and multiplication of size a and b ciphertexts results in a ciphertext of size a+b-1. Thus, multiplying two ciphertext of size 2 you'll end up with a ciphertext of size 3. In almost all cases you'll want to relinearize at this point to bring the size back down to 2, as further operations on size 3 ciphertext can be significantly more computationally costly.
There are some exceptions to this rule: say you want to compute the sum of many products. In this case you might want to relinearize only the final sum and not the individual summands, since computing sums of size 3 ciphertexts is still very fast.
To make relinearization possible, the party who generates keys needs to also generate evaluation keys as follows:
EvaluationKeys ev_keys;
keygen.generate_evaluation_keys(60, ev_keys);
Later the evaluating party can use these as:
evaluator.relinearize(result, ev_keys);
Here I used 60 as the decomposition_bit_count in generate_evaluation_keys, which is the fastest and most often best choice. You should probably never use a int count parameter different from 1 (default) in generate_evaluation_keys. This is meant for use-cases where you let your ciphertexts grow in size beyond 3 and need to bring them down from e.g. size 4 or 5 down to 2.

rdbuf vs getline vs ">>"

I want to load a map from a text file (If you can come up with whatever else way to load a map to an array, I'm open for anything new).
Whats written in the text file is something like this but a bit larger in the scale.
6 6 10 (Nevermind what this number "10" is but the two other are the map size.)
1 1 1 1 1 1
1 0 2 0 0 1
1 0 0 0 2 1
1 2 2 0 0 1
1 0 0 0 0 1
1 1 1 1 1 1
Where 1 is border, 0 is empty, 2 is wall.
Now i want to read this text file but I'm not sure what way would be best.
What i have in mind yet is:
Reading the whole text file at once in a stringstream and convert it to string later via rdbuf() and then split the string and put it in the array.
Reading it number by number via getline().
Reading it number by number using the >> operator.
My question is which of the mentioned (or any other way if available) ways is better by means of ram use and speed.
Note: weather or not using rdbuf() is a good way. I'd appreciate a lot a good comparison between different ways of splitting a string, for example splitting a text to words regarding to the whitespaces.
Where 1 is border, 0 is empty, 2 is wall. Now i want to read this text file but I'm not sure what way would be best. What i have in mind yet is:
You don't have enough data to make a significant impact on performance by any of the means you mentioned. In other words, concentrate on correctness and robustness of your program then come back and optimize the parts that are slow.
Reading the whole text file at once in a stringstream and convert it to string later via rdbuf() and then split the string and put it in the array.
The best method for inputting data is to keep the input stream flowing. This usually means reading large chunks of data per transaction versus many small transactions of small quantities. Memory is a lot faster to search and process than an input stream.
I suggest using istream::read before using rdbuf. For either one, I recommend reading into a preallocated area of memory, that is either an array or if using string, reserve a large space in the string when constructing it. You don't want the reallocation of std::string data to slow your program.
Reading it number by number via getline().
Since your data is line oriented this could be beneficial. You read one row and process the one row. Good technique to start with, however, a bit more complicate than the one below, but simpler than the previous method.
Reading it number by number using the >> operator.
IMO, this is the technique you should be using. The technique is simple and easy to get working; enabling you to work on the remainder of your project.
Changing the Data Format
If you want to make the input faster, you can change the format of the data. Binary data, data that doesn't need translations, is the fastest format to read. It bypasses the translation of textual format to internal representation. The binary data is the internal representation.
One of the caveats to binary data is that it is hard to read and modify.
Optimizing
Don't. Focus on finishing the project: correctly and robustly.
Don't. Usually, the time you gain is wasted in waiting for I/O or
the User. Development time is costly, unnecessary optimization is a waste of development time thus a waste of money.
Profile your executable. Optimize the parts that occupy the most
execution time.
Reduce requirements / Features before changing code.
Optimize the design or architecture before changing the code.
Change compiler optimization settings before changing the code.
Change data structures & alignment for cache optimization.
Optimize I/O if your program is I/O bound.
Reduce branches / jumps / changes in execution flow.

Fast code for searching bit-array for contiguous set/clear bits?

Is there some reasonably fast code out there which can help me quickly search a large bitmap (a few megabytes) for runs of contiguous zero or one bits?
By "reasonably fast" I mean something that can take advantage of the machine word size and compare entire words at once, instead of doing bit-by-bit analysis which is horrifically slow (such as one does with vector<bool>).
It's very useful for e.g. searching the bitmap of a volume for free space (for defragmentation, etc.).
Windows has an RTL_BITMAP data structure one can use along with its APIs.
But I needed the code for this sometime ago, and so I wrote it here (warning, it's a little ugly):
https://gist.github.com/3206128
I have only partially tested it, so it might still have bugs (especially on reverse). But a recent version (only slightly different from this one) seemed to be usable for me, so it's worth a try.
The fundamental operation for the entire thing is being able to -- quickly -- find the length of a run of bits:
long long GetRunLength(
const void *const pBitmap, unsigned long long nBitmapBits,
long long startInclusive, long long endExclusive,
const bool reverse, /*out*/ bool *pBit);
Everything else should be easy to build upon this, given its versatility.
I tried to include some SSE code, but it didn't noticeably improve the performance. However, in general, the code is many times faster than doing bit-by-bit analysis, so I think it might be useful.
It should be easy to test if you can get a hold of vector<bool>'s buffer somehow -- and if you're on Visual C++, then there's a function I included which does that for you. If you find bugs, feel free to let me know.
I can't figure how to do well directly on memory words, so I've made up a quick solution which is working on bytes; for convenience, let's sketch the algorithm for counting contiguous ones:
Construct two tables of size 256 where you will write for each number between 0 and 255, the number of trailing 1's at the beginning and at the end of the byte. For example, for the number 167 (10100111 in binary), put 1 in the first table and 3 in the second table. Let's call the first table BBeg and the second table BEnd. Then, for each byte b, two cases: if it is 255, add 8 to your current sum of your current contiguous set of ones, and you are in a region of ones. Else, you end a region with BBeg[b] bits and begin a new one with BEnd[b] bits.
Depending on what information you want, you can adapt this algorithm (this is a reason why I don't put here any code, I don't know what output you want).
A flaw is that it does not count (small) contiguous set of ones inside one byte ...
Beside this algorithm, a friend tells me that if it is for disk compression, just look for bytes different from 0 (empty disk area) and 255 (full disk area). It is a quick heuristic to build a map of what blocks you have to compress. Maybe it is beyond the scope of this topic ...
Sounds like this might be useful:
http://www.aggregate.org/MAGIC/#Population%20Count%20%28Ones%20Count%29
and
http://www.aggregate.org/MAGIC/#Leading%20Zero%20Count
You don't say if you wanted to do some sort of RLE or to simply count in-bytes zeros and one bits (like 0b1001 should return 1x1 2x0 1x1).
A look up table plus SWAR algorithm for fast check might gives you that information easily.
A bit like this:
byte lut[0x10000] = { /* see below */ };
for (uint * word = words; word < words + bitmapSize; word++) {
if (word == 0 || word == (uint)-1) // Fast bailout
{
// Do what you want if all 0 or all 1
}
byte hiVal = lut[*word >> 16], loVal = lut[*word & 0xFFFF];
// Do what you want with hiVal and loVal
The LUT will have to be constructed depending on your intended algorithm. If you want to count the number of contiguous 0 and 1 in the word, you'll built it like this:
for (int i = 0; i < sizeof(lut); i++)
lut[i] = countContiguousZero(i); // Or countContiguousOne(i)
// The implementation of countContiguousZero can be slow, you don't care
// The result of the function should return the largest number of contiguous zero (0 to 15, using the 4 low bits of the byte, and might return the position of the run in the 4 high bits of the byte
// Since you've already dismissed word = 0, you don't need the 16 contiguous zero case.

how to efficiently access 3^20 vectors in a 2^30 bits of memory

I want to store a 20-dimensional array where each coordinate can have 3 values,
in a minimal amount of memory (2^30 or 1 Gigabyte).
It is not a sparse array, I really need every value.
Furthermore I want the values to be integers of arbirary but fixed precision,
say 256 bits or 8 words
example;
set_big_array(1,0,0,0,1,2,2,0,0,2,1,1,2,0,0,0,1,1,1,2, some_256_bit_value);
and
get_big_array(1,0,0,0,1,2,2,0,0,2,1,1,2,0,0,0,1,1,1,2, &some_256_bit_value);
Because the value 3 is relative prime of 2. its difficult to implement this using
efficient bitwise shift, and and or operators.
I want this to be as fast as possible.
any thoughts?
Seems tricky to me without some compression:
3^20 = 3486784401 values to store
256bits / 8bitsPerByte = 32 bytes per value
3486784401 * 32 = 111577100832 size for values in bytes
111577100832 / (1024^3) = 104 Gb
You're trying to fit 104 Gb in 1 Gb. There'd need to be some pattern to the data that could be used to compress it.
Sorry, I know this isn't much help, but maybe you can rethink your strategy.
There are 3.48e9 variants of 20-tuple of indexes that are 0,1,2. If you wish to store a 256 bit value at each index, that means you're talking about 8.92e11 bits - about a terabit, or about 100GB.
I'm not sure what you're trying to do, but that sounds computationally expensive. It may be reasonable feasible as a memory-mapped file, and may be reasonably fast as a memory-mapped file on an SSD.
What are you trying to do?
So, a practical solution would be to use a 64-bit OS and a large memory-mapped file (preferably on an SSD) and simply compute the address for a given element in the typical way for arrays, i.e. as sum-of(forall-i(i-th-index * 3^i)) * 32 bytes in pseudeo-math. Or, use a very very expensive machine with that much memory, or another algorithm that doesn't require this array in the first place.
A few notes on platforms: Windows 7 supports just 192GB of memory, so using physical memory for a structure like this is possible but really pushing it (more expensive editions support more). If you can find a machine at all that is. According to microsoft's page on the matter the user-mode virtual address space is 7-8TB, so mmap/virtual memory should be doable. Alex Ionescu explains why there's such a low limit on virtual memory despite an apparently 64-bit architecture. Wikipedia puts linux's addressable limits at 128TB, though probably that's before the kernel/usermode split.
Assuming you want to address such a multidimensional array, you must process each index at least once: that means any algorithm will be O(N) where N is the number of indexes. As mentioned before, you don't need to convert to base-2 addressing or anything else, the only thing that matters is that you can compute the integer offset - and which base the maths happens in is irrelevant. You should use the most compact representation possible and ignore the fact that each dimension is not a multiple of 2.
So, for a 16-dimensional array, that address computation function could be:
int offset = 0;
for(int ii=0;ii<16;ii++)
offset = offset*3 + indexes[ii];
return &the_array[offset];
As previously said, this is just the common array indexing formula, nothing special about it. Note that even for "just" 16 dimensions, if each item is 32 bytes, you're dealing with a little more than a gigabyte of data.
Maybe i understand your question wrong. But can't you just use a normal array?
INT256 bigArray[3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3];
OR
INT256 ********************bigArray = malloc(3^20 * 8);
bigArray[1][0][0][1][2][0][1][1][0][0][0][0][1][1][2][1][1][1][1][1] = some_256_bit_value;
etc.
Edit:
Will not work because you would need 3^20 * 8Byte = ca. 25GByte.
The malloc variant is wrong.
I'll start by doing a direct calculation of the address, then see if I can optimize it
address = 0;
for(i=15; i>=0; i--)
{
address = 3*address + array[i];
}
address = address * number_of_bytes_needed_for_array_value
2^30 bits is 2^27 bytes so not actually a gigabyte, it's an eighth of a gigabyte.
It appears impossible to do because of the mathematics although of course you can create the data size bigger then compress it, which may get you down to the required size although it cannot guarantee. (It must fail to some of the time as the compression is lossless).
If you do not require immediate "random" access your solution may be a "variable sized" two-bit word so your most commonly stored value takes only 1 bit and the other two take 2 bits.
If 0 is your most common value then:
0 = 0
10 = 1
11 = 2
or something like that.
In that case you will be able to store your bits in sequence this way.
It could take up to 2^40 bits this way but probably will not.
You could pre-run through your data and see which is the commonly occurring value and use that to indicate your single-bit word.
You can also compress your data after you have serialized it in up to 2^40 bits.
My assumption here is that you will be using disk possibly with memory mapping as you are unlikely to have that much memory available.
My assumption is that space is everything and not time.
You might want to take a look at something like STXXL, an implementation of the STL designed for handling very large volumes of data
You can actually use a pointer-to-array20 to have your compiler implement the index calculations for you:
/* Note: there are 19 of the [3]'s below */
my_256bit_type (*foo)[3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3];
foo = allocate_giant_array();
foo[0][1][1][0][2][1][2][2][0][2][1][0][2][1][0][0][2][1][0][0] = some_256bit_value;