OpenCL: Confusing Results according local_item_size - c++

My code acts like 2d matrix muliplication ( http://gpgpu-computing4.blogspot.de/2009/09/matrix-multiplication-2-opencl.html).
The dimenstions of the matrixes are (1000*1000 and 10000*10000 and 100000*100000).
My Hardware is: NVIDIA Corporation GM204 [GeForce GTX 980] (MAX_WORK_GROUP_SIZES: 1024 1024 64).
The question is:
I have got Some Confusing Results depends on local_item_size and I need to understand what is happened?
1000 X 1000 matrixes & local_item_size = 16 : INVALID_WORKGROUP_SIZE.
1000 X 1000 matrixes & local_item_size = 8 : WORKS :).
1000 X 1000 matrixes & local_item_size = 10 : WORKS :) (The Execution time when 8 was better).
10000 X 10000 matrixes & local_item_size = 8 or 16: CL_OUT_OF_RESOURCES.
Thanks in advance,

To your second question, this is the reasoning behind:
1000 / 8 = 125, ok
1000 / 16 = 62.5, wrong! INVALID_WORKGROUP_SIZE
1000 / 10 = 100 ok, but 10 and multiples of 10, will never fully use the GPU cores.
IE: If you have 16 warps, 6 are wasted, if you have 32, 2 are wasted, and so on.
10000x10000 = 400MB(at least, if using floats) for just the input, so something is getting too big for the memory, therefore CL_OUT_OF_RESOURCES

Related

Optimal way to compress 60 bit string

Given 15 random hexadecimal numbers (60 bits) where there is always at least 1 duplicate in every 20 bit run (5 hexdecimals).
What is the optimal way to compress the bytes?
Here are some examples:
01230 45647 789AA
D8D9F 8AAAF 21052
20D22 8CC56 AA53A
AECAB 3BB95 E1E6D
9993F C9F29 B3130
Initially I've been trying to use Huffman encoding on just 20 bits because huffman coding can go from 20 bits down to ~10 bits but storing the table takes more than 9 bits.
Here is the breakdown showing 20 bits -> 10 bits for 01230
Character Frequency Assignment Space Savings
0 2 0 2×4 - 2×1 = 6 bits
2 1 10 1×4 - 1×2 = 2 bits
1 1 110 1×4 - 1×3 = 1 bits
3 1 111 1×4 - 1×3 = 1 bits
I then tried to do huffman encoding on all 300 bits (five 60bit runs) and here is the mapping given the above example:
Character Frequency Assignment Space Savings
---------------------------------------------------------
a 10 101 10×4 - 10×3 = 10 bits
9 8 000 8×4 - 8×3 = 8 bits
2 7 1111 7×4 - 7×4 = 0 bits
3 6 1101 6×4 - 6×4 = 0 bits
0 5 1100 5×4 - 5×4 = 0 bits
5 5 1001 5×4 - 5×4 = 0 bits
1 4 0010 4×4 - 4×4 = 0 bits
8 4 0111 4×4 - 4×4 = 0 bits
d 4 0101 4×4 - 4×4 = 0 bits
f 4 0110 4×4 - 4×4 = 0 bits
c 4 1000 4×4 - 4×4 = 0 bits
b 4 0011 4×4 - 4×4 = 0 bits
6 3 11100 3×4 - 3×5 = -3 bits
e 3 11101 3×4 - 3×5 = -3 bits
4 2 01000 2×4 - 2×5 = -2 bits
7 2 01001 2×4 - 2×5 = -2 bits
This yields a savings of 8 bits overall, but 8 bits isn't enough to store the huffman table. It seems because of the randomness of the data that the more bits you try to encode with huffman the less effective it works. Huffman encoding seemed to work best with 20 bits (50% reduction) but storing the table in 9 or less bits isnt possible AFAIK.
In the worst-case for a 60 bit string there are still at least 3 duplicates, the average case there are more than 3 duplicates (my assumption). As a result of at least 3 duplicates the most symbols you can have in a run of 60 bits is just 12.
Because of the duplicates plus the less than 16 symbols, I can't help but feel like there is some type of compression that can be used
If I simply count the number of 20-bit values with at least two hexadecimal digits equal, there are 524,416 of them. A smidge more than 219. So the most you could possibly save is a little less than one bit out of the 20.
Hardly seems worth it.
If I split your question in two parts:
How do I compress (perfect) random data: You can't. Every bit is some new entropy which can't be "guessed" by a compression algorithm.
How to compress "one duplicate in five characters": There are exactly 10 options where the duplicate can be (see table below). This is basically the entropy. Just store which option it is (maybe grouped for the whole line).
These are the options:
AAbcd = 1 AbAcd = 2 AbcAd = 3 AbcdA = 4 (<-- cases where first character is duplicated somewhere)
aBBcd = 5 aBcBd = 6 aBcdB = 7 (<-- cases where second character is duplicated somewhere)
abCCd = 8 abCdC = 9 (<-- cases where third character is duplicated somewhere)
abcDD = 0 (<-- cases where last characters are duplicated)
So for your first example:
01230 45647 789AA
The first one (01230) is option 4, the second 3 and the third option 0.
You can compress this by multiplying each consecutive by 10: (4*10 + 3)*10 + 0 = 430
And uncompress it by using divide and modulo: 430%10=0, (430/10)%10=3, (430/10/10)%10=4. So you could store your number like that:
1AE 0123 4567 789A
^^^ this is 430 in hex and requires only 10 bit
The maximum number for the three options combined is 1000, so 10 bit are enough.
Compared to storing these 3 characters normally you save 2 bit. As someone else already commented - this is probably not worth it. For the whole line it's even less: 2 bit / 60 bit = 3.3% saved.
If you want to get rid of the duplicates first, do this, then look at the links at the bottom of the page. If you don't want to get rid of the duplicates, then still look at the links at the bottom of the page:
Array.prototype.contains = function(v) {
for (var i = 0; i < this.length; i++) {
if (this[i] === v) return true;
}
return false;
};
Array.prototype.unique = function() {
var arr = [];
for (var i = 0; i < this.length; i++) {
if (!arr.contains(this[i])) {
arr.push(this[i]);
}
}
return arr;
}
var duplicates = [1, 3, 4, 2, 1, 2, 3, 8];
var uniques = duplicates.unique(); // result = [1,3,4,2,8]
console.log(uniques);
Then you would have shortened your code that you have to deal with. Then you might want to check out Smaz
Smaz is a simple compression library suitable for compressing strings.
If that doesn't work, then you could take a look at this:
http://ed-von-schleck.github.io/shoco/
Shoco is a C library to compress and decompress short strings. It is very fast and easy to use. The default compression model is optimized for english words, but you can generate your own compression model based on your specific input data.
Let me know if it works!

For finding top K elements using heap, which approach is better - NlogK or KLogN?

For finding top K elements using heap, which approach is better?
NlogK, use Minheap of size K and remove the minimum element so top k elements remain in heap
KlogN, use Maxheap, store all elements and then extract top K elements
I did some calculations and at no point, I see that NLogK better than KlogN.
N= 16 (2^4), k = 8 (2^3)
O(Nlog(K)) = 16* 3 = 48
O(Klog(N)) = 8 * 4 = 32
N= 16 (2^4), k = 12 (log to base 2 = 3.5849)
O(Nlog(K)) = 16* 3.5849 = 57.3584
O(Klog(N)) = 12 * 4 = 48
N= 256 (2^8), k = 4 (2^2)
O(Nlog(K)) = 256* 2 = 512
O(Klog(N)) = 4 * 8 = 32
N= 1048576 (2^20), k = 16 (2^4)
O(Nlog(K)) = 1048576* 4 = 4194304
O(Klog(N)) = 16 * 20 = 320
N= 1048576 (2^20), k = 1024 (2^10)
O(Nlog(K)) = 1048576* 10 = 10485760
O(Klog(N)) = 1024 * 20 = 20480
N= 1048576 (2^20), k = 524288 (2^19)
O(Nlog(K)) = 1048576* 19 = 19922944
O(Klog(N)) = 524288 * 20 = 10485760
I just wanted to confirm that my approach is correct and adding all elements to heap and extract top k elements is always the best approach. (and also simpler one)
I did some calculations and at no point, I see that NLogK better than KlogN.
Since, K <= N, NlogK will always be more than or equal to KlogN.
That does not mean, min heap approach is going to take more time than max heap approch.
Need to consider the below,
In Min heap approach, we will update the heap only if the next value is larger than the head. If the array is in ascending order, we will do it (N-K) times, if it is descending order we will not update it at all. On average, the number of times the tree gets updated is considerable less than N.
In Max heap, you need to heapify the tree of size N. If K is negligibly small when compared to N, then this time can become a dominant factor. While in the case of min heap, heapify works on the smaller set of K. Also as mentioned in point 1, most of the value of N will not trigger an update of the tree.
I wrote a small program to compare both the approach. The source can be accessed here.
Results for an array ranging from 0 to 1M in random order is below:
Test for Array size: 1000000
k MinHeapIterations MinHeapTime(ms) MaxHeapTime(ms) MaxTime/MinTime
1 15 6.07 72.03 11.88
10 114 3.85 70.09 18.19
100 913 4.11 69.60 16.93
1000 6874 5.32 72.94 13.71
10000 46123 16.52 79.89 4.83
100000 230385 78.19 132.27 1.69
1000000 0 35.86 453.57 12.65
As you can see,
Min heap does outperforms Max heap for all values of K
The no of tree updates incase of Min heap (MinHeapIterations) is very much less than (N-K)

dramatic error in lp_solve?

I've a simple problem that I passed to lp_solve via the IDE (5.5.2.0)
/* Objective function */
max: +r1 +r2;
/* Constraints */
R1: +r1 +r2 <= 4;
R2: +r1 -2 b1 = 0;
R3: +r2 -3 b2 = 0;
/* Variable bounds */
b1 <= 1;
b2 <= 1;
/* Integer definitions */
int b1,b2;
The obvious solution to this problem is 3. SCIP as well as CBC give 3 as answer but not lp_solve. Here I get 2. Is there a major bug in the solver?
Thank's in advance.
I had contact to the developer group that cares about lpsolve software. The error will be fixed in the next version of lpsolve.
When I tried it, I am getting 3 as the optimal value for the Obj function.
Model name: 'LPSolver' - run #1
Objective: Maximize(R0)
SUBMITTED
Model size: 3 constraints, 4 variables, 6 non-zeros.
Sets: 0 GUB, 0 SOS.
Using DUAL simplex for phase 1 and PRIMAL simplex for phase 2.
The primal and dual simplex pricing strategy set to 'Devex'.
Relaxed solution 4 after 4 iter is B&B base.
Feasible solution 2 after 6 iter, 3 nodes (gap 40.0%)
Optimal solution 2 after 7 iter, 4 nodes (gap 40.0%).
Excellent numeric accuracy ||*|| = 0
MEMO: lp_solve version 5.5.2.0 for 32 bit OS, with 64 bit REAL variables.
In the total iteration count 7, 1 (14.3%) were bound flips.
There were 2 refactorizations, 0 triggered by time and 0 by density.
... on average 3.0 major pivots per refactorization.
The largest [LUSOL v2.2.1.0] fact(B) had 8 NZ entries, 1.0x largest basis.
The maximum B&B level was 3, 0.8x MIP order, 3 at the optimal solution.
The constraint matrix inf-norm is 3, with a dynamic range of 3.
Time to load data was 0.001 seconds, presolve used 0.017 seconds,
... 0.007 seconds in simplex solver, in total 0.025 seconds.

In Doom3's source code, why did they use bitshift to generate the number instead of hardcoding it?

Why did they do this:
Sys_SetPhysicalWorkMemory( 192 << 20, 1024 << 20 ); //Min = 201,326,592 Max = 1,073,741,824
Instead of this:
Sys_SetPhysicalWorkMemory( 201326592, 1073741824 );
The article I got the code from
A neat property is that shifting a value << 10 is the same as multiplying it by 1024 (1 KiB), and << 20 is 1024*1024, (1 MiB).
Shifting by successive powers of 10 yields all of our standard units of computer storage:
1 << 10 = 1 KiB (Kibibyte)
1 << 20 = 1 MiB (Mebibyte)
1 << 30 = 1 GiB (Gibibyte)
...
So that function is expressing its arguments to Sys_SetPhysicalWorkMemory(int minBytes, int maxBytes) as 192 MB (min) and 1024 MB (max).
Self commenting code:
192 << 20 means 192 * 2^20 = 192 * 2^10 * 2^10 = 192 * 1024 * 1024 = 192 MByte
1024 << 20 means 1024 * 2^20 = 1 GByte
Computations on constants are optimized away so nothing is lost.
I might be wrong (and I didn't study the source) , but I guess it's just for readability reasons.
I think the point (not mentioned yet) is that
All but the most basic compilers will do the shift at compilation time. Whenever you use operators with constant expressions, the
compiler will be able to do this before the code is even generated.
Note, that before constexpr and C++11, this did not extend to
functions.

calculation of bits needed

I need help with this
I was asked that for an unsigned integer range 1 to 1 billion, ,how many bits are needed!
How do we calculate this?
Thank you
UPDATE!!!!
This what I wanted to know because the interviwer said 17
Take the log base 2 of 1 billion and round up.
Alternatively, you should know that integers (with over 4 billion values) require 32-bits, therefore for 2 billion you'd need 31-bits and for 1 billion, 30-bits.
Another handy thing to know is that every 10 bits increase the number of values you can represent by a factor just over 1000 (1024), so for 1000, you need 10 bits, 1 million needs 20 bits, and 1 billion needs 30 bits.
Calculate log2(1000000000) and round it up. It works out to 30 bits.
For example in Python you can calculate it like this:
>>> import math
>>> math.ceil(math.log(1000000000, 2))
30.0
2^10 = 1024
2^10 * 2^10 = 2^20 = 1024*1024 = 1048576
2^10 * 2^10 * 2^10 = 2^30 = 3 * 1024 ~= 1,000,000
=> 30 Bits