Fast inner product of ternary vectors - c++

Consider two vectors, A and B, of size n, 7 <= n <= 23. Both A and B consists of -1s, 0s and 1s only.
I need a fast algorithm which computes the inner product of A and B.
So far I've thought of storing the signs and values in separate uint32_ts using the following encoding:
sign 0, value 0 → 0
sign 0, value 1 → 1
sign 1, value 1 → -1.
The C++ implementation I've thought of looks like the following:
struct ternary_vector {
uint32_t sign, value;
};
int inner_product(const ternary_vector & a, const ternary_vector & b) {
uint32_t psign = a.sign ^ b.sign;
uint32_t pvalue = a.value & b.value;
psign &= pvalue;
pvalue ^= psign;
return __builtin_popcount(pvalue) - __builtin_popcount(psign);
}
This works reasonably well, but I'm not sure whether it is possible to do it better. Any comment on the matter is highly appreciated.

I like having the 2 uint32_t, but I think your actual calculation is a bit wasteful
Just a few minor points:
I'm not sure about the reference (getting a and b by const &) - this adds a level of indirection compared to putting them on the stack. When the code is this small (a couple of clocks maybe) this is significant. Try passing by value and see what you get
__builtin_popcount can be, unfortunately, very inefficient. I've used it myself, but found that even a very basic implementation I wrote was far faster than this. However - this is dependent on the platform.
Basically, if the platform has a hardware popcount implementation, __builtin_popcount uses it. If not - it uses a very inefficient replacement.

The one serious problem here is the reuse of the psign and pvalue variables for the positive and negative vectors. You are doing neither your compiler nor yourself any favors by obfuscating your code in this way.

Would it be possible for you to encode your ternary state in a std::bitset<2> and define the product in terms of and? For example, if your ternary types are:
1 = P = (1, 1)
0 = Z = (0, 0)
-1 = M = (1, 0) or (0, 1)
I believe you could define their product as:
1 * 1 = 1 => P * P = P => (1, 1) & (1, 1) = (1, 1) = P
1 * 0 = 0 => P * Z = Z => (1, 1) & (0, 0) = (0, 0) = Z
1 * -1 = -1 => P * M = M => (1, 1) & (1, 0) = (1, 0) = M
Then the inner product could start by taking the and of the bits of the elements and... I am working on how to add them together.
Edit:
My foolish suggestion did not consider that (-1)(-1) = 1, which cannot be handled by the representation I proposed. Thanks to #user92382 for bringing this up.

Depending on your architecture, you may want to optimize away the temporary bit vectors -- e.g. if your code is going to be compiled to FPGA, or laid out to an ASIC, then a sequence of logical operations will be better in terms of speed/energy/area than storing and reading/writing to two big buffers.
In this case, you can do:
int inner_product(const ternary_vector & a, const ternary_vector & b) {
return __builtin_popcount( a.value & b.value & ~(a.sign ^ b.sign))
- __builtin_popcount( a.value & b.value & (a.sign ^ b.sign));
}
This will lay out very well -- the (a.value & b.value & ... ) can enable/disable an XOR gate, whose output splits into two signed accumulators, with the first pathway NOTed before accumulation.

Related

How to store output of very large Fibonacci number?

I am making a program for nth Fibonacci number. I made the following program using recursion and memoization.
The main problem is that the value of n can go up to 10000 which means that the Fibonacci number of 10000 would be more than 2000 digit long.
With a little bit of googling, I found that i could use arrays and store every digit of the solution in an element of the array but I am still not able to figure out how to implement this approach with my program.
#include<iostream>
using namespace std;
long long int memo[101000];
long long int n;
long long int fib(long long int n)
{
if(n==1 || n==2)
return 1;
if(memo[n]!=0)
return memo[n];
return memo[n] = fib(n-1) + fib(n-2);
}
int main()
{
cin>>n;
long long int ans = fib(n);
cout<<ans;
}
How do I implement that approach or if there is another method that can be used to achieve such large values?
One thing that I think should be pointed out is there's other ways to implement fib that are much easier for something like C++ to compute
consider the following pseudo code
function fib (n) {
let a = 0, b = 1, _;
while (n > 0) {
_ = a;
a = b;
b = b + _;
n = n - 1;
}
return a;
}
This doesn't require memoisation and you don't have to be concerned about blowing up your stack with too many recursive calls. Recursion is a really powerful looping construct but it's one of those fubu things that's best left to langs like Lisp, Scheme, Kotlin, Lua (and a few others) that support it so elegantly.
That's not to say tail call elimination is impossible in C++, but unless you're doing something to optimise/compile for it explicitly, I'm doubtful that whatever compiler you're using would support it by default.
As for computing the exceptionally large numbers, you'll have to either get creative doing adding The Hard Way or rely upon an arbitrary precision arithmetic library like GMP. I'm sure there's other libs for this too.
Adding The Hard Way™
Remember how you used to add big numbers when you were a little tater tot, fresh off the aluminum foil?
5-year-old math
1259601512351095520986368
+ 50695640938240596831104
---------------------------
?
Well you gotta add each column, right to left. And when a column overflows into the double digits, remember to carry that 1 over to the next column.
... <-001
1259601512351095520986368
+ 50695640938240596831104
---------------------------
... <-472
The 10,000th fibonacci number is thousands of digits long, so there's no way that's going to fit in any integer C++ provides out of the box. So without relying upon a library, you could use a string or an array of single-digit numbers. To output the final number, you'll have to convert it to a string tho.
(woflram alpha: fibonacci 10000)
Doing it this way, you'll perform a couple million single-digit additions; it might take a while, but it should be a breeze for any modern computer to handle. Time to get to work !
Here's an example in of a Bignum module in JavaScript
const Bignum =
{ fromInt: (n = 0) =>
n < 10
? [ n ]
: [ n % 10, ...Bignum.fromInt (n / 10 >> 0) ]
, fromString: (s = "0") =>
Array.from (s, Number) .reverse ()
, toString: (b) =>
b .reverse () .join ("")
, add: (b1, b2) =>
{
const len = Math.max (b1.length, b2.length)
let answer = []
let carry = 0
for (let i = 0; i < len; i = i + 1) {
const x = b1[i] || 0
const y = b2[i] || 0
const sum = x + y + carry
answer.push (sum % 10)
carry = sum / 10 >> 0
}
if (carry > 0) answer.push (carry)
return answer
}
}
We can verify that the Wolfram Alpha answer above is correct
const { fromInt, toString, add } =
Bignum
const bigfib = (n = 0) =>
{
let a = fromInt (0)
let b = fromInt (1)
let _
while (n > 0) {
_ = a
a = b
b = add (b, _)
n = n - 1
}
return toString (a)
}
bigfib (10000)
// "336447 ... 366875"
Expand the program below to run it in your browser
const Bignum =
{ fromInt: (n = 0) =>
n < 10
? [ n ]
: [ n % 10, ...Bignum.fromInt (n / 10 >> 0) ]
, fromString: (s = "0") =>
Array.from (s) .reverse ()
, toString: (b) =>
b .reverse () .join ("")
, add: (b1, b2) =>
{
const len = Math.max (b1.length, b2.length)
let answer = []
let carry = 0
for (let i = 0; i < len; i = i + 1) {
const x = b1[i] || 0
const y = b2[i] || 0
const sum = x + y + carry
answer.push (sum % 10)
carry = sum / 10 >> 0
}
if (carry > 0) answer.push (carry)
return answer
}
}
const { fromInt, toString, add } =
Bignum
const bigfib = (n = 0) =>
{
let a = fromInt (0)
let b = fromInt (1)
let _
while (n > 0) {
_ = a
a = b
b = add (b, _)
n = n - 1
}
return toString (a)
}
console.log (bigfib (10000))
Try not to use recursion for a simple problem like fibonacci. And if you'll only use it once, don't use an array to store all results. An array of 2 elements containing the 2 previous fibonacci numbers will be enough. In each step, you then only have to sum up those 2 numbers. How can you save 2 consecutive fibonacci numbers? Well, you know that when you have 2 consecutive integers one is even and one is odd. So you can use that property to know where to get/place a fibonacci number: for fib(i), if i is even (i%2 is 0) place it in the first element of the array (index 0), else (i%2 is then 1) place it in the second element(index 1). Why can you just place it there? Well when you're calculating fib(i), the value that is on the place fib(i) should go is fib(i-2) (because (i-2)%2 is the same as i%2). But you won't need fib(i-2) any more: fib(i+1) only needs fib(i-1)(that's still in the array) and fib(i)(that just got inserted in the array).
So you could replace the recursion calls with a for loop like this:
int fibonacci(int n){
if( n <= 0){
return 0;
}
int previous[] = {0, 1}; // start with fib(0) and fib(1)
for(int i = 2; i <= n; ++i){
// modulo can be implemented with bit operations(much faster): i % 2 = i & 1
previous[i&1] += previous[(i-1)&1]; //shorter way to say: previous[i&1] = previous[i&1] + previous[(i-1)&1]
}
//Result is in previous[n&1]
return previous[n&1];
}
Recursion is actually discommanded while programming because of the time(function calls) and ressources(stack) it consumes. So each time you use recursion, try to replace it with a loop and a stack with simple pop/push operations if needed to save the "current position" (in c++ one can use a vector). In the case of the fibonacci, the stack isn't even needed but if you are iterating over a tree datastructure for example you'll need a stack (depends on the implementation though). As I was looking for my solution, I saw #naomik provided a solution with the while loop. That one is fine too, but I prefer the array with the modulo operation (a bit shorter).
Now concerning the problem of the size long long int has, it can be solved by using external libraries that implement operations for big numbers (like the GMP library or Boost.multiprecision). But you could also create your own version of a BigInteger-like class from Java and implement the basic operations like the one I have. I've only implemented the addition in my example (try to implement the others they are quite similar).
The main idea is simple, a BigInt represents a big decimal number by cutting its little endian representation into pieces (I'll explain why little endian at the end). The length of those pieces depends on the base you choose. If you want to work with decimal representations, it will only work if your base is a power of 10: if you choose 10 as base each piece will represent one digit, if you choose 100 (= 10^2) as base each piece will represent two consecutive digits starting from the end(see little endian), if you choose 1000 as base (10^3) each piece will represent three consecutive digits, ... and so on. Let's say that you have base 100, 12765 will then be [65, 27, 1], 1789 will be [89, 17], 505 will be [5, 5] (= [05,5]), ... with base 1000: 12765 would be [765, 12], 1789 would be [789, 1], 505 would be [505]. It's not the most efficient, but it is the most intuitive (I think ...)
The addition is then a bit like the addition on paper we learned at school:
begin with the lowest piece of the BigInt
add it with the corresponding piece of the other one
the lowest piece of that sum(= the sum modulus the base) becomes the corresponding piece of the final result
the "bigger" pieces of that sum will be added ("carried") to the sum of the following pieces
go to step 2 with next piece
if no piece left, add the carry and the remaining bigger pieces of the other BigInt (if it has pieces left)
For example:
9542 + 1097855 = [42, 95] + [55, 78, 09, 1]
lowest piece = 42 and 55 --> 42 + 55 = 97 = [97]
---> lowest piece of result = 97 (no carry, carry = 0)
2nd piece = 95 and 78 --> (95+78) + 0 = 173 = [73, 1]
---> 2nd piece of final result = 73
---> remaining: [1] = 1 = carry (will be added to sum of following pieces)
no piece left in first `BigInt`!
--> add carry ( [1] ) and remaining pieces from second `BigInt`( [9, 1] ) to final result
--> first additional piece: 9 + 1 = 10 = [10] (no carry)
--> second additional piece: 1 + 0 = 1 = [1] (no carry)
==> 9542 + 1 097 855 = [42, 95] + [55, 78, 09, 1] = [97, 73, 10, 1] = 1 107 397
Here is a demo where I used the class above to calculate the fibonacci of 10000 (result is too big to copy here)
Good luck!
PS: Why little endian? For the ease of the implementation: it allows to use push_back when adding digits and iteration while implementing the operations will start from the first piece instead of the last piece in the array.

How to increase speed of code in VS 2008 [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I have a portion of code that is called an enormous amount of times. How can I speed it up?
#define SUM_(p0, p1, p2, p3, offset)
((p0)[offset] - (p1)[offset] - (p2)[offset] + (p3)[offset])
inline int Calc::compute( int offset ) const
{
int b = SUM( p[5], p[6], p[9], p[10], offset );
int a1 = SUM(...);
int a2 = SUM(...);
....
return (uchar)(((a1 >= b) << 7) |
((a2 >= b) << 6) |
((a3 >= b) << 5) |
((a4 >= b) << 4) |
((a5 >= b) << 3) |
((a6 >= b) << 2) |
((a7 >= b) << 1) |
(a8 >= b));
}
Thank you.
I see 3 possible opportunities here:
Pack your p0, p1, p2 and p3 data contiguously in memory (so basically contiguously in an array) in order to prevent cache misses:
#define SUM_(array, offset)
(array[offset] - array[offset + 1] - array[offset + 2] + array[offset + 3])
....
//make sure pArray contains all the p0, p1, ... values.
int a1 = SUM(pArray, offset);
replace the bitshift operator with an if structure that or's togethers static literals if (a1 >=b) and so on. The values are going to be static everytime anyway:
uint8_t bitmask = 0;
if(a1 >= b)
bitmask |= 0x80;
if(a2 >= b)
bitmask |= 0x40;
...
Try to make sure that SIMD instructions are being used. This involves dumping the assembly and see if these kind of instructions are being generated.
EDIT: Reaction on the comment below:
In order to prevent cache misses, everything revolves around accessing your data in a predictable manner. A problem with your original code is that you resolve the offset with your p-variables.
So you have something like:
int b = SUM( p[5], p[6], p[9], p[10], _offset );
int a1 = SUM( p[0], p[1], p[4], p[5], _offset );
int a2 = SUM( p[1], p[2], p[5], p[6], _offset );
You could create an array which contains these values in the order you use them.
So I'd try to create an array that looks like this at a certain offset:
p[5], p[6], p[9], p[10], p[0], p[1], p[4], p[5], p[1], p[2], p[5], p[6]
And now you can define your sum function like this:
#define SUM_(array, offset, calculationOffset)
(array[offset + calculationOffset * 4] - array[offset + calculationOffset * 4 + 1]
- array[offset + calculationOffset * 4 + 2] + array[offset + calculationOffset * 4 + 3])
Your calls can be transformed into this:
int b = SUM(pArray, offset, 0);
int a1 = SUM(pArray, offset, 1);
int a2 = SUM(pArray, offset, 2);
There's only one problem with this: if you have to create the array and copy all the data every function call, this might remove any benefit of what we did, but you may be able to construct this kind of array before using this function and pass it as an argument.
There's still the possibility to change/improve the algorithm itself. We cannot help you with this unless the algorithm you use is known (I mean the algorithm which uses the code you have shown).
If there are any known properties of the variables used and their values, this could give room for improvement.
I don't know whether the cast to uchar could impact performance. Why do you need this? I think this should be changed, since you are adding int and returning int. But you'd have to neasure performance to see if for example using & 0xff would give better performance.
Also, you should try the difference it makes when you replace the bit-oring-operators ("|") by a simple addition (should work the same in this case because each summand has a single unique bit set to 1).
Then try the effect of altering (a1 >= b) << 7) to (a1 >= b) ? 128 : 0) etc.
All of the above might or might not effect performance, and you have to measure the effect with the compiler you use.
But most important: if your problem with analyzing all these images is the total amount of time, you should look into processing different images at the same time (if you are on a multiprocessor machine with sufficient RAM). You have several options:
parallizing the code that processes a single image (using something like OpenMP)
Use one thread per image (IMHO much easier and I'd expect better overall throughput than 1.)
Move the concurrency to where your start your program (ie. a script).

possible to index a set by a variable?

I am trying to do something that logically should be possible to do. However, I am not sure how to do this within the realm of linear programming. I am using ZMPL/SCIP, but this should be readable to most.
set I := {1,2,3,4,5};
param u[I] := <1> 10, <2> 20, <3> 30, <4> 40, <5> 50;
var a;
var b;
subto bval:
b == 2;
subto works:
a == u[2];
#subto does_not_work:
# a == u[b];
I am trying to make sure that the variable a is equal to the value at the index b in u. So for example, I ensure that b == 2 and then I try to set the constraint that a == u[b], but that does not work. It complains that I am trying to index with a variable. I am able to just do a == u[2] however, which makes a equal to 20.
Is there a way to easily access u at an index specified by a variable? Thanks for any help/guidance.
EDIT: I think the consensus is that this is not possible because it no longer becomes an LP. In that case, can anyone think of another way to write this so that, depending on the value of b, I can get an associated value from the set u? This would have to avoid directly indexing it.
SOLUTION: Based on the response from Ram, I was able to try it out and found that it was definitely a viable and linear solution. Thanks, Ram! Here is sample solution code in ZMPL:
set I := {1,2,3,4,5};
param u[I] := <1> 10, <2> 20, <3> 30, <4> 40, <5> 50;
var a;
var b;
var y[I] binary;
subto bval:
b == 4;
subto only_one:
sum <i> in I : y[i] == 1;
subto trick:
b == (sum <i> in I : y[i] * i);
subto aval:
(sum <i> in I : u[i]*y[i]) == a;
Yes, you can rewrite and linearize your constraints, by introducing a few extra 0/1 variables (indicator variables). These kinds of tricks are not uncommon in Integer Programming.
Constraints In English
b can take on values from 1 through 5. b = {1..5}
and depending on b's value, the variable a should become u[b]
Indicator Variables
Let's introduce 5 Y variables - Y1..Y5 (one for each possible value of b)
Only one of them can be true at any given time.
Y1 + Y2 + Y3 + Y4 + Y5 = 1
All Y's are binary {0,1}
Here's the trick. We introduce one linear constraint to ensure that the corresponding Y variable will take on value 1, only when b is that value.
b - 1xY1 - 2xY2 - 3xY3 - 4xY4 - 5xY5 = 0
(For example, if b is 3, the constraint above will force Y3 to be 1.)
Now, we want a to take on the value u[b].
a = u[1]xY1 + u[2]xY2 + u[3]xY3 + u[4]xY4 + u[5]xY5
Since u[ 1] ...u[5] are constants known beforehand, the constraint above is also linear.
Here is one reference on these kinds of IF-THEN conditions in Integer Programming. Many of these tricks involve the Big-M, though we didn't need it in this case.
Hope that helps you move forward.

evaluate whether a number is integer power of 4

The following function is claimed to evaluate whether a number is integer power of 4. I do not quite understand how it works?
bool fn(unsigned int x)
{
if ( x == 0 ) return false;
if ( x & (x - 1) ) return false;
return x & 0x55555555;
}
The first condition rules out 0, which is obviously not a power of 4 but would incorrectly pass the following two tests. (EDIT: No, it wouldn't, as pointed out. The first test is redundant.)
The next one is a nice trick: It returns true if and only if the number is a power of 2. A power of two is characterized by having only one bit set. A number with one bit set minus one results in a number with all bits previous to that bit being set (i.e. 0x1000 minus one is 0x0111). AND those two numbers, and you get 0. In any other case (i.e. not power of 2), there will be at least one bit that overlaps.
So at this point, we know it's a power of 2.
x & 0x55555555 returns non-zero (=true) if any even bit it set (bit 0, bit 2, bit 4, bit 6, etc). That means it's power of 4. (i.e. 2 doesn't pass, but 4 passes, 8 doesn't pass, 16 passes, etc).
Every power of 4 must be in the form of 1 followed by an even number of zeros (binary representation): 100...00:
100 = 4
10000 = 16
1000000 = 64
The 1st test ("if") is obvious.
When subtracting 1 from a number of the form XY100...00 you get XY011...11. So, the 2nd test checks whether there is more than one "1" bit in the number (XY in this example).
The last test checks whether this single "1" is in the correct position, i.e, bit #2,4,6 etc. If it is not, the masking (&) will return a nonzero result.
Below solution works for 2,4,16 power of checking.
public static boolean isPowerOf(int a, int b)
{
while(b!=0 && (a^b)!=0)
{
b = b << 1;
}
return (b!=0)?true:false;
}
isPowerOf(4,2) > true
isPowerOf(8,2) > true
isPowerOf(8,3) > false
isPowerOf(16,4) > true
var isPowerOfFour = function (n) {
let x = Math.log(n) / Math.log(4)
if (Number.isInteger(x)) {
return true;
}
else {
return false
}
};
isPowerOfFour(4) ->true
isPowerOfFour(1) ->true
isPowerOfFour(5) ->false

Math question in regards to functions in the form (1) / ( b ^ c )

I've found functions which follow the pattern of 1 / bc produce nice curves which can be coupled with interpolation functions really nicely.
The way I use the function is by treating 'c' as the changing value, i.e. the interpolation value between 0 and 1, while varying b for 'sharpness'. I use it to work out an interpolation value between 0 and 1, so generelly the function I use is as such:
float interpolationvalue = 1 - 1/pow(100,c);
linearinterpolate( val1, val2, interpolationvalue);
Up to this point I've been using a hacked approach to make it 'work' since when interpolation value = 1 the value is very close to but not quite 0.
So I was wondering, is there a function in the form of or one which can reproduce similar curves to the ones produced by 1 / bc where at c = 0 result = 1 and c = 1 result = 0.
Or even C = 0, result = 0 and C = 1 result = 1.
Thanks for any help!
For interpolation the approach offering the most flexibility is using splines, in your case quadratic splines would seem sufficient. The wikipedia page is math heavy, but you can find adapted desciptions on google.
1 - c ^ b with small values for b? Another option would be to use a cubic polynomial and specifying the slope at 0 and 1.
You could use a similar curve of the form A - 1 / b^(c + a), choosing values of A and a to match your constraints. So, for c = 0, result = 1:
1 = A - 1/b^a => A = 1 + 1/b^a
and for c = 1, result = 0:
0 = A - 1/b^(1+a) => A = 1/b^(1+a)
Combining these, we can find a in terms of b:
1 + 1/b^a = 1/b^(1+a)
b^(1+a) + b = 1
b * (b^a - 1) = 1
b^a = 1/b - 1
So:
a = log_b(1/b - 1) = log(1/b - 1) / log(b)
A = 1 + 1/b^a = 1 / (1-b)
In real numbers, the ones that mathematician use, no function of the form you specify is ever going to return 0, division can't do that. (1/x)==0 has no real solutions. In floating point arithmetic, the poor relation of real arithmetic that computers use, you could write 1/(MAX_FP_VALUE^1) which will give you as close to 0 as you are ever going to get (actually, it might give you a NaN or one of the other odd returns that IEEE 754 allows).
And, as I'm sure you've noticed, 1/(b^0) always returns 1 since b^0 is, by definition of 0-th power, always 1.
So, no function with c = 0 will produce a result of 0.
For c = 1, result = 1, set b = 1
But I guess this is only a partial answer, I'm not terribly sure I understand what you are trying to do.
Regards
Mark