Concatenation of prefixes of a boolean array - bit-manipulation

I have a boolean array A of size n that I want to transform in the following way : concatenate every prefix of size 1,2,..n.
For instance, for n=5, it will transform "abcde" into "aababcabcdabcde".
Of course, a simple approach can loop over every prefix and use elementary bit operations (shift, mask, plus); so the complexity of this approach is obviously O(n).
There are some well known tricks regarding bit manipulations like here.
My question is: is it possible to achieve a faster algorithm for the transformation described above with a complexity better than O(n) by using bit manipulations ?
I am aware that the interest of such an improvement may be just theoretical because the simple approach could be the fastest in practice, but I am still curious about the fact that there exists a theoretical improvement or not.
As a precision, I need to perform this transformation p times, where p can be much bigger than n; some pre-computations could be done for a given n and used later for the computation of the p transformations.

I'm not sure if this is what you're looking for, but here is a different algorithm, which may be interesting depending on your assumptions.
First compute two masks that depend only on n, so for any particular n these are just constants:
C (copy mask), a mask that has every n'th bit set and is n² bits long. So for n = 5, C = 0000100001000010000100001. This will be used to create n copies of A concatenated together.
E (extract mask), a mask that indicates which bits to take from the big concatenation, which is built up from n times a block of n bits, with values 1, 3, 7, 15 ... eg for n = 5, E = 1111101111001110001100001. Pad the left with zeroes if necessary.
Then the real computation that takes an A and constructs the concatenation of prefixes is:
pext(A * C, E)
Where pext is compress_right, discarding the bits for which the extraction mask is 0 and compacting the remaining bits to the right.
The multiplication can be replaced by a "doubling" technique like this: (which can also be used to compute the C mask)
l = n
while l < n²:
A = A | (A << l)
l = l * 2
Which in general produces too many concatenated copies of A but you can just pretend the excess isn't there by not looking at it (the pext drops any excess input anyway). Efficiently producing E for an unknown and arbitrarily large n seems harder, but that's not really a use case for this approach anyway.
The actual time complexity depends on your assumptions, but of course in the full "arbitrary bit count" setting both multiplication and compaction are heavyweight operations and the fact that the output has a size quadratic in input size really doesn't help. For small enough n such that eg n² <= 64 (depending on word size), so everything fits in a machine word, it all works out well (even for unknown n, since all the required masks can be precomputed). On the other hand, for such small n it is also feasible to table the entire thing, doing a lookup for a pair (n, A).

I may have found another way to proceed.
The idea is to use multiplication to propagate the initial input I to the correct position. The multiplication coefficient J is the vector whose bits are set to one at position i*(i-1)/2 for i in [1:n].
However, a direct multiplication I by J will provide many unwanted terms, so the idea is to
mask some bits from vectors I and J
multiply these masked vectors
remove some junk bits from the result.
We have thus several iterations to do; the final result is the sum of the intermediate results. We can write the result as "sum on i of ((I & Ai) * Bi) & Ci", so we have 2 masks and 1 multiplication per iteration (Ai, Bi and Ci are constants depending on n).
This algorithm seems to be O(log(n)), so it is better than O(log(n)^2) BUT it needs multiplication which may be costly. Note also that this algorithm requires registers of size n*(n+1)/2 which is better than n^2.
Here is an example for n=7
Input:
I = abcdefg
We set J = 1101001000100001000001
We also note temporary results:
Xi = I & Ai
Yi = Xi * Bi
Zi = Yi & Ci
iteration 1
----------------------------
1 A1
11 1 1 1 1 1 B1
11 1 1 1 1 1 C1
----------------------------
a X1
aa a a a a a Y1
aa a a a a a Z1
iteration 2
----------------------------
11 A2
1 1 1 1 1 1 B2
1 11 11 11 11 11 C2
----------------------------
bc X2
bcbc bc bc bc bc Y2
b bc bc bc bc bc Z2
iteration 3
----------------------------
1111 A3
1 1 1 1 B3
1 11 111 1111 C3
----------------------------
defg X3
defgdefg defg defg Y3
d de def defg Z3
FINAL SUM
----------------------------
aa a a a a a Z1
b bc bc bc bc bc Z2
d de def defg Z3
----------------------------
aababcabcdabcdeabcdefabcdefg SUM

Related

How to write this floating point code in a portable way?

I am working on a cryptocurrency and there is a calculation that nodes must make:
average /= total;
double ratio = average/DESIRED_BLOCK_TIME_SEC;
int delta = -round(log2(ratio));
It is required that every node has the exact same result no matter what architecture or stdlib being used by the system. My understanding is that log2 might have different implementations that yield very slightly different results or flags like --ffast-math could impact the outputted results.
Is there a simple way to convert the above calculation to something that is verifiably portable across different architectures (fixed point?) or am I overthinking the precision that is needed (given that I round the answer at the end).
EDIT: Average is a long and total is an int... so average ends up rounded to the closest second.
DESIRED_BLOCK_TIME_SEC = 30.0 (it's a float) that is #defined
For this kind of calculation to be exact, one must either calculate all the divisions and logarithms exactly -- or one can work backwards.
-round(log2(x)) == round(log2(1/x)), meaning that one of the divisions can be turned around to get (1/x) >= 1.
round(log2(x)) == floor(log2(x * sqrt(2))) == binary_log((int)(x*sqrt(2))).
One minor detail here is, if (double)sqrt(2) rounds down, or up. If it rounds up, then there might exist one or more value x * sqrt2 == 2^n + epsilon (after rounding), where as if it would round down, we would get 2^n - epsilon. One would give the integer value of n the other would give n-1. Which is correct?
Naturally that one is correct, whose ratio to the theoretical mid point x * sqrt(2) is smaller.
x * sqrt(2) / 2^(n-1) < 2^n / (x * sqrt(2)) -- multiply by x*sqrt(2)
x^2 * 2 / 2^(n-1) < 2^n -- multiply by 2^(n-1)
x^2 * 2 < 2^(2*n-1)
In order of this comparison to be exact, x^2 or pow(x,2) must be exact as well on the boundary - and it matters, what range the original values are. Similar analysis can and should be done while expanding x = a/b, so that the inexactness of the division can be mitigated at the cost of possible overflow in the multiplication...
Then again, I wonder how all the other similar applications handle the corner cases, which may not even exist -- and those could be brute force searched assuming that average and total are small enough integers.
EDIT
Because average is an integer, it makes sense to tabulate those exact integer values, which are on the boundaries of -round(log2(average)).
From octave: d=-round(log2((1:1000000)/30.0)); find(d(2:end) ~= find(d(1:end-1))
1 2 3 6 11 22 43 85 170 340 679 1358 2716
5431 10862 21723 43445 86890 173779 347558 695115
All the averages between [1 2( -> 5
All the averages between [2 3( -> 4
All the averages between [3 6( -> 3
..
All the averages between [43445 86890( -> -11
int a = find_lower_bound(average, table); // linear or binary search
return 5 - a;
No floating point arithmetic needed

Data structure for fast range searches of dense dataset 4D vectors

I have millions of unstructured 3D vectors associated with arbitrary values - making for a set 4D of vectors. To make it simpler to understand: I have unixtime stamps associated with hundreds of thousands of 3D vectors. And I have many time stamps, making for a very large dataset; upwards of 30 millions vectors.
I have the need to search particular datasets of specific time stamps.
So lets say I have the following data:
For time stamp 1407633943:
(0, 24, 58, 1407633943)
(9, 2, 59, 1407633943)
...
For time stamp 1407729456:
(40, 1, 33, 1407729456)
(3, 5, 7, 1407729456)
...
etc etc
And I wish to make a very fast query along the lines of:
Query Example 1:
Give me vectors between:
X > 4 && X < 9 && Y > -29 && Y < 100 && Z > 0.58 && Z < 0.99
Give me list of those vectors, so I can find the timestamps.
Query Example 2:
Give me vectors between:
X > 4 && X < 9 && Y > -29 && Y < 100 && Z > 0.58 && Z < 0.99 && W (timestamp) = 1407729456
So far I've used SQLite for the task, but even after column indexing, the thing takes between 500ms - 7s per query. I'm looking for somewhere between 50ms-200ms per query solution.
What sort of structures or techniques can I use to speed the query up?
Thank you.
kd-trees can be helpful here. Range search in a kd-tree is a well-known problem. Time complexity of one query depends on the output size, of course(in the worst case all tree will be traversed if all vectors fit). But it can work pretty fast on average.
I would use octree. In each node I would store arrays of vectors in a hashtable using the timestamp as a key.
To further increase the performance you can use CUDA, OpenCL, OpenACC, OpenMP and implement the algorithms to be executed in parallel on the GPU or a multi-core CPU.
BKaun: please accept my attempt at giving you some insight into the problem at hand. I suppose you have thought of every one of my points, but maybe seeing them here will help.
Regardless of how ingest data is presented, consider that, using the C programming language, you can reduce the storage size of the data to minimize space and search time. You will be searching for, loading, and parsing single bits of a vector instead of, say, a SHORT INT which is 2 bytes for every entry - or a FLOAT which is much more. The object, as I understand it, is to search the given data for given values of X, Y, and Z and then find the timestamp associated with these 3 while optimizing the search. My solution does not go into the search, but merely the data that is used in a search.
To illustrate my hints simply, I'm considering that the data consists of 4 vectors:
X between -2 and 7,
Y between 0.17 and 3.08,
Z between 0 and 50,
timestamp (many of same size - 10 digits)
To optimize, consider how many various numbers each vector can have in it:
1. X can be only 10 numbers (include 0)
2. Y can be 3.08 minus 0.17 = 2.91 x 100 = 291 numbers
3. Z can be 51 numbers
4. timestamp can be many (but in this scenario,
you are not searching for a certain one)
Consider how each variable is stored as a binary:
1. Each entry in Vector X COULD be stored in 4 bits, using the first bit=1 for
the negative sign:
7="0111"
6="0110"
5="0101"
4="0100"
3="0011"
2="0010"
1="0001"
0="0000"
-1="1001"
-2="1010"
However, the original data that you are searching through may range
from -10 to 20!
Therefore, adding another 2 bits gives you a table like this:
-10="101010"
-9="101001" ...
...
-2="100010"
-1="100001" ...
...
8="001000"
9="001001" ...
...
19="001001"
20="010100"
And that's only 6 bits to store each X vector entry for integers from -10 to 20
For search purposes on a range of -10 to 20, there are 21 different X Vector entries
possible to search through.
Each entry in Vector Y COULD be stored in 9 bits (no extra sign bit is needed)
The 1's and 0's COULD be stored (accessed, really) in 2 parts
(tens place, and a 2 digit decimal).
Part 1 can be 0, 1, 2, or 3 (4 2-place bits from "00" to "11")
However, if the range of the entire Y dataset is 0 to 10,
part 1 can be 0, 1, ...9, 10 (which is 11 4-place bits
from "0000" to "1010"
Part 2 can be 00, 01,...98, 99 (100 7-place bits from "0000000" to "1100100"
Total storage bits for Vector Y entries is 11 + 7 = 18 bits in the
range 00.00 to 10.99
For search purposes on a range 00.00 to 10.99, there are 1089 different Y Vector
entries possible to search through (11x99) (?)
Each entry in Vector Z in the range of 0 to 50 COULD be stored in 6 bits
("000000" to "110010").
Again, the actual data range may be 7 bits long (for simplicity's sake)
0 to 64 ("0000000" to "1000000")
For search purposes on a range of 0 to 64, there are 65 different Z Vector entries
possible to search through.
Consider that you will be storing the data in this optimized format, in a single
succession of bits:
X=4 bits + 2 range bits = 6 bits
+ Y=4 bits part 1 and 7 bits part 2 = 11 bits
+ Z=7 bits
+ timestamp (10 numbers - each from 0 to 9 ("0000" to "1001") 4 bits each = 40 bits)
= TOTAL BITS: 6 + 11 + 7 + 40 = 64 stored bits for each 4D vector
THE SEARCH:
Input xx, yy, zz to search for in arrays X, Y and Z (which are stored in binary)
Change xx, yy, and zz to binary bit strings per optimized format above.
function(xx, yy, zz)
Search for X first, since it has 21 possible outcomes (range is -10 to 10)
- the lowest number of any array
First search for positive targets (there are 8 of them and better chance
of finding one)
These all start with "000"
7="000111"
6="000110"
5="000101"
4="000100"
3="000011"
2="000010"
1="000001"
0="000000"
So you can check if the first 3 bits = "000". If so, you have a number
between 0 and 7.
Found: search for Z
Else search for xx=-2 or -1: does X = -2="100010" or -1="100001" ?
(do second because there are only 2 of them)
Found: Search for Z
NotFound: next X
Search for Z after X is Found: (Z second, since it has 65 possible outcomes
- range is 0 to 64)
You are searching for 6 bits of a 7 bit binary number
("0000000" to "1000000") If bits 1,2,3,4,5,6 are all "0", analyze bit 0.
If it is "1" (it's 64), next Z
Else begin searching 6 bits ("000000" to "110010") with LSB first
Found: Search for Y
NotFound: Next X
Search for Y (Y last, since it has 1089 possible outcomes - range is 0.00 to 10.99)
Search for Part 1 (decimal place) bits (you are searching for
"0000", "0001" or "0011" only, so use yyPt1=YPt1)
Found: Search for Part 2 ("0000000" to "1100100") using yyPt2=YPt2
(direct comparison)
Found: Print out X, Y, Z, and timestamp
NotFound: Search criteria for X, Y, and Z not found in data.
Print X,Y,Z,"timestamp not found". Ask for new X, Y, Z. New search.

Is (n & m) <= m always true?

Given n and m unsigned integral types, will the expression
(n & m) <= m
always be true ?
Yes, it is true.
It should be readily apparent that a necessary condition for y > x is that at least one bit position is set to 1 in y but 0 in x. As & cannot set a bit to 1 if the corresponding operand bits were not already 1, the result cannot be larger than the operands.
Yes, it is always true for unsigned integral data types.
Depending on the value of the mask n, some 1-bits in m may become 0-bits; all bits that are 0 in m will remain 0 in the result. From the point of keeping m as high as possible, the best thing that could happen is that all 1-bits would remain in place, in which case the result would equal m. In all other cases, the result will be less than m.
Let m be m1m2m3m4m5m6m7m8 where mi is a bit.
Now let n be n1n2n3n4n5n6n7n8.
What would be m & n? All bits in m that originally was 0, will stay 0, because 0 & anything is 0. All bits that were 1, will be 1 only of the corresponding bit in n is 1.
Meaning that in the "best" case, the number will be the same, but can never get bigger, since not 1's can be created from 0 & anything.
Let's have an example in order to have a better intuition:
Let m be 11101011.
What are the numbers that are bigger than m? 11111111 (trivial), 11111011, 11101111, 11111010, 11111110.
n1n2n3n4n5n6n7n8
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ &
1 1 1 0 1 0 1 1
-----------------------
No way you can get any of the above combinations from doing this.
Yes. To supplement the reasons described in the other answers, a few binary examples make it pretty clear that it will not be possible to make the result greater than either of the args:
0
1
------
0
1
1
------
1
111011
11111
------
11011
111011
111011
------
111011
Given the two arguments, the highest that we can achieve is if both of the arguments are the same value, the result being equal to the value of the two arguments (last example above).
It is impossible to make the result larger no matter what we set the arguments to. If you are able to, then we are in serious trouble. ;)
n & m has all the bits that are set in m and in n.
~n & m has all the bits that are set in m but not in n.
Adding both quantities will give all the bits that are set in m. That is, m:
m = (n & m) + (~n & m)
m ≥ (n & m)

Fast Exponentiation when only k digits are required - continued

Where I need help...
What I want to do now is translate this solution, which calculates the mantissaof a number to c++:
n^m = exp10(m log10(n)) = exp(q (m log(n)/q)) where q = log(10)
Finding the first n digits from the result can be done like this:
"the first K digits of exp10(x) = the first K digits of exp10(frac(x))
where frac(x) = the fractional part of x = x - floor(x)."
My attempts (sparked by the math and this code) failed...:
u l l function getPrefix(long double pow /*exponent*/, long double length /*length of prefix*/)
{
long double dummy; //unused but necessary for modf
long double q = log(10);
u l l temp = floor(pow(10.0, exp(q * modf( (pow * log(2)/q), &dummy) + length - 1));
return temp;
}
If anyone out there can correctly implement this solution, I need your help!!
EDIT
Example output from my attempts:
n: 2
m: 0
n^m: 1
Calculated mantissa: 1.16334
n: 2
m: 1
n^m: 2
Calculated mantissa: 2.32667
n: 2
m: 2
n^m: 4
Calculated mantissa: 4.65335
n: 2
m: 98
n^m: 3.16913e+29
Calculated mantissa: 8.0022
n: 2
m: 99
n^m: 6.33825e+29
Calculated mantissa: 2.16596
I'd avoid pow for this. It's notoriously hard to implement correctly. There are lots of SO questions where people got burned by a bad pow implementation in their standard library.
You can also save yourself a good deal of pain by working in the natural base instead of base 10. You'll get code that looks like this:
long double foo = m * logl(n);
foo = fmodl(foo, logl(10.0)) + some_epsilon;
sprintf(some_string, "%.9Lf", expl(foo));
/* boring string parsing code here */
to compute the appropriate analogue of m log(n). Notice that the largest m * logl(n) that can arise is just a little bigger than 2e10. When you divide that by 264 and round up to the nearest power of two, you see that an ulp of foo is 2-29 at worst. This means, in particular, that you cannot get more than 8 digits out of this method using long doubles, even with a perfect implementation.
some_epsilon will be the smallest long double that makes expl(foo) always exceed the mathematically correct result; I haven't computed it exactly, but it should be on the order of 1e-9.
In light of the precision difficulties here, I might suggest using a library like MPFR instead of long doubles. You may also be able to get something to work using a double double trick and quad-precision exp, log, and fmod.

Generating random numbers given a uniform random number generator

I was asked to generate a random number between a and b, inclusive, using random(0,1). random(0,1) generates a uniform random number between 0 and 1.
I answered
(a+(((1+random(0,1))*b))%(b-a))
My interviewer was not satisfied with my usage of b in this piece of the expression:
(((1+random(0,1))*b))
Then I tried changing my answer to:
int*z=(int*)malloc(sizeof(int));
(a+(((1+random(0,1))*(*z)))%(b-a));
Later the question changed to generate random(1,7) from random(1,5). I responded with:
A = rand(1,5)%3
B = (rand(1,5)+1)%3
C = (rand(1,5)+2)%3
rand(1,7) = rand(1,5)+ (A+B+C)%3
Were my answers correct?
I think you were confused between random integral-number generator and random floating-point number generator. In C++, rand() generates random integral number between 0 and 32K. Thus to generate a random number from 1 to 10, we write rand() % 10 + 1. As such, to generate a random number from integer a to integer b, we write rand() % (b - a + 1) + a.
The interviewer told you that you had a random generator from 0 to 1. It means floating-point number generator.
How to get the answer mathematically:
Shift the question to a simple form such that the lower bound is 0.
Scale the range by multiplication
Re-shift to the required range.
For example: to generate R such that
a <= R <= b.
Apply rule 1, we get a-a <= R - a <= b-a
0 <= R - a <= b - a.
Think R - a as R1. How to generate R1 such that R1 has range from 0 to (b-a)?
R1 = rand(0, 1) * (b-a) // by apply rule 2.
Now substitute R1 by R - a
R - a = rand(0,1) * (b-a) ==> R = a + rand(0,1) * (b-a)
==== 2nd question - without explanation ====
We have 1 <= R1 <= 5
==> 0 <= R1 - 1 <= 4
==> 0 <= (R1 - 1)/4 <= 1
==> 0 <= 6 * (R1 - 1)/4 <= 6
==> 1 <= 1 + 6 * (R1 - 1)/4 <= 7
Thus, Rand(1,7) = 1 + 6 * (rand(1,5) - 1) / 4
random(a,b) from random(0,1):
random(0,1)*(b-a)+a
random(c,d) from random(a,b):
(random(a,b)-a)/(b-a)*(d-c)+c
or, simplified for your case (a=1,b=5,c=1,d=7):
random(1,5) * 1.5 - 0.5
(note: I assume we're talking about float values and that rounding errors are negligible)
random(a,b) from random(c,d) = a + (b-a)*((random(c,d) - c)/(d-c))
No?
[random(0,1)*(b-a)] + a, i think would give random numbers b/w a&b.
([random(1,5)-1]/4)*6 + 1 should give the random nubers in the range (1,7)
I am not sure whether the above will destroy the uniform distribution..
Were my answers correct?
I think there are some problems.
First off, I'm assuming that random() returns a floating point value - otherwise to generate any useful distribution of a larger range of numbers using random(0,1) would require repeated calls to generate a pool of bits to work with.
I'm also going to assume C/C++ is the intended platform, since the question is tagged as such.
Given these assumptions, one problem with your answers is that C/C++ do not allow the use of the % operator on floating point types.
But even if we imagine that the % operator was replaced with a function that performed a modulo operation with floating point arguments in a reasonable way, there are still some problems. In your initial answer, if b (or the uninitialized *z allocated in your second attempt - I'm assuming this is a kind of bizarre way to get an arbitrary value, or is something else intended?) is zero (say the range given for a and b is (-5, 0)), then your result will be decidedly non-uniform. The result would always be b.
Finally, I'm certainly no statistician, but in your final answer (to generate random(1,7) from random(1.5)), I'm pretty sure that A+B+C would be non-uniform and would therefore introduce a bias in the result.
I think that there is a nicer answer to this. There is one value (probability -> zero) that this overflows and thus the modulus is there.
Take a random number x in the interval [0,1].
Increment your upper_bound which could be a parameter by one.
Calculate (int(random() / (1.0 / upper_bound)) % upper_bound) + 1 + lower_bound .
This ought to return a number in your desired interval.
given random(0,5) you can generate random(0,7) in the following way
A = random(0,5)*random(0,5)
now the range of A is 0-25
if we simply take the modulo 7 of A, we can get the random numbers but they wont be truly random as for values of A from 22-25, you will get 1-4 values after modulo operation, hence getting modulo 7 from range(0,25) will bias the output towards 1-4. This is because 7 does not evenly divide 25: the largest multiple of 7 less than or equal to 25 is 7*3=21 and it is the numbers in the incomplete range from 21-25 that will cause the bias.
Easiest way to fix this problem is to discard those numbers (from 22-25) and to keep tying again until a number in the suitable range come up.
Obviously, this is true when we assume that we want random integers.
However to get random float numbers we need to modify the range accordingly as described in above posts.