How does this array indexer helps coalesced memory access? - c++

At here, it is defined this function:
template <typename T, typename = std::enable_if_t<is_uint32_v<T> || is_uint64_v<T>>>
inline T reverse_bits(T operand, int bit_count)
{
// Just return zero if bit_count is zero
return (bit_count == 0) ? T(0)
: reverse_bits(operand) >> (sizeof(T) * static_cast<std::size_t>(bits_per_byte) -
static_cast<std::size_t>(bit_count));
}
At a later point, this function is used to store elements in a scrambled way into an array:
inv_root_powers_[reverse_bits(i - 1, coeff_count_power_) + 1].set(power, modulus_);
The justification for this is so that the memory access is coalesced. However, I don't know why such random values would make it easier for the memory access. For example, here are some values:
reverse_bits(3661, 12) +1 = 2856
reverse_bits(3662, 12) +1 = 1832
reverse_bits(3663, 12) +1 = 3880
reverse_bits(3664, 12) +1 = 168
reverse_bits(3665, 12) +1 = 2216
reverse_bits(3666, 12) +1 = 1192
reverse_bits(3667, 12) +1 = 3240
reverse_bits(3668, 12) +1 = 680
reverse_bits(3669, 12) +1 = 2728
seems like things are stored far apart.

You're right - the accesses you see in NTTTables::initialize are random-access and not serial. It is slower because of this "scramble". However, most of the work happens only later in DWTHandler::transform_to_rev, when the transform itself is applied.
There, they need to access the roots by reverse-bits order. The array being pre-scrambled means all the accesses to this array are now serial: you can see this in the r = *++roots; lines.
The reverse-bits access pattern has a good, real reason - it's because they're doing a variant of the Finite Fourier Transform (FFT). The memory access patterns used in those algorithms (sometimes called butterflies) are done in a bit-reverse order.

Related

Explanation of the usage of std::max in that code?

I am unsure whether or not I better should have posted this question on codereview.stackexchange.com. Anyway, here we go ...
Please consider the following code snippet which is a literal (I have only changed the formatting) excerpt from here and has been printed (in stripped-down form) in the German computer magazine c't, issue 23/2019:
while (lo <= hi) {
std::streamoff pos = std::streamoff((uint64_t(lo) + uint64_t(hi)) / 2);
pos -= pos % std::streamoff(PasswordHashAndCount::size);
pos = std::max<int64_t>(0, pos);
phc.read(mInputFile, pos);
++nReads;
if (hash > phc.hash) {
lo = pos + std::streamoff(PasswordHashAndCount::size);
}
else if (hash < phc.hash) {
hi = pos - std::streamoff(PasswordHashAndCount::size);
}
else {
safe_assign(readCount, nReads);
return phc;
}
}
Why do we need the fourth line pos = std::max<int64_t>(0, pos);?
From the second line, we see that pos is equal or greater to 0, because it is the half of the sum of two numbers which themselves are of type uint64_t.
The third line can't make pos lower than 0. Proof:
For simplicity, replace pos by A and std::streamoff(PasswordHashAndCount::size) by B. Then the third line reads A -= A % B which is equivalent to A = A - (A % B), where A and B are integers, A being equal to or greater than 0, and B being greater than 0 (because ::size is always greater than 0).
First, if A < B, A % B = A. In this case, the third line becomes A = A - A, that is, A = 0.
Secondly, if A == B, A % B becomes A % A which is 0. Therefore, the third line becomes A = A - 0, which is equivalent to a null operation. In other words, A does not change in that case; notably, it remains 0 or greater than that.
Third, if A > B, A - (A % B) is greater than 0. This is because A % B is smaller than B, and thus, A - (A % B) is greater than A - B. The latter in turn is greater than 0, because the condition here was A > B.
Of course, the three cases A > B, A < B and A == B are all cases which can occur. In every case, the third line assigns a new value to A which is 0 or positive.
Coming back to the original variable naming, that means that pos is always 0 or greater than that after execution of the third line.
Given that, I don't understand what the fourth line does. After all, max(0, pos) is always equivalent to pos if pos is 0 or positive.
What am I missing? Is there an error in the reasoning above?
Let's consider what exactly it does:
std::streamoff pos = std::streamoff((uint64_t(lo) + uint64_t(hi)) / 2);
pos = std::max<int64_t>(0, pos);
std::streamoff is some implementation defined signed integer type. Let's consider a case where it is a 64 bit type or smaller: The value of pos will not be changed by the conversion to int64_t because the type is wider, nor when converting back in the assignment because the original value must have been representable.
Let's consider a case where std::streamoff is a 128 bit type or wider: The value comes from (uint64_t(lo) + uint64_t(hi)) / 2 which cannot exceed maximum of int64_t. Thus, the value cannot be changed by the conversion in this case either.
Thus, the use of int64_t has no effect in any case.
The third line can't make pow lower than 0
Is there an error in the reasoning above?
I cannot find any error.
Given that, I don't understand what the fourth line does.
The line has no effect at on the behaviour of the program at all. The program would have equivalent behaviour if the line was written:
;
Besides, on most systems that you find on desktop or server, std::streamoff and int64_t have the same number of bits.

Understanding type inference

I believe I am having a problem with both the data type and ownership of iter. It is first declared inside the for loop expression. I believe Rust infers that iter is of type u16 because it is being used inside of my computation on line 4.
1 let mut numbers: [Option<u16>; 5];
2 for iter in 0..5 {
3 let number_to_add: u16 = { // `iter` moves to inner scope
4 ((iter * 5) + 2) / (4 * 16) // Infers `iter: u16`
5 };
6
7 numbers[iter] = Some(number_to_add); // Expects `iter: usize`
8 }
I am receiving the following error:
error[E0277]: the type `[std::option::Option<u16>]` cannot be indexed by `u16`
--> exercises/option/option1.rs:3:9
|
7 | numbers[iter] = Some(number_to_add);
| ^^^^^^^^^^^^^ slice indices are of type `usize` or ranges of `usize`
I tried casting iter as u16 inside the computation in line 4, but still having issues.
Where is my misconception?
Your assumption is correct. And your fix was ok too (it led to a different error, see below).
Your first problem was that for slice indexing, iter needs to be of type usize so either
numbers[iter as usize] = Some(number_to_add);
or
((iter as u16 * 5) + 2) / (4 * 16)
will lead to correct type inference through rustc.
Your second problem was that numbers was not initialized, so rustc correctly warns you when you try to modify numbers. Assigning a value, e.g.,
let mut numbers: [Option<u16>; 5] = [None; 5];
will let you compile your program.
Your reasoning is correct. Just to add, if you only want to initialize your array, you might also consider this way of doing it:
let arr_elem = |i: u16| Some(((i * 5) + 2) / (4 * 16));
let numbers : [Option<u16>; 5] = [
arr_elem(0),
arr_elem(1),
arr_elem(2),
arr_elem(3),
arr_elem(4),
];
This way, you do not need to have it mut (at the cost of writing a helper function to initialize a single element and stating the initializer elements, but that could be automated e.g. via a macro or some helper traits).
Aside from the existing answers, a somewhat higher level approach could also be cleaner (depending on taste)
let mut numbers = [None; 5];
for (i, n) in numbers.iter_mut().enumerate() {
let iter = i as u16;
let number_to_add: u16 =
((iter * 5) + 2) / (4 * 16);
*n = Some(number_to_add);
}
An other alternative would be a more lazy approach, but there's (afaik) no way to e.g. try_collect into an array, only try_from a slice to an array, so you'd need to collect() to a vec, then try_from to an array, which seems less than useful. Though you could always use an iterator to initialise you array:
let mut it = (0u16..5).map(|i| ((i * 5) + 2) / (4 * 16));
let numbers = [it.next(), it.next(), it.next(), it.next(), it.next()];
Also
// `iter` moves to inner scope
iter is Copy so it's just... copied. Kinda. And the block is not useful either, it only contains a simple expression.

How to store output of very large Fibonacci number?

I am making a program for nth Fibonacci number. I made the following program using recursion and memoization.
The main problem is that the value of n can go up to 10000 which means that the Fibonacci number of 10000 would be more than 2000 digit long.
With a little bit of googling, I found that i could use arrays and store every digit of the solution in an element of the array but I am still not able to figure out how to implement this approach with my program.
#include<iostream>
using namespace std;
long long int memo[101000];
long long int n;
long long int fib(long long int n)
{
if(n==1 || n==2)
return 1;
if(memo[n]!=0)
return memo[n];
return memo[n] = fib(n-1) + fib(n-2);
}
int main()
{
cin>>n;
long long int ans = fib(n);
cout<<ans;
}
How do I implement that approach or if there is another method that can be used to achieve such large values?
One thing that I think should be pointed out is there's other ways to implement fib that are much easier for something like C++ to compute
consider the following pseudo code
function fib (n) {
let a = 0, b = 1, _;
while (n > 0) {
_ = a;
a = b;
b = b + _;
n = n - 1;
}
return a;
}
This doesn't require memoisation and you don't have to be concerned about blowing up your stack with too many recursive calls. Recursion is a really powerful looping construct but it's one of those fubu things that's best left to langs like Lisp, Scheme, Kotlin, Lua (and a few others) that support it so elegantly.
That's not to say tail call elimination is impossible in C++, but unless you're doing something to optimise/compile for it explicitly, I'm doubtful that whatever compiler you're using would support it by default.
As for computing the exceptionally large numbers, you'll have to either get creative doing adding The Hard Way or rely upon an arbitrary precision arithmetic library like GMP. I'm sure there's other libs for this too.
Adding The Hard Way™
Remember how you used to add big numbers when you were a little tater tot, fresh off the aluminum foil?
5-year-old math
1259601512351095520986368
+ 50695640938240596831104
---------------------------
?
Well you gotta add each column, right to left. And when a column overflows into the double digits, remember to carry that 1 over to the next column.
... <-001
1259601512351095520986368
+ 50695640938240596831104
---------------------------
... <-472
The 10,000th fibonacci number is thousands of digits long, so there's no way that's going to fit in any integer C++ provides out of the box. So without relying upon a library, you could use a string or an array of single-digit numbers. To output the final number, you'll have to convert it to a string tho.
(woflram alpha: fibonacci 10000)
Doing it this way, you'll perform a couple million single-digit additions; it might take a while, but it should be a breeze for any modern computer to handle. Time to get to work !
Here's an example in of a Bignum module in JavaScript
const Bignum =
{ fromInt: (n = 0) =>
n < 10
? [ n ]
: [ n % 10, ...Bignum.fromInt (n / 10 >> 0) ]
, fromString: (s = "0") =>
Array.from (s, Number) .reverse ()
, toString: (b) =>
b .reverse () .join ("")
, add: (b1, b2) =>
{
const len = Math.max (b1.length, b2.length)
let answer = []
let carry = 0
for (let i = 0; i < len; i = i + 1) {
const x = b1[i] || 0
const y = b2[i] || 0
const sum = x + y + carry
answer.push (sum % 10)
carry = sum / 10 >> 0
}
if (carry > 0) answer.push (carry)
return answer
}
}
We can verify that the Wolfram Alpha answer above is correct
const { fromInt, toString, add } =
Bignum
const bigfib = (n = 0) =>
{
let a = fromInt (0)
let b = fromInt (1)
let _
while (n > 0) {
_ = a
a = b
b = add (b, _)
n = n - 1
}
return toString (a)
}
bigfib (10000)
// "336447 ... 366875"
Expand the program below to run it in your browser
const Bignum =
{ fromInt: (n = 0) =>
n < 10
? [ n ]
: [ n % 10, ...Bignum.fromInt (n / 10 >> 0) ]
, fromString: (s = "0") =>
Array.from (s) .reverse ()
, toString: (b) =>
b .reverse () .join ("")
, add: (b1, b2) =>
{
const len = Math.max (b1.length, b2.length)
let answer = []
let carry = 0
for (let i = 0; i < len; i = i + 1) {
const x = b1[i] || 0
const y = b2[i] || 0
const sum = x + y + carry
answer.push (sum % 10)
carry = sum / 10 >> 0
}
if (carry > 0) answer.push (carry)
return answer
}
}
const { fromInt, toString, add } =
Bignum
const bigfib = (n = 0) =>
{
let a = fromInt (0)
let b = fromInt (1)
let _
while (n > 0) {
_ = a
a = b
b = add (b, _)
n = n - 1
}
return toString (a)
}
console.log (bigfib (10000))
Try not to use recursion for a simple problem like fibonacci. And if you'll only use it once, don't use an array to store all results. An array of 2 elements containing the 2 previous fibonacci numbers will be enough. In each step, you then only have to sum up those 2 numbers. How can you save 2 consecutive fibonacci numbers? Well, you know that when you have 2 consecutive integers one is even and one is odd. So you can use that property to know where to get/place a fibonacci number: for fib(i), if i is even (i%2 is 0) place it in the first element of the array (index 0), else (i%2 is then 1) place it in the second element(index 1). Why can you just place it there? Well when you're calculating fib(i), the value that is on the place fib(i) should go is fib(i-2) (because (i-2)%2 is the same as i%2). But you won't need fib(i-2) any more: fib(i+1) only needs fib(i-1)(that's still in the array) and fib(i)(that just got inserted in the array).
So you could replace the recursion calls with a for loop like this:
int fibonacci(int n){
if( n <= 0){
return 0;
}
int previous[] = {0, 1}; // start with fib(0) and fib(1)
for(int i = 2; i <= n; ++i){
// modulo can be implemented with bit operations(much faster): i % 2 = i & 1
previous[i&1] += previous[(i-1)&1]; //shorter way to say: previous[i&1] = previous[i&1] + previous[(i-1)&1]
}
//Result is in previous[n&1]
return previous[n&1];
}
Recursion is actually discommanded while programming because of the time(function calls) and ressources(stack) it consumes. So each time you use recursion, try to replace it with a loop and a stack with simple pop/push operations if needed to save the "current position" (in c++ one can use a vector). In the case of the fibonacci, the stack isn't even needed but if you are iterating over a tree datastructure for example you'll need a stack (depends on the implementation though). As I was looking for my solution, I saw #naomik provided a solution with the while loop. That one is fine too, but I prefer the array with the modulo operation (a bit shorter).
Now concerning the problem of the size long long int has, it can be solved by using external libraries that implement operations for big numbers (like the GMP library or Boost.multiprecision). But you could also create your own version of a BigInteger-like class from Java and implement the basic operations like the one I have. I've only implemented the addition in my example (try to implement the others they are quite similar).
The main idea is simple, a BigInt represents a big decimal number by cutting its little endian representation into pieces (I'll explain why little endian at the end). The length of those pieces depends on the base you choose. If you want to work with decimal representations, it will only work if your base is a power of 10: if you choose 10 as base each piece will represent one digit, if you choose 100 (= 10^2) as base each piece will represent two consecutive digits starting from the end(see little endian), if you choose 1000 as base (10^3) each piece will represent three consecutive digits, ... and so on. Let's say that you have base 100, 12765 will then be [65, 27, 1], 1789 will be [89, 17], 505 will be [5, 5] (= [05,5]), ... with base 1000: 12765 would be [765, 12], 1789 would be [789, 1], 505 would be [505]. It's not the most efficient, but it is the most intuitive (I think ...)
The addition is then a bit like the addition on paper we learned at school:
begin with the lowest piece of the BigInt
add it with the corresponding piece of the other one
the lowest piece of that sum(= the sum modulus the base) becomes the corresponding piece of the final result
the "bigger" pieces of that sum will be added ("carried") to the sum of the following pieces
go to step 2 with next piece
if no piece left, add the carry and the remaining bigger pieces of the other BigInt (if it has pieces left)
For example:
9542 + 1097855 = [42, 95] + [55, 78, 09, 1]
lowest piece = 42 and 55 --> 42 + 55 = 97 = [97]
---> lowest piece of result = 97 (no carry, carry = 0)
2nd piece = 95 and 78 --> (95+78) + 0 = 173 = [73, 1]
---> 2nd piece of final result = 73
---> remaining: [1] = 1 = carry (will be added to sum of following pieces)
no piece left in first `BigInt`!
--> add carry ( [1] ) and remaining pieces from second `BigInt`( [9, 1] ) to final result
--> first additional piece: 9 + 1 = 10 = [10] (no carry)
--> second additional piece: 1 + 0 = 1 = [1] (no carry)
==> 9542 + 1 097 855 = [42, 95] + [55, 78, 09, 1] = [97, 73, 10, 1] = 1 107 397
Here is a demo where I used the class above to calculate the fibonacci of 10000 (result is too big to copy here)
Good luck!
PS: Why little endian? For the ease of the implementation: it allows to use push_back when adding digits and iteration while implementing the operations will start from the first piece instead of the last piece in the array.

Is there a way to find the next item in random sequence?

I know that there was a program like this:
#include <iostream>
#include <string>
int main() {
const std::string alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
std::string temp = "1234567890";
srand(MAGICNUMBER);
for (int i = 0;; ++i) {
for (int j = 0; j < 10; ++j)
temp[j] = alphabet[rand() % alphabet.size()];
std::cout << temp << std::endl;
}
}
Basically, random 10-symbol string generator.
I also know that the 124660967-th generated string was "2lwd9JjVnE". Is there a way to find what the MAGICNUMBER is, or, at least, the next string in the sequence?
Brute-forcing would be painful, given the time it takes to generate one such sequence, but I have some info about the compiler used (if that helps?): it was 64-bit g++ 4.8 for Linux.
UPD. Finding the next item would already be very helpful; can I do that in reasonable amount of time (especially without a seed)?
Yes, given typical rand() implementations this is likely to be possible, fairly easy, even.
rand() is typically a linear congruential generator such that each internal state of the generator is formed from a simple arithmetic equation of the previous state: x1 = (a*x0 + c) % m. You'll need to know the constants a, c and m used by the particular implementation you're targeting, and the method of producing the output value from the state (usually the values are either the entire state, or the upper half of the state). It's also important that the state is typically only 32-bits. A larger state would be more difficult.
So you need to find a state for the pRNG such that the next ten states produce the particular sequence of indices that produce the 10 characters you're looking for: 2lwd9JjVnE. So assuming the entire state is output by rand(), you need to find some 32-bit number x such that:
x % 62 = 54
(x1 = (a*x + c) % m) % 62 == 11
(x2 = (a*x1 + c) % m) % 62 == 22
(x3 = (a*x2 + c) % m) % 62 == 3
(x4 = (a*x3 + c) % m) % 62 == 61
(x5 = (a*x4 + c) % m) % 62 == 35
(x6 = (a*x5 + c) % m) % 62 == 9
(x7 = (a*x6 + c) % m) % 62 == 47
(x8 = (a*x7 + c) % m) % 62 == 13
(x9 = (a*x8 + c) % m) % 62 == 30
This could be done without too much difficulty by trying all 2^32 possible state values (assuming the typical 32-bit state). However, since the constants used were probably chosen to ensure that the RNG runs through a complete 32-bit period, you can simply choose any state at all and run it until you find this sequence.
Either way, once you know the state that produces these values, you then simply have to run the generator backwards for 124660967 * 10 steps in order to find which state was used as the original seed. To do that you'll need to compute the congruence multiplicative inverse of a mod m. Alternatively you could run it forward for (period - 124660967*10) steps.
No, it's almost not possible. As #chux pointed out in their comment the exact implementation isn't specified in the c++ standard.
You'll need to check for all of the sequences that will be generated with all possible seeds. That will run in an unreasonable amount of computing time necessary.
Though if the compiler is well known, and the implementation is open source (as is in your specific case), there could be ways to find out the initial seed value, knowing the specific rand() result for a specific iteration on the call.
If you have access to the program, disassemble it to attempt to learn what the magic number was.
Otherwise the standard doesn't specify anything about storing the srand value so you're stuck with alternate approaches, such as brute-forcing all seeds, or possibly trying to store the sequence of random numbers looking for the ten in a row that generate the string you're interested in.

Fast inner product of ternary vectors

Consider two vectors, A and B, of size n, 7 <= n <= 23. Both A and B consists of -1s, 0s and 1s only.
I need a fast algorithm which computes the inner product of A and B.
So far I've thought of storing the signs and values in separate uint32_ts using the following encoding:
sign 0, value 0 → 0
sign 0, value 1 → 1
sign 1, value 1 → -1.
The C++ implementation I've thought of looks like the following:
struct ternary_vector {
uint32_t sign, value;
};
int inner_product(const ternary_vector & a, const ternary_vector & b) {
uint32_t psign = a.sign ^ b.sign;
uint32_t pvalue = a.value & b.value;
psign &= pvalue;
pvalue ^= psign;
return __builtin_popcount(pvalue) - __builtin_popcount(psign);
}
This works reasonably well, but I'm not sure whether it is possible to do it better. Any comment on the matter is highly appreciated.
I like having the 2 uint32_t, but I think your actual calculation is a bit wasteful
Just a few minor points:
I'm not sure about the reference (getting a and b by const &) - this adds a level of indirection compared to putting them on the stack. When the code is this small (a couple of clocks maybe) this is significant. Try passing by value and see what you get
__builtin_popcount can be, unfortunately, very inefficient. I've used it myself, but found that even a very basic implementation I wrote was far faster than this. However - this is dependent on the platform.
Basically, if the platform has a hardware popcount implementation, __builtin_popcount uses it. If not - it uses a very inefficient replacement.
The one serious problem here is the reuse of the psign and pvalue variables for the positive and negative vectors. You are doing neither your compiler nor yourself any favors by obfuscating your code in this way.
Would it be possible for you to encode your ternary state in a std::bitset<2> and define the product in terms of and? For example, if your ternary types are:
1 = P = (1, 1)
0 = Z = (0, 0)
-1 = M = (1, 0) or (0, 1)
I believe you could define their product as:
1 * 1 = 1 => P * P = P => (1, 1) & (1, 1) = (1, 1) = P
1 * 0 = 0 => P * Z = Z => (1, 1) & (0, 0) = (0, 0) = Z
1 * -1 = -1 => P * M = M => (1, 1) & (1, 0) = (1, 0) = M
Then the inner product could start by taking the and of the bits of the elements and... I am working on how to add them together.
Edit:
My foolish suggestion did not consider that (-1)(-1) = 1, which cannot be handled by the representation I proposed. Thanks to #user92382 for bringing this up.
Depending on your architecture, you may want to optimize away the temporary bit vectors -- e.g. if your code is going to be compiled to FPGA, or laid out to an ASIC, then a sequence of logical operations will be better in terms of speed/energy/area than storing and reading/writing to two big buffers.
In this case, you can do:
int inner_product(const ternary_vector & a, const ternary_vector & b) {
return __builtin_popcount( a.value & b.value & ~(a.sign ^ b.sign))
- __builtin_popcount( a.value & b.value & (a.sign ^ b.sign));
}
This will lay out very well -- the (a.value & b.value & ... ) can enable/disable an XOR gate, whose output splits into two signed accumulators, with the first pathway NOTed before accumulation.