Problem with initialising 2D vector in C++ - c++

I was implementing a solution for this problem to get a feel for the language. My reasoning is as follows:
Notice that the pattern on the diagonal is 2*n+1.
The elements to the left and upwards are alternating arithmetic progressions or additions/subtractions of the elements from the diagonal to the boundary.
Create a 2D vector and instantiate all the diagonal elements. Then create a dummy variable to fill in the remaining parts by add/subtract the diagonal elements.
My code is as follows:
#include <vector>
using namespace std;
const long value = 1e9;
vector<vector<long>> spiral(value, vector<long> (value));
long temp;
void build(){
spiral[0][0] = 1;
for(int i = 1; i < 5e8; i++){
spiral[i][i]= 2*i+1;
temp = i;
long counter = temp;
while(counter){
if(temp % 2 ==0){
spiral[i][counter]++;
spiral[counter][i]--;
counter--;
temp--;
}else{
spiral[i][counter]--;
spiral[counter][i]++;
counter--;
temp--;
}
}
}
}
int main(){
spiral[0][0] = 1;
build();
int y, x;
cin >> y >> x;
cout << spiral[y][x] << endl;
}
The problem is that the programme doesn't output any thing. I can't figure out why my vector won't print any elements. I've tested it with spiral[1][1] and all I get is some obscure assembler message after waiting 5 or 10 minutes. What's wrong with my reasoning?
EDIT: Full output is:
and

A long is probably 4 or 8 bytes for you (e.g. commonly 4 bytes on Windows, 4 bytes on x86 Linux, and 8 bytes on x64 Linux), so lets assume 4. 1e9 * 4 is 4 gigabytes of continuous memory for each vector<long> (value).
Then the outer vector creates another 1e9 copies of that, which is 4 exabytes (or 4 million terabytes) given a 32bit long or double for 64bit and ignoring the overhead size of each std::vector. It is highly unlikely that you have that much memory and swapfile, and being a global this is attempted before main() is called.
So you are not going to be able to store all this data directly, you will need to think about what data actually needs to be stored to get the result you desire.
If you run under a debugger set to stop on exceptions, you might see a std::bad_alloc getting thrown, with the call stack indicating the cause (e.g. Visual Studio will display something like "dynamic initializer for 'spiral'" in the call stack), but it is possible on Linux the OS will just kill it first, as Linux can over-commit memory (so new etc. succeeds), then when some program goes to use memory (an actual read or write) it fails (over committed, nothing free) and it SIGKILL's something to free memory (this doesn't seem entirely predictable, I copy-pasted your code onto Ubuntu 18 and on command line got "terminate called after throwing an instance of 'std::bad_alloc'").

The problem actually asks you to find an analytical formula for the solution, not to simulate the pattern. All you need to do is to carefully analyze the pattern:
unsigned int get_n(unsigned int row, unsigned int col) {
assert(row >= 1 && col >= 1);
const auto n = std::max(row, col);
if (n % 2 == 0)
std::swap(row, col);
if (col == n)
return n * n + 1 - row;
else
return (n - 1) * (n - 1) + col;
}

Math is your friend, here, not std::vector. One of the constraints of this puzzle is a memory limit of 512MB, but a vector big enough for all the tests would require several GB of memory.
Consider how the square is filled. If you choose the maximum between the given x and y (call it w), you have "delimited" a square of size w2. Now you have to consider the outer edge of this square to find the actual index.
E.g. Take x = 6 and y = 3. The maximum is 6 (even, remember the zig zag pattern), so the number is (6 - 1)2 + 3 = 28
* * * * * 26
* * * * * 27
* * * * * [28]
* * * * * 29
* * * * * 30
36 35 34 33 32 31
Here, a proof of concept.

Related

TCS MockVita 2019 Round 2 Question: Hop Game

I am trying to solve a problem asked in TCS MockVita 2019 Round 2:
Problem Description
Dr Felix Kline, the Math teacher at Gauss School introduced the following game to teach his students problem solving. He places a series of “hopping stones” (pieces of paper) in a line with points (a positive number) marked on each of the stones.
Students start from one end and hop to the other end. One can step on a stone and add the number on the stone to their cumulative score or jump over a stone and land on the next stone. In this case, they get twice the points marked on the stone they land but do not get the points marked on the stone they jumped over.
At most once in the journey, the student is allowed (if they choose) to do a “double jump”– that is, they jump over two consecutive stones – where they would get three times the points of the stone they land on, but not the points of the stone they jump over.
The teacher expected his students to do some thinking and come up with a plan to get the maximum score possible. Given the numbers on the sequence of stones, write a program to determine the maximum score possible.
Constraints
The number of stones in the sequence< 30
Input Format
The first line contains N, the number of integers (this is a positive integer)
The next line contains the N points (each a positive integer) separated by commas. These are the points on the stones in the order the stones are placed.
Output
One integer representing the maximum score
Test Case
Explanation
Example 1
Input
3
4,2,3
Output
10
Explanation
There are 3 stones (N=3), and the points (in the order laid out) are 4,2 and 3 respectively.
If we step on the first stone and skip the second to get 4 + 2 x 3 = 10. A double jump to the third stone will get only 9. Hence the result is 10, and the double jump is not used
Example 2
Input
6
4,5,6,7,4,5
Output
35
Explanation
N=6, and the sequence of points is given.One way of getting 35 is to start with a double jump to stone 3 (3 x 6=18), go to stone 4 (7) and jump to stone 6 (10 points) for a total of 35. The double jump was used only once, and the result is 35.
I found that it's a Dynamic programming problem, but I don't know what I did wrong because my solution is not able to pass all the test cases. My code passed all the tests I created.
unordered_map<int, int> lookup;
int res(int *arr, int n, int i){
if(i == n-1){
return 0;
}
if(i == n-2){
return arr[i+1];
}
if(lookup.find(i) != lookup.end())
return lookup[i];
int maxScore = 0;
if(i< n-3 && flag == false){
flag = true;
maxScore = max(maxScore, 3 * (arr[i+3]) + res(arr, n, i+3));
flag = false;
}
maxScore = max(maxScore, (arr[i+1] + res(arr,n,i+1)));
lookup[i] = max(maxScore, 2 * (arr[i+2]) + res(arr, n, i+2));
return lookup[i];
}
cout << res(arr, n, 0) + arr[0]; // It is inside the main()
I expect you to find the mistake in my code and give the correct solution, and any test case which fails this solution. Thanks :)
You don't need any map. All you need to remember are last few maximal values. You have two options every move (except two first), end with double jump made or without it. If you don't want ot make a dj then your best joice is maximum of last stone + current and stone before last + 2 * current max(no_dj[2] + arr[i], no_dj[1] + 2 * arr[i]).
On the other hand, if you want to have dj made than you have three options, either jump one stone after some previous dj dj[2] + arr[i] or jump over last stone after some dj dj[1] + 2 * arr[i] or do double jump in current move no_dj[0] + 3 * arr[i].
int res(int *arr, int n){
int no_dj[3]{ 0, 0, arr[0]};
int dj[3]{ 0, 0, 0};
for(int i = 1; i < n; i++){
int best_nodj = max(no_dj[1] + 2 * arr[i], no_dj[2] + arr[i]);
int best_dj = 0;
if(i > 1) best_dj = max(max(dj[1] + 2 * arr[i], dj[2] + arr[i]), no_dj[0] + 3 * arr[i]);
no_dj[0] = no_dj[1];
no_dj[1] = no_dj[2];
no_dj[2] = best_nodj;
dj[0] = dj[1];
dj[1] = dj[2];
dj[2] = best_dj;
}
return max(no_dj[2], dj[2]);
}
All you have to remember are two arrays of three elements. Last three maximum values after double jump and last three maximum values without double jump.

C++: What are some general ways to make code more efficient for use with large numbers?

Please when answering this question try to be as general as possible to help the wider community, rather than just specifically helping my issue (although helping my issue would be great too ;) )
I seem to be encountering this problem time and time again with the simple problems on Project Euler. Most commonly are the problems that require a computation of the prime numbers - these without fail always fail to terminate for numbers greater than about 60,000.
My most recent issue is with Problem 12:
The sequence of triangle numbers is generated by adding the natural numbers. So the 7th triangle number would be 1 + 2 + 3 + 4 + 5 + 6 + 7 = 28. The first ten terms would be:
1, 3, 6, 10, 15, 21, 28, 36, 45, 55, ...
Let us list the factors of the first seven triangle numbers:
1: 1
3: 1,3
6: 1,2,3,6
10: 1,2,5,10
15: 1,3,5,15
21: 1,3,7,21
28: 1,2,4,7,14,28
We can see that 28 is the first triangle number to have over five divisors.
What is the value of the first triangle number to have over five hundred divisors?
Here is my code:
#include <iostream>
#include <vector>
#include <cmath>
using namespace std;
int main() {
int numberOfDivisors = 500;
//I begin by looping from 1, with 1 being the 1st triangular number, 2 being the second, and so on.
for (long long int i = 1;; i++) {
long long int triangularNumber = (pow(i, 2) + i)/2
//Once I have the i-th triangular, I loop from 1 to itself, and add 1 to count each time I encounter a divisor, giving the total number of divisors for each triangular.
int count = 0;
for (long long int j = 1; j <= triangularNumber; j++) {
if (triangularNumber%j == 0) {
count++;
}
}
//If the number of divisors is 500, print out the triangular and break the code.
if (count == numberOfDivisors) {
cout << triangularNumber << endl;
break;
}
}
}
This code gives the correct answers for smaller numbers, and then either fails to terminate or takes an age to do so!
So firstly, what can I do with this specific problem to make my code more efficient?
Secondly, what are some general tips both for myself and other new C++ users for making code more efficient? (I.e. applying what we learn here in the future.)
Thanks!
The key problem is that your end condition is bad. You are supposed to stop when count > 500, but you look for an exact match of count == 500, therefore you are likely to blow past the correct answer without detecting it, and keep going ... maybe forever.
If you fix that, you can post it to code review. They might say something like this:
Break it down into separate functions for finding the next triangle number, and counting the factors of some number.
When you find the next triangle number, you execute pow. I perform a single addition.
For counting the number of factors in a number, a google search might help. (e.g. http://www.cut-the-knot.org/blue/NumberOfFactors.shtml ) You can build a list of prime numbers as you go, and use that to quickly find a prime factorization, from which you can compute the number of factors without actually counting them. When the numbers get big, that loop gets big.
Tldr: 76576500.
About your Euler problem, some math:
Preliminary 1:
Let's call the n-th triangle number T(n).
T(n) = 1 + 2 + 3 + ... + n = (n^2 + n)/2 (sometimes attributed to Gauss, sometimes someone else). It's not hard to figure it out:
1+2+3+4+5+6+7+8+9+10 =
(1+10) + (2+9) + (3+8) + (4+7) + (5+6) =
11 + 11 + 11 + 11 + 11 =
55 =
110 / 2 =
(10*10 + 10)/2
Because of its definition, it's trivial that T(n) + n + 1 = T(n+1), and that with a<b, T(a)<T(b) is true too.
Preliminary 2:
Let's call the divisor count D. D(1)=1, D(4)=3 (because 1 2 4).
For a n with c non-repeating prime factors (not just any divisors, but prime factors, eg. n = 42 = 2 * 3 * 7 has c = 3), D(n) is c^2: For each factor, there are two possibilites (use it or not). The 9 possibile divisors for the examples are: 1, 2, 3, 7, 6 (2*3), 14 (2*7), 21 (3*7), 42 (2*3*7).
More generally with repeating, the solution for D(n) is multiplying (Power+1) together. Example 126 = 2^1 * 3^2 * 7^1: Because it has two 3, the question is no "use 3 or not", but "use it 1 time, 2 times or not" (if one time, the "first" or "second" 3 doesn't change the result). With the powers 1 2 1, D(126) is 2*3*2=12.
Preliminary 3:
A number n and n+1 can't have any common prime factor x other than 1 (technically, 1 isn't a prime, but whatever). Because if both n/x and (n+1)/x are natural numbers, (n+1)/x - n/x has to be too, but that is 1/x.
Back to Gauss: If we know the prime factors for a certain n and n+1 (needed to calculate D(n) and D(n+1)), calculating D(T(n)) is easy. T(N) = (n^2 + n) / 2 = n * (n+1) / 2. As n and n+1 don't have common prime factors, just throwing together all factors and removing one 2 because of the "/2" is enough. Example: n is 7, factors 7 = 7^1, and n+1 = 8 = 2^3. Together it's 2^3 * 7^1, removing one 2 is 2^2 * 7^1. Powers are 2 1, D(T(7)) = 3*2 = 6. To check, T(7) = 28 = 2^2 * 7^1, the 6 possible divisors are 1 2 4 7 14 28.
What the program could do now: Loop through all n from 1 to something, always factorize n and n+1, use this to get the divisor count of the n-th triangle number, and check if it is >500.
There's just the tiny problem that there are no efficient algorithms for prime factorization. But for somewhat small numbers, todays computers are still fast enough, and keeping all found factorizations from 1 to n helps too for finding the next one (for n+1). Potential problem 2 are too large numbers for longlong, but again, this is no problem here (as can be found out with trying).
With the described process and the program below, I got
the 12375th triangle number is 76576500 and has 576 divisors
#include <iostream>
#include <vector>
#include <cstdint>
using namespace std;
const int limit = 500;
vector<uint64_t> knownPrimes; //2 3 5 7...
//eg. [14] is 1 0 0 1 ... because 14 = 2^1 * 3^0 * 5^0 * 7^1
vector<vector<uint32_t>> knownFactorizations;
void init()
{
knownPrimes.push_back(2);
knownFactorizations.push_back(vector<uint32_t>(1, 0)); //factors for 0 (dummy)
knownFactorizations.push_back(vector<uint32_t>(1, 0)); //factors for 1 (dummy)
knownFactorizations.push_back(vector<uint32_t>(1, 1)); //factors for 2
}
void addAnotherFactorization()
{
uint64_t number = knownFactorizations.size();
size_t len = knownPrimes.size();
for(size_t i = 0; i < len; i++)
{
if(!(number % knownPrimes[i]))
{
//dividing with a prime gets a already factorized number
knownFactorizations.push_back(knownFactorizations[number / knownPrimes[i]]);
knownFactorizations[number][i]++;
return;
}
}
//if this failed, number is a newly found prime
//because a) it has no known prime factors, so it must have others
//and b) if it is not a prime itself, then it's factors should've been
//found already (because they are smaller than the number itself)
knownPrimes.push_back(number);
len = knownFactorizations.size();
for(size_t s = 0; s < len; s++)
{
knownFactorizations[s].push_back(0);
}
knownFactorizations.push_back(knownFactorizations[0]);
knownFactorizations[number][knownPrimes.size() - 1]++;
}
uint64_t calculateDivisorCountOfN(uint64_t number)
{
//factors for number must be known
uint64_t res = 1;
size_t len = knownFactorizations[number].size();
for(size_t s = 0; s < len; s++)
{
if(knownFactorizations[number][s])
{
res *= (knownFactorizations[number][s] + 1);
}
}
return res;
}
uint64_t calculateDivisorCountOfTN(uint64_t number)
{
//factors for number and number+1 must be known
uint64_t res = 1;
size_t len = knownFactorizations[number].size();
vector<uint32_t> tmp(len, 0);
size_t s;
for(s = 0; s < len; s++)
{
tmp[s] = knownFactorizations[number][s]
+ knownFactorizations[number+1][s];
}
//remove /2
tmp[0]--;
for(s = 0; s < len; s++)
{
if(tmp[s])
{
res *= (tmp[s] + 1);
}
}
return res;
}
int main()
{
init();
uint64_t number = knownFactorizations.size() - 2;
uint64_t DTn = 0;
while(DTn <= limit)
{
number++;
addAnotherFactorization();
DTn = calculateDivisorCountOfTN(number);
}
uint64_t tn;
if(number % 2) tn = ((number+1)/2)*number;
else tn = (number/2)*(number+1);
cout << "the " << number << "th triangle number is "
<< tn << " and has " << DTn << " divisors" << endl;
return 0;
}
About your general question about speed:
1) Algorithms.
How to know them? For (relatively) simple problems, either reading a book/Wikipedia/etc. or figuring it out if you can. For harder stuff, learning more basic things and gaining experience is necessary before it's even possible to understand them, eg. studying CS and/or maths ... number theory helps a lot for your Euler problem. (It will help less to understand how a MP3 file is compressed ... there are many areas, it's not possible to know everything.).
2a) Automated compiler optimizations of frequently used code parts / patterns
2b) Manual timing what program parts are the slowest, and (when not replacing it with another algorithm) changing it in a way that eg. requires less data send to slow devices (HDD, hetwork...), less RAM memory access, less CPU cycles, works better together with OS scheduler and memory management strategies, uses the CPU pipeline/caches better etc.etc. ... this is both education and experience (and a big topic).
And because long variables have a limited size, sometimes it is necessary to use custom types that use eg. a byte array to store a single digit in each byte. That way, it's possible to use the whole RAM for a single number if you want to, but the downside is you/someone has to reimplement stuff like addition and so on for this kind of number storage. (Of course, libs for that exist already, without writing everything from scratch).
Btw., pow is a floating point function and may get you inaccurate results. It's not appropriate to use it in this case.

C++ Optimizing this Algorithm

After watching some Terence Tao videos, I wanted to try implementing algorithms into c++ code to find all the prime numbers up to a number n. In my first version, where I simply had every integer from 2 to n tested to see if they were divisible by anything from 2 to sqrt(n), I got the program to find the primes between 1-10,000,000 in ~52 seconds.
Attempting to optimize the program, and implementing what I now know to be the Sieve of Eratosthenes, I assumed the task would be done much faster than 51 seconds, but sadly, that wasn't the case. Even going up to 1,000,000 took a considerable amount of time (didn't time it, though)
#include <iostream>
#include <vector>
using namespace std;
void main()
{
vector<int> tosieve = {};
for (int i = 2; i < 1000001; i++)
{
tosieve.push_back(i);
}
for (int j = 0; j < tosieve.size(); j++)
{
for (int k = j + 1; k < tosieve.size(); k++)
{
if (tosieve[k] % tosieve[j] == 0)
{
tosieve.erase(tosieve.begin() + k);
}
}
}
//for (int f = 0; f < tosieve.size(); f++)
//{
// cout << (tosieve[f]) << endl;
//}
cout << (tosieve.size()) << endl;
system("pause");
}
Is it the repeated referencing of the vectors or something? Why is this so slow? Even if I'm completely overlooking something (could be, complete beginner at this :I) I would think that finding the primes between 2 and 1,000,000 with this horrible inefficient method would be faster than my original way of finding them from 2 to 10,000,000.
Hope someone has a clear answer to this - hopefully I can use whatever knowledge is gleaned in the future when optimizing programs using a lot of recursion.
The problem is that 'erase' moves every element in the vector down one, meaning it is an O(n) operation.
There are three alternative choices:
1) Just mark deleted elements as 'empty' (make them 0, for example). This will mean future passes have to pass over those empty positions, but that isn't that expensive.
2) Make a new vector, and push_back new values into there.
3) Use std::remove_if: This will move the elements down, but do it in a single pass so will be more efficient. If you use std::remove_if, then you will have to remember it doesn't resize the vector itself.
Most of vector operations, including erase() have a O(n) linear time complexity.
Since you have two loops of size 10^6, and a vector of size 10^6, your algorithm executes up to 10^18 operations.
Qubic algorithms for such a big N will take a huge amount of time.
N = 10^6 is even big enough for quadratic algorithms.
Please, read carefully about Sieve of Eratosthenes. The fact that both full search and Sieve of Eratosthenes algorithms took the same time, means that you have done the second one wrong.
I see two performanse issues here:
First of all, push_back() will have to reallocate the dynamic memory block once in a while. Use reserve():
vector<int> tosieve = {};
tosieve.resreve(1000001);
for (int i = 2; i < 1000001; i++)
{
tosieve.push_back(i);
}
Second erase() has to move all Elements behind the one you try to remove. You set the elements to 0 instead and do a run over the vector in the end (untested code):
for (auto& x : tosieve) {
for (auto y = tosieve.begin(); *y < x; ++y) // this check works only in
// the case of an ordered vector
if (y != 0 && x % y == 0) x = 0;
}
{ // this block will make sure, that sieved will be released afterwards
auto sieved = vector<int>{};
for(auto x : tosieve)
sieved.push_back(x);
swap(tosieve, sieved);
} // the large memory block is released now, just keep the sieved elements.
consider to use standard algorithms instead of hand written loops. They help you to state your intent. In this case I see std::transform() for the outer loop of the sieve, std::any_of() for the inner loop, std::generate_n() for filling tosieve at the beginning and std::copy_if() for filling sieved (untested code):
vector<int> tosieve = {};
tosieve.resreve(1000001);
generate_n(back_inserter(tosieve), 1000001, []() -> int {
static int i = 2; return i++;
});
transform(begin(tosieve), end(tosieve), begin(tosieve), [](int i) -> int {
return any_of(begin(tosieve), begin(tosieve) + i - 2,
[&i](int j) -> bool {
return j != 0 && i % j == 0;
}) ? 0 : i;
});
swap(tosieve, [&tosieve]() -> vector<int> {
auto sieved = vector<int>{};
copy_if(begin(tosieve), end(tosieve), back_inserter(sieved),
[](int i) -> bool { return i != 0; });
return sieved;
});
EDIT:
Yet another way to get that done:
vector<int> tosieve = {};
tosieve.resreve(1000001);
generate_n(back_inserter(tosieve), 1000001, []() -> int {
static int i = 2; return i++;
});
swap(tosieve, [&tosieve]() -> vector<int> {
auto sieved = vector<int>{};
copy_if(begin(tosieve), end(tosieve), back_inserter(sieved),
[](int i) -> bool {
return !any_of(begin(tosieve), begin(tosieve) + i - 2,
[&i](int j) -> bool {
return i % j == 0;
});
});
return sieved;
});
Now instead of marking elements, we don't want to copy afterwards, but just directly copy only the elements, we want to copy. This is not only faster than the above suggestion, but also better states the intent.
Very interesting task you have. Thanks!
With pleasure I implemented from scratch my own versions of solving it.
I created 3 separate (independent) functions, all based on Sieve of Eratosthenes. These 3 versions are different in their complexity and speed.
Just a quick note, my simplest (slowest) version finds all primes below your desired limit of 10'000'000 within just 0.025 sec (i.e. 25 milli-seconds).
I also tested all 3 versions to find primes below 2^32 (4'294'967'296), which is solved by "simple" version within 47 seconds, by "intermediate" version within 30 seconds, by "advanced" within 12 seconds. So within just 12 seconds it finds all primes below 4 Billion (there are 203'280'221 such primes below 2^32, see OEIS sequence)!!!
For simplicity I will describe in details only Simple version out of 3. Here's code:
template <typename T>
std::vector<T> GenPrimes_SieveOfEratosthenes(size_t end) {
// https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes
if (end <= 2)
return {};
size_t const cnt = end >> 1;
std::vector<u8> composites((cnt + 7) / 8);
auto Get = [&](size_t i){ return bool((composites[i / 8] >> (i % 8)) & 1); };
auto Set = [&](size_t i){ composites[i / 8] |= u8(1) << (i % 8); };
std::vector<T> primes = {2};
size_t i = 0;
for (i = 1; i < cnt; ++i) {
if (Get(i))
continue;
size_t const p = 2 * i + 1, start = (p * p) >> 1;
primes.push_back(p);
if (start >= cnt)
break;
for (size_t j = start; j < cnt; j += p)
Set(j);
}
for (i = i + 1; i < cnt; ++i)
if (!Get(i))
primes.push_back(2 * i + 1);
return primes;
}
This code implements simplest but fast algorithm of finding primes, called Sieve of Eratosthenes. As a small optimization of speed and memory, I search only over odd numbers. This odd numbers optimization gives me ability to store 2x times less memory and do 2x times less steps, hence improves both speed and memory consumption exactly 2 times.
Algorithm is simple, we allocate array of bits, this array at position K has bit 1 if K is composite, or has 0 if K is probably prime. At the end all 0 bits in array signify Definite primes (that are for sure primes). Also due to odd numbers optimization this bit-array stores only odd numbers, so K-th bit is actually a number 2 * K + 1.
Then left to right we go over this array of bits and if we meet 0 bit at position K then it means we found a prime number P = 2 * K + 1 and now starting from position (P * P) / 2 we mark every P-th bit with 1. It means we mark all numbers bigger than P*P that are composite, because they are divisible by P.
We do this procedure only until P * P becomes greater or equal to our limit End (we're finding all primes < End). This limit guarantees that after reaching it ALL zero bits inside array signify prime numbers.
Second version of code does only one optimization to this Simple version, it makes all multi-core (multi-threaded). But this only optimization makes code much bigger and more complex. Basically it slices whole range of bits into all cores, so that they write bits to memory in parallel.
I'll explain only my third Advanced version, it is most complex of 3 versions. It does not only multi-threaded optimization, but also so-called Primorial optimization.
What is Primorial, it is a product of first smallest primes, for example I take primorial 2 * 3 * 5 * 7 = 210.
We can see that any primorial splits infinite range of integers into wheels by modulus of this primorial. For example primorial 210 splits into ranges [0; 210), [210; 2210), [2210; 3*210), etc.
Now it is easy to mathematically prove that inside All ranges of primorial we can mark same positions of numbers as complex, exactly we can mark all numbers that are multiple of 2 or 3 or 5 or 7 as composite.
We can see that out of 210 remainders there are 162 remainders that are for sure composite, and only 48 remainders are probably prime.
Hence it is enough for us to check primality of only 48/210=22.8% of whole search space. This reduction of search space makes task more than 4x times faster, and 4x times less memory consuming.
One can see that my first Simple version in fact due to odd-only optimization was actually using Primorial equal to 2 optimization. Yes, if we take primorial 2 instead of primorial 210, then we gain exactly first version (Simple) algorithm.
All of my 3 versions are tested for correctness and speed. Although still some tiny bugs can remain. Note. Yet it is recommended not to use my code straight away in production, unless it is tested thoroughly.
All 3 versions are tested for correctness by re-using each other answers. I thoroughly test correctness by feeding all limits (end value) from 0 to 2^18. It takes some time to do this.
See main() function to figure out how to use my functions.
Try it online!
SOURCE CODE GOES HERE. Due to StackOverflow limit of 30K symbols per post, I can't inline source code here, as it is almost 30K in size and together with English post above it takes more than 30K. So I'm providing source code on separate Github Gist server, link below. Note that Try it online! link above also contains full source code, but I reduced search limit of 2^32 to smaller one due to GodBolt limit of running time to 3 seconds.
Github Gist code
Output:
10M time 'Simple' 0.024 sec
Time 2^32 'Simple' 46.924 sec, number of primes 203280221
Time 2^32 'Intermediate' 30.999 sec
Time 2^32 'Advanced' 11.359 sec
All checked till 0
All checked till 5000
All checked till 10000
All checked till 15000
All checked till 20000
All checked till 25000

OpenCL crash on big 2d range

In my program, i need to run the kernel once on every item of the large 2d-array. The program works correctly for small ranges - up to around 50x50, sometimes up to 100x100.
For bigger datasets however, calling the kernel causes the video card driver to crash.
I have tested this program on two computers with different AMD cards, and they exhibit the exact same behaviour. Other, one-dimensional kernels work properly, even for huge datasets of ~10 000 x 10 000 items.
Also, removing the i variable from the matrix[i + (N + 1) * j] expression causes the kernel to work without errors.
Am i setting the range incorrectly, making a mistake in the kernel, or maybe the problem lies elsewhere?
enqueued range:
cl::EnqueueArgs args(queue,cl::NDRange(offset, offset+1),cl::NDRange(N+1, N),cl::NullRange);
kernel:
void kernel sub(global float* matrix, global const float* vec, int N, int offset) {
int i = get_global_id(0);
int j = get_global_id(1);
matrix[i + (N + 1) * j] -= matrix[i + (N + 1) * offset] * vec[j];
}
One of possible reasons - if your kernel is running for too long, driver may drop it. Dice up problem area into smaller blocks.
Consider this, for a 100x100 input array you will use N=100, hence the maximum value of i in your kernel will be 100 because of the N+1 used in the enqueue args, while the maximum for j will be 99. I have assumed that offset = 0. Therefore i + (N + 1) * j = 100 + 101*99 = 10099 which is outside of your 2D array.
When offset = 1, the minimums for i and j will be 1 and 2 respectively, while the maximums will be 101 and 100. Therefore i + (N + 1) * j = 101 + 101*100 = 10201.
In my experience, GPUs are not very good at catching segmentation faults when accessing global memory. Your attempt at purposefully creating one may work on some cards sometimes but no guarantees.
The problem could be caused by local-work-size and global-work-size. It is important while using two dimensional arrays to properly calculate them. It could be that for big values your global_id(0) is bigger than you specified in clEnqueueNDRangeKernel().

Scale an array of values with lightning speed

I have an array of double with 12,000 entries. I need to scale each entry's value by a factor (e.g. 0.3345, or 6.78. whatever).
What I did was to loop each entry and perform the multiplication. As I am working on an PPC-based 100MHz embedded system, the large number of multiplication calls is slowing it down tremendously.
I there a way to do this faster. An analogy would be like initializing a block of memory -- one would use memset which is very fast. I wonder if there is an equivalent method.
I'd like to answer with a question: Do you really need to actually multiply each value?
Personally I would consider using a better data structure which hides the actual content of the array in a private variable and provides a scale-function which just updates a scale-field. The public access methods of the data structure can then simply scale the values according to the scale-field on a per-need basis.
There is a reason why memset can be very fast: there is no dependency on the previous value of the memory. This is not your case.
There are a few solutions for your problem. The first is to change the algorithm so you can prevent the multiplication in the first case. This is what I would be shooting for. An example is wrapping the array that multiplies an element when it is accessed.
If the multiplication in the data can not be avoided your best bet is to parallelize the multiplication, dividing the array in n parts (where n is equal to the amount of processors), where each part gets assigned to a thread for the multiplication. This is an example:
void multiply_block(double *array, const double val, const size_t len) {
int n = (len + 7) / 8;
/* duff's device */
switch (len % 8) {
case 0: do { *array++ *= val;
case 7: *array++ *= val;
case 6: *array++ *= val;
case 5: *array++ *= val;
case 4: *array++ *= val;
case 3: *array++ *= val;
case 2: *array++ *= val;
case 1: *array++ *= val;
} while(--n > 0);
}
}
void multiply_block_parallel(double *array, const double val, const size_t len) {
const int threads = get_num_processors();
int i = 0;
/* start all but the last thread */
while (i < (threads - 1)) {
start_thread(multiply_block,
array + i * (len / threads), val, len / threads);
i++;
}
/* start last thread with remaining data */
start_thread(multiply_block,
array + i * (len / threads), val, len - i * (len / threads));
}
In this example get_num_processors returns the amount of processors, and start_thread(func, args...) is a function that starts a new thread executing func with the arguments given. You should obviously replace those functions with real-life equivalents.
First of all I would suggest you to consider to go for fixed points if you can, it would greatly improve performance simplifying the task to integer multiplication.
In this case you could pre-calculate a "multiplication table". Thus, say you want to multiply a lot of x<256 numbers by 3, you would generate:
1 * 3 = 3
2 * 5 = 6
4 * 3 = 12
8 * 3 = 24
16 * 3 = 48
...
128 * 3 = 384
It's even very fast as you just have to shift the results to left by one. Then for each element you have to multiply you take the last bit, add the corresponding number to the result from the table and shift the value to right. This way you simplify multiplication to 8 additions.