Speed up recursive determinant algorithm

Speed up recursive determinant algorithm - c++

How do I speed up this recursive function? When it reaches a 10x10 matrix, it takes up a minute or so just to solve a problem. I included the event function as well so you can see when the calculation would take place.
void determinantsFrame::OnCalculateClick(wxCommandEvent &event)
{
double elem[MAX][MAX]; double det; string test; bool doIt = true;
for (int i = 0; i < n; i++)
{
for (int j = 0; j < n; j++)
{
test = (numbers[i][j]->GetValue()).mb_str();
if (test == "")
{
doIt = false;
break;
}
for (int k = 0; k < test.length(); k++)
if (isalpha(test[k]) || test[k] == ' ')
{
doIt = false;
break;
}
else if (ispunct(test[k]))
{
if (test[k] == '.' && test.length() == 1)
doIt = false;
else if (test[k] == '.' && test.length() != 1)
doIt = true;
else if (test[k] != '.')
doIt = false;
}
if (doIt == false)
break;
}
if (doIt == false)
break;
}
if (doIt)
{
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
elem[i][j] = static_cast<double>(wxAtof(numbers[i][j]->GetValue()));
det = determinant(elem, n);
wxMessageBox(wxString::Format(wxT("The determinant is: %.4lf"),det));
}
else
wxMessageBox(wxT("You may have entered an invalid character. Please try again"));
}
double determinantsFrame::determinant(double matrix[MAX][MAX], int order) // Here's the recursive algorithm
{
double det = 0; double temp[MAX][MAX]; int row, col;
if (order == 1)
return matrix[0][0];
else if (order == 2)
return ((matrix[0][0] * matrix[1][1]) - (matrix[0][1] * matrix[1][0]));
else
{
for (int r = 0; r < order; r++)
{
col = 0; row = 0;
for (int i = 1; i < order; i++)
{
for (int j = 0; j < order; j++)
{
if (j == r)
continue;
temp[row][col] = matrix[i][j];
col++;
if (col == order - 1)
col = 0;
}
row++;
}
det = det + (matrix[0][r] * pow(-1, r) * determinant(temp, order - 1));
}
return det;
}
}

You can do a bit better with keeping the same algorithm but it is at least O(n!) (probably worse) so higher order matrices will be slow no matter how much you optimize it. Note I did the benchmark times in MSVC 2010 and are there only for rough comparison purposes. Each change is cumulative as you go down the list and is compared to the original algorithm.
Skip Col Check -- As Surt suggested, removing this gets us a speed increase of 1%.
Add 3x3 Case -- Adding another explicit check for a 3x3 matrix gets us the most, 55%
Change pow() -- Changing the pow() call to (r % 2 ? -1.0 : 1.0) gets us a little bit more, 57%
Change to switch -- Changing the order check to a switch gets us a little bit more, 58%
Add 4x4 Case -- Adding another explicit check for a 4x4 matrix gets more, 85%
Things that don't work include:
memcpy -- As Surt suggested this actually looses a good deal of speed, -100%
Threads -- Creating order threads doesn't work well at all, -160%
I was hoping that using threads could get us a significant performance increase but even with all the optimization it is slower than the original. I think the copying of all the memory is making it not very parallel.
Added the 3x3 and 4x4 cases has the most effect and are the primary reason for the over x6 increase in speed. In theory you could add more explicit cases (probably by creating a program to output the required code) to reduce the speed even further. Of course, at some point this kind of defeats the purpose of using a recursive algorithm to begin with.
To get more performance you would probably have to consider a different algorithm. In theory you can change the recursive function into an iterative one by managing your own stack but it is considerable work and you aren't guaranteed a performance increase anyways.

It could be a branch mispredict problem (see also). The test
if (col == order - 1)
col = 0;
Is not needed as far as I can see.
The test fails 1/order times per loop and dominates for small order, which is why larger N aren't so affected. The timing is still large O(N!^3) (afaik) so don't expect miracles.
col = 0; row = 0;
for (int i = 1; i < order; i++) {
for (int j = 0; j < order; j++) {
if (j == r)
continue;
temp[row][col] = matrix[i][j];
col++;
//if (col == order - 1)
// col = 0;
}
col = 0; // no need to test
row++;
}
The algorithm will get a further slowdown when it hit L2 cache, at latest at N=64.
Also the matrix copy might be ineffective, this could be far more effective for large order at the cost of low effectiveness at low order.
for (int r = 0; r < order; r++) {
row = 0;
for (int i = 1; i < order; i++) {
memcpy(temp[row], matrix[i], r*sizeof(double)); // if r==0 will this work?
memcpy(&temp[row][r], &matrix[i][r+1], (order-r-1)*sizeof(double));
// amount of copied elements r+(order-r-1)=order-1.
row++;
}
Make a test with the original code to get the determinant that I got the indexes right!

Related

Why is 1 for-loop slower than 2 for-loops in problem related to prefix sum matrix?

I'm recently doing this problem, taken directly and translated from day 1 task 3 of IOI 2010, "Quality of life", and I encountered a weird phenomenon.
I was setting up a 0-1 matrix and using that to calculate a prefix sum matrix in 1 loop:
for (int i = 1; i <= m; i++)
{
for (int j = 1; j <= n; j++)
{
if (a[i][j] < x) {lower[i][j] = 0;} else {lower[i][j] = 1;}
b[i][j] = b[i-1][j] + b[i][j-1] - b[i-1][j-1] + lower[i][j];
}
}
and I got TLE (time limit exceeded) on 4 tests (the time limit is 2.0s). While using 2 for loop seperately:
for (int i = 1; i <= m; i++)
{
for (int j = 1; j <= n; j++)
{
if (a[i][j] < x) {lower[i][j] = 0;} else {lower[i][j] = 1;}
}
}
for (int i = 1; i <= m; i++)
{
for (int j = 1; j <= n; j++)
{
b[i][j] = b[i-1][j] + b[i][j-1] - b[i-1][j-1] + lower[i][j];
}
}
got me full AC (accepted).
As we can see from the 4 pictures here:
TLE result, picture 1 : https://i.stack.imgur.com/9o5C2.png
TLE result, picture 2 : https://i.stack.imgur.com/TJwX5.png
AC result, picture 1 : https://i.stack.imgur.com/1fo2H.png
AC result, picture 2 : https://i.stack.imgur.com/CSsZ2.png
the 2 for-loops code generally ran a bit faster (even in accepted test cases), contrasting my logic that the single for-loop should be quicker. Why does this happened?
Full code (AC) : https://pastebin.com/c7at11Ha (Please ignore all the nonsense bit and stuff like using namespace std;, as this is a competitive programming contest).
Note : The judge server, lqdoj.edu.vn is built on dmoj.ca, a global competitive programming contest platform.

If you look at assembly you'll see the source of the difference:
Single loop:
{
if (a[i][j] < x)
{
lower[i][j] = 0;
}
else
{
lower[i][j] = 1;
}
b[i][j] = b[i-1][j]
+ b[i][j-1]
- b[i-1][j-1]
+ lower[i][j];
}
In this case, there's a data dependency. The assignment to b depends on the value from the assignment to lower. So the operations go sequentially in the loop - first assignment to lower, then to b. The compiler can't optimize this code significantly because of the dependency.
Separation of assignments into 2 loops:
The assignment to lower is now independent and the compiler can use SIMD instructions that leads to a performance boost in the first loop. The second loop stays more or less similar to the original assembly.

Algorithm on hexagonal grid

Hexagonal grid is represented by a two-dimensional array with R rows and C columns. First row always comes "before" second in hexagonal grid construction (see image below). Let k be the number of turns. Each turn, an element of the grid is 1 if and only if the number of neighbours of that element that were 1 the turn before is an odd number. Write C++ code that outputs the grid after k turns.
Limitations:
1 <= R <= 10, 1 <= C <= 10, 1 <= k <= 2^(63) - 1
An example with input (in the first row are R, C and k, then comes the starting grid):
4 4 3
0 0 0 0
0 0 0 0
0 0 1 0
0 0 0 0
Simulation: image, yellow elements represent '1' and blank represent '0'.
This problem is easy to solve if I simulate and produce a grid each turn, but with big enough k it becomes too slow. What is the faster solution?
EDIT: code (n and m are used instead R and C) :
#include <cstdio>
#include <cstring>
using namespace std;
int old[11][11];
int _new[11][11];
int n, m;
long long int k;
int main() {
scanf ("%d %d %lld", &n, &m, &k);
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) scanf ("%d", &old[i][j]);
}
printf ("\n");
while (k) {
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
int count = 0;
if (i % 2 == 0) {
if (i) {
if (j) count += old[i-1][j-1];
count += old[i-1][j];
}
if (j) count += (old[i][j-1]);
if (j < m-1) count += (old[i][j+1]);
if (i < n-1) {
if (j) count += old[i+1][j-1];
count += old[i+1][j];
}
}
else {
if (i) {
if (j < m-1) count += old[i-1][j+1];
count += old[i-1][j];
}
if (j) count += old[i][j-1];
if (j < m-1) count += old[i][j+1];
if (i < n-1) {
if (j < m-1) count += old[i+1][j+1];
count += old[i+1][j];
}
}
if (count % 2) _new[i][j] = 1;
else _new[i][j] = 0;
}
}
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) old[i][j] = _new[i][j];
}
k--;
}
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
printf ("%d", old[i][j]);
}
printf ("\n");
}
return 0;
}

For a given R and C, you have N=R*C cells.
If you represent those cells as a vector of elements in GF(2), i.e, 0s and 1s where arithmetic is performed mod 2 (addition is XOR and multiplication is AND), then the transformation from one turn to the next can be represented by an N*N matrix M, so that:
turn[i+1] = M*turn[i]
You can exponentiate the matrix to determine how the cells transform over k turns:
turn[i+k] = (M^k)*turn[i]
Even if k is very large, like 2^63-1, you can calculate M^k quickly using exponentiation by squaring: https://en.wikipedia.org/wiki/Exponentiation_by_squaring This only takes O(log(k)) matrix multiplications.
Then you can multiply your initial state by the matrix to get the output state.
From the limits on R, C, k, and time given in your question, it's clear that this is the solution you're supposed to come up with.

There are several ways to speed up your algorithm.
You do the neighbour-calculation with the out-of bounds checking in every turn. Do some preprocessing and calculate the neighbours of each cell once at the beginning. (Aziuth has already proposed that.)
Then you don't need to count the neighbours of all cells. Each cell is on if an odd number of neighbouring cells were on in the last turn and it is off otherwise.
You can think of this differently: Start with a clean board. For each active cell of the previous move, toggle the state of all surrounding cells. When an even number of neighbours cause a toggle, the cell is on, otherwise the toggles cancel each other out. Look at the first step of your example. It's like playing Lights Out, really.
This method is faster than counting the neighbours if the board has only few active cells and its worst case is a board whose cells are all on, in which case it is as good as neighbour-counting, because you have to touch each neighbours for each cell.
The next logical step is to represent the board as a sequence of bits, because bits already have a natural way of toggling, the exclusive or or xor oerator, ^. If you keep the list of neigbours for each cell as a bit mask m, you can then toggle the board b via b ^= m.
These are the improvements that can be made to the algorithm. The big improvement is to notice that the patterns will eventually repeat. (The toggling bears resemblance with Conway's Game of Life, where there are also repeating patterns.) Also, the given maximum number of possible iterations, 2⁶³ is suspiciously large.
The playing board is small. The example in your question will repeat at least after 2¹⁶ turns, because the 4×4 board can have at most 2¹⁶ layouts. In practice, turn 127 reaches the ring pattern of the first move after the original and it loops with a period of 126 from then.
The bigger boards may have up to 2¹⁰⁰ layouts, so they may not repeat within 2⁶³ turns. A 10×10 board with a single active cell near the middle has ar period of 2,162,622. This may indeed be a topic for a maths study, as Aziuth suggests, but we'll tacke it with profane means: Keep a hash map of all previous states and the turns where they occurred, then check whether the pattern has occurred before in each turn.
We now have:
a simple algorithm for toggling the cells' state and
a compact bitwise representation of the board, which allows us to create a hash map of the previous states.
Here's my attempt:
#include <iostream>
#include <map>
/*
* Bit representation of a playing board, at most 10 x 10
*/
struct Grid {
unsigned char data[16];
Grid() : data() {
}
void add(size_t i, size_t j) {
size_t k = 10 * i + j;
data[k / 8] |= 1u << (k % 8);
}
void flip(const Grid &mask) {
size_t n = 13;
while (n--) data[n] ^= mask.data[n];
}
bool ison(size_t i, size_t j) const {
size_t k = 10 * i + j;
return ((data[k / 8] & (1u << (k % 8))) != 0);
}
bool operator<(const Grid &other) const {
size_t n = 13;
while (n--) {
if (data[n] > other.data[n]) return true;
if (data[n] < other.data[n]) return false;
}
return false;
}
void dump(size_t n, size_t m) const {
for (size_t i = 0; i < n; i++) {
for (size_t j = 0; j < m; j++) {
std::cout << (ison(i, j) ? 1 : 0);
}
std::cout << '\n';
}
std::cout << '\n';
}
};
int main()
{
size_t n, m, k;
std::cin >> n >> m >> k;
Grid grid;
Grid mask[10][10];
for (size_t i = 0; i < n; i++) {
for (size_t j = 0; j < m; j++) {
int x;
std::cin >> x;
if (x) grid.add(i, j);
}
}
for (size_t i = 0; i < n; i++) {
for (size_t j = 0; j < m; j++) {
Grid &mm = mask[i][j];
if (i % 2 == 0) {
if (i) {
if (j) mm.add(i - 1, j - 1);
mm.add(i - 1, j);
}
if (j) mm.add(i, j - 1);
if (j < m - 1) mm.add(i, j + 1);
if (i < n - 1) {
if (j) mm.add(i + 1, j - 1);
mm.add(i + 1, j);
}
} else {
if (i) {
if (j < m - 1) mm.add(i - 1, j + 1);
mm.add(i - 1, j);
}
if (j) mm.add(i, j - 1);
if (j < m - 1) mm.add(i, j + 1);
if (i < n - 1) {
if (j < m - 1) mm.add(i + 1, j + 1);
mm.add(i + 1, j);
}
}
}
}
std::map<Grid, size_t> prev;
std::map<size_t, Grid> pattern;
for (size_t turn = 0; turn < k; turn++) {
Grid next;
std::map<Grid, size_t>::const_iterator it = prev.find(grid);
if (1 && it != prev.end()) {
size_t start = it->second;
size_t period = turn - start;
size_t index = (k - turn) % period;
grid = pattern[start + index];
break;
}
prev[grid] = turn;
pattern[turn] = grid;
for (size_t i = 0; i < n; i++) {
for (size_t j = 0; j < m; j++) {
if (grid.ison(i, j)) next.flip(mask[i][j]);
}
}
grid = next;
}
for (size_t i = 0; i < n; i++) {
for (size_t j = 0; j < m; j++) {
std::cout << (grid.ison(i, j) ? 1 : 0);
}
std::cout << '\n';
}
return 0;
}
There is probably room for improvement. Especially, I'm not so sure how it fares for big boards. (The code above uses an ordered map. We don't need the order, so using an unordered map will yield faster code. The example above with a single active cell on a 10×10 board took significantly longer than a second with an ordered map.)

Not sure about how you did it - and you should really always post code here - but let's try to optimize things here.
First of all, there is not really a difference between that and a quadratic grid. Different neighbor relationships, but I mean, that is just a small translation function. If you have a problem there, we should treat this separately, maybe on CodeReview.
Now, the naive solution is:
for all fields
count neighbors
if odd: add a marker to update to one, else to zero
for all fields
update all fields by marker of former step
this is obviously in O(N). Iterating twice is somewhat twice the actual run time, but should not be that bad. Try not to allocate space every time that you do that but reuse existing structures.
I'd propose this solution:
at the start:
create a std::vector or std::list "activated" of pointers to all fields that are activated
each iteration:
create a vector "new_activated"
for all items in activated
count neighbors, if odd add to new_activated
for all items in activated
set to inactive
replace activated by new_activated*
for all items in activated
set to active
*this can be done efficiently by putting them in a smart pointer and use move semantics
This code only works on the activated fields. As long as they stay within some smaller area, this is far more efficient. However, I have no idea when this changes - if there are activated fields all over the place, this might be less efficient. In that case, the naive solution might be the best one.
EDIT: after you now posted your code... your code is quite procedural. This is C++, use classes and use representation of things. Probably you do the search for neighbors right, but you can easily make mistakes there and therefore should isolate that part in a function, or better method. Raw arrays are bad and variables like n or k are bad. But before I start tearing your code apart, I instead repeat my recommendation, put the code on CodeReview, having people tear it apart until it is perfect.

This started off as a comment, but I think it could be helpful as an answer in addition to what has already been stated.
You stated the following limitations:
1 <= R <= 10, 1 <= C <= 10
Given these restrictions, I'll take the liberty to can represent the grid/matrix M of R rows and C columns in constant space (i.e. O(1)), and also check its elements in O(1) instead of O(R*C) time, thus removing this part from our time-complexity analysis.
That is, the grid can simply be declared as bool grid[10][10];.
The key input is the large number of turns k, stated to be in the range:
1 <= k <= 2^(63) - 1
The problem is that, AFAIK, you're required to perform k turns. This makes the algorithm be in O(k). Thus, no proposed solution can do better than O(k)[1].
To improve the speed in a meaningful way, this upper-bound must be lowered in some way[1], but it looks like this cannot be done without altering the problem constraints.
Thus, no proposed solution can do better than O(k)[1].
The fact that k can be so large is the main issue. The most anyone can do is improve the rest of the implementation, but this will only improve by a constant factor; you'll have to go through k turns regardless of how you look at it.
Therefore, unless some clever fact and/or detail is found that allows this bound to be lowered, there's no other choice.
[1] For example, it's not like trying to determine if some number n is prime, where you can check all numbers in the range(2, n) to see if they divide n, making it a O(n) process, or notice that some improvements include only looking at odd numbers after checking n is not even (constant factor; still O(n)), and then checking odd numbers only up to √n, i.e., in the range(3, √n, 2), which meaningfully lowers the upper-bound down to O(√n).

code optimization histogram c++ from matlab

LIBIQTOOL_API void Hist(std::vector<double>input, std::vector<double> bins, std::vector<double>& histogram)
{
double minY = *std::min_element(std::begin(input), std::end(input));
double maxY = *std::max_element(std::begin(input), std::end(input));
std::vector<double> edges;
edges.push_back(-1 * std::numeric_limits<double>::infinity());
for (int i = 0; i < bins.size() - 1; i++)
{
edges.push_back(bins[i] + 0.0100 / 2);
}
edges.push_back(std::numeric_limits<double>::infinity());
//histC
histogram.resize(edges.size() - 1);
#pragma omp parallel for
for (int i = 0; i < input.size(); i++)
{
for (int j = 0; j < edges.size() - 1; j++)
{
if ((edges[j] < input[i]) && (input[i] <= edges[j + 1]))
{
histogram[j] = histogram[j] + 1;
break;
}
}
}
histogram[histogram.size() - 1] = histogram[histogram.size() - 1] + histogram[histogram.size() - 2];
histogram.pop_back();
}
the input vector is size 3,000,000++ and the number of bins is ~7000.
I have taken Matlab's Hist() function and created the code I need in c++.
however it take very long to run, can you see more optimizations for runtime which can be done here?
I did:
a. break when you find the bin to place the current number
b. use openMP

Possible optimizations:
do not pass your input data by value, but by const reference
Do not check lower bound, only upper bound for each bin when doing the linear search for the correct bin.
Alternatively: Since your bins are ordered monotonously and there are no gaps, do a binary search for the correct bin, not a linear search.
The last one should give you the greatest gains, the others are more trivial to implement.
Btw the way you fill the edges vector looks strange.

separating bayer image to color channel c++

I have a raw image with different Bayer pattern.
this is what i have implemented in order to separate the channels.
speed is very important here since this is going to run on thousands of large images.
can you please suggest code optimizations.
I know % (modulo) isn't very fast how can i replace this for example?
thanks
void Utilities::SeparateChannels(int** _image, int*& gr, int*& r, int*& b, int*& gb,int _width, int _height, int _colorOrder)
{
//swith case the color Order
int counter_R = 0;
int counter_GR = 0;
int counter_GB = 0;
int counter_B = 0;
switch (_colorOrder)
{
//rggb
case 0:
for (int i = 0; i < _height; i++)
{
for (int j = 0; j < _width; j++)
{
if (i % 2 == 0 && j % 2 == 0)
{
r[counter_R] = _image[i][j];
counter_R++;
}
else if (i % 2 == 0 && j % 2 == 1)
{
gr[counter_GR] = _image[i][j];
counter_GR++;
}
else if (i % 2 == 1 && j % 2 == 0)
{
gb[counter_GB] = _image[i][j];
counter_GB++;
}
else if (i % 2 == 1 && j % 2 == 1)
{
b[counter_B] = _image[i][j];
counter_B++;
}
}
}
break;
default:
break;
}
}

One possibility that might be worth considering would be to set up the arrays for the destination channel data as an array itself:
int *channels[] = {r, gr, gb, b};
Likewise, set up the counters as an array:
int counters[4] = {0};
...then your code could come out something like this:
for (int i=0; i<_height; i++)
for (int j=0; j<_width; j++) {
channel = (i&1) << 1 + (j&1);
int &counter = counters[channel];
channels[channel][counter++] = image[i][j];
}
The basic idea is that we combine the low bits of the i and j into a single number that we can use as a channel address. Then we use that number to index into the channel and the counter for that channel.
It's possible your compiler is already optimizing the existing code to be roughly equivalent to this (or possibly even better than this produces), but it's also possible it isn't.
I wouldn't normally expect a lot of improvement (at least on a typical desktop computer) though. I'd expect the bottleneck to be the bandwidth to main memory, almost regardless of the details of how you write the loop.

You should unroll the loop to process in 2x2 blocks. This way you will always know the parities and won't need to test them.
r[counter_R] = _image[i][j];
counter_R++;
gr[counter_GR] = _image[i][j+1];
counter_GR++;
gb[counter_GB] = _image[i+1][j];
counter_GB++;
b[counter_B] = _image[i+1][j+1];
counter_B++;
(Also adapt the loop parameters.)

weighted RNG speed problem in C++

Edit: to clarify, the problem is with the second algorithm.
I have a bit of C++ code that samples cards from a 52 card deck, which works just fine:
void sample_allcards(int table[5], int holes[], int players) {
int temp[5 + 2 * players];
bool try_again;
int c, n, i;
for (i = 0; i < 5 + 2 * players; i++) {
try_again = true;
while (try_again == true) {
try_again = false;
c = fast_rand52();
// reject collisions
for (n = 0; n < i + 1; n++) {
try_again = (temp[n] == c) || try_again;
}
temp[i] = c;
}
}
copy_cards(table, temp, 5);
copy_cards(holes, temp + 5, 2 * players);
}
I am implementing code to sample the hole cards according to a known distribution (stored as a 2d table). My code for this looks like:
void sample_allcards_weighted(double weights[][HOLE_CARDS], int table[5], int holes[], int players) {
// weights are distribution over hole cards
int temp[5 + 2 * players];
int n, i;
// table cards
for (i = 0; i < 5; i++) {
bool try_again = true;
while (try_again == true) {
try_again = false;
int c = fast_rand52();
// reject collisions
for (n = 0; n < i + 1; n++) {
try_again = (temp[n] == c) || try_again;
}
temp[i] = c;
}
}
for (int player = 0; player < players; player++) {
// hole cards according to distribution
i = 5 + 2 * player;
bool try_again = true;
while (try_again == true) {
try_again = false;
// weighted-sample c1 and c2 at once
// h is a number < 1325
int h = weighted_randi(&weights[player][0], HOLE_CARDS);
// i2h uses h and sets temp[i] to the 2 cards implied by h
i2h(&temp[i], h);
// reject collisions
for (n = 0; n < i; n++) {
try_again = (temp[n] == temp[i]) || (temp[n] == temp[i+1]) || try_again;
}
}
}
copy_cards(table, temp, 5);
copy_cards(holes, temp + 5, 2 * players);
}
My problem? The weighted sampling algorithm is a factor of 10 slower. Speed is very important for my application.
Is there a way to improve the speed of my algorithm to something more reasonable? Am I doing something wrong in my implementation?
Thanks.
edit: I was asked about this function, which I should have posted, since it is key
inline int weighted_randi(double *w, int num_choices) {
double r = fast_randd();
double threshold = 0;
int n;
for (n = 0; n < num_choices; n++) {
threshold += *w;
if (r <= threshold) return n;
w++;
}
// shouldn't get this far
cerr << n << "\t" << threshold << "\t" << r << endl;
assert(n < num_choices);
return -1;
}
...and i2h() is basically just an array lookup.

Your reject collisions are turning an O(n) algorithm into (I think) an O(n^2) operation.
There are two ways to select cards from a deck: shuffle and pop, or pick sets until the elements of the set are unique; you are doing the latter which requires a considerable amount of backtracking.
I didn't look at the details of the code, just a quick scan.

you could gain some speed by replacing the all the loops that check if a card is taken with a bit mask, eg for a pool of 52 cards, we prevent collisions like so:
DWORD dwMask[2] = {0}; //64 bits
//...
int nCard;
while(true)
{
nCard = rand_52();
if(!(dwMask[nCard >> 5] & 1 << (nCard & 31)))
{
dwMask[nCard >> 5] |= 1 << (nCard & 31);
break;
}
}
//...

My guess would be the memcpy(1326*sizeof(double)) within the retry-loop. It doesn't seem to change, so should it be copied each time?

Rather than tell you what the problem is, let me suggest how you can find it. Either 1) single-step it in the IDE, or 2) randomly halt it to see what it's doing.
That said, sampling by rejection, as you are doing, can take an unreasonably long time if you are rejecting most samples.

Your inner "try_again" for loop should stop as soon as it sets try_again to true - there's no point in doing more work after you know you need to try again.
for (n = 0; n < i && !try_again; n++) {
try_again = (temp[n] == temp[i]) || (temp[n] == temp[i+1]);
}

Answering the second question about picking from a weighted set also has an algorithmic replacement that should be less time complex. This is based on the principle of that which is pre-computed does not need to be re-computed.
In an ordinary selection, you have an integral number of bins which makes picking a bin an O(1) operation. Your weighted_randi function has bins of real length, thus selection in your current version operates in O(n) time. Since you don't say (but do imply) that the vector of weights w is constant, I'll assume that it is.
You aren't interested in the width of the bins, per se, you are interested in the locations of their edges that you re-compute on every call to weighted_randi using the variable threshold. If the constancy of w is true, pre-computing a list of edges (that is, the value of threshold for all *w) is your O(n) step which need only be done once. If you put the results in a (naturally) ordered list, a binary search on all future calls yields an O(log n) time complexity with an increase in space needed of only sizeof w / sizeof w[0].

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js