Problem
I have some code that I need to optimize for work. Given two datasets, I need to compare every element in one dataset with every element in another. The elements in the datasets are string vectors. that look like this: {"AB", "BB", "AB", "AA", "AB", ...}, where there are 3 possible values: AB, BB, and AA. So for example, one dataset would be something like:
AB AA BB BB AA AB
AB AA AA AA BB AB
AA AA AB BB BB BB
while the other dataset might be
BB AB AB AA AB AB
AA AA BB BB BB BB
Note: The vector length will be the same within and between datasets. In this case, it's length 6.
So the first data set contains three vectors, and the second dataset contains two for a total of 6 comparisons
This example contained 3 vs 2 vectors. My real problem will have something like 1.3M vs 6,000
Reproducible Example
The following code will create the vectors to the datasets to the desired sizes similar to how they'll show up in my real code. The first part of the main function simply generates the datasets. This part doesn't need to be optimized because these will be read in from a file. I'm generating them here for the sake of this question. The part that needs to be optimized is the nested for loop in the latter part of the main function
#include <chrono>
#include <iostream>
#include <vector>
// Takes in a 2D string vector by reference, and fills it to the required size with AA, AB, or BB
void make_genotype_data(int numRows, int numCols, std::vector<std::vector<std::string>>& geno) {
std::string vals[3] = {"AA", "AB", "BB"};
for (int i = 0; i < numRows; i++) {
std::vector<std::string> markers;
for (int j = 0; j < numCols; j++) {
int randIndex = rand() % 3;
markers.push_back(vals[randIndex]);
}
geno.push_back(markers);
markers.clear();
}
}
int main(int argc, char **argv) {
// Timing Calculation
using timepoint = std::chrono::time_point<std::chrono::high_resolution_clock>;
auto print_exec_time = [](timepoint start, timepoint stop) {
auto duration_us = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
auto duration_ms = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
auto duration_s = std::chrono::duration_cast<std::chrono::seconds>(stop - start);
std::cout << duration_s.count() << " s\n";
};
// Create the data
auto start = std::chrono::high_resolution_clock::now();
int numMarkers = 100;
std::vector<std::vector<std::string>> old_genotypes;
std::vector<std::vector<std::string>> new_genotypes;
make_genotype_data(50, numMarkers, old_genotypes);
make_genotype_data(6000, numMarkers, new_genotypes);
auto stop = std::chrono::high_resolution_clock::now();
std::cout << "*****************" << std::endl;
std::cout << "Total time for creating data" << std::endl;
print_exec_time(start, stop);
std::cout << "*****************" << std::endl;
int nCols = old_genotypes[0].size();
float threshold = 0.8;
// Compare old_genotypes with new_genotypes
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < old_genotypes.size()-1; i++) {
auto og = old_genotypes[i];
for (int j = 0; j < new_genotypes.size()-1; j++) {
auto ng = new_genotypes[j];
int numComparisons = 0;
int numMatches = 0;
for (int i = 1; i < nCols; i++) {
if (ng[i] != "--" && og[i] != "--") {
if (ng[i] == og[i]) {
numMatches++;
}
numComparisons++;
}
}
float similarity = (float) numMatches / numComparisons;
if (similarity >= threshold) {
std::cout << i << " from old_genotypes and " << j << " from new_genotypes have high similarity: " << similarity << std::endl;
}
}
}
stop = std::chrono::high_resolution_clock::now();
std::cout << "*****************" << std::endl;
std::cout << "Total time for comparison" << std::endl;
print_exec_time(start, stop);
std::cout << "*****************" << std::endl;
}
On 6,000 vs 5,000 it takes about 4 minutes. So for 6,000 vs 1.3M, it'll take about 17 hours.
It's quite slow. And I have no idea what I can do to improve the speed of the nested for loop. I'm a bit new to C++, so I don't know too many of the intricacies, but I can follow along with the jargon.
I'd really appreciate some pointers (pun intended :D) to help me optimize this. I am willing to try parallelization by breaking one of the datasets up into chunks and feeding each chunk to a core to compare against the second dataset (though don't know how to parallelize in C++). But I only want to explore parallelization after taking the serialized version as far as it can go (it'll help with the parallelized version anyway).
Comparing every element of the first list to every element of the second list does n^2 comparisons. You can get much faster (except on degenerate lists) by instead sorting the two lists (which should take n log n comparisons for each list), then walk a pair of counters down the lists comparing what is at the two counters -- if equal, you found a match, increment counters to the first different element in each list and record the set of matches you just found; if element from list 1 is less than element from list 2, then increment counter 1 (because they are sorted, this element of list one must be ; if element from list 1 is greater than the element from list 2, increment counter 2; as soon as one counter goes beyond the end of its list you are done. This does one compare and increments one counter each time, but can increment counters at most n times before it must hit the end of the lists, so is linear time except for the "record the set of matches you just found", which could be a large set if this is a degenerate pair of lists where most of both lists are the same. But even in that worst possible case, it's only failing back to the same n^2 behavior of your current algorithm.
Related
Let say I've a target number and a list of possibile values that I can pick to create a sequence that, once summed every picked number, will sum to the target:
target = 31
list = 2, 3, 4
possible sequence: 3 2 4 2 2 2 4 2 3 2 3 2
I'd like to:
first decide if there is any sequence that will reach the target
return one of the many (possible) sequence
This is my attempt:
#include <iostream>
#include <random>
#include <chrono>
#include <vector>
inline int GetRandomInt(int min = 0, int max = 1) {
uint64_t timeSeed = std::chrono::high_resolution_clock::now().time_since_epoch().count();
std::seed_seq ss{ uint32_t(timeSeed & 0xffffffff), uint32_t(timeSeed >> 32) };
std::mt19937_64 rng;
rng.seed(ss);
std::uniform_int_distribution<int> unif(min, max);
return unif(rng);
}
void CreateSequence(int target, std::vector<int> &availableNumbers) {
int numAttempts = 1;
int count = 0;
std::vector<int> elements;
while (count != target) {
while (count < target) {
int elem = availableNumbers[GetRandomInt(0, availableNumbers.size() - 1)];
count += elem;
elements.push_back(elem);
}
if (count != target) {
numAttempts++;
count = 0;
elements.clear();
}
}
int size = elements.size();
std::cout << "count: " << count << " | " << "num elements: " << size << " | " << "num attempts: " << numAttempts << std::endl;
for (auto it = elements.begin(); it != elements.end(); it++) {
std::cout << *it << " ";
}
}
int main() {
std::vector<int> availableNumbers = { 2, 3, 4 };
CreateSequence(31, availableNumbers);
}
But it can loop infinitely if the list of number can't be appropriate to reach such sum; example:
std::vector<int> availableNumbers = { 3 };
CreateSequence(8, availableNumbers);
No sequence of 3 will sum to 8. Also, if the list is huge and the target number high, it can lead to a huge amount of processing (cause lots of while check fails).
How would you implement this kind of algorithm?
Your suggested code is possibly very fast, since it is heuristic. But as you said, it gets potentially trapped in a nearly endless loop.
If you want to avoid this situation, you have to search the complete set of possible combinations.
Abstraction
Let's define our algorithm as a function f with a scalar target t and a vector <b> as parameters returning a vector of coefficients <c>, where <b> and <c> have the same dimension:
<c> = f(t, <b>)
First the given set of numbers Sg should be reduced to their reduced set Sr so we reduce the dimension of our solution vector <c>. E.g. {2,3,4,11} can be reduced to {2,3}. We get this by calling our algorithm recursively by splitting Sg into a new target ti with the remaining numbers as the new given set Sgi and ask the algorithm, if it finds any solution (a non-zero vector). If so, remove that target ti from the original given set Sg. Repeat this recursively until no solutions found any more.
Now we can understand this set of numbers as a polynomial, where we are looking for possible coefficients ci to get our target t. Let's call each element in Sb bi with i={1..n}.
Our test sum ts is the sum over all i for ci * bi, where each ci can run from 0 to ni = floor(t/bi).
The number of possible tests N is now the product over all ni+1: N = (n1+1) * (n2+1) * ... * (ni+1).
Iterate now over all possibilities by representing the coefficient vector <c> as an vector of integers and incrementing c1 and carrying over an overrun to the next element in the vector, resetting c1 and so forth.
Example
#include <random>
#include <chrono>
#include <vector>
#include <iostream>
using namespace std;
static int evaluatePolynomial(const vector<int> &base, const vector<int> &coefficients)
{
int v=0;
for(unsigned long i=0; i<base.size(); i++){
v += base[i]*coefficients[i];
}
return v;
}
static bool isZeroVector(vector<int> &v)
{
for (auto it = v.begin(); it != v.end(); it++) {
if(*it != 0){
return false;
}
}
return true;
}
static vector<int> searchCoeffs(int target, vector<int> &set) {
// TODO: reduce given set
vector<int> n = set;
vector<int> c = vector<int>(set.size(), 0);
for(unsigned long int i=0; i<set.size(); i++){
n[i] = target/set[i];
}
c[0] = 1;
bool overflow = false;
while(!overflow){
if(evaluatePolynomial(set, c) == target){
return c;
}
// increment coefficient vector
overflow = true;
for(unsigned long int i=0; i<c.size(); i++){
c[i]++;
if(c[i] > n[i]){
c[i] = 0;
}else{
overflow = false;
break;
}
}
}
return vector<int>(set.size(), 0);
}
static void print(int target, vector<int> &set, vector<int> &c)
{
for(unsigned long i=0; i<set.size(); i++){
for(int j=0; j<c[i]; j++){
cout << set[i] << " ";
}
}
cout << endl;
cout << target << " = ";
for(unsigned long i=0; i<set.size(); i++){
cout << " +" << set[i] << "*" << c[i];
}
cout << endl;
}
int main() {
vector<int> set = {4,3,2};
int target = 31;
auto c = searchCoeffs(target, set);
print(target, set,c);
}
That code prints
4 4 4 4 4 4 4 3
31 = +4*7 +3*1 +2*0
Further Thoughts
productive code should test for zeros in any given values
the search could be improved by incrementing the next coefficient if the evaluated polynomial already exceeded the target value.
further speedup is possible, when calculating the difference of the target value and the evaluated polynomial when c1 is set to zero, and checking if that difference is a multiple of b1. If not, c2 could be incremented straight forward.
Perhaps there exist some shortcuts exploiting the least common multiple
As ihavenoidea proposed, I would also try backtracking. In addition, I will sort the numbers in decreasing order, il order to speed up the process.
Note: a comment would be more appropriate than an answer, but I am not allowed to. Hope it helps. I will suppress this answer if requested.
I must add: I am calling my linear search 15 000 times and the lowest range i am looking within is up to 50 000 with each iteration. Thus meaning there are 15 000 * 50 000 look ups on the first iteration. This should take longer than 0ms.
I have this basic Linear search:
bool linearSearch(std::vector<int>&primes, int number, int range) {
for (int i = 0; i < range; i++) {
if (primes[i] == number)
return true;
}
return false;
}
I take time using:
void timeLinearSearch(std::vector<int>& primes) {
clock_t start, stop;
size_t NRND = 15000; // 15000 primes per clock
for (int N = 50000; N <= 500000; N += 50000) // increase by 50k each iteration
{
for (int repeat = 0; repeat < 5; repeat++) {
start = clock();
for (int j = 0; j < NRND; j++) {
linearSearch(primes, rand(), N);
}
stop = clock();
std::cout << stop - start << ", " << N << std::endl;
}
}
}
The problem here is that the time taken is 0ms. The vector 'primes' has roughly 600 000 elements in it so the search stays within range.
In the linear search, if I change:
if(primes[i] == number)
to:
if(primes.at(i) == number)
then I get time > 0 taken for the search.
I have compared my linear search with the primes.at(i) to std::find() by:
for (int j = 0; j < NRND; j++) {
std::find(primes.begin(), primes.begin() + N, rand());
}
and this is roughly 200ms faster than my .at() find.
Why is my search with std::vector[i] giving me 0ms time?
When the compiler can see into the implementation of linearSearch, it can optimize it out entirely when you use operator[], because there are no side effects. That is why you see zero time.
at(..), on the other hand, has a side effect (throwing when the index is out of bounds) so the compiler has no option of optimizing it out.
You can fix your benchmark to ensure that the call is kept in place, for example, by counting the number of matches:
int count = 0;
for (int j = 0; j < NRND; j++) {
count += linearSearch(primes, rand(), N);
}
std::cout << stop - start << ", " << N << " " << count << std::endl;
You do need to be careful with writing comparison code like this; do make sure you have a statistically rigourous way of interpreting your data. Assuming you do have this, here's an explanation:
[i] does not have to check if i is within the bounds of a vector, whereas at(i) must check.
That explains the difference in the speed: your compiler is able to generate faster code for [].
It feels to me you are comparing apples and oranges.
You ask to find "rand()" element so it is a different number in every run
What about looking for elements like this (assuming you have N primes):
primes[N/10], primes[N/2], primes(3*N/4), ... for elements to be found
(add +1 if you want the item no to be found)
Careful, if your primes array is sorted in an increasing order, you might want to return false if primes[i] > number rather than going through the whole array (or even do dichotomy search) unless you are just looking at .at() evaluation
I'm trying to solve this exercise http://main.edu.pl/en/archive/amppz/2014/dzi and I have no idea how to improve perfomance of my code. Problems occure when program have to handle over 500,000 unique numbers(up to 2,000,000 as in description). Then it took 1-8s to loop over all this numbers. Tests I have used are from http://main.edu.pl/en/user.phtml?op=tests&c=52014&task=1263, and I testing it by command
program.exe < data.in > result.out
Description:
You are given a sequence of n integer a1, a2, ... an. You should determine the number of such ordered pairs(i, j), that i, j equeals(1, ..., n), i != j and ai is divisor of aj.
The first line of input contains one integer n(1 <= n <= 2000000)
The second line contains a sequence of n integers a1, a2, ..., an(1 <= ai <= 2000000).
In the first and only line of output should contain one integer, denoting the number of pairs sought.
For the input data:
5
2 4 5 2 6
the correct answer is: 6
Explanation: There are 6 pars: (1, 2) = 4/2, (1, 4) = 2/2, (1, 5) = 6/2, (4, 1) = 2/2, (4, 2) = 4/2, (4, 5) = 6/2.
For example:
- with 2M in total numbers and 635k unique numbers, there is 345mln iterations in total
- with 2M in total numbers and 2mln unqiue numbers, there is 1885mln iterations in total
#include <iostream>
#include <math.h>
#include <algorithm>
#include <time.h>
#define COUNT_SAME(count) (count - 1) * count
int main(int argc, char **argv) {
std::ios_base::sync_with_stdio(0);
int n; // Total numbers
scanf("%d", &n);
clock_t start, finish;
double duration;
int minVal = 2000000;
long long *countVect = new long long[2000001]; // 1-2,000,000; Here I'm counting duplicates
unsigned long long counter = 0;
unsigned long long operations = 0;
int tmp;
int duplicates = 0;
for (int i = 0; i < n; i++) {
scanf("%d", &tmp);
if (countVect[tmp] > 0) { // Not best way, but works
++countVect[tmp];
++duplicates;
} else {
if (minVal > tmp)
minVal = tmp;
countVect[tmp] = 1;
}
}
start = clock();
int valueJ;
int sqrtValue, valueIJ;
int j;
for (int i = 2000000; i > 0; --i) {
if (countVect[i] > 0) { // Not all fields are setted up
if (countVect[i] > 1)
counter += COUNT_SAME(countVect[i]); // Sum same values
sqrtValue = sqrt(i);
for (j = minVal; j <= sqrtValue; ++j) {
if (i % j == 0) {
valueIJ = i / j;
if (valueIJ != i && countVect[valueIJ] > 0 && valueIJ > sqrtValue)
counter += countVect[i] * countVect[valueIJ];
if (i != j && countVect[j] > 0)
counter += countVect[i] * countVect[j];
}
++operations;
}
}
}
finish = clock();
duration = (double)(finish - start) / CLOCKS_PER_SEC;
printf("Loops time: %2.3f", duration);
std::cout << "s\n";
std::cout << "\n\nCounter: " << counter << "\n";
std::cout << "Total operations: " << operations;
std::cout << "\nDuplicates: " << duplicates << "/" << n;
return 0;
}
I know, I shouldn't sort the array at beginning, but I have no idea how to make it in better way.
Any tips will be great, thanks!
Here is improved algorithm - 2M unique numbers within 0.5s. Thanks to #PJTraill!
#include <iostream>
#include <math.h>
#include <algorithm>
#include <time.h>
#define COUNT_SAME(count) (count - 1) * count
int main(int argc, char **argv) {
std::ios_base::sync_with_stdio(0);
int n; // Total numbers
scanf("%d", &n);
clock_t start, finish;
double duration;
int maxVal = 0;
long long *countVect = new long long[2000001]; // 1-2,000,000; Here I'm counting duplicates
unsigned long long counter = 0;
unsigned long long operations = 0;
int tmp;
int duplicates = 0;
for (int i = 0; i < n; i++) {
scanf("%d", &tmp);
if (countVect[tmp] > 0) { // Not best way, but works
++countVect[tmp];
++duplicates;
} else {
if (maxVal < tmp)
maxVal = tmp;
countVect[tmp] = 1;
}
}
start = clock();
int j;
int jCounter = 1;
for (int i = 0; i <= maxVal; ++i) {
if (countVect[i] > 0) { // Not all fields are setted up
if (countVect[i] > 1)
counter += COUNT_SAME(countVect[i]); // Sum same values
j = i * ++jCounter;
while (j <= maxVal) {
if (countVect[j] > 0)
counter += countVect[i] * countVect[j];
j = i * ++jCounter;
++operations;
}
jCounter = 1;
}
}
finish = clock();
duration = (double)(finish - start) / CLOCKS_PER_SEC;
printf("Loops time: %2.3f", duration);
std::cout << "s\n";
std::cout << "\n\nCounter: " << counter << "\n";
std::cout << "Total operations: " << operations;
std::cout << "\nDuplicates: " << duplicates << "/" << n;
return 0;
}
I expect the following to work a lot faster than the OP’s algorithm (optimisations oblique):
(The type of values and frequencies should be 32-bit unsigned, counts 64-bit – promote before calculating a count, if your language would not.)
Read the number of values, N.
Read each value v, adding one to its frequency freq[v] (no need to store it).
(freq[MAX] (or MAX+1) can be statically allocated for probably optimal initialisation to all 0)
Calculate the number of pairs involving 1 from freq[1] and the number of values.
For every i in 2..MAX (with freq[i] > 0):
Calculate the number of pairs (i,i) from freq[i].
For every multiple m of i in 2m..MAX:
(Use m as the loop counter and increment it, rather than multiplying)
Calculate the number of pairs (i,m) from freq[i] and freq[m].
(if freq[i] = 1, one can omit the (i,i) calculation and perform a variant of the loop optimised for freq[i] = 1)
(One can perform the previous (outer) loop from 2..MAX/2, and then from MAX/2+1..MAX omitting the processing of multiples)
The number of pairs (i,i) = freq[i]C2 = ( freq[i] * (freq[i] - 1) ) / 2 .
The number of pairs (i,j) = freq[i] * freq[j] for i ≠ j.
This avoids sorting, sqrt and division.
Other optimisations
One can store the distinct values, and scan that array instead (the order does not matter); the gain or loss due to this depends on the density of the values in 1..MAX.
If the maximum frequency is < 216, which sounds very probable, all products will fit in 32 bits. One could take advantage of this by writing functions with the numeric type as a template, tracking the maximum frequency and then choosing the appropriate instance of the template for the rest. This costs N*(compare+branch) and may gain by performing D2 multiplications with 32 bits instead of 64, where D is the number of distinct values. I see no easy way to deduce that 32 bits suffice for the total, apart from N < 216.
If parallelising this for n processors, one could let different processors process different residues modulo n.
I considered keeping track of the number of even values, to avoid a scan of half the frequencies, but I think that for most datasets within the given parameters that would yield little advantage.
Ok, I am not going to write your whole algorithm for you, but it can definitely be done faster. So i guess this is what you need to get going:
So you have your list sorted, so there are a lot of assumptions you can make from this. Take for instance the highest value. It wont have any multiples. The highest value that does, will highest value divided by two.
There is also one other very usefull fact here. A multiple of a multiple is also a multiple. (Still following? ;)). Take for instance the list [2 4 12]. Now you've found (4,12) as a multiple pair. If you now also find (2,4), then you can deduce that 12 is also a multiple of 2.
And since you only have to count the pairs, you can just keep a count for each number how many multiples it has, and add that when you see that number as a multiple itself.
This means that it is probably best to iterate your sorted list backwards, and look for divisors instead.
And maybe store it in some way that goes like
[ (three 2's ), (two 5's), ...]
ie. store how often a number occurs. Once again, you don't have to keep track of it's id, since you only need to give them the total number of pairs.
Storing your list this way helps you, because all the 2's are going to have the same amount of multiples. So calculate once and then multiply.
Using VexCL in C++ I am trying to count all values in a vector above a certain minimum and I would like to perform this count on the device. The default Reductors only provide methods for MIN, MAX and SUM and the examples do not show very clear how to perform such a operation. This code is slow as it is probably executed on the host instead of the device:
int amount = 0;
int minimum = 5;
for (vex::vector<int>::iterator i = vector.begin(); i != vector.end(); ++i)
{
if (*i >= minimum)
{
amount++;
}
}
The vector I am using will consists of a large amount of values, say millions and mostly zero's. Besides the amount of values that are above the minimum, I also would like to retrieve a list of vector-ID's which contains these values. Is this possible?
If you only needed to count elements above the minimum, this would be as simple as
vex::Reductor<int, vex::SUM> sum(ctx);
int amount = sum( vec >= minimum );
The vec >= minimum expression results in a sequence of ones and zeros, and sum then counts ones.
Now, since you also need to get the positions of the elements above the minimum, it gets a bit more complicated:
#include <iostream>
#include <vexcl/vexcl.hpp>
int main() {
vex::Context ctx(vex::Filter::Env && vex::Filter::Count(1));
// Input vector
vex::vector<int> vec(ctx, {1, 3, 5, 2, 6, 8, 0, 2, 4, 7});
int n = vec.size();
int minimum = 5;
// Put result of (vec >= minimum) into key, and element indices into pos:
vex::vector<int> key(ctx, n);
vex::vector<int> pos(ctx, n);
key = (vec >= minimum);
pos = vex::element_index();
// Get number of interesting elements in vec.
vex::Reductor<int, vex::SUM> sum(ctx);
int amount = sum(key);
// Sort pos by key in descending order.
vex::sort_by_key(key, pos, vex::greater<int>());
// First 'amount' of elements in pos now hold indices of interesting
// elements. Lets use slicer to extract them:
vex::vector<int> indices(ctx, amount);
vex::slicer<1> slice(vex::extents[n]);
indices = slice[vex::range(0, amount)](pos);
std::cout << "indices: " << indices << std::endl;
}
This gives the following output:
indices: {
0: 2 4 5 9
}
#ddemidov
Thanks for your help, it is working. However, it is much slower than my original code which copies the device vector to the host and sorts using Boost. Below is the sample code with some timings:
#include <iostream>
#include <cstdio>
#include <vexcl/vexcl.hpp>
#include <vector>
#include <boost/range/algorithm.hpp>
int main()
{
clock_t start, end;
// initialize vector with random numbers
std::vector<int> hostVector(1000000);
for (int i = 0; i < hostVector.size(); ++i)
{
hostVector[i] = rand() % 20 + 1;
}
// copy to device
vex::Context cpu(vex::Filter::Type(CL_DEVICE_TYPE_CPU) && vex::Filter::Any);
vex::Context gpu(vex::Filter::Type(CL_DEVICE_TYPE_GPU) && vex::Filter::Any);
vex::vector<int> vectorCPU(cpu, 1000000);
vex::vector<int> vectorGPU(gpu, 1000000);
copy(hostVector, vectorCPU);
copy(hostVector, vectorGPU);
// sort results on CPU
start = clock();
boost::sort(hostVector);
end = clock();
cout << "C++: " << (end - start) / (CLOCKS_PER_SEC / 1000) << " ms" << endl;
// sort results on OpenCL
start = clock();
vex::sort(vectorCPU, vex::greater<int>());
end = clock();
cout << "vexcl CPU: " << (end - start) / (CLOCKS_PER_SEC / 1000) << " ms" << endl;
start = clock();
vex::sort(vectorGPU, vex::greater<int>());
end = clock();
cout << "vexcl GPU: " << (end - start) / (CLOCKS_PER_SEC / 1000) << " ms" << endl;
return 0;
}
which results in:
C++: 17 ms
vexcl CPU: 737 ms
vexcl GPU: 1670 ms
using an i7 3770 CPU and a (slow) HD4650 graphics card. As I'v read OpenCL should be able to perform fast sortings on large vertices. Do you have any advice how to perform a fast sort using OpenCL and vexcl?
I've edited original text to save potential readers some time and health. Maybe someone will actually use this.
I know it's basic stuff. Probably like very, very basic.
How to get all possible combinations of given set.
E.g.
string set = "abc";
I expect to get:
a b c aa ab ac aaa aab aac aba abb abc aca acb acc baa bab ...
and the list goes on (if no limit for length is set).
I'm looking for a very clean code for that - all that I've found was kind of dirty and not working correctly. The same I can say about code I wrote.
I need such code because I'm writing brute force (md5) implementation working on multiple threads. The pattern is that there's Parent process that feeds threads with chunks of their very own combinations, so they would work on these on their own.
Example: first thread gets package of 100 permutations, second gets next 100 etc.
Let me know if I should post the final program anywhere.
EDIT #2
Once again thank you guys.
Thanks to you I've finished my Slave/Master Brute-Force application implemented with MPICH2 (yep, can work under linux and windows across for example network) and since the day is almost over here, and I've already wasted a lot of time (and sun) I'll proceed with my next task ... :)
You shown me that StackOverflow community is awesome - thanks!
Here's some C++ code that generates permutations of a power set up to a given length.
The function getPowPerms takes a set of characters (as a vector of strings) and a maximum length, and returns a vector of permuted strings:
#include <iostream>
using std::cout;
#include <string>
using std::string;
#include <vector>
using std::vector;
vector<string> getPowPerms( const vector<string>& set, unsigned length ) {
if( length == 0 ) return vector<string>();
if( length == 1 ) return set;
vector<string> substrs = getPowPerms(set,length-1);
vector<string> result = substrs;
for( unsigned i = 0; i < substrs.size(); ++i ) {
for( unsigned j = 0; j < set.size(); ++j ) {
result.push_back( set[j] + substrs[i] );
}
}
return result;
}
int main() {
const int MAX_SIZE = 3;
string str = "abc";
vector<string> set; // use vector for ease-of-access
for( unsigned i = 0; i < str.size(); ++i ) set.push_back( str.substr(i,1) );
vector<string> perms = getPowPerms( set, MAX_SIZE );
for( unsigned i = 0; i < perms.size(); ++i ) cout << perms[i] << '\n';
}
When run, this example prints
a b c aa ba ca ab bb cb ... acc bcc ccc
Update: I'm not sure if this is useful, but here is a "generator" function called next that creates the next item in the list given the current item.
Perhaps you could generate the first N items and send them somewhere, then generate the next N items and send them somewhere else.
string next( const string& cur, const string& set ) {
string result = cur;
bool carry = true;
int loc = cur.size() - 1;
char last = *set.rbegin(), first = *set.begin();
while( loc >= 0 && carry ) {
if( result[loc] != last ) { // increment
int found = set.find(result[loc]);
if( found != string::npos && found < set.size()-1 ) {
result[loc] = set.at(found+1);
}
carry = false;
} else { // reset and carry
result[loc] = first;
}
--loc;
}
if( carry ) { // overflow
result.insert( result.begin(), first );
}
return result;
}
int main() {
string set = "abc";
string cur = "a";
for( int i = 0; i < 20; ++i ) {
cout << cur << '\n'; // displays a b c aa ab ac ba bb bc ...
cur = next( cur, set );
}
}
C++ has a function next_permutation(), but I don't think that's what you want.
You should be able to do it quite easily with a recursive function. e.g.
void combinations(string s, int len, string prefix) {
if (len<1) {
cout << prefix << endl;
} else {
for (int i=0;i<s.size();i++) {
combinations(s, len-1, prefix + s[i])
}
}
}
EDIT: For the threading part, I assume you are working on a password brute forcer?
If so, I guess the password testing part is what you want to speed up rather than password generation.
Therefore, you could simply create a parent process which generates all combinations, then every kth password is given to thread k mod N (where N is the number of threads) for checking.
Another version of permutation is in Python's standard library although you questioned in C++.
http://docs.python.org/library/itertools.html#itertools.permutations
But your list contains an infinitive sequence of a each character, so I think the method that how to order those should be defined first, and state your algorithm clearly.
I can't give you the code but what you need is a recursive algorithm here is some pseudo code
The idea is simple, concatinate each string in your set with each and every other string, then permute the strings. add all your smaller strings to your set and do the same thing again with the new set. Keep going till you are tired :)
Might be a bit confusing but think about it a little ;)
set = { "a", "b", "c"}
build_combinations(set)
{
new_set={}
for( Element in set ){
new_set.add(Element);
for( other_element in set )
new_element = concatinate(Element, other_element);
new_set.add(new_element);
}
new_set = permute_all_elements(new_set);
return build_combinations(new_set);
}
This will obviously cause a stack overflow because there is no terminating condition :) so put into the build_combinations function what ever condition you like (maybe size of set?) to terminate the recursion
Here's an odd and normally not ideal way of doing it, but hey, it works, and it doesn't use recursion :-)
void permutations(char c[], int l) // l is the length of c
{
int length = 1;
while (length < 5)
{
for (int j = 0; j < int(pow(double(l), double(length))); j++) // for each word of a particular length
{
for (int i = 0; i < length; i++) // for each character in a word
{
cout << c[(j / int(pow(double(l), double(length - i - 1))) % l)];
}
cout << endl;
}
length++;
}
}
I know you've got a perfectly good answer already (multiple ones in fact), but I was thinking a bit about this problem and I came up with a pretty neat algorithm that I might as well share.
Basically, you can do this by starting with a list of the symbols, and then appending each symbol to each other symbol to make two symbol words, and then appending each symbol to each word. That might not make much sense like that, so here's what it looks like:
Start with 'a', 'b' and 'c' as the symbols and add them to a list:
a
b
c
Append 'a', 'b' and 'c' to each word in the list. The list then looks like:
a
b
c
aa
ab
ac
ba
bb
bc
ca
cb
cc
Then append 'a', 'b' and 'c' to each new word in the list so the list will look like this:
a
b
c
aa
ab
ac
ba
bb
bc
ca
cb
cc
aaa
aab
aac
aba
abb
... and so on
You can do this easily by using an iterator and just let the iterator keep going from the start.
This code prints out each word that is added to the list.
void permutations(string symbols)
{
list<string> l;
// add each symbol to the list
for (int i = 0; i < symbols.length(); i++)
{
l.push_back(symbols.substr(i, 1));
cout << symbols.substr(i, 1) << endl;
}
// infinite loop that looks at each word in the list
for (list<string>::iterator it = l.begin(); it != l.end(); it++)
{
// append each symbol to the current word and add it to the end of the list
for (int i = 0; i < symbols.length(); i++)
{
string s(*it);
s.push_back(symbols[i]);
l.push_back(s);
cout << s << endl;
}
}
}
a Python example:
import itertools
import string
characters = string.ascii_lowercase
max_length = 3
count = 1
while count < max_length+1:
for current_tuple in itertools.product(characters, repeat=count):
current_string = "".join(current_tuple)
print current_string
count += 1
The output is exactly what you expect to get:
a b c aa ab ac aaa aab aac aba abb abc aca acb acc baa bab ...
(the example is using the whole ASCII lowercase chars set, change "characters = ['a','b','c']" to reduce the size of output)
What you want is called Permutation.
Check this for a Permutation implementation in java