Generate integers with even hamming weight (popcount) c++ - c++

I want to effectively (by using bit hacks) generate all integers up a given number, k, such that they have an even hamming weight without explicitly calculating their hamming weights. It is not important to me whether that is done in ascending or descending order.
A bonus (related task) would be if I could generate all integers with even hamming weight which are subsets (in the Gray code sense) of k.
Example:
input-> k=14 (binary 1110)
output all-> 3 (0011), 5(0101), 6 (0110), 9 (1001), 10 (1010), 12 (1100)
output subsets-> 6 (0110), 10 (1010), 12 (1100)
Example code using popcount:
for (unsigned int sub=1; sub<k; sub++){
if (__builtin_popcount(sub) % 2 == 0){
cout << sub << endl;
}
}
Example code using popcount for subsets:
for (unsigned int sub=((k-1)&k); sub!=0; sub=((sub-1)&k)){
if (__builtin_popcount(sub) % 2 == 0){
cout << sub << endl;
}
}

We can build a tree with numbers in nodes, each node has two childs, one with flipped bit number x and the other with not flipped bit number x. We need to exclude all the childs with value greater then initial value. We can store the popcount in a variable and decrement and increment each time we flip a bit depending on the flipped bit value, thus avoiding calculating popcount each time the variable is changed.
I don't know if this method is faster or not. I guess it may be faster, but the overhead for recursive function may be too big.
That was fun:
#include <cstdio>
#include <iostream>
#include <vector>
#include <algorithm>
#include <climits>
#include <cinttypes>
#include <cassert>
#include <bitset>
#include <cstring>
namespace gen {
bool isEven(unsigned int x) {
return x % 2 == 0;
}
// find last set bit, just like ffs, but backwards
unsigned int fls(unsigned int x)
{
assert(x >= 1);
if (x == 0) {
return 0;
}
#ifdef __GNUC__
const unsigned int clz = __builtin_clz(x);
#else
#error find clz function in C++
#endif
assert(clz >= 1 && (sizeof(x) * CHAR_BIT) >= clz + 1);
return (sizeof(x) * CHAR_BIT) - clz - 1;
}
unsigned int popcount(unsigned int x) {
#ifdef __GNUC__
return __builtin_popcount(x);
#else
return std::bitset<sizeof(x)*CHAR_BIT>(x).count();
#endif
}
/**
* Generates all integers up a given number k with even hamming weight
* #param out - output vector with push_backed results
* #param greycodesubset - set to true, if only interested in grey subset integers only
* #param startk - starting k value
* #param k - the current number value
* #param pos - one plus the position of the bit in number k that we will change in this run
* #param popcount - Hamming weight of number k up to position pos
* #param changes - the number of bits changed in number k since startk. Used only if greycodesubset = true
*/
void loop(std::vector<unsigned int>& out, const bool& greycodesubset,
const unsigned int& startk,
unsigned int k, unsigned int pos, unsigned int popcount,
unsigned int changes)
{
// k > startk may happen for example for 0b10, if we flip last byte, then k = 0b11
if (startk < k) {
return;
}
// end of recusive function
if (pos == 0) {
if (isEven(popcount) && k != 0) {
out.push_back(k);
}
return;
}
// decrement pos
--pos;
const int mask = 1 << pos;
const bool is_pos_bit_set = k & mask;
// call without changes
loop(out, greycodesubset, startk,
k, pos, popcount + (is_pos_bit_set ? +1 : 0), changes);
// when finding grey code subset only we can change maximum 1 byte
if (greycodesubset) {
if (changes >= 1) {
return;
}
++changes;
}
// call with toggled bit number pos
loop(out, greycodesubset, startk,
k ^ mask, pos, popcount + (!is_pos_bit_set ? +1 : 0), changes);
}
std::vector<unsigned int> run(const unsigned int& k, const bool& greycodesubsetonly)
{
assert(k > 0);
std::vector<unsigned int> out;
if (k < 2) return out;
loop(out, greycodesubsetonly, k, k, fls(k) + 1, 0, 0);
return out;
}
} // namespace gen
int main()
{
const unsigned int k = 14;
const int bits_in_k = 4;
std::vector<unsigned int> out = gen::run(k, false);
std::vector<unsigned int> out_subset = gen::run(k, true);
std::cout << "input-> k=" << k << "(" << std::bitset<bits_in_k>(k).to_string() << ") " << std::endl;
std::cout << "output all-> ";
std::for_each(out.begin(), out.end(), [](int v) {
std::cout << v << "(" << std::bitset<bits_in_k>(v).to_string() << ") ";
});
std::cout << std::endl;
std::cout << "output subsets-> ";
std::for_each(out_subset.begin(), out_subset.end(), [](int v) {
std::cout << v << "(" << std::bitset<bits_in_k>(v).to_string() << ") ";
});
std::cout << std::endl;
return 0;
}
input-> k=14(1110)
output all-> 12(1100) 10(1010) 9(1001) 6(0110) 5(0101) 3(0011)
output subsets-> 12(1100) 10(1010) 6(0110)

Related

Formatting Commas into a long long integer

this is my first time posting a question. I was hoping to get some help on a very old computer science assignment that I never got around to finishing. I'm no longer taking the class, just want to see how to solve this.
Read in an integer (any valid 64-bit
integer = long long type) and output the same number but with commas inserted.
If the user entered -1234567890, your program should output -1,234,567,890. Commas
should appear after every three significant digits (provided more digits remain) starting
from the decimal point and working left toward more significant digits. If the number
entered does not require commas, do not add any. For example, if the input is 234 you
should output 234. The input 0 should produce output 0. Note in the example above
that the number can be positive or negative. Your output must maintain the case of the
input.
I'm relatively new to programming, and this was all I could come up with:
#include <iostream>
#include <cmath>
using namespace std;
int main()
{
long long n;
cout << "Enter an integer:" << endl;
cin >> n;
int ones = n % 10;
int tens = n / 10 % 10;
int hund = n / 100 % 10;
int thous = n / 1000 % 10;
int tthous = n / 10000 % 10;
cout << tthous << thous << "," << hund << tens << ones << endl;
return 0;
}
The original assignment prohibited the use of strings, arrays, and vectors, so please refrain from giving suggestions/solutions that involve these.
I'm aware that some sort of for-loop would probably be required to properly insert the commas in the necessary places, but I just do not know how to go about implementing this.
Thank you in advance to anyone who offers their help!
Just to give you an idea how to solve this, I've maiden a simple implementation. Just keep in mind that is just a simple example:
#include <iostream>
#include <cmath>
using namespace std;
int main()
{
long long n = -1234567890;
if ( n < 0 )
cout << '-';
n = abs(n);
for (long long i = 1000000000000; i > 0; i /= 1000) {
if ( n / i <= 0 ) continue;
cout << n / i ;
n = n - ( n / i) * i;
if ( n > 0 )
cout << ',';
}
return 0;
}
http://coliru.stacked-crooked.com/a/150f75db89c46e99
The easy solution would be to use ios::imbue to set a locale that would do all the work for you:
std::cout.imbue(std::locale(""));
std::cout << n << std::endl;
However, if the restraints don't allow for strings or vectors I doubt that this would be a valid solution. Instead you could use recursion:
void print(long long n, int counter) {
if (n > 0) {
print(n / 10, ++counter);
if (counter % 3 == 0) {
std::cout << ",";
}
std::cout << n%10;
}
}
void print(long long n) {
if (n < 0) {
std::cout << "-";
n *= -1;
}
print(n, 0);
}
And then in the main simply call print(n);
A small template class comma_sep may be a solution, the usage may be as simple as:
cout << comma_sep<long long>(7497592752850).sep() << endl;
Which outputs:
7,497,592,752,850
Picked from here:
https://github.com/arloan/libimsux/blob/main/comma_sep.hxx
template <class I = int, int maxdigits = 32>
class comma_sep
char buff[maxdigits + maxdigits / 3 + 2];
char * p;
I i;
char sc;
public:
comma_sep(I i, char c = ',') : p(buff), i(i), sc(c) {
if (i < 0) {
buff[0] = '-';
*++p = '\0';
}
}
const char * sep() {
return _sep(std::abs(i));
}
private:
const char * _sep(I i) {
I r = i % 1000;
I n = i / 1000;
if (n > 0) {
_sep(n);
p += sprintf(p, "%c%03d", sc, (int)r);
*p = '\0';
} else {
p += sprintf(p, "%d", (int)r);
*p = '\0';
}
return buff;
}
};
The above class handles only integeral numbers, float/double numbers need to use a partial specialized version:
template<int maxd>
class comma_sep<double, maxd> {
comma_sep<int64_t, maxd> _cs;
char fs[64];
double f;
public:
const int max_frac = 12;
comma_sep(double d, char c = ',') : _cs((int64_t)d, c) {
double np;
f = std::abs(modf(d, &np));
}
const char * sep(int frac = 3) {
if (frac < 1 || frac > max_frac) {
throw std::invalid_argument("factional part too too long or invalid");
}
auto p = _cs.sep();
strcpy(fs, p);
char fmt[8], tmp[max_frac+3];
sprintf(fmt, "%%.%dlf", frac);
sprintf(tmp, fmt, f);
return strcat(fs, tmp + 1);
}
};
The two above classes can be improved by adding type-traits like std::is_integral and/or std::is_floating_point, though.

C++ Binomial Coefficient is too slow

I've tried to compute the binomial coefficient by making a recursion with Pascal's triangle. It works great for small numbers, but 20 up is either really slow or doesn't work at all.
I've tried to look up some optimization techniques, such as "chaching" but they don't really seem to be well integrated in C++.
Here's the code if that helps you.
int binom(const int n, const int k)
{
double sum;
if(n == 0 || k == 0){
sum = 1;
}
else{
sum = binom(n-1,k-1)+binom(n-1,k);
}
if((n== 1 && k== 0) || (n== 1 && k== 1))
{
sum = 1;
}
if(k > n)
{
sum = 0;
}
return sum;
}
int main()
{
int n;
int k;
int sum;
cout << "Enter a n: ";
cin >> n;
cout << "Enter a k: ";
cin >> k;
Summe = binom(n,k);
cout << endl << endl << "Number of possible combinations: " << sum <<
endl;
}
My guess is that the programm wastes a lot of time calculating results it has already calculated. It somehow must memorize past results.
My guess is that the program wastes a lot of time calculating results it has already calculated.
That's definitely true.
On this topic, I'd suggest you have a look to Dynamic Programming Topic.
There is a class of problem which requires an exponential runtime complexity but they can be solved with Dynamic Programming Techniques.
That'd reduce the runtime complexity to polynomial complexity (most of the times, at the expense of increasing space complexity).
The common approaches for dynamic programming are:
Top-Down (exploiting memoization and recursion).
Bottom-Up (iterative).
Following, my bottom-up solution (fast and compact):
int BinomialCoefficient(const int n, const int k) {
std::vector<int> aSolutions(k);
aSolutions[0] = n - k + 1;
for (int i = 1; i < k; ++i) {
aSolutions[i] = aSolutions[i - 1] * (n - k + 1 + i) / (i + 1);
}
return aSolutions[k - 1];
}
This algorithm has a runtime complexity O(k) and space complexity O(k).
Indeed, this is a linear.
Moreover, this solution is simpler and faster than the recursive approach. It is very CPU cache-friendly.
Note also there is no dependency on n.
I have achieved this result exploiting simple math operations and obtaining the following formula:
(n, k) = (n - 1, k - 1) * n / k
Some math references on the Binomial Coeffient.
Note
The algorithm does not really need a space complexity of O(k).
Indeed, the solution at i-th step depends only on (i-1)-th.
Therefore, there is no need to store all intermediate solutions but just the one at the previous step. That would make the algorithm O(1) in terms of space complexity.
However, I would prefer keeping all intermediate solutions in solution code to better show the principle behind the Dynamic Programming methodology.
Here my repository with the optimized algorithm.
I would cache the results of each calculation in a map. You can't make a map with a complex key, but you could turn the key into a string.
string key = string("") + n.to_s() + "," + k.to_s();
Then have a global map:
map<string, double> cachedValues;
You can then do a lookup with the key, and if found, return immediately. otherwise before your return, store to the map.
I began mapping out what would happen with a call to 4,5. It gets messy, with a LOT of calculations. Each level deeper results in 2^n lookups.
I don't know if your basic algorithm is correct, but if so, then I'd move this code to the top of the method:
if(k > n)
{
return 0;
}
As it appears that if k > n, you always return 0, even for something like 6,100. I don't know if that's correct or not, however.
You're computing some binomial values multiple times. A quick solution is memoization.
Untested:
int binom(int n, int k);
int binom_mem(int n, int k)
{
static std::map<std::pair<int, int>, std::optional<int>> lookup_table;
auto const input = std::pair{n,k};
if (lookup_table[input].has_value() == false) {
lookup_table[input] = binom(n, k);
}
return lookup_table[input];
}
int binom(int n, int k)
{
double sum;
if (n == 0 || k == 0){
sum = 1;
} else {
sum = binom_mem(n-1,k-1) + binom_mem(n-1,k);
}
if ((n== 1 && k== 0) || (n== 1 && k== 1))
{
sum = 1;
}
if(k > n)
{
sum = 0;
}
return sum;
}
A better solution would be to turn the recursion tailrec (not easy with double recursions) or better yet, not use recursion at all ;)
I found this very simple (perhaps a bit slow) method of writing the binomial coefficient even for non integers, based on this proof (written by me):
double binomial_coefficient(float k, int a) {
double b=1;
for(int p=1; p<=a; p++) {
b=b*(k+1-p)/p;
}
return b;
}
If you can tolerate wasting some compile time memory, you can pre-compute a Pascal-Triangle at compile time. With a simple lookup mechanism, this will give you maximum speed.
The downsite is that you can only calculate up to the 69th row. After that, even an unsigned long long would overflow.
So, we simply use a constexpr function and calculate the values for a Pascal triangle in a 2 dimensional compile-time constexpr std::array.
The nCr function simply uses an index into that array (into Pascals Triangle).
Please see the following example code:
#include <iostream>
#include <utility>
#include <array>
#include <iomanip>
#include <cmath>
// Biggest number for which nCR will work with a 64 bit variable: 69
constexpr size_t MaxN = 69u;
// If we store Pascal Triangle in a 2 dimensional array, the size will be that
constexpr size_t ArraySize = MaxN;
// This function will generate Pascals triangle stored in a 2 dimension std::array
constexpr auto calculatePascalTriangle() {
// Result of function. Here we will store Pascals triangle as a 1 dimensional array
std::array<std::array<unsigned long long, ArraySize>, ArraySize> pascalTriangle{};
// Go through all rows and columns of Pascals triangle
for (size_t row{}; row < MaxN; ++row) for (size_t col{}; col <= row; ++col) {
// Border valus are always one
unsigned long long result{ 1 };
if (col != 0 && col != row) {
// And calculate the new value for the current row
result = pascalTriangle[row - 1][col - 1] + pascalTriangle[row - 1][col];
}
// Store new value
pascalTriangle[row][col] = result;
}
// And return array as function result
return pascalTriangle;
}
// This is a constexpr std::array<std::array<unsigned long long,ArraySize>, ArraySize> with the name PPP, conatining all nCr results
constexpr auto PPP = calculatePascalTriangle();
// To calculate nCr, we used look up the value from the array
constexpr unsigned long long nCr(size_t n, size_t r) {
return PPP[n][r];
}
// Some debug test driver code. Print Pascal triangle
int main() {
constexpr size_t RowsToPrint = 16u;
const size_t digits = static_cast<size_t>(std::ceil(std::log10(nCr(RowsToPrint, RowsToPrint / 2))));
for (size_t row{}; row < RowsToPrint; ++row) {
std::cout << std::string((RowsToPrint - row) * ((digits + 1) / 2), ' ');
for (size_t col{}; col <= row; ++col)
std::cout << std::setw(digits) << nCr(row, col) << ' ';
std::cout << '\n';
}
return 0;
}
We can also store Pascals Triangle in a 1 dimensional constexpr std::array. But then we need to additionally calculate the Triangle numbers to find the start index for a row. But also this can be done completely at compile time.
Then the solution would look like this:
#include <iostream>
#include <utility>
#include <array>
#include <iomanip>
#include <cmath>
// Biggest number for which nCR will work with a 64 bit variable
constexpr size_t MaxN = 69u; //14226520737620288370
// If we store Pascal Triangle in an 1 dimensional array, the size will be that
constexpr size_t ArraySize = (MaxN + 1) * MaxN / 2;
// To get the offset of a row of a Pascals Triangle stored in an1 1 dimensional array
constexpr size_t getTriangleNumber(size_t row) {
size_t sum{};
for (size_t i = 1; i <= row; i++) sum += i;
return sum;
}
// Generate a std::array with n elements of a given type and a generator function
template <typename DataType, DataType(*generator)(size_t), size_t... ManyIndices>
constexpr auto generateArray(std::integer_sequence<size_t, ManyIndices...>) {
return std::array<DataType, sizeof...(ManyIndices)>{ { generator(ManyIndices)... } };
}
// This is a std::arrax<size_t,MaxN> withe the Name TriangleNumber, containing triangle numbers for ip ti MaxN
constexpr auto TriangleNumber = generateArray<size_t, getTriangleNumber>(std::make_integer_sequence<size_t, MaxN>());
// This function will generate Pascals triangle stored in an 1 dimension std::array
constexpr auto calculatePascalTriangle() {
// Result of function. Here we will store Pascals triangle as an 1 dimensional array
std::array <unsigned long long, ArraySize> pascalTriangle{};
size_t index{}; // Running index for storing values in the array
// Go through all rows and columns of Pascals triangle
for (size_t row{}; row < MaxN; ++row) for (size_t col{}; col <= row; ++col) {
// Border valuse are always one
unsigned long long result{ 1 };
if (col != 0 && col != row) {
// So, we are not at the border. Get the start index the upper 2 values
const size_t offsetOfRowAbove = TriangleNumber[row - 1] + col;
// And calculate the new value for the current row
result = pascalTriangle[offsetOfRowAbove] + pascalTriangle[offsetOfRowAbove - 1];
}
// Store new value
pascalTriangle[index++] = result;
}
// And return array as function result
return pascalTriangle;
}
// This is a constexpr std::array<unsigned long long,ArraySize> with the name PPP, conatining all nCr results
constexpr auto PPP = calculatePascalTriangle();
// To calculate nCr, we used look up the value from the array
constexpr unsigned long long nCr(size_t n, size_t r) {
return PPP[TriangleNumber[n] + r];
}
// Some debug test driver code. Print Pascal triangle
int main() {
constexpr size_t RowsToPrint = 16; // MaxN - 1;
const size_t digits = static_cast<size_t>(std::ceil(std::log10(nCr(RowsToPrint, RowsToPrint / 2))));
for (size_t row{}; row < RowsToPrint; ++row) {
std::cout << std::string((RowsToPrint - row+1) * ((digits+1) / 2), ' ');
for (size_t col{}; col <= row; ++col)
std::cout << std::setw(digits) << nCr(row, col) << ' ';
std::cout << '\n';
}
return 0;
}

CUDA histogram reduce_by_key failing

I have the following CUDA Thrust code that uses reduce_by_key to histogram the values [0, 1024) into 256 buckets. I expect each bucket to have a count = 4, yet I see bucket 0 has 256, bucket 255 has 3, and the remainder have 4.
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <thrust/device_vector.h>
#include <thrust/extrema.h>
#include <thrust/pair.h>
#define SIZE 1024
struct binFunc {
const float minVal;
const float valRange;
const int numBins;
binFunc(float _minVal, float _valRange, int _numBins) :
minVal(_minVal), valRange(_valRange), numBins(_numBins) {}
__host__ __device__
int operator()(float v) const {
int b = int((v - minVal) / valRange * float(numBins));
return b;
}
};
int main() {
thrust::device_vector<float> d_vec(SIZE);
for (int i = 0; i < SIZE; ++i)
d_vec[i] = float(i);
thrust::device_vector<float>::iterator min;
thrust::device_vector<float>::iterator max;
thrust::pair<thrust::device_vector<float>::iterator,
thrust::device_vector<float>::iterator> minmax =
thrust::minmax_element(d_vec.begin(), d_vec.end());
min = minmax.first;
max = minmax.second;
float minVal = *min;
float maxVal = *max;
std::cout << "The minimum value is " << minVal
<< " and the maximum value is " << maxVal << "." << std::endl;
float valRange = maxVal - minVal;
std::cout << "The range is " << valRange << "." << std::endl;
int numBins = 256;
thrust::device_vector<int> d_binResults(SIZE);
thrust::transform(d_vec.begin(), d_vec.end(), d_binResults.begin(),
binFunc(minVal, valRange, numBins));
thrust::device_vector<int>::iterator d_binResults_iter =
d_binResults.begin();
for (int i = 0; i < 10; ++i) {
int b = *d_binResults_iter;
printf("d_binResults[%d]=%d\n", i, b);
d_binResults_iter++;
}
std::cout << "The numBins is " << numBins << "." << std::endl;
thrust::device_vector<int> d_binsKeys(numBins);
thrust::device_vector<int> d_binsValues(numBins);
thrust::pair<thrust::device_vector<int>::iterator,
thrust::device_vector<int>::iterator> keys_and_values =
thrust::reduce_by_key(d_binResults.begin(), d_binResults.end(),
thrust::constant_iterator<int>(1), d_binsKeys.begin(),
d_binsValues.begin());
thrust::device_vector<int>::iterator d_binsKeys_begin_iter =
d_binsKeys.begin();
thrust::device_vector<int>::iterator d_binsValues_begin_iter =
d_binsValues.begin();
for (int i = 0; i < numBins; ++i) {
int key = *d_binsKeys_begin_iter;
int val = *d_binsValues_begin_iter;
printf("d_binsValues[%d]=(%d,%d)\n", i, key, val);
d_binsKeys_begin_iter++;
d_binsValues_begin_iter++;
}
return 0;
}
The salient part of the output is:
d_binsValues[0]=(0,256)
d_binsValues[1]=(1,4)
d_binsValues[2]=(2,4)
...
d_binsValues[254]=(254,4)
d_binsValues[255]=(255,3)
So, bucket 0 has 256 elements, and bucket 255 has 3 elements? What's going on here?
If you print out all the d_binResults[] values instead of the first 10, you will discover that the last element (d_binResults[1023]) has a value of 256! But that is an invalid bin index. For numBins = 256, the valid indices are 0..255.
It is occurring due to the calculation arithmetic in your functor:
int b = int((v - minVal) / valRange * float(numBins));
Plugging in the relevant values for the last element, we have:
(1023 - 0)/1023*256 = 256
But 256 is an invalid bin index. It turns out that this breaks the reduce_by_key operation, causing both the last bin to have 3 elements and the first bin to be "corrupted".
If you fix this you will fix both issues you describe (first bin has 256 elements, last bin has 3.)
As a simple proof, add this line of code:
d_binResults[1023] = 255;
immediately after your thrust::transform operation. The results are then correct. How you choose to correct your bin calculation arithmetic is up to you. (possibly "fixable" by adding 1 to valRange but that may imply something about your expected histogram values).

thrust vector distance calculation

Consider the following dataset and centroids. There are 7 individuals and two means each with 8 dimensions. They are stored row major order.
short dim = 8;
float centroids[] = {
0.223, 0.002, 0.223, 0.412, 0.334, 0.532, 0.244, 0.612,
0.742, 0.812, 0.817, 0.353, 0.325, 0.452, 0.837, 0.441
};
float data[] = {
0.314, 0.504, 0.030, 0.215, 0.647, 0.045, 0.443, 0.325,
0.731, 0.354, 0.696, 0.604, 0.954, 0.673, 0.625, 0.744,
0.615, 0.936, 0.045, 0.779, 0.169, 0.589, 0.303, 0.869,
0.275, 0.406, 0.003, 0.763, 0.471, 0.748, 0.230, 0.769,
0.903, 0.489, 0.135, 0.599, 0.094, 0.088, 0.272, 0.719,
0.112, 0.448, 0.809, 0.157, 0.227, 0.978, 0.747, 0.530,
0.908, 0.121, 0.321, 0.911, 0.884, 0.792, 0.658, 0.114
};
I want to calculate each euclidean distances. c1 - d1, c1 - d2 ....
On CPU I would do:
float dist = 0.0, dist_sqrt;
for(int i = 0; i < 2; i++)
for(int j = 0; j < 7; j++)
{
float dist_sum = 0.0;
for(int k = 0; k < dim; k++)
{
dist = centroids[i * dim + k] - data[j * dim + k];
dist_sum += dist * dist;
}
dist_sqrt = sqrt(dist_sum);
// do something with the distance
std::cout << dist_sqrt << std::endl;
}
Is there any built in solution of vector distance calculation in THRUST?
It can be done in thrust. Explaining how will be rather involved, and the code is rather dense.
The key observation to start with is that the core operation can be done via a transformed reduction. The thrust transform operation is used to perform the elementwise subtraction of the vectors (individual-centroid) and squaring of each result, and the reduction sums the results together to produce the square of the euclidean distance. The starting point for this operation is thrust::reduce_by_key, but it gets rather involved to present the data correctly to reduce_by_key.
The final results are produced by taking the square root of each result from above, and we can use an ordinary thrust::transform for this.
The above is a summary description of the only 2 lines of thrust code that do all the work. However, the first line has considerable complexity to it. In order to exploit parallelism, the approach I took was to virtually "lay out" the necessary vectors in sequence, to be presented to reduce_by_key. To take a simple example, suppose we have 2 centroids and 4 individuals, and suppose our dimension is 2.
centroid 0: C00 C01
centroid 1: C10 C11
individ 0: I00 I01
individ 1: I10 I11
individ 2: I20 I21
individ 3: I30 I31
We can "lay out" the vectors like this:
C00 C01 C00 C01 C00 C01 C00 C01 C10 C11 C10 C11 C10 C11 C10 C11
I00 I01 I10 I11 I20 I21 I30 I31 I00 I01 I10 I11 I20 I21 I30 I31
To facilitate the reduce_by_key, we will also need to generate key values to delineate the vectors:
0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
The above data "laid-out" data sets can be quite large, and we don't want to incur storage and retrieval cost, so we will generate these "on-the-fly" using thrust's collection of fancy iterators. This is where things get quite dense. With the above strategy in mind, we will use thrust::reduce_by_key to do the work. We'll create a custom functor provided to a transform_iterator to do the subtraction (and squaring) of the I and C vectors, which will be zipped together for this purpose. The "lay out" of the vectors will be created on the fly using permutation iterators with additional custom index-creation functors, to help with the replicated patterns in each of I and C.
Therefore, working from the "inside out", the sequence of steps is as follows:
for both I (data) and C (centr) use a counting_iterator combined with a custom indexing functor inside of a transform_iterator to produce the indexing sequences we will need.
using the indexing sequences created in step 1 and the base I and C vectors, virtually "lay out" the vectors via a permutation_iterator (one for each laid-out vector).
zip the 2 "laid out" virtual I and C vectors together, to create a <float, float> tuple vector (virtual).
take the zip_iterator from step 3, and combine with a custom distance-calculation functor ((I-C)^2) in a transform_iterator
use another transform_iterator, combining a counting_iterator with a custom key-generating functor, to produce the key sequence (virtual)
pass the iterators in steps 4 and 5 to reduce_by_keyas the inputs (keys, values) to be reduced. The output vectors for reduce_by_key are also keys and values. We don't need the keys, so we'll use a discard_iterator to dump those. The values we will save.
The above steps are all accomplished in a single line of thrust code.
Here's a code illustrating the above:
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/reduce.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/copy.h>
#include <math.h>
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>
#define MAX_DATA 100000000
#define MAX_CENT 5000
#define TOL 0.001
unsigned long long dtime_usec(unsigned long long prev){
#define USECPSEC 1000000ULL
timeval tv1;
gettimeofday(&tv1,0);
return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}
unsigned verify(float *d1, float *d2, int len){
unsigned pass = 1;
for (int i = 0; i < len; i++)
if (fabsf(d1[i] - d2[i]) > TOL){
std::cout << "mismatch at: " << i << " val1: " << d1[i] << " val2: " << d2[i] << std::endl;
pass = 0;
break;}
return pass;
}
void eucl_dist_cpu(const float *centroids, const float *data, float *rdist, int num_centroids, int dim, int num_data, int print){
int out_idx = 0;
float dist, dist_sqrt;
for(int i = 0; i < num_centroids; i++)
for(int j = 0; j < num_data; j++)
{
float dist_sum = 0.0;
for(int k = 0; k < dim; k++)
{
dist = centroids[i * dim + k] - data[j * dim + k];
dist_sum += dist * dist;
}
dist_sqrt = sqrt(dist_sum);
// do something with the distance
rdist[out_idx++] = dist_sqrt;
if (print) std::cout << dist_sqrt << ", ";
}
if (print) std::cout << std::endl;
}
struct dkeygen : public thrust::unary_function<int, int>
{
int dim;
int numd;
dkeygen(const int _dim, const int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val/dim);
}
};
typedef thrust::tuple<float, float> mytuple;
struct my_dist : public thrust::unary_function<mytuple, float>
{
__host__ __device__ float operator()(const mytuple &my_tuple) const {
float temp = thrust::get<0>(my_tuple) - thrust::get<1>(my_tuple);
return temp*temp;
}
};
struct d_idx : public thrust::unary_function<int, int>
{
int dim;
int numd;
d_idx(int _dim, int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val % (dim*numd));
}
};
struct c_idx : public thrust::unary_function<int, int>
{
int dim;
int numd;
c_idx(int _dim, int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val % dim) + (dim * (val/(dim*numd)));
}
};
struct my_sqrt : public thrust::unary_function<float, float>
{
__host__ __device__ float operator()(const float val) const {
return sqrtf(val);
}
};
unsigned long long eucl_dist_thrust(thrust::host_vector<float> &centroids, thrust::host_vector<float> &data, thrust::host_vector<float> &dist, int num_centroids, int dim, int num_data, int print){
thrust::device_vector<float> d_data = data;
thrust::device_vector<float> d_centr = centroids;
thrust::device_vector<float> values_out(num_centroids*num_data);
unsigned long long compute_time = dtime_usec(0);
thrust::reduce_by_key(thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), dkeygen(dim, num_data)), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(dim*num_data*num_centroids), dkeygen(dim, num_data)),thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_centr.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), c_idx(dim, num_data))), thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), d_idx(dim, num_data))))), my_dist()), thrust::make_discard_iterator(), values_out.begin());
thrust::transform(values_out.begin(), values_out.end(), values_out.begin(), my_sqrt());
cudaDeviceSynchronize();
compute_time = dtime_usec(compute_time);
if (print){
thrust::copy(values_out.begin(), values_out.end(), std::ostream_iterator<float>(std::cout, ", "));
std::cout << std::endl;
}
thrust::copy(values_out.begin(), values_out.end(), dist.begin());
return compute_time;
}
int main(int argc, char *argv[]){
int dim = 8;
int num_centroids = 2;
float centroids[] = {
0.223, 0.002, 0.223, 0.412, 0.334, 0.532, 0.244, 0.612,
0.742, 0.812, 0.817, 0.353, 0.325, 0.452, 0.837, 0.441
};
int num_data = 8;
float data[] = {
0.314, 0.504, 0.030, 0.215, 0.647, 0.045, 0.443, 0.325,
0.731, 0.354, 0.696, 0.604, 0.954, 0.673, 0.625, 0.744,
0.615, 0.936, 0.045, 0.779, 0.169, 0.589, 0.303, 0.869,
0.275, 0.406, 0.003, 0.763, 0.471, 0.748, 0.230, 0.769,
0.903, 0.489, 0.135, 0.599, 0.094, 0.088, 0.272, 0.719,
0.112, 0.448, 0.809, 0.157, 0.227, 0.978, 0.747, 0.530,
0.908, 0.121, 0.321, 0.911, 0.884, 0.792, 0.658, 0.114,
0.721, 0.555, 0.979, 0.412, 0.007, 0.501, 0.844, 0.234
};
std::cout << "cpu results: " << std::endl;
float dist[num_data*num_centroids];
eucl_dist_cpu(centroids, data, dist, num_centroids, dim, num_data, 1);
thrust::host_vector<float> h_data(data, data + (sizeof(data)/sizeof(float)));
thrust::host_vector<float> h_centr(centroids, centroids + (sizeof(centroids)/sizeof(float)));
thrust::host_vector<float> h_dist(num_centroids*num_data);
std::cout << "gpu results: " << std::endl;
eucl_dist_thrust(h_centr, h_data, h_dist, num_centroids, dim, num_data, 1);
float *data2, *centroids2, *dist2;
num_centroids = 10;
num_data = 1000000;
if (argc > 2) {
num_centroids = atoi(argv[1]);
num_data = atoi(argv[2]);
if ((num_centroids < 1) || (num_centroids > MAX_CENT)) {std::cout << "Num centroids out of range" << std::endl; return 1;}
if ((num_data < 1) || (num_data > MAX_DATA)) {std::cout << "Num data out of range" << std::endl; return 1;}
if (num_data * dim * num_centroids > 2000000000) {std::cout << "data set out of range" << std::endl; return 1;}}
std::cout << "Num Data: " << num_data << std::endl;
std::cout << "Num Cent: " << num_centroids << std::endl;
std::cout << "result size: " << ((num_data*num_centroids*4)/1048576) << " Mbytes" << std::endl;
data2 = new float[dim*num_data];
centroids2 = new float[dim*num_centroids];
dist2 = new float[num_data*num_centroids];
for (int i = 0; i < dim*num_data; i++) data2[i] = rand()/(float)RAND_MAX;
for (int i = 0; i < dim*num_centroids; i++) centroids2[i] = rand()/(float)RAND_MAX;
unsigned long long dtime = dtime_usec(0);
eucl_dist_cpu(centroids2, data2, dist2, num_centroids, dim, num_data, 0);
dtime = dtime_usec(dtime);
std::cout << "cpu time: " << dtime/(float)USECPSEC << "s" << std::endl;
thrust::host_vector<float> h_data2(data2, data2 + (dim*num_data));
thrust::host_vector<float> h_centr2(centroids2, centroids2 + (dim*num_centroids));
thrust::host_vector<float> h_dist2(num_data*num_centroids);
dtime = dtime_usec(0);
unsigned long long ctime = eucl_dist_thrust(h_centr2, h_data2, h_dist2, num_centroids, dim, num_data, 0);
dtime = dtime_usec(dtime);
std::cout << "gpu total time: " << dtime/(float)USECPSEC << "s, gpu compute time: " << ctime/(float)USECPSEC << "s" << std::endl;
if (!verify(dist2, &(h_dist2[0]), num_data*num_centroids)) {std::cout << "Verification failure." << std::endl; return 1;}
std::cout << "Success!" << std::endl;
return 0;
}
Notes:
The code is set up to do 2 passes, a short one using a data set similar to yours, with printout for visual check. Then a larger data set can be entered, via command-line sizing parameters (number of centroids, then number of individuals), for benchmark comparison and validation of results.
Contrary to what I stated in the comments, the thrust code is only running about 25% faster than the naive single-threaded CPU code. Your mileage may vary.
This is just one way to think about handling it. I have had other ideas, but not enough time to flesh them out.
The data sets can become rather large. The code right now is intended to be limited to data sets where the product of dimension*number_of_centroids*number_of_individuals is less than 2 billion. However, as you approach even this number, you will need a GPU and CPU that both have a few GB of memory. I briefly explored larger data set sizes. A few code changes would be needed in various places to extend from e.g. int to unsigned long long, etc. However I haven't provided that as I am still investigating an issue with that code.
For another, non-thrust-related look at computing euclidean distances on the GPU, you may be interested in this question. If you follow the sequence of optimizations that were made there, it may shed some light on either how this thrust code might be improved, or else how another non-thrust realization could be used.
Sorry I wasn't able to squeeze more performance out.

How to put bit sequence into bytes (C/C++)

I have a couple of integers, for example (in binary represetation):
00001000, 01111111, 10000000, 00000001
and I need to put them in sequence to array of bytes(chars), without the leading zeros, like so:
10001111 11110000 0001000
I understand that it is must be done by bit shifting with <<,>> and using binary or |. But I can't find the correct algorithm, can you suggest the best approach?
The integers I need to put there are unsigned long long ints, so the length of one can be anywhere from 1 bit to 8 bytes (64 bits).
You could use a std::bitset:
#include <bitset>
#include <iostream>
int main() {
unsigned i = 242122534;
std::bitset<sizeof(i) * 8> bits;
bits = i;
std::cout << bits.to_string() << "\n";
}
There are doubtless other ways of doing it, but I would probably go with the simplest:
std::vector<unsigned char> integers; // Has your list of bytes
integers.push_back(0x02);
integers.push_back(0xFF);
integers.push_back(0x00);
integers.push_back(0x10);
integers.push_back(0x01);
std::string str; // Will have your resulting string
for(unsigned int i=0; i < integers.size(); i++)
for(int j=0; j<8; j++)
str += ((integers[i]<<j) & 0x80 ? "1" : "0");
std::cout << str << "\n";
size_t begin = str.find("1");
if(begin > 0) str.erase(0,begin);
std::cout << str << "\n";
I wrote this up before you mentioned that you were using long ints or whatnot, but that doesn't actually change very much of this. The mask needs to change, and the j loop variable, but otherwise the above should work.
Convert them to strings, then erase all leading zeros:
#include <iostream>
#include <sstream>
#include <string>
#include <cstdint>
std::string to_bin(uint64_t v)
{
std::stringstream ss;
for(size_t x = 0; x < 64; ++x)
{
if(v & 0x8000000000000000)
ss << "1";
else
ss << "0";
v <<= 1;
}
return ss.str();
}
void trim_right(std::string& in)
{
size_t non_zero = in.find_first_not_of("0");
if(std::string::npos != non_zero)
in.erase(in.begin(), in.begin() + non_zero);
else
{
// no 1 in data set, what to do?
in = "<no data>";
}
}
int main()
{
uint64_t v1 = 437148234;
uint64_t v2 = 1;
uint64_t v3 = 0;
std::string v1s = to_bin(v1);
std::string v2s = to_bin(v2);
std::string v3s = to_bin(v3);
trim_right(v1s);
trim_right(v2s);
trim_right(v3s);
std::cout << v1s << "\n"
<< v2s << "\n"
<< v3s << "\n";
return 0;
}
A simple approach would be having the "current byte" (acc in the following), the associated number of used bits in it (bitcount) and a vector of fully processed bytes (output):
int acc = 0;
int bitcount = 0;
std::vector<unsigned char> output;
void writeBits(int size, unsigned long long x)
{
while (size > 0)
{
// sz = How many bit we're about to copy
int sz = size;
// max avail space in acc
if (sz > 8 - bitcount) sz = 8 - bitcount;
// get the bits
acc |= ((x >> (size - sz)) << (8 - bitcount - sz));
// zero them off in x
x &= (1 << (size - sz)) - 1;
// acc got bigger and x got smaller
bitcount += sz;
size -= sz;
if (bitcount == 8)
{
// got a full byte!
output.push_back(acc);
acc = bitcount = 0;
}
}
}
void writeNumber(unsigned long long x)
{
// How big is it?
int size = 0;
while (size < 64 && x >= (1ULL << size))
size++;
writeBits(size, x);
}
Note that at the end of the processing you should check if there is any bit still in the accumulator (bitcount > 0) and you should flush them in that case by doing a output.push_back(acc);.
Note also that if speed is an issue then probably using a bigger accumulator is a good idea (however the output will depend on machine endianness) and also that discovering how many bits are used in a number can be made much faster than a linear search in C++ (for example x86 has a special machine language instruction BSR dedicated to this).