Related
I have a list of item called L, and a sophisticated python function called func; the normal way is to use python-loop like:
out = [func(item) for item in L]
But it's single-thread, so I want to implement a function in c++, and bind with pybind11:
For cpp:
m.def("test_func_iter", [](const py::object &func, const py::sequence &iter) {
auto n = len(iter);
py::list l(n);
unsigned int k = std::thread::hardware_concurrency();
std::thread threads[k];
auto stride = n / k;
// [0, n//k), [n//k, ...), [...,n)
for (unsigned int w = 0; w < k; ++w) {
if (w < k - 1) {
threads[w] = std::thread([&l, &func, &iter](size_t start, size_t end) {
for (size_t i = start; i < end; ++i) {
std::cout << "h: "<< i << std::endl;
l[i] = func(iter[i]);
}
}, w * stride, (w + 1) * stride);
} else {
threads[w] = std::thread([&l, &func, &iter](size_t start, size_t end) {
for (size_t i = start; i < end; ++i) {
std::cout << "h: "<< i << std::endl;
l[i] = func(iter[i]);
}
}, w * stride, n);
}
}
std::cout << "Done spawning threads! Now wait for them to finish.\n";
for (auto& t: threads) {
t.join();
}
std::cout << "end" << std::endl;
return py::type::of(iter)(l);
And when I invoke the corresponding bind function in python, like:
def func(i):
# just simplify the actual logic, a sophisticated function that is hard to re-write totally in c++
print(i, i == 0)
return int(gmpy2.mpz(i) + 100)
b = test_func_iter(func, list(range(100)))
print(b)
And I get the output and error like:
h: h: Done spawning threads! Now wait for them to finish.
050
0 True
进程已结束,退出代码为 139 (interrupted by signal 11: SIGSEGV)
I have done some tries:
Not use thread : everything is OK in python
Use thread & k=1: just use one single thread, everything is OK in python
Use thread & k>=2: crash.
BTW, I use Mac M1 laptop, and version of clang is 12.05 ;
I am new to c++, and guess the reason may be the use of thread, but can not find some suggestions in google, can anybody give some hints?(Or some suggestions about the origin problem: elegant way for multi-thread support with pybind11) Thanks!
I have to calculate the position of an element in a n dimensional matrix, inside a 1D array, during compile time.
In short terms, a mapping of array[i][j][k]....[n] -> array[...], where i,j,k,...,n in {1,2,3} and the dimension of every index is DIM = 3. That means, every row and column has 3 elements.
My main problem is to write the summation for n indices (parameter pack), as template and using constexpr to evaluate the sum at compile time.
My research in other stack post resulted in the following formula for 3 dimension:
a[i][j][k] -> a[(i*DIM*DIM) + (j*DIM) + k]
If we expand it to n dimensions, it results in the following formula:
a[i][j][k]....[n] -> a[(n*DIM ^ (indexAmount-1)] +... + (i*DIM*DIM) + (j*DIM) + k].
Furthermore, i wrote the code to generate the addends of the sum using templates and constexpr, which is shown in the code below.
/**
* calculates DIM3^(indexPos)
*/
template<auto count>
int constexpr multiple_dim(){
if constexpr (count == 0){
return 1;
}else{
return DIM3 * multiple_dim<count-1>();
}
}
/**
*
*calculates addends for the end summation
* e.g if we have 3 indices i,j,k. j would be at position 2
* and j = 1. The parameters would be IndexPos = 2, index = 1.
*/
template<auto indexPos, auto index>
int constexpr calculate_flattened_index(){
if constexpr (indexPos == 0){
return (index-1);
}else{
return (index-1) * multiple_dim<indexPos>();
}
}
/**
* calculates the position of an element inside a
* nD matrix and maps it to a position in 1D
* A[i][j]..[n] -> ???? not implemented yet
* #tparam Args
* #return
*/
template<auto ...Args>
[[maybe_unused]] auto constexpr pos_nd_to_1d(){
/* maybe iterate over all indices inside the parameter pack?
const int count = 1;
for(int x : {Args...}){
}
return count;
*/
}
An example output for the elements inside a 3D Matrix A.
A111, A121, A131. The sum over the 3 elements would be the position in 1D. For e.g A121 -> 0 + 3 + 0 = 3. A111 would be placed in a 1 dimensional array at array[3].
std::cout << "Matrix A111" << std::endl;
//A111
std::cout << calculate_flattened_index<0 , 1>() << std::endl;
std::cout << calculate_flattened_index<1 , 1>() << std::endl;
std::cout << calculate_flattened_index<2 , 1>() << std::endl;
std::cout << "Matrix A121" << std::endl;
//A121
std::cout << calculate_flattened_index<0 , 1>() << std::endl;
std::cout << calculate_flattened_index<1 , 2>() << std::endl;
std::cout << calculate_flattened_index<2 , 1>() << std::endl;
std::cout << "Matrix A131" << std::endl;
//A131
std::cout << calculate_flattened_index<0 , 1>() << std::endl;
std::cout << calculate_flattened_index<1 , 3>() << std::endl;
std::cout << calculate_flattened_index<2 , 1>() << std::endl;
Output:
Matrix A111
0
0
0
Matrix A121
0
3
0
Matrix A131
0
6
0
An desired output could look like the following code:
Function call
pos_nd_to_1d<1,1,1>() //A111
pos_nd_to_1d<1,2,1>() //A121
pos_nd_to_1d<1,3,1>() //A131
Output:
0 //0+0+0
3 //0+3+0
6 //0+6+0
If I understand correctly... your looking something as follows
template <auto ... as>
auto constexpr pos_nd_to_1d ()
{
std::size_t i { 0u };
((i *= DIM, i += as - 1u), ...);
return i;
}
Or maybe you can use std::common_type, for i,
std::common_type_t<decltype(as)...> i {};
but for indices I suggest the use of std::size_t (also std::size_t ... as).
The following is a full compiling example
#include <iostream>
constexpr auto DIM = 3u;
template <auto ... as>
auto constexpr pos_nd_to_1d ()
{
std::size_t i { 0u };
((i *= DIM, i += as - 1u), ...);
return i;
}
int main ()
{
std::cout << pos_nd_to_1d<1u, 1u, 1u>() << std::endl;
std::cout << pos_nd_to_1d<1u, 2u, 1u>() << std::endl;
std::cout << pos_nd_to_1d<1u, 3u, 1u>() << std::endl;
}
-- EDIT --
The OP ask
could you explain how this code works?I am a bit new to c++.
I'm better at coding that at explaining, anyway...
What I've used here
((i *= DIM, i += as - 1u), ...);
//...^^^^^^^^^^^^^^^^^^^^^^ repeated part
is called "fold expression" (or also "folding" or "template folding"), and is a new C++17 feature (you can obtain the same result also in C++14 (also C++11 but not constexpr) but in a less simple and elegant way) that consist in expanding a variadic template pack with an operator.
By example, if you want to sum the indexes, you can simply write
(as + ...);
and the expression become
(a0 + (a1 + (a2 + (/* etc */))));
In this case I've used the fact that the comma is an operator, so the expression
((i *= DIM, i += as - 1u), ...);
become
((i *= DIM, i += a0 - 1u),
((i *= DIM, i += a1 - 1u),
((i *= DIM, i += a2 - 1u),
/* etc. */ )))))
Observe that, this way, the first i *= DIM is unuseful (because i is initialized with zero) but the following i *= DIM multiply as - 1u the right number of times
So, when as... is 1, 2, 1, by example, you get
(1 - 1)*DIM*DIM + (2 - 1)*DIM + (1 - 1)
I am trying to implement parallel quadratic sieve using open mp. In sieving phase, I am using log approximations to check the divisibility. This is my code.
#pragma omp parallel for schedule (dynamic) num_threads(4)
for (int i = 0; i < factorBase.size(); ++i) {
const uint32_t p = factorBase[i];
const float logp = std::log(factorBase[i]) / std::log(2);
// Sieve first sequence.
while (startIndex.first[i] < intervalEnd) {
logApprox[startIndex.first[i] - intervalStart] -= logp;
startIndex.first[i] += p;
}
if (p == 2)
continue; // a^2 = N (mod 2) only has one root.
// Sieve second sequence.
while (startIndex.second[i] < intervalEnd) {
logApprox[startIndex.second[i] - intervalStart] -= logp;
startIndex.second[i] += p;
}
}
Here factorbase and logApprox are std::vectors initialized as follows
std::vector<float> logApprox(INTERVAL_LENGTH, 0);
std::vector<uint32_t> factorBase;
Whenever, I run this code and compare the running time, there is no much difference between sequential and parallel run. What are some optimizations that can be done? I am a beginner in openmp and any help is appreciated.Thanks
Very interesting task you have! Thanks!
Decided to make my own implementation with very many optimizations.
I achieved 20.4x times boost compared to your original code (your code gives 17.86 seconds, my gives 0.87 seconds). Also I used 2x times less memory for sieving compared to your algorithm, while achieving same goal.
To make comparison I simplified your code in such a way that it still does almost same thing and runs exactly same time, but looks much more simple:
#pragma omp parallel for
for (size_t i = 0; i < factorBase.size(); ++i) {
auto const p = factorBase[i];
float const logp = std::log(p) / std::log(2);
while (startIndex[i] < logApprox.size()) {
logApprox[startIndex[i]] += logp;
startIndex[i] += p;
}
}
You can see that I leaved only single sieve loop, second one does same thing and not necessary for demonstration, so I removed it. Also I removed startInterval as it is irrelevant to speed demonstration. And for simplicity I did += of logarithm instead of yours -=.
One important notice regarding your algorithm is that it doesn't do any synchronization, it means that different cores of CPU may write to same entry of logApprox array hence give wrong result.
And as I have measured this wrong result happens once or twice per hundred million entries of logApprox array. My optimized code overcame this limitation and did correct synchronization besides doing all speed optimizations.
I did following improvements to gain 20x times speedup:
I split whole array into blocks, approximately 2^13 elements in size. Each group of blocks is processed by separate thread/CPU-core hence no synchronization of threads is needed. Besides avoiding synchronization what is very important is that 2^13 block fits fully into L1 or L2 cache of CPU, hence speeds up things a lot.
Each block of 2^13 is processed for all possible primes. To keep track of which offsets of what primes are needed I created a special ring buffer of 2^7 size, this ring buffer is indexed with block number modulo 2^7 and keeps track which primes with which offsets are needed for each block (modulo 2^7).
I have as many threads as there are CPU cores. For each thread I precompute starting offsets of all primes for this thread, these starting offsets are computed through modular arithmetics based on startIndex array that you provided in your original code.
To speedup even more instead of float logarithm I use integer logarithm, which is based on uint16_t. This integer logarithm is computed as uint16_t integer_log = uint16_t(std::log2(p) * (1 << 8) + 0.5);. Besides increasing speed of computing += for integer logarithms, they also decrease occupied memory 2x times. If for some reason uint16_t logarithm is not enough for you then please replace using ILog2T = u16; with using ILog2T = u32; in my code, but this will double amount of used memory.
My code output following to console:
time_simple 17.859 sec, time_optimized 0.874 sec, boost 20.434, correct_ratio 0.999999993
Time simple is time of your original code for sieving array of size 2^28, time optimized is my code for same array, boost is how much my code is faster (you can see it is 20x times faster). Correct ratio says if there are any errors in your code, due to absence of multi-core synchronization (as you can see sometimes it is less than 1.0 hence there are some errors).
Full optimized code below:
Try it online!
#include <cstdint>
#include <random>
#include <iostream>
#include <iomanip>
#include <chrono>
#include <thread>
#include <type_traits>
#include <vector>
#include <stdexcept>
#include <sstream>
#include <mutex>
#include <omp.h>
#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg: '" + std::string(msg) + "'."); }
#define ASSERT(cond) ASSERT_MSG(cond, "")
#define OSTR(code) ([&]{ std::ostringstream ss; ss code; return ss.str(); }())
#define COUT(code) { std::unique_lock<std::mutex> lock(cout_mux); std::cout code; std::cout << std::flush; }
#define LN { COUT(<< "LN " << __LINE__ << std::endl); }
#define DUMP(var) { COUT(<< #var << " = (" << (var) << ")" << std::endl); }
using u16 = uint16_t;
using u32 = uint32_t;
using u64 = uint64_t;
using ILog2T = u16;
using PrimeT = u32;
std::mutex cout_mux;
template <typename T>
std::vector<T> GenPrimes(size_t end) {
thread_local std::vector<T> primes = {2, 3};
while (primes.back() < end) {
for (T p = primes.back() + 2;; p += 2) {
bool is_prime = true;
for (auto d: primes) {
if (u64(d) * d > p)
break;
if (p % d == 0) {
is_prime = false;
break;
}
}
if (is_prime) {
primes.push_back(p);
break;
}
}
}
primes.pop_back();
return primes;
}
void SieveA(std::vector<float> & logApprox, std::vector<PrimeT> const & factorBase, std::vector<PrimeT> startIndex) {
#pragma omp parallel for
for (size_t i = 0; i < factorBase.size(); ++i) {
auto const p = factorBase[i];
float const logp = std::log(p) / std::log(2);
while (startIndex[i] < logApprox.size()) {
logApprox[startIndex[i]] += logp;
startIndex[i] += p;
}
}
}
size_t NThreads() {
//return 1;
return std::thread::hardware_concurrency();
}
ILog2T LogToI(double x) { return ILog2T(x * (1ULL << (sizeof(ILog2T) * 8 - 8)) + 0.5); }
double IToLog(ILog2T x) { return x / double(1ULL << (sizeof(ILog2T) * 8 - 8)); }
double Time() {
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
std::string FloatToStr(double x, size_t round = 6) {
return OSTR(<< std::fixed << std::setprecision(round) << x);
}
double SieveB(std::vector<ILog2T> & logs, std::vector<PrimeT> const & primes, std::vector<PrimeT> const & starts0) {
auto const nthr = NThreads();
std::vector<std::vector<PrimeT>> starts(nthr, std::vector<PrimeT>(primes.size()));
std::vector<std::vector<ILog2T>> plogs(nthr, std::vector<ILog2T>(primes.size()));
std::vector<std::pair<u64, u64>> ranges(nthr);
size_t constexpr block_log2 = 13, block = 1 << block_log2, ring_log2 = 6, ring_size = 1ULL << ring_log2, ring_mask = ring_size - 1;
std::vector<std::vector<std::vector<std::pair<u32, u32>>>> ring(nthr, std::vector<std::vector<std::pair<u32, u32>>>(ring_size));
#pragma omp parallel for
for (size_t ithr = 0; ithr < nthr; ++ithr) {
size_t const nblock = ((logs.size() + nthr - 1) / nthr + block - 1) / block * block,
begin = ithr * nblock, end = std::min<size_t>(logs.size(), (ithr + 1) * nblock);
ranges[ithr] = {begin, end};
for (size_t i = 0; i < primes.size(); ++i) {
PrimeT const p = primes[i];
size_t const mod0 = begin % p, mod = starts0[i] < mod0 ? p + starts0[i] - mod0 : starts0[i] - mod0;
starts[ithr][i] = mod;
plogs[ithr][i] = LogToI(std::log2(p));
ring[ithr][((begin + starts[ithr][i]) >> block_log2) & ring_mask].push_back({i, begin + starts[ithr][i]});
}
}
auto tim = Time();
#pragma omp parallel for
for (size_t ithr = 0; ithr < nthr; ++ithr) {
auto const [begin, end] = ranges[ithr];
auto const [bbegin, bend] = std::make_tuple(begin / block, (end - 1) / block + 1);
auto const & cstarts = starts.at(ithr);
auto const & cplogs = plogs.at(ithr);
auto & cring = ring[ithr];
std::decay_t<decltype(cring[0])> tmp;
size_t hit_cnt = 0, miss_cnt = 0;
for (size_t iblock = bbegin; iblock < bend; ++iblock) {
size_t const cbegin = iblock << block_log2, cend = std::min<size_t>(end, (iblock + 1) << block_log2);
auto & ring_cur = cring[iblock & ring_mask];
tmp = ring_cur;
ring_cur.clear();
for (auto [ip, off]: tmp)
if (off >= cend) {
//++miss_cnt;
ring_cur.push_back({ip, off});
} else {
//++hit_cnt;
auto const p = primes[ip];
auto const plog = cplogs[ip];
for (; off < cend; off += p) {
//if (8192 - 10 <= off && off <= 8192 + 10) COUT(<< "logs.size() " << logs.size() << " begin " << begin << " end " << end << " bbegin " << bbegin << " bend " << bend << " cbegin " << cbegin << " cend " << cend << " iblock " << iblock << " off " << off << " p " << p << " plog " << plog << std::endl);
logs[off] += plog;
}
if (off < end)
cring[(off >> block_log2) & ring_mask].push_back({ip, off});
}
}
//COUT(<< "hit_ratio " << std::fixed << std::setprecision(6) << double(hit_cnt) / (hit_cnt + miss_cnt) << std::endl);
}
return Time() - tim;
}
void Test() {
size_t constexpr len = 1ULL << 28;
std::mt19937_64 rng{123};
auto const primes = GenPrimes<PrimeT>(1 << 12);
std::vector<PrimeT> starts;
for (auto p: primes)
starts.push_back(rng() % p);
ASSERT(primes.size() == starts.size());
double tA = 0, tB = 0;
std::vector<float> logsA(len);
std::vector<ILog2T> logsB(len);
{
tA = Time();
SieveA(logsA, primes, starts);
tA = Time() - tA;
}
{
tB = SieveB(logsB, primes, starts);
}
size_t correct = 0;
for (size_t i = 0; i < len; ++i) {
//ASSERT_MSG(std::abs(logsA[i] - IToLog(logsB[i])) < 0.1, "i " + std::to_string(i) + " logA " + FloatToStr(logsA[i], 3) + " logB " + FloatToStr(IToLog(logsB[i]), 3));
if (std::abs(logsA[i] - IToLog(logsB[i])) < 0.1)
++correct;
}
std::cout << std::fixed << std::setprecision(3) << "time_simple " << tA << " sec, time_optimized " << tB << " sec, boost " << (tA / tB) << ", correct_ratio " << std::setprecision(9) << double(correct) / len << std::endl;
}
int main() {
try {
omp_set_num_threads(NThreads());
Test();
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
}
}
Output:
time_simple 17.859 sec, time_optimized 0.874 sec, boost 20.434, correct_ratio 0.999999993
In my opinion, you should turn the schedule to static and give it chunk-size (https://software.intel.com/en-us/articles/openmp-loop-scheduling).
A small optimization should be :
outside of the big FOR loop, declare a const and initialize it to 1/std::log(2), and then inside the FOR loop, instead of dividing by std::log(2), do a multiplication of the previous const, division is very expensive in CPU cycles.
We are trying to understand accumarray function of MATLAB, wanted to write C/C++ code for the same for our understanding. Can someone help us with a sample/pseudo code?
According to the documentation,
The function processes the input as follows:
Find out how many unique indices there are in subs. Each unique index defines a bin in the output array. The maximum index value in
subs determines the size of the output array.
Find out how many times each index is repeated.
This determines how many elements of vals are going to be accumulated at each bin in the output array.
Create an output array. The output array is of size max(subs) or of size sz.
Accumulate the entries in vals into bins using the values of the indices in subs and apply fun to the entries in each bin.
Fill the values in the output for positions not referred to by subs. Default fill value is zero; use fillval to set a different
value.
So, translating to C++ (this is untested code),
template< typename sub_it, typename val_it, typename out_it,
typename fun = std::plus< typename std::iterator_traits< val_it >::value_type >,
typename T = typename fun::result_type >
out_it accumarray( sub_it first_index, sub_it last_index,
val_it first_value, // val_it last_value, -- 1 value per index
out_it first_out,
fun f = fun(), T fillval = T() ) {
std::size_t sz = std::max_element( first_index, last_index ); // 1. Get size.
std::vector< bool > used_indexes; // 2-3. remember which indexes are used
std::fill_n( first_out, sz, T() ); // 4. initialize output
while ( first_index != last_index ) {
std::size_t index = * first_index;
used_indexes[ index ] = true; // 2-3. remember that this index was used
first_out[ index ] = f( first_out[ index ], * first_value ); // 5. accumulate
++ first_value;
++ first_index;
}
// If fill is different from zero, reinitialize untouched values
if ( fillval != T() ) {
out_it fill_it = first_out;
for ( std::vector< bool >::iterator used_it = used_indexes.begin();
used_it != used_indexes.end(); ++ used_it ) {
if ( * used_it ) * fill_it = fillval;
}
}
return first_out + sz;
}
This has a few shortcomings, for example the accumulation function is called repeatedly instead of once with the entire column vector. The output is placed in pre-allocated storage referenced by first_out. The index vector must be the same size as the value vector. But most of the features should be captured pretty well.
Many thanks for your response. We were able to fully understand and implement the same in C++ (we used armadillo). Here is the code:
colvec TestProcessing::accumarray(icolvec cf, colvec T, double nf, int p)
{
/* ******* Description *******
here cf is the matrix of indices
T is the values whose data is to be
accumulted in the output array S.
if T is not given (or is scaler)then accumarray simply converts
to calculation of histogram of the input data
nf is the the size of output Array
nf >= max(cf)
so pass the argument accordingly
p is not used in the function
********************************/
colvec S; // output Array
S.set_size(int(nf)); // preallocate the output array
for(int i = 0 ; i < (int)nf ; i++)
{
// find the indices in cf corresponding to 1 to nf
// and store in unsigned integer array q1
uvec q1 = find(cf == (i+1));
vec q ;
double sum1 = 0 ;
if(!q1.is_empty())
{
q = T.elem(q1) ; // find the elements in T having indices in q1
// make sure q1 is not empty
sum1 = arma::sum(q); // calculate the sum and store in output array
S(i) = sum1;
}
// if q1 is empty array just put 0 at that particular location
else
{
S(i) = 0 ;
}
}
return S;
}
Hope this will help others too!
Thanks again to everybody who contributed :)
Here's what I came up with. Note: I went for readability (since you wanted to understand best), rather than being optimized. Oh, and I've never used MATLAB, I was just going off of this sample I saw just now:
val = 101:105;
subs = [1; 2; 4; 2; 4]
subs =
1
2
4
2
4
A = accumarray(subs, val)
A =
101 % A(1) = val(1) = 101
206 % A(2) = val(2)+val(4) = 102+104 = 206
0 % A(3) = 0
208 % A(4) = val(3)+val(5) = 103+105 = 208
Anyway, here's the code sample:
#include <iostream>
#include <stdio.h>
#include <vector>
#include <map>
class RangeValues
{
public:
RangeValues(int startValue, int endValue)
{
int range = endValue - startValue;
// Reserve all needed space up front
values.resize(abs(range) + 1);
unsigned int index = 0;
for ( int i = startValue; i != endValue; iterateByDirection(range, i), ++index )
{
values[index] = i;
}
}
std::vector<int> GetValues() const { return values; }
private:
void iterateByDirection(int range, int& value)
{
( range < 0 ) ? --value : ++value;
}
private:
std::vector<int> values;
};
typedef std::map<unsigned int, int> accumMap;
accumMap accumarray( const RangeValues& rangeVals )
{
accumMap aMap;
std::vector<int> values = rangeVals.GetValues();
unsigned int index = 0;
std::vector<int>::const_iterator itr = values.begin();
for ( itr; itr != values.end(); ++itr, ++index )
{
aMap[index] = (*itr);
}
return aMap;
}
int main()
{
// Our value range will be from -10 to 10
RangeValues values(-10, 10);
accumMap aMap = accumarray(values);
// Now iterate through and check out what values map to which indices.
accumMap::const_iterator itr = aMap.begin();
for ( itr; itr != aMap.end(); ++itr )
{
std::cout << "Index: " << itr->first << ", Value: " << itr->second << '\n';
}
//Or much like the MATLAB Example:
cout << aMap[5]; // -5, since out range was from -10 to 10
}
In addition to Vicky Budhiraja "armadillo" example, this one is a 2D version of accumarray using similar semantic than matlab function:
arma::mat accumarray (arma::mat& subs, arma::vec& val, arma::rowvec& sz)
{
arma::u32 ar = sz.col(0)(0);
arma::u32 ac = sz.col(1)(0);
arma::mat A; A.set_size(ar, ac);
for (arma::u32 r = 0; r < ar; ++r)
{
for (arma::u32 c = 0; c < ac; ++c)
{
arma::uvec idx = arma::find(subs.col(0) == r &&
subs.col(1) == c);
if (!idx.is_empty())
A(r, c) = arma::sum(val.elem(idx));
else
A(r, c) = 0;
}
}
return A;
}
The sz input is a two columns vector that contain : num rows / num cols for the output matrix A. The subs matrix is a 2 columns with same num rows of val. Num rows of val is basically sz.rows by sz.cols.
The sz (size) input is not really mandatory and can be deduced easily by searching the max in subs columns.
arma::u32 sz_rows = arma::max(subs.col(0)) + 1;
arma::u32 sz_cols = arma::max(subs.col(1)) + 1;
or
arma::u32 sz_rows = arma::max(subs.col(0)) + 1;
arma::u32 sz_cols = val.n_elem / sz_rows;
the output matrix is now :
arma::mat A (sz_rows, sz_cols);
the accumarray function become :
arma::mat accumarray (arma::mat& subs, arma::vec& val)
{
arma::u32 sz_rows = arma::max(subs.col(0)) + 1;
arma::u32 sz_cols = arma::max(subs.col(1)) + 1;
arma::mat A (sz_rows, sz_cols);
for (arma::u32 r = 0; r < sz_rows; ++r)
{
for (arma::u32 c = 0; c < sz_cols; ++c)
{
arma::uvec idx = arma::find(subs.col(0) == r &&
subs.col(1) == c);
if (!idx.is_empty())
A(r, c) = arma::sum(val.elem(idx));
else
A(r, c) = 0;
}
}
return A;
}
For example :
arma::vec val = arma::regspace(101, 106);
arma::mat subs;
subs << 0 << 0 << arma::endr
<< 1 << 1 << arma::endr
<< 2 << 1 << arma::endr
<< 0 << 0 << arma::endr
<< 1 << 1 << arma::endr
<< 3 << 0 << arma::endr;
arma::mat A = accumarray (subs, val);
A.raw_print("A =");
Produce this result :
A =
205 0
0 207
0 103
106 0
This example is found here : http://fr.mathworks.com/help/matlab/ref/accumarray.html?requestedDomain=www.mathworks.com
except for the indices of subs, armadillo is 0-based indice where matlab is 1-based.
Unfortunaly, the previous code is not suitable for big matrix. Two for-loop with a find in vector in between is really bad thing. The code is good to understand the concept but can be optimized as a single loop like this one :
arma::mat accumarray(arma::mat& subs, arma::vec& val)
{
arma::u32 ar = arma::max(subs.col(0)) + 1;
arma::u32 ac = arma::max(subs.col(1)) + 1;
arma::mat A(ar, ac);
A.zeros();
for (arma::u32 r = 0; r < subs.n_rows; ++r)
A(subs(r, 0), subs(r, 1)) += val(r);
return A;
}
The only change are :
init the output matrix with zero's.
loop over subs rows to get the output indice(s)
accumulate val to output (subs & val are row synchronized)
A 1-D version (vector) of the function can be something like :
arma::vec accumarray (arma::ivec& subs, arma::vec& val)
{
arma::u32 num_elems = arma::max(subs) + 1;
arma::vec A (num_elems);
A.zeros();
for (arma::u32 r = 0; r < subs.n_rows; ++r)
A(subs(r)) += val(r);
return A;
}
For testing 1D version :
arma::vec val = arma::regspace(101, 105);
arma::ivec subs;
subs << 0 << 2 << 3 << 2 << 3;
arma::vec A = accumarray(subs, val);
A.raw_print("A =");
The result is conform with matlab examples (see previous link)
A =
101
0
206
208
This is not a strict copy of matlab accumarray function. For example, the matlab function allow to output vec/mat with size defined by sz that is larger than the intrinsec size of the subs/val duo.
Maybe that can be a idea for addition to the armadillo api. Allowing a single interface for differents dimensions & types.
Hey, my friends and I are trying to beat each other's runtimes for generating "Self Numbers" between 1 and a million. I've written mine in c++ and I'm still trying to shave off precious time.
Here's what I have so far,
#include <iostream>
using namespace std;
bool v[1000000];
int main(void) {
long non_self = 0;
for(long i = 1; i < 1000000; ++i) {
if(!(v[i])) std::cout << i << '\n';
non_self = i + (i%10) + (i/10)%10 + (i/100)%10 + (i/1000)%10 + (i/10000)%10 +(i/100000)%10;
v[non_self] = 1;
}
std::cout << "1000000" << '\n';
return 0;
}
The code works fine now, I just want to optimize it.
Any tips? Thanks.
I built an alternate C solution that doesn't require any modulo or division operations:
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[]) {
int v[1100000];
int j1, j2, j3, j4, j5, j6, s, n=0;
memset(v, 0, sizeof(v));
for (j6=0; j6<10; j6++) {
for (j5=0; j5<10; j5++) {
for (j4=0; j4<10; j4++) {
for (j3=0; j3<10; j3++) {
for (j2=0; j2<10; j2++) {
for (j1=0; j1<10; j1++) {
s = j6 + j5 + j4 + j3 + j2 + j1;
v[n + s] = 1;
n++;
}
}
}
}
}
}
for (n=1; n<=1000000; n++) {
if (!v[n]) printf("%6d\n", n);
}
}
It generates 97786 self numbers including 1 and 1000000.
With output, it takes
real 0m1.419s
user 0m0.060s
sys 0m0.152s
When I redirect output to /dev/null, it takes
real 0m0.030s
user 0m0.024s
sys 0m0.004s
on my 3 Ghz quad core rig.
For comparison, your version produces the same number of numbers, so I assume we're either both correct or equally wrong; but your version chews up
real 0m0.064s
user 0m0.060s
sys 0m0.000s
under the same conditions, or about 2x as much.
That, or the fact that you're using longs, which is unnecessary on my machine. Here, int goes up to 2 billion. Maybe you should check INT_MAX on yours?
Update
I had a hunch that it may be better to calculate the sum piecewise. Here's my new code:
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[]) {
char v[1100000];
int j1, j2, j3, j4, j5, j6, s, n=0;
int s1, s2, s3, s4, s5;
memset(v, 0, sizeof(v));
for (j6=0; j6<10; j6++) {
for (j5=0; j5<10; j5++) {
s5 = j6 + j5;
for (j4=0; j4<10; j4++) {
s4 = s5 + j4;
for (j3=0; j3<10; j3++) {
s3 = s4 + j3;
for (j2=0; j2<10; j2++) {
s2 = s3 + j2;
for (j1=0; j1<10; j1++) {
v[s2 + j1 + n++] = 1;
}
}
}
}
}
}
for (n=1; n<=1000000; n++) {
if (!v[n]) printf("%d\n", n);
}
}
...and what do you know, that brought down the time for the top loop from 12 ms to 4 ms. Or maybe 8, my clock seems to be getting a bit jittery way down there.
State of affairs, Summary
The actual finding of self numbers up to 1M is now taking roughly 4 ms, and I'm having trouble measuring any further improvements. On the other hand, as long as output is to the console, it will continue to take about 1.4 seconds, my best efforts to leverage buffering notwithstanding. The I/O time so drastically dwarfs computation time that any further optimization would be essentially futile. Thus, although inspired by further comments, I've decided to leave well enough alone.
All times cited are on my (pretty fast) machine and are for comparison purposes with each other only. Your mileage may vary.
Generate the numbers once, copy the output into your code as a gigantic string. Print the string.
Those mods (%) look expensive. If you are allowed to move to base 16 (or even base 2), then you can probably code this a lot faster. If you have to stay in decimal, try creating an array of digits for each place (units, tens, hundreds) and build some rollover code. That will make summating the numbers far easier.
Alternatively, you could recognise the behaviour of the core self function (let's call it s):
s = n + f(b,n)
where f(b,n) is the sum of the digits of the number n in base b.
For base 10, it's clear that as the ones (also known as least significant) digit moves from 0,1,2,...,9, that n and f(b,n) proceed in lockstep as you move from n to n+1, it's only that 10% of the time that 9 rolls to 0 that it doesnt, so:
f(b,n+1) = f(b,n) + 1 // 90% of the time
thus the core self function s advances as
n+1 + f(b,n+1) = n + 1 + f(b,n) + 1 = n + f(b,n) + 2
s(n+1) = s(n) + 2 // again, 90% of the time
In the remaining (and easily identifiable) 10% of the time, the 9 rolls back to zero and adds one to the next digit, which in the simplest case subtracts (9-1) from the running total, but might cascade up through a series of 9s, to subtract 99-1, 999-1 etc.
So the first optimisation can remove most of the work from 90% of your cycles!
if ((n % 10) != 0)
{
n + f(b,n) = n-1 + f(b,n-1) + 2;
}
or
if ((n % 10) != 0)
{
s = old_s + 2;
}
That should be enough to substantially increase your performance without really changing your algorithm.
If you want more, then work out a simple algorithm for the change between iterations for the remaining 10%.
If you want your output to be fast, it may be worth investigating replacing iostream output with plain old printf() - depends on the rules for winning the competition whether this is important.
Multithread (use different arrays/ranges for every thread). Also, dont use more threads than your number of cpu cores =)
cout or printf within a loop will be slow. If you can remove any prints from a loop you will see significant performance increase.
Since the range is limited (1 to 1000000) the maximum sum of the digits does not exceed 9*6 = 54. This means that to implement the sieve a circular buffer of 54 elements should be perfectly sufficient (and the size of the sieve grows very slowly as the range increases).
You already have a sieve-based solution, but it is based on pre-building the full-length buffer (sieve of 1000000 elements), which is rather inelegant (if not completely unacceptable). The performance of your solution also suffers from non-locality of memory access.
For example, this is a possible very simple implementation
#define N 1000000U
void print_self_numbers(void)
{
#define NMARKS 64U /* make it 64 just in case (and to make division work faster :) */
unsigned char marks[NMARKS] = { 0 };
unsigned i, imark;
for (i = 1, imark = i; i <= N; ++i, imark = (imark + 1) % NMARKS)
{
unsigned digits, sum;
if (!marks[imark])
printf("%u ", i);
else
marks[imark] = 0;
sum = i;
for (digits = i; digits > 0; digits /= 10)
sum += digits % 10;
marks[sum % NMARKS] = 1;
}
}
(I'm not going for the best possible performance in terms of CPU clocks here, just illustrating the key idea with the circular buffer.)
Of course, the range can be easily turned into a parameter of the function, while the size of the curcular buffer can be easily calculated at run-time from the range.
As for "optimizations"... There's no point in trying to optimize the code that contains I/O operations. You won't achieve anything by such optimizations. If you want to analyze the performance of the algorithm itself, you'll have to put the generated numbers into an output array and print them later.
For such simple task, the best option would be to think of alternative algorithms to produce the same result. %10 is not usually considered a fast operation.
Why not use the recurrence relation given on the wikipedia page instead?
That should be blazingly fast.
EDIT: Ignore this .. the recurrence relation generates some but not all of the self numbers.
In fact only very few of them. Thats not particularly clear from thewikipedia page though :(
This may help speed up C++ iostreams output:
cin.tie(0);
ios::sync_with_stdio(false);
Put them in main before you start writing to cout.
I created a CUDA-based solution based on Carl Smotricz's second algorithm. The code to identify Self Numbers itself is extremely fast -- on my machine it executes in ~45 nanoseconds; this is about 150 x faster than Carl Smotricz's algorithm, which ran in 7 milliseconds on my machine.
There is a bottleneck, however, and that seems to be the PCIe interface. It took my code a whopping 43 milliseconds to move the computed data from the graphics card back to RAM. This might be optimizable, and I will look in to this.
Still, 45 nanosedons is pretty darn fast. Scary fast, actually, and I added code to my program which runs Carl Smotricz's algorithm and compares the results for accuracy. The results are accurate. Here is the program output (compiled in VS2008 64-bit, Windows7):
UPDATE
I recompiled this code in release mode with full optimization and using static runtime libraries, with signifigant results. The optimizer seems to have done very well with Carl's algorithm, reducing the runtime from 7 ms to 1 ms. The CUDA implementation sped up as well, from 35 us to 20 us. The memory copy from video card to RAM was unaffected.
Program Output:
Running on device: 'Quadro NVS 295'
Reference Implementation Ran In 15603 ticks (7 ms)
Kernel Executed in 40 ms -- Breakdown:
[kernel] : 35 us (0.09%)
[memcpy] : 40 ms (99.91%)
CUDA Implementation Ran In 111889 ticks (51 ms)
Compute Slots: 1000448 (1954 blocks X 512 threads)
Number of Errors: 0
The code is as follows:
file : main.h
#pragma once
#include <cstdlib>
#include <functional>
typedef std::pair<int*, size_t> sized_ptr;
static sized_ptr make_sized_ptr(int* ptr, size_t size)
{
return make_pair<int*, size_t>(ptr, size);
}
__host__ void ComputeSelfNumbers(sized_ptr hostMem, sized_ptr deviceMemory, unsigned const blocks, unsigned const threads);
inline std::string format_elapsed(double d)
{
char buf[256] = {0};
if( d < 0.00000001 )
{
// show in ps with 4 digits
sprintf(buf, "%0.4f ps", d * 1000000000000.0);
}
else if( d < 0.00001 )
{
// show in ns
sprintf(buf, "%0.0f ns", d * 1000000000.0);
}
else if( d < 0.001 )
{
// show in us
sprintf(buf, "%0.0f us", d * 1000000.0);
}
else if( d < 0.1 )
{
// show in ms
sprintf(buf, "%0.0f ms", d * 1000.0);
}
else if( d <= 60.0 )
{
// show in seconds
sprintf(buf, "%0.2f s", d);
}
else if( d < 3600.0 )
{
// show in min:sec
sprintf(buf, "%01.0f:%02.2f", floor(d/60.0), fmod(d,60.0));
}
// show in h:min:sec
else
sprintf(buf, "%01.0f:%02.0f:%02.2f", floor(d/3600.0), floor(fmod(d,3600.0)/60.0), fmod(d,60.0));
return buf;
}
inline std::string format_pct(double d)
{
char buf[256] = {0};
sprintf(buf, "%.2f", 100.0 * d);
return buf;
}
file: main.cpp
#define _CRT_SECURE_NO_WARNINGS
#include <windows.h>
#include "C:\CUDA\include\cuda_runtime.h"
#include <cstdlib>
#include <iostream>
#include <string>
using namespace std;
#include <cmath>
#include <map>
#include <algorithm>
#include <list>
#include "main.h"
int main()
{
unsigned numVals = 1000000;
int* gold = new int[numVals];
memset(gold, 0, sizeof(int)*numVals);
LARGE_INTEGER li = {0}, li2 = {0};
QueryPerformanceFrequency(&li);
__int64 freq = li.QuadPart;
// get cuda properties...
cudaDeviceProp cdp = {0};
cudaError_t err = cudaGetDeviceProperties(&cdp, 0);
cout << "Running on device: '" << cdp.name << "'" << endl;
// first run the reference implementation
QueryPerformanceCounter(&li);
for( int j6=0, n = 0; j6<10; j6++ )
{
for( int j5=0; j5<10; j5++ )
{
for( int j4=0; j4<10; j4++ )
{
for( int j3=0; j3<10; j3++ )
{
for( int j2=0; j2<10; j2++ )
{
for( int j1=0; j1<10; j1++ )
{
int s = j6 + j5 + j4 + j3 + j2 + j1;
gold[n + s] = 1;
n++;
}
}
}
}
}
}
QueryPerformanceCounter(&li2);
__int64 ticks = li2.QuadPart-li.QuadPart;
cout << "Reference Implementation Ran In " << ticks << " ticks" << " (" << format_elapsed((double)ticks/(double)freq) << ")" << endl;
// now run the cuda version...
unsigned threads = cdp.maxThreadsPerBlock;
unsigned blocks = numVals/threads;
if( numVals%threads ) ++blocks;
unsigned computeSlots = blocks * threads; // this may be != the number of vals since we want 32-thread warps
// allocate device memory for test
int* deviceTest = 0;
err = cudaMalloc(&deviceTest, sizeof(int)*computeSlots);
err = cudaMemset(deviceTest, 0, sizeof(int)*computeSlots);
int* hostTest = new int[numVals]; // the repository for the resulting data on the host
memset(hostTest, 0, sizeof(int)*numVals);
// run the CUDA code...
LARGE_INTEGER li3 = {0}, li4={0};
QueryPerformanceCounter(&li3);
ComputeSelfNumbers(make_sized_ptr(hostTest, numVals), make_sized_ptr(deviceTest, computeSlots), blocks, threads);
QueryPerformanceCounter(&li4);
__int64 ticksCuda = li4.QuadPart-li3.QuadPart;
cout << "CUDA Implementation Ran In " << ticksCuda << " ticks" << " (" << format_elapsed((double)ticksCuda/(double)freq) << ")" << endl;
cout << "Compute Slots: " << computeSlots << " (" << blocks << " blocks X " << threads << " threads)" << endl;
unsigned errorCount = 0;
for( size_t i = 0; i < numVals; ++i )
{
if( gold[i] != hostTest[i] )
{
++errorCount;
}
}
cout << "Number of Errors: " << errorCount << endl;
return 0;
}
file: self.cu
#pragma warning( disable : 4231)
#include <windows.h>
#include <cstdlib>
#include <vector>
#include <iostream>
#include <string>
#include <iomanip>
using namespace std;
#include "main.h"
__global__ void SelfNum(int * slots)
{
__shared__ int N;
N = (blockIdx.x * blockDim.x) + threadIdx.x;
const int numDigits = 10;
__shared__ int digits[numDigits];
for( int i = 0, temp = N; i < numDigits; ++i, temp /= 10 )
{
digits[numDigits-i-1] = temp - 10 * (temp/10) /*temp % 10*/;
}
__shared__ int s;
s = 0;
for( int i = 0; i < numDigits; ++i )
s += digits[i];
slots[N+s] = 1;
}
__host__ void ComputeSelfNumbers(sized_ptr hostMem, sized_ptr deviceMem, const unsigned blocks, const unsigned threads)
{
LARGE_INTEGER li = {0};
QueryPerformanceFrequency(&li);
double freq = (double)li.QuadPart;
LARGE_INTEGER liStart = {0};
QueryPerformanceCounter(&liStart);
// run the kernel
SelfNum<<<blocks, threads>>>(deviceMem.first);
LARGE_INTEGER liKernel = {0};
QueryPerformanceCounter(&liKernel);
cudaMemcpy(hostMem.first, deviceMem.first, hostMem.second*sizeof(int), cudaMemcpyDeviceToHost); // dont copy the overflow - just throw it away
LARGE_INTEGER liMemcpy = {0};
QueryPerformanceCounter(&liMemcpy);
// display performance stats
double e = double(liMemcpy.QuadPart - liStart.QuadPart)/freq,
eKernel = double(liKernel.QuadPart - liStart.QuadPart)/freq,
eMemcpy = double(liMemcpy.QuadPart - liKernel.QuadPart)/freq;
double pKernel = eKernel/e,
pMemcpy = eMemcpy/e;
cout << "Kernel Executed in " << format_elapsed(e) << " -- Breakdown: " << endl
<< " [kernel] : " << format_elapsed(eKernel) << " (" << format_pct(pKernel) << "%)" << endl
<< " [memcpy] : " << format_elapsed(eMemcpy) << " (" << format_pct(pMemcpy) << "%)" << endl;
}
UPDATE2:
I refactored my CUDA implementation to try to speed it up a bit. I did this by unrolling loops manually, fixing some questionable use of __shared__ memory which might have been an error, and getting rid of some redundancy.
The output of my new kernel is:
Reference Implementation Ran In 69610 ticks (5 ms)
Kernel Executed in 2 ms -- Breakdown:
[kernel] : 39 us (1.57%)
[memcpy] : 2 ms (98.43%)
CUDA Implementation Ran In 62970 ticks (4 ms)
Compute Slots: 1000448 (1954 blocks X 512 threads)
Number of Errors: 0
The only code I changed is the kernel itself, so that's all I will post here:
__global__ void SelfNum(int * slots)
{
int N = (blockIdx.x * blockDim.x) + threadIdx.x;
int s = 0;
int temp = N;
s += temp - 10 * (temp/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
slots[N+s] = 1;
}
I wonder if multi-threading would help. This algorithm looks like it would lend itself well to multi-threading. (Poor-man's test of this: Create two copies of the program and run them at the same time. If it runs in less than 200% of the time, multi-threading may help).
I was actually surprised that the code below was faster then any other posted here. I probably measured it wrong, but maybe it helps; or at least is interesting.
#include <iostream>
#include <boost/progress.hpp>
class SelfCalc
{
private:
bool array[1000000];
int non_self;
public:
SelfCalc()
{
memset(&array, 0, sizeof(array));
}
void operator()(const int i)
{
if (!(array[i]))
std::cout << i << '\n';
non_self = i + (i%10) + (i/10)%10 + (i/100)%10 + (i/1000)%10 + (i/10000)%10 +(i/100000)%10;
array[non_self] = true;
}
};
class IntIterator
{
private:
int value;
public:
IntIterator(const int _value):value(_value){}
int operator*(){ return value; }
bool operator!=(const IntIterator &v){ return value != v.value; }
int operator++(){ return ++value; }
};
int main()
{
boost::progress_timer t;
SelfCalc selfCalc;
IntIterator i(1), end(100000);
std::for_each(i, end, selfCalc);
std::cout << 100000 << std::endl;
return 0;
}
Fun problem. The problem as stated does not specify what base it must be in. I fiddled around with it some and wrote a base-2 version. It generates an extra few thousand entries because the termination point of 1,000,000 is not as natural with base-2. This pre-counts the number of bits in a byte for a table lookup. The generation of the result set (without the I/O) took 2.4 ms.
One interesting thing (assuming I wrote it correctly) is that the base-2 version has about 250,000 "self numbers" up to 1,000,000 while there are just under 100,000 base-10 self numbers in that range.
#include <windows.h>
#include <stdio.h>
#include <string.h>
void StartTimer( _int64 *pt1 )
{
QueryPerformanceCounter( (LARGE_INTEGER*)pt1 );
}
double StopTimer( _int64 t1 )
{
_int64 t2, ldFreq;
QueryPerformanceCounter( (LARGE_INTEGER*)&t2 );
QueryPerformanceFrequency( (LARGE_INTEGER*)&ldFreq );
return ((double)( t2 - t1 ) / (double)ldFreq) * 1000.0;
}
#define RANGE 1000000
char sn[0x100000 + 32];
int bitCount[256];
// precompute bitcounts for each byte
void PreCountBits()
{
int i;
// generate count of bits in each byte
memset( bitCount, 0, sizeof( bitCount ));
for ( i = 0; i < 256; i++ )
{
int tmp = i;
while ( tmp )
{
if ( tmp & 0x01 )
bitCount[i]++;
tmp >>= 1;
}
}
}
void GenBase2( )
{
int i;
int *b1, *b2, *b3;
int b1sum, b2sum, b3sum;
i = 0;
for ( b1 = bitCount; b1 < bitCount + 256; b1++ )
{
b1sum = *b1;
for ( b2 = bitCount; b2 < bitCount + 256; b2++ )
{
b2sum = b1sum + *b2;
for ( b3 = bitCount; b3 < bitCount + 256; b3++ )
{
sn[i++ + *b3 + b2sum] = 1;
}
}
// 1000000 does not provide a great termination number for base 2. So check
// here. Overshoots the target some but avoids repeated checks
if ( i > RANGE )
return;
}
}
int main( int argc, char* argv[] )
{
int i = 0;
__int64 t1;
memset( sn, 0, sizeof( sn ));
StartTimer( &t1 );
PreCountBits();
GenBase2();
printf( "Generation time = %.3f\n", StopTimer( t1 ));
#if 1
for ( i = 1; i <= RANGE; i++ )
if ( !sn[i] ) printf( "%d\n", i );
#endif
return 0;
}
Maybe try just computing the recurrence relation defined below?
http://en.wikipedia.org/wiki/Self_number