I need to do something like this in the fastest way possible (O(1) would be perfect):
for (int j = 0; j < V; ++j)
{
if(!visited[j]) required[j]=0;
}
I came up with this solution:
for (int j = 0; j < V; ++j)
{
required[j]=visited[j]&required[j];
}
Which made the program run 3 times faster but I believe there is an even better way to do this. Am I right?
Btw. required and visited are dynamically allocated arrays
bool *required;
bool *visited;
required = new bool[V];
visited = new bool[V];
In the case where you're using a list of simple objects, you are most likely best suited using the functionality provided by the C++ Standard Library. Structures like valarray and vectors are recognized and optimized very effectively by all modern compilers.
Much debate exists as to how much you can rely on your compiler, but one guarantee is, your compiler was built alongside the standard library and relying on it for basic functionality (such as your problem) is generally a safe bet.
Never be afraid to run your own time tests and race your compiler! It's a fun exercise and one that is ever increasingly difficult to achieve.
Construct a valarray (highly optimized in c++11 and later):
std::valarray<bool> valRequired(required, V);
std::valarray<bool> valVisited(visited, V);
valRequired &= valVisited;
Alternatively, you could do it with one line using transform:
std::transform(required[0], required[V-1], visited[0], required[0], [](bool r, bool v){ return r & v; })
Edit: while fewer lines is not faster, your compiler will likely vectorize this operation.
I also tested their timing:
int main(int argc, const char * argv[]) {
auto clock = std::chrono::high_resolution_clock{};
{
bool visited[5] = {1,0,1,0,0};
bool required[5] = {1,1,1,0,1};
auto start = clock.now();
for (int i = 0; i < 5; ++i) {
required[i] &= visited[i];
}
auto end = clock.now();
std::cout << "1: " << (end - start).count() << std::endl;
}
{
bool visited[5] = {1,0,1,0,0};
bool required[5] = {1,1,1,0,1};
auto start = clock.now();
for (int i = 0; i < 5; ++i) {
required[i] = visited[i] & required[i];
}
auto end = clock.now();
std::cout << "2: " << (end - start).count() << std::endl;
}
{
bool visited[5] = {1,0,1,0,0};
bool required[5] = {1,1,1,0,1};
auto start = clock.now();
std::transform(required, required + 4, visited, required, [](bool r, bool v){ return r & v; });
auto end = clock.now();
std::cout << "3: " << (end - start).count() << std::endl;
}
{
bool visited[5] = {1,0,1,0,0};
bool required[5] = {1,1,1,0,1};
std::valarray<bool> valVisited(visited, 5);
std::valarray<bool> valrequired(required, 5);
auto start = clock.now();
valrequired &= valVisited;
auto end = clock.now();
std::cout << "4: " << (end - start).count() << std::endl;
}
}
Output:
1: 102
2: 55
3: 47
4: 45
Program ended with exit code: 0
In the line of #AlanStokes, use packed binary data and combine with the AVX instruction _mm512_and_epi64, 512 bits at a time. Be prepared for your hair messed up.
Related
I have a list of item called L, and a sophisticated python function called func; the normal way is to use python-loop like:
out = [func(item) for item in L]
But it's single-thread, so I want to implement a function in c++, and bind with pybind11:
For cpp:
m.def("test_func_iter", [](const py::object &func, const py::sequence &iter) {
auto n = len(iter);
py::list l(n);
unsigned int k = std::thread::hardware_concurrency();
std::thread threads[k];
auto stride = n / k;
// [0, n//k), [n//k, ...), [...,n)
for (unsigned int w = 0; w < k; ++w) {
if (w < k - 1) {
threads[w] = std::thread([&l, &func, &iter](size_t start, size_t end) {
for (size_t i = start; i < end; ++i) {
std::cout << "h: "<< i << std::endl;
l[i] = func(iter[i]);
}
}, w * stride, (w + 1) * stride);
} else {
threads[w] = std::thread([&l, &func, &iter](size_t start, size_t end) {
for (size_t i = start; i < end; ++i) {
std::cout << "h: "<< i << std::endl;
l[i] = func(iter[i]);
}
}, w * stride, n);
}
}
std::cout << "Done spawning threads! Now wait for them to finish.\n";
for (auto& t: threads) {
t.join();
}
std::cout << "end" << std::endl;
return py::type::of(iter)(l);
And when I invoke the corresponding bind function in python, like:
def func(i):
# just simplify the actual logic, a sophisticated function that is hard to re-write totally in c++
print(i, i == 0)
return int(gmpy2.mpz(i) + 100)
b = test_func_iter(func, list(range(100)))
print(b)
And I get the output and error like:
h: h: Done spawning threads! Now wait for them to finish.
050
0 True
进程已结束,退出代码为 139 (interrupted by signal 11: SIGSEGV)
I have done some tries:
Not use thread : everything is OK in python
Use thread & k=1: just use one single thread, everything is OK in python
Use thread & k>=2: crash.
BTW, I use Mac M1 laptop, and version of clang is 12.05 ;
I am new to c++, and guess the reason may be the use of thread, but can not find some suggestions in google, can anybody give some hints?(Or some suggestions about the origin problem: elegant way for multi-thread support with pybind11) Thanks!
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
What is the best way to compare two unsorted std::vector
std::vector<int> v1 = {1, 2, 3, 4, 5};
std::vector<int> v2 = {2, 3, 4, 5, 1};
What I am currently doing is
const auto is_similar = v1.size() == v2.size() && std::is_permutation(v1.begin(), v1.end(), v2.begin());
Here two vectors are similar only when the size of both vectors are equal and they contain the same elements
What would be a better approach for
two small std::vectors (size well under 50 elements)
two large std::vectors
std::is_permutation appears to be very-very slow for large arrays. Already for 64 K elements for simlar arrays it takes around 5 seconds to give answer. While regular sorting takes 0.007 seconds for this size of arrays. Timings are provided in my code below.
I suggest to do following thing - compute any simple (and fast) hash function of elements that is independent of elements order. If hash of two arrays is not equal then they are not similar, in other words two arrays as sets are not equal. Only if hashes are same then do regular sorting and compare arrays for equality.
Things suggested in my answer are meant for large arrays, to make computation fast, for tiny arrays probably std::is_permutation is enough. Although everything in this answer applies well to small arrays too.
In following code there are three functions implemented SimilarProd(), SimilarSort(), SimilarIsPermutation(), first of them uses my suggestion about first computing hash function and then sorting.
As a position-independent hash function I took regular product (multiplication) of all elements shifted (added to) by some fixed random 64-bit value. This kind of computation applied to integer arrays will be computed very fast due to good auto-vectorization capabilities of modern compilers (like CLang and GCC) which use SIMD instructions to boost computation.
In below code I did timings for all three implementations of similarity functions. It appeared that in case of similar arrays (same set of numbers) for arrays 64 K in size it takes 5 seconds for std::is_permutation(), while both hash approach and sort approach take 0.007 seconds. For unsimilar arrays std::is_permutation is very fast, below 0.00005 seconds, while sort is also 0.007 seconds and hash is 100x times faster, 0.00008 seconds.
So conclusion is that std::is_permutation is very slow for large similar arrays, and very fast for unsimilar. Sort approach is same fast speed for similar and unsimilar. While hash approach is fast for similar and blazingly fast for unsimilar. Hash approach is about the same speed as std::is_permutation for the case of unsimilar arrays, but for similar arrays is a clear win.
So out of three approaches hash approach is a clear win.
See timings below after code.
Update. For comparison just now added one more method SimilarMap(). Counting number of occurances of each integer in arrays using std::unordered_map. It appeared to be a bit slower than sorting. So still Hash+Sort method is the fastest. Although for very large arrays this map-counting method should outperform sorting speed.
Try it online!
#include <cstdint>
#include <array>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>
#include <iomanip>
#include <iostream>
#include <unordered_map>
bool SimilarProd(std::vector<int> const & a, std::vector<int> const & b) {
using std::size_t;
using u64 = uint64_t;
if (a.size() != b.size())
return false;
u64 constexpr randv = 0x6A7BE8CD0708EC4CULL;
size_t constexpr num_words = 8;
std::array<u64, num_words> prodA = {}, prodB = {};
std::fill(prodA.begin(), prodA.end(), 1);
std::fill(prodB.begin(), prodB.end(), 1);
for (size_t i = 0; i < a.size() - a.size() % num_words; i += num_words)
for (size_t j = 0; j < num_words; ++j) {
prodA[j] *= (randv + u64(a[i + j])) | 1;
prodB[j] *= (randv + u64(b[i + j])) | 1;
}
for (size_t i = a.size() - a.size() % num_words; i < a.size(); ++i) {
prodA[0] *= (randv + u64(a[i])) | 1;
prodB[0] *= (randv + u64(b[i])) | 1;
}
for (size_t i = 1; i < num_words; ++i) {
prodA[0] *= prodA[i];
prodB[0] *= prodB[i];
}
if (prodA[0] != prodB[0])
return false;
auto a2 = a, b2 = b;
std::sort(a2.begin(), a2.end());
std::sort(b2.begin(), b2.end());
return a2 == b2;
}
bool SimilarSort(std::vector<int> a, std::vector<int> b) {
if (a.size() != b.size())
return false;
std::sort(a.begin(), a.end());
std::sort(b.begin(), b.end());
return a == b;
}
bool SimilarIsPermutation(std::vector<int> const & a, std::vector<int> const & b) {
return a.size() == b.size() && std::is_permutation(a.begin(), a.end(), b.begin());
}
bool SimilarMap(std::vector<int> const & a, std::vector<int> const & b) {
if (a.size() != b.size())
return false;
std::unordered_map<int, int> m;
for (auto x: a)
++m[x];
for (auto x: b)
--m[x];
for (auto const & p: m)
if (p.second != 0)
return false;
return true;
}
void Test() {
using std::size_t;
auto TimeCur = []{ return std::chrono::high_resolution_clock::now(); };
auto const gtb = TimeCur();
auto Time = [&]{ return std::chrono::duration_cast<
std::chrono::microseconds>(TimeCur() - gtb).count() / 1000000.0; };
std::mt19937_64 rng{123};
auto RandV = [&](size_t n) {
std::vector<int> v(n);
for (size_t i = 0; i < v.size(); ++i)
v[i] = rng() % (1 << 30);
return v;
};
size_t constexpr n = 1 << 16;
auto a = RandV(n), b = a, c = RandV(n);
std::shuffle(b.begin(), b.end(), rng);
std::cout << std::boolalpha << std::fixed;
double tb = 0;
tb = Time(); std::cout << "Prod "
<< SimilarProd(a, b) << " time " << (Time() - tb) << std::endl;
tb = Time(); std::cout << "Sort "
<< SimilarSort(a, b) << " time " << (Time() - tb) << std::endl;
tb = Time(); std::cout << "IsPermutation "
<< SimilarIsPermutation(a, b) << " time " << (Time() - tb) << std::endl;
tb = Time(); std::cout << "Map "
<< SimilarMap(a, b) << " time " << (Time() - tb) << std::endl;
tb = Time(); std::cout << "Prod "
<< SimilarProd(a, c) << " time " << (Time() - tb) << std::endl;
tb = Time(); std::cout << "Sort "
<< SimilarSort(a, c) << " time " << (Time() - tb) << std::endl;
tb = Time(); std::cout << "IsPermutation "
<< SimilarIsPermutation(a, c) << " time " << (Time() - tb) << std::endl;
tb = Time(); std::cout << "Map "
<< SimilarMap(a, c) << " time " << (Time() - tb) << std::endl;
}
int main() {
Test();
}
Output:
Prod true time 0.009208
Sort true time 0.008080
IsPermutation true time 4.436632
Map true time 0.010382
Prod false time 0.000082
Sort false time 0.008750
IsPermutation false time 0.000036
Map false time 0.016390
What would be a better approach
Remove the v1.size() == v2.size() && expression and instead pass end iterator to std::is_permutation.
You tagged C++11, but to those who can use C++20, I recommend following:
std::ranges::is_permutation(v1, v2)
If you can modify the vectors, then it will be asymptotically faster to sort them and compare equality. If you cannot modify, then you could create a sorted copy if you can afford the storage cost.
i'm trying to optimize my code using multithreading and is not just that the program is not the double speed as is suposed to be in this dual-core computer, it is SO MUCH SLOW. And i just wanna know if i'm doing something wrong or is pretty normal that in this case use multithreading does not help. I make this recreation of how i used the multithreading, and in my computer the parallel versions take's 4 times the time in the comparation of the normal version:
#include <iostream>
#include <random>
#include <thread>
#include <chrono>
using namespace std;
default_random_engine ran;
inline bool get(){
return ran() % 3;
}
void normal_serie(unsigned repetitions, unsigned &result){
for (unsigned i = 0; i < repetitions; ++i)
result += get();
}
unsigned parallel_series(unsigned repetitions){
const unsigned hardware_threads = std::thread::hardware_concurrency();
cout << "Threads in this computer: " << hardware_threads << endl;
const unsigned threads_number = (hardware_threads != 0) ? hardware_threads : 2;
const unsigned its_per_thread = repetitions / threads_number;
unsigned *results = new unsigned[threads_number]();
std::thread *threads = new std::thread[threads_number - 1];
for (unsigned i = 0; i < threads_number - 1; ++i)
threads[i] = std::thread(normal_serie, its_per_thread, std::ref(results[i]));
normal_serie(its_per_thread, results[threads_number - 1]);
for (unsigned i = 0; i < threads_number - 1; ++i)
threads[i].join();
auto result = std::accumulate(results, results + threads_number, 0);
delete[] results;
delete[] threads;
return result;
}
int main()
{
constexpr unsigned repetitions = 100000000;
auto to = std::chrono::high_resolution_clock::now();
cout << parallel_series(repetitions) << endl;
auto tf = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(tf - to).count();
cout << "Parallel duration: " << duration << "ms" << endl;
to = std::chrono::high_resolution_clock::now();
unsigned r = 0;
normal_serie(repetitions, r);
cout << r << endl;
tf = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(tf - to).count();
cout << "Normal duration: " << duration << "ms" << endl;
return 0;
}
Things that i already know, but i didn't to make this code shorter:
I should set a max_iterations_per_thread because you don't wanna make 10 iterations per thread, but in this case we are doing one billion iterations so that is not gonna happend.
The number of iterations must be divisible by the number or threads, otherwise the code will not do an effective work.
This is the output that i get in my computer:
Threads in this computer: 2
66665160
Parallel duration: 4545ms
66664432
Normal duration: 1019ms
(Solved partially doing this changes: )
inline bool get(default_random_engine &ran){
return ran() % 3;
}
void normal_serie(unsigned repetitions, unsigned &result){
default_random_engine eng;
unsigned saver_result = 0;
for (unsigned i = 0; i < repetitions; ++i)
saver_result += get(eng);
result += saver_result;
}
All your threads are tripping over each other fighting for access to ran which can only perform one operation at a time because it only has one state and each operation advances its state. There is no point in running operations in parallel if the vast majority of each operation involves a choke point that cannot support any concurrency.
All elements of results are likely to share a cache line, which means there is lots of inter-core communication going on.
Try modifying normal_serie to accumulate into a local variable and only write it to results in the end.
I am trying to implement parallel quadratic sieve using open mp. In sieving phase, I am using log approximations to check the divisibility. This is my code.
#pragma omp parallel for schedule (dynamic) num_threads(4)
for (int i = 0; i < factorBase.size(); ++i) {
const uint32_t p = factorBase[i];
const float logp = std::log(factorBase[i]) / std::log(2);
// Sieve first sequence.
while (startIndex.first[i] < intervalEnd) {
logApprox[startIndex.first[i] - intervalStart] -= logp;
startIndex.first[i] += p;
}
if (p == 2)
continue; // a^2 = N (mod 2) only has one root.
// Sieve second sequence.
while (startIndex.second[i] < intervalEnd) {
logApprox[startIndex.second[i] - intervalStart] -= logp;
startIndex.second[i] += p;
}
}
Here factorbase and logApprox are std::vectors initialized as follows
std::vector<float> logApprox(INTERVAL_LENGTH, 0);
std::vector<uint32_t> factorBase;
Whenever, I run this code and compare the running time, there is no much difference between sequential and parallel run. What are some optimizations that can be done? I am a beginner in openmp and any help is appreciated.Thanks
Very interesting task you have! Thanks!
Decided to make my own implementation with very many optimizations.
I achieved 20.4x times boost compared to your original code (your code gives 17.86 seconds, my gives 0.87 seconds). Also I used 2x times less memory for sieving compared to your algorithm, while achieving same goal.
To make comparison I simplified your code in such a way that it still does almost same thing and runs exactly same time, but looks much more simple:
#pragma omp parallel for
for (size_t i = 0; i < factorBase.size(); ++i) {
auto const p = factorBase[i];
float const logp = std::log(p) / std::log(2);
while (startIndex[i] < logApprox.size()) {
logApprox[startIndex[i]] += logp;
startIndex[i] += p;
}
}
You can see that I leaved only single sieve loop, second one does same thing and not necessary for demonstration, so I removed it. Also I removed startInterval as it is irrelevant to speed demonstration. And for simplicity I did += of logarithm instead of yours -=.
One important notice regarding your algorithm is that it doesn't do any synchronization, it means that different cores of CPU may write to same entry of logApprox array hence give wrong result.
And as I have measured this wrong result happens once or twice per hundred million entries of logApprox array. My optimized code overcame this limitation and did correct synchronization besides doing all speed optimizations.
I did following improvements to gain 20x times speedup:
I split whole array into blocks, approximately 2^13 elements in size. Each group of blocks is processed by separate thread/CPU-core hence no synchronization of threads is needed. Besides avoiding synchronization what is very important is that 2^13 block fits fully into L1 or L2 cache of CPU, hence speeds up things a lot.
Each block of 2^13 is processed for all possible primes. To keep track of which offsets of what primes are needed I created a special ring buffer of 2^7 size, this ring buffer is indexed with block number modulo 2^7 and keeps track which primes with which offsets are needed for each block (modulo 2^7).
I have as many threads as there are CPU cores. For each thread I precompute starting offsets of all primes for this thread, these starting offsets are computed through modular arithmetics based on startIndex array that you provided in your original code.
To speedup even more instead of float logarithm I use integer logarithm, which is based on uint16_t. This integer logarithm is computed as uint16_t integer_log = uint16_t(std::log2(p) * (1 << 8) + 0.5);. Besides increasing speed of computing += for integer logarithms, they also decrease occupied memory 2x times. If for some reason uint16_t logarithm is not enough for you then please replace using ILog2T = u16; with using ILog2T = u32; in my code, but this will double amount of used memory.
My code output following to console:
time_simple 17.859 sec, time_optimized 0.874 sec, boost 20.434, correct_ratio 0.999999993
Time simple is time of your original code for sieving array of size 2^28, time optimized is my code for same array, boost is how much my code is faster (you can see it is 20x times faster). Correct ratio says if there are any errors in your code, due to absence of multi-core synchronization (as you can see sometimes it is less than 1.0 hence there are some errors).
Full optimized code below:
Try it online!
#include <cstdint>
#include <random>
#include <iostream>
#include <iomanip>
#include <chrono>
#include <thread>
#include <type_traits>
#include <vector>
#include <stdexcept>
#include <sstream>
#include <mutex>
#include <omp.h>
#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg: '" + std::string(msg) + "'."); }
#define ASSERT(cond) ASSERT_MSG(cond, "")
#define OSTR(code) ([&]{ std::ostringstream ss; ss code; return ss.str(); }())
#define COUT(code) { std::unique_lock<std::mutex> lock(cout_mux); std::cout code; std::cout << std::flush; }
#define LN { COUT(<< "LN " << __LINE__ << std::endl); }
#define DUMP(var) { COUT(<< #var << " = (" << (var) << ")" << std::endl); }
using u16 = uint16_t;
using u32 = uint32_t;
using u64 = uint64_t;
using ILog2T = u16;
using PrimeT = u32;
std::mutex cout_mux;
template <typename T>
std::vector<T> GenPrimes(size_t end) {
thread_local std::vector<T> primes = {2, 3};
while (primes.back() < end) {
for (T p = primes.back() + 2;; p += 2) {
bool is_prime = true;
for (auto d: primes) {
if (u64(d) * d > p)
break;
if (p % d == 0) {
is_prime = false;
break;
}
}
if (is_prime) {
primes.push_back(p);
break;
}
}
}
primes.pop_back();
return primes;
}
void SieveA(std::vector<float> & logApprox, std::vector<PrimeT> const & factorBase, std::vector<PrimeT> startIndex) {
#pragma omp parallel for
for (size_t i = 0; i < factorBase.size(); ++i) {
auto const p = factorBase[i];
float const logp = std::log(p) / std::log(2);
while (startIndex[i] < logApprox.size()) {
logApprox[startIndex[i]] += logp;
startIndex[i] += p;
}
}
}
size_t NThreads() {
//return 1;
return std::thread::hardware_concurrency();
}
ILog2T LogToI(double x) { return ILog2T(x * (1ULL << (sizeof(ILog2T) * 8 - 8)) + 0.5); }
double IToLog(ILog2T x) { return x / double(1ULL << (sizeof(ILog2T) * 8 - 8)); }
double Time() {
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
std::string FloatToStr(double x, size_t round = 6) {
return OSTR(<< std::fixed << std::setprecision(round) << x);
}
double SieveB(std::vector<ILog2T> & logs, std::vector<PrimeT> const & primes, std::vector<PrimeT> const & starts0) {
auto const nthr = NThreads();
std::vector<std::vector<PrimeT>> starts(nthr, std::vector<PrimeT>(primes.size()));
std::vector<std::vector<ILog2T>> plogs(nthr, std::vector<ILog2T>(primes.size()));
std::vector<std::pair<u64, u64>> ranges(nthr);
size_t constexpr block_log2 = 13, block = 1 << block_log2, ring_log2 = 6, ring_size = 1ULL << ring_log2, ring_mask = ring_size - 1;
std::vector<std::vector<std::vector<std::pair<u32, u32>>>> ring(nthr, std::vector<std::vector<std::pair<u32, u32>>>(ring_size));
#pragma omp parallel for
for (size_t ithr = 0; ithr < nthr; ++ithr) {
size_t const nblock = ((logs.size() + nthr - 1) / nthr + block - 1) / block * block,
begin = ithr * nblock, end = std::min<size_t>(logs.size(), (ithr + 1) * nblock);
ranges[ithr] = {begin, end};
for (size_t i = 0; i < primes.size(); ++i) {
PrimeT const p = primes[i];
size_t const mod0 = begin % p, mod = starts0[i] < mod0 ? p + starts0[i] - mod0 : starts0[i] - mod0;
starts[ithr][i] = mod;
plogs[ithr][i] = LogToI(std::log2(p));
ring[ithr][((begin + starts[ithr][i]) >> block_log2) & ring_mask].push_back({i, begin + starts[ithr][i]});
}
}
auto tim = Time();
#pragma omp parallel for
for (size_t ithr = 0; ithr < nthr; ++ithr) {
auto const [begin, end] = ranges[ithr];
auto const [bbegin, bend] = std::make_tuple(begin / block, (end - 1) / block + 1);
auto const & cstarts = starts.at(ithr);
auto const & cplogs = plogs.at(ithr);
auto & cring = ring[ithr];
std::decay_t<decltype(cring[0])> tmp;
size_t hit_cnt = 0, miss_cnt = 0;
for (size_t iblock = bbegin; iblock < bend; ++iblock) {
size_t const cbegin = iblock << block_log2, cend = std::min<size_t>(end, (iblock + 1) << block_log2);
auto & ring_cur = cring[iblock & ring_mask];
tmp = ring_cur;
ring_cur.clear();
for (auto [ip, off]: tmp)
if (off >= cend) {
//++miss_cnt;
ring_cur.push_back({ip, off});
} else {
//++hit_cnt;
auto const p = primes[ip];
auto const plog = cplogs[ip];
for (; off < cend; off += p) {
//if (8192 - 10 <= off && off <= 8192 + 10) COUT(<< "logs.size() " << logs.size() << " begin " << begin << " end " << end << " bbegin " << bbegin << " bend " << bend << " cbegin " << cbegin << " cend " << cend << " iblock " << iblock << " off " << off << " p " << p << " plog " << plog << std::endl);
logs[off] += plog;
}
if (off < end)
cring[(off >> block_log2) & ring_mask].push_back({ip, off});
}
}
//COUT(<< "hit_ratio " << std::fixed << std::setprecision(6) << double(hit_cnt) / (hit_cnt + miss_cnt) << std::endl);
}
return Time() - tim;
}
void Test() {
size_t constexpr len = 1ULL << 28;
std::mt19937_64 rng{123};
auto const primes = GenPrimes<PrimeT>(1 << 12);
std::vector<PrimeT> starts;
for (auto p: primes)
starts.push_back(rng() % p);
ASSERT(primes.size() == starts.size());
double tA = 0, tB = 0;
std::vector<float> logsA(len);
std::vector<ILog2T> logsB(len);
{
tA = Time();
SieveA(logsA, primes, starts);
tA = Time() - tA;
}
{
tB = SieveB(logsB, primes, starts);
}
size_t correct = 0;
for (size_t i = 0; i < len; ++i) {
//ASSERT_MSG(std::abs(logsA[i] - IToLog(logsB[i])) < 0.1, "i " + std::to_string(i) + " logA " + FloatToStr(logsA[i], 3) + " logB " + FloatToStr(IToLog(logsB[i]), 3));
if (std::abs(logsA[i] - IToLog(logsB[i])) < 0.1)
++correct;
}
std::cout << std::fixed << std::setprecision(3) << "time_simple " << tA << " sec, time_optimized " << tB << " sec, boost " << (tA / tB) << ", correct_ratio " << std::setprecision(9) << double(correct) / len << std::endl;
}
int main() {
try {
omp_set_num_threads(NThreads());
Test();
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
}
}
Output:
time_simple 17.859 sec, time_optimized 0.874 sec, boost 20.434, correct_ratio 0.999999993
In my opinion, you should turn the schedule to static and give it chunk-size (https://software.intel.com/en-us/articles/openmp-loop-scheduling).
A small optimization should be :
outside of the big FOR loop, declare a const and initialize it to 1/std::log(2), and then inside the FOR loop, instead of dividing by std::log(2), do a multiplication of the previous const, division is very expensive in CPU cycles.
I was curious and did a little benchmark to determine the performance delta between primitive types such as int or float and user types.
I created a template class Var, created some inline arithmetic operators. The test consisted of looping this loop for both the primitive and Var vectors:
for (unsigned i = 0; i < 1000; ++i) {
in1[i] = i;
in2[i] = -i;
out[i] = (i % 2) ? in1[i] + in2[i] : in2[i] - in1[i];
}
I was quite surprised with the results, turns out my Var class is faster most of the time, with int on average that loop took about 5700 nsec less with the class. Out of 3000 runs, int was faster 11 times vs. Var which was faster 2989 times. Similar results with float, where Var is 15100 nsec faster than floatin 2991 of the runs.
Shouldn't primitive types be faster?
Edit: Compiler is a rather ancient mingw 4.4.0, build options are the defaults of QtCreator, no optimizations:
qmake call: qmake.exe C:\...\untitled15.pro -r -spec win32-g++ "CONFIG+=release"
OK, posting full source, platform is 64 bit Win7, 4 GB DDR2-800, Core2Duo#3Ghz
#include <QTextStream>
#include <QVector>
#include <QElapsedTimer>
template<typename T>
class Var{
public:
Var() {}
Var(T val) : var(val) {}
inline T operator+(Var& other)
{
return var + other.value();
}
inline T operator-(Var& other)
{
return var - other.value();
}
inline T operator+(T& other)
{
return var + other;
}
inline T operator-(T& other)
{
return var - other;
}
inline void operator=(T& other)
{
var = other;
}
inline T& value()
{
return var;
}
private:
T var;
};
int main()
{
QTextStream cout(stdout);
QElapsedTimer timer;
unsigned count = 1000000;
QVector<double> pin1(count), pin2(count), pout(count);
QVector<Var<double> > vin1(count), vin2(count), vout(count);
unsigned t1, t2, pAcc = 0, vAcc = 0, repeat = 10, pcount = 0, vcount = 0, ecount = 0;
for (int cc = 0; cc < 5; ++cc)
{
for (unsigned c = 0; c < repeat; ++c)
{
timer.restart();
for (unsigned i = 0; i < count; ++i)
{
pin1[i] = i;
pin2[i] = -i;
pout[i] = (i % 2) ? pin1[i] + pin2[i] : pin2[i] - pin1[i];
}
t1 = timer.nsecsElapsed();
cout << t1 << endl;
timer.restart();
for (unsigned i = 0; i < count; ++i)
{
vin1[i] = i;
vin2[i] = -i;
vout[i] = (i % 2) ? vin1[i] + vin2[i] : vin2[i] - vin1[i];
}
t2 = timer.nsecsElapsed();
cout << t2 << endl;;
pAcc += t1;
vAcc += t2;
}
pAcc /= repeat;
vAcc /= repeat;
if (pAcc < vAcc) {
cout << "primitive was faster" << endl;
pcount++;
}
else if (pAcc > vAcc) {
cout << "var was faster" << endl;
vcount++;
}
else {
cout << "amazingly, both are equally fast" << endl;
ecount++;
}
cout << "Average for primitive type is " << pAcc << ", average for Var is " << vAcc << endl;
}
cout << "int was faster " << pcount << " times, var was faster " << vcount << " times, equal " << ecount << " times, " << pcount + vcount + ecount << " times ran total" << endl;
}
Relatively, with floats the Var class is 6-7% faster than floats, with ints about 3%.
I also ran the test with vector length of 10 000 000 instead of the original 1000 and results are still consistent and in favor of the class.
With QVector replaced by std::vector, at -O2 optimization level, code generated by GCC for the two types is exactly the same, instruction for instruction.
Without the replacement, the generated code is different, but that's hardly surprising, considering that QtVector is implemented differently for primitive and non-primitive types (look for QTypeInfo<T>::isComplex in qvector.h).
Update It looks like isComplex does not affect the linner oop, i.e. the measured part. The loop code still differs for the two types, albeit very slightly. It looks like the difference is due to GCC.
I benchmarked running time and memory allocation for QVector and float* with very little difference between both