There are a some IPTables with different sizes (e.g 255 or 16384 or 512000!!).Every entry of each table, holds a unique IP Address (hex format) and some other values. The total number of IPs is 8 millions.
All IPs of all IPTables are sorted
We need to search IPTable 300,000 times per sec. Our current Algorithm for finding an IP is as follow:
// 10 <number of IPTables <20
//_rangeCount = number of IPTables
s_EntryItem* searchIPTable(const uint32_t & ip) {
for (int i = 0; i < _rangeCount; i++) {
if (ip > _ipTable[i].start && ip < _ipTable[i].end) {
int index = ip - _ipTable[i].start;
return (_ipTable[i].p_entry + index);
return NULL;
As it can be seen, in worst case, number of comparisons for a given IP address is _rangeCount *2 and number of "if" statement checking is _rangeCount.
Suppose i want to change the searchIPTable and use more efficient way to find an IP address in IPTables. as far as i know, for a sorted array, the best software implementation of a famous search algorithm like binary search needs log(n) comparisons( in worst case).
So, the number of comparisons to find an IP address is log(8000000) that is equal to ~23.
Question 1:
As it can bee seen there is a little gap between the number of comparison needed by two algorithm ( _rangeCount vs 23) but in first method, there are some "if" statement that could effect on performance. if you want to run first algorithm for 10 times, obviously the first algorithm has better performance, but i have know idea about running two algorithm for 3000,000 times! what is your idea?
Question 2:
Is there a more efficient algorithm or solution to search IPs?
curiosity piqued, I wrote a test program (below) and ran it on my macbook.
It's suggesting that a naiive solution, based on a std::unordered_map (lookup time == constant time) is able to search an ip4 address table with 8 million entries 5.6 million times per second.
This easily outperforms the requirements.
update: responding to my critics, I have increased the test space to the required 8m ip addresses. I have also increased the test size to 100 million searches, 20% of which will be a hit.
With a test this large we can clearly see the performance benefits of using an unordered_map when compared to an ordered map (logarithmic time lookups).
All test parameters are configurable.
#include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>
#include <unordered_map>
#include <unordered_set>
#include <map>
#include <random>
#include <tuple>
#include <iomanip>
#include <utility>
namespace detail
template<class T>
struct has_reserve
template<class U> static auto test(U*p) -> decltype(p->reserve(std::declval<std::size_t>()), void(), std::true_type());
template<class U> static auto test(...) -> decltype(std::false_type());
using type = decltype(test<T>((T*)0));
template<class T>
using has_reserve = typename detail::has_reserve<T>::type;
using namespace std::literals;
struct data_associated_with_ip {};
using ip_address = std::uint32_t;
using candidate_vector = std::vector<ip_address>;
static constexpr std::size_t search_space_size = 8'000'000;
static constexpr std::size_t size_of_test = 100'000'000;
std::vector<ip_address> make_random_ip_set(std::size_t size)
std::unordered_set<ip_address> results;
std::random_device rd;
std::default_random_engine eng(rd());
auto dist = std::uniform_int_distribution<ip_address>(0, 0xffffffff);
while (results.size() < size)
auto candidate = dist(eng);
return { std::begin(results), std::end(results) };
template<class T, std::enable_if_t<not has_reserve<T>::value> * = nullptr>
void maybe_reserve(T& container, std::size_t size)
// nop
template<class T, std::enable_if_t<has_reserve<T>::value> * = nullptr>
decltype(auto) maybe_reserve(T& container, std::size_t size)
return container.reserve(size);
template<class MapType>
void build_ip_map(MapType& result, candidate_vector const& chosen)
maybe_reserve(result, chosen.size());
for (auto& ip : chosen)
result.emplace(ip, data_associated_with_ip{});
// build a vector of candidates to try against our map
// some percentage of the time we will select a candidate that we know is in the map
candidate_vector build_candidates(candidate_vector const& known)
std::random_device rd;
std::default_random_engine eng(rd());
auto ip_dist = std::uniform_int_distribution<ip_address>(0, 0xffffffff);
auto select_known = std::uniform_int_distribution<std::size_t>(0, known.size() - 1);
auto chance = std::uniform_real_distribution<double>(0, 1);
static constexpr double probability_of_hit = 0.2;
candidate_vector result;
std::generate_n(std::back_inserter(result), size_of_test, [&]
if (chance(eng) < probability_of_hit)
return known[select_known(eng)];
return ip_dist(eng);
return result;
int main()
candidate_vector known_candidates = make_random_ip_set(search_space_size);
candidate_vector random_candidates = build_candidates(known_candidates);
auto run_test = [&known_candidates, &random_candidates]
(auto const& search_space)
std::size_t hits = 0;
auto start_time = std::chrono::high_resolution_clock::now();
for (auto& candidate : random_candidates)
auto ifind = search_space.find(candidate);
if (ifind != std::end(search_space))
auto stop_time = std::chrono::high_resolution_clock::now();
using fns = std::chrono::duration<long double, std::chrono::nanoseconds::period>;
using fs = std::chrono::duration<long double, std::chrono::seconds::period>;
auto interval = fns(stop_time - start_time);
auto time_per_hit = interval / random_candidates.size();
auto hits_per_sec = fs(1.0) / time_per_hit;
std::cout << "ip addresses in table: " << search_space.size() << std::endl;
std::cout << "ip addresses searched: " << random_candidates.size() << std::endl;
std::cout << "total search hits : " << hits << std::endl;
std::cout << "searches per second : " << std::fixed << hits_per_sec << std::endl;
std::cout << "building unordered map:" << std::endl;
std::unordered_map<ip_address, data_associated_with_ip> um;
build_ip_map(um, known_candidates);
std::cout << "testing with unordered map:" << std::endl;
std::cout << "\nbuilding ordered map :" << std::endl;
std::map<ip_address, data_associated_with_ip> m;
build_ip_map(m, known_candidates);
std::cout << "testing with ordered map :" << std::endl;
example results:
building unordered map:
testing with unordered map:
ip addresses in table: 8000000
ip addresses searched: 100000000
total search hits : 21681856
searches per second : 5602458.505577
building ordered map :
testing with ordered map :
ip addresses in table: 8000000
ip addresses searched: 100000000
total search hits : 21681856
searches per second : 836123.513710
Test conditions:
MacBook Pro (Retina, 15-inch, Mid 2015)
Processor: 2.2 GHz Intel Core i7
Memory: 16 GB 1600 MHz DDR3
Release build (-O2)
Running on mains power.
In these kinds of situations, the only practical way to determine the fastest implementation is to implement both approaches, and then benchmark each one.
And, sometimes, it's faster to do that than to try to figure out which one will be faster. And, sometimes, if you do that, and then proceed with your chosen approach, you will discover that you were wrong.
It looks like your problem is not the performance cost of an if statement, but rather what data structure can give you an answer to the question “do you contain this element?” as fast as possible. If that is true, how about using a Bloom Filter?
Data structures that offer fast lookup (faster than logarithmic complexity) are hash tables, which, on average, have O(1) complexity. One such implementation is in Boost.Unordered.
Of course you'd need to test with real data... but thinking to IPV4 I would try first a different approach:
EntryItem* searchIPTable(uint32_t ip) {
EntryItem** tab = master_table[ip >> 16];
return tab ? tab[ip & 65535] : NULL;
In other words a master table of 65536 entries that are pointers to detail tables of 65536 entries each.
Depending on the type of data a different subdivision instead of 16+16 bits could work better (less memory).
It could also make sense to have detail pages to be directly IP entries instead of pointers to entries.
I was not satisfied with the performance of the below thrust::reduce_by_key, so I rewrote it in a variety of ways with little gained benefit (including removing the permutation iterator). However, it wasn't until after replacing it with a thrust::for_each() (see below) that capitalizes on atomicAdd(), that I gained almost a 75x speedup! The two versions produce the exact same results. What could be the biggest cause for the dramatic performance differences?
Complete code for comparison between the two approaches:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <ctime>
#include <iostream>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/sort.h>
constexpr int NumberOfOscillators = 100;
int SeedRange = 500;
struct GetProduct
template<typename Tuple>
__host__ __device__
int operator()(const Tuple & t)
return thrust::get<0>(t) * thrust::get<1>(t);
int main()
using namespace std;
using namespace thrust::placeholders;
thrust::device_vector<int> dv_OscillatorsVelocity(NumberOfOscillators);
thrust::device_vector<int> dv_outputCompare(NumberOfOscillators);
thrust::device_vector<int> dv_Connections_Strength((NumberOfOscillators - 1) * NumberOfOscillators);
thrust::device_vector<int> dv_Connections_Active((NumberOfOscillators - 1) * NumberOfOscillators);
thrust::device_vector<int> dv_Connections_TerminalOscillatorID_Map(0);
thrust::device_vector<int> dv_Permutation_Connections_To_TerminalOscillators((NumberOfOscillators - 1) * NumberOfOscillators);
thrust::device_vector<int> dv_Connection_Keys((NumberOfOscillators - 1) * NumberOfOscillators);
srand((unsigned int)time(NULL));
thrust::fill(dv_OscillatorsVelocity.begin(), dv_OscillatorsVelocity.end(), 0);
for (int c = 0; c < NumberOfOscillators * (NumberOfOscillators - 1); c++)
dv_Connections_Strength[c] = (rand() % SeedRange) - (SeedRange / 2);
dv_Connections_Active[c] = 0;
int curOscillatorIndx = -1;
for (int c = 0; c < NumberOfOscillators * NumberOfOscillators; c++)
if (c % NumberOfOscillators == 0)
if (c % NumberOfOscillators != curOscillatorIndx)
dv_Connections_TerminalOscillatorID_Map.push_back(c % NumberOfOscillators);
for (int n = 0; n < NumberOfOscillators; n++)
for (int p = 0; p < NumberOfOscillators - 1; p++)
thrust::make_counting_iterator<int>(dv_Connections_TerminalOscillatorID_Map.size()), // indices from 0 to N
dv_Connections_TerminalOscillatorID_Map.begin(), // array data
dv_Permutation_Connections_To_TerminalOscillators.begin() + (n * (NumberOfOscillators - 1)), // result will be written here
_1 == n);
for (int c = 0; c < NumberOfOscillators * (NumberOfOscillators - 1); c++)
dv_Connection_Keys[c] = c / (NumberOfOscillators - 1);
auto t = clock();
for (int x = 0; x < 5000; ++x) //Set x maximum to a reasonable number while testing performance.
//dv_Connection_Keys = 0,0,0,...1,1,1,...2,2,2,...3,3,3...
dv_Connection_Keys.begin(), //keys_first The beginning of the input key range.
dv_Connection_Keys.end(), //keys_last The end of the input key range.
), //values_first The beginning of the input value range.
thrust::make_discard_iterator(), //keys_output The beginning of the output key range.
dv_OscillatorsVelocity.begin() //values_output The beginning of the output value range.
std::cout << "iterations time for original: " << (clock() - t) * (1000.0 / CLOCKS_PER_SEC) << "ms\n" << endl << endl;
thrust::copy(dv_OscillatorsVelocity.begin(), dv_OscillatorsVelocity.end(), dv_outputCompare.begin());
t = clock();
for (int x = 0; x < 5000; ++x) //Set x maximum to a reasonable number while testing performance.
thrust::make_counting_iterator(0) + dv_Connections_Active.size(),
s = dv_OscillatorsVelocity.size() - 1,
dv_b = thrust::raw_pointer_cast(,
dv_c = thrust::raw_pointer_cast(, //3,6,9,0,7,10,1,4,11,2,5,8
dv_ppa = thrust::raw_pointer_cast(,
dv_pps = thrust::raw_pointer_cast(
] __device__(int i) {
const int readIndex = i / s;
dv_b + readIndex,
(dv_ppa[dv_c[i]] * dv_pps[dv_c[i]])
std::cout << "iterations time for new: " << (clock() - t) * (1000.0 / CLOCKS_PER_SEC) << "ms\n" << endl << endl;
std::cout << "***" << (dv_OscillatorsVelocity == dv_outputCompare ? "success" : "fail") << "***\n";
return 0;
Extra info.:
My results are using a single GTX 980 TI.
There are 100 * (100 - 1) = 9,900 elements in all of the "Connection" vectors.
Each of the 100 unique keys found in dv_Connection_Keys has 99 elements each.
Use this compiler option: --expt-extended-lambda
What could be the biggest cause for the dramatic performance differences?
You are evidently building a debug project, that is your compilation settings include the -G switch. Although you were asked for your compilation settings in the comments, you didn't mention this.
It's important.
CUDA device code can have dramatically different performance characteristics when compiled with -G.
Don't evaluate performance of a debug project, or code compiled with -G.
When I compile and run your code without -G, I get:
iterations time for original: 210ms
iterations time for new: 70ms
When I compile your code with the debug switch -G, and run, I get:
iterations time for original: 12330ms
iterations time for new: 320ms
returning to your question, that accounts for the biggest factor of the difference.
The following answer tries to explain or at least motivate the remaining difference in performance after going from a debug build to a release build as explained in Robert Crovella's answer.
As the accesses in both kernels are not coalesced due to the permutation_iterator/indirection through dv_c, going by the the plain number of accesses will overestimate the performance in this case. thrust::reduce_by_key (or pretty much all Thrust algorithms) is not and can not be optimized for general permutations of the input as the performance of these bandwidth-bound kernels depends strongly on coalesced memory access. Naturally the algorithms are written such that accesses are coalesced for normal continuous input. So if you need to access the permuted state order of the data more than once (which might happen in a single reduction algorithm), it could be faster to actually permute the data in memory using thrust::gather or thrust::scatter once so at least all following accesses are efficient. I would not expect the for_each solution to beat reduce_by_key without that permutation.
Newer versions of nvcc will try to use automatically use warp-aggregated-atomics to reduce the number of actual atomic instructions on the same address. As neighboring threads (same warp) tend to atomically write to the same address, this optimization is crucial for the performance of your custom reduction. Another important detail is that s = NumberOfOscillators is relatively small (100) in your code compared to typical thread-block sizes (256, 512, 1024; locality of atomic writes) and the amount of parallelism in the for_each (~NumberOfOscillators^2). So for smaller NumberOfOscillators I expect your custom reduction to get worse than reduce_by_key due to the vanishing amount of parallelism, while for bigger NumberOfOscillators you get both much more parallelism and more thread blocks/warps writing to the same location, so it is not quite clear which one will win without benchmarking it for given hardware and compiler.
C++17 added std::hardware_destructive_interference_size and std::hardware_constructive_interference_size. First, I thought it is just a portable way to get the size of a L1 cache line but that is an oversimplification.
How are these constants related to the L1 cache line size?
Is there a good example that demonstrates their use cases?
Both are defined static constexpr. Is that not a problem if you build a binary and execute it on other machines with different cache line sizes? How can it protect against false sharing in that scenario when you are not certain on which machine your code will be running?
The intent of these constants is indeed to get the cache-line size. The best place to read about the rationale for them is in the proposal itself:
I'll quote a snippet of the rationale here for ease-of-reading:
[...] the granularity of memory that does not interfere (to the first-order) [is] commonly referred to as the cache-line size.
Uses of cache-line size fall into two broad categories:
Avoiding destructive interference (false-sharing) between objects with temporally disjoint runtime access patterns from different threads.
Promoting constructive interference (true-sharing) between objects which have temporally local runtime access patterns.
The most sigificant issue with this useful implementation quantity is the questionable portability of the methods used in current practice to determine its value, despite their pervasiveness and popularity as a group. [...]
We aim to contribute a modest invention for this cause, abstractions for this quantity that can be conservatively defined for given purposes by implementations:
Destructive interference size: a number that’s suitable as an offset between two objects to likely avoid false-sharing due to different runtime access patterns from different threads.
Constructive interference size: a number that’s suitable as a limit on two objects’ combined memory footprint size and base alignment to likely promote true-sharing between them.
In both cases these values are provided on a quality of implementation basis, purely as hints that are likely to improve performance. These are ideal portable values to use with the alignas() keyword, for which there currently exists nearly no standard-supported portable uses.
"How are these constants related to the L1 cache line size?"
In theory, pretty directly.
Assume the compiler knows exactly what architecture you'll be running on - then these would almost certainly give you the L1 cache-line size precisely. (As noted later, this is a big assumption.)
For what it's worth, I would almost always expect these values to be the same. I believe the only reason they are declared separately is for completeness. (That said, maybe a compiler wants to estimate L2 cache-line size instead of L1 cache-line size for constructive interference; I don't know if this would actually be useful, though.)
"Is there a good example that demonstrates their use cases?"
At the bottom of this answer I've attached a long benchmark program that demonstrates false-sharing and true-sharing.
It demonstrates false-sharing by allocating an array of int wrappers: in one case multiple elements fit in the L1 cache-line, and in the other a single element takes up the L1 cache-line. In a tight loop a single, a fixed element is chosen from the array and updated repeatedly.
It demonstrates true-sharing by allocating a single pair of ints in a wrapper: in one case, the two ints within the pair do not fit in L1 cache-line size together, and in the other they do. In a tight loop, each element of the pair is updated repeatedly.
Note that the code for accessing the object under test does not change; the only difference is the layout and alignment of the objects themselves.
I don't have a C++17 compiler (and assume most people currently don't either), so I've replaced the constants in question with my own. You need to update these values to be accurate on your machine. That said, 64 bytes is probably the correct value on typical modern desktop hardware (at the time of writing).
Warning: the test will use all cores on your machines, and allocate ~256MB of memory. Don't forget to compile with optimizations!
On my machine, the output is:
Hardware concurrency: 16
sizeof(naive_int): 4
alignof(naive_int): 4
sizeof(cache_int): 64
alignof(cache_int): 64
sizeof(bad_pair): 72
alignof(bad_pair): 4
sizeof(good_pair): 8
alignof(good_pair): 4
Running naive_int test.
Average time: 0.0873625 seconds, useless result: 3291773
Running cache_int test.
Average time: 0.024724 seconds, useless result: 3286020
Running bad_pair test.
Average time: 0.308667 seconds, useless result: 6396272
Running good_pair test.
Average time: 0.174936 seconds, useless result: 6668457
I get ~3.5x speedup by avoiding false-sharing, and ~1.7x speedup by ensuring true-sharing.
"Both are defined static constexpr. Is that not a problem if you build a binary and execute it on other machines with different cache line sizes? How can it protect against false sharing in that scenario when you are not certain on which machine your code will be running?"
This will indeed be a problem. These constants are not guaranteed to map to any cache-line size on the target machine in particular, but are intended to be the best approximation the compiler can muster up.
This is noted in the proposal, and in the appendix they give an example of how some libraries try to detect cache-line size at compile time based on various environmental hints and macros. You are guaranteed that this value is at least alignof(max_align_t), which is an obvious lower bound.
In other words, this value should be used as your fallback case; you are free to define a precise value if you know it, e.g.:
constexpr std::size_t cache_line_size() {
return std::hardware_destructive_interference_size;
During compilation, if you want to assume a cache-line size just define KNOWN_L1_CACHE_LINE_SIZE.
Hope this helps!
Benchmark program:
#include <chrono>
#include <condition_variable>
#include <cstddef>
#include <functional>
#include <future>
#include <iostream>
#include <random>
#include <thread>
#include <vector>
constexpr std::size_t hardware_destructive_interference_size = 64;
constexpr std::size_t hardware_constructive_interference_size = 64;
constexpr unsigned kTimingTrialsToComputeAverage = 100;
constexpr unsigned kInnerLoopTrials = 1000000;
typedef unsigned useless_result_t;
typedef double elapsed_secs_t;
// wraps an int, default alignment allows false-sharing
struct naive_int {
int value;
static_assert(alignof(naive_int) < hardware_destructive_interference_size, "");
// wraps an int, cache alignment prevents false-sharing
struct cache_int {
alignas(hardware_destructive_interference_size) int value;
static_assert(alignof(cache_int) == hardware_destructive_interference_size, "");
// wraps a pair of int, purposefully pushes them too far apart for true-sharing
struct bad_pair {
int first;
char padding[hardware_constructive_interference_size];
int second;
static_assert(sizeof(bad_pair) > hardware_constructive_interference_size, "");
// wraps a pair of int, ensures they fit nicely together for true-sharing
struct good_pair {
int first;
int second;
static_assert(sizeof(good_pair) <= hardware_constructive_interference_size, "");
// accesses a specific array element many times
template <typename T, typename Latch>
useless_result_t sample_array_threadfunc(
Latch& latch,
unsigned thread_index,
T& vec) {
// prepare for computation
std::random_device rd;
std::mt19937 mt{ rd() };
std::uniform_int_distribution<int> dist{ 0, 4096 };
auto& element = vec[vec.size() / 2 + thread_index];
// compute
for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
element.value = dist(mt);
return static_cast<useless_result_t>(element.value);
// accesses a pair's elements many times
template <typename T, typename Latch>
useless_result_t sample_pair_threadfunc(
Latch& latch,
unsigned thread_index,
T& pair) {
// prepare for computation
std::random_device rd;
std::mt19937 mt{ rd() };
std::uniform_int_distribution<int> dist{ 0, 4096 };
// compute
for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
pair.first = dist(mt);
pair.second = dist(mt);
return static_cast<useless_result_t>(pair.first) +
//////// UTILITIES:
// utility: allow threads to wait until everyone is ready
class threadlatch {
explicit threadlatch(const std::size_t count) :
count_{ count }
void count_down_and_wait() {
std::unique_lock<std::mutex> lock{ mutex_ };
if (--count_ == 0) {
else {
cv_.wait(lock, [&] { return count_ == 0; });
std::mutex mutex_;
std::condition_variable cv_;
std::size_t count_;
// utility: runs a given function in N threads
std::tuple<useless_result_t, elapsed_secs_t> run_threads(
const std::function<useless_result_t(threadlatch&, unsigned)>& func,
const unsigned num_threads) {
threadlatch latch{ num_threads + 1 };
std::vector<std::future<useless_result_t>> futures;
std::vector<std::thread> threads;
for (unsigned thread_index = 0; thread_index != num_threads; ++thread_index) {
std::packaged_task<useless_result_t()> task{
std::bind(func, std::ref(latch), thread_index)
const auto starttime = std::chrono::high_resolution_clock::now();
for (auto& thread : threads) {
const auto endtime = std::chrono::high_resolution_clock::now();
const auto elapsed = std::chrono::duration_cast<
endtime - starttime
useless_result_t result = 0;
for (auto& future : futures) {
result += future.get();
return std::make_tuple(result, elapsed);
// utility: sample the time it takes to run func on N threads
void run_tests(
const std::function<useless_result_t(threadlatch&, unsigned)>& func,
const unsigned num_threads) {
useless_result_t final_result = 0;
double avgtime = 0.0;
for (unsigned trial = 0; trial != kTimingTrialsToComputeAverage; ++trial) {
const auto result_and_elapsed = run_threads(func, num_threads);
const auto result = std::get<useless_result_t>(result_and_elapsed);
const auto elapsed = std::get<elapsed_secs_t>(result_and_elapsed);
final_result += result;
avgtime = (avgtime * trial + elapsed) / (trial + 1);
<< "Average time: " << avgtime
<< " seconds, useless result: " << final_result
<< std::endl;
int main() {
const auto cores = std::thread::hardware_concurrency();
std::cout << "Hardware concurrency: " << cores << std::endl;
std::cout << "sizeof(naive_int): " << sizeof(naive_int) << std::endl;
std::cout << "alignof(naive_int): " << alignof(naive_int) << std::endl;
std::cout << "sizeof(cache_int): " << sizeof(cache_int) << std::endl;
std::cout << "alignof(cache_int): " << alignof(cache_int) << std::endl;
std::cout << "sizeof(bad_pair): " << sizeof(bad_pair) << std::endl;
std::cout << "alignof(bad_pair): " << alignof(bad_pair) << std::endl;
std::cout << "sizeof(good_pair): " << sizeof(good_pair) << std::endl;
std::cout << "alignof(good_pair): " << alignof(good_pair) << std::endl;
std::cout << "Running naive_int test." << std::endl;
std::vector<naive_int> vec;
vec.resize((1u << 28) / sizeof(naive_int)); // allocate 256 mibibytes
run_tests([&](threadlatch& latch, unsigned thread_index) {
return sample_array_threadfunc(latch, thread_index, vec);
}, cores);
std::cout << "Running cache_int test." << std::endl;
std::vector<cache_int> vec;
vec.resize((1u << 28) / sizeof(cache_int)); // allocate 256 mibibytes
run_tests([&](threadlatch& latch, unsigned thread_index) {
return sample_array_threadfunc(latch, thread_index, vec);
}, cores);
std::cout << "Running bad_pair test." << std::endl;
bad_pair p;
run_tests([&](threadlatch& latch, unsigned thread_index) {
return sample_pair_threadfunc(latch, thread_index, p);
}, cores);
std::cout << "Running good_pair test." << std::endl;
good_pair p;
run_tests([&](threadlatch& latch, unsigned thread_index) {
return sample_pair_threadfunc(latch, thread_index, p);
}, cores);
I would almost always expect these values to be the same.
Regarding above, I would like to make a minor contribution to the accepted answer. A while ago, I saw a very good use-case where these two should be defined separately in the folly library. Please see the caveat about Intel Sandy Bridge processor.
// Memory locations within the same cache line are subject to destructive
// interference, also known as false sharing, which is when concurrent
// accesses to these different memory locations from different cores, where at
// least one of the concurrent accesses is or involves a store operation,
// induce contention and harm performance.
// Microbenchmarks indicate that pairs of cache lines also see destructive
// interference under heavy use of atomic operations, as observed for atomic
// increment on Sandy Bridge.
// We assume a cache line size of 64, so we use a cache line pair size of 128
// to avoid destructive interference.
// mimic: std::hardware_destructive_interference_size, C++17
constexpr std::size_t hardware_destructive_interference_size =
kIsArchArm ? 64 : 128;
static_assert(hardware_destructive_interference_size >= max_align_v, "math?");
// Memory locations within the same cache line are subject to constructive
// interference, also known as true sharing, which is when accesses to some
// memory locations induce all memory locations within the same cache line to
// be cached, benefiting subsequent accesses to different memory locations
// within the same cache line and heping performance.
// mimic: std::hardware_constructive_interference_size, C++17
constexpr std::size_t hardware_constructive_interference_size = 64;
static_assert(hardware_constructive_interference_size >= max_align_v, "math?");
I've tested the above code but I think there is a minor error preventing us from understanding the underlying functionning, a single cache line should not be shared between two distinct atomics in order to prevent false sharing.
I've changed the definition of those structs.
struct naive_int
alignas ( sizeof ( int ) ) atomic < int > value;
struct cache_int
alignas ( hardware_constructive_interference_size ) atomic < int > value;
struct bad_pair
// two atomics sharing a single 64 bytes cache line
alignas ( hardware_constructive_interference_size ) atomic < int > first;
atomic < int > second;
struct good_pair
// first cache line begins here
alignas ( hardware_constructive_interference_size ) atomic < int >
// That one is still in the first cache line
atomic < int > first_s;
// second cache line starts here
alignas ( hardware_constructive_interference_size ) atomic < int >
// That one is still in the second cache line
atomic < int > second_s;
And the resulting run:
Hardware concurrency := 40
sizeof(naive_int) := 4
alignof(naive_int) := 4
sizeof(cache_int) := 64
alignof(cache_int) := 64
sizeof(bad_pair) := 64
alignof(bad_pair) := 64
sizeof(good_pair) := 128
alignof(good_pair) := 64
Running naive_int test.
Average time: 0.060303 seconds, useless result: 8212147
Running cache_int test.
Average time: 0.0109432 seconds, useless result: 8113799
Running bad_pair test.
Average time: 0.162636 seconds, useless result: 16289887
Running good_pair test.
Average time: 0.129472 seconds, useless result: 16420417
I experienced a lot of variance in the last result but never dedicated precisely any core to that specific problem. Anyway this ran out of 2 Xeon 2690V2 and from various run using 64 or 128 for hardware_constructive_interference_size = 128 I found 64 to be more than enought and 128 a very poor use of available cache.
I suddently realized that your question helps me understand what's Jeff Preshing
was talking, all about payload !?
So I'm working on developing an online game, and one of the features of this game (like many other MMORPG's) is the drop system & upgrade system.
The drop system decides what items will drop from monsters when they are killed.
The upgrade system decides if an item will successfully upgrade to the next level or not.
They both need to be able to use probability to determine if:
An item Drops
An item upgrades successfully.
I've developed a system that generates a random number between 0 and 100000. In this system a 1% probability of either of the above happening would be represented by 1000. Similarly, a 0.5% would be 500... and 50% would be 50000.
Here is the guts of this code...
int RandomValueInRange(const int start, const int end)
std::random_device rd;
std::mt19937 generator(rd());
const int stable_end = ((end < start) ? start : end);
std::uniform_int_distribution<int> distribution(start, stable_end);
return distribution(generator);
Now in order to dermine if an item drops or upgrades sucecsfully, all I have to do is this...
const int random_value = RandomValueInRange(0, 100000);
const int probability = item.GetProbability();//This simply returns an integer stored in a config file which represents the probability of this item being dropped/upgraded.
if(random_value <= probability)
std::cout << "Probability Success!" << endl;
std::cout << "Probability Failed!" << endl;
I would expect the above to work, but for whatever reason it seems faulty... Players are able to get items that have a 0.1% probability with ease (something that should almost never happen!).
Does anyone know of a better system or how I can improve this system to truly follow the probability guidelines....
std::random_device rd;
std::mt19937 generator(rd());
return distribution(generator);
I think problem here, the std c++ library gives you uniform distribution
if you do reuse random_device and mt19937, but you recreate them each time,
it is not how that they should be used.
Save somewhere this std::random_device rd and this std::mt19937and this distribution
Ok, so the problem with your code is that you are choosing a random number between 0 and 100,000. Anyone can get between 1 and 100 with a bit of luck, because, if you think about it, 100 is a pretty big number and shouldn't be too hard to get.
Also, if you go back to Primary/Elementary (or whatever you want to call it) school maths books, you will see in the 'probability and chance' chapter, some questions like:
If there are 6 balls in a bag, 3 red, 1 green and 2 blue, then what is the chance of choosing a blue?
Of course, you would've answered 2/6 or 1/3. In C++, this can be changed to something like this:
#include <iostream>
#include <ctime>
#include <algorithm>
#include <random>
using namespace std;
// Be sure to have this in to get a truly random number
class MoreProbability {
// Be sure to have this in to get a truly random number
void GetProbability(int min, int max, int probability) {
const int arrayMax = max;
int probabilityArray[100000];
for (int i = 0; i < max; i++) {
if (i >= 0 && i <= probability) {
probabilityArray[i] = 1;
else {
probabilityArray[i] = 0;
// Arrays go from 0 to max-1 to account for the 0
std::random_shuffle(&probabilityArray[0], &probabilityArray[max - 1]);
// Check if the first element of the randomly shufffled array is equal to 1
if (probabilityArray[0] == 1) {
cout << "Probability Successful" << endl;
else {
cout << "Probability Failed" << endl;
int main() {
GetProbability(0, 100000, 100);
return 0;
It may give a StackOverflowException. To fix this, simply increase the 'Stack Reserve Size'.
After changing the code around a bit to return a 1 or a 0 based on the outcome, and putting it into a for loop which repeated itself 1000 times (I do NOT recommend trying this as it takes a while to complete), I got an output of 1, clearly showing that this piece of code works perfectly.
My code requires continuously computing a value from the following function:
inline double f (double x) {
return ( tanh( 3*(5-x) ) *0.5 + 0.5);
Profiling indicates that this part of the program is where most of the time is spent. Since the program will run for weeks if not months, I would like to optimize this operation and am considering the use of a lookup table.
I know that the efficiency of a lookup table depends on the size of the table itself, and on the way it's designed. Currently I cannot use less than 100 MB and can use up to 2GB. Values between two points in the matrix will be linearly interpolated.
Would using a lookup table be faster than doing the computation? Also, would using an N-dimensional matrix be better than a 1-D std::vector and what is the threshold (if any) on the size of the table that should not be crossed?
I'm writing a code that continuously requires to compute a value from a particular function. After some profiling, I discovered that this part of my program is where most of the time is spent.
So far, I'm not allowed to use less than 100 MB, and I can use up to 2GB. A linear interpolation will be used for points between to points in the matrix.
If you would have huge lookup table (hundreds of MB as you said), which does not fit to cache - most likely memory lookup time would be much higher than calculation itself. RAM is "very slow", especially when fetching from random locations of huge arrays.
Here is synthetic test:
live demo
#include <boost/progress.hpp>
#include <iostream>
#include <ostream>
#include <vector>
#include <cmath>
using namespace boost;
using namespace std;
inline double calc(double x)
return ( tanh( 3*(5-x) ) *0.5 + 0.5);
template<typename F>
void test(F &&f)
progress_timer t;
volatile double res;
for(unsigned i=0;i!=1<<26;++i)
res = f(i);
int main()
const unsigned size = (1 << 26) + 1;
vector<double> table(size);
cout << "table size is " << 1.0*sizeof(double)*size/(1 << 20) << "MiB" << endl;
cout << "calc ";
cout << "dummy lookup ";
test([&](unsigned i){return table[(i << 12)%size];}); // dummy lookup, not real values
Output on my machine is:
table size is 512MiB
calc 0.52 s
dummy lookup 0.92 s
My C++ code evaluates very large integrals on timeseries data (t2 >> t1). The integrals are fixed length and currently stored in [m x 2] column array of doubles. Column 1 is time. Column 2 is the signal that's being integrated. The code is running on a quadcore or 8 core machine.
For a machine with k cores, I want to:
Spin off k-1 worker processes (one for each of the remaining cores) to evaluate portions of the integral (trapezoidal integrations) and return their results to the waiting master thread.
Achieve the above without deep copying portions of the original array.
Implement C++11 async template for portability
How can I achieve the above without hardcoding the number of available cores?
I am Currently using VS 2012.
Update for Clarity:
For example, here's the rough psuedo-code
data is [100000,2] double
result = MyIntegrator(data[1:50000,1:2]) + MyIntegrator(data[50001:100000, 1:2]);
I need the MyIntegrator() functions to be evaluated in separate threads. The master thread waits for the two results.
Here is source that does a multi-threaded integration of the problem.
#include <vector>
#include <memory>
#include <future>
#include <iterator>
#include <iostream>
struct sample {
double duration;
double value;
typedef std::pair<sample*, sample*> data_range;
sample* begin( data_range const& r ) { return r.first; }
sample* end( data_range const& r ) { return r.second; }
typedef std::unique_ptr< std::future< double > > todo_item;
double integrate( data_range r ) {
double total = 0.;
for( auto&& s:r ) {
total += s.duration * s.value;
return total;
todo_item threaded_integration( data_range r ) {
return todo_item( new std::future<double>( std::async( integrate, r )) );
double integrate_over_threads( data_range r, std::size_t threads ) {
if (threads > std::size_t(r.second-r.first))
threads = r.second-r.first;
if (threads == 0)
threads = 1;
sample* begin = r.first;
sample* end = r.second;
std::vector< std::unique_ptr< std::future< double > > > todo_list;
sample* highwater = begin;
while (highwater != end) {
sample* new_highwater = (end-highwater)/threads+highwater;
todo_item item = threaded_integration( data_range(highwater, new_highwater) );
todo_list.push_back( std::move(item) );
highwater = new_highwater;
double total = 0.;
for (auto&& item: todo_list) {
total += item->get();
return total;
sample data[5] = {
{1., 1.},
{1., 2.},
{1., 3.},
{1., 4.},
{1., 5.},
int main() {
using std::begin; using std::end;
double result = integrate_over_threads( data_range( begin(data), end(data) ), 2 );
std::cout << result << "\n";
it requires some modification to read data in exactly the format you specified.
But you can call it with std::thread::hardware_concurrency() as the number of threads, and it should work.
(In particular, to keep it simple, I have pairs of (duration, value) rather than (time, value), but that is just a minor detail).
What about std::thread::hardware_concurrency()?
Get the number of cores running, usually this can be found with std::thread::hardware_concurrency()
Returns number of concurrent threads supported by the implementation. The value should be considered only a hint.
If this is zero then you can try running specific commands based on the OS.
This seems to be a good way to find out the number of cores.
You'll still need to do testing to determine if multithreading will even give you tangible benefits, remember not to optimize prematurely :)
You could overschedule and see if it hurts your performance. Split your array into small fixed-length intervals (computable in one quant, may be fitting in one cache page) and see how that compares in performance with splitting according to number of CPUs.
Use std::packaged_task and pass it to a thread to make sure that you're not hurt by "launch" configuration.
Next step would be introducing thread pool, but that's more complicated.
You could accept a command-line parameter for the number of worker threads.