lookup table vs runtime computation efficiency - C++ - c++

My code requires continuously computing a value from the following function:
inline double f (double x) {
return ( tanh( 3*(5-x) ) *0.5 + 0.5);
}
Profiling indicates that this part of the program is where most of the time is spent. Since the program will run for weeks if not months, I would like to optimize this operation and am considering the use of a lookup table.
I know that the efficiency of a lookup table depends on the size of the table itself, and on the way it's designed. Currently I cannot use less than 100 MB and can use up to 2GB. Values between two points in the matrix will be linearly interpolated.
Would using a lookup table be faster than doing the computation? Also, would using an N-dimensional matrix be better than a 1-D std::vector and what is the threshold (if any) on the size of the table that should not be crossed?

I'm writing a code that continuously requires to compute a value from a particular function. After some profiling, I discovered that this part of my program is where most of the time is spent.
So far, I'm not allowed to use less than 100 MB, and I can use up to 2GB. A linear interpolation will be used for points between to points in the matrix.
If you would have huge lookup table (hundreds of MB as you said), which does not fit to cache - most likely memory lookup time would be much higher than calculation itself. RAM is "very slow", especially when fetching from random locations of huge arrays.
Here is synthetic test:
live demo
#include <boost/progress.hpp>
#include <iostream>
#include <ostream>
#include <vector>
#include <cmath>
using namespace boost;
using namespace std;
inline double calc(double x)
{
return ( tanh( 3*(5-x) ) *0.5 + 0.5);
}
template<typename F>
void test(F &&f)
{
progress_timer t;
volatile double res;
for(unsigned i=0;i!=1<<26;++i)
res = f(i);
(void)res;
}
int main()
{
const unsigned size = (1 << 26) + 1;
vector<double> table(size);
cout << "table size is " << 1.0*sizeof(double)*size/(1 << 20) << "MiB" << endl;
cout << "calc ";
test(calc);
cout << "dummy lookup ";
test([&](unsigned i){return table[(i << 12)%size];}); // dummy lookup, not real values
}
Output on my machine is:
table size is 512MiB
calc 0.52 s
dummy lookup 0.92 s

Related

prime number below 2 billion - usage of std::list hinders performance

Problem Statement is to find prime number below 2 billion in timeframe < 20 sec.
I followed below approaches.
Divide the number n with list of number k ( k < sqrt(n)) - took 20 sec
Divide the number n with list of prime number below sqrt(n).In this scenario I stored prime numbers in std::list - took more than 180 sec
Can someone help me understand why did 2nd approach take longtime even though we reduced no of divisions by 50%(approx)? or Did I choose wrong Data Structure?
Approach 1:
#include <iostream>
#include<list>
#include <ctime>
using namespace std;
list<long long> primeno;
void ListPrimeNumber();
int main()
{
clock_t time_req = clock();
ListPrimeNumber();
time_req = clock() - time_req;
cout << "time taken " << static_cast<float>(time_req) / CLOCKS_PER_SEC << " seconds" << endl;
return 0;
}
void check_prime(int i);
void ListPrimeNumber()
{
primeno.push_back(2);
primeno.push_back(3);
primeno.push_back(5);
for (long long i = 6; i <= 20000000; i++)
{
check_prime(i);
}
}
void check_prime(int i)
{
try
{
int j = 0;
int limit = sqrt(i);
for (j = 2 ; j <= limit;j++)
{
if(i % j == 0)
{
break;
}
}
if( j > limit)
{
primeno.push_back(i);
}
}
catch (exception ex)
{
std::cout << "Message";
}
}
Approach 2 :
#include <iostream>
#include<list>
#include <ctime>
using namespace std;
list<long long> primeno;
int noofdiv = 0;
void ListPrimeNumber();
int main()
{
clock_t time_req = clock();
ListPrimeNumber();
time_req = clock() - time_req;
cout << "time taken " << static_cast<float>(time_req) / CLOCKS_PER_SEC << " seconds" << endl;
cout << "No of divisions : " << noofdiv;
return 0;
}
void check_prime(int i);
void ListPrimeNumber()
{
primeno.push_back(2);
primeno.push_back(3);
primeno.push_back(5);
for (long long i = 6; i <= 10000; i++)
{
check_prime(i);
}
}
void check_prime(int i)
{
try
{
int limit = sqrt(i);
for (int iter : primeno)
{
noofdiv++;
if (iter <= limit && (i%iter) == 0)
{
break;
}
else if (iter > limit)
{
primeno.push_back(i);
break;
}
}
}
catch (exception ex)
{
std::cout << "Message";
}
}
The reason your second example takes longer is you're iterating a std::list.
A std::list in C++ is a linked list, which means it doesn't use contiguous memory. This is bad because to iterate the list you must jump from node to node in a (to the CPU/prefetcher) unpredictable way. Also, You're most likely only "using" a few bytes of each cacheline. RAM is slow. Fetching a byte from RAM takes a lot longer than fetching it from L1. CPUs are fast these days, so your program is most of the time not doing anything and waiting for memory to arrive.
Use a std::vector instead. It stores all values one after the other and iterating is very cheap. Since you're iterating forward in memory without jumping, you're using the full cacheline and your prefetcher will be able to fetch further pages before you need them because your access of memory is predictable.
It has been proven by numerous people, including Bjarne Stroustrup, that std::vector is in a lot of cases faster than std::list, even in cases where the std::list has "theoretically" better complexity (random insert, delete, ...) just because caching helps a lot. So always use std::vector as your default. And if you think a linked list would be faster in your case, measure it and be surprised that - most of the time - std::vector dominates.
Edit: as others have noted, your method of finding primes isn't very efficient. I just played around a bit and implemented a Sieve of Eratosthenes using a bitset.
constexpr int max_prime = 1000000000;
std::bitset<max_prime> *bitset = new std::bitset<max_prime>{};
// Note: Bit SET means NO prime
bitset->set(0);
bitset->set(1);
for(int i = 4; i < max_prime ; i += 2)
bitset->set(i); // set all even numbers
int max = sqrt(max_prime);
for(int i = 3; i < max; i += 2) { // No point testing even numbers as they can't be prime
if(!bitset->test(i)) { // If i is prime
for(int j = i * 2; j < no_primes; j += i)
bitset->set(j); // set all multiples of i to non-prime
}
}
This takes between 4.2 and 4.5 seconds 30 seconds (not sure why it changed that much after slight modifications... must be an optimization I'm not hitting anymore) to find all primes below one Billion (1,000,000,000) on my machine. Your approach took way too long even for 100 million. I cancelled the 1 Billion search after about two minutes.
Comparison for 100 million:
time taken: 63.515 seconds
time taken bitset: 1.874 seconds
No of divisions : 1975961174
No of primes found: 5761455
No of primes found bitset: 5761455
I'm not a mathematician so I'm pretty sure there are still ways to optimize it further, I only optimize for even numbers.
The first thing to do is make sure you are compiling with optimisations enabled. The c++ standard library template classes tend to perform very poorly with unoptimised code as they generate lots of function calls. The optimiser inlines most of these function calls which makes them much cheaper.
std::list is a linked list. Its is mostly useful where you want to insert or remove elements randomly (i.e. not from the end).
For the case where you are only appending to the end of a list std::list has the following issues:
Iterating through the list is relatively expensive as the code has to follow node pointers and then retrieve the data
The list uses quite a lot more memory, each element needs a pointer to the previous and next nodes in addition to the actual data. On a 64-bit system this equates to 20 bytes per element rather than 4 for a list of int
As the elements in the list are not contiguous in memory the compiler can't perform as many SIMD optimisations and you will suffer more from CPU cache misses
A std::vector would solve all of the above as its memory is contiguous and iterating through it is basically just a case of incrementing an array index. You do need to make sure that you call reserve on your vector at the beginning with a sufficiently large value so that appending to the vector doesn't cause the whole array to be copied to a new larger array.
A bigger optimisation than the above would be to use the Sieve of Eratosthenes to calculate your primes. As generating this light require random deletions (depending on your exact implementation) std::list might perform better than std::vector though even in this case the overheads of std::list might not outweigh its costs.
A test at Ideone (the OP code with few superficial alterations) completely contradicts the claims made in this question:
/* check_prime__list:
time taken No of divisions No of primes
10M: 0.873 seconds 286144936 664579
20M: 2.169 seconds 721544444 1270607 */
2B: projected time: at least 16 minutes but likely much more (*)
/* check_prime__nums:
time taken No of divisions No of primes
10M: 4.650 seconds 1746210131 664579
20M: 12.585 seconds 4677014576 1270607 */
2B: projected time: at least 3 hours but likely much more (*)
I also changed the type of the number of divisions counter to long int because it was wrapping around the data type limit. So they could have been misinterpreting that.
But the run time wasn't being affected by that. A wall clock is a wall clock.
Most likely explanation seems to be a sloppy testing by the OP, with different values used in each test case, by mistake.
(*) The time projection was made by the empirical orders of growth analysis:
100**1.32 * 2.169 / 60 = 15.8
100**1.45 * 12.585 / 3600 = 2.8
Empirical orders of growth, as measured on the given range of sizes, were noticeably better for the list algorithm, n1.32 vs. the n1.45 for the testing by all numbers. This is entirely expected from theoretical complexity, since there are fewer primes than all numbers up to n, by a factor of log n, for a total complexity of O(n1.5/log n) vs. O(n1.5). It is also highly unlikely for any implementational discrepancy to beat an actual algorithmic advantage.

How much "if" statements effect on performance?

There are a some IPTables with different sizes (e.g 255 or 16384 or 512000!!).Every entry of each table, holds a unique IP Address (hex format) and some other values. The total number of IPs is 8 millions.
All IPs of all IPTables are sorted
We need to search IPTable 300,000 times per sec. Our current Algorithm for finding an IP is as follow:
// 10 <number of IPTables <20
//_rangeCount = number of IPTables
s_EntryItem* searchIPTable(const uint32_t & ip) {
for (int i = 0; i < _rangeCount; i++) {
if (ip > _ipTable[i].start && ip < _ipTable[i].end) {
int index = ip - _ipTable[i].start;
return (_ipTable[i].p_entry + index);
}
}
return NULL;
}
As it can be seen, in worst case, number of comparisons for a given IP address is _rangeCount *2 and number of "if" statement checking is _rangeCount.
Suppose i want to change the searchIPTable and use more efficient way to find an IP address in IPTables. as far as i know, for a sorted array, the best software implementation of a famous search algorithm like binary search needs log(n) comparisons( in worst case).
So, the number of comparisons to find an IP address is log(8000000) that is equal to ~23.
Question 1:
As it can bee seen there is a little gap between the number of comparison needed by two algorithm ( _rangeCount vs 23) but in first method, there are some "if" statement that could effect on performance. if you want to run first algorithm for 10 times, obviously the first algorithm has better performance, but i have know idea about running two algorithm for 3000,000 times! what is your idea?
Question 2:
Is there a more efficient algorithm or solution to search IPs?
curiosity piqued, I wrote a test program (below) and ran it on my macbook.
It's suggesting that a naiive solution, based on a std::unordered_map (lookup time == constant time) is able to search an ip4 address table with 8 million entries 5.6 million times per second.
This easily outperforms the requirements.
update: responding to my critics, I have increased the test space to the required 8m ip addresses. I have also increased the test size to 100 million searches, 20% of which will be a hit.
With a test this large we can clearly see the performance benefits of using an unordered_map when compared to an ordered map (logarithmic time lookups).
All test parameters are configurable.
#include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>
#include <unordered_map>
#include <unordered_set>
#include <map>
#include <random>
#include <tuple>
#include <iomanip>
#include <utility>
namespace detail
{
template<class T>
struct has_reserve
{
template<class U> static auto test(U*p) -> decltype(p->reserve(std::declval<std::size_t>()), void(), std::true_type());
template<class U> static auto test(...) -> decltype(std::false_type());
using type = decltype(test<T>((T*)0));
};
}
template<class T>
using has_reserve = typename detail::has_reserve<T>::type;
using namespace std::literals;
struct data_associated_with_ip {};
using ip_address = std::uint32_t;
using candidate_vector = std::vector<ip_address>;
static constexpr std::size_t search_space_size = 8'000'000;
static constexpr std::size_t size_of_test = 100'000'000;
std::vector<ip_address> make_random_ip_set(std::size_t size)
{
std::unordered_set<ip_address> results;
results.reserve(size);
std::random_device rd;
std::default_random_engine eng(rd());
auto dist = std::uniform_int_distribution<ip_address>(0, 0xffffffff);
while (results.size() < size)
{
auto candidate = dist(eng);
results.emplace(candidate);
}
return { std::begin(results), std::end(results) };
}
template<class T, std::enable_if_t<not has_reserve<T>::value> * = nullptr>
void maybe_reserve(T& container, std::size_t size)
{
// nop
}
template<class T, std::enable_if_t<has_reserve<T>::value> * = nullptr>
decltype(auto) maybe_reserve(T& container, std::size_t size)
{
return container.reserve(size);
}
template<class MapType>
void build_ip_map(MapType& result, candidate_vector const& chosen)
{
maybe_reserve(result, chosen.size());
result.clear();
for (auto& ip : chosen)
{
result.emplace(ip, data_associated_with_ip{});
}
}
// build a vector of candidates to try against our map
// some percentage of the time we will select a candidate that we know is in the map
candidate_vector build_candidates(candidate_vector const& known)
{
std::random_device rd;
std::default_random_engine eng(rd());
auto ip_dist = std::uniform_int_distribution<ip_address>(0, 0xffffffff);
auto select_known = std::uniform_int_distribution<std::size_t>(0, known.size() - 1);
auto chance = std::uniform_real_distribution<double>(0, 1);
static constexpr double probability_of_hit = 0.2;
candidate_vector result;
result.reserve(size_of_test);
std::generate_n(std::back_inserter(result), size_of_test, [&]
{
if (chance(eng) < probability_of_hit)
{
return known[select_known(eng)];
}
else
{
return ip_dist(eng);
}
});
return result;
}
int main()
{
candidate_vector known_candidates = make_random_ip_set(search_space_size);
candidate_vector random_candidates = build_candidates(known_candidates);
auto run_test = [&known_candidates, &random_candidates]
(auto const& search_space)
{
std::size_t hits = 0;
auto start_time = std::chrono::high_resolution_clock::now();
for (auto& candidate : random_candidates)
{
auto ifind = search_space.find(candidate);
if (ifind != std::end(search_space))
{
++hits;
}
}
auto stop_time = std::chrono::high_resolution_clock::now();
using fns = std::chrono::duration<long double, std::chrono::nanoseconds::period>;
using fs = std::chrono::duration<long double, std::chrono::seconds::period>;
auto interval = fns(stop_time - start_time);
auto time_per_hit = interval / random_candidates.size();
auto hits_per_sec = fs(1.0) / time_per_hit;
std::cout << "ip addresses in table: " << search_space.size() << std::endl;
std::cout << "ip addresses searched: " << random_candidates.size() << std::endl;
std::cout << "total search hits : " << hits << std::endl;
std::cout << "searches per second : " << std::fixed << hits_per_sec << std::endl;
};
{
std::cout << "building unordered map:" << std::endl;
std::unordered_map<ip_address, data_associated_with_ip> um;
build_ip_map(um, known_candidates);
std::cout << "testing with unordered map:" << std::endl;
run_test(um);
}
{
std::cout << "\nbuilding ordered map :" << std::endl;
std::map<ip_address, data_associated_with_ip> m;
build_ip_map(m, known_candidates);
std::cout << "testing with ordered map :" << std::endl;
run_test(m);
}
}
example results:
building unordered map:
testing with unordered map:
ip addresses in table: 8000000
ip addresses searched: 100000000
total search hits : 21681856
searches per second : 5602458.505577
building ordered map :
testing with ordered map :
ip addresses in table: 8000000
ip addresses searched: 100000000
total search hits : 21681856
searches per second : 836123.513710
Test conditions:
MacBook Pro (Retina, 15-inch, Mid 2015)
Processor: 2.2 GHz Intel Core i7
Memory: 16 GB 1600 MHz DDR3
Release build (-O2)
Running on mains power.
In these kinds of situations, the only practical way to determine the fastest implementation is to implement both approaches, and then benchmark each one.
And, sometimes, it's faster to do that than to try to figure out which one will be faster. And, sometimes, if you do that, and then proceed with your chosen approach, you will discover that you were wrong.
It looks like your problem is not the performance cost of an if statement, but rather what data structure can give you an answer to the question “do you contain this element?” as fast as possible. If that is true, how about using a Bloom Filter?
Data structures that offer fast lookup (faster than logarithmic complexity) are hash tables, which, on average, have O(1) complexity. One such implementation is in Boost.Unordered.
Of course you'd need to test with real data... but thinking to IPV4 I would try first a different approach:
EntryItem* searchIPTable(uint32_t ip) {
EntryItem** tab = master_table[ip >> 16];
return tab ? tab[ip & 65535] : NULL;
}
In other words a master table of 65536 entries that are pointers to detail tables of 65536 entries each.
Depending on the type of data a different subdivision instead of 16+16 bits could work better (less memory).
It could also make sense to have detail pages to be directly IP entries instead of pointers to entries.

Most efficient way to remove punctuation marks from string in c++

I'm trying to find the most efficient way to remove punctuation marks from a string in c++, this is what I currently have.
#include <iostream>
#include <string>
#include <fstream>
#include <iomanip>
#include <stdlib.h>
#include <algorithm>
using namespace std;
void PARSE(string a);
int main()
{
string f;
PARSE(f);
cout << f;
}
void PARSE(string a)
{
a = "aBc!d:f'a";
a.erase(remove_if(a.begin(), a.end(), ispunct), a.end());
cout << a << endl;
}
Is there a easier/more efficient way to do this?
I was thinking using str.len, get the length of the string, run it through a for loop and check ispunct then remove if it is.
No string copies. No heap allocation. No heap deallocation.
void strip_punct(string& inp)
{
auto to = begin(inp);
for (auto from : inp)
if (!ispunct(from))
*to++ = from;
inp.resize(distance(begin(inp), to));
}
Comparing to:
void strip_punct_re(string& inp)
{
inp.erase(remove_if(begin(inp), end(inp), ispunct), end(inp));
}
I created a variety of workloads. As a baseline input, I created a string containing all char values between 32 and 127. I appended this string num-times to create my test string. I called both strip_punct and strip_punct_re with a copy of the test string iters-times. I performed these workloads 10 times timing each test. I averaged the timings after dropping the lowest and highest results. I tested using release builds (optimized) from VS2015 on Windows 10 on a Microsoft Surface Book 4 (Skylake). I SetPriorityClass() for the process to HIGH_PRIORITY_CLASS and timed the results using QueryPerformanceFrequency/QueryPerformanceCounter. All timings were performed without a debugger attached.
num iters seconds seconds (re) improvement
10000 1000 2.812 2.947 4.78%
1000 10000 2.786 2.977 6.85%
100 100000 2.809 2.952 5.09%
By varying num and iters while keeping the number of processed bytes the same, I was able to see that the cost is primarily influenced by the number of bytes processed rather than per-call overhead. Reading the disassembly confirmed this.
So this version, is ~5% faster and generates 30% of the code.

Pre-compute once cos() and sin() in tables

I'd like to improve performance of my Dynamic Linked Library (DLL).
For that I want to use lookup tables of cos() and sin() as I use a lot of them.
As I want maximum performance, I want to create a table from 0 to 2PI that contains the resulting cos and sin computations.
For a good result in term of precision, I think tables of 1 mb for each function is a good trade between size and precision.
I would like to know how to create and uses these tables without using an external file (as it is a DLL) : I want to keep everything within one file.
Also I don't want to compute the sin and cos function when the plugin starts : they have to be computed once and put in a standard vector.
But how do I do that in C++?
EDIT1: code from jons34yp is very good to create the vector files.
I did a small benchmark and found that if you need good precision and good speed you can do a 250000 units vector and linear interpolate between them you will have a 7.89E-11 max error (!) and it is the fastest between all the approximations I tried (and it is more than 12x faster than sin() (13,296 x faster exactly)
Easiest solution is to write a separate program that creates a .cc file with definition of your vector.
For example:
#include <iostream>
#include <cmath>
int main()
{
std::ofstream out("values.cc");
out << "#include \"static_values.h\"\n";
out << "#include <vector>\n";
out << "std::vector<float> pi_values = {\n";
out << std::precision(10);
// We only need to compute the range from 0 to PI/2, and use trigonometric
// transformations for values outside this range.
double range = 3.141529 / 2;
unsigned num_results = 250000;
for (unsigned i = 0; i < num_results; i++) {
double value = (range / num_results) * i;
double res = std::sin(value);
out << " " << res << ",\n";
}
out << "};\n"
out.close();
}
Note that this is unlikely to improve performance, since a table of this size probably won't fit in your L2 cache. This means a large percentage of trigonometric computations will need to access RAM; each such access costs roughly several hundreds of CPU cycles.
By the way, have you looked at approximate SSE SIMD trigonometric libraries. This looks like a good use case for them.
You can use precomputation instead of storing them already precomputed in the executable:
double precomputed_sin[65536];
struct table_filler {
table_filler() {
for (int i=0; i<65536; i++) {
precomputed_sin[i] = sin(i*2*3.141592654/65536);
}
}
} table_filler_instance;
This way the table is computed just once at program startup and it's still at a fixed memory address. After that tsin and tcos can be implemented inline as
inline double tsin(int x) { return precomputed_sin[x & 65535]; }
inline double tcos(int x) { return precomputed_sin[(x + 16384) & 65535]; }
The usual answer to this sort of question is to write a small
program which generates a C++ source file with the values in
a table, and compile it into your DLL. If you're thinking of
tables with 128000 entries (128000 doubles are 1MB), however,
you might run up against some internal limits in your compiler.
In that case, you might consider writing the values out to
a file as a memory dump, and mmaping this file when you load
the DLL. (Under windows, I think you could even put this second
file into a second stream of your DLL file, so you wouldn't have
to distribute a second file.)

Applying Pre-fetching in this C++ code for finding maximum value index

The following code finds min/max and its location in a given array:
//Finding the array index//
#include "stdafx.h"
#include <iostream>
#include <iterator>
#include <list>
#include <algorithm>
using namespace std;
int main () {
int A[4] = {0, 2, 1, 1};
const int N = sizeof(A) / sizeof(int);
cout << "Index of max element: "
<< distance(A, max_element(A, A + N))
<< endl;
return 0;
}
I want to improve this code for 2D arrays and take advantage of pre-fetching
So my data is now something like this:
A[3][10] = { {3,7,2,9,39,4,9,2,19,20},
{3,7,2,9,33,4,22, 2,19,21},
{3,7,2,36,33,4,9,2,19,22}
};
In actual case the data will be much more.
Will I really get any adavantage of prefetching here? If so how do I go about this? Also is there any compiler directive that can instruct the compiler to prefetch data in A?
Edit:
I need the maximum value of the whole 2D array and also the corresponding index. Will be running on x86, Intel i3 , Windows 7.
In this code as you can see I am first finding the maximum value and then finding the location of the maximum value. Is there a way I can simplify this two step process to single and thus speed up the process?
Update:
I modified this code so its processing the data in one step unlike earlier when it was first finding the maximum value and then finding the index. The question is how do I use prefetching to improve the performance?