Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I was implementing a rolling median solution and was not sure why my python implementation was around 40 times slower than c++ implementation.
Here are the complete implementations
C++
#include <iostream>
#include <vector>
#include <string.h>
using namespace std;
int tree[17][65536];
void insert(int x) { for (int i=0; i<17; i++) { tree[i][x]++; x/=2; } }
void erase(int x) { for (int i=0; i<17; i++) { tree[i][x]--; x/=2; } }
int kThElement(int k) {
int a=0, b=16;
while (b--) { a*=2; if (tree[b][a]<k) k-=tree[b][a++]; }
return a;
}
long long sumOfMedians(int seed, int mul, int add, int N, int K) {
long long result = 0;
memset(tree, 0, sizeof(tree));
vector<long long> temperatures;
temperatures.push_back( seed );
for (int i=1; i<N; i++)
temperatures.push_back( ( temperatures.back()*mul+add ) % 65536 );
for (int i=0; i<N; i++) {
insert(temperatures[i]);
if (i>=K) erase(temperatures[i-K]);
if (i>=K-1) result += kThElement( (K+1)/2 );
}
return result;
}
// default input
// 47 5621 1 125000 1700
// output
// 4040137193
int main()
{
int seed,mul,add,N,K;
cin >> seed >> mul >> add >> N >> K;
cout << sumOfMedians(seed,mul,add,N,K) << endl;
return 0;
}
Python
def insert(tree,levels,n):
for i in xrange(levels):
tree[i][n] += 1
n /= 2
def delete(tree,levels,n):
for i in xrange(levels):
tree[i][n] -= 1
n /= 2
def kthElem(tree,levels,k):
a = 0
for b in reversed(xrange(levels)):
a *= 2
if tree[b][a] < k:
k -= tree[b][a]
a += 1
return a
def main():
seed,mul,add,N,K = map(int,raw_input().split())
levels = 17
tree = [[0] * 65536 for _ in xrange(levels)]
temps = [0] * N
temps[0] = seed
for i in xrange(1,N):
temps[i] = (temps[i-1]*mul + add) % 65536
result = 0
for i in xrange(N):
insert(tree,levels,temps[i])
if (i >= K):
delete(tree,levels,temps[i-K])
if (i >= K-1):
result += kthElem(tree,levels,((K+1)/2))
print result
# default input
# 47 5621 1 125000 1700
# output
# 4040137193
main()
On the above mentioned input (in the comments of the code) C++ code took around 0.06 seconds while python took around 2.3 seconds.
Can some one suggest the possible problems with my python code and how to improve to less than 10x performance hit?
I dont expect it to be anywhere near c++ implementation but to the order of 5-10x. I know I can optimize this by using libraries like numpy (and/or scipy). I am asking this question from the point of view of using python for solving programming challenges. These libraries are usually not allowed in these challenges. I am just asking if it is even possible to beat the timelimit for this algorithm in python.
If somebody is interested C++ code is borrowed from Floating median problem at http://community.topcoder.com/tc?module=Static&d1=match_editorials&d2=srm310
[Edit]
For those who think using numpy arrays will improve the performance, it does not. On the otherhand just using numpy ndarray instead of list of list, performace further degraded to around 14 seconds which is more than 200x slowdown from c++.
Pure Python code which is compute-bound and written procedurally is likely to be slow, as you have found. If you want to make something in Python which runs quickly for tasks like this, you'll need to use some C (or C++, Fortran, or other) extensions, which are abundant. For example, statistics and math people use NumPy and SciPy and related tools, which are easy to use from Python but which are actually implemented in compiled languages and have high performance (if used carefully).
If you want to try to squeeze a bit more performance out of pure Python, you can try using the "cProfile" module to analyze your code. But it probably won't get anywhere near C++ speed unless you use smarter modules like NumPy or write your own extensions.
You might gain a small amount by refactoring this:
reversed(xrange(levels))
Especially if you are using Python 2.x, as this will create an actual list. You can instead do something like this:
xrange(levels - 1, -1, -1)
Can some one suggest [...] how to improve to less than 10x performance hit?
Profile the code.
Look into using NumPy instead of native lists.
If that turns out to not be enough, look into using Cython for the critical part.
Related
I am using LAPACK to inverse a matrix: I did a reference passing, i.e by working on the address. Here below the function with an input matrix and an output matrix referenced by their address.
The issue is that I am obliged to convert the F_matrix into 1D array and I think this is a waste of performances on the runtime level : which way could I find to get rid of this supplementary task which is time consuming I think if I call a lot of times the
function matrix_inverse_lapack.
Below the function concerned :
// Passing Matrixes by Reference
void matrix_inverse_lapack(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {
// Index for loop and arrays
int i, j, ip, idx;
// Size of F_matrix
int N = F_matrix.size();
int *IPIV = new int[N];
// Statement of main array to inverse
double *arr = new double[N*N];
// Output Diagonal block
double *diag = new double[N];
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
arr[idx] = F_matrix[i][j];
}
}
// LAPACKE routines
int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, arr, N, IPIV);
int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, arr, N, IPIV);
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
F_output[i][j] = arr[idx];
}
}
delete[] IPIV;
delete[] arr;
}
For example, I call it this way :
vector<vector<double>> CO_CL(lsize*(2*Dim_x+Dim_y), vector<double>(lsize*(2*Dim_x+Dim_y), 0));
... some code
matrix_inverse_lapack(CO_CL, CO_CL);
The performances on inversion are not which are expected, I think this is due to this conversion 2D -> 1D that I described in the function matrix_inverse_lapack.
Update
I was advised to install MAGMA on my MacOS Big Sur 11.3 but I have a lot of difficulties to set up it.
I have a AMD Radeon Pro 5600M graphic card. I have already installed by default Big Sur version all the Framework OpenCL (maybe I am wrong by saying that). Anyone could tell the procedure to follow for the installation of MAGMA. I saw that on a MAGMA software exists on http://magma.maths.usyd.edu.au/magma/ but it is really expensive and doesn't correspond to what I want : I just need all the SDK (headers and libraries) , if possible built with my GPU card. I have already installed all the Intel OpenAPI SDK on my MacOS. Maybe, I could link it to a MAGMA installation.
I saw another link https://icl.utk.edu/magma/software/index.html where MAGMA seems to be public : there is none link with the non-free version above, isn't there ?
First of all let me complain that OP did not provide all necessary data. The program is almost complete, but it is not a minimal, reproducible example. This is important because (a) it wastes time and (b) it hides potentially relevant information, eg. about the matrix initialization. Second, OP did not provide any details on the compilation, which, again may be relevant.
Last, but not least, OP didn't check the status code for possible errors from Lapack functions, and this could also be important for correct interpretation of the results.
Let's start from a minimal reproducible example:
#include <lapacke.h>
#include <vector>
#include <chrono>
#include <iostream>
using Matrix = std::vector<std::vector<double>>;
std::ostream &operator<<(std::ostream &out, Matrix const &v)
{
const auto size = std::min<int>(10, v.size());
for (int i = 0; i < size; i++)
{
for (int j = 0; j < size; j++)
{
out << v[i][j] << "\t";
}
if (size < std::ssize(v)) out << "...";
out << "\n";
}
return out;
}
void matrix_inverse_lapack(Matrix const &F_matrix, Matrix &F_output, std::vector<int> &IPIV_buffer,
std::vector<double> &matrix_buffer)
{
// std::cout << F_matrix << "\n";
auto t0 = std::chrono::steady_clock::now();
const int N = F_matrix.size();
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
auto idx = i * N + j;
matrix_buffer[idx] = F_matrix[i][j];
}
}
auto t1 = std::chrono::steady_clock::now();
// LAPACKE routines
int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, matrix_buffer.data(), N, IPIV_buffer.data());
int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, matrix_buffer.data(), N, IPIV_buffer.data());
auto t2 = std::chrono::steady_clock::now();
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
auto idx = i * N + j;
F_output[i][j] = matrix_buffer[idx];
}
}
auto t3 = std::chrono::steady_clock::now();
auto whole_fun_time = std::chrono::duration<double>(t3 - t0).count();
auto lapack_time = std::chrono::duration<double>(t2 - t1).count();
// std::cout << F_output << "\n";
std::cout << "status: " << info1 << "\t" << info2 << "\t" << (info1 == 0 && info2 == 0 ? "Success" : "Failure")
<< "\n";
std::cout << "whole function: " << whole_fun_time << "\n";
std::cout << "LAPACKE matrix operations: " << lapack_time << "\n";
std::cout << "conversion: " << (whole_fun_time - lapack_time) / whole_fun_time * 100.0 << "%\n";
}
int main(int argc, const char *argv[])
{
const int M = 5; // numer of test repetitions
const int N = (argc > 1) ? std::stoi(argv[1]) : 10;
std::cout << "Matrix size = " << N << "\n";
std::vector<int> IPIV_buffer(N);
std::vector<double> matrix_buffer(N * N);
// Test matrix_inverse_lapack M times
for (int i = 0; i < M; i++)
{
Matrix CO_CL(N);
for (auto &v : CO_CL) v.resize(N);
int idx = 1;
for (auto &v : CO_CL)
{
for (auto &x : v)
{
x = idx + 1.0 / idx;
idx++;
}
}
matrix_inverse_lapack(CO_CL, CO_CL, IPIV_buffer, matrix_buffer);
}
}
Here, operator<< is an overkill, but may be useful for anyone wanting to verify half-manually that the code works (by uncommenting lines 26 and 58), and ensuring that the code is correct is more important that measuring its performance.
The code can be compiled with
g++ -std=c++20 -O3 main.cpp -llapacke
The program relies on an external library, lapacke, which needs to be installed, headers + binaries, for the code to compile and run.
My code differs a bit from OP's: it is closer to "modern C++" in that it refrains from using naked pointers; I also added external buffers to matrix_inverse_lapack to suppress continual launching of memory allocator and deallocator, a small improvement that reduces the 2D-1D-2D conversion overhead in a measurable way. I also had to initialize the matrix and find a way to read in OP's mind what the value of N could be. I also added some timer readings for benchmarking. Apart from this, the logic of the code is unchanged.
Now a benchmark carried out on a decent workstation. It lists the percentage of time the conversion takes relative to the total time taken by matrix_inverse_lapack. In other words, I measure the conversion overhead:
N = 10, 3.5%
N = 30, 1.5%
N = 100, 1%
N = 300, 0.5%
N = 1000, 0.35%
N = 3000, 0.1%
The time taken by Lapack nicely scales as N3, as expected (data not shown). The time to invert a matrix is about 16 seconds for N = 3000, and about 5-6 s (5 microseconds) for N = 10.
I assume the overhead of even 3% is completely acceptable. I believe OP uses matrices of size larger then 100, in which case the overhead at or below 1% is certainly acceptable.
So what OP (or anyone having a similar problem) could have done wrong to obtain "unacceptable overhead conversion values"? Here's my short list
Improper compilation
Improper matrix initialization (for tests)
Improper benchmarking
1. Improper compilation
If one forgets to compile in Release mode, one ends up with optimized Lapacke competing with unoptimized conversion. On my machine this peaks at an 33% overhead for N = 20.
2. Improper matrix initialization (for tests)
If one initializes the matrix like this:
for (auto &v : CO_CL)
{
for (auto &x : v)
{
x = idx; // rather than, eg., idx + 1.0/idx
idx++;
}
}
then the matrix is singular, lapack returns quite quickly with the status different from 0. This increases the relative importance of the conversion part. But singular matrices are not what one wants to invert (it's impossible to do).
3. Improper benchmarking
Here's an example of the program output for N = 10:
./a.out 10
Matrix size = 10
status: 0 0 Success
whole function: 0.000127658
LAPACKE matrix operations: 0.000126783
conversion: 0.685425%
status: 0 0 Success
whole function: 1.2497e-05
LAPACKE matrix operations: 1.2095e-05
conversion: 3.21677%
status: 0 0 Success
whole function: 1.0535e-05
LAPACKE matrix operations: 1.0197e-05
conversion: 3.20835%
status: 0 0 Success
whole function: 9.741e-06
LAPACKE matrix operations: 9.422e-06
conversion: 3.27482%
status: 0 0 Success
whole function: 9.939e-06
LAPACKE matrix operations: 9.618e-06
conversion: 3.2297%
One can see that the first call to lapack functions can take 10 times more time than the subsequent calls. This is quite a stable pattern, as if Lapack needed some time for self-initialization. It can affect the measurements for small N badly.
4. What else can be done?
OP apperas to believe that his approach to 2D arrays is good and Lapack is strange and old-fashionable in its packing a 2D array into a 1D array. No. It is Lapack who is right.
If one defines a 2D array as vector<vector<double>>, one obtains one advantage: code simplicity. This comes at a price. Each row of such a matrix is allocated separateley from the others. Thus, a matrix 100 by 100 may be stored in 100 completely different memory blocks. This has a bad impact on the cache (and prefetcher) utilization. Lapck (and other linear algebra packages) enforces compactification of the data in a single, continuous array. This is so to minimize cache and prefetcher misses. If OP had used such an approach from the very beginning, he would probably have gained more than 1-3% that they pay now for the conversion.
This compactification can be achieved in at least three ways.
Write a custom class for a 2D matrix, with the internal data stored in a 1D array and convenient access member funnctions (e.g.: operator ()), or find a library that does just that
Write a custom allocator for std::vector (or find a library). This allocator should allocate the memory from a preallocated 1D vector exactly matching the data storage pattern used by Lapack
Use std::vector<double*> and initailze the pointers with the addresses pointing at the appropriate elements of a preallocated 1D array.
Each of the above solutions forces some changes to the surrounding code, which OP might not want to do. All depends on the code complexity and expected performance gains.
EDIT: Alternative libraries
An alternative approach is to use a library that is known for being a highly optimzed one. Lapack by itself can be regardered as a standard interface with many implementations and it may happen that OP uses an unoptimized one. Which library to choose may depend on the hardware/software platform OP is interested in and may vary in time.
As for now (mid-2021) a decent suggestions are:
Lapack https://www.netlib.org/lapack/
Atlas https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Algebra_Software http://math-atlas.sourceforge.net/
OpenBlas https://www.openblas.net/
Magma https://developer.nvidia.com/magma
Plasma https://bitbucket.org/icl/plasma/src/main/
If OP uses martices of sizes at least 100, then GPU-oriented MAGMA might be worth trying.
An easier (installation, running) way might with a parallel CPU library, e.g. Plasma. Plsama is Lapack-compliant, it has been being developed by a large team of people, including Jack Dongarra, it also should be rather easy to compile it locally as it is provided with a CMake script.
An example how much a parallel CPU-based, multicore implementation can outperform a single-threaded implementation of the LU-decomposition can be found for example here: https://cse.buffalo.edu/faculty/miller/Courses/CSE633/Tummala-Spring-2014-CSE633.pdf (short answer: 5 to 15 times for matrices of size 1000).
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I was reading a interesting post about why is it faster to process a sorted array than an unsorted array? and saw a comment made by #mp31415 that said:
Just for the record. On Windows / VS2017 / i7-6700K 4GHz there is NO difference between two versions. It takes 0.6s for both cases. If number of iterations in the external loop is increased 10 times the execution time increases 10 times too to 6s in both cases
So I tried it on a online c/c++ compiler (with, I suppose, modern server architecture), I get, for the sorted and unsorted, respectively, ~1.9s and ~1.85s, not so much different but repeatable.
So I wonder if it is still true for modern architectures?
Question was from 2012, not so far from now...
Or where am I wrong?
Question precision for reopening:
Please forget about me adding the C code as example. This was a terrible mistake. Not only erroneous was the code, posting it misled people who were focusing on the code itself, rather than on the question.
When I tried first the C++ code used in the link above and got only 2% difference (1.9s & 1.85s).
My first question and intent was about the previous post, its c++ code and the comment of #mp31415.
#rustyx made an interesting comment, and I wondered if it could explain what I observed.
Interestingly, a debug build exhibits 400% difference between sorted/unsorted, and a release build at most 5% difference (i7-7700).
In other words, my question is:
Why does the c++ code in the previous post did not worked with as good performances as those claimed by the previous OP?
precised by:
Does the timing difference between the release build and debug build could explain it?
You're a victim of the as-if rule:
... conforming implementations are required to emulate (only) the observable behavior of the abstract machine ...
Consider the function under test ...
const size_t arraySize = 32768;
int *data;
long long test()
{
long long sum = 0;
for (size_t i = 0; i < 100000; ++i)
{
// Primary loop
for (size_t c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
return sum;
}
And the generated assembly (VS 2017, x86_64 /O2 mode)
The machine does not execute your loops, instead it executes a similar program that does this:
long long test()
{
long long sum = 0;
// Primary loop
for (size_t c = 0; c < arraySize; ++c)
{
for (size_t i = 0; i < 20000; ++i)
{
if (data[c] >= 128)
sum += data[c] * 5;
}
}
return sum;
}
Observe how the optimizer reversed the order of the loops and defeated your benchmark.
Obviously the latter version is much more branch-predictor-friendly.
You can in turn defeat the loop hoisting optimization by introducing a dependency in the outer loop:
long long test()
{
long long sum = 0;
for (size_t i = 0; i < 100000; ++i)
{
sum += data[sum % 15]; // <== dependency!
// Primary loop
for (size_t c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
return sum;
}
Now this version again exhibits a massive difference between sorted/unsorted data. On my system (i7-7700) 1.6s vs 11s (or 700%).
Conclusion: branch predictor is more important than ever these days when we are facing unprecedented pipeline depths and instruction-level parallelism.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I've been doing some of the LeetCode problems, and I notice that the C solutions are a couple of times faster than the exact same thing in C++. For example:
Updated with a couple of simpler examples:
Given a sorted array and a target value, return the index if the target is found. If not, return the index where it would be if it were inserted in order. You may assume no duplicates in the array. (Link to question on LeetCode)
My solution in C, runs in 3 ms:
int searchInsert(int A[], int n, int target) {
int left = 0;
int right = n;
int mid = 0;
while (left<right) {
mid = (left + right) / 2;
if (A[mid]<target) {
left = mid + 1;
}
else if (A[mid]>target) {
right = mid;
}
else {
return mid;
}
}
return left;
}
My other C++ solution, exactly the same but as a member function of the Solution class runs in 13 ms:
class Solution {
public:
int searchInsert(int A[], int n, int target) {
int left = 0;
int right = n;
int mid = 0;
while (left<right) {
mid = (left + right) / 2;
if (A[mid]<target) {
left = mid + 1;
}
else if (A[mid]>target) {
right = mid;
}
else {
return mid;
}
}
return left;
}
};
Even simpler example:
Reverse the digits of an integer. Return 0 if the result will overflow. (Link to question on LeetCode)
The C version runs in 6 ms:
int reverse(int x) {
long rev = x % 10;
x /= 10;
while (x != 0) {
rev *= 10L;
rev += x % 10;
x /= 10;
if (rev>(-1U >> 1) || rev < (1 << 31)) {
return 0;
}
}
return rev;
}
And the C++ version is exactly the same but as a member function of the Solution class, and runs for 19 ms:
class Solution {
public:
int reverse(int x) {
long rev = x % 10;
x /= 10;
while (x != 0) {
rev *= 10L;
rev += x % 10;
x /= 10;
if (rev>(-1U >> 1) || rev < (1 << 31)) {
return 0;
}
}
return rev;
}
};
I see how there would be considerable overhead from using vector of vector as a 2D array in the original example if the LeetCode testing system doesn't compile the code with optimisation enabled. But the simpler examples above shouldn't suffer that issue because the data structures are pretty raw, especially in the second case where all you have is long or integer arithmetics. That's still slower by a factor of three.
I'm starting to think that there might be something odd happening with the way LeetCode do the benchmarking in general because even in the C version of the integer reversing problem you get a huge bump in running time from just replacing the line
if (rev>(-1U >> 1) || rev < (1 << 31)) {
with
if (rev>INT_MAX || rev < INT_MIN) {
Now, I suppose having to #include<limits.h> might have something to do with that but it seems a bit extreme that this simple change bumps the execution time from just 6 ms to 19 ms.
Lately I've been seeing the vector<vector<int>> suggestion a lot for doing 2d arrays in C++, and I've been pointing out to people why this really isn't a good idea. It's a handy trick to know when slapping together temporary code, but there's (almost) never any reason to ever use it for real code. The right thing to do is to use a class that wraps a contiguous block of memory.
So my first reaction might be to point to this as a possible source for the disparity. However you're also using int** in the C version, which is generally a sign of the exact same problem as vector<vector<int>>.
So instead I decided to just compare the two solutions.
http://coliru.stacked-crooked.com/a/fa8441cc5baa0391
6468424
6588511
That's the time taken by the 'C version' vs the 'C++ version' in nanoseconds.
My results don't show anything like the disparity you describe. Then it occurred to me to check a common mistake people make when benchmarking
http://coliru.stacked-crooked.com/a/e57d791876b9252b
18386695
42400612
Notice that the -O3 flag from the first example has become -O0, which disables optimization.
Conclusion: you're probably comparing unoptimized executables.
C++ supports building rich abstractions that don't require overhead, but eliminating the the overhead does require certain code transformations that play havoc with the 'debuggability' of code.
That means debug builds avoid those transformations and therefore C++ debug builds are often slower than debug builds of C style code because C style code just doesn't use much abstraction. Seeing a 130% slowdown such as the above is not at all surprising when timing, for example, machine code that uses function calls in place of simple store instructions.
Some code really needs optimizations in order to have reasonable performance even for debugging, so compilers often offer a mode that applies some optimizations which don't cause too much trouble for debuggers. Clang and gcc use -O1 for this, and you can see that even this level of optimization essentially eliminates the gap in this program between C style code and the more C++ style code:
http://coliru.stacked-crooked.com/a/13967ebcfcfa4073
8389992
8196935
Update:
In those later examples optimization shouldn't make a difference, since the C++ is not using any abstraction beyond what the C version is doing. I'm guessing that the explanation for this is that the examples are being compiled with different compilers or with some other different compiler options. Without knowing how the compilation is done I would say it makes no sense to compare these runtime numbers; LeetCode is clearly not producing an apples to apples comparison.
You are using vector of vector in your C++ code snippet. Vectors are sequence containers in C++ that are like arrays that can change in size. Instead of vector<vector<int>> if you use statically allocated arrays, that would be better. You may use your own Array class as well with operator [] overloaded, but vector has more overhead as it dynamically resizes when you add more elements than its original size. In C++, you use call by reference to further reduce your time if you compare that with C. C++ should run even faster if written well.
The code is written using C++11. Each Process got tow Matrix Data(Sparse). The test data can be downloaded from enter link description here
Test data contains 2 file : a0 (Sparse Matrix 0) and a1 (Sparse Matrix 1). Each line in file is "i j v", means the sparse matrix Row i, Column j has the value v. i,j,v are all integers.
Use c++11 unordered_map as the sparse matrix's data structure.
unordered_map<int, unordered_map<int, double> > matrix1 ;
matrix1[i][j] = v ; //means at row i column j of matrix1 is value v;
The following code took about 2 minutes. The compile command is g++ -O2 -std=c++11 ./matmult.cpp.
g++ version is 4.8.1, Opensuse 13.1. My computer's info : Intel(R) Core(TM) i5-4200U CPU # 1.60GHz, 4G memory.
#include <iostream>
#include <fstream>
#include <unordered_map>
#include <vector>
#include <thread>
using namespace std;
void load(string fn, unordered_map<int,unordered_map<int, double> > &m) {
ifstream input ;
input.open(fn);
int i, j ; double v;
while (input >> i >> j >> v) {
m[i][j] = v;
}
}
unordered_map<int,unordered_map<int, double> > m1;
unordered_map<int,unordered_map<int, double> > m2;
//vector<vector<int> > keys(BLK_SIZE);
int main() {
load("./a0",m1);
load("./a1",m2);
for (auto r1 : m1) {
for (auto r2 : m2) {
double sim = 0.0 ;
for (auto c1 : r1.second) {
auto f = r2.second.find(c1.first);
if (f != r2.second.end()) {
sim += (f->second) * (c1.second) ;
}
}
}
}
return 0;
}
The code above is too slow. How can I make it run faster? I use multithread.
The new code is following, compile command is g++ -O2 -std=c++11 -pthread ./test.cpp. And it took about 1 minute. I want it to be faster.
How Can I make the task faster? Thank you!
#include <iostream>
#include <fstream>
#include <unordered_map>
#include <vector>
#include <thread>
#define BLK_SIZE 8
using namespace std;
void load(string fn, unordered_map<int,unordered_map<int, double> > &m) {
ifstream input ;
input.open(fn);
int i, j ; double v;
while (input >> i >> j >> v) {
m[i][j] = v;
}
}
unordered_map<int,unordered_map<int, double> > m1;
unordered_map<int,unordered_map<int, double> > m2;
vector<vector<int> > keys(BLK_SIZE);
void thread_sim(int blk_id) {
for (auto row1_id : keys[blk_id]) {
auto r1 = m1[row1_id];
for (auto r2p : m2) {
double sim = 0.0;
for (auto col1 : r1) {
auto f = r2p.second.find(col1.first);
if (f != r2p.second.end()) {
sim += (f->second) * col1.second ;
}
}
}
}
}
int main() {
load("./a0",m1);
load("./a1",m2);
int df = BLK_SIZE - (m1.size() % BLK_SIZE);
int blk_rows = (m1.size() + df) / (BLK_SIZE - 1);
int curr_thread_id = 0;
int index = 0;
for (auto k : m1) {
keys[curr_thread_id].push_back(k.first);
index++;
if (index==blk_rows) {
index = 0;
curr_thread_id++;
}
}
cout << "ok" << endl;
std::thread t[BLK_SIZE];
for (int i = 0 ; i < BLK_SIZE ; ++i){
t[i] = std::thread(thread_sim,i);
}
for (int i = 0; i< BLK_SIZE; ++i)
t[i].join();
return 0 ;
}
Most times when working with sparse matrices one uses more efficient representations than the nested maps you have. Typical choices are Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC). See https://en.wikipedia.org/wiki/Sparse_matrix for details.
You haven't specified the time you expect your example to run in or the platform you hope to run on. These are important design contraints in this example.
There are several areas that I can think of for improving the efficeny of this:-
Improve the way the data is stored
Improve the multithreading
Improve the algorithm
The first point is geared toward the way the system stores the sparse arrays and the interfaces to enable the data to be read. Nested unordered_maps are a good option when speed isn't important but there may be more specific data structures available that are geared toward this problem. At best you may find a library that provides a better way to store the data than nested maps, at worst you may have to come up with something yourself.
The second point refers to the way the multithreading is supported in the language. The original spec for the multithreading system were meant to be platform independant and might miss out handy features some systems might have. Decide what system you want to target and use the OSs threading system. You'll have more control over the way the threading works, possibly reduce the overhead but will lose out on the cross platform support.
The third point will take a bit of work. Is the way you're multiplying the matricies really the most efficent way given the nature of the data. I'm no expert on these things but it is something to consider but it will take a bit of effort.
Lastly, you can always be very specific about the platform you're running on and head into the world of assembly programming. Modern CPUs are complicated beasts. They can sometimes perform operations in parallel. For example, you may be able to do SIMD operations or do parallel integer and floating point operations. Doing this does require a deep understanding of what's going on and there are useful tools to help you out. Intel did have a tool called VTune (it may be something else now) that would analyse code and highlight potential bottlenecks. Ultimately, you'll be wanting to eliminate areas of the algorithm where the CPU is idle waiting for something to happen (like waiting for data from RAM) either by finding something else for the CPU to do or improving the algorithm (or both).
Ultimately, in order to improve the overall speed, you'll need to know what is slowing it down. This generally means knowing how to analyse your code and understand the results. Profilers are the general tool for this but there are platform specific tools available as well.
I know this isn't quite what you want but making code fast is really hard and very time consuming.
I have a C++ application in which I need to compare two values and decide which is greater. The only complication is that one number is represented in log-space, the other is not. For example:
double log_num_1 = log(1.23);
double num_2 = 1.24;
If I want to compare num_1 and num_2, I have to use either log() or exp(), and I'm wondering if one is easier to compute than the other (i.e. runs in less time, in general). You can assume I'm using the standard cmath library.
In other words, the following are semantically equivalent, so which is faster:
if(exp(log_num_1) > num_2)) cout << "num_1 is greater";
or
if(log_num_1 > log(num_2)) cout << "num_1 is greater";
AFAIK the algorithms, the complexity is the same, the difference should be only a (hopefully negligible) constant.
Due to this, I'd use the exp(a) > b, simply because it doesn't break on invalid input.
Do you really need to know? Is this going to occupy a large fraction of you running time? How do you know?
Worse, it may be platform dependent. Then what?
So sure, test it if you care, but spending much time agonizing over micro-optimization is usually a bad idea.
Edit: Modified the code to avoid exp() overflow. This caused the margin between the two functions to shrink considerably. Thanks, fredrikj.
Code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char **argv)
{
if (argc != 3) {
return 0;
}
int type = atoi(argv[1]);
int count = atoi(argv[2]);
double (*func)(double) = type == 1 ? exp : log;
int i;
for (i = 0; i < count; i++) {
func(i%100);
}
return 0;
}
(Compile using:)
emil#lanfear /home/emil/dev $ gcc -o test_log test_log.c -lm
The results seems rather conclusive:
emil#lanfear /home/emil/dev $ time ./test_log 0 10000000
real 0m2.307s
user 0m2.040s
sys 0m0.000s
emil#lanfear /home/emil/dev $ time ./test_log 1 10000000
real 0m2.639s
user 0m2.632s
sys 0m0.004s
A bit surprisingly log seems to be the faster one.
Pure speculation:
Perhaps the underlying mathematical taylor series converges faster for log or something? It actually seems to me like the natural logarithm is easier to calculate than the exponential function:
ln(1+x) = x - x^2/2 + x^3/3 ...
e^x = 1 + x + x^2/2! + x^3/3! + x^4/4! ...
Not sure that the c library functions even does it like this, however. But it doesn't seem totally unlikely.
Since you're working with values << 1, note that x-1 > log(x) for x<1,
which means that x-1 < log(y) implies log(x) < log(y), which already
takes care of 1/e ~ 37% of the cases without having to use log or exp.
some quick tests in python (which uses c for math):
$ time python -c "from math import log, exp;[exp(100) for i in xrange(1000000)]"
real 0m0.590s
user 0m0.520s
sys 0m0.042s
$ time python -c "from math import log, exp;[log(100) for i in xrange(1000000)]"
real 0m0.685s
user 0m0.558s
sys 0m0.044s
would indicate that log is slightly slower
Edit: the C functions are being optimized out by compiler it seems, so the loop is what is taking up the time
Interestingly, in C they seem to be the same speed (possibly for reasons Mark mentioned in comment)
#include <math.h>
void runExp(int n) {
int i;
for (i=0; i<n; i++) {
exp(100);
}
}
void runLog(int n) {
int i;
for (i=0; i<n; i++) {
log(100);
}
}
int main(int argc, char **argv) {
if (argc <= 1) {
return 0;
}
if (argv[1][0] == 'e') {
runExp(1000000000);
} else if (argv[1][0] == 'l') {
runLog(1000000000);
}
return 0;
}
giving times:
$ time ./exp l
real 0m2.987s
user 0m2.940s
sys 0m0.015s
$ time ./exp e
real 0m2.988s
user 0m2.942s
sys 0m0.012s
It can depend on your libm, platform and processor. You're best off writing some code that calls exp/log a large number of times, and using time to call it a few times to see if there's a noticeable difference.
Both take basically the same time on my computer (Windows), so I'd use exp, since it's defined for all values (assuming you check for ERANGE). But if it's more natural to use log, you should use that instead of trying to optimise without good reason.
It would make sense that log be faster... Exp has to perform several multiplications to arrive at its answer whereas log only has to convert the mantissa and exponent to base-e from base-2.
Just be sure to boundary check (as many others have said) if you're using log.
if you are sure that this is hotspot -- compiler instrinsics are your friends. Although it`s platform-dependent (if you go for performance in places like this -- you cannot be platform-agnostic)
So. The question really is -- which one is asm instruction on your target architecture -- and latency+cycles. Without this it is pure speculation.