How to troubleshoot cudaErrorUnknown from cudaDeviceSynchronize()?

How to troubleshoot cudaErrorUnknown from cudaDeviceSynchronize()? - c++

I have a large code base that performs RGB to YUV color conversion with CUDA kernels. Since I am doing a lot of parallel conversions, I use streams (maybe that is relevant here). The code is running on Linux, It is working fine on a Quadro K4200 GPU but I recently got a new Quadro P4000 GPU on which I constantly get cudaErrorUnknown when calling cudaDeviceSynchronize(). Before this happens the only things I do are a call to cuMemcpy2DAsync to copy the pixel data and after that a call to my kernel. The code base is large and I can share some relevant parts, but can anyone give advice how could I troubleshoot this? Since I was working with the K4200 all the time, I haven't changed the CUDA compiler flags. Should I do that? I am currently compiling the same code for both cards with the following flags:
--compiler-bindir /usr/bin/gcc-4.9 -gencode=arch=compute_30,code=\"sm_30,compute_30\" -cudart static -maxrregcount=0 --machine 64 --compile -g -G -std=c++11 -D_MWAITXINTRIN_H_INCLUDED
But in that case is it even possible to make a single object that runs on different GPUs?
This is the output of nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P4000 Off | 00000000:04:00.0 Off | N/A |
| 46% 39C P0 29W / 105W | 0MiB / 8112MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro K4200 Off | 00000000:84:00.0 Off | N/A |
| 30% 40C P0 26W / 110W | 0MiB / 4036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Should I disable the old card, could the driver seeing both cards start to behave incorrectly? Are there any internal NVIDIA logs/tools that I can use to get a more detailed description of what is failing?

How to troubleshoot ... ?
By transforming your program into a
Minimal, Complete, Verifiable Example (MCVE)
of this issue manifesting.
This will focus your "list of suspects" to very few CUDA API calls, which should either be enough for you to figure out the problem by yourself or would make it possible for you to post the whole thing (in a different question) here and get proper help. Or you'll find out the problem goes away as you drop supposedly-irrelevant parts of the code, meaning that it lies with what you've just removed.

Recompiling the kernel with the correct architecture flags -gencode=arch=compute_61,code=sm_61 as suggested by #tera fixed it for Quadro P4000, however now the same code fails on Quadro K4200, but this time with a reasonable error cudaErrorNoKernelImageForDevice:
This indicates that there is no kernel image available that is suitable for the device. This can occur when a user specifies code generation options for a particular CUDA source file that do not include the corresponding device configuration.
So apparently my biggest problem was the lack of knowledge to understand what could be causing the cudaErrorUnknown.

Related

While using X11 on a GPU, does XShmGetImage give you host/device memory back?

If you have X11 running on a GPU like so:
Fri Aug 2 23:52:39 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.30 Driver Version: 430.30 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 Off | 00000000:00:1E.0 Off | 0 |
| N/A 28C P8 14W / 150W | 141MiB / 7618MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3255 G /usr/lib/xorg/Xorg 57MiB |
| 0 3286 G /usr/bin/gnome-shell 81MiB |
+-----------------------------------------------------------------------------+
If you run XShmGetImage(), does it give you a pointer to a memory address in GPU memory or host memory?
If the GPU, I assume you can do other operations on the NVIDIA card with it, like H264 encode that data.
Is there a way to copy the memory from one GPU memory block to a different GPU memory block?
I am using NVENC libraries.

Reading the MIT Shared Memory extension's documentation:
The next step is to create the shared memory segment. This is best
done after the creation of the XImage, since you need to make use of
the information in that XImage to know how much memory to allocate. To
create the segment, you need a call like:
shminfo.shmid = shmget(IPC_PRIVATE, image->bytes_per_line * image->height, IPC_CREAT|0777);
This implies the extension regards "shared memory" as "that which is returned by shmget or equivalent". Since shmget is incapable of allocating GPU memory, my answer is the XImage is in host memory, not device.

How to check the region where your GoogleColab is running?

Following this tweet I tried without luck to use this GPU on Google Colab. I'm wondering if this is due to the region where my notebook is running but I don't have idea how to check this.
Am I missing something setting up the GPU? I followed this post [ Solved! See UPDATE 2]
How can I check in which region I'm from colab?
UPDATE
The output of !nvidia-smi is
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 32C P0 54W / 149W | 121MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
UPDATE 2
Tesla T4 is now available but I'm still interested in how to check in which region my instances are running.

You can check server location by curl:
curl ipinfo.io
And you can change server by factory reset runtime, but you can't choose region.

In your Colab notebook, create a new cell and run:
!curl ipinfo.io
Output:
{
"ip": "31.141.44.210",
"hostname": "31.141.44.210.bc.googleusercontent.com",
"city": "Groningen",
"region": "Groningen", <--
"country": "NL",
"loc": "52.2192,6.1667",
"org": "AS396981 Google LLC",
"postal": "9711",
"timezone": "Europe/Amsterdam",
"readme": "https://ipinfo.io/missingauth"
}
Just running curl ipinfo.io (without the leading !) didn't work for me. So, extending #Krishna answers here.

For me !nvidia-smi is not working it tells me:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Moreover I still have the K80 so

multithread read from disk?

Suppose I need to read many distinct, independent chunks of data from the same file saved on disk.
Is it possible to multi-thread this upload?
Related: Do all threads on the same processor use the same IO device to read from disk? In this case, multi-threading would not speed up the upload at all - the threads would just be waiting in line.
(I am currently multi-threading with OpenMP.)

Yes, it is possible. However:
Do all threads on the same processor use the same IO device to read from disk?
Yes. The read head on the disk. As an example, try copying two files in parallel as opposed to in series. It will take significantly longer in parallel, because the OS uses scheduling algorithms to make sure the IO rate is "fair," or equal between the two threads/processes. Because of this, the read head will jump back and forth between different parts of the disk, slowing the process down A LOT. The time to actually read the data is pretty small compared to the time to seek to it, and when you're reading two different parts of the disk at once, you spend most of the time seeking.
Note that all of this assumes you're using a hard disk. If you're using an SSD, it will not be slower in parallel, but it will not be faster either. Edit: according to comments parallel is actually faster for an SSD. With RAID the situation becomes more complicated, and (obviously) depends on what kind of RAID you're using.
This is what it looks like (I've unwrapped the circular disk into a rectangle because ascii circles are hard, and simplified the data layout to make it easier to read):
Assume the files are separated by some space on the platter like so:
| |
A series read will look like (* indicates reading)
space ----->
| *| t
| *| i
| *| m
| *| e
| *| |
| / | |
| / | |
| / | V
| / |
|* |
|* |
|* |
|* |
While a parallel read will look like
| \ |
| *|
| / |
| / |
| / |
| / |
|* |
| \ |
| \ |
| \ |
| \ |
| *|
| / |
| / |
| / |
| / |
|* |
| \ |
| \ |
| \ |
| \ |
| *|
etc

If you're doing this on Windows you might want to look into the ReadFileScatter function. It will let you read multiple segments from a file in a single asynchronous call. This will allow the OS to better control the file IO bottle neck and hopefully optimizes the reads.
The matching write call on Windows would be WriteFileGather.
For UNIX you're looking at readv and writev to do the same thing.

As mentioned in the other answers a parallel read may be slower depending on the way the file is physically stored on disk. So if the head has to move a significant distance it can cause an actual slowdown. This being said there are however storage systems which can support multiple simultaneous reads and writes efficiently. The most simple one I can imagine is a SSD disk. I myself worked with magnificent storage systems from IBM which could perform simultaneous reads and writes with no slowdown.
So let's assume you have such a file system and physical storage which will not slow down on parallel reads.
In that case parallel reads are very logical. In general there are two ways to achieve that:
If you want to use the standard C/C++ library to perform the IO then the only option you have is to keep one open file handle (descriptor) per thread. This is because the file pointer (which points to where to read or write from in the file) is kept per handle. So if you try to read simultaneously from the same file handle you will not have any way of knowing what you are actually reading.
Use platform specific API to perform asynchronous (OVERLAPPED) IO. On windows you use the WinAPI functions with what is called OVERLAPPED IO. On Unix/Linux you have posix AIO although I understand that it's use is discouraged although I didn't see any satisfactory explanation as to why that is the case.
I myself implemented the both fd/thread approach on both linux and windows and the OVERLAPPED approach on windows. Both work great.

You won't be able to speed up the process of reading to disk. If you're calculating at the same time as you're writing, parallelizing will help. But the pure writing will be limited by the bandwidth of the lane between processor and hard drive and, more notably, by the harddrive itself (my hard drive does 30 MB/s, I've heard about raid setups serving 120 MB/s over network, but don't rely on that).

Multiple reads from a disk should be thread-safe by the design of the op system if you use the standard system functions there's no need to manually locking it, open the files read-only though. (Otherwise you'll get file access errors.)
Btw you are not necessary reading from the disk in practice, the op system will decide where it will serve you from. It typically prefetches the reads and serves from the memory.

Benchmarking (python vs. c++ using BLAS) and (numpy)

I would like to write a program that makes extensive use of BLAS and LAPACK linear algebra functionalities. Since performance is an issue I did some benchmarking and would like know, if the approach I took is legitimate.
I have, so to speak, three contestants and want to test their performance with a simple matrix-matrix multiplication. The contestants are:
Numpy, making use only of the functionality of dot.
Python, calling the BLAS functionalities through a shared object.
C++, calling the BLAS functionalities through a shared object.
Scenario
I implemented a matrix-matrix multiplication for different dimensions i. i runs from 5 to 500 with an increment of 5 and the matricies m1 and m2 are set up like this:
m1 = numpy.random.rand(i,i).astype(numpy.float32)
m2 = numpy.random.rand(i,i).astype(numpy.float32)
1. Numpy
The code used looks like this:
tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2")
rNumpy.append((i, tNumpy.repeat(20, 1)))
2. Python, calling BLAS through a shared object
With the function
_blaslib = ctypes.cdll.LoadLibrary("libblas.so")
def Mul(m1, m2, i, r):
no_trans = c_char("n")
n = c_int(i)
one = c_float(1.0)
zero = c_float(0.0)
_blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n),
byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n),
m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero),
r.ctypes.data_as(ctypes.c_void_p), byref(n))
the test code looks like this:
r = numpy.zeros((i,i), numpy.float32)
tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul")
rBlas.append((i, tBlas.repeat(20, 1)))
3. c++, calling BLAS through a shared object
Now the c++ code naturally is a little longer so I reduce the information to a minimum.
I load the function with
void* handle = dlopen("libblas.so", RTLD_LAZY);
void* Func = dlsym(handle, "sgemm_");
I measure the time with gettimeofday like this:
gettimeofday(&start, NULL);
f(&no_trans, &no_trans, &dim, &dim, &dim, &one, A, &dim, B, &dim, &zero, Return, &dim);
gettimeofday(&end, NULL);
dTimes[j] = CalcTime(start, end);
where j is a loop running 20 times. I calculate the time passed with
double CalcTime(timeval start, timeval end)
{
double factor = 1000000;
return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec))) / factor;
}
Results
The result is shown in the plot below:
Questions
Do you think my approach is fair, or are there some unnecessary overheads I can avoid?
Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.
Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?
Download
The complete benchmark can be downloaded here. (J.F. Sebastian made that link possible^^)

UPDATE (30.07.2014):
I re-run the the benchmark on our new HPC.
Both the hardware as well as the software stack changed from the setup in the original answer.
I put the results in a google spreadsheet (contains also the results from the original answer).
Hardware
Our HPC has two different nodes one with Intel Sandy Bridge CPUs and one with the newer Ivy Bridge CPUs:
Sandy (MKL, OpenBLAS, ATLAS):
CPU: 2 x 16 Intel(R) Xeon(R) E2560 Sandy Bridge # 2.00GHz (16 Cores)
RAM: 64 GB
Ivy (MKL, OpenBLAS, ATLAS):
CPU: 2 x 20 Intel(R) Xeon(R) E2680 V2 Ivy Bridge # 2.80GHz (20 Cores, with HT = 40 Cores)
RAM: 256 GB
Software
The software stack is for both nodes the sam. Instead of GotoBLAS2, OpenBLAS is used and there is also a multi-threaded ATLAS BLAS that is set to 8 threads (hardcoded).
OS: Suse
Intel Compiler: ictce-5.3.0
Numpy: 1.8.0
OpenBLAS: 0.2.6
ATLAS:: 3.8.4
Dot-Product Benchmark
Benchmark-code is the same as below. However for the new machines I also ran the benchmark for matrix sizes 5000 and 8000.
The table below includes the benchmark results from the original answer (renamed: MKL --> Nehalem MKL, Netlib Blas --> Nehalem Netlib BLAS, etc)
Single threaded performance:
Multi threaded performance (8 threads):
Threads vs Matrix size (Ivy Bridge MKL):
Benchmark Suite
Single threaded performance:
Multi threaded (8 threads) performance:
Conclusion
The new benchmark results are similar to the ones in the original answer. OpenBLAS and MKL perform on the same level, with the exception of Eigenvalue test.
The Eigenvalue test performs only reasonably well on OpenBLAS in single threaded mode.
In multi-threaded mode the performance is worse.
The "Matrix size vs threads chart" also show that although MKL as well as OpenBLAS generally scale well with number of cores/threads,it depends on the size of the matrix. For small matrices adding more cores won't improve performance very much.
There is also approximately 30% performance increase from Sandy Bridge to Ivy Bridge which might be either due to higher clock rate (+ 0.8 Ghz) and/or better architecture.
Original Answer (04.10.2011):
Some time ago I had to optimize some linear algebra calculations/algorithms which were written in python using numpy and BLAS so I benchmarked/tested different numpy/BLAS configurations.
Specifically I tested:
Numpy with ATLAS
Numpy with GotoBlas2 (1.13)
Numpy with MKL (11.1/073)
Numpy with Accelerate Framework (Mac OS X)
I did run two different benchmarks:
simple dot product of matrices with different sizes
Benchmark suite which can be found here.
Here are my results:
Machines
Linux (MKL, ATLAS, No-MKL, GotoBlas2):
OS: Ubuntu Lucid 10.4 64 Bit.
CPU: 2 x 4 Intel(R) Xeon(R) E5504 # 2.00GHz (8 Cores)
RAM: 24 GB
Intel Compiler: 11.1/073
Scipy: 0.8
Numpy: 1.5
Mac Book Pro (Accelerate Framework):
OS: Mac OS X Snow Leopard (10.6)
CPU: 1 Intel Core 2 Duo 2.93 Ghz (2 Cores)
RAM: 4 GB
Scipy: 0.7
Numpy: 1.3
Mac Server (Accelerate Framework):
OS: Mac OS X Snow Leopard Server (10.6)
CPU: 4 X Intel(R) Xeon(R) E5520 # 2.26 Ghz (8 Cores)
RAM: 4 GB
Scipy: 0.8
Numpy: 1.5.1
Dot product benchmark
Code:
import numpy as np
a = np.random.random_sample((size,size))
b = np.random.random_sample((size,size))
%timeit np.dot(a,b)
Results:
System | size = 1000 | size = 2000 | size = 3000 |
netlib BLAS | 1350 ms | 10900 ms | 39200 ms |
ATLAS (1 CPU) | 314 ms | 2560 ms | 8700 ms |
MKL (1 CPUs) | 268 ms | 2110 ms | 7120 ms |
MKL (2 CPUs) | - | - | 3660 ms |
MKL (8 CPUs) | 39 ms | 319 ms | 1000 ms |
GotoBlas2 (1 CPU) | 266 ms | 2100 ms | 7280 ms |
GotoBlas2 (2 CPUs)| 139 ms | 1009 ms | 3690 ms |
GotoBlas2 (8 CPUs)| 54 ms | 389 ms | 1250 ms |
Mac OS X (1 CPU) | 143 ms | 1060 ms | 3605 ms |
Mac Server (1 CPU)| 92 ms | 714 ms | 2130 ms |
Benchmark Suite
Code:
For additional information about the benchmark suite see here.
Results:
System | eigenvalues | svd | det | inv | dot |
netlib BLAS | 1688 ms | 13102 ms | 438 ms | 2155 ms | 3522 ms |
ATLAS (1 CPU) | 1210 ms | 5897 ms | 170 ms | 560 ms | 893 ms |
MKL (1 CPUs) | 691 ms | 4475 ms | 141 ms | 450 ms | 736 ms |
MKL (2 CPUs) | 552 ms | 2718 ms | 96 ms | 267 ms | 423 ms |
MKL (8 CPUs) | 525 ms | 1679 ms | 60 ms | 137 ms | 197 ms |
GotoBlas2 (1 CPU) | 2124 ms | 4636 ms | 147 ms | 456 ms | 743 ms |
GotoBlas2 (2 CPUs)| 1560 ms | 3278 ms | 116 ms | 295 ms | 460 ms |
GotoBlas2 (8 CPUs)| 741 ms | 2914 ms | 82 ms | 262 ms | 192 ms |
Mac OS X (1 CPU) | 948 ms | 4339 ms | 151 ms | 318 ms | 566 ms |
Mac Server (1 CPU)| 1033 ms | 3645 ms | 99 ms | 232 ms | 342 ms |
Installation
Installation of MKL included installing the complete Intel Compiler Suite which is pretty straight forward. However because of some bugs/issues configuring and compiling numpy with MKL support was a bit of a hassle.
GotoBlas2 is a small package which can be easily compiled as a shared library. However because of a bug you have to re-create the shared library after building it in order to use it with numpy.
In addition to this building it for multiple target plattform didn't work for some reason. So I had to create an .so file for each platform for which i want to have an optimized libgoto2.so file.
If you install numpy from Ubuntu's repository it will automatically install and configure numpy to use ATLAS. Installing ATLAS from source can take some time and requires some additional steps (fortran, etc).
If you install numpy on a Mac OS X machine with Fink or Mac Ports it will either configure numpy to use ATLAS or Apple's Accelerate Framework.
You can check by either running ldd on the numpy.core._dotblas file or calling numpy.show_config().
Conclusions
MKL performs best closely followed by GotoBlas2.
In the eigenvalue test GotoBlas2 performs surprisingly worse than expected. Not sure why this is the case.
Apple's Accelerate Framework performs really good especially in single threaded mode (compared to the other BLAS implementations).
Both GotoBlas2 and MKL scale very well with number of threads. So if you have to deal with big matrices running it on multiple threads will help a lot.
In any case don't use the default netlib blas implementation because it is way too slow for any serious computational work.
On our cluster I also installed AMD's ACML and performance was similar to MKL and GotoBlas2. I don't have any numbers tough.
I personally would recommend to use GotoBlas2 because it's easier to install and it's free.
If you want to code in C++/C also check out Eigen3 which is supposed to outperform MKL/GotoBlas2 in some cases and is also pretty easy to use.

I've run your benchmark. There is no difference between C++ and numpy on my machine:
Do you think my approach is fair, or are there some unnecessary overheads I can avoid?
It seems fair due to there is no difference in results.
Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.
No.
Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?
Make sure that numpy uses optimized version of BLAS/LAPACK libraries on your system.

Here's another benchmark (on Linux, just type make): http://dl.dropbox.com/u/5453551/blas_call_benchmark.zip
http://dl.dropbox.com/u/5453551/blas_call_benchmark.png
I do not see essentially any difference between the different methods for large matrices, between Numpy, Ctypes and Fortran. (Fortran instead of C++ --- and if this matters, your benchmark is probably broken.)
Your CalcTime function in C++ seems to have a sign error. ... + ((double)start.tv_usec)) should be instead ... - ((double)start.tv_usec)). Perhaps your benchmark also has other bugs, e.g., comparing between different BLAS libraries, or different BLAS settings such as number of threads, or between real time and CPU time?
EDIT: failed to count the braces in the CalcTime function -- it's OK.
As a guideline: if you do a benchmark, please always post all the code somewhere. Commenting on benchmarks, especially when surprising, without having the full code is usually not productive.
To find out which BLAS Numpy is linked against, do:
$ python
Python 2.7.2+ (default, Aug 16 2011, 07:24:41)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy.core._dotblas
>>> numpy.core._dotblas.__file__
'/usr/lib/pymodules/python2.7/numpy/core/_dotblas.so'
>>>
$ ldd /usr/lib/pymodules/python2.7/numpy/core/_dotblas.so
linux-vdso.so.1 => (0x00007fff5ebff000)
libblas.so.3gf => /usr/lib/libblas.so.3gf (0x00007fbe618b3000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbe61514000)
UPDATE: If you can't import numpy.core._dotblas, your Numpy is using its internal fallback copy of BLAS, which is slower, and not meant to be used in performance computing!
The reply from #Woltan below indicates that this is the explanation for the difference he/she sees in Numpy vs. Ctypes+BLAS.
To fix the situation, you need either ATLAS or MKL --- check these instructions: http://scipy.org/Installing_SciPy/Linux Most Linux distributions ship with ATLAS, so the best option is to install their libatlas-dev package (name may vary).

Given the rigor you've shown with your analysis, I'm surprised by the results thus far. I put this as an 'answer' but only because it's too long for a comment and does provide a possibility (though I expect you've considered it).
I would've thought the numpy/python approach wouldn't add much overhead for a matrix of reasonable complexity, since as the complexity increases, the proportion that python participates in should be small. I'm more interested in the results on the right hand side of the graph, but orders of magnitude discrepancy shown there would be disturbing.
I wonder if you're using the best algorithms that numpy can leverage. From the compilation guide for linux:
"Build FFTW (3.1.2):
SciPy Versions >= 0.7 and Numpy >= 1.2:
Because of license, configuration, and maintenance issues support for FFTW was removed in versions of SciPy >= 0.7 and NumPy >= 1.2. Instead now uses a built-in version of fftpack.
There are a couple ways to take advantage of the speed of FFTW if necessary for your analysis.
Downgrade to a Numpy/Scipy version that includes support.
Install or create your own wrapper of FFTW. See http://developer.berlios.de/projects/pyfftw/ as an un-endorsed example."
Did you compile numpy with mkl? (http://software.intel.com/en-us/articles/intel-mkl/). If you're running on linux, the instructions for compiling numpy with mkl are here: http://www.scipy.org/Installing_SciPy/Linux#head-7ce43956a69ec51c6f2cedd894a4715d5bfff974 (in spite of url). The key part is:
[mkl]
library_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64
include_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/include
mkl_libs = mkl_intel_lp64,mkl_intel_thread,mkl_core
If you're on windows, you can obtain a compiled binary with mkl, (and also obtain pyfftw, and many other related algorithms) at: http://www.lfd.uci.edu/~gohlke/pythonlibs/, with a debt of gratitude to Christoph Gohlke at the Laboratory for Fluorescence Dynamics, UC Irvine.
Caveat, in either case, there are many licensing issues and so on to be aware of, but the intel page explains those. Again, I imagine you've considered this, but if you meet the licensing requirements (which on linux is very easy to do), this would speed up the numpy part a great deal relative to using a simple automatic build, without even FFTW. I'll be interested to follow this thread and see what others think. Regardless, excellent rigor and excellent question. Thanks for posting it.

Let me contribute to a somewhat weird finding
My numpy is linked to mkl, as given by numpy.show_config(). I have no idea what kind of libblas.so C++/BLAS was used. I hope someone can tell me a way to figure it out.
I think the results strongly depend on the library was used. I cannot isolate the efficiency of C++/BLAS.

Detecting memory leaks in C++ Qt combine?

I have an application that interacts with external devices using serial communication. There are two versions of the device differing in their implementations.
-->One is developed and tested by my team
-->The other version by a different team.
Since the other team has left, our team is looking after it's maintenance. The other day while testing the application I noticed that the application takes up 60 Mb memory at startup and to my horror it's memory usage starts increasing with 200Kb chunks, in 60 hrs it shoots up to 295 Mb though there is no slow down in the responsiveness and usage of application. I tested it again and again and the same memory usage pattern is repeated.
The application is made in C++,Qt 4.2.1 on RHEL4.
I used mtrace to check for any memory leaks and it shows no such leaks. I then used valgrind memcheck tool, but the messages it gives are cryptic and not very conclusive, it shows leaks in graphical elements of Qt, which on scrutiny can be straightaway rejected.
I am in a fix as to what other tools/methodologies can be adopted to pinpoint the source of these memory leaks if any.
-->Also, in a larger context, how can we detect and debug presence of memory leaks in a C++ Qt application?
-->How can we check, how much memory a process uses in Linux?
I had used gnome-system-monitor and top command to check for memory used by the application, but I have heard that results given by above mentioned tools are not absolute.
EDIT:
I used ccmalloc for detecting memory leaks and this is the error report I got after I closed the application. During application execution, there were no error messages.
|ccmalloc report|
=======================================================
| total # of| allocated | deallocated | garbage |
+-----------+-------------+-------------+-------------+
| bytes| 387325257 | 386229435 | 1095822 |
+-----------+-------------+-------------+-------------+
|allocations| 1232496 | 1201351 | 31145 |
+-----------------------------------------------------+
| number of checks: 1 |
| number of counts: 2434332 |
| retrieving function names for addresses ... done. |
| reading file info from gdb ... done. |
| sorting by number of not reclaimed bytes ... done. |
| number of call chains: 3 |
| number of ignored call chains: 0 |
| number of reported call chains: 3 |
| number of internal call chains: 3 |
| number of library call chains: 1 |
=======================================================
|
| 3.1% = 33.6 KB of garbage allocated in 47 allocations
| |
| | 0x???????? in
| |
| | 0x081ef2b6 in
| | at src/wrapper.c:489
| |
| | 0x081ef169 in <_realloc>
| | at src/wrapper.c:435
| |
| `-----> 0x081ef05c in
| at src/wrapper.c:318
|
| 0.8% = 8722 Bytes of garbage allocated in 35 allocations
| |
| | 0x???????? in
| |
| | 0x081ef134 in
| | at src/wrapper.c:422
| |
| `-----> 0x081ef05c in
| at src/wrapper.c:318
|
| 0.1% = 1144 Bytes of garbage allocated in 5 allocations
| |
| | 0x???????? in
| |
| | 0x081ef1cb in
| | at src/wrapper.c:455
| |
| `-----> 0x081ef05c in
| at src/wrapper.c:318
|
`------------------------------------------------------
free(0x09cb650c) after reporting
(This can happen with static destructors.
When linking put `ccmalloc.o' at the end (for gcc) or
in front of the list of object files.)
free(0x09cb68f4) after reporting
free(0x09cb68a4) after reporting
free(0x09cb6834) after reporting
free(0x09cb6814) after reporting
free(0x09cb67a4) after reporting
free(0x09cb6784) after reporting
free(0x09cb66cc) after reporting
free(0x09cb66ac) after reporting
free(0x09cb65e4) after reporting
free(0x09cb65c4) after reporting
free(0x09cb653c) after reporting
ccmalloc_report() called in non valid state
I have no clue, what this means, it doesn't seem to indicate any memory leaks to me? I may be wrong. Does anyone of you have come across such a scenario?
link|edit|delete

Valgrind can be a bitch if you don't really read the manuals or whatever documentation is actually available (man page for starters) - but they are worth it.
Basicly, you could start by running the valgrind on your application with --gen-suppressions=all and then create a suppressions for each block that is originating from QT itself and then use the suppression file to block those errors and you should be left with only with errors in your own code.
Also, you could try to use valgrind thru a alleyoop frontend if that makes things easier for you.
There are also bunch of other tools that can be used to detect memory leaks and Linux Journal has article about those here: http://www.linuxjournal.com/article/6556
And last, in some cases, some static analysis tools can spot memory errors too..

I'd like to make the minor point that just because the meory used by a process is increasing, it does not follow that you have a memory leak. Take a word processor as an example - as you write text, the memory usage increases, but there is no leak. Most processes in fact increase their memoryy usage as they run, often until they reach some sort of near steady-state, where objects been created are balanced by old objects being destroyed.

You said you tried Valgrind's memcheck tool; you should also try the massif tool, which should be able to graph the heap usage over time, and tell you where the memory was allocated from.
One of the reasons why top isn't too useful to measure memory usage is that they don't take into account that memory is often shared between processes. For the best overview on where the process has allocated memory, I recommend using a recent Linux kernel and checking /proc/<pid>/maps for your process. This shows what memory is mapped to that process and from where. For example, here's a snippet from konqueror on my system.
b732a000-b7a20000 r-xp 00000000 fd:05 205437 /usr/lib/qt3/lib/libqt-mt.so.3.3.8
Size: 7128 kB
Rss: 3456 kB
Pss: 347 kB
Shared Clean: 3452 kB
Shared Dirty: 0 kB
Private Clean: 4 kB
Private Dirty: 0 kB
Referenced: 3452 kB
The important thing here is that, although the resident set resulting from the load of libqt-mt.so.3.3.8 is 3456kB, all but 4kB of that is shared between all processes which loaded the library, so it's a one-off system-wide cost. top doesn't expose this information, so just reading the RSS from top is misleading.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js