Performance Loss when Writing to Memory Buffer (C++) - c++

I am writing a small renderer (based on the rasterisation algorithm). It's a personal project I am doing to test different techniques. I was measuring the time it took to render a bunch of triangles, and while doing this I noticed something strange. What the program does is write to an image buffer (a 1D array of Vec3ui) if a given pixel overlaps a 2D triangle and pass some other test (it writes in the buffer the color of that triangle).
Vec3<unsigned char> *fb = new Vec3<unsigned char>[w * h];
...
void rasterize(
...,
Vec3<unsigned char> *&fb,
float *&zbuffer)
{
Vec3<unsigned char> randcol(drand48() * 255, drand48() * 255, drand48() * 255);
...
uint32_t x, y;
// loop over bounding box of triangle
// check if given pixel is in triangle
for (y = ymin, p.y = ymin; y <= ymax; ++y, ++p.y)
{
for (x = xmin, p.x = xmin; x <= xmax; ++x, ++p.x)
{
if (pixelOverTriangle(...) {
fb[y * w + x] = randcol;
}
}
}
}
Where I measured the stat, I thought that would actually take the longest in the process is rendering the triangles, doing all the test etc. It happens that when I run the program with a given number of triangles I get the following render time:
74 ms
But when I comment out the line where I write to the image buffer I get:
5 ms
So to be clear I do:
if (pixelOverTriangle(...) {
// fb[y * w + x] = randcol;
}
In fact more than 90% of the time is spent writing to the image buffer!
I have to say that I tried optimising how the index used to access elements in the array is computed, but this not where the time goes. The times goes into actually copying the variable to the right into the buffer (so it seems anyway).
I am very surprised by these numbers.
So I have a few questions:
Is it expected?
Am i doing something wrong?
Can I make it better? What technique can I use to optimise this?

A lot more goes into a memory read / write than C++ makes it seem. More often than not, your processor caches blocks of memory for quick access; this vastly improves performance for data in contiguous memory: arrays, structs, and the stack for example. However, upon trying to access memory that has not been cached (a cache miss) the processor has to cache a new block of memory, which takes significantly longer (minutes or even hours scaled to a second-long cycle). By accessing arbitrary segments of a long block of memory – like your image – you are practically guaranteeing continuous cache misses.
To make matters worse, computer memory (RAM) actually lies on virtual pages that are swapped in and out of the physical memory all the time. If your image is big enough to lie across multiple memory pages (usually around 4kb each) then your operating system is actually loading and unloading data from secondary storage (your hard drive), which you can imagine taking much longer than a direct read from memory.
I found an article from another stackoverflow question about cache performance that might answer your question better than me. Really, it's just important to be aware of what a memory read/write is actually doing, and how that can drastically affect performance.

A possible answer which you'll have to check out...
The compiler might notice that your code does nothing and remove it. Look at the disassembly of the function and see if it is actually doing any calculations.

Related

QT QOpenGLWidget : how to modify individual vertices values in VBO without using data block copy?

I don't know if it is possible or not:
I have an array of QVector3D vertices that I copy to a VBO
sometimes I want to modify only the z value of a range of vertices between the values (x1, y1) and (x2, y2) - the concerned vertices strictly follow each other
my "good" idea is to only modify the z values with a direct access to the VBO.
I have searched a lot, but all the solutions I saw use memcpy, something like this :
m_vboPos.bind();
GLfloat* PosBuffer = (GLfloat*) (m_vboPos.map(QOpenGLBuffer::WriteOnly));
if (PosBuffer != (GLfloat*) NULL) {
memcpy(PosBuffer, m_Vertices.constData(), m_Vertices.size() * sizeof(QVector3D));
m_vboPos.unmap();
m_vboPos.release();
But it is to copy blocks of data.
I don't think using memcpy to change only 1 float value in every concerned vertex would be very efficient (I have several millions of vertices in the VBO).
I'd just like to optimize because copying millions of vertices takes a (too) long time : is there a way to achieve my goal (without memcpy ?), for only one float here and there ? (already tried that but couldn't make it, I must be missing something)
This call here
GLfloat* PosBuffer = (GLfloat*) (m_vboPos.map(QOpenGLBuffer::WriteOnly));
will internally call glMapBuffer which means that it just maps the buffer contents into the address space of your process (see also the OpenGL Wiki on Buffer Object Mapping.
Since you map it write-only, you can simply overwrite each and every bit of the buffer, as you see fit. There is no need to use memcpy, you can just use any means to write to memory, e.g. you can directly do
PosBuffer[3*vertex_id + 2] = 42.0f; // assuming 3 floats per vertex
I don't think using memcpy to change only 1 float value in every concerned vertex would be very efficient (I have several millions of vertices in the VBO).
Yes, doing a million separate memcpy() calls for 4 bytes each will not be a good idea. A modern compiler might actually inline it, so it might be equivalent to just individual assignments, though. But you can also do the assignments directly, since memcpy is not gaining you anything here.
However, it is not clear what the performance impacts of all this are. glMapBuffer might return a pointer to
some local copy of the VBO in system memory, and will have later to copy the contents to the GPU. Since it does not know which values you changed and which not, it might have to re-transmit the whole buffer.
some system meory inside the GART area, which is mapped on the GPU, so the GPU will directly access this memory when reading from the buffer.
some I/O-mapped region in VRAM. In this case, the caching behavior of the memory region might be significantly different, and changing a 4 bytes in every 12 byte block might not be the most ideal approach. Just re-copying the whole sub-block as one big junk might yield better performance.
The mapping itself is also not for free, it involves changing the page tables, and the GL driver might have to synchronize it's threads, or, in the worst case, synchronize with the GPU (to prevent you from overwriting stuff the GPU is still using for a previous draw call which is still in flight).
sometimes I want to modify only the z value of a range of vertices between the values (x1, y1) and (x2, y2) - the concerned vertices strictly follow each other
So you have a continuous sub-region of the buffer which you want to modify. I would recommend to look at two alternatives:
Use glMapBufferRange (if available in your OpenGL version) to map only the region you care about.
Forget about buffer mapping completely, and try glBufferSubData(). Not individually on each z component of each vertex, but as one big junk for the whole range of modified vertices. This will imply you have a local copy of the buffer contents in your memory somewhere, just update in, and send the results to the GL.
Which option is better will depend on a lot of different factors, and I would not rule one of them out without benchmarking in the actual scenario, on the actual implementations you care about. Also have a look at the general strategies for Buffer Object Streaming in OpenGL. A persistently mapped buffer might or might not be also a good option for your use case.
The glMap method works great and is really FAST !
Thanks a lot genpfault, the speed gain is so great that the 3D rendering isn't choppy anymore.
Here is my new code, simplified to offer an easy to understand answer :
vertexbuffer.bind();
GLfloat* posBuffer = (GLfloat*) (vertexbuffer.map(QOpenGLBuffer::WriteOnly));
if (posBuffer != (GLfloat*) NULL) {
int index = NumberOfVertices(area.y + 1, image.cols); // index of first vertex on line area.y
for (row = ...) for (col = ...) {
if (mask.at<uchar>(row, col) != 0)
posBuffer[3 * index + 2] = depthmap.at<uchar>(row, col) * depth;
index++;
}
}
vertexbuffer.unmap();
vertexbuffer.release();

Double-checking understanding of memory coalescing in CUDA

Suppose I define some arrays which are visible to the GPU:
double* doubleArr = createCUDADouble(fieldLen);
float* floatArr = createCUDAFloat(fieldLen);
char* charArr = createCUDAChar(fieldLen);
Now, I have the following CUDA thread:
void thread(){
int o = getOffset(); // the same for all threads in launch
double d = doubleArr[threadIdx.x + o];
float f = floatArr[threadIdx.x + o];
char c = charArr[threadIdx.x + o];
}
I'm not quite sure whether I correctly interpret the documentation, and its very critical for my design: Will the memory accesses for double, float and char be nicely coalesced? (Guess: Yes, it will fit into sizeof(type) * blockSize.x / (transaction size) transactions, plus maybe one extra transaction at the upper and lower boundary.)
Yes, for all the cases you have shown, and assuming createCUDAxxxxx translates into some kind of ordinary cudaMalloc type operation, everything should nicely coalesce.
If we have ordinary 1D device arrays allocated via cudaMalloc, in general we should have good coalescing behavior across threads if our load pattern includes an array index of the form:
data_array[some_constant + threadIdx.x];
It really does not matter what data type the array is - it will coalesce nicely.
However, from a performance perspective, global loads (assuming an L1 miss) will occur in a minimum 128-byte granularity. Therefore loading larger sizes per thread (say, int, float, double, float4, etc.) may give slightly better performance. The caches tend to mitigate any difference, if the loads are across a large enough number of warps.
It's pretty easy also to verify this on a particular piece of code with a profiler. There are many ways to do this depending on which profiler you choose, but for example with nvprof you can do:
nvprof --metric gld_efficiency ./my_exe
and it will return an average percentage number that more or less exactly reflects the percentage of optimal coalescing that is occurring on global loads.
This is the presentation I usually cite for additional background info on memory optimization.
I suppose someone will come along and notice that this pattern:
data_array[some_constant + threadIdx.x];
roughly corresponds to the access type shown on slides 40-41 of the above presentation. And aha!! efficiency drops to 50%-80%. That is true, if only a single warp-load is being considered. However, referring to slide 40, we see that the "first" load will require two cachelines to be loaded. After that however, additional loads (moving to the right, for simplicity) will only require one additional/new cacheline per warp-load (assuming the existence of an L1 or L2 cache, and reasonable locality, i.e. lack of thrashing). Therefore, over a reasonably large array (more than just 128 bytes), the average requirement will be one new cacheline per warp, which corresponds to 100% efficiency.

Blend two images using GPU

I need to blend thousands of pairs of images very fast.
My code currently does the following: _apply is a function pointer to a function like Blend. It is one of the many functions we can pass, but it is not the only one. Any function takes two values and outputs a third and it is done on each channel for each pixel. I would prefer a solution that is general to any such function rather than a specific solution for blending.
typedef byte (*Transform)(byte src1,byte src2);
Transform _apply;
for (int i=0 ; i< _frameSize ; i++)
{
source[i] = _apply(blend[i]);
}
byte Blend(byte src, byte blend)
{
int resultPixel = (src + blend)/2;
return (byte)resultPixel;
}
I was doing this on CPU but the performance is terrible. It is my understanding that doing this in GPU is very fast. My program needs to run in computers that will have either Nvidia GPUs or Intel GPUs so whatever solution I use needs to be vendor independent. If I use GPU it has to be OpenGL to be platform independent as well.
I think using a GLSL pixel shader would help, but I am not familiar with pixel shaders or how to use them to 2D objects (like my images).
Is that a reasonable solution? If so, how do I do this in 2D?
If there is a library that already does that it is also great to know.
EDIT: I am receiving the image pairs from different sources. One is always coming from a 3d graphics component in opengl (so it is in GPU originally). The other one is coming from system memory, either from a socket (in a compressed video stream) or from a memory mapped file. The "sink" of the resulting image is the screen. I am expected to show the images on the screen, so going to GPU is an option or using something like SDL to display them.
The blend function that is going to be executed the most is this one
byte Patch(byte delta, byte lo)
{
int resultPixel = (2 * (delta - 127)) + lo;
if (resultPixel > 255)
resultPixel = 255;
if (resultPixel < 0)
resultPixel = 0;
return (byte)resultPixel;
}
EDIT 2: The image coming from GPU land, comes in this fashion. From FBO to PBO to system memory
glBindFramebuffer(GL_FRAMEBUFFER,fbo);
glReadBuffer( GL_COLOR_ATTACHMENT0 );
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
glReadPixels(0,0,width,height,GL_BGR,GL_UNSIGNED_BYTE,0);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
void* mappedRegion = glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
Seems like it is probably better to just work everything in GPU memory. The other bitmap can come from system memory. We may get it from a video decoder in GPU memory eventually as well.
Edit 3: One of my images will come from D3D while the other one comes from OpenGL. It seems that something like Thrust or OpenCL is the best option
From the looks of your Blend function, this is an entirely memory bounded operation. The caches on the CPU can likely only hold a very small fraction of the thousands of images you have. Meaning most of your time is spent waiting for RAM to fulfill load/store requests, and the CPU will idle a lot.
You will NOT get any speedup by having to copy your images from RAM to GPU, have the GPU arithmetic units idle while they wait for GPU RAM to feed them data, wait for GPU RAM again to write results, then copy it all back to main RAM. Using GPU for this could actually slow things down substantially.
But I could be wrong and you might not be saturating your memory bus already. You will have to try it on your system and profile it. Here are some simple things you can try to optimize.
1. Multi-thread
I would focus on optimizing the algorithm directly on the CPU. The simplest thing is to go multi-threaded, which can be as simple as enabling OpenMP in your compiler and updating your for loop:
#include <omp.h> // add this along with enabling OpenMP support in your compiler
...
#pragma omp parallel for // <--- compiler magic happens here
for (int i=0 ; i< _frameSize ; i++)
{
source[i] = _apply(blend[i]);
}
If your memory bandwidth is not saturated, this will likely speed up the blending by however many cores your system has.
2. Micro-optimizations
Another thing you can try is to implement your Blend using SIMD instructions which most CPUs have nowadays. I can't help you with that without knowing what CPU you are targeting.
You can also try unrolling your for loop to mitigate some of the loop overhead.
One easy way to achieve both of these is leverage the Eigen matrix library by wrapping your data in their data structures.
// initialize your data and result buffer
byte *source = ...
byte *blend = ...
byte *result = ...
// tell Eigen where you data/buffer are, and to treat it like a dynamic vectory of bytes
// this is a cheap shallow copy
Map<Matrix<byte, Dynamic,1> > sourceMap(source, _frameSize);
Map<Matrix<byte, Dynamic,1> > blendMap(blend, _frameSize);
Map<Matrix<byte, Dynamic,1> > resultMap(result, _frameSize);
// perform blend using all manner of insane optimization voodoo under the covers
resultMap = (sourceMap + blendMap)/2;
3. Use GPGPU
Finally, I will provide a direct answer to your question with an easy way to leverage the GPU without having to know much about GPU programming. The simplest thing to do is try the Thrust library. You will have to rewrite your algorithms as STL style algorithms, but that's pretty easy in your case.
// functor for blending
struct blend_functor
{
template <typename Tuple>
__host__ __device__
void operator()(Tuple t)
{
// C[i] = (A[i] + B[i])/2;
thrust::get<2>(t) = (thrust::get<0>(t) + thrust::get<1>(t))/2;
}
};
// initialize your data and result buffer
byte *source = ...
byte *blend = ...
byte *result = NULL;
// copy the data to the vectors on the GPU
thrust::device_vector<byte> A(source, source + _frameSize);
thrust::device_vector<byte> B(blend, blend + _frameSize);
// allocate result vector on the GPU
thrust::device_vector<byte> C(_frameSize);
// process the data on the GPU device
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(
A.begin(), B.begin(), C.begin())),
thrust::make_zip_iterator(thrust::make_tuple(
A.end(), B.end(), C.end())),
blend_functor());
// copy the data back to main RAM
thrust::host_vector<byte> resultVec = C;
result = resultVec.data();
A really neat thing about thrust is that once you have written the algorithms in a generic way, it can automagically use different back ends for doing the computation. CUDA is the default back end, but you can also configure it at compile time to use OpenMP or TBB (Intel threading library).

Efficiency in C and C++

So my teacher tells me that I should compute intermediate results as needed on the fly rather than storing them, because the speed of processors nowadays is much more faster than the speed of memory.
So when we compute an intermediate result, we also need to use some memory right ? Can anyone please explain it to me ?
your teacher is right speed of processors nowadays is much more faster than the speed of memory. Access to RAM is slower what access to the internal memory: cache, registers, etc.
Suppose you want to compute a trigonometric function: sin(x). To do this you can either call a function (math library offers one, or implement your own) which is computing the value; or you can use a lookup table stored in memory to get the result which means storing the intermediate values (sort of).
Calling a function will result in executing a number of instructions, while using a lookup table will result in fewer instructions (getting the address of the LUT, getting the offset to the desired element, reading from address+offset). In this case, storing the intermediate values is faster
But if you were to do c = a+b, computing the value will be much faster than reading it from somewhere in RAM. Notice that in this case the number of instructions to be executed would be similar.
So while it is true that access to RAM is slower, whether it's worth accessing RAM instead of doing the computation is a sensible question and several things need to be considered: number of instructions to be executed, if the computation happens in a loop and you can take advantage the architectures pipeline, cache memory, etc.
There is no one answer, you need to analyze each situation individually.
Your teacher's advice is oversimplifying advice on a complex topic.
If you think of "intermediate" as a single term (in the arithmetical sense of the word), then ask yourself, is your code re-using that term anywhere else ? I.e. if you have code like:
void calculate_sphere_parameters(double radius, double & area, double & volume)
{
area = 4 * (4 * acos(1)) * radius * radius;
volume = 4 * (4 * acos(1)) * radius * radius * radius / 3;
}
should you instead write:
void calculate_sphere_parameters(double radius, double & area, double *volume)
{
double quarter_pi = acos(1);
double pi = 4 * quarter_pi;
double four_pi = 4 * pi;
double four_thirds_pi = four_pi / 3;
double radius_squared = radius * radius;
double radius_cubed = radius_squared * radius;
area = four_pi * radius_squared;
volume = four_thirds_pi * radius_cubed; // maybe use "(area * radius) / 3" ?
}
It's not unlikely that a modern optimizing compiler will emit the same binary code for these two. I leave it to the reader to determine what they prefer to see in the sourcecode ...
The same is true for a lot of simple arithmetics (at the very least, if no function calls are involved in the calculation). In addition to that, modern compilers and/or CPU instruction sets might have the ability to do "offset" calculations for free, i.e. something like:
for (int i = 0; i < N; i++) {
do_something_with(i, i + 25, i + 314159);
}
will turn out the same as:
for (int i = 0; i < N; i++) {
int j = i + 25;
int k = i + 314159;
do_something_with(i, j, k);
}
So the main rule should be, if your code's readability doesn't benefit from creating a new variable to hold the result of a "temporary" calculation, it's probably overkill to use one.
If, on the other hand, you're using i + 12345 a dozen times in ten lines of code ... name it, and comment why this strange hardcoded offset is so important.
Remember just because your source code contains a variable doesn't mean the binary code as emitted by the compiler will allocate memory for this variable. The compiler might come to the conclusion that the value isn't even used (and completely discard the calculation assigning it), or it might come to the conclusion that it's "only an intermediate" (never used later where it would've to be retrieved from memory) and so store it in a register, to overwrite after "last use". It's far more efficiently to do something like calculate the value i + 1 each time you need it than to retrieve it from a memory location.
My advice would be:
keep your code readable first and foremost - too many variables rather obscure than help.
don't bother saving "simple" intermediates - addition/subtraction or scaling by powers of two is pretty much a "free" operation
if you reuse the same value ("arithmetic term") in multiple places, save it if it is expensive to calculate (for example involves function calls, a long sequence of arithmetics, or a lot of memory accesses like an array checksum).
So when we compute an intermediate result, we also need to use some memory right ? Can anyone please explain it to me?
There are several levels of memory in a computer. The layers look like this
registers – the CPU does all the calculations on this and access is instant
Caches - memory that's tightly coupled to the CPU core; all memory accesses to main system memory go through the cache actually and to the program it looks like if the data goes and comes from system memory. If the data is present in the cache and the access is well aligned the access is almost instant as well and hence very fast.
main system memory - connected to the CPU through a memory controller and shared by the CPU cores in a system. Accessing main memory introduces latencies through addressing and the limited bandwidth between memory and CPUs
When you work with in-situ calculated intermediary results those often never leave the registers or may go only as far as the cache and thus are not limited by the available system memory bandwidth or blocked by memory bus arbitration or address generation interlock.
This hurts me.
Ask your teacher (or better, don't, because with his level of competence in programming I wouldn't trust him), whether he has measured it, and what the difference was. The rule when you are programming for speed is: If you haven't measured it, and measured it before and after a change, then what you are doing is purely based on presumption and worthless.
In reality, an optimising compiler will take the code that you write and translate it to the fastest possible machine code. As a result, it is unlikely that there is any difference in code or speed.
On the other hand, using intermediate variables will make complex expressions easier to understand and easier to get right, and it makes debugging a lot easier. If your huge complex expression gives what looks like the wrong result, intermediate variables make it possible to check the calculation bit by bit and find where the error is.
Now even if he was right and removing intermediate variables made your code faster, and even if anyone cared about the speed difference, he would be wrong: Making your code readable and easier to debug gets you to a correctly working version of the code quicker (and if it doesn't work, nobody cares how fast it is). Now if it turns out that the code needs to be faster, the time you saved will allow you to make changes that make it really faster.

Std::vector fill time goes from 0ms to 16ms after a certain threshold?

Here is what I'm doing. My application takes points from the user while dragging and in real time displays a filled polygon.
It basically adds the mouse position on MouseMove. This point is a USERPOINT and has bezier handles because eventually I will do bezier and this is why I must transfer them into a vector.
So basically MousePos -> USERPOINT. USERPOINT gets added to a std::vector<USERPOINT> . Then in my UpdateShape() function, I do this:
DrawingPoints is defined like this:
std::vector<std::vector<GLdouble>> DrawingPoints;
Contour[i].DrawingPoints.clear();
for(unsigned int x = 0; x < Contour[i].UserPoints.size() - 1; ++x)
SetCubicBezier(
Contour[i].UserPoints[x],
Contour[i].UserPoints[x + 1],
i);
SetCubicBezier() currently looks like this:
void OGLSHAPE::SetCubicBezier(USERFPOINT &a,USERFPOINT &b, int &currentcontour )
{
std::vector<GLdouble> temp(2);
if(a.RightHandle.x == a.UserPoint.x && a.RightHandle.y == a.UserPoint.y
&& b.LeftHandle.x == b.UserPoint.x && b.LeftHandle.y == b.UserPoint.y )
{
temp[0] = (GLdouble)a.UserPoint.x;
temp[1] = (GLdouble)a.UserPoint.y;
Contour[currentcontour].DrawingPoints.push_back(temp);
temp[0] = (GLdouble)b.UserPoint.x;
temp[1] = (GLdouble)b.UserPoint.y;
Contour[currentcontour].DrawingPoints.push_back(temp);
}
else
{
//do cubic bezier calculation
}
So for the reason of cubic bezier, I need to make USERPOINTS into GlDouble[2] (since GLUTesselator takes in a static array of double.
So I did some profiling. At ~ 100 points, the code:
for(unsigned int x = 0; x < Contour[i].UserPoints.size() - 1; ++x)
SetCubicBezier(
Contour[i].UserPoints[x],
Contour[i].UserPoints[x + 1],
i);
Took 0 ms to execute. then around 120, it jumps to 16ms and never looks back. I'm positive this is due to std::vector. What can I do to make it stay at 0ms. I don't mind using lots of memory while generating the shape then removing the excess when the shape is finalized, or something like this.
0ms is no time...nothing executes in no time. This should be your first indicator that you might want to check your timing methods over timing results.
Namely, timers typically don't have good resolution. Your pre-16ms results are probably just actually 1ms - 15ms being incorrectly reported at 0ms. In any case, if we could tell you how to keep it at 0ms, we'd be rich and famous.
Instead, find out which parts of the loop take the longest, and optimize those. Don't work towards an arbitrary time measure. I'd recommend getting a good profiler to get accurate results. Then you don't need to guess what's slow (something in the loop), but can actually see what part is slow.
You could use vector::reserve() to avoid unnecessary reallocations in DrawingPoints:
Contour[i].DrawingPoints.reserve(Contour[i].size());
for(unsigned int x = 0; x < Contour[i].UserPoints.size() - 1; ++x) {
...
}
If you actually timed the second code snippet only (as you stated in your post), then you're probably just reading from the vector. This means, the cause can not be the re-allocation cost of the vector. In that case, it may due to cache issues of the CPU (i.e. the small datasets can be read in lightning speed from cpu cache, but whenever the dataset is larger than the cache [or when alternately reading from different memory locations], the cpu has to access ram, which is distinctly slower than cache access).
If the part of the code, which you profiled, appends data to the vector, then use std::vector::reserve() with an appropriate capacity (number of expected entries in vector) before filling it.
However, regard two general rules for profiling/benchmarking:
1) Use time measurement methods with high resolution precision (as others stated, the resolution of your timer IS too low)
2) In any case, run the code snippet more than once (e.g. 100 times), get the total time of all runs and divide it by number of runs. This will give you some REAL numbers.
There's a lot of guessing going on here. Good guesses, I imagine, but guesses nevertheless. And when you try to measure the time functions take, that doesn't tell you how they take it. You can see if you try different things that the time will change, and from that you can have some suggestion of what was taking the time, but you can't really be certain.
If you really want to know what's taking the time, you need to catch it when it's taking that time, and find out for certain what it's doing. One way is to single-step it at the instruction level through that code, but I suspect that's out of the question. The next best way is to get stack samples. You can find profilers that are based on stack samples. Personally, I rely on the manual technique, for the reasons given here.
Notice that it's not really about measuring time. It's about finding out why that extra time is being spent, which is a very different question.