Julia: access #time timing and memory allocation values from within code - profiling

I am profiling my Julia application, in particular the execution time and memory allocation of function calls. I would like to automate storage of this information to a database so it can run without supervision.
The information I want to store is that returned by #time, of the form:
#time some_function()
86.278909 seconds (6.94 M allocations: 383.520 MB, 0.08% gc time)
Is it possible to access this information from within the code itself, rather than it just being printed out?
I know I can access the time component using tic() and toq(), but what about the memory allocation?

There is a #timed macro that gives you all of this information:
julia> #timed sleep(1)
(nothing,1.00417,624,0.0,Base.GC_Diff(624,0,0,13,0,0,0,0,0))
help?> #timed
#timed
A macro to execute an expression, and return the value of the expression, elapsed
time, total bytes allocated, garbage collection time, and an object with various
memory allocation counters.

First, Julia allows very good access to the internals of everything. So if something does something, just look inside how. In the case of the #time macro, looking inside is done with the macroexpand, in the REPL:
macroexpand(:(#time some_function()))
Having done so, the equivalent for tic() and toq() for allocations are before = Base.gc_num and diff = Base.GC_Diff(Base.gc_num(),before).
The diff variable now holds the allocations statistics.
Base.gc_alloc_count(diff) gives the allocation count for example.
Cheers!

Related

Why a pointer to a class take less memory SRAM than a "classic" variable

i have a Arduino Micro with 3 time of flight LIDAR micro sensors welded to it. In my code i was creating 3 Global variable like this:
Adafruit_VL53L0X lox0 = Adafruit_VL53L0X();
Adafruit_VL53L0X lox1 = Adafruit_VL53L0X();
Adafruit_VL53L0X lox2 = Adafruit_VL53L0X();
And it took like ~80% of the memory
Now i am creating my objects like this
Adafruit_VL53L0X *lox_array[3] = {new Adafruit_VL53L0X(), new Adafruit_VL53L0X(), new Adafruit_VL53L0X()};
And it take 30% of my entire program
I Try to look on arduino documentation but i don't find anything that can help me.
I can understand that creating a "classic" object can fill the memory. But where is the memory zone located when the pointer is create ?
You use the same amount of memory either way. (Actually, the second way uses a tiny bit more, because the pointers need to be stored as well.)
It's just that with the first way, the memory is already allocated statically from the start and part of the data size of your program, so your compiler can tell you about it, while with the second way, the memory is allocated at runtime dynamically (on the heap), so your compiler doesn't know about it up front.
I dare say that the second method is more dangerous, because consider the following scenario: Let's assume your other code and data already uses 90% of the memory at compile-time. If you use your first method, you will fail to upload the program because it would now use something like 150%, so you already know it won't work. But if you use your second method, your program will compile and upload just fine, but then crash when trying to allocate the extra memory at runtime.
(By the way, the compiler message is a bit incomplete. It should rather say "leaving 1750 bytes for local variables and dynamically allocated objects" or something along those lines).
You can check it yourself using this function which allows you to estimate the amount of free memory at runtime (by comparing the top of the heap with the bottom [physically, not logically] of the stack, the latter being achieved by looking at the address of a local variable which would have been allocated at the stack at that point):
int freeRam () {
extern int __heap_start, *__brkval;
int v;
return (int) &v - (__brkval == 0 ? (int) &__heap_start : (int) __brkval);
}
See also: https://playground.arduino.cc/Code/AvailableMemory/

Execution time overhead at index 2^21

What do I want to do?
I have written a program which reads data from binary files and does calculation based on the read values. Execution time is most import for this program. To validate that my program is operating within the specified time limits, I tried to log all the calculations by storing them inside a std::vector<std::string>. And after the time critical execution is done, I write this vector to a file.
What is stored inside the vector?
In the vector I write the execution time (std::chrono:steady_clock.now()) and the current clock time (std::chrono::system_clock::now() with date.h by Howard Hinnant).
What did I observe?
While analyzing the results I stumble over the following pattern. Independent on the input data the mean execution time of 0.003ms for one operation explodes to ~20ms for a single operation at one specific reproducible index. After this, the execution time of all operations goes back to 0.003ms. The index of the execution time explosion is every time 2097151. Since 2^21 equals 2097152, something happens at 2^21 that slows down the entire program. The same effect can be observed with 2^22 and 2^23. Even more interesting is that the lag is doubled (2^21 = ~20ms, 2^22 = ~43ms, 2^23 =~81ms ). I googled about this specific number and the only thing I found was some node.js stuff which uses c++ under the hood.
What do I suspect?
At index 2^21 a memory area must be expanded, and that is why the delay occurs.
Questions
Is my assumption correct and the size of the vector is the problem?
How can I debug such a Phenomenon? (To be certain, that purely the vector is the problem)
Can I allocate enough memory beforehand to avoid the memory expansion?
What could I use instead of a std::vector, which supports > 10.000.000.000 elements?
I was able to solve my problem by reserving memory by using std::vector::reserve() before the time critical part of my program. Thanks to all the comments.
Here the working code I used:
std::vector<std::string> myLogVector;
myLogVector.reserve(12000000);
//...do time critical stuff, without reallocating storage

Lua garbage collection and C userdata

In my game engine I expose my Vector and Color objects to Lua, using userdata.
Now, for every even locally created Vector and Color from within Lua scripts, Luas memory usage goes up a bit, it doesn't fall until the garbage collector runs.
The garbage collector causes a small lagspike in my game.
Shouldn't the Vector and Color objects be immediately deleted if they are only used as arguments? For example like: myObject:SetPosition( Vector( 123,456 ) )
They aren't right now - the memory usage of Lua rises to 1,5 MB each second, then the lag spike occurs and it goes back to about 50KB.
How can I solve this problem, is it even solvable?
You can run a lua_setgcthreshold(L,0) to force an immediate garbage collection after you exit the function.
Edit: for 5.1 I'm seeing the following:
int lua_gc (lua_State *L, int what, int data);
Controls the garbage collector.
This function performs several tasks, according to the value of the parameter what:
* LUA_GCSTOP: stops the garbage collector.
* LUA_GCRESTART: restarts the garbage collector.
* LUA_GCCOLLECT: performs a full garbage-collection cycle.
* LUA_GCCOUNT: returns the current amount of memory (in Kbytes) in use by Lua.
* LUA_GCCOUNTB: returns the remainder of dividing the current amount of bytes of memory in use by Lua by 1024.
* LUA_GCSTEP: performs an incremental step of garbage collection. The step "size" is controlled by data (larger values mean more steps) in a non-specified way. If you want to control the step size you must experimentally tune the value of data. The function returns 1 if the step finished a garbage-collection cycle.
* LUA_GCSETPAUSE: sets data as the new value for the pause of the collector (see §2.10). The function returns the previous value of the pause.
* LUA_GCSETSTEPMUL: sets data as the new value for the step multiplier of the collector (see §2.10). The function returns the previous value of the step multiplier.
In Lua, the only way an object like userdata can be deleted is by the garbage collector. You can call the garbage collector directly, like B Mitch wrote (use lua_gc(L, LUA_CGSTEP, ...)), but there is no warranty that exactly your temporary object will be freed.
The best way to solve this is to avoid the creation of temporary objects. If you need to pass fixed parameters to methods like SetPosition, try to modify the API so that it also accepts numeric arguments, avoiding the creation of a temporary object, like so:
myObject:SetPosition(123, 456)
Lua Gems has a nice piece about optimization for Lua programs.
Remember, Lua doesn't know until runtime whether or not you saved those objects- you could have put them in a table in the registry, for example. You shouldn't even notice the impacts of collecting 1.5MB, there's another problem here.
Also, you're really being a waste making a new object for that. Remember that in Lua every object has to be dynamically allocated, so you're calling malloc to .. make a Vector object to hold two numbers? Write your function to take a pair of numeric arguments as an overload.

Can static local variables cut down on memory allocation time?

Suppose I have a function in a single threaded program that looks like this
void f(some arguments){
char buffer[32];
some operations on buffer;
}
and f appears inside some loop that gets called often, so I'd like to make it as fast as possible. It looks to me like the buffer needs to get allocated every time f is called, but if I declare it to be static, this wouldn't happen. Is that correct reasoning? Is that a free speed up? And just because of that fact (that it's an easy speed up), does an optimizing compiler already do something like this for me?
No, it's not a free speedup.
First, the allocation is almost free to begin with (since it consists merely of adding 32 to the stack pointer), and secondly, there are at least two reasons why a static variable might be slower
you lose cache locality. Data allocated on the stack are going to be in the CPU cache already, so accessing it is extremely cheap. Static data is allocated in a different area of memory, and so it may not be cached, and so it will cause a cache miss, and you'll have to wait hundreds of clock cycles for the data to be fetched from main memory.
you lose thread safety. If two threads execute the function simultaneously, it'll crash and burn, unless a lock is placed so only one thread at a time is allowed to execute that section of the code. And that would mean you'd lose the benefit of having multiple CPU cores.
So it's not a free speedup. But it is possible that it is faster in your case (although I doubt it).
So try it out, benchmark it, and see what works best in your particular scenario.
Incrementing 32 bytes on the stack will cost virtually nothing on nearly all systems. But you should test it out. Benchmark a static version and a local version and post back.
For implementations that use a stack for local variables, often times allocation involves advancing a register (adding a value to it), such as the Stack Pointer (SP) register. This timing is very negligible, usually one instruction or less.
However, initialization of stack variables takes a little longer, but again, not much. Check out your assembly language listing (generated by compiler or debugger) for exact details. There is nothing in the standard about the duration or number of instructions required to initialize variables.
Allocation of static local variables is usually treated differently. A common approach is to place these variables in the same area as global variables. Usually all the variables in this area are initialized before calling main(). Allocation in this case is a matter of assigning addresses to registers or storing the area information in memory. Not much execution time wasted here.
Dynamic allocation is the case where execution cycles are burned. But that is not in the scope of your question.
The way it is written now, there is no cost for allocation: the 32 bytes are on the stack. The only real work is you need to zero-initialize.
Local statics is not a good idea here. It wont be faster, and your function can't be used from multiple threads anymore, as all calls share the same buffer. Not to mention that local statics initialization is not guaranteed to be thread safe.
I would suggest that a more general approach to this problem is that if you have a function called many times that needs some local variables then consider wrapping it in a class and making these variables member functions. Consider if you needed to make the size dynamic, so instead of char buffer[32] you had std::vector<char> buffer(requiredSize). This is more expensive than an array to initialise every time through the loop
class BufferMunger {
public:
BufferMunger() {};
void DoFunction(args);
private:
char buffer[32];
};
BufferMunger m;
for (int i=0; i<1000; i++) {
m.DoFunction(arg[i]); // only one allocation of buffer
}
There's another implication of making the buffer static, which is that the function is now unsafe in a multithreaded application, as two threads may call it and overwrite the data in the buffer at the same time. On the other hand it's safe to use a separate BufferMunger in each thread that requires it.
Note that block-level static variables in C++ (as opposed to C) are initialized on first use. This implies that you'll be introducing the cost of an extra runtime check. The branch potentially could end up making performance worse, not better. (But really, you should profile, as others have mentioned.)
Regardless, I don't think it's worth it, especially since you'd be intentionally sacrificing re-entrancy.
If you are writing code for a PC, there is unlikely to be any meaningful speed advantage either way. On some embedded systems, it may be advantageous to avoid all local variables. On some other systems, local variables may be faster.
An example of the former: on the Z80, the code to set up the stack frame for a function with any local variables was pretty long. Further, the code to access local variables was limited to using the (IX+d) addressing mode, which was only available for 8-bit instructions. If X and Y were both global/static or both local variables, the statement "X=Y" could assemble as either:
; If both are static or global: 6 bytes; 32 cycles
ld HL,(_Y) ; 16 cycles
ld (_X),HL ; 16 cycles
; If both are local: 12 bytes; 56 cycles
ld E,(IX+_Y) ; 14 cycles
ld D,(IX+_Y+1) ; 14 cycles
ld (IX+_X),D ; 14 cycles
ld (IX+_X+1),E ; 14 cycles
A 100% code space penalty and 75% time penalty in addition to the code and time to set up the stack frame!
On the ARM processor, a single instruction can load a variable which is located within +/-2K of an address pointer. If a function's local variables total 2K or less, they may be accessed with a single instruction. Global variables will generally require two or more instructions to load, depending upon where they are stored.
With gcc, I do see some speedup:
void f() {
char buffer[4096];
}
int main() {
int i;
for (i = 0; i < 100000000; ++i) {
f();
}
}
And the time:
$ time ./a.out
real 0m0.453s
user 0m0.450s
sys 0m0.010s
changing buffer to static:
$ time ./a.out
real 0m0.352s
user 0m0.360s
sys 0m0.000s
Depending on what exactly the variable is doing and how its used, the speed up is almost nothing to nothing. Because (on x86 systems) stack memory is allocated for all local vars at the same time with a simple single func(sub esp,amount), thus having just one other stack var eliminates any gain. the only exception to this is really huge buffers in which case a compiler might stick in _chkstk to alloc memory(but if your buffer is that big you should re-evaluate your code). The compiler cannot turn stack memory into static memory via optimization, as it cannot assume that the function is going to be used in a single threaded enviroment, plus it would mess with object constructors & destructors etc
If there are any local automatic variables in the function at all, the stack pointer needs to be adjusted. The time taken for the adjustment is constant, and will not vary based on the number of variables declared. You might save some time if your function is left with no local automatic variables whatsoever.
If a static variable is initialized, there will be a flag somewhere to determine if the variable has already been initialized. Checking the flag will take some time. In your example the variable is not initialized, so this part can be ignored.
Static variables should be avoided if your function has any chance of being called recursively or from two different threads.
It will make the function substantially slower on most real cases. This is because the static data segment is not near the stack and you will lose cache coherency, so you will get a cache miss when you try to access it. However when you allocate a regular char[32] on the stack, it is right next to all your other needed data and costs very little to access. The initialization costs of a stack-based array of char are meaningless.
This is ignoring that statics have many other problems.
You really need to actually profile your code and see where the slowdowns are, because no profiler will tell you that allocating a statically-sized buffer of characters is a performance problem.

What is the cost of a function call?

Compared to
Simple memory access
Disk access
Memory access on another computer(on the same network)
Disk access on another computer(on the same network)
in C++ on windows.
relative timings (shouldn't be off by more than a factor of 100 ;-)
memory-access in cache = 1
function call/return in cache = 2
memory-access out of cache = 10 .. 300
disk access = 1000 .. 1e8 (amortized depends upon the number of bytes transferred)
depending mostly upon seek times
the transfer itself can be pretty fast
involves at least a few thousand ops, since the user/system threshold must be crossed at least twice; an I/O request must be scheduled, the result must be written back; possibly buffers are allocated...
network calls = 1000 .. 1e9 (amortized depends upon the number of bytes transferred)
same argument as with disk i/o
the raw transfer speed can be quite high, but some process on the other computer must do the actual work
A function call is simply a shift of the frame pointer in memory onto the stack and addition of a new frame on top of that. The function parameters are shifted into local registers for use and the stack pointer is advanced to the new top of the stack for execution of the function.
In comparison with time
Function call ~ simple memory access
Function call < Disk Access
Function call < memory access on another computer
Function call < disk access on another computer
Compared to a simple memory access - slightly more, negligible really.
Compared to every thing else listed - orders of magnitude less.
This should hold true for just about any language on any OS.
In general, a function call is going to be slightly slower than memory access since it in fact has to do multiple memory accesses to perform the call. For example, multiple pushes and pops of the stack are required for most function calls using __stdcall on x86. But if your memory access is to a page that isn't even in the L2 cache, the function call can be much faster if the destination and the stack are all in the CPU's memory caches.
For everything else, a function call is many (many) magnitudes faster.
Hard to answer because there are a lot of factors involved.
First of all, "Simple Memory Access" isn't simple. Since at modern clock speeds, a CPU can add two numbers faster than it get a number from one side of the chip to the other (The speed of light -- It's not just a good idea, it's the LAW)
So, is the function being called inside the CPU memory cache? Is the memory access you're comparing it too?
Then we have the function call will clear the CPU instruction pipeline, which will affect speed in a non-deterministic way.
Assuming you mean the overhead of the call itself, rather than what the callee might do, it's definitely far, far quicker than all but the "simple" memory access.
It's probably slower than the memory access, but note that since the compiler can do inlining, function call overhead is sometimes zero. Even if not, it's at least possible on some architectures that some calls to code already in the instruction cache could be quicker than accessing main (uncached) memory. It depends how many registers need to be spilled to stack before making the call, and that sort of thing. Consult your compiler and calling convention documentation, although you're unlikely to be able to figure it out faster than disassembling the code emitted.
Also note that "simple" memory access sometimes isn't - if the OS has to bring the page in from disk then you've got a long wait on your hands. The same would be true if you jump into code currently paged out on disk.
If the underlying question is "when should I optimise my code to minimise the total number of function calls made?", then the answer is "very close to never".
This link comes up a lot in Google. For future reference, I ran a short program in C# on the cost of a function call, and the answer is: "about six times the cost of inline". Below are details, see //Output at the bottom. UPDATE: To better compare apples with apples, I changed Class1.Method to return 'void', as so: public void Method1 () { // return 0; }
Still, inline is faster by 2x: inline (avg): 610 ms; function call (avg): 1380 ms. So the answer, updated, is "about two times".
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
namespace FunctionCallCost
{
class Program
{
static void Main(string[] args)
{
Debug.WriteLine("stop1");
int iMax = 100000000; //100M
DateTime funcCall1 = DateTime.Now;
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iMax; i++)
{
//gives about 5.94 seconds to do a billion loops,
// or 0.594 for 100M, about 6 times faster than
//the method call.
}
sw.Stop();
long iE = sw.ElapsedMilliseconds;
Debug.WriteLine("elapsed time of main function (ms) is: " + iE.ToString());
Debug.WriteLine("stop2");
Class1 myClass1 = new Class1();
Stopwatch sw2 = Stopwatch.StartNew();
int dummyI;
for (int ie = 0; ie < iMax; ie++)
{
dummyI = myClass1.Method1();
}
sw2.Stop();
long iE2 = sw2.ElapsedMilliseconds;
Debug.WriteLine("elapsed time of helper class function (ms) is: " + iE2.ToString());
Debug.WriteLine("Hi3");
}
}
// Class 1 here
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace FunctionCallCost
{
class Class1
{
public Class1()
{
}
public int Method1 ()
{
return 0;
}
}
}
// Output:
stop1
elapsed time of main function (ms) is: 595
stop2
elapsed time of helper class function (ms) is: 3780
stop1
elapsed time of main function (ms) is: 592
stop2
elapsed time of helper class function (ms) is: 4042
stop1
elapsed time of main function (ms) is: 626
stop2
elapsed time of helper class function (ms) is: 3755
The cost of actually calling the function, but not executing it in full? or the cost of actually executing the function? simply setting up a function call is not a costly operation (update the PC?). but obviously the cost of a function executing in full depends on what the function is doing.
Let's not forget that C++ has virtual calls (significantly more expensive, about x10) and on WIndows you can expect VS to inline calls (0 cost by definition, as there is no call left in the binary)
Depends on what that function does, it would fall 2nd on your list if it were doing logic with objects in memory. Further down the list if it included disk/network access.
A function call usually involves merely a couple of memory copies (often into registers, so they should not take up much time) and then a jump operation. This will be slower than a memory access, but faster than any of the other operations mentioned above, because they require communication with other hardware. The same should usually hold true on any OS/language combination.
If the function is inlined at compile time, the cost of the function becomes equivelant to 0.
0 of course being, what you would have gotten by not having a function call, ie: inlined it yourself.
This of course sounds excessively obvious when I write it like that.
The cost of a function call depends on the architecture. x86 is considerably slower (a few clocks plus a clock or so per function argument) while 64-bit is much less because most function arguments are passed in registers instead of on the stack.
Function call is actually a copy of parameters onto the stack (multiple memory access), register save, the actual code execution, and finally result copy and and registers restore (the registers save/restore depend on the system).
So.. speaking relatively:
Function call > Simple memory access.
Function call << Disk access - compared with memory it can be hundreds of times more expensive.
Function call << Memory access on another computer - the network bandwidth and protocol are the grand time killers here.
Function call <<< Disk access on another computer - all of the above and more :)
Only memory access is faster than a function call.
But the call can be avoided if compiler with inline optimization (for GCC compiler(s) and not only it is activated when using level 3 of optimization (-O3) ).