Measure correctly highest RAM usage of a Python 2 program - python-2.7

is there any function that can help me run the program Python 2 and get the highest its RAM usage or some ways to measure exactly the RAM usage of this program?

Related

Timing C++ program on Mac

I want to time my threading program on my mac and knew that for Linux environments, you can time it by doing: "time ./a.out".
I tried this on my mac terminal and it seems to be working but gives a little different output. The one on Linux will give output in the format:
(random times btw)
real 0m0.792s
user 0m0.046s
sys 0m0.218s
while this one gives: ./a.out 0.84s user 1.49s system 29% cpu 7.866 total
What is the main difference between "real" and "user" and is there any other types of commands to time the execution of programs?
This is well explained in the man pages, in both Linux and macOS.
Linux
These statistics consist of (i) the elapsed real time between invocation and termination, (ii) the user CPU time and (iii) the system CPU time.
macOS
time writes the total time elapsed, the time consumed by system
overhead, and the time used to execute utility to the standard error
stream.
Real means the total time needed to execute the whole program, including startup and exit.
User is basically your own code.
System is the time spent calling system functions.
The difference is not between "real" and "user"; the numbers are also in a different order.
The difference is between "real" and "total", and there isn't one. They're just different ways of saying the same thing: total, wall-clock, real-world time to execute your program, including all user code, system calls and (crucially) any time that your CPU spent on other processes/tasks.

Is this behavior showing that I have a memory problem?

I have a LP problem with ~4 million variables and ~4 million constraints. I use gurobi to solve it. My PC has 4 cores and 8 GB memory.
According to the log file, it takes ~100 seconds to find the optimal solution. Then the CPU is released, but still almost full memory is being used. It hangs there, doing nothing for hours until it continues to run the script (e.g. print command) after the solving.
results = opt.solve(model, tee=True)
print("model solved")
I used barrier method with crossover disabled, this worked best. I also tried different number of threads to be used, it turned out using 4 is the best in terms of the hanging time (but still hours).
This hanging significantly increases the total run time, which is not desired.
I plan to upgrade the memory, but want to get answers from the community that it indeed is a memory issue. Is this a memory problem?
Likely the problem does not fit in memory and virtual memory (i.e. disk) is used. This is called thrashing when it is really bad. It can bring your machine to its knees. Depending on the number of nonzeros in the problem, the presolve statistics and the number of threads you are using, you need at least 16 GB (and may be more like 32 GB).
Also: try to reduce the number of threads Gurobi is using. It may be better to use 1 thread (after benchmarking which LP algorithm works best: primal or dual simplex or a barrier method). By default a concurrent LP method is used: use different LP solvers in parallel, significantly increasing the memory footprint.

Why is c++ multithreaded program slower on two machines with the same processor?

I wrote a simple c++ multithreaded program to do several fft iterations, the objective is to get a mflops score of the machine.
I have two (virtual) machines, both running ubuntu:
machine 1: 2core 8gb ram
machine 2: 2core 16gb ram
both machines have the exact same characteristics except the memory, however the avg. results are:
machine 1: 530mflops
machine 2: 850mflops
here is the top command showing the resources consumed:
the one on the left is the 8gb machine and on the right the 16gb machine.
memory swapping shouldn't be an issue, each thread consumes 1mb memory, any ideas why this might be happening?

C++ profiling: clock cycle count

I'm using valgrind --tool=callgrind to profile a critical part of my C++ program.
The part itself takes less that a microsecond to execute so I'm profiling over a large number of loops over that part.
I noticed that instructions take multiples of 0.13% time to execute (percentage out of program total time to execute). So I only see 0.13, 0.26, 0.52, so on.
My question is, should I assume that this atomic quantity measures a CPU cycle? See photo.
(The callgrind output is presented graphically with kcachegrind.)
Edit: By the way, looking at machine code, I see mov takes 0.13 so that's probably a clock cycle indeed.
Callgrind doesn't measure CPU time. It measures instruction reads. That's where the "Ir" term comes from. If the multiples are of .13% (especially since you confirmed with mov) then it means that they are measuring a single instruction read. There are also cache simulation options that let it measure how likely you are to have cache misses.
Note that not all instructions will take the same time to execute, so the percentages do not exactly match the amount of time spent in each section. However, it still gives you a good idea of where your program is doing the most work, and likely spending the most time.

Confusing gprof output

I ran gprof on a C++ program that took 16.637s, according to time(), and I got this for the first line of output:
% cumulative self self total
time seconds seconds calls s/call s/call name
31.07 0.32 0.32 5498021 0.00 0.00 [whatever]
Why does it list 31.07% of time if it only took .32 seconds? Is this a per-call time? (Wouldn't that be self s/call?)
This is my first time using gprof, so please be kind :)
Edit: by scrolling down, it appears that gprof only thinks my program takes 1.03 seconds. Why might it be getting it so wrong?
The bottleneck turned out to be in file I/O (see Is std::ifstream significantly slower than FILE?). I switched to reading the entire file in a buffer and it sped up enormously.
The problem here was that gprof doesn't appear to generate accurate profiling when waiting for file I/O (see http://www.regatta.cs.msu.su/doc/usr/share/man/info/ru_RU/a_doc_lib/cmds/aixcmds2/gprof.htm). In fact, seekg and tellg were not even on the profiling list, and they were the bottleneck!
Self seconds is the time spent in [whatever].
Cumulative seconds is the time spent in [whatever] and the calls above it (such as [whatever] + main)
Neither of those include time spent in functions called from [whatever]. That's why you're not seeing more time listed.
If your [whatever] function is calling lots of printf's, for example, your gprof output is telling you that printf is eating the majority of that time.
This seems to be a pretty good overview of how to read gprof output. The 31.07% you are looking at is portion of the total run time gprof thinks is being spent in just that function (not including functions it calls). Odds are the reason the percentage is so large while the time is small is that gprof thinks the program doesn't take as long as you do. This is easily checked by scrolling down this first section of the gprof output: cumulative seconds will keep getting larger and larger until it limits out at the total running time of the program (from gprof's perspective). I think you will find this is about one second rather than the 16 you are expecting.
As for why there is such a large discrepancy there, I can't say. Perhaps gprof isn't seeing all of the code. Or did you use time on the instrumented code while it was profiling? I wouldn't expect that to work correctly...
Have you tried some of the other tools mentioned in this question? It would be interesting how they compare.
You're experiencing a problem common to gprof and other profilers based on the same concepts - 1) sample the program counter to get some kind of histogram, 2) instrument the functions to measure times, counts, and get a call graph.
For actually locating performance problems, they are missing the point.
It's not about measuring routines, it's about finding guilty code.
Suppose you have a sampler that stroboscopically X-rays the program at random wall-clock times. In each sample, the program may be in the middle of I/O, it may be in code that you compiled, it may be in some library routine like malloc.
But no matter where it is, the responsibility for it spending that slice of time is jointly shared by every line of code on the call stack, because if any one of those calls had not been made, it would not be in the process of carrying out the work requested by that call.
So look at every line of code that shows up on multiple samples of the call stack (the more samples it is on, the better). That's where the money is. Don't just look where the program counter is. There are "deep pockets" higher up the stack.
Yes those "seconds" values are aren't per call. The percentage time is for the entire run of the program. In effect your program spent 31% of it's time in that function (due to number of calls + time taken per call).
You might want to read up on how to analyze gprof's flat profiles.
Correction: Sorry, those first two seconds values are cumulative as pointed out by the OP.
I think that it's odd that you're seeing 0 for "self" and "total s/call".
Quoting the section on gprof accuracy: "The actual amount of error is usually more than one sampling period. In fact, if a value is n times the sampling period, the expected error in it is the square-root of n sampling periods. If the sampling period is 0.01 seconds and foo's run-time is 1 second, the expected error in foo's run-time is 0.1 seconds. It is likely to vary this much on the average from one profiling run to the next. (Sometimes it will vary more.)"
Also, possibly related, it might be worth noting that gprof doesn't profile multithreaded programs. You're better off using something like Sysprof or OProfile in such cases.