I'v profiled my app in several situations and i came up with the conclusion that my bottle neck is the template rendering, Example dump
61323 function calls (59462 primitive calls) in 0.827 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.827 0.827 /home/haki/src/CalcSite/calc/views.py:279(home)
1 0.000 0.000 0.815 0.815 /usr/local/lib/python2.7/dist-packages/django/shortcuts/__init__.py:31(render)
3/1 0.000 0.000 0.814 0.814 /usr/local/lib/python2.7/dist-packages/django/template/loader.py:151(render_to_string)
4/1 0.000 0.000 0.808 0.808 /usr/local/lib/python2.7/dist-packages/django/template/base.py:136(render)
5/1 0.000 0.000 0.808 0.808 /usr/local/lib/python2.7/dist-packages/django/template/base.py:133(_render)
286/1 0.002 0.000 0.808 0.808 /usr/local/lib/python2.7/dist-packages/django/template/base.py:836(render)
714/2 0.000 0.000 0.808 0.404 /usr/local/lib/python2.7/dist-packages/django/template/debug.py:76(render_node)
1 0.000 0.000 0.808 0.808 /usr/local/lib/python2.7/dist-packages/django/template/loader_tags.py:100(render)
6 0.000 0.000 0.803 0.134 /usr/local/lib/python2.7/dist-packages/django/template/loader_tags.py:48(render)
According to the docs enabling cached templates can have a significant effect on performance. So i tried adding this settings
TEMPLATE_LOADERS = (
('django.template.loaders.cached.Loader', (
'django.template.loaders.filesystem.Loader',
'django.template.loaders.app_directories.Loader',
)),
)
All my templates are in app/templates. I'm not using too mach template fragments \ includes and all my app tags (~4) are thread safe. Looking at the db trace on this session i got 6 queries returning in 9ms - This is not the hold up.
I don't see any difference in the performance reports. Am i missing something here ? Am I testing it wrong ?
The cached template loader still needs to render the templates. The savings come from not having the read the template again from the filesystem each time, saving IO.
If you want to cache the template, look at the {% cache %} template tags. But be aware you need to include the correct keys.
From Django 1.8+ you now place it in the OPTIONS block of the TEMPLATES setting.
Related
Q1: which occupies less CPU usage, future wait() or check flag in a while loop?
std::atomic_bool isRunning{false};
void foo(){
isRunning.store(true);
doSomethingTimeConsuming();
isRunning.store(false);
}
std::future f = std::async(std::launch::async, foo);
use std::future wait():
if(f.vaild())
f.wait()
check flag in a while loop:
if(f.valid){
while(isRunning.load())
std::this_thread::sleep_for(1ms);
}
Q2: is the conclusion also applied to std::thread.join() or std::condition_variable.wait() ?
thanks in advance.
std::this_thread::sleep_for keeps waking up the thread unnecessarily at wrong times. The average latency of the result being ready and the waiter thread noticing it is half the sleep_for timeout.
std::future::wait is more efficient because it blocks in the kernel till the result is ready, without doing multiple syscalls unnecessarily, unlike std::this_thread::sleep_for.
If your run the two versions with
void doSomethingTimeConsuming() {
std::this_thread::sleep_for(1s);
}
under perf stat, the results for std::future::wait are:
1.803578 task-clock (msec) # 0.002 CPUs utilized
2 context-switches # 0.001 M/sec
0 cpu-migrations # 0.000 K/sec
116 page-faults # 0.064 M/sec
6,356,215 cycles # 3.524 GHz
4,511,076 instructions # 0.71 insn per cycle
835,604 branches # 463.304 M/sec
22,313 branch-misses # 2.67% of all branches
Whereas for std::this_thread::sleep_for(1ms):
11.715249 task-clock (msec) # 0.012 CPUs utilized
901 context-switches # 0.077 M/sec
6 cpu-migrations # 0.512 K/sec
118 page-faults # 0.010 M/sec
40,177,222 cycles # 3.429 GHz
25,401,055 instructions # 0.63 insn per cycle
2,286,806 branches # 195.199 M/sec
156,400 branch-misses # 6.84% of all branches
I.e. in this particular test, sleep_for burns roughly 6 times as many CPU cycles.
Note that there is a race condition between isRunning.load() and isRunning.store(true). A fix is to initialize isRunning{true};.
To identify the step that is using most of the computation time, I ran cProfile and got the following result:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.014 0.014 216.025 216.025 func_poolasync.py:2(<module>)
11241 196.589 0.017 196.589 0.017 {method 'acquire' of 'thread.lock' objects}
982 0.010 0.000 196.532 0.200 threading.py:309(wait)
1000 0.002 0.000 196.498 0.196 pool.py:565(get)
1000 0.005 0.000 196.496 0.196 pool.py:557(wait)
515856/3987 0.350 0.000 13.434 0.003 artist.py:230(stale)
The step on which most of the time is spent clearly seems to be method 'acquire' of 'thread.lock' objects. I have not used threading; rather I've used pool.apply_async with several processors, so I'm confused why thread.lock is the issue?
I am hoping to have some light shed on why this is the bottleneck? And how to bring down this time?
Here's how the code looks:
path='/usr/home/work'
filename='filename'
with open(path+filename+'/'+'result.pickle', 'rb') as f:
pdata = pickle.load(f)
if __name__ == '__main__':
pool = Pool()
results=[]
data=list(range(1000))
print('START')
start_time = int(round(time.time()))
result_objects = [pool.apply_async(func, args=(nomb,pdata[0],pdata[1],pdata[2])) for nomb in data]
results = [r.get() for r in result_objects]
pool.close()
pool.join()
print('END', int(round(time.time()))-start_time)
Revision:
By switching from pool.apply_async to pool.map I am able to reduce the execution time by ~3 times.
Output:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.113 0.113 70.824 70.824 func.py:2(<module>)
4329 28.048 0.006 28.048 0.006 {method 'acquire' of 'thread.lock' objects}
4 0.000 0.000 28.045 7.011 threading.py:309(wait)
1 0.000 0.000 28.044 28.044 pool.py:248(map)
1 0.000 0.000 28.044 28.044 pool.py:565(get)
1 0.000 0.000 28.044 28.044 pool.py:557(wait)
Modification to code:
if __name__ == '__main__':
pool = Pool()
data=list(range(1000))
print('START')
start_time = int(round(time.time()))
funct = partial(func,pdata[0],pdata[1],pdata[2])
results = pool.map(funct,data)
print('END', int(round(time.time()))-start_time)
But, it has been found that some of the iterations result in nonsensical results. I'm not sure why this is happening, however, it is seen that
'method 'acquire' of 'thread.lock' objects' is still the rate determining step.
Testing something else I stumbled across something that I haven't managed to figure out yet.
Let's look at this snippet:
#include <iostream>
#include <chrono>
int main () {
int i = 0;
using namespace std::chrono_literals;
auto const end = std::chrono::system_clock::now() + 5s;
while (std::chrono::system_clock::now() < end) {
++i;
}
std::cout << i;
}
I've noticed that the counts heavily depend on the machine I execute it on.
I've compiled with gcc 7.3,8.2, and clang 6.0 with std=c++17 -O3.
On i7-4790 (4.17.14-arch1-1-ARCH kernel): ~3e8
but on a Xeon E5-2630 v4 (3.10.0-514.el7.x86_64): ~8e6
Now this is a difference that I would like to understand so I've checked with perf stat -d
on the i7:
4999.419546 task-clock:u (msec) # 0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
120 page-faults:u # 0.024 K/sec
19,605,598,394 cycles:u # 3.922 GHz (49.94%)
33,601,884,120 instructions:u # 1.71 insn per cycle (62.48%)
7,397,994,820 branches:u # 1479.771 M/sec (62.53%)
34,788 branch-misses:u # 0.00% of all branches (62.58%)
10,809,601,166 L1-dcache-loads:u # 2162.171 M/sec (62.41%)
13,632 L1-dcache-load-misses:u # 0.00% of all L1-dcache hits (24.95%)
3,944 LLC-loads:u # 0.789 K/sec (24.95%)
1,034 LLC-load-misses:u # 26.22% of all LL-cache hits (37.42%)
5.003180401 seconds time elapsed
4.969048000 seconds user
0.016557000 seconds sys
Xeon:
5001.000000 task-clock (msec) # 0.999 CPUs utilized
42 context-switches # 0.008 K/sec
2 cpu-migrations # 0.000 K/sec
412 page-faults # 0.082 K/sec
15,100,238,798 cycles # 3.019 GHz (50.01%)
794,184,899 instructions # 0.05 insn per cycle (62.51%)
188,083,219 branches # 37.609 M/sec (62.49%)
85,924 branch-misses # 0.05% of all branches (62.51%)
269,848,346 L1-dcache-loads # 53.959 M/sec (62.49%)
246,532 L1-dcache-load-misses # 0.09% of all L1-dcache hits (62.51%)
13,327 LLC-loads # 0.003 M/sec (49.99%)
7,417 LLC-load-misses # 55.65% of all LL-cache hits (50.02%)
5.006139971 seconds time elapsed
What pops out is the low amount of instructions per cycle on the Xeon as well as the nonzero context-switches that I don't understand. However, I wasn't able to use these diagnostics to come up with an explanation.
And to add a bit more weirdness to the problem, when trying to debug I've also compiled statically on one machine and executed on the other.
On the Xeon the statically compiled executable gives a ~10% lower output with no difference between compiling on xeon or i7.
Doing the same thing on the i7 both the counter actually drops from 3e8 to ~2e7
So in the end I'm now left with two questions:
Why do I see such a significant difference between the two machines.
Why does a statically linked exectuable perform worse while I would expect the oposite?
Edit: after updating the kernel on the centos 7 machine to 4.18 we actually see an additional drop from ~ 8e6 to 5e6.
perf interestingly shows different numbers though:
5002.000000 task-clock:u (msec) # 0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
119 page-faults:u # 0.024 K/sec
409,723,790 cycles:u # 0.082 GHz (50.00%)
392,228,592 instructions:u # 0.96 insn per cycle (62.51%)
115,475,503 branches:u # 23.086 M/sec (62.51%)
26,355 branch-misses:u # 0.02% of all branches (62.53%)
115,799,571 L1-dcache-loads:u # 23.151 M/sec (62.51%)
42,327 L1-dcache-load-misses:u # 0.04% of all L1-dcache hits (62.50%)
88 LLC-loads:u # 0.018 K/sec (49.96%)
2 LLC-load-misses:u # 2.27% of all LL-cache hits (49.98%)
5.005940327 seconds time elapsed
0.533000000 seconds user
4.469000000 seconds sys
It's interesting that there are no more context switches and istructions per cycle went up significantly but the cycles and therefore colck are super low!
I've been able to reproduce the respective measurements on the two machines, thanks to #Imran's comment above. (Posting this answer to close the question, if Imran should post one I'm happy to accept his instead)
It indeed is related to the available clocksource. The XEON, unfortunately, had the notsc flag in its kernel parameters which is why the tsc clocksource wasn't available and selected.
Thus for anyone running into this problem:
1. check your clocksource in /sys/devices/system/clocksource/clocksource0/current_clocksource
2. check available clocksources in /sys/devices/system/clocksource/clocksource0/available_clocksource
3. If you can't find tsc, check dmesg | grep tsc to check you kernel parameters for notsc
I have a C++ library that I am wrapping and exposing via python. For various reasons I need to overload the __call__ of the functions when exposing them via python.
A minimal example is below using time.sleep to mimic functions with various compute times
import sys
import time
class Wrap_Func(object):
def __init__(self, func, name):
self.name = name
self.func = func
def __call__(self, *args, **kwargs):
# do stuff
return self.func(*args, **kwargs)
def wrap_funcs():
thismodule = sys.modules[__name__]
for i in range(3):
fname = 'example{}'.format(i)
setattr(thismodule, fname, Wrap_Func(lambda: time.sleep(i), fname))
wrap_funcs()
When profiling my code via cProfile I get a list of __call__ routines.
I am unable to ascertain which routines are taking the majority of the compute time.
>>> import cProfile
>>> cProfile.runctx('example0(); example1(); example2()', globals(), locals())
11 function calls in 6.000 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
3 0.000 0.000 6.000 2.000 <ipython-input-48-e8126c5f6ea3>:11(__call__)
3 0.000 0.000 6.000 2.000 <ipython-input-48-e8126c5f6ea3>:20(<lambda>)
1 0.000 0.000 6.000 6.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
3 6.000 2.000 6.000 2.000 {time.sleep}
Expected
By manually defining the functions (this needs to be dynamic and wrapper as above) as
def example0():
time.sleep(0)
def example1():
time.sleep(1)
def example2():
time.sleep(2)
I get the expected output of
>>> import cProfile
>>> cProfile.runctx('example0(); example1(); example2()', globals(), locals())
11 function calls in 6.000 seconds
8 function calls in 3.000 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <ipython-input-58-688d247cb941>:1(example0)
1 0.000 0.000 0.999 0.999 <ipython-input-58-688d247cb941>:4(example1)
1 0.000 0.000 2.000 2.000 <ipython-input-58-688d247cb941>:7(example2)
1 0.000 0.000 3.000 3.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
3 3.000 1.000 3.000 1.000 {time.sleep}
The following combines answers from #alex-martelli to resolve patching the special __call__ method and #martijn-pieters to solve the problem of properly renaming a function code object
Edit: Since PEP 570 there is a new second argument to types.CodeType of code.co_kwonlyargcount. Remove this argument for earlier versions of Python.
import types
def rename_code_object(func, new_name):
code = func.__code__
return types.FunctionType(
types.CodeType(
code.co_argcount,
code.co_kwonlyargcount, # comment out this line for earlier versions of python
code.co_nlocals,
code.co_stacksize, code.co_flags,
code.co_code, code.co_consts,
code.co_names, code.co_varnames,
code.co_filename, new_name,
code.co_firstlineno, code.co_lnotab,
code.co_freevars, code.co_cellvars),
func.__globals__, new_name, func.__defaults__, func.__closure__)
class ProperlyWrapFunc(Wrap_Func):
def __init__(self, func, name):
super(ProperlyWrapFunc, self).__init__(func, name)
renamed = rename_code_object(super(ProperlyWrapFunc, self).__call__, name)
self.__class__ = type(name, (ProperlyWrapFunc,), {'__call__': renamed})
After calling a modifying wrap_funcs() that uses the new class we get the expected output. This should also be compatible with exception tracebacks
>>> import cProfile
>>> cProfile.runctx('example0(); example1(); example2()', globals(), locals())
11 function calls in 6.000 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 2.000 2.000 <ipython-input-1-96920f80be1c>:9(example0)
1 0.000 0.000 2.000 2.000 <ipython-input-1-96920f80be1c>:9(example1)
1 0.000 0.000 2.000 2.000 <ipython-input-1-96920f80be1c>:9(example2)
3 0.000 0.000 6.000 2.000 <ipython-input-9-ed938f395cb4>:30(<lambda>)
1 0.000 0.000 6.000 6.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
3 6.000 2.000 6.000 2.000 {time.sleep}
I was hoping someone could give me advice as to easily change the value of various variables within a Fortran input file from a C++ application.
I have a model written in Fortran but am writing a C++ application to execute the model in a loop but changing the value of the model parameters after each model execution.
Any advice would be appreciated.
Thanks!
.Pnd file Subbasin: 1 7/26/2012 12:00:00 AM ArcSWAT 2009.93.7
Pond inputs:
0.000 | PND_FR : Fraction of subbasin area that drains into ponds. The value for PND_FR should be between 0.0 and 1.0. If PND_FR = 1.0, the pond is at the outlet of the subbasin on the main channel
0.000 | PND_PSA: Surface area of ponds when filled to principal spillway [ha]
0.000 | PND_PVOL: Volume of water stored in ponds when filled to the principal spillway [104 m3]
0.000 | PND_ESA: Surface area of ponds when filled to emergency spillway [ha]
0.000 | PND_EVOL: Volume of water stored in ponds when filled to the emergency spillway [104 m3]
0.000 | PND_VOL: Initial volume of water in ponds [104 m3]
0.000 | PND_SED: Initial sediment concentration in pond water [mg/l]
0.000 | PND_NSED: Normal sediment concentration in pond water [mg/l]
0.000 | PND_K: Hydraulic conductivity through bottom of ponds [mm/hr].
0 | IFLOD1: Beginning month of non-flood season
0 | IFLOD2: Ending month of non-flood season
0.000 | NDTARG: Number of days needed to reach target storage from current pond storage
10.000 | PSETLP1: Phosphorus settling rate in pond for months IPND1 through IPND2 [m/year]
10.000 | PSETLP2: Phosphorus settling rate in pond for months other than IPND1-IPND2 [m/year]
5.500 | NSETLP1: Initial dissolved oxygen concentration in the reach [mg O2/l]
5.500 | NSETLP2: Initial dissolved oxygen concentration in the reach [mg O2/l]
1.000 | CHLAP: Chlorophyll a production coefficient for ponds [ ]
1.000 | SECCIP: Water clarity coefficient for ponds [m]
0.000 | PND_NO3: Initial concentration of NO3-N in pond [mg N/l]
0.000 | PND_SOLP: Initial concentration of soluble P in pond [mg P/L]
0.000 | PND_ORGN: Initial concentration of organic N in pond [mg N/l]
0.000 | PND_ORGP: Initial concentration of organic P in pond [mg P/l]
5.000 | PND_D50: Median particle diameter of sediment [um]
1 | IPND1: Beginning month of mid-year nutrient settling "season"
1 | IPND2: Ending month of mid-year nutrient settling "season"
Wetland inputs:
0.000 | WET_FR : Fraction of subbasin area that drains into wetlands
0.000 | WET_NSA: Surface area of wetlands at normal water level [ha]
0.000 | WET_NVOL: Volume of water stored in wetlands when filled to normal water level [104 m3]
0.000 | WET_MXSA: Surface area of wetlands at maximum water level [ha]
0.000 | WET_MXVOL: Volume of water stored in wetlands when filled to maximum water level [104 m3]
0.000 | WET_VOL: Initial volume of water in wetlands [104 m3]
0.000 | WET_SED: Initial sediment concentration in wetland water [mg/l]
0.000 | WET_NSED: Normal sediment concentration in wetland water [mg/l]
0.000 | WET_K: Hydraulic conductivity of bottom of wetlands [mm/hr]
0.000 | PSETLW1: Phosphorus settling rate in wetland for months IPND1 through IPND2 [m/year]
0.000 | PSETLW2: Phosphorus settling rate in wetlands for months other than IPND1-IPND2 [m/year]
0.000 | NSETLW1: Nitrogen settling rate in wetlands for months IPND1 through IPND2 [m/year]
0.000 | NSETLW2: Nitrogen settling rate in wetlands for months other than IPND1-IPND2 [m/year]
0.000 | CHLAW: Chlorophyll a production coefficient for wetlands [ ]
0.000 | SECCIW: Water clarity coefficient for wetlands [m]
0.000 | WET_NO3: Initial concentration of NO3-N in wetland [mg N/l]
0.000 | WET_SOLP: Initial concentration of soluble P in wetland [mg P/l]
0.000 | WET_ORGN: Initial concentration of organic N in wetland [mg N/l]
0.000 | WET_ORGP: Initial concentration of organic P in wetland [mg P/l]
0.000 | PNDEVCOEFF: Actual pond evaporation is equal to the potential evaporation times the pond evaporation coefficient
0.000 | WETEVCOEFF: Actual wetland evaporation is equal to the potential evaporation times the wetland evaporation coefficient.
Take a look at Fortran's namelist I/O which may suit your purposes. The IBM XL Fortran documentation for this feature is here. Make sure you consult the documentation for your own compiler, each one has some quirky variations on the standard it seems (or perhaps the standard isn't quite tight enough).
I tools like there is a number inside a 20-byte ASCII field, a '|' symbol, and a variable name terminated by a ':'. The lines are variable length. So something like this:
// ./bin/bin/g++ -o hackpond hackpond.cpp
#include <fstream>
#include <string>
#include <cstddef>
#include <sstream>
#include <iomanip>
int
main()
{
std::string variable_you_want = "PND_SOLP";
std::string line;
std::fstream pond("test.pnd");
std::getline(pond, line);
while (! pond.eof())
{
std::string value, variable;
std::size_t loc = pond.tellg();
std::getline(pond, line);
if (line == "Pond inputs:")
continue;
else if (line == "Wetland inputs:")
continue;
std::ostringstream hack();
value = line.substr(0, 20);
std::size_t colon = line.find(':', 20);
if (colon == std::string::npos)
continue;
variable = line.substr(22, colon - 22);
if (variable == variable_you_want)
{
double new_value = 666.66;
pond.seekp(loc);
std::ostringstream thing;
thing << std::setw(20) << new_value;
pond.write(thing.str().c_str(), thing.str().length());
}
if (pond.eof())
break;
}
}
The basic idea is that you 1. note the starting location of the line you want to change (tellg), 2. create a 20 character string containing the new value, 3. go to the saved start position of that line (seekp) and write the new chunk.
I'm not an expert on FORTRAN, but I think I would look into using command line arguments for each parameter that needs to change. This is only possible, however, if you are using FORTRAN 2003. I think that FORTRAN 95 is still one of the most common FORTRAN standards in use, but I don't know that for sure. If your implementation of FORTRAN does support command line arguments, see if this site can help you figure out how to use them: http://fortranwiki.org/fortran/show/Command-line+arguments
Otherwise, you could have the FORTRAN program read in those values from a text file. If you want to get really nasty, you could have your C++ program actually alter the FORTRAN source file and recompile it each time it needs to run. If you do that, then back it up first. Good luck.
As far as I understand your problem, you merely need some text-editing in C++.
I guess the easiest thing would be to have all the parameters in the C++ code and let that code write the complete text file for each run.
If you don't want to do that, I'd recommend to use some template toolbox, maybe something like cpptemplate or teng, though there are probably many more templating engines available for C++ out there, which might be suitable for your use case.
I know you asked for C++, but I have a very similar situation with my code and I use shell scripts.
I use sed s/var_name/value <template >input_file to change occurrences of the var_name in the template. My file format is set up to make this easy, but sed is a super flexible tool and I am sure it could do what you are asking. There is a tutorial here.
With that set up I write a script to loop through these sed commands and the application calls. This works like a charm. Besides, learning a little scripting will help you handle all kind of other tasks like sorting through the data generated from all those different runs.