Score-P callpath depth limitation of 30 exceeded - profiling

I am profiling a code with Scalasca 2.0 that uses some recoursions.
When I run the analyzer with scalasca -analyze myexec , it does not rise any error to the end, where it says:
Score-P callpath depth limitation of 30 exceeded.
Reached callpath depth was 34
At this point, the scalasca results are corrupted and I cannot run cube over the produced output files.
I know for sure that the number of self-calls, of the recoursions won't be greater than 34.
I have read that there is a variable taking into account the number of "measured call-paths" (see. https://www.dkrz.de/Nutzerportal-en/doku/blizzard/program-analysis/profiling). So, I also tried to run scalasca with export ESD_FRAMES=40 but scalasca still says the limit is 30.
So, Is there a way to shift this scalasca limit to an higher value?

I write my answer 2 months after you posted the question so chances are you have already found a solution.
In score-p 1.4+ it can be fixed with:
export SCOREP_PROFILING_MAX_CALLPATH_DEPTH=128

Related

Nextjs export timeout configuration

I am building a website with NextJS that takes some time to build. It has to create a big dictionary, so when I run next dev it takes around 2 minutes to build.
The issue is, when I run next export to get a static version of the website there is a timeout problem, because the build takes (as I said before), 2 minutes, whihc exceeds the 60 seconds limit pre-configured in next.
In the NEXT documentation: https://nextjs.org/docs/messages/static-page-generation-timeout it explains that you can increase the timeout limit, whose default is 60 seconds: "Increase the timeout by changing the staticPageGenerationTimeout configuration option (default 60 in seconds)."
However it does not specify WHERE you can set that configuration option. In next.config.json? in package.json?
I could not find this information anywhere, and my blind tries of putting this parameter in some of the files mentioned before did not work out at all. So, Does anybody know how to set the timeout of next export? Thank you in advance.
They were a bit more clear in the basic-features/data-fetching part of the docs that it should be placed in the next.config.js
I added this to mine and it worked (got rid of the Error: Collecting page data for /path/[pk] is still timing out after 2 attempts. See more info here https://nextjs.org/docs/messages/page-data-collection-timeout build error):
// next.config.js
module.exports = {
// time in seconds of no pages generating during static
// generation before timing out
staticPageGenerationTimeout: 1000,
}

Do kernel NFS module have concurrent limit?

Background: I was testing a nfs-server with fio. And I find that no matter how much "iodepth" is set to fio. The nfs-server can only have "64 Inflight". So I just suspect that somewhere around "nfs protocol" limits the max concurrent(max io in flight).
fio command is
fio -numjobs=1 -iodepth=128 -direct=1 -ioengine=libaio -sync=1 -rw=write -bs=4k -size=500M -time_based -runtime=90 -name=Fiow -directory=/75
My nfs-server is based on ganesha, and got conclusion "64 Inflight" by using ganesha_stats.py.
So I have two options for now:
Study the calling-graph and read code to find the problem
I download linux kernel code, but struggle to . Which function/source file should I begin , maybe vfs.c:nfsd_write?
Tried to use 'perf' to trace calling-graph to speed up my code reading tour for linux kernel, but failed. Because 'perf report' shows shared library symbol without function name.
Learn the nfs protocol/mount cmd to seek the limit.
Can somebody can help me with this? :)
Assuming you're using NFSv4.1 (RFC 5661):
In NFSv4.1, the number of outstanding requests is bounded by the size of the slot table [...].
And in Linux:
#define NFS4_DEF_SLOT_TABLE_SIZE (64U)
It is the default for this module param:
module_param(max_session_slots, ushort, 0644);
MODULE_PARM_DESC(max_session_slots, "Maximum number of outstanding NFSv4.1 "
"requests the client will negotiate");
IIUC the total maximum for this is:
#define NFS4_MAX_SLOT_TABLE (1024U)

Error: VECTORSZ is too small

I am new to working with Promela and in particular SPIN. I have a model which I am trying verify and can't understand SPIN's output to resolve the problem.
Here is what I did:
spin -a untitled.pml
gcc -o pan pan.c
./pan
The output was as follows:
pan:1: VECTORSZ is too small, edit pan.h (at depth 0)
pan: wrote untitled.pml.trail
(Spin Version 6.4.5 -- 1 January 2016)
Warning: Search not completed
+ Partial Order Reduction
Full statespace search for:
never claim - (none specified)
assertion violations +
acceptance cycles - (not selected)
invalid end states +
State-vector 8172 byte, depth reached 0, errors: 1
0 states, stored
0 states, matched
0 transitions (= stored+matched)
0 atomic steps
hash conflicts: 0 (resolved)
I then ran SPIN again to try to determine the cause of the problem by examining the trail file. I used this command:
spin -t -v -p untitled.pml
This was the result:
using statement merging
spin: trail ends after -4 steps
#processes: 1
( global variable dump omitted )
-4: proc 0 (:init::1) untitled.pml:173 (state 1)
1 process created
According to this output (as I understand it), the verification is failing during the "init" procedure. The relevant code from within untitled.pml is this:
init {
int count = 0;
int ordinal = N;
do // This is line 173
:: (count < 2 * N + 1) ->
At this point I have no idea what is causing the problem since to me, the "do" statement should execute just fine.
Can anyone please help me in understanding SPINs output so I can remove this error during the verification process? The model does produce the correct output for reference.
You can simply ignore the trail file in this case, it is not relevant at all.
The error message
pan:1: VECTORSZ is too small, edit pan.h (at depth 0)
tells you that the size of directive VECTORSZ is too small to successfully verify your model.
By default, VECTORSZ has size 1024.
To fix this issue, try compiling your verifier with a larger VECTORSZ size:
spin -a untitled.pml
gcc -DVECTORSZ=2048 -o run pan.c
./run
If 2048 doesn't work too, try some more (increasingly larger) values.

Interpreting PGI_ACC_TIME output

I have some OpenACC-accelerated C++ code that I've compiled using the PGI compiler. Things seem to be working, so now I want to play efficiency whack-a-mole with profiling information.
I generate some timing info by setting:
export PGI_ACC_TIME=1
And then running the program.
The following output results:
-bash-4.2$ ./a.out
libcupti.so not found
Accelerator Kernel Timing data
PGI_ACC_SYNCHRONOUS was set, disabling async() clauses
/home/myuser/myprogram.cpp
_MyProgram NVIDIA devicenum=1
time(us): 97,667
75: data region reached 2 times
75: data copyin transfers: 3
device time(us): total=101 max=82 min=9 avg=33
76: compute region reached 1000 times
76: kernel launched 1000 times
grid: [1938] block: [128]
elapsed time(us): total=680,216 max=1,043 min=654 avg=680
95: compute region reached 1000 times
95: kernel launched 1000 times
grid: [1938] block: [128]
elapsed time(us): total=487,365 max=801 min=476 avg=487
110: data region reached 2000 times
110: data copyin transfers: 1000
device time(us): total=6,783 max=140 min=3 avg=6
125: data copyout transfers: 1000
device time(us): total=7,445 max=190 min=6 avg=7
real 0m3.864s
user 0m3.499s
sys 0m0.348s
It raises some questions:
I see time(us): 97,667 at the top. This seems like a total time, but, at the bottom, I see real 0m3.864s. Why is there such a difference?
If time(us): 97,667 is the total, why is it so much smaller than values lower down, such as elapsed time(us): total=680,216?
This kernel including the line (elapsed time(us): total=680,216 max=1,043 min=654 avg=680) was run 1000 times. Are max, min, and avg values based on per-run values of the kernel?
Since the [grid] and [block] values may vary, are the elapsed total values still a good indicator of hotspots?
For data regions (device time(us): total=6,783) is the measurement transfer time or the entire time spent dealing with the data (preparing to transfer, post-receipt operations)?
The line numbering is weird. For instance, Line 76 in my program is clearly a for loop, Line 95 in is a close-brace, and Line 110 is a variable definition. Should line numbers be interpreted as "the loop most closely following the indicated line number", or in some other way?
The kernel at 76 contains the kernel at 95. Are the times calculated for 76 inclusive of time spent in 95? If so, is there a convenient way to find the time spent in a kernel minus the times of all the subkernels?
(Some of these questions are a bit anal retentive, but I haven't found documentation for this, so I thought I'd be thorough.)
Part of the issue here is that the runtime can't find the CUDA Profiling library (libcupti.so), hence you're only seeing the PGI CPU side profiling not the device profiling. PGI ships libcupti.so library with the compilers (under $PGI/[linux86-64|linuxpower]/2017/cuda/[7.5|8.0]/lib64) but this is an optional install so you may not have it install on the system you're running. CUPTI also ships with the CUDA SDK, so if the system has CUDA install, you can try setting you're LD_LIBRARY_PATH there instead. On my system it's installed in "/opt/cuda-8.0/extras/CUPTI/lib64/".
The missing CUPTI library is why you're seeing the bad time, 97,667, for the file time. Also since you're missing CUPTI, the time you're seeing is being measured from the host. With CUPTI, in addition to the elapsed time, you'd see the device time for each kernel. The difference between the elapsed time and the device time is the launch overhead per kernel.
Are max, min, and avg values based on per-run values of the kernel?
Yes.
4.Since the [grid] and [block] values may vary, are the elapsed total values still a good indicator of hotspots?
I tend to first look at the avg time since there's typically more opportunities to tune these loops. If you are varying the amount of work per kernel iteration (i.e the grid size changes), then it might not be as useful, but a good starting point.
Now if you had a low average but many calls, then the elapsed time may be dominated by kernel launch overhead. In which case, I'd look to see if you can combine loops or push more work into each loop.
5.For data regions (device time(us): total=6,783) is the measurement transfer time or the entire time spent dealing with the data
(preparing to transfer, post-receipt operations)?
Just the data transfer time. For the overhead, you would need to use PGPROF/NVPROF.
6.The line numbering is weird. For instance, Line 76 in my program is clearly a for loop, Line 95 in is a close-brace, and Line 110 is a
variable definition. Should line numbers be interpreted as "the loop
most closely following the indicated line number", or in some other
way?
It's because the code's been optimized so the line number may be a bit off though it should correspond to the line numbers from compiler feedback messages (-Minfo=accel). So "the loop most closely..." option should be correct.

Getting OpenCV Error: Insufficient memory while running OpenCV Sample Program: "stitching_detailed.cpp"

I recently starting working with OpenCV with the intent of stitching large amounts of images together to create massive panoramas. To begin my experimentation, I looked into the sample programs that come with the OpenCV files to get an idea about how to implement the OpenCV libraries. Since I was interested in image stitching, I went straight for the "stitching_detailed.cpp." The code can be found at:
https://code.ros.org/trac/opencv/browser/trunk/opencv/samples/cpp/stitching_detailed.cpp?rev=6856
Now, this program does most of what I need it to do, but I ran into something interesting. I found that for 9 out of 15 of the optional projection warpers, I receive the following error when I try to run the program:
Insufficient memory (Failed to allocate XXXXXXXXXX bytes) in unknown function,
file C:\slave\winInstallerMegaPack\src\opencv\modules\core\src\alloc.cpp,
line 52
where the "X's" mark integer that change between the different types of projection (as though different methods require different amounts of space). The full source code for "alloc.cpp" can be found at the following website:
https://code.ros.org/trac/opencv/browser/trunk/opencv/modules/core/src/alloc.cpp?rev=3060
However, the line of code that emits this error in alloc.cpp is:
static void* OutOfMemoryError(size_t size)
{
--HERE--> CV_Error_(CV_StsNoMem, ("Failed to allocate %lu bytes", (unsigned long)size));
return 0;
}
So, I am simply lost as to the possible reasons that this error may be occurring. I realize that this error would normally occur if the system is out of memory, but I when running this program with my test images I am never using more that ~3.5GB of RAM, according to my Task Manager.
Also, since the program was written as an sample of the OpenCV stitching capabilities BY OpenCV developers I find it hard to believe that there is a drastic memory error present within the source code.
Finally, the program works fine if I use some of the warping methods:
- spherical
- fisheye
- transverseMercator
- compressedPlanePortraitA2B1
- paniniPortraitA2B1
- paniniPortraitA1.5B1)
but as ask the program to use any of the others (through the command line tag
--warp [PROJECTION_NAME]):
- plane
- cylindrical
- stereographic
- compressedPlaneA2B1
- mercator
- compressedPlaneA1.5B1
- compressedPlanePortraitA1.5B1
- paniniA2B1
- paniniA1.5B1
I get the error mentioned above. I get pretty good results from the transverseMercator project warper, but I would like to test the stereographic in particular. Can anyone help me figure this out?
The pictures that I am trying to process are 1360 x 1024 in resolution and my computer has the following stats:
Model: HP Z800 Workstation
Operating System: Windows 7 enterprise 64-bit OPS
Processor: Intel Xeon 2.40GHz (12 cores)
Memory: 14GB RAM
Hard Drive: 1TB Hitachi
Video Card: ATI FirePro V4800
Any help would be greatly appreciated, thanks!
When I run OpenCV's APP traincascade, i get just the same error as you:
Insufficient memory (Failed to allocate XXXXXXXXXX bytes) in unknown function,
file C:\slave\winInstallerMegaPack\src\opencv\modules\core\src\alloc.cpp,
line 52
at the time, only about 70% pecent of my RAM(6G) was occupied. And when runnig trainscascade step by step, I found that the error would be thrown.when it use about more than 1.5G RAM space.
then, I found the are two arguments which can control how many memory should be used:
-precalcValBufSize
-precalcIdxBufSize
so i tried to set these two to 128, it run. I hope my experience can help you.
I thought this problem is nothing about memory leak, it is just relate to how many memory the OS limits a application occupy. I expect someone can check my guess.
I've recently had a similar issue with OpenCV image stitching. I've used create method to create stitcher instance and provided 5 images in vertical order to stitch method, but I've received insufficient memory error.
Panorama was successfully created after setting:
setWaveCorrection(false)
This solution will not be applicable if you need wave correction.
This may be related to the sequence of the stitching, I split a big picture into 3*3, and firstly I stitch them row by row and there is no problem, when I stitch them column by column, there is the problem same as you.