Callgrind: Profile a specific part of my code

Callgrind: Profile a specific part of my code - c++

I'm trying to profile (with Callgrind) a specific part of my code by removing noise and computation that I don't care about.
Here is an example of what I want to do:
for (int i=0; i<maxSample; ++i) {
//Prepare data to be processed...
//Method to be profiled with these data
//Post operation on the data
}
My use-case is a regression test, I want to make sure that the method in question is still fast enough (something like less than 10% extra instructions since the last implementation).
This is why I'd like to have the cleaner output form Callgrind.
(I need a for loop in order to have a significant amount of data processed in order to have a good estimation of the behavior of the method I want to profile)
My first try was to change the code to:
for (int i=0; i<maxSample; ++i) {
//Prepare data to be processed...
CALLGRIND_START_INSTRUMENTATION;
//Method to be profiled with these data
CALLGRIND_STOP_INSTRUMENTATION;
//Post operation on the data
}
CALLGRIND_DUMP_STATS;
Adding the Callgrind macros to control the instrumentation. I also added the --instr-atstart=no options to be sure that I profile only the part of the code I want...
Unfortunately with this configuration when I start to launch my executable with callgrind, it never ends... It is not a question of slowness, because a full instrumentation run last less than one minute.
I also tried
for (int i=0; i<maxSample; ++i) {
//Prepare data to be processed...
CALLGRIND_TOGGLE_COLLECT;
//Method to be profiled with these data
CALLGRIND_TOGGLE_COLLECT;
//Post operation on the data
}
CALLGRIND_DUMP_STATS;
(or the --toggle-collect="myMethod" option)
But Callgrind returned me a log without any call (KCachegrind is white as snow :( and says zero instructions...)
Did I use the macros/options correctly? Any idea of what I need to change in order to get the expected result?

I finally managed to solve this issue... This was a config issue:
I kept the code
for (int i=0; i<maxSample; ++i) {
//Prepare data to be processed...
CALLGRIND_TOGGLE_COLLECT;
//Method to be profiled with these data
CALLGRIND_TOGGLE_COLLECT;
//Post operation on the data
}
CALLGRIND_DUMP_STATS;
But ran the callgrind with --collect-atstart=no (and without the --instr-atstart=no!!!) and it worked perfectly, in a reasonable time (~1min).
The issue with START/STOP instrumentation was that callgrind dumps a file (callgrind.out.#number) at each iteration (each STOP) thus it was really really slow... (after 5min I had only 5000 runs for a 300 000 iterations benchmark... unsuitable for a regression test).

The toggle-collect option is very picky in how you specify the method to use as trigger. You actually need to specify its argument list as well, and even the whitespace needs to match! Use the method name exactly as it appears in the callgrind output. For instance, I am using this invokation:
$ valgrind
--tool=callgrind
--collect-atstart=no
"--toggle-collect=ctrl_simulate(float, int)"
./swaag
Please observe:
The double quotes around the option.
The argument list including parentheses.
The whitespace after the comma character.

Related

Parallel Writes to NFS-backed File

UPDATE: I had each node write to a separate file, and when the separate files were concatenated together the result was correct. I also updated the code to attempt a channel flush and file sync after each write of a single record, but there are still issues between nodes 0 and 1, now. If I make Node 0 sleep for a few seconds before it starts its iteration of the coforall loop, the records come out correct. If not, the last few hundred bytes of Node 0's records seem to be reliably overwritten with NULL bytes, up to the start of Node 1's records. The issues between Node 1 and Node 2, and Node 2 and Node 3, seem to not show up anymore.
Additionally, if I suppress either Node 0 or Node 1 from writing, I see the fully-formed records from the un-suppressed node written correctly to the file. In the case that Node 1 is suppressed, I see 9,997 100B records (or 999,700) correct bytes followed by NULL bytes in the file where Node 1's suppressed records would go. In the case that Node 0 is suppressed, I see exactly 999,700 NULL bytes in the file, after which Node 1's records begin.
Original Post:
I'm trying to troubleshoot an issue with parallel writes from different nodes to a shared NFS-backed file on disk. At the moment, I suspect that something is wrong with the way writes to the disk happen on the NFS server.
I'm working on adapting MPI+C code that uses pwrite to write to coordinated chunks of a file. If I try to have the equivalent locales in Chapel write to the file inside of a coforall loop, I end up with the bits of the file around the node boundaries messed up - usually the final few hundred bytes of each node's data are garbled. However, if I have just one locale iterate through the data on all locales and write it, the data comes out correctly. That is, I use the same data structures to calculate the offsets, but only Locale 0 seeks to that offset and performs the writes.
I've verified that the offsets into the file that each locale runs do not overlap, and I'm using a single channel per task, defined from within the on loc do block, so that tasks don't share a single channel.
Are there known issues with writing to a file from different locales? A lot of the documentation makes it seem like this is known to be safe, but an unsubstantiated guess seems to indicate that there are issues with caching of file contents; when examining the incorrect data, the bits that are incorrect seem to be the original data from the file in that location at the beginning of the program.
I've included the relevant routine below, in case you easily spot something I missed. To make this serial, I convert the coforall loc in Locales and on loc do block into a for j in 0..numLocales-1 loop, and replace here.id with j. Please let me know what else would help get to the bottom of this. Thanks!
proc write_share_of_data(data_filename: string, ref record_blocks) throws {
coforall loc in Locales {
on loc do {
var data_file: file = open(data_filename, iomode.cwr);
var data_writer = data_file.writer(kind=ionative, locking=false);
var line: [1..100] uint(8);
const indices = record_blocks[here.id].D;
var local_record_offset = + reduce record_blocks[0..here.id-1].D.size;
writeln("Loc ", here.id, ": record offset is ", local_record_offset);
var local_bytes_offset = terarec.terarec_width_disk * local_record_offset;
data_writer.seek(start=local_bytes_offset);
for i in indices {
var write_rec: terarec_t = record_blocks[here.id].records[i];
line[1..10] = write_rec.key;
line[11..98] = write_rec.value;
line[99] = 13; // \r
line[100] = 10; // \n
data_writer.write(line);
lines_written += 1;
}
data_file.fsync();
data_writer.close();
data_file.close();
}
}
return;
}

Adding an answer here that solved my particular problem, though it doesn't explain the behavior seen. I ended up changing the outer loop from coforall loc in Locales to for loc in Locales. This isn't too big of an issue since it's all writing to one file anyway - I doubt that multiple locales can actually make much headway in all attempting to write concurrently to a single file on an NFS server. As a result, the change still allows nodes to write the data they have locally to NFS, rather than forcing Node 0 to collect and then write the data on behalf of all locales. This amounts to only adding idle time to the write operation commensurate with the time it takes Locale 0 to start the remote task on other nodes when the previous node has finished writing, which for the application at hand is not a concern.

Have you tried specifying start/end in file.writer instead of using seek? Does that change anything? What about specifying the end offset for the channel.seek call? Does it matter if the file is created and has the appropriate size before you start?
Other than that, I wonder if this issue would appear for both NFS and Lustre. If it appears for both it might well be a Chapel bug. It sounds from your description that the C program was using this pattern, which points to it being a bug. But, have you run C code doing this on your setup? If it being a Chapel bug seems most likely after further investigation, we would appreciate a bug report issue with a reproducer.
I know that NFS does not always do what one would like, in terms of data consistency. It's my understanding that it has "close to open" semantics but it's unclear to me what that means in the context of opening a file and writing to a particular region within it, in parallel from different locales.
From Why NFS Sucks by Olaf Kirch:
An NFS client is permitted to cache changes locally and send them to
the server whenever it sees fit. This sort of lazy write-back greatly
helps write performance, but the flip side is that everyone else will
be blissfully unaware of these change before they hit the server. To
make things just a little harder, there is also no requirement for a
client to transmit its cached write in any particular fashion, so
dirty pages can (and often will be) written out in random order.
I read two implications from this paragraph that are relevant to your situation here:
The writes you do on different locales can be observed by the NFS server in an arbitrary order. (However as I understand it, the data should be sent to the server by the time your fsync call returns).
These writes are done at an OS page granularity (usually 4k). (Note that this is more a hypothesis I am making than it is a fact. It should be tested or further investigated).
It would be interesting to check if 2. is a plausible explanation for the behavior you are seeing. For example, you could explore having each locale operate on a multiple of 4096 records (or potentially try writing records of 4096 bytes each) and see if that changes the behavior. If 2 is indeed the explanation, it should be possible to create a C program that demonstrates the behavior as well.

Reusing FFMPEG AVFilterGraph

In my code I apply same filtering on multiple input files. In first version of code I created AVFilterGraph for every input, but I think these actions might be excessive.
However, when I try to reuse the same graph, I face with the error during sending frame to abuffer filter. At the previous iteration over input files, I passed EOF to it for flushing, and the av_buffersrc_add_frame function has a check for this:
BufferSourceContext *s = ctx->priv;
...
if (s->eof)
return AVERROR(EINVAL);
which crashes the execution on the second iteration.
Unfortunately, I couldn't find any functions that can restore buffer filter or something like that.
I would like to know if avfilter implies the possibility to reuse once created filter graph, or there are some fundamental misconceptions in my understanding of ffmpeg logic by passing input after EOF.
Thank you!

Arduino substring doesn't work

I have a static method that searches (and returns) into String msg the value between a TAG
this is the code function:
static String genericCutterMessage(String TAG, String msg){
Serial.print("a-----");
Serial.println(msg);
Serial.print("b-----");
Serial.println(TAG);
if(msg.indexOf(TAG) >= 0){
Serial.print("msg ");
Serial.println(msg);
int startTx = msg.indexOf(TAG)+3;
int endTx = msg.indexOf(TAG,startTx)-2;
Serial.print("startTx ");
Serial.println(startTx);
Serial.print("endTx ");
Serial.println(endTx);
String newMsg = msg.substring(startTx,endTx);
Serial.print("d-----");
Serial.println(newMsg);
Serial.println("END");
Serial.println(newMsg.length());
return newMsg;
} else {
Serial.println("d-----TAG NOT FOUND");
return "";
}
}
and this is output
a-----[HS][TS]5132[/TS][TO]5000[/TO][/HS]
b-----HS
msg [HS][TS]5132[/TS][TO]5000[/TO][/HS]
startTx 4
endTx 30
d-----
END
0
fake -_-'....go on! <-- print out of genericCutterMessage
in that case I want return the string between HS tag, so my expected output is
[TS]5132[/TS][TO]5000[/TO]
but I don't know why I receive a void string.
to understand how substring works I just followed tutorial on official Arduino site
http://www.arduino.cc/en/Tutorial/StringSubstring
I'm not an expert in C++ and Arduino but this looks like a flushing or buffering problem, isn't it?
Any idea?

Your code is correct, this should not happen. Which forces you to consider the unexpected ways that this could possibly fail. There is really only one candidate mishap I can think of, your Arduino is running out of RAM. It has very little, the Uno only has 2 kilobytes for example. It doesn't take a lot of string munching to fill that up.
This is not reported in a smooth way. All I can do is point you to the relevant company page. Quoting:
If you run out of SRAM, your program may fail in unexpected ways; it will appear to upload successfully, but not run, or run strangely. To check if this is happening, you can try commenting out or shortening the strings or other data structures in your sketch (without changing the code). If it then runs successfully, you're probably running out of SRAM. There are a few things you can do to address this problem:
If your sketch talks to a program running on a (desktop/laptop) computer, you can try shifting data or calculations to the computer, reducing the load on the Arduino.
If you have lookup tables or other large arrays, use the smallest data type necessary to store the values you need; for example, an int takes up two bytes, while a byte uses only one (but can store a smaller range of values).
If you don't need to modify the strings or data while your sketch is running, you can store them in flash (program) memory instead of SRAM; to do this, use the PROGMEM keyword.
That's not very helpful in your specific case, you'll have to look at the rest of the program for candidates. Or upgrade your hardware, StackExchange has a dedicated site for Arduino enthusiasts, surely the best place to get advice.

Debugging C++ in an Eclipse-based IDE - is there something like "step over loop/cycle"?

At the moment I'm using an eclipse-like IDE and the corresponding debug perspective, that most of you are probably familiar with. While debugging code I quite often find myself stepping through many lines of code and observing variables and double checking if everything is as it is supposed to be.
But suppose there is something like this:
1. important line, e.g. generating a new object;
2. another important line, e.g. some tricky class method;
3. for (int i = 0; i < some_limit; ++i)
4. some_array[i]++;
5. more important stuff;
Obviously I'm interested in what happens in lines 1,2 and 5 (I know this is a poor example, but please bear with me for a little while longer) but I don't want to step through all hundreds (or even thousands) of iterations of lines 3/4.
So, finally, my question: Is there some way to step directly over the for-cycle? What I do right now is set a new breakpoint at line 5 and let the program run as soon as I hit line 3 and I believe this is not an optimal solution.
edit: The eclipse implementation of what ks1322 proposed is called "Run to line" and is mapped to ctrl-r

Use until command instead of next.
From gdb documentation:
Continue running until a source line past the current line, in the
current stack frame, is reached. This command is used to avoid single
stepping through a loop more than once.
If you will use until instead of next, gdb will step over loops only once, which almost exactly what you want.

WxTextCtrl unable to load large texts

I've read about the solutuon written here on a post a year ago
wx.TextCtrl.LoadFile()
Now I have a windows application that will generate color frequency statistics that are saved in 3D arrays. Here is a part of my code as you will see on the code below the printing of the statistics is dependent on a slider which specifies the threshold.
void Project1Frm::WxButton2Click(wxCommandEvent& event)
{
char stat[32] ="";
int ***report = pGLCanvas->GetPixel();
float max = pGLCanvas->GetMaxval();
float dist = WxSlider5->GetValue();
WxRichTextCtrl1->Clear();
WxRichTextCtrl1->SetMaxLength(100);
if(dist>0)
{
WxRichTextCtrl1->AppendText(wxT("Statistics\nR\tG\tB\t\n"));
for(int m=0; m<256; m++){
for(int n=0; n<256; n++){
for(int o=0; o<256; o++){
if((report[m][n][o]/max)>=(dist/100.0))
{
sprintf(stat,"%d\t%d\t%d\t%3.6f%%\n",m,n,o,report[m][n][o]/max*100.0);
WxRichTextCtrl1->AppendText(wxT(stat));
}
}
}
}
}
else if(dist==0) WxRichTextCtrl1->LoadFile("histodata.txt");
}
The solution I've tried so far is that when I am to print all the statistics I'll get it from a text file rather than going through the 3D array... I would like to ask if the Python implementation of the segmenting can be ported to C++ or are there better ways to deal with this problem. Thank you.
EDIT:
Another reason why I used a text file instead is that I observed that whenever I do sprintf only [with the line WxRichTextCtrl1->AppendText(wxT(stat)); was commented out] the computer starts to slow down.
-Ric

Disclaimer: My answer is more of an alternative than a solution.
I don't believe that there's any situation in which a user of this application is going to find it useful to have a scrolled text window containing ~16 million lines of numbers. It would be impossible to scroll to one specific location in the list that the user might need to see easily. This is all assuming that every single number you output here has some significance to the user of course (you are showing them on the screen for a reason). Providing the user with controls to look up specific, fixed (reasonable) ranges of those numbers would be a better solution, not only in regards to a better user experience, but also in helping to resolve your issue here.
On the other hand, if you still insist on one single window containing all 64 million numbers, you seem to have a very rigid data structure here, which means you can (and should) take advantage of using a virtual grid control (wxGrid), which is intended to work smoothly even with incredibly large data sets like this. The user will likely find this control easier to read and find the section of data they are looking for.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js