Controlling a specific core in a trace32 multi core debugging enviroment - trace32

I am trying to write a practice script to debug a multi core system in Trace32. As per the docs, I started the scrpit as follows
SYStem.CPU CortexA55
SYStem.CONFIG CoreNumber 3
Core.Number 3
I can choose to single step in a single core by using the data list window buttons for that core. But I am not sure how to do that using the commands. In their docs they mention
Where the target is configured for Symmetric Multi-Processing (SMP),
a single instance of TRACE32 PowerView is used to control all cores.
In this case many PRACTICE commands add an option /CORE option to indicate that the command should be run with reference to a
particular core or cores.
But I tried to execute these commands
Go.Backover
Step
Print Register(pc)
but none of them does have this /Core option.

All commands used without the option /CORE are for the currently selected logical core. You can indicate the currently selected core by checking the status bar: Left to the system state (probably showing "stopped"), you'll see a digit, indicating the active logical core.
You can change the active logical core with the command CORE.select. Note, that after any Step or Go command the active logical core can change, because when the SoC goes in a running-state and the enters halt-state, the active logical core is the core which caused the SoC to halt.
So in you case the following should be safe.
CORE.select 0
Step.Over
CORE.select 0
Step
CORE.select 0
ECHO Register(PP)
Instead of CORE.select 0 you can use of course also CORE.select 1 or CORE.select 2.
I think there is no command called Go.Backover. So I am not sure if you are referring to Step.BackOver, Step.Over, Step.Back, or Go.Back.
By the way:
You have used SYStem.CPU CortexA55. That is not wrong, but probably not optimal: Selecting as base-core like CortexA55 is indented for self-made SoCs and requires usually a lot of additional SYStem.CONFIG commands to tell the Debugger the location of the CoreSight components. If you have not created your own SoC try to find the exact chip name in the window SYStem.CPU.

Related

How to implement google test sharding in c++?

I want to parallelise my googletest cases in c++.
I have read the documentation of google test sharding but unable to implement it in c++ coding environment.
As I'm new to the coding field , so can anyone please by a code explain to me the documentation in the link below
https://github.com/google/googletest/blob/master/googletest/docs/advanced.md
Google Sharding works on different machines or can be implemented on same using multiple threads?
Sharding isn't done in code, it's done using the environment. Your machine specifies two environment variables GTEST_TOTAL_SHARDS, which is the total number of machines you are running and GTEST_SHARD_INDEX, which is unique to each machine. When GTEST starts up, it selects a subset of these tests.
If you want to simulate this, then you need to set these environment variables (which can be done in code).
I would probably try something like this (on Windows) in a .bat file:
set GTEST_TOTAL_SHARDS=10
FOR /L %%I in (1,1,10) DO cmd.exe /c "set GTEST_SHARD_INDEX=%%I && start mytest.exe"
And hope that the new cmd instance had it's own environment.
Running the following in a command window worked for me (very similar to James Poag's answer, but note change of range from "1,1,10" to "0,1,9", "%%" -> "%" and "set" to "set /A"):
set GTEST_TOTAL_SHARDS=10
FOR /L %I in (0,1,9) DO cmd.exe /c "set /A GTEST_SHARD_INDEX=%I && start mytests.exe"
After further experimentation it is also possible to do this in C++. But it is not straightforward and I did not find a portable way of doing it. I can't post the code as it was done at work.
Essentially, from main, create new processes (where n is the number of cores available), capture the results from each shard, merge and output to the screen.
To get each process running a different shard, the total number of shard and instance number is given to the child process by the controller.
This is done by retrieving and copying the current environment, and setting in the copy the two environment variables (GTEST_TOTAL_SHARDS and GTEST_SHARD_INDEX) as required. GTEST_TOTAL_SHARDS is always the same, but GTEST_SHARD_INDEX will be the instance number of the child.
Merging the results is tedious but straightforward string manipulation. I successfully managed to get a correct total at the end, adding up the results of all the separate shards.
I was using Windows, so used CreateProcessA to create the new processes, passing in the custom environment.
It turned out that creating new processes takes a significant amount of time, but my program was taking about 3 minutes to run, so there was good benefits to be had from parallel running - the time came down to about 30 seconds on my 12 core PC.
Note that if this all seems overkill, there is a python program which does what I have described here but using a python script (I think - I haven't used it). This might be more straight forward.

Process spawned by Windows service runs 3 to 4 times more slowly than spawned by GUI

I have written a service application in Borland C++. It works fine. In the ServiceStart(TService *Sender,bool &Started) routine, I call mjwinrun to launch a process which picks up and processes macros. This process has no UI and any errors are logged to a file. It continues to run, until the server is restarted, shut down, or the process is terminated using Task Manager. Here is mjwinrun :-
int mjwinrun(AnsiString cmd)
{
STARTUPINFO mjstupinf; PROCESS_INFORMATION mjprcinf;
memset(&mjstupinf,0,sizeof(STARTUPINFO)); mjstupinf.cb=sizeof(STARTUPINFO);
if (!CreateProcess(NULL,cmd.c_str(),NULL,NULL,TRUE,0,NULL,GetCurrentDir().c_str(),&mjstupinf,&mjprcinf))
{
LogMessage("Could not launch "+cmd); return -1;
}
CloseHandle(mjprcinf.hThread); CloseHandle(mjprcinf.hProcess);
return mjprcinf.dwProcessId;
}
cmd is the command line for launching the macro queue processor. I used a macro that is CPU/Memory intensive and got it to write its timings to a file. Here is what I found :-
1) If the macro processor is launched from the command line within a logged on session, no matter what Windows core it is running under, the macro is completed in 6 seconds.
2) If the macro processor is launched from a service starting up on Vista core or earlier (using mjwinrun above), the macro is completed in 6 seconds.
3) If the macro processor is launched from a service starting up on Windows 7 core or later (using mjwinrun above), the macro is completed in more than 18 seconds.
I have tried all the different flags for CreateProcess and none of them make a difference. I have tried all different accounts for the service and that makes no difference. I tried setting all of the various priorities for tasks, I/O and Page, but they all make no difference. It's as if the service's spawned processes are somehow throttled, not in I/O terms, but in CPU/memory usage terms. Any ideas what changed in Windows 7 onwards?
I isolated code to reproduce this, and it eventually boiled down to calls to the database engine to lookup a field definition (TTable methods FindField and FieldByName). These took much longer on a table with a lot of fields when run on a service app instead of a GUI app. I devised my own method to store mappings from field names to field definitions, since I always opened my databases with a central routine. I used an array of strings indexed by the Tag property on each table (common to all BCB objects), where each string was composed of ;fieldname;fieldnumber; pairs, and then did a .Pos of the field name to get the field number. fieldnumber is zero-padded to a width of 4. This only uses a few hundred KB of RAM for the entire app and all of its databases. Once in place, the service app runs at the same speed as the GUI app. The only thing I can think of that may explain this, is that service apps have a fixed heap (I think I read 48MBytes somewhere by default) for themselves and any process they spawn. With lots of fields, the memory overflowed and had to thrash to VM on the disk. The GUI app had no such limit and was able to do the lookup entirely in real memory. However, I maybe completely wrong. One thing I have learnt is that FieldByName and FindField are expensive TTable functions to call, and I have now supplanted them all with my own mechanism which seems to work much better and much faster. Here is my lookup routine :-
AnsiString fldsbytag[MXSPRTBLS+100];
TField *fldfromtag(TAdsTable *tbl,AnsiString fld)
{
int fi=fldsbytag[tbl->Tag].Pos(";"+fld.UpperCase()+";"),gi;
if (fi==0) return tbl->FindField(fld);
gi=StrToIntDef(fldsbytag[tbl->Tag].SubString(fi+fld.Length()+2,4),-1);
if (gi<0 || gi>=tbl->Fields->Count) return tbl->FindField(fld);
return tbl->Fields->Fields[gi];
}
It will be very difficult to give an authoritative answer to this question without a lot more details.
However a factor to consider is the Windows foreground priority boost described here.
You may want to read Russinovich's book chapter on processes/threads, in particular the stuff on scheduling. You can find PDFs of the book online (there are two that together make up the whole book). I believe the latest (or next to latest) edition covers changes in Win 7.

Distributed software debug with gdb

I am currently developing a distributed software in C++ using linux which is executed in more than 20 nodes simultaneously. So one of the most challenging issue that I found is how to debug it.
I heard that is possible to manage in a single gdb session multiple remote sessions (e.g. in the master node I create the gdb session and in every other node I launch the program using gdbserver), is it possible? If so can you give an example? Do you know any other way to do it?
Thanks
You can try to do it like this:
First start nodes with gdbserver on remote hosts. It is even possible to start it without a program to debug, if you start it with --multi flag. When server is in multi mode, you can control it from your local session, I mean that you can make it start a program you want to debug.
Then, start multiple inferiors in your gdb session
gdb> add-inferior -copies <number of servers>
switch them to a remote target and connect them to remote servers
gdb> inferior 1
gdb> target extended-remote host:port // use extended to switch gdbserver to multi mode
// start a program if gdbserver was started in multi mode
gdb> inferior 2
...
Now you have them all attached to one gdb session. The problem is that, AFAIK, it is not much better than to start multiple gdb's from different console tabs. On the other hand you can write some scripts or auto tests this way. See the gdb tutorial: server and inferiors.
I don't believe there is one, simple, answer to debugging "many remote applications". Yes, you can attach to a process on another machine, and step through it in GDB. But it's quite awkward to debug a large number of interdependent processes, especially when the problem is complicated.
I believe a good set of logging capabilities in the code, supplemented with additional logs for specific debugging as needed, is more likely to give you a good/fast result.
Another option might be to run the processes on one machine, rather than on multiple machines. Perhaps even use threads within one process, to simulate the behaviour of multiple machines, simplifying the debugging process. Of course, this doesn't prevent bugs that appear ONLY when you run 20 processes on 20 different machines. But the basic idea is to reduce the number of those bugs to a minimum, and debug most things in a "simpler environment".
Aggressive use of defensive programming paradigms, such as liberal use of assert is clearly a good idea (perhaps with a macro to turn it off for the production runs, but make sure that you don't just leave error paths completely unchecked - it is MUCH harder to detect that the reason something crashes is that a memory allocation failed than to track down where that NULL pointer came from some 20 function calls away from a failed allocation.

Extract execution log from gdb record in a VirtualBox VM

I am attempting to use gdb's record feature to generate a list of the instructions executed for the tutorial example
I can use gdb record to step forward and back successfully and save the execution log to a file using "record save".
I think what I want to do is "record instruction-history" which from docs
Disassembles instructions from the recorded execution log
But when I attempt this i get the error:
You can't do that when your target is 'record-full'
Attempting to set the record target to btrace returns the error:
Target does not support branch tracing.
I am running gdb 7.6 in a VirtualBox VM, do i need to be running natively or is there some other magic i'm missing.
Your problem comes from a problem on VirtualBox itself to perform this operation. As you can see in this link, more specifically in this lines:
if (packet->support != PACKET_ENABLE)
error (_("Target does not support branch tracing."));
This problem is explained here.
But VirtualBox does NOT
emulate certain debugging features of modern x86 CPUs like branch target
store or performance counters.
My best guess is to install some other VirtualBox features that allow you to perform such operations, or switch to a new virtual environment.
I'll keep searching for information.

Running multiprocess applications from MATLAB

I've written a multitprocess application in VC++ and tried to execute it with command line arguments with the system command from MATLAB. It runs, but only on one core --- any suggestions?
Update:In fact, it doesn't even see the second core. I used OpenMP and used omp_get_max_threads() and omp_get_thread_num() to check and omp_get_max_threads() seems to be 1 when I execute the application from MATLAB but it's 2 (as is expected) if I run it from the command window.
Question:My task manager reports that CPU usage is close to 100% --- could this mean that the aforementioned API is malfunctioning it's still running as a multiprocess application?
Confirmation:
I used Process Explorer to check if there were any differences in the number of threads.
When I call the application from the command window, 1 thread goes to cmd.exe and 2 go to my application.
When I call it from MATLAB, 26 threads are for MATLAB.exe, 1 for cmd.exe and 1 for my application.
Any ideas?
The question is how Matlab is affecting your app's behavior, since it's a separate process. I suspect Matlab is modifying environment variables in a manner that affects OMP, maybe because it uses OMP internally, and the process you are spawning from Matlab is inheriting this modified environment.
Do a "set > plain.txt" from the command window where you're launching you app plain, and "system('set > from_matlab.txt')" from within Matlab, and diff the outputs. This will show you the differences in environment variables that Matlab is introducing. When I do this, this appears in the environment inherited from Matlab, but not in the plain command window's environment.
OMP_NUM_THREADS=1
That looks like an OpenMP setting related to the function calls in your question. I'll bet your spawned app is seeing that and respecting it.
I don't know why Matlab is setting it. But as a workaround, when you launch the app from Matlab, instead of calling it directly, call a wrapper .bat file that clears the OMP_NUM_THREADS environment variable, or sets it to a higher number.
Run the command outside of Matlab and see how many cores its using. There should be no difference running it from within Matlab because its just a call down to the operating system. IE. equivalent from running on command line.
EDIT
Ok odd, what do you get when you call feature('NumCores') ? What version of Matlab are you using?
Does it help to enable this?
you have to execute in MATLAB command-line:
setenv OMP_NUM_THREADS 4
if you want to use 4 threads.