XCode 6: Unable to create watchpoint - c++

I am trying to watch the following variable
vector<Vec3f> lines[2];
in XCode (where Vec3f is an OpenCV datatype, a vector of 3 floats).
But when I right-click the variable in Variable View and choose Watch "lines", I am being yelled at by XCode:
error: Watchpoint creation failed (addr=0x16fd92d48, size=48, variable
expression='lines'). error: watch size of 48 is not supported
This seems to happen with other variables of type vector<T> as well, but only if it is a local variable. I can watch the vector passed in as a method parameter just fine.
double computeReprojectionError(vector<Point2f>& imgpts1, vector<Point2f>& imgpts2, Mat& inlier_mask, const Mat& F)
{
// ^ I can watch this guy
vector<Vec3f> lines[2]; // <- I cannot watch this guy (size 48)
vector<Point2f> imgpts1_copy(npt), // <- I cannot watch this guy (size 24)
imgpts2_copy(npt);
...
I googled the error with no success. Can somebody shed light on the matter?

Watchpoints are in general fairly limited resources. You didn't say what architecture you were debugging, but x86_64, for instance, has only 4 hardware watchpoint registers, which can at most watch 8 bytes each. So you wouldn't be able to watch a 48 byte region on x86_64 in any case.
But you should be able to watch a 24 byte region by using 3 8-byte watches. I tried this locally, and it looks like there is a bug in the watchpoint setting - it doesn't divvy up a request larger than the native watchpoint size into several smaller watches. So you have to break up the request into 1/2/4/8 byte chunks by hand.
I filed a bug to track this with the Apple bug reporter. But if you want to track it feel free to file one at Apple's http://bugreporter.apple.com site if you want to track the resolution of this, and I'll dup mine to it.

Related

How to diagnose a visual studio project slowing down as time goes on?

Computer:
Processor: Intel Xeon Silver 4114 CPU # 2.19Ghz (2 processors)
Ram: 96 Gb 2666 Hz: 12 - 8 Gb sticks
OS: Windows 10
GPU: None
Hard drive: Samsung MZVLB512HAJQ-000H2 - 512GB M.2 PCIe NVMe
IDE:
Visual Studio 2019
I am including what I am doing in case it is relevant. I am running a visual studio code where I read data off a GSC PCI SIO4B Sync Card 256K. Using the API for this card (Documentation: http://www.generalstandards.com/downloads/GscApi.1.6.10.1.pdf) I read 150 bytes of data at a speed of 100Hz using the code below. That data is then being split into to the message structure my device. I can’t give info on the message structure but the data is then combined into the various words using a union and added to an integer array int Data[100];
Union Example:
union data_set{
unsigned int integer;
unsigned char input[2];
} word;
Example of how the data is read read:
PLX_PHYSICAL_MEM cpRxBuffer;
#define TEST_BUFFER_SIZE 0x400
//allocates memory for the buffer
cpRxBuffer.Size = TEST_BUFFER_SIZE;
status = GscAllocPhysicalMemory(BoardNum, &cpRxBuffer);
status = GscMapPhysicalMemory(BoardNum, &cpRxBuffer);
memset((unsigned char*)cpRxBuffer.UserAddr, 0xa5, sizeof(cpRxBuffer));
// start data reception:
status = GscSio4ChannelReceivePlxPhysData(BoardNum, iRxChannel, &cpRxBuffer, SetMaxBytes, &messageID);
// wait for Rx operation to complete
status = GscSio4ChannelWaitForTransfer(BoardNum, iRxChannel, 7000, messageID, &amount);
if (status)
{
// If we have an error, "bytesTransferred" will contain the number of bytes that we
// actually transmitted.
DisplayErrorMessage(status);
printf("\n\t%04X bytes out of %04X transferred", amount, SetMaxBytes);
}
My issue is that this code works fine and keeps up for around 5 minutes then randomly it stops being able to keep up and the FIFO (first in first out) register on the PCI card begins to fill up faster than the code can process the data. To me this seems like a memory leak issue since the code works fine for a long time, then starts to slow down when nothing has changed as all the code is doing it reading the data off the card. We used to save the data in a really large array but even after removing that we had the same issue.
I am unsure how to figure out exactly what is happening and I'm hopping for a way to determine if there is a memory leak and how to fix it if there is.
It being a data leak is only a guess though and it very well could be something else that is the problem so any out of the box suggestions for diagnosing the problem are also appreciated.
Similar to Paul's answer, but I like to strategically place two (or more) _CrtMemCheckpoint followed by _CrtMemDifference, to cut down the noise.
Memory leaks can be detected and reported on (in Debug builds) by calling the _CrtDumpMemoryLeaks function. When running under the debugger, this will tell you (in the output tab) how many allocations you have at the time that it is called and the file and line number that each was allocated from.
Call this right at the end of your program, after you (think you) have freed all the resources you use. Anything left over is a candidate for being a leak.

Question about the Cortex-M3 vector table placement

I am trying to understand the placement of the vector table for Cortex-M3 processor.
According to the Cortex-M3 arch ref manual, the reset behavior is like this (some parts are omitted):
So, we can see that the vectortable comes from the VTOR (Vector Table Offset Register).
According to the Cortex-M3 tech ref manual, the VTOR is defined as:
So we can see, it has a reset value of 0x0. So based on the above 2 criteria, the Cortex-M3 processor expects a vector table at the absolute address 0x0 in the Code area after reset.
But in my MDK uVision IDE, I see my application is placed in the IROM1 area, which starts at 0x8000000, which is within the 0.5G Code memory area according to the Cortex-M3 memory map.
And since it has the Starup button checked, I guess that means the IROM1 area should contain the vector table (please correct me if I am wrong about this).
So I think the vector table should lie at the beginning of IROM1 area, i.e. 0x8000000. And it is indeed so. Below pic shows that at the beginning of IROM1, it is the vector table's 1st entry, the SP value.
And what's more strange, the VTOR register (at 0xE000ED08) still holds a 0x0 value:
So, how could my vector table be found with a 0x0 VTOR reset value?
And just out of curiosity, I checked the memory content at 0x0, there contains exactly the same vector table content as IROM1. So who did this magic copy??
ADD 1 - 4:39 PM 10/9/2020
I guess there must be something I don't know about the startup check box in below pic.
ADD 2 - 5:09 PM 10/9/2020
Thanks to #RealtimeRik and #domen. I downloaded the datasheet for STM32F103x8_xB(https://www.st.com/resource/en/datasheet/stm32f103c8.pdf). In section 4 Memory mapping, I saw below diagram:
So it seems the [0x0, 0x8000000) range does get aliased to somewhere else. But I haven't found how to determine where it is aliased to...
ADD 3 - 5:39 PM 10/9/2020
Now I found it!
I downloaded the STM32Fxxx fef manual (btw it's really huge).
In section 3.4 Boot configuration, it specifies the boot mode configured through the BOOT[1:0] pins.
And with different boot mode, different address aliasing is used:
Depending on the selected boot mode, main Flash memory, system memory
or SRAM is accessible as follows:
Boot from main Flash memory: the main Flash memory is aliased in the boot memory space (0x0000 0000), but still accessible from its
original memory space (0x800 0000). In other words, the Flash memory
contents can be accessed starting from address 0x0000 0000 or 0x800 0000.
Boot from system memory: the system memory is aliased in the boot memory space (0x0000 0000), but still accessible from its original
memory space (0x1FFF B000 in connectivity line devices, 0x1FFF F000 in
other devices).
Boot from the embedded SRAM: SRAM is accessible only at address 0x2000 0000.
What I saw is Boot from main Flash memory.
Well finally I can explain why 0x800 0000 is chosen...
ADD 4 - 3:19 PM 10/15/2020
The placement/expectation of the interrupt vector table at the address 0 is similar to the IA32 processor in real mode...
There is no "Magic Copy". 0x00000000 is aliased to 0x08000000.
The actual memory is physically located at 0x08000000 but can also be access at 0x00000000.
If you look in the processor specific reference manual you should find this in the the memory map section.

IRQ 8 isn't working... HW or SW?

First, I program for Vintage computer groups. What I write is specifically for MS-DOS and not windows, because that's what people are running. My current program is for later systems and not the 8086 line, so the plan was to use IRQ 8. This allows me to set the interrupt rate in binary values from 2 / second to 8192 / second (2, 4, 8, 16, etc...)
Only, for some reason, on the newer old systems (ok, that sounds weird,) it doesn't seem to be working. In emulation, and the 386 system I have access to, it works just fine, but on the P3 system I have (GA-6BXC MB w/P3 800 CPU,) it just doesn't work.
The code
setting up the interrupt
disable();
oldrtc = getvect(0x70); //Reads the vector for IRQ 8
settvect(0x70,countdown); //Sets the vector for
outportb(0x70,0x8a);
y = inportb(0x71) & 0xf0;
outportb(0x70,0x8a);
outportb(0x71,y | _MRATE_); //Adjustable value, set for 64 interrupts per second
outportb(0x70,0x8b);
y = inportb(0x71);
outportb(0x70,0x8b);
outportb(0x71,y | 0x40);
enable();
at the end of the interrupt
outportb(0x70,0x0c);
inportb(0x71); //Reading the C register resets the interrupt
outportb(0xa0,0x20); //Resets the PIC (turns interrupts back on)
outportb(0x20,0x20); //There are 2 PICs on AT machines and later
When closing program down
disable();
outportb(0x70,0x8b);
y = inportb(0x71);
outportb(0x70,0x8b);
outportb(0x71,y & 0xbf);
setvect(0x70,oldrtc);
enable();
I don't see anything in the code that can be causing the problem. But it just doesn't seem to make sense. While I don't completely trust the information, MSD "does" report IRQ 8 as the RTC Counter and says it is present and working just fine. Is it possible that later systems have moved the vector? Everything I find says that IRQ 8 is vector 0x70, but the interrupt never triggers on my Pentium III system. Is there some way to find if the Vectors have been changed?
It's been a LONG time since I've done any MS-DOS code and I don't think I ever worked with this particular interrupt (I'm pretty sure you can just read the memory location to fetch the time too, and IRQ0 can be used to trigger you at an interval too, so maybe that's better. Anyway, given my rustiness, forgive me for kinda link dumping.
http://wiki.osdev.org/Real_Time_Clock the bottom of that page has someone saying they've had problem on some machines too. RBIL suggests it might be a BIOS thing: http://www.ctyme.com/intr/rb-7797.htm
Without DOS, I'd just capture IRQ0 itself and remap all of them to my own interrupt numbers and change the timing as needed. I've done that somewhat recently! I think that's a bad idea on DOS though, this looks more recommended for that: http://www.ctyme.com/intr/rb-2443.htm
Anyway though, I betcha it has to do with the BIOS thing:
"Notes: Many BIOSes turn off the periodic interrupt in the INT 70h handler unless in an event wait (see INT 15/AH=83h,INT 15/AH=86h).. May be masked by setting bit 0 on I/O port A1h "

OpenCL SHA1 Throughput Optimisation

Hoping someone more experienced in OpenCL usage may be able to help me here! I'm doing a project (to help me learn a bit more crypto and to try my hand at GPGPU programming) where I'm trying to implement my own SHA-1 algorighm.
Ultimately my question is about maximizing my throughput rates. At present I'm seeing something like 56.1 MH/sec, which compares very badly to open source programs I've looked at, such as John the Ripper and OCLHashcat, which are giving 1,000 and 1,500 MH/sec respectively (heck, I'd be well-chuffed with a 3rd of that!).
So, what I'm doing
I've written a SHA-1 implementation in an OpenCL kernel and a C++ host application to load data to the GPU (using CL 1.2 C++ wrapper). I'm generating blocks of candidate data to hash in a threaded fashion on the CPU and loading this data onto the global GPU memory using the CL C++ call to enqueueWriteBuffer (using uchars to represent the bytes to hash):
errorCode = dispatchQueue->enqueueWriteBuffer(
inputBuffer,
CL_FALSE,//CL_TRUE,
0,
sizeof(cl_uchar) * inputBufferSize,
passwordBuffer,
NULL,
&dispatchDelegate);
I'm en-queuing data using enqueueNDRangeKernel in the following manner (where global worksize is a user-defined variable, at present I've set this to my GPUs maximum flattened global worksize of 16.777 million per run):
errorCode = dispatchQueue->enqueueNDRangeKernel(
*kernel,
NullRange,
NDRange(globalWorkgroupSize, 1),
NullRange,
NULL,
NULL);
This means that (per dispatch) I load 16.777 million items in a 1D array and index from my kernel into this using get_global_offset(0).
My Kernel signature:
__kernel void sha1Crack(__global uchar* out, __global uchar* in,
__constant int* passLen, __constant int* targetHash,
__global bool* collisionFound)
{
//Kernel Instance Global GPU Mem IO Mapping:
__private int id = get_global_id(0);
__private int inputIndexStart = id * passwordLen;
//Select Password input key space:
#pragma unroll
for (i = 0; i < passwordLen; i++)
{
inputMem[i] = in[inputIndexStart + i];
}
//SHA1 Code omitted for brevity...
}
So, given all this: am I doing something fundamentally wrong in the way I'm loading data? I.e. 1 call to enqueueNDrange for 16.7 million kernel executions over a 1D input vector? Should I be using a 2-D space and sub-dividing into localworkgroup ranges? I tried playing with this but it didn't seem quicker.
Or, perhaps as likely is my algorithm itself the source of slowness? I've spent a good while optimizing it and manually unrolling all of the loop stages using pre-processor directives.
I've read about memory coalescing on the hardware. Could that be my issue? :S
Any advice at all appreciated! If I've missed anything important please let me know and I'll update.
Thanks in advance! ;)
Update: 16,777,216 is the device maximum reported workgroup size; 256**3. The global array of boolean values is one boolean. It's set to false at the start of the kernel enqueue, then a branching statement sets this to true if a collision is found only - will that force a convergence? passwordLen is the length of the current input value and target hash is an int[4] encoded hash to check against.
Your 'maximum flattened global worksize' should be multiplied by passwordLen. It is the number of kernels you can run, not the maximal length of an input array. You can most likely send much more data than this to the GPU.
Other potential issues: the 'generating blocks of candidate data to hash in a threaded fashion on the CPU', try doing this in advance of the kernel iterations to see whether the delay is in the generation of the data blocks or in the processing of the kernels; your sha1 algorithm is the other obvious potential issue. I'm not sure how much you've really optimised it by 'unrolling' the loops, usually the bigger optimisation issue is 'if' statements (if a single kernel instance within a workgroup tests to true then all of the lockstepped workgroup instances must follow that branch in parallel).
And DarkZeros is correct, you should manually play with the local workgroup size making it the highest common multiple of the global size and the number of kernels which can be run at once on the card. The easiest way to do this is to round up the global work group size to the next multiple of the card capacity and use an external if{} statement in the kernel only running the kernel for global_id less than the actual number of kernels you want to run.
Dave.

Getting OpenCV Error: Insufficient memory while running OpenCV Sample Program: "stitching_detailed.cpp"

I recently starting working with OpenCV with the intent of stitching large amounts of images together to create massive panoramas. To begin my experimentation, I looked into the sample programs that come with the OpenCV files to get an idea about how to implement the OpenCV libraries. Since I was interested in image stitching, I went straight for the "stitching_detailed.cpp." The code can be found at:
https://code.ros.org/trac/opencv/browser/trunk/opencv/samples/cpp/stitching_detailed.cpp?rev=6856
Now, this program does most of what I need it to do, but I ran into something interesting. I found that for 9 out of 15 of the optional projection warpers, I receive the following error when I try to run the program:
Insufficient memory (Failed to allocate XXXXXXXXXX bytes) in unknown function,
file C:\slave\winInstallerMegaPack\src\opencv\modules\core\src\alloc.cpp,
line 52
where the "X's" mark integer that change between the different types of projection (as though different methods require different amounts of space). The full source code for "alloc.cpp" can be found at the following website:
https://code.ros.org/trac/opencv/browser/trunk/opencv/modules/core/src/alloc.cpp?rev=3060
However, the line of code that emits this error in alloc.cpp is:
static void* OutOfMemoryError(size_t size)
{
--HERE--> CV_Error_(CV_StsNoMem, ("Failed to allocate %lu bytes", (unsigned long)size));
return 0;
}
So, I am simply lost as to the possible reasons that this error may be occurring. I realize that this error would normally occur if the system is out of memory, but I when running this program with my test images I am never using more that ~3.5GB of RAM, according to my Task Manager.
Also, since the program was written as an sample of the OpenCV stitching capabilities BY OpenCV developers I find it hard to believe that there is a drastic memory error present within the source code.
Finally, the program works fine if I use some of the warping methods:
- spherical
- fisheye
- transverseMercator
- compressedPlanePortraitA2B1
- paniniPortraitA2B1
- paniniPortraitA1.5B1)
but as ask the program to use any of the others (through the command line tag
--warp [PROJECTION_NAME]):
- plane
- cylindrical
- stereographic
- compressedPlaneA2B1
- mercator
- compressedPlaneA1.5B1
- compressedPlanePortraitA1.5B1
- paniniA2B1
- paniniA1.5B1
I get the error mentioned above. I get pretty good results from the transverseMercator project warper, but I would like to test the stereographic in particular. Can anyone help me figure this out?
The pictures that I am trying to process are 1360 x 1024 in resolution and my computer has the following stats:
Model: HP Z800 Workstation
Operating System: Windows 7 enterprise 64-bit OPS
Processor: Intel Xeon 2.40GHz (12 cores)
Memory: 14GB RAM
Hard Drive: 1TB Hitachi
Video Card: ATI FirePro V4800
Any help would be greatly appreciated, thanks!
When I run OpenCV's APP traincascade, i get just the same error as you:
Insufficient memory (Failed to allocate XXXXXXXXXX bytes) in unknown function,
file C:\slave\winInstallerMegaPack\src\opencv\modules\core\src\alloc.cpp,
line 52
at the time, only about 70% pecent of my RAM(6G) was occupied. And when runnig trainscascade step by step, I found that the error would be thrown.when it use about more than 1.5G RAM space.
then, I found the are two arguments which can control how many memory should be used:
-precalcValBufSize
-precalcIdxBufSize
so i tried to set these two to 128, it run. I hope my experience can help you.
I thought this problem is nothing about memory leak, it is just relate to how many memory the OS limits a application occupy. I expect someone can check my guess.
I've recently had a similar issue with OpenCV image stitching. I've used create method to create stitcher instance and provided 5 images in vertical order to stitch method, but I've received insufficient memory error.
Panorama was successfully created after setting:
setWaveCorrection(false)
This solution will not be applicable if you need wave correction.
This may be related to the sequence of the stitching, I split a big picture into 3*3, and firstly I stitch them row by row and there is no problem, when I stitch them column by column, there is the problem same as you.