I have some code that processes images. Performance is critical so I'm tyring to implement multi-threading using BoundedBuffer. The image data is stored as unsigned char* (dictated by the SDK I'm using to process the image data).
The problem occurs in the processData function called in the consumer thread. Inside the processData, there is another function (from the image processing SDK) that uses cudaMemcpy2D function. The cuda function always throws an exception saying Access violation reading location.
However, the cuda function works fine if I call the the processData directly within the producer thread or deposit. When I call processData from the consumer thread (as desired), I get the exception from the cuda function. I even tried calling processData from fetch and I got the same exception.
My guess is that after the data is deposited into the rawImageBuffer by the producer thread, somehow the memory pointed to by unsigned char* changes, thus the consumer thread (or fetch) actually sends bad image data to processData (and the cuda function).
This is what my code looks like:
void processData(vector<unsigned char*> unProcessedData)
{
// Process the data
}
struct BoundedBuffer {
queue<vector<unsigned char*>> buffer;
int capacity;
std::mutex lock;
std::condition_variable not_full;
std::condition_variable not_empty;
BoundedBuffer(int capacity) : capacity(capacity) {}
void deposit(vector<unsigned char*> vData)
{
std::unique_lock<std::mutex> l(lock);
bool bWait = not_full.wait_for(l, 3000ms, [this] {return buffer.size() != capacity; }); // Wait if full
if (bWait)
{
buffer.push(vData); // only push data when timeout doesn't expire
not_empty.notify_one();
}
}
vector<unsigned char*> fetch()
{
std::unique_lock<std::mutex> l(lock);
not_empty.wait(l, [this]() {return buffer.size() != 0; }); // Wait if empty
vector<unsigned char*> result{};
result = buffer.front();
buffer.pop();
not_full.notify_one();
return result;
}
};
void producerTask(BoundedBuffer &rawImageBuffer)
{
for(;;)
{
// Produce Data
vector<unsigned char*> producedDataVec{dataElement0, dataElement1};
rawImageBuffer.deposit(producedDataVec);
} //loop breaks upon user interception
}
void consumerTask(BoundedBuffer &rawImageBuffer)
{
for(;;)
{
vector<unsigned char*> fetchedDataVec{};
fetchedDataVec = rawImageBuffer.fetch();
processData(fetchedDataVec);
} //loop breaks upon user interception
}
int main()
{
BoundedBuffer rawImageBuffer(6);
thread consumer(consumerTask, ref(rawImageBuffer));
thread producer(producerTask, ref(rawImageBuffer),
consumer.join();
producer.join();
return 0;
}
Am I correct in my guess about why the exception is being thrown? How do I resolve this? For reference, each vector element contains data for a 2448px X 2048px image in RGBa 8bit format.
UPDATES:
After someone pointed out in the comments that the unsigned char* pointers could be invalid, I found that the address pointed by the pointers is in fact a real memory location. In the exception Access violation reading location X. X is larger than the location pointed by the pointer.
After some more debugging, I've found that the memory pointed to by the unsigned char* in unprocessedData vector in processData doesn't remain intact, the pointer address is correct, but some blocks of memory are unreadable. I found this by printing each char in the unsigned char* in processData. When processData is called by producer thread (this is when cuda doesn't throw exception), all chars get printed nicely (I'm printing 2048*2448*4 chars, dictated by the aforementioned image resolution and format). But when processData is called by the consumer thread, printing the char throws the same exception, exception is thrown around the 40th char (around 40th, not always 40th).
Okay, so now I'm pretty sure not only my pointers are pointing to real memory locations, I also know that the first memory block pointed by the pointer holds the expected value for as many times as I've tested this. To test this, in producerTask I deliberately write a test value (such as int 42, or char *) to the 0th memory block pointed by the unsigned char*. In the processData function, I check if the memory block still contains the test value and it does. So, now I know some of the memory blocks pointed by the pointer become inaccessible to read for some unknown reason. Also, my test doesn't prove that the first memory block is immune to become inaccessible, just that it didn't become inaccessible for the few number of tests I did. TLDR for Updates 1 to 3: The unprocessedImage pointers are valid, they point to a real memory address and also they point to the memory address that hold the expected value.
Another debugging attempt. Now I'm using Visual Studio's memory window to visually inspect the data. The debugger tells me that unProcessedData[0] points to 0x00000279d7c76070. This is what memory around 0x00000279d7c76070 looks like:
Memory seems sensible, the RGBa format can be clearly seen, the image is all black so it makes sense that the RGB channels are close to 0 whereas alpha is ff. I scrolled down for a long time to see what the memory looks like, all the way till 0x00000279D8F9606F the data looks good (RGBa values as expected). The 0x00000279D8F9606F number also makes sense because 0x00000279D8F9606F - 0x00000279d7c76070 = 0d20054015, which means there are 20054016 valid chars which is expected (2048 height*2448 width*4 channels = 20054016). Okay, so far so good. Note that all this is right before running the cuda function. After stepping through the cuda function I get the same exception: Access violation reading location 0x00000279D80B8000. Note that 0x00000279D80B8000 is between 0x00000279d7c76070 and 0x00000279D8F9606F, the parts of memory which I visually checked to be correct. Now, after running the cuda function here is what the memory between 0x00000279d7c76070 and 0x00000279D8F9606F looks like:
When I cout anything in processData before calling the cuda function. The memory pointed by the pointer changes. All the chars become equivalent to 0xdd as can be seen in the image below. This page on MSDN says that The freed blocks kept unused in the debug heap's linked list when the _CRTDBG_DELAY_FREE_MEM_DF flag is set are currently filled with 0xDD.
But when I call processData from the producer thread, the pointed memory doesn't change after I cout anything.
Right now the most upvoted comment to this question is telling me to learn more about pointers. I am doing this currently (hopefully as my updates may suggest), however what topics do I need to learn about them? I do know how pointers work. I know my the pointers are pointing to valid memory location (see Update 2). I know some memory blocks pointed by the pointer become inaccessible to read (see Update 3). But I don't know why the memory blocks become inaccessible. Especially, I don't know why they only become inaccessible when processData is called from the consumer thread (note that there is no exception thrown when processData is called form the producer thread). Is there anything else I can do to help narrow down this problem?
The problem was fairly simple, n.m.'s comments guided me towards the right direction and I'm thankful for that.
In my updates I mentioned that printing anything using cout caused the data to become corrupt. Although, it seemed like that was happening, but after putting some breakpoints in fetch and deposit, I got a complete picture of what was really happening.
The way I produced the image data was by using another SDK supplied with the camera, the SDK provided me with image data in the type of wrapped pointer. Then I converted the image format and then unwrapped converted image to get the pointer to the raw image. Then the pointer to the raw image is stored into producedDataVec and deposited it into rawImageBuffer. The problem was that as soon as the converted image went out of scope, my data became corrupted. So, the cout statements weren't really responsible for corrupting my data. With breakpoints placed everywhere I could see the data becoming corrupt just after the converted image went out of scope. To resolve this, now my producer directly deposits the wrapped pointer to the buffer. The consumer fetches the wrapped pointer, the converted image is obtained by converting the format in the consumer, and then the raw image pointer is obtained. Now the converted image only goes out of scope after processData has returned so the exception is never thrown.
Related
I have a recursive method in a class that computes some stages (doesn't matter what this is). If it notice that the probability of success for a stage is to low, the stage will be delayed by storing it in a queue and the program will look for delayed stage later. It grabs a stage, copies the data for working and deletes the stage.
The program runs fine, but I got a memory problem. Since this program is randomized it could happen that it delays up to 10 million stages which results in something like 8 to 12 GB memory usage (or more but it crashes before that happens). It seems the program never frees the memory for a deleted stage before reaching the end of the call stack of the recursive function.
class StageWorker
{
private:
queue<Stage*> delayedStages;
void perform(Stage** start)
{
// [long value_from_stage = (**start).value; ]
delete *start;
// Do something here
// Delay a stage if necessary like this
this->delayedStages.push(new Stage(value));
// Do something more here
// Look for delayed stages like this
Stage* front = this->delayedStages.front();
this->delayedStages.pop();
this->perform(&front);
}
}
I use pointer to pointer as I thought the memory is not freed, because there is a pointer (front*) that points to the stage. So I used a pointer that point to the pointer, and so I can delete it. But it seems not to work. If I watch the memory usage on Performance monitor (on Windows) it like like this:
I marked the end of the recursive calls. This is also just a example plot. real data, but in a very small scenario.
Any ideas how to free the memory of not longer used sages before reaching the end of the call stack?
Edit
I followed some advice and removed the pointers. So now it looks like this:
class StageWorker
{
private:
queue<Stage> delayedStages;
void perform(Stage& start)
{
// [long value_from_stage = start.value; ]
// Do something here
// Delay a stage if necessary like this
this->delayedStages.push(Stage(value));
// Do something more here
// Look for delayed stages like this
this->perform(this->delayedStages.front());
this->delayedStages.pop();
}
}
But this changed nothing, memory usage is the same as before.
As mentioned it maybe is a problem of monitoring. Is there a better way to check the exact memory usage?
Just to close this question as answered:
Mats Petersson (thanks a lot) mentiond in the comments that it could be a problem of monitoring the memory usage.
In fact this exactly was the problem. The code I provided in the edit (after replacing the pointers by references) frees the space correctly, but it is not given back to the OS (in this case Windows). So from the monitor it looks like the space is not freed.
But then I made some experiments and saw that freed memory is reused by the application, so it does not have to ask the OS for more memory.
So what the monitor shows is the maximum memory used bevore getting back the whole memory the recursive function required, after the first call of this function ends.
Unhandled exception at 0x764F135D (kernel32.dll) in RFNReader_NFCP.exe.4448.dmp: 0xC0000005: Access violation writing location 0x00000001.
void Notify( const char* buf, size_t len )
{
for( auto it = m_observerList.begin(); it != m_observerList.end(); )
{
auto item = it->lock();
if( item )
{
item->Update( buf, len );
++it;
}
else
{
it = m_observerList.erase( it );
}
}
}
variable item's value in debug window:
item shared_ptr {m_interface="10.243.112.12" m_port="8889" m_clientSockets={ size=0 } ...} [3 strong refs, 2 weak refs] [default] std::tr1::shared_ptr
but in item->Update():
the item(this) become null!
why??
The problem here is most likely not the weak_ptr, which is used correctly.
In fact, the code you posted is completely fine, so the error must be elsewhere. The raw pointer and length arguments indicate a possible memory corruption.
Be aware that the debugger might lie to you if you accidentally mess up stack frames due to memory corruption. Since you seem to be debugging this from a minidump it might also be that the dumping swallowed some info here.
Mind you, the corrupted this pointer that you are seeing here is just a value on the stack! The underlying object is most probably still alive, as you are maintaining several shared_ptrs to it (you can verify this in a debug build by checking if the original memory location of the object was overwritten by magic numbers). It's really just your stack values that are bogus. I would definitely recommend you double check the stack manually using VS's memory and register windows. If you do have a memory corruption, it should become visible there.
Also consider temporarily cranking up the amount of data saved to the minidump if it threw away too much.
Finally, be sure you double check your buffer handling. It's very likely that you messed up there somewhere and an out-of-bounds buffer write caused the corruption.
Note that your this is invalid (0x00000001), i.e. the object got destroyed. Notify member function was called for a destroyed object. This obviously crashes as soon as Notify tries to access an object member.
Shared memory is giving me a hard time and GDB isn't being much help. I've got 32KB of shared memory allocated, and I used shmat to cast it to a pointer to a struct containing A) a bool and B) a queue of objects containing one std::string, three ints, and one bool, plus assorted methods. (I don't know if this matryoshka structure is how you're supposed to do it, but it's the only way I know. Using a message queue isn't an option, and I need to use multiple processes.)
Pushing one object onto the queue works, but when I try to push a second, the program freezes. No error message, no nothing. What's causing this? I doubt it's a lack of memory, but if it is, how much do I need?
EDIT: In case I was unclear -- the objects in the queue are of a class with the five data members described.
EDIT 2: I changed the class of the queue's entries so that it doesn't use std::string. (Embarrassingly enough, I was able to represent the data with a primitive.) The program still freezes on the second push().
EDIT 3: I tried calling front() from the same queue immediately after the first push(), and it froze the program too. Checking the value of the bool outside the queue, however, worked fine, so it's gotta be something wrong with the queue itself.
EDIT 4: As an experiment, I added an std::queue<int> to the struct I was using for the shared memory. It showed the same behavior -- push() worked once, then front() made it freeze. So it's not a problem with the class I'm using for the queue items, either.
This question suggests I'm not likely to solve this with std::queue. Is that so? Should I use boost like it says? (In my case, I'm executing shmget() and shmat() in the parent process and trying to let two child processes communicate, so it's slightly different.)
EDIT 5: The other child process also freezes when it calls front(). A semaphore ensures this happens after the first push() call.
Putting std::string objects into a shared memory segment can't possibly work.
It should work fine for a single process, but as soon as you try to access it from a second process, you'll get garbage: the string will contain a pointer to heap-allocated data, and that pointer is only valid in the process that allocated it.
I don't know why your program freezes, but it is completely pointless to even think about.
As I said in my comment, your problem stems from attempting to use objects that internally require heap allocation in a structure, which should be self contained (i.e. requires no further dynamically allocated memory).
I would tweak your setup, and change the std::string to some fixed size character array, something like
// this structure fits nicely into a typical cache line
struct Message
{
boost::array<char, 48> some_string;
int a, b, c;
bool c;
};
Now, when you need to post something on the queue, copy the string content into some_string. Of course you should size your strings appropriately (and boost::array probably isn't the best - ideally you want some length information too) but you get the idea...
I've come across a strange issue in some code that I'm working on. Basically what's going on is that whenever I try to get some information from an empty map, the program segfaults. Here's the relevant code:
(note that struct Pair is a data structure that is defined earlier, and sendMasks is a std::map that is good)
std::map<std::string*, struct Pair*>::iterator it;
for(it = sendMasks->begin(); it != sendMasks->end(); it++){ //segfault
//(some code goes here)
}
I know that the pointer to the map is good; I can do
it = sendMasks->begin();
it = sendMasks->end();
before my loop, and it doesn't segfault at all then.
Now, if I put the following test before the for loop, it will segfault:
if( sendMasks->empty() )
As will any other attempt to determine if the map is empty.
This issue will only occur if the map is empty. My only thought on this issue would be that because I am updating sendMasks in a separate thread, that it may not have been updated properly; that however doesn't make any sense because this will only happen if the map is empty, and this code has worked perfectly fine before now. Any other thoughts on what could be happening?
EDIT:
I figured out what the problem was.
At an earlier part in my code, I was making a new char* array and putting that pointer into another array of length 4. I was then putting a NULL character at the end of my new array, but accidentally only did a subscript off of the first array - which went off the end of the array and overwrote a pointer. Somehow, this managed to work properly occasionally. (valgrind doesn't detect this problem)
The sequence was something like this:
object* = NULL; //(overwritten memory)
object->method();
//Inside object::method() :
map->size(); //segfault. Gets an offset of 0x24 into the object,
//which is NULL to begin with. memory location 0x24 = invalid
I wasn't expecting the instance of the object itself to be null, because in Java this method call would fail before it even did that, and in C this would be done quite differently(I don't do much object-oriented programming in C++)
If you are accessing a data structure from different threads, you must have some kind of synchronization. You should ensure that your object is not accessed simultaneously from different threads. As well, you should ensure that the changes done by one of the threads are fully visible to other threads.
A mutex (or critical section if on Windows) should do the trick: the structure should be locked for each access. This ensures the exclusive access to the data structure and makes the needed memory barriers for you.
Welcome to the multithreaded world!
Either:
You made a mistake somewhere, and have corrupted your memory. Run your application through valgrind to find out where.
You are not using locks around access to objects that you share between threads. You absolutely must do this.
I know that the pointer to the map is good; I can do
it = sendMasks->begin();
it = sendMasks->end();
before my loop, and it doesn't segfault at all then.
This logic is flawed.
Segmentation faults aren't some consistent, reliable indicator of an error. They are just one possible symptom of a completely unpredictable system, that comes into being when you have invoked Undefined Behaviour.
this code has worked perfectly fine before now
The same applies here. It may have been silently "working" for years, quietly overwriting bytes in memory that it may or may not have had safe access to.
This issue will only occur if the map is empty.
You just got lucky that, when the map is empty, your bug is evident. Pure chance.
It seems that this question gets asked frequently, but I am not coming to any definitive conclusion. I need a little help on determining whether or not I should (or must!) implement locking code when accessing/modifying global variables when I have:
global variables defined at file scope
a single "worker" thread reading/writing to global variables
calls from the main process thread calling accessor functions which return these globals
So the question is, should I be locking access to my global variables with mutexes?
More specifically, I am writing a C++ library which uses a webcam to track objects on a page of paper -- computer vision is CPU intensive, so performance is critical. I have a single worker thread which is spun off in an Open() function. This thread handles all of the object tracking. It is terminated (indirectly w/global flag) when a Close() function is called.
It feels like I'm just asking for memory corruption, but I have observed no deadlock issues nor have I experienced any bad values returned from these accessor functions. And after several hours of research, the general impression I get is, "Meh, probably. Whatever. Have fun with that." If I indeed should be using mutexes, why I have not experienced any problems yet?
Here is an over-simplification on my current program:
// *********** lib.h ***********
// Structure definitions
struct Pointer
{
int x, y;
};
// more...
// API functions
Pointer GetPointer();
void Start();
void Stop();
// more...
The implementation looks like this...
// *********** lib.cpp ***********
// Globals
Pointer p1;
bool isRunning = false;
HANDLE hWorkerThread;
// more...
// API functions
Pointer GetPointer()
{
// NOTE: my current implementation is actually returning a pointer to the
// global object in memory, not a copy of it, like below...
// Return copy of pointer data
return p1;
}
// more "getters"...
void Open()
{
// Create worker thread -- continues until Close() is called by API user
hWorkerThread = CreateThread(NULL, 0, DoWork, NULL, 0, NULL);
}
void Close()
{
isRunning = false;
// Wait for the thread to close nicely or else you WILL get nasty
// deadlock issues on close
WaitForSingleObject(hWorkerThread, INFINITE);
}
DWORD WINAPI DoWork(LPVOID lpParam)
{
while (isRunning)
{
// do work, including updating 'p1' about 10 times per sec
}
return 0;
}
Finally, this code is being called from an external executable. Something like this (pseudocode):
// *********** main.cpp ***********
int main()
{
Open();
while ( <esc not pressed> )
{
Pointer p = GetPointer();
<wait 50ms or so>
}
Close();
}
Is there perhaps a different approach I should be taking? This non-issue issue is driving me nuts today :-/ I need to ensure this library is stable and returning accurate values. Any insight would be greatly appreciated.
Thanks
If only one thread access an object (both read and write) then no locks are required.
If an object is read only then no locks are required. (Assuming you can guarantee that only one thread access the object during construction).
If any thread writes (changes the state) of an object. If there are other threads that access that object then ALL accesses (both read and write) must be locked. Though you may use read locks that allow multiple readers. But write operations must be exclusive and no readers can access the object while the state is being changed.
I suppose it depends on what you are doing in your DoWork() function. Let's assume it writes a point value to p1. At the very least you have the following possibility of a race condition that will return invalid results to the main thread:
Suppose the worker thread wants to update the value of p1. For example, lets change the value of p1 from (A, B) to (C, D). This will involve at least two operations, store C in x and store D in y. If the main thread decides to read the value of p1 in the GetPointer() function, it must perform at least two operations also, load value for x and load value for y. If the sequence of operations is:
update thread: store C
main thread: load x (main thread receives C)
main thread: load y (main thread receives B)
update thread: store D
The main thread will get the point (C, B), which is not correct.
This particular problem is not a good use of threads, since the main thread isn't doing any real work. I would use a single thread, and an API like WaitForMultipleObjectsEx which allows you to simultaneously wait for input from the keyboard stdin handle, an I/O event from the camera, and a timeout value.
You won't get a deadlock, but you may see some occasional bad value with an extremely low probability: since reading and writing take fractions of a nanosecond, and you only read the variable 50 times per second, the chance of a collision is something like 1 in 50 million.
If this is happening on an Intel 64, "Pointer" is aligned to a 8 byte boundary, and it is read and written in one operation (all 8 bytes with one assembly instruction), then accesses are atomic and you don't need a mutex.
If either of those conditions are not satisfied, there's a possibility that the reader will get bad data.
I'd put a mutex just to be on the safe side, since it's only going to be used 50 times a second and it's not going to be a performance issue.
The situation's pretty clear cut - the reader may not see updates until something triggers synchronisation (a mutex, memory barrier, atomic operation...). Many things processes do implicitly trigger such synchronisation - e.g. external function calls (for reason's explained the the Usenet threading FAQ (http://www.lambdacs.com/cpt/FAQ.html) - see Dave Butenhof's answer re need for volatile, so if your code is dealing in values that are small enough that they can't be half-written (e.g. numbers rather than strings, fixed address rather than dynamic (re)allocations) then it can limp along without explicit syncs.
If your idea of performance is getting more loops through your write code, then you'll get a nicer number if you leave out the synchronisation. But if you're interested in minimising the average and worst-case latency, and how many distinct updates the reader cam actually see, then you should do synchronisation from the writer.
You may not be seeing problems because of the nature of the information in Pointer. If it is tracking coordinates of some object that is not moving very fast, and the position is updated during a read, then the coordinates might be a "little off", but not enough to notice.
For instance, assume that after an update, p.x is 100, and p.y is 100. The object your are tracking moves a bit, so after the next update, p.x is 102 and p.y is 102. If you happen to read in the middle of this update, after x is updated but before y is updated, you will end getting a pointer value of p.x as 102, and p.y as 100.