WriteFile returning error code 995 - c++

I have searched stackoverflow and googled thoroughly for this problem but not been able to find a clue to why this problem is happening.
I am writing a program in C++ which communicates with a measurement device connected through USB. The program is multithreaded and several threads will communicate with the device. A mutex is used to guarantee that no two threads try to read or write from the device at the same time.
Commands are sent to the device using WriteFile and the responses and measured values are read using ReadFile - both operations are done synchronously.
Sometimes while reading the measured value from the device will the measurement fail with a timeout (GetLastError() returns error code 121), due to a synchronization error inside the measurement device itself - which is ok and expected.
When I try to continue the measurement, by sending a new command will WriteFile sometimes (roughly 50% of the time) fail and GetLastError() returns the error code 995 which is described in MSDN as:
ERROR_OPERATION_ABORTED
995 (0x3E3)
The I/O operation has been aborted because of either a thread exit or an application request.
There is no thread exit after the timeout occurs and there is no cancel of any read or write operation. I am able to resume the communication only by closing and re-opening the communication with the device using CloseHandle and CreateFile. However, this will take some time from the measurement and is not an ideal solution.
My question is, why does WriteFile return the error code 995 in this case, and what can I do to avoid having to close and re-open the communication with the device?

Contact the OEM of your USB-serial device -- we can't help you further with this, since we don't have access to the driver's plumbing. If the device OEM can't help you -- contact the USB-serial chipset's manufacturer; if they refuse to help, throw the USB-to-serial adapter in the garbage can and buy one with a chipset that actually has manufacturer support (such as the FTDI or Silicon Labs USB-serial chips), not some cheap clone garbage.

Related

Windows 10 virtual COM port - ReadFile stops working after some period of time

I am in the unenviable position of having to debug code that was written by someone 10+ years ago who no longer works at the company.
The premise is fairly simple: this is a Windows based test tool that is intended to communicate with an external device that our company builds. The communication is over RS-232 using a Windows COM port via a USB-to-Serial converter. The communication is a simple request/response scheme. The program runs a continuous loop of successive WriteFile() and ReadFile() calls to communicate with the external device. WriteFile to send a command, followed by ReadFile to read the response.
All works well initially, but after some period of time (roughly 10 minutes - although I haven't confirmed that it's always consistent), the ReadFile call stops working - as in, it times out and returns 0 characters every single time after the initial failure. Since I have the ability to debug the external device simultaneously, obviously the first thing I did was to check if the failure was there, but I have confirmed that even after the ReadFile call stops working, the external device still correctly receives the commands sent via the WriteFile call and responds on the same COM port.
// Flush buffer
PurgeComm(hComm, PURGE_RXABORT|PURGE_RXCLEAR|PURGE_TXABORT|PURGE_TXCLEAR);
// Send command
WriteFile(hComm, dataOut_ptr, write_size, &dwBytesWritten, NULL);
//...
// Read Response
ReadFile(hComm, dataIn_ptr, read_size, &dwBytesRead, NULL);
//This sequence works for a while
//At a certain point, the ReadFile call times out and dwBytesRead is 0
//After that point, every call to ReadFile times out in the same way
//WriteFile still works fine and I know that the external device is still responding on the same UART channel
If I close and re-open the COM port after the timeout as shown below, nothing changes.
//This is the code inside the COM close function
PurgeComm(hComm, PURGE_RXABORT);
CloseHandle(hComm);
//...
//This is the COM open code that gets called in a separate function:
hComm = CreateFile( name,
GENERIC_READ | GENERIC_WRITE,
0,
0,
OPEN_EXISTING,
0,
0);
GetCommTimeouts(hComm,&ctmoOld);
ctmoNew.ReadTotalTimeoutConstant = 200;
ctmoNew.ReadTotalTimeoutMultiplier = 0;
ctmoNew.WriteTotalTimeoutMultiplier = 0;
ctmoNew.WriteTotalTimeoutConstant = 0;
SetCommTimeouts(hComm, &ctmoNew);
dcbCommPort.DCBlength = sizeof(DCB);
GetCommState(hComm, &dcbCommPort);
BuildCommDCB("9600,O,8,1", &dcbCommPort);
SetCommState(hComm, &dcbCommPort);
However, if I set a break-point on the external device just before it responds, close the test program and open the COM port in a serial terminal like RealTerm then let the external device proceed, the data comes in fine. At the same time, if I kill and restart the test program entirely, it will also work again for a period of time before again experiencing the same timeout issue.
I have tried playing with the Rx timeout, as well as inserting an additional delay between the WriteFile and ReadFile calls with no success.
I don't get it. Based on this behaviour I don't suspect the Windows USB-to-Serial driver that's being used and feel like there is something going wrong specifically with the use of ReadFile in the test program.
Is there a possibility that the buffer is not being flushed properly and simply stops working because it overflows? Are there known issues with the ReadFile or PurgeComm functions on Windows 10? This is a legacy program that normally runs on a Windows XP machine without issue. I'm having to run it on Windows 10 because I'm using it to test an upgrade of the external device and that's the PC I have.
Edit: To clarify, the "failed" call to ReadFile still returns 1 (so calling GetLastError() is not relevant here), just the number of characters read is 0
Edit 2: Some more details about the communication being attempted...
The Purge-WriteFile-ReadFile sequence alternates between 2 types of commands (same sequence for both commands):
a 'write' command, in which a 134 byte packet (128 byte payload + 6 bytes overhead) is sent to the external device, to which the device responds with a 4-byte 'ok' or 'not ok' handshake
a 'read' command, which is a 6 byte packet with the ID of the data to be read-back (specifically the data that was just written), to which the device responds with a 130 byte (128 bytes data + 2 bytes overhead) response
The timeout always initially occurs during the 'read' command. So the ReadFile call is expecting a length of 130 bytes. After that, the ReadFile call during the 'write' command (where expected bytes read is 4) also times out.
This time noting that the OP's system tends to work some of the time, verifying basic communication, there are some interesting points and questions. (And for some reason I can't "comment" and must post any questions using an "answer".)
One interesting feature is that the re-open uses 0 for both WriteTotalTimeoutConstant and WriteTotalTimeoutMultiplier. I can't tell if this is the initial condition as well, or only the "reopen" state after first fail. We normally use MAXDWORD value for WriteTotalTimeoutConstant. The apparent effect is that the program may not be waiting for the write when going to read.
And 200 mS is very short timeout on read, so if the read doesn't occur in 200 ms of the initiation of the write, then the read times out. The transmission of the packet at 9600 baud will take at least 130 mS of that 200 mS timeout, so any delay in the (unreliable) operating system write might mean that the data was still being transmitted when the read times out.
I would certainly experiment using MAXDWORD in WriteTotalTimeoutConstant, and much longer read timeout. Remember that the system won't actually wait for the timeout if it receives a full "readsize" packet, but I can't tell if that is set to the exact packet size or if depending upon the timeout to tell when the receive is over with (thus wasting 200 mS usually). Also if you are depending upon timeout to recognize when the device has finished responding (that is reading larger than the size of the actual responding packet), then I would look at using the inter-byte timeouts as well--but that is a more complex topic.
Docs on write timeouts:
WriteTotalTimeoutMultiplier
The multiplier used to calculate the total time-out period for write operations, in milliseconds. For each write operation, this value is multiplied by the number of bytes to be written.
WriteTotalTimeoutConstant
A constant used to calculate the total time-out period for write operations, in milliseconds. For each write operation, this value is added to the product of the WriteTotalTimeoutMultiplier member and the number of bytes to be written.
A value of zero for both the WriteTotalTimeoutMultiplier and WriteTotalTimeoutConstant members indicates that total time-outs are not used for write operations.
In addition I would check on the success of the write operation to make sure there was no problem, before starting the read as well, as some form of failure could lock up writing. Also note that nothing is given as to hardware or software flow control, so investigate possible reasons for write to not finish in the time expected.
Also note that the "system" (as a whole) might be in an unsynchronized state after a random failure. This is because a portion of the transmit block is terminated by the flushing (PurgeComm) operation, then restarted. (Specifically because the write operation is potentially asynchronous in the above code and doesn't wait for end of the write, a subsequent flush kills the write block before it is finished.) The device under test must have a way to know that the partial (aborted) packet has been restarted, and some delays to allow for the device under test to resynchronize should be implemented on any failure condition.
OH, and I am suspicious of PurgeComm problems that are not recognized -- having removed all flushing operations from my code because of random issues not unlike those of the OP (opening post). I would look into not using flush, controlling exactly what is flushed (flags to PurgeComm), and only flushing after a failure and once upon system initialization. Also implementing significant delays upon any failure occurrence to let external systems settle. I also have changed to using a read function with timeout to flush input, rather than using some equivalent of flushing, because I was having problems when I did that (but can't explain why). I am especially suspicious of flushing (purge) everything including any ongoing transmission because the device under test may only receive a partial packet and thus that end needs a recovery mechanism.
Also suspicious of flushing everything (read and write) before every test. Indeed, if serial cables are removed and reconnected before a test random characters will occur in the input buffer. And especially if there is any flow control of the output, that might be held up as well. So there are some reasons to flush all that. But now imagine that some error occurs and a partial packet is sent out corrupted (will get back to that). Then the device under test sees a partial packet, then perhaps a complete packet spliced into that partial. What does it do? Is there a delay of over 200 mS? So assume in this circumstance there is a delay. The program times out in 200 mS, and goes in a loop and sends another packet. The device under test then receives another packet, but was still handling or responding to the last one. Meanwhile the test program flushed any possible response that may have been underway because it looped and flushed both input and output. The cycle continues every 200 mS, and the response in the OP is exactly what happens. When the program used to run on an old slow XP machine, there were much greater delays and perhaps the multiple packets were not occurring every 200 mS, but modern multi-Gigahertz multi-threaded computer can be writing (see above without waiting for the write to finish) and starting a new cycle every 200 mS. (Which could be as fast as 5 times a second.)
How to know the device under test is "still responding"? If debug break the device, that breaks the loop and the system changes, so then may receive a complete packet--that's not the same as responding correctly during possible "looping" above. Suggest scope and/or device under test special code to report its activity, or a device simulator hooked by null modem. There are even LED serial monitors that can show if the device under test is sending a packet back each time, will give more clues, but still possible the flush "ate" the response due to timing, so integrating time delays in the program along with such testing to see if packet response is given may be useful.
(Yeah, kind of obsessive here--but working on converted 25 year old Linux serial port code right now and having similar issues!)
PS: The issue of potential weird behavior of PurgeComm:
My library uses these:
Serial::SerialImpl::flushInput ()
{
if (is_open_ == false) {
throw PortNotOpenedException("Serial::flushInput");
}
PurgeComm(fd_, PURGE_RXCLEAR);
}
void
Serial::SerialImpl::flushOutput ()
{
if (is_open_ == false) {
throw PortNotOpenedException("Serial::flushOutput");
}
PurgeComm(fd_, PURGE_TXCLEAR);
}
I stopped using either of the "flush" options that call PurgeComm. I don't remember the exact problems, but they remind me of those described here. And by that I mean complete failure to do serial transactions after the unexplained intermittent fail. If all else doesn't work, I would figure out a way to skip calling. For example a read "flush" can be done with read one character and very short timeout, in a loop, stop when timeout. This will delay the amount specified once when fail by timeout then input is flushed, does not add much delay. (There may even be a zero delay option for this.) Combined with making sure delays until most of write is done (rather than 0 timeout--see above) and checking for write failures.
Also read the post on GetCommTimeouts in this thread--very applicable to weird problems like in the OP, and spot on.
GetCommTimeouts is a debugging function. You should never use it in production code unless you want a program that randomly fails depending on what arbitrary configuration is leftover on the port from the previous application that opened it.
Instead of calling GetCommTimeouts, start with a zero-filled COMMTIMEOUTS structure and then set every documented member explicitly. Currently you're leaving one unchanged, ReadIntervalTimeout, which is potentially highly relevant. Do not allow your code to inherit the previous configuration of ReadIntervalTimeout. Set it explicitly to the value you want.
The same applies to GetCommState. You never, ever, want to inherit port configuration leftover by some other application.
The BuildCommDCB function adjusts only those members of the DCB structure that are specifically affected by the lpDef parameter, with the following exceptions
That's really not what you want, you want a 100% predictable configuration in order to get consistent behavior. Do not use configuration that you found leftover from the previous user. Set it entirely yourself, starting with a zero-filled DCB.
(Been dealing with serial ports for 50+ years. And supporting a company that builds equipment that is tested and connected by an RS232 serial port.)
First you need to know if the serial port is actually working. I do that with an oscilloscope (don't understand how someone can debug systems without that), but you can get a "null modem" and set up a separate port or computer. I would recommend Teraterm which will send and receive text (ASCII) characters. So set up two Teraterm terminals, one set up to the port in question, and the other to another port which you connect through a Null Modem. Set for same serial rate and communication settings (8 bit, 1 stop, no parity for example) and same rate (9600 buad for example). Then hit characters on one terminal and see they appear on the other, and vice versa. After you know your ports themselves work, then move on to the serial library and program. If the port isn't working, then that explains why the software no longer talks to the equipment.
Next configure your program to send a simple ASCII message. (e.g copy program and add some test code at the beginning in the copy). After that wait for a single character and print out what is received in a loop. You can kill the window to end the program, don't need fancy programming to test. So then hit keys on the other terminal, just like the terminal-terminal test above. Be sure your program is configured to same parameters as the Teraterm window that is connected through the null modem. You should see the string you send out, and then see characters you hit on the keyboard received back to your program loop. That verifies that the basic serial library interface is working. If not, you can concentrate on where it goes wrong. For example if doesn't send, no point in spending time debugging the wait times for receive. After some detail is known then can decide where to look next.
As to discussion of why one might purge before each write -- this is because you want to start in a clean state with reading the next packet response. If something was in the input buffer before sending the packet, the response would not be associated with the packet that was sent. However one must be careful about timing, for example don't time out and cancel the previous sending or receiving packet before it is completed, for example by timing out too soon. Only reason for timeout at that point is if the packet is not received by the equipment (thus it doesn't respond), the response (given) is not received, or the equipment doesn't actually respond. You need to run down which of those issues is applicable, and the testing above will help verify the system as whole before these details.
(Note, I had misunderstood that the timeout occurred every time. Note my point about using an oscilloscope to observe, and might extend to suggest writing a message simulator program to respond to the test program from null modem connection with reporting so know the exact transmit state at the time of the failure condition.)

simultanious read/write on the same serial port

I am building an application that intersepts a serial comunication line by recieving the transmition, modifieng the data, and echoing the changed result.
The transmitted data is made of status sentances at high baudrate with alot of data.
I have created two threads, one reads the sentaces and pushes a pointer to each new sentance into a queue, and the Other pops the pointers out of the queue, manipulates them, sends them to the serial port and deletes the pointer.
The queue operstions are in external functions with CririticalSection locks so that works fine.
To make sure the queue doesnt overflow quickly i need to send the messages quickly and not wait for the recieving to end.
To my understanding serial ports can recieve and transmit simultaniously but trying to do so gives error with access resttictions.
The other solution is to split the system into two diffrent ports but I try to avoid it because the hardware changes and the need of another USB and convertor.
I read about Overlapped structures but didnt fully understood what is their usage and, as I got it they manage asinc operation where my issue is parallel operation.
Sorry for my lame english, any help or explanation will help.
I used this class for the serial comunication, setting overlapped to enable when opening the comport to allow wait event timeouts:
http://www.codeproject.com/Articles/992/Serial-library-for-C
Thanks in advance.
Roman.
Clarification:
Im not opening the port twice, just once in the main program and pass the handler to both threads (writing it now maximizes the problem in this approach
More details:
The error comes from the Cserial library:
"Cserial::read overlapped complete without result." Commenting the send back to serial command in the sending thread will not raise an error and the queue is filled and displays correctly–
Im on a classified system without internet access so i cant upload the sample, writing from my tablet. The error accures after I get the first sentace, which triggers the first send command ss soon as queues size changes, and then the recieving thread exits because recieve failes, so the queue stops to fill and nothing sends out.
Probbly because both use same serial handler but whats the alternative to access the same port simultaniosly without locking one thread or the other
Ignoring error 996, which is the error id of the "read overlapped completed without results" and not exiting the thread when its detected makes both recieve an transmited data wrong (missing bytes)
At the buttom line, after asking alot of questions:
Why a read operation is interrupted by a write operation if these are two seperate comunication lines?can i use two handlers one for each task on the same port?
Is the D+/- in usb is transmit/recieve or both line used for transmit and recieve?
":read overlapped complete without result"
Are you preventing the read from being interrupted by the OS switching execution to the write thread? You need to protect this from happening by using a mutex or similar.
The real solution is to switch to an asynchronous library, such as bosst::asio.
Why a read operation is interrupted by a write operation if these are two seperate comunication lines?
here is a possible hand-waving visualization of what happens if you use synchronous operations in two threads without locking them against each other. ( I am guessing at the details of how you arranged your software )
Your app receives a read request from the port.
Your app requests the OS to start the read thread.
OS agrees, and your read thread completes the read.
-. Your app does its processing.
Your app asks the OS to start the write thread.
The OS agrees, and your write thread starts a write.
A second read request arrives on the port. This does not interrupt anything, it just waits.
The write is not yet finished, but the OS decides that the write thread has had enough time. It decides to switch context to the read thread which is waiting.
The read thread starts reading
Again the OS decides that the running thread ( read ) has had a fair crack at the CPU . It switches context back to the write thread. This crashes the unfinished read. Note that this happens in your software, not in the hardware, or the hardware driver.
This should give you a general insight into the sort of problems that occur, unless you keep the OS from running the reads and writes over the top of each other. It is a matter of opinion wehter it is better to use multithreading with mutexes ( or equivalent ) or asynchronous event-driven designs.
Two threads can't operate on single port / file descriptior. Depending on what library you used you should try to do this asynchronous or by checking how many bytes can be read/write without blocking thread. (if it is Linux raw filedescriptor you should look at poll / select)

ctb::SerialPort - time-out in Write()

I'm writing program that should control a piece of scientific hardware over COM-port. The program itself is written in wxWidgets and uses ctb library. To test, it before I connect it to 300k€ equipment, I use com0com (Null-modem emulator) to forward COM2 port. To emulate my hardware I use wxTerminal (COM3). Altogether it works nice. One can debug not only in VS or DB but also see the whole data transfer in wxTerminal.
Now to my problem. I use to send data to COM-port ctb::SerialPort::Write() function.
device->Write( (char*)line.c_str(), line.size() );
However, if I disconnect the connection on the side of wxTerminal (i.e. COM2->NULL) than program hangs in this function.
It's obvious that I should add some function to test if my equipment is still there, but to do it I need to send data-packet to it and expect some answer. So I'm back to the Write().
"Just in case" I've also tried ctb::IOBase::Writev (char ∗ buf, size_t len, unsigned int timeout_in_ms) with timeout set to 100ms and I've still got program hanging in the same line. It's actually expected behavior as in this case timeout means only that the connection line is blocked till whole buffer is transferred or timeout is reached.
Connecting of wxTerminal to COM3 leads to un-freezing of debugger or stand-alone program. The Sun is shining, the birds are singing.
Can somebody give me a hint how to overcome my problem? I'd appreciate if comments would be restrained to wxWidgets-world - I really do not want to re-write whole program with other toolkit.
If you COM port library does not provide effective timeouts on write block, (presumably because of hardware flow-control), you could implement your own by threading off the write. You could use a couple of events/semaphores/condvar/whatever. One to signal to the thread that there is something in a buffer to send and another that you can wait on with a timeout that is signaled by the thread after it has sent the buffer. If the 'ack' wait times out, your COM port is stuck and you can pop up some 'Check cable' messageBox. I don't know what other calls your port lib supports, so I don't know how you could implement flushes/retries.

Windows WriteFile problem when using threads

My company is developing a hardware that needs to communicate with software. To do this, we have made a driver that enables writing to and reading from the hardware. To access the driver, we use the command:
HANDLE device = CreateFile(DEVICE_NAME,
GENERIC_READ | GENERIC_WRITE,
0x00000007,
&sec,
OPEN_EXISTING,
0,
NULL);
Reading and writing is done using the functions:
WriteFile(device,&package,package.datasize,&bytesWritten,NULL);
and
ReadFile(device,returndata,returndatasize,&bytesRead,NULL);
And finally, CloseHandle(device), to close the file.
This works just fine in the case where the functions are called from the main thread. If they are called from some other thread, we get error 998 (no_acccess) when trying to Write more than a couple of elements. The threads are created using
CreateThread(NULL, 0, thread_func, NULL, 0, &thread_id);
I'm running out of ideas here, any suggestions?
edit:
When running the following sequence:
Main_thread:
CreateFile
Write
Close
CreateThread
WaitForThread
Thread_B:
CreateFile
Write
Close
Main_Thread succeeds and Thread_B does not. However, when writing small sets of data, this works fine. May this be because Thread_B does not inherit all of Main_Thread's access privileges?
edit2:
a lot of good thinking going on here, much appreciated! After some work on this problem, the following seems to be the case:
The api contains a Queue-thread, handling all packages going to and from the device. This thread handles pointers to package-objects. When a pointer reaches the front of the queue, a "send_and_get" function is called. If the arrays in the package is allocated in the same thread that calls the "send_and_get" function, everything works fine. If the arrays are allocated in some other thread, sending fails. How to fix this, though, I don't know.
According to winerror, Win32 error 998 is one of the following native status values (which would be returned by the O/S or the driver):
998 ERROR_NOACCESS <--> 0x80000002 STATUS_DATATYPE_MISALIGNMENT
998 ERROR_NOACCESS <--> 0xc0000005 STATUS_ACCESS_VIOLATION
998 ERROR_NOACCESS <--> 0xc00002c5 STATUS_DATATYPE_MISALIGNMENT_ERROR
Access violation might be a likely candidate based on you saying, "when trying to Write more than a couple of elements." Are you sure the buffer that you're sending is large enough?
The alignment errors are fairly exotic, but might be relevant if the device has some alignment requirements and the developer chose to use these particular errors.
-scott
Still sounds to me like it's concurrent access.
Your separate threads writing to this device will need to properly protect access to the file using a mutex or similar. Either open the handle in the main thread and leave it open or protect the whole Open -> Write -> Close sequence that can occur in each thread (with a mutex).
As a debugging measure, since it's your own driver, you could get the driver to log the requests it is receiving, e.g., into the event log. Set up two test runs which are identical except that one runs all the code in the main thread and the other runs all the code in a second thread. Comparing the results should give you a better insight into what is happening.
It would also be a good idea to get your driver to report any error codes that it is returning to the operating system.
First thing that you should check is if the error (998) reported by your driver or by the kernel-mode I/O manager (which is responsible to initiate the IRP and call your driver) even before the request reaches your driver. You should be able to discover this since this is your driver. Just log the calls to the driver's Dispatch routine, what it returns, what it does (does it call other drivers or calls IoCompleteRequest with an error code or etc.) and things should become clear.
From the scenario that you describe it seems that most likely the error is caused by your driver. For instance, your driver may allocate some global state structure on a response to CreateFile (which is driver's IRP_MJ_CREATE), and purge it when the file is closed. Such a driver won't function correctly if simultaneously two files are opened, then one is closed whereas the second still receives I/O requests.

GetQueuedCompletionStatus stops reading a serial port

I have a device that generates messages over a serial port. When I reboot the device, the IO Completion Port stops reading bytes.
The code is calls GetQueuedCompletionStatus():
BOOL bRet = GetQueuedCompletionStatus(
m_hCompletionPort,
&dwBytesTransferred,
&dwCompletionKey,
&pOverlapped,
INFINITE);
PortMon looks like:
...
IRP_MJ_WRITE Serial1 SUCCESS LENGTH: 7 REBOOT.
IRP_MJ_READ Serial1 CANCELLED LENGTH: 1
Logging shows the following result:
bRet=true, dwBytesTransferred=7, pOverlapped=0x0202B028, GetLastError()=997
(sleep forever)
Is there any way to detect this failure and reestablish communications?
I can monitor a heat beat and close/reopen the serial port, but it doesn't seem right that the windows API allows serial communications to silently drop like this.
If you do WaitForSingleObject on the handle for the serial port that you opened to start reading data, does the handle become signalled when the device is rebooted? Maybe this is a way to tell when you need to open the port again?
IO Completion Ports can certainly handle this case without problem. You don't need to close and reopen the device.
The most likely problem in this case is that you have an error on the line (caused by the device reset) that you have not cleared using ClearCommError().
You need to use SetCommState() and SetCommTimeouts() appropriately for your device up front. In the DCB you pass to SetCommState(), you need to set fAbortOnError. If you do dequeue an error you need to call ClearCommError() before you requeue another read.
RE: janm (I can't seem to add a comment to your answer sorry)
I did try setting various flags, including the DCB's fAbortOnError, but GetQueuedCompletionStatus() would still wait infinitely. I also tried periodically timing out the call, and checking the serial port for errors. The serial port always looked fine, yet the disconnection would still permanently break the IO Completion Port. The device rebooting probably creates a transient error state... I say probably, because I've never been able to detect it!
A fellow developer also had a crack at this problem, and they too failed. So we just rewrote the code to use overlapped serial port reads, and now it works fine.
There is probably something, somewhere that we missed... in the end we wasted more time trying to solve the mystery than it took to rewrite the code.