I am in the unenviable position of having to debug code that was written by someone 10+ years ago who no longer works at the company.
The premise is fairly simple: this is a Windows based test tool that is intended to communicate with an external device that our company builds. The communication is over RS-232 using a Windows COM port via a USB-to-Serial converter. The communication is a simple request/response scheme. The program runs a continuous loop of successive WriteFile() and ReadFile() calls to communicate with the external device. WriteFile to send a command, followed by ReadFile to read the response.
All works well initially, but after some period of time (roughly 10 minutes - although I haven't confirmed that it's always consistent), the ReadFile call stops working - as in, it times out and returns 0 characters every single time after the initial failure. Since I have the ability to debug the external device simultaneously, obviously the first thing I did was to check if the failure was there, but I have confirmed that even after the ReadFile call stops working, the external device still correctly receives the commands sent via the WriteFile call and responds on the same COM port.
// Flush buffer
PurgeComm(hComm, PURGE_RXABORT|PURGE_RXCLEAR|PURGE_TXABORT|PURGE_TXCLEAR);
// Send command
WriteFile(hComm, dataOut_ptr, write_size, &dwBytesWritten, NULL);
//...
// Read Response
ReadFile(hComm, dataIn_ptr, read_size, &dwBytesRead, NULL);
//This sequence works for a while
//At a certain point, the ReadFile call times out and dwBytesRead is 0
//After that point, every call to ReadFile times out in the same way
//WriteFile still works fine and I know that the external device is still responding on the same UART channel
If I close and re-open the COM port after the timeout as shown below, nothing changes.
//This is the code inside the COM close function
PurgeComm(hComm, PURGE_RXABORT);
CloseHandle(hComm);
//...
//This is the COM open code that gets called in a separate function:
hComm = CreateFile( name,
GENERIC_READ | GENERIC_WRITE,
0,
0,
OPEN_EXISTING,
0,
0);
GetCommTimeouts(hComm,&ctmoOld);
ctmoNew.ReadTotalTimeoutConstant = 200;
ctmoNew.ReadTotalTimeoutMultiplier = 0;
ctmoNew.WriteTotalTimeoutMultiplier = 0;
ctmoNew.WriteTotalTimeoutConstant = 0;
SetCommTimeouts(hComm, &ctmoNew);
dcbCommPort.DCBlength = sizeof(DCB);
GetCommState(hComm, &dcbCommPort);
BuildCommDCB("9600,O,8,1", &dcbCommPort);
SetCommState(hComm, &dcbCommPort);
However, if I set a break-point on the external device just before it responds, close the test program and open the COM port in a serial terminal like RealTerm then let the external device proceed, the data comes in fine. At the same time, if I kill and restart the test program entirely, it will also work again for a period of time before again experiencing the same timeout issue.
I have tried playing with the Rx timeout, as well as inserting an additional delay between the WriteFile and ReadFile calls with no success.
I don't get it. Based on this behaviour I don't suspect the Windows USB-to-Serial driver that's being used and feel like there is something going wrong specifically with the use of ReadFile in the test program.
Is there a possibility that the buffer is not being flushed properly and simply stops working because it overflows? Are there known issues with the ReadFile or PurgeComm functions on Windows 10? This is a legacy program that normally runs on a Windows XP machine without issue. I'm having to run it on Windows 10 because I'm using it to test an upgrade of the external device and that's the PC I have.
Edit: To clarify, the "failed" call to ReadFile still returns 1 (so calling GetLastError() is not relevant here), just the number of characters read is 0
Edit 2: Some more details about the communication being attempted...
The Purge-WriteFile-ReadFile sequence alternates between 2 types of commands (same sequence for both commands):
a 'write' command, in which a 134 byte packet (128 byte payload + 6 bytes overhead) is sent to the external device, to which the device responds with a 4-byte 'ok' or 'not ok' handshake
a 'read' command, which is a 6 byte packet with the ID of the data to be read-back (specifically the data that was just written), to which the device responds with a 130 byte (128 bytes data + 2 bytes overhead) response
The timeout always initially occurs during the 'read' command. So the ReadFile call is expecting a length of 130 bytes. After that, the ReadFile call during the 'write' command (where expected bytes read is 4) also times out.
This time noting that the OP's system tends to work some of the time, verifying basic communication, there are some interesting points and questions. (And for some reason I can't "comment" and must post any questions using an "answer".)
One interesting feature is that the re-open uses 0 for both WriteTotalTimeoutConstant and WriteTotalTimeoutMultiplier. I can't tell if this is the initial condition as well, or only the "reopen" state after first fail. We normally use MAXDWORD value for WriteTotalTimeoutConstant. The apparent effect is that the program may not be waiting for the write when going to read.
And 200 mS is very short timeout on read, so if the read doesn't occur in 200 ms of the initiation of the write, then the read times out. The transmission of the packet at 9600 baud will take at least 130 mS of that 200 mS timeout, so any delay in the (unreliable) operating system write might mean that the data was still being transmitted when the read times out.
I would certainly experiment using MAXDWORD in WriteTotalTimeoutConstant, and much longer read timeout. Remember that the system won't actually wait for the timeout if it receives a full "readsize" packet, but I can't tell if that is set to the exact packet size or if depending upon the timeout to tell when the receive is over with (thus wasting 200 mS usually). Also if you are depending upon timeout to recognize when the device has finished responding (that is reading larger than the size of the actual responding packet), then I would look at using the inter-byte timeouts as well--but that is a more complex topic.
Docs on write timeouts:
WriteTotalTimeoutMultiplier
The multiplier used to calculate the total time-out period for write operations, in milliseconds. For each write operation, this value is multiplied by the number of bytes to be written.
WriteTotalTimeoutConstant
A constant used to calculate the total time-out period for write operations, in milliseconds. For each write operation, this value is added to the product of the WriteTotalTimeoutMultiplier member and the number of bytes to be written.
A value of zero for both the WriteTotalTimeoutMultiplier and WriteTotalTimeoutConstant members indicates that total time-outs are not used for write operations.
In addition I would check on the success of the write operation to make sure there was no problem, before starting the read as well, as some form of failure could lock up writing. Also note that nothing is given as to hardware or software flow control, so investigate possible reasons for write to not finish in the time expected.
Also note that the "system" (as a whole) might be in an unsynchronized state after a random failure. This is because a portion of the transmit block is terminated by the flushing (PurgeComm) operation, then restarted. (Specifically because the write operation is potentially asynchronous in the above code and doesn't wait for end of the write, a subsequent flush kills the write block before it is finished.) The device under test must have a way to know that the partial (aborted) packet has been restarted, and some delays to allow for the device under test to resynchronize should be implemented on any failure condition.
OH, and I am suspicious of PurgeComm problems that are not recognized -- having removed all flushing operations from my code because of random issues not unlike those of the OP (opening post). I would look into not using flush, controlling exactly what is flushed (flags to PurgeComm), and only flushing after a failure and once upon system initialization. Also implementing significant delays upon any failure occurrence to let external systems settle. I also have changed to using a read function with timeout to flush input, rather than using some equivalent of flushing, because I was having problems when I did that (but can't explain why). I am especially suspicious of flushing (purge) everything including any ongoing transmission because the device under test may only receive a partial packet and thus that end needs a recovery mechanism.
Also suspicious of flushing everything (read and write) before every test. Indeed, if serial cables are removed and reconnected before a test random characters will occur in the input buffer. And especially if there is any flow control of the output, that might be held up as well. So there are some reasons to flush all that. But now imagine that some error occurs and a partial packet is sent out corrupted (will get back to that). Then the device under test sees a partial packet, then perhaps a complete packet spliced into that partial. What does it do? Is there a delay of over 200 mS? So assume in this circumstance there is a delay. The program times out in 200 mS, and goes in a loop and sends another packet. The device under test then receives another packet, but was still handling or responding to the last one. Meanwhile the test program flushed any possible response that may have been underway because it looped and flushed both input and output. The cycle continues every 200 mS, and the response in the OP is exactly what happens. When the program used to run on an old slow XP machine, there were much greater delays and perhaps the multiple packets were not occurring every 200 mS, but modern multi-Gigahertz multi-threaded computer can be writing (see above without waiting for the write to finish) and starting a new cycle every 200 mS. (Which could be as fast as 5 times a second.)
How to know the device under test is "still responding"? If debug break the device, that breaks the loop and the system changes, so then may receive a complete packet--that's not the same as responding correctly during possible "looping" above. Suggest scope and/or device under test special code to report its activity, or a device simulator hooked by null modem. There are even LED serial monitors that can show if the device under test is sending a packet back each time, will give more clues, but still possible the flush "ate" the response due to timing, so integrating time delays in the program along with such testing to see if packet response is given may be useful.
(Yeah, kind of obsessive here--but working on converted 25 year old Linux serial port code right now and having similar issues!)
PS: The issue of potential weird behavior of PurgeComm:
My library uses these:
Serial::SerialImpl::flushInput ()
{
if (is_open_ == false) {
throw PortNotOpenedException("Serial::flushInput");
}
PurgeComm(fd_, PURGE_RXCLEAR);
}
void
Serial::SerialImpl::flushOutput ()
{
if (is_open_ == false) {
throw PortNotOpenedException("Serial::flushOutput");
}
PurgeComm(fd_, PURGE_TXCLEAR);
}
I stopped using either of the "flush" options that call PurgeComm. I don't remember the exact problems, but they remind me of those described here. And by that I mean complete failure to do serial transactions after the unexplained intermittent fail. If all else doesn't work, I would figure out a way to skip calling. For example a read "flush" can be done with read one character and very short timeout, in a loop, stop when timeout. This will delay the amount specified once when fail by timeout then input is flushed, does not add much delay. (There may even be a zero delay option for this.) Combined with making sure delays until most of write is done (rather than 0 timeout--see above) and checking for write failures.
Also read the post on GetCommTimeouts in this thread--very applicable to weird problems like in the OP, and spot on.
GetCommTimeouts is a debugging function. You should never use it in production code unless you want a program that randomly fails depending on what arbitrary configuration is leftover on the port from the previous application that opened it.
Instead of calling GetCommTimeouts, start with a zero-filled COMMTIMEOUTS structure and then set every documented member explicitly. Currently you're leaving one unchanged, ReadIntervalTimeout, which is potentially highly relevant. Do not allow your code to inherit the previous configuration of ReadIntervalTimeout. Set it explicitly to the value you want.
The same applies to GetCommState. You never, ever, want to inherit port configuration leftover by some other application.
The BuildCommDCB function adjusts only those members of the DCB structure that are specifically affected by the lpDef parameter, with the following exceptions
That's really not what you want, you want a 100% predictable configuration in order to get consistent behavior. Do not use configuration that you found leftover from the previous user. Set it entirely yourself, starting with a zero-filled DCB.
(Been dealing with serial ports for 50+ years. And supporting a company that builds equipment that is tested and connected by an RS232 serial port.)
First you need to know if the serial port is actually working. I do that with an oscilloscope (don't understand how someone can debug systems without that), but you can get a "null modem" and set up a separate port or computer. I would recommend Teraterm which will send and receive text (ASCII) characters. So set up two Teraterm terminals, one set up to the port in question, and the other to another port which you connect through a Null Modem. Set for same serial rate and communication settings (8 bit, 1 stop, no parity for example) and same rate (9600 buad for example). Then hit characters on one terminal and see they appear on the other, and vice versa. After you know your ports themselves work, then move on to the serial library and program. If the port isn't working, then that explains why the software no longer talks to the equipment.
Next configure your program to send a simple ASCII message. (e.g copy program and add some test code at the beginning in the copy). After that wait for a single character and print out what is received in a loop. You can kill the window to end the program, don't need fancy programming to test. So then hit keys on the other terminal, just like the terminal-terminal test above. Be sure your program is configured to same parameters as the Teraterm window that is connected through the null modem. You should see the string you send out, and then see characters you hit on the keyboard received back to your program loop. That verifies that the basic serial library interface is working. If not, you can concentrate on where it goes wrong. For example if doesn't send, no point in spending time debugging the wait times for receive. After some detail is known then can decide where to look next.
As to discussion of why one might purge before each write -- this is because you want to start in a clean state with reading the next packet response. If something was in the input buffer before sending the packet, the response would not be associated with the packet that was sent. However one must be careful about timing, for example don't time out and cancel the previous sending or receiving packet before it is completed, for example by timing out too soon. Only reason for timeout at that point is if the packet is not received by the equipment (thus it doesn't respond), the response (given) is not received, or the equipment doesn't actually respond. You need to run down which of those issues is applicable, and the testing above will help verify the system as whole before these details.
(Note, I had misunderstood that the timeout occurred every time. Note my point about using an oscilloscope to observe, and might extend to suggest writing a message simulator program to respond to the test program from null modem connection with reporting so know the exact transmit state at the time of the failure condition.)
I'm trying to capture a signal much like UART communication.
This specific signal is composed by:
1 start bit(low)
16 data bits
1 stop bit (high)
From testing, I figured out that signal is about ~8-9μs / bit. This led me to believe that the baud is around 115.2kbps.
My frist idea was to try a "manual" approach, and wrote a small C program. Although I couldn't sample the signal at the proper timing.
From here, I decided to look for libraries that could do the job. I did try "termios" and "asio::serial_port" from boost, but those don't seem to be able to receive 16 bit characters.
Am I being naive trying to configure a 16 bit receiver?
Does a "16 bit UART" even make sense?
Thanks!
-nls
There's nothing fundamentally wrong with the idea of a UART which supports a 16-data-bit configuration, but I'm not aware of any which do. 8 or 9 is usually the limit.
If you're communicating with a device which only supports that configuration (what the heck kind of device is that?), your only real option is bit-banging, which would be best done by a MCU dedicated to the purpose. You are not going to get microsecond-accurate timing in user space on a multitasking operating system, no matter what libraries you bring to bear.
EDIT: Note that you could do this, more or less, with bit-banging from a dedicated kernel-space driver. But that would make the system nearly unusable. The whole reason UARTs exist is because the CPU has better things to do than poll a line every few microseconds.
I am trying to understand why I get CE_FRAME error in a serial communication. The documentation reads:
The hardware detected a framing error. Returned when the SERIAL_LSR_FE bit is detected in the LSR hardware register.
This is the framing error indicator. It is set whenever the hardware detects that the incoming serial data unit does not have a valid stop bit. This bit is cleared by reading this register:
define SERIAL_LSR_FE 0x08
But I don't really know what shall I do with this valid stop bit. Can I just ignore this?
I have no other issues with the communication. Every packet of data (send by the device) is being captured on the PC. On the PC I am using ClearCommError() to detect statistics of the channel, and from time to time I got this CE_FRAME flag on.
I am not sure if I have to provide details about the CreateFile() and SetCommState() function calls in my code, as there are nothing 'special' about them. But if needed, I can.
If you are programming on Windows then the application programmer does not set start and stop bits, the 'system' takes care of applying the start/stop bits as well as possible parity bits, baud rate and even some other settings. The critical ones are baud rate, start and stop bits and parity bits.
The system being the hardware or operating system. I think it is the UART chip which adds the start and stop bits. But you need to set the actual configuration to use in software.
What you do have to do is set the start and stop bits the same on both ends. So if you are communicating with a device which uses 1 start bit and 2 stop bits, then you have to set this same setting for your communication end.
You are likely to get framing errors if these settings are NOT set the same on both ends of the communication. I have seen framing errors where for example I set the baud rate 1200 one end but 9600 the other end. Actually my start and stop bits were correctly set both ends. So it may well be something simple like that.
I have an old RM-1501 digital tachometer which I'm using to try to identify the speed of an object.
According to the manual I should be able to read the data over a serial link. Unfortunately, I don't appear to be able to get any sensible output from the device (never gives a valid speed). I think it might be a signalling problem because disconnecting the CTS line starts to get some data through..
Has anyone ever developed anything for one of these / had any success?
The manual does not specify that flow control is used. Open the port with hardware/software flow control disabled.
The manual does not specify the connection - whether it is DTE<->DCE or Null Modem; are you using the cable supplied with the device?
I don;t know if this info is still use full. but I tried with even parity and got the data. The protocol in the document is incorrect i think (at least for the version I am using now) it is a 5 character display (9999) we only need 3 byte to get the required information the 4th byte should always be zero. Hence with 0x0D as starting and following 6 byte makes the entire packet i.e., 0xD0 B1 B2 B3 D1 D2 D3. B1,B2 and B3 bytes contain the divisor, status, units, functions and error flags. Where as the last three byte (D1,D2,D3) are the data, with D1 as LSB and D3 as MSB. I would also like to add that may be the manufacture changed the Firmware with out changing the user manual :). so my version of the protocol might be true some and wrong for others
I tried every combination of hardware control (both enabled and disabled) I could think of so I think it must be a hardware problem. Removnig the CLS link between PC and the device solved the issue.
Is it actually sending data to indicate speed, or is it providing make / break on one of the pins?
I'm trying to implement a protocol over serial port on a windows(xp) machine.
The problem is that message synchronization in the protocol is done via a gap in the messages, i.e., x millisecond gap between sent bytes signifies a new message.
Now, I don't know if it is even possible to accurately detect this gap.
I'm using win32/serport.h api to read in one of the many threads of our server. Data from the serial port gets buffered, so if there is enough (and there will be enough) latency in our software, I will get multiple messages from the port buffer in one sequence of reads.
Is there a way of reading from the serial port, so that I would detect gaps in when particular bytes were received?
If you want more control over a Windows serial port, you will have to write your own driver.
The problem I see is that Windows may be executing other tasks or programs (such as virus checking) which will cause timing issues with your application. You application will not know when it has been swapped out for another application.
If possible, I suggest your program time stamp the end of the last message. When the next message arrives, another time stamp is taken. The difference between time stamps may help in detecting new messages.
I highly suggest changing the protocol so that timing is not a factor.
I've had to do something similar in the past. Although the protocol in question did not use any delimiter bytes, it did have a crc and a few fixed value bytes at certain positions so I could speculatively decode the message to determine if it was a complete individual message.
It always amazes me when I encounter these protocols that have no context information in them.
Look for crc fields, length fields, type fields with a corresponding indication of the expected message length or any other fixed offset fields with predictable values that could help you determine when you have a single complete message.
Another approach might be to use the CreateFile, ReadFile and WriteFile API functions. There are settings you can change using the SetCommTimeouts function that allows you to halt the i/o operation when a certain time gap is encountered.
Doing that along with some speculative decoding could be your best bet.
It sounds odd that there is no sort of data format delineating a "message" from the device. Every serial port device I've worked with has had some form of a header that described the data it transmitted.
Just throwing this out there, but could you use the Win32 Asynchronous ReadFileEx() and WriteFileEx() system calls? They allow you to attach a callback function, and then you might be able to manage a timer within the callback. The timer would only provide you a rough estimation, however.
If you need to write your own driver, the Windows Driver Kit has a sample that shows how to write a serial port driver. I can't imagine that you'll be able to override the Windows serial port bus driver(the driver that directly controls the serial port on your Windows machine), but you might be able to write a driver that sits on top of the bus driver.
I thought so. You all grew up with the web, I didn't, though I was present at the birth. Let me guess, the one byte is 1(SOH) or 2(STX)? IMVEO it is enough. You just need to think outside the box.
You receive message_delimiter followed by 4 (as length) and then 4 bytes of data. A valid message is not those 6 bytes.
message_delimiter - 1 byte
4 - length - 1 byte
(4 data bytes) - 4 bytes
A valid message is always bounded by the message_delimiter, so it would look like
message_delimiter - 1 byte
4 - length - 1 bytes
(4 data bytes) - 4 bytes
message_delimiter - 1 byte