I am in the unenviable position of having to debug code that was written by someone 10+ years ago who no longer works at the company.
The premise is fairly simple: this is a Windows based test tool that is intended to communicate with an external device that our company builds. The communication is over RS-232 using a Windows COM port via a USB-to-Serial converter. The communication is a simple request/response scheme. The program runs a continuous loop of successive WriteFile() and ReadFile() calls to communicate with the external device. WriteFile to send a command, followed by ReadFile to read the response.
All works well initially, but after some period of time (roughly 10 minutes - although I haven't confirmed that it's always consistent), the ReadFile call stops working - as in, it times out and returns 0 characters every single time after the initial failure. Since I have the ability to debug the external device simultaneously, obviously the first thing I did was to check if the failure was there, but I have confirmed that even after the ReadFile call stops working, the external device still correctly receives the commands sent via the WriteFile call and responds on the same COM port.
// Flush buffer
PurgeComm(hComm, PURGE_RXABORT|PURGE_RXCLEAR|PURGE_TXABORT|PURGE_TXCLEAR);
// Send command
WriteFile(hComm, dataOut_ptr, write_size, &dwBytesWritten, NULL);
//...
// Read Response
ReadFile(hComm, dataIn_ptr, read_size, &dwBytesRead, NULL);
//This sequence works for a while
//At a certain point, the ReadFile call times out and dwBytesRead is 0
//After that point, every call to ReadFile times out in the same way
//WriteFile still works fine and I know that the external device is still responding on the same UART channel
If I close and re-open the COM port after the timeout as shown below, nothing changes.
//This is the code inside the COM close function
PurgeComm(hComm, PURGE_RXABORT);
CloseHandle(hComm);
//...
//This is the COM open code that gets called in a separate function:
hComm = CreateFile( name,
GENERIC_READ | GENERIC_WRITE,
0,
0,
OPEN_EXISTING,
0,
0);
GetCommTimeouts(hComm,&ctmoOld);
ctmoNew.ReadTotalTimeoutConstant = 200;
ctmoNew.ReadTotalTimeoutMultiplier = 0;
ctmoNew.WriteTotalTimeoutMultiplier = 0;
ctmoNew.WriteTotalTimeoutConstant = 0;
SetCommTimeouts(hComm, &ctmoNew);
dcbCommPort.DCBlength = sizeof(DCB);
GetCommState(hComm, &dcbCommPort);
BuildCommDCB("9600,O,8,1", &dcbCommPort);
SetCommState(hComm, &dcbCommPort);
However, if I set a break-point on the external device just before it responds, close the test program and open the COM port in a serial terminal like RealTerm then let the external device proceed, the data comes in fine. At the same time, if I kill and restart the test program entirely, it will also work again for a period of time before again experiencing the same timeout issue.
I have tried playing with the Rx timeout, as well as inserting an additional delay between the WriteFile and ReadFile calls with no success.
I don't get it. Based on this behaviour I don't suspect the Windows USB-to-Serial driver that's being used and feel like there is something going wrong specifically with the use of ReadFile in the test program.
Is there a possibility that the buffer is not being flushed properly and simply stops working because it overflows? Are there known issues with the ReadFile or PurgeComm functions on Windows 10? This is a legacy program that normally runs on a Windows XP machine without issue. I'm having to run it on Windows 10 because I'm using it to test an upgrade of the external device and that's the PC I have.
Edit: To clarify, the "failed" call to ReadFile still returns 1 (so calling GetLastError() is not relevant here), just the number of characters read is 0
Edit 2: Some more details about the communication being attempted...
The Purge-WriteFile-ReadFile sequence alternates between 2 types of commands (same sequence for both commands):
a 'write' command, in which a 134 byte packet (128 byte payload + 6 bytes overhead) is sent to the external device, to which the device responds with a 4-byte 'ok' or 'not ok' handshake
a 'read' command, which is a 6 byte packet with the ID of the data to be read-back (specifically the data that was just written), to which the device responds with a 130 byte (128 bytes data + 2 bytes overhead) response
The timeout always initially occurs during the 'read' command. So the ReadFile call is expecting a length of 130 bytes. After that, the ReadFile call during the 'write' command (where expected bytes read is 4) also times out.
This time noting that the OP's system tends to work some of the time, verifying basic communication, there are some interesting points and questions. (And for some reason I can't "comment" and must post any questions using an "answer".)
One interesting feature is that the re-open uses 0 for both WriteTotalTimeoutConstant and WriteTotalTimeoutMultiplier. I can't tell if this is the initial condition as well, or only the "reopen" state after first fail. We normally use MAXDWORD value for WriteTotalTimeoutConstant. The apparent effect is that the program may not be waiting for the write when going to read.
And 200 mS is very short timeout on read, so if the read doesn't occur in 200 ms of the initiation of the write, then the read times out. The transmission of the packet at 9600 baud will take at least 130 mS of that 200 mS timeout, so any delay in the (unreliable) operating system write might mean that the data was still being transmitted when the read times out.
I would certainly experiment using MAXDWORD in WriteTotalTimeoutConstant, and much longer read timeout. Remember that the system won't actually wait for the timeout if it receives a full "readsize" packet, but I can't tell if that is set to the exact packet size or if depending upon the timeout to tell when the receive is over with (thus wasting 200 mS usually). Also if you are depending upon timeout to recognize when the device has finished responding (that is reading larger than the size of the actual responding packet), then I would look at using the inter-byte timeouts as well--but that is a more complex topic.
Docs on write timeouts:
WriteTotalTimeoutMultiplier
The multiplier used to calculate the total time-out period for write operations, in milliseconds. For each write operation, this value is multiplied by the number of bytes to be written.
WriteTotalTimeoutConstant
A constant used to calculate the total time-out period for write operations, in milliseconds. For each write operation, this value is added to the product of the WriteTotalTimeoutMultiplier member and the number of bytes to be written.
A value of zero for both the WriteTotalTimeoutMultiplier and WriteTotalTimeoutConstant members indicates that total time-outs are not used for write operations.
In addition I would check on the success of the write operation to make sure there was no problem, before starting the read as well, as some form of failure could lock up writing. Also note that nothing is given as to hardware or software flow control, so investigate possible reasons for write to not finish in the time expected.
Also note that the "system" (as a whole) might be in an unsynchronized state after a random failure. This is because a portion of the transmit block is terminated by the flushing (PurgeComm) operation, then restarted. (Specifically because the write operation is potentially asynchronous in the above code and doesn't wait for end of the write, a subsequent flush kills the write block before it is finished.) The device under test must have a way to know that the partial (aborted) packet has been restarted, and some delays to allow for the device under test to resynchronize should be implemented on any failure condition.
OH, and I am suspicious of PurgeComm problems that are not recognized -- having removed all flushing operations from my code because of random issues not unlike those of the OP (opening post). I would look into not using flush, controlling exactly what is flushed (flags to PurgeComm), and only flushing after a failure and once upon system initialization. Also implementing significant delays upon any failure occurrence to let external systems settle. I also have changed to using a read function with timeout to flush input, rather than using some equivalent of flushing, because I was having problems when I did that (but can't explain why). I am especially suspicious of flushing (purge) everything including any ongoing transmission because the device under test may only receive a partial packet and thus that end needs a recovery mechanism.
Also suspicious of flushing everything (read and write) before every test. Indeed, if serial cables are removed and reconnected before a test random characters will occur in the input buffer. And especially if there is any flow control of the output, that might be held up as well. So there are some reasons to flush all that. But now imagine that some error occurs and a partial packet is sent out corrupted (will get back to that). Then the device under test sees a partial packet, then perhaps a complete packet spliced into that partial. What does it do? Is there a delay of over 200 mS? So assume in this circumstance there is a delay. The program times out in 200 mS, and goes in a loop and sends another packet. The device under test then receives another packet, but was still handling or responding to the last one. Meanwhile the test program flushed any possible response that may have been underway because it looped and flushed both input and output. The cycle continues every 200 mS, and the response in the OP is exactly what happens. When the program used to run on an old slow XP machine, there were much greater delays and perhaps the multiple packets were not occurring every 200 mS, but modern multi-Gigahertz multi-threaded computer can be writing (see above without waiting for the write to finish) and starting a new cycle every 200 mS. (Which could be as fast as 5 times a second.)
How to know the device under test is "still responding"? If debug break the device, that breaks the loop and the system changes, so then may receive a complete packet--that's not the same as responding correctly during possible "looping" above. Suggest scope and/or device under test special code to report its activity, or a device simulator hooked by null modem. There are even LED serial monitors that can show if the device under test is sending a packet back each time, will give more clues, but still possible the flush "ate" the response due to timing, so integrating time delays in the program along with such testing to see if packet response is given may be useful.
(Yeah, kind of obsessive here--but working on converted 25 year old Linux serial port code right now and having similar issues!)
PS: The issue of potential weird behavior of PurgeComm:
My library uses these:
Serial::SerialImpl::flushInput ()
{
if (is_open_ == false) {
throw PortNotOpenedException("Serial::flushInput");
}
PurgeComm(fd_, PURGE_RXCLEAR);
}
void
Serial::SerialImpl::flushOutput ()
{
if (is_open_ == false) {
throw PortNotOpenedException("Serial::flushOutput");
}
PurgeComm(fd_, PURGE_TXCLEAR);
}
I stopped using either of the "flush" options that call PurgeComm. I don't remember the exact problems, but they remind me of those described here. And by that I mean complete failure to do serial transactions after the unexplained intermittent fail. If all else doesn't work, I would figure out a way to skip calling. For example a read "flush" can be done with read one character and very short timeout, in a loop, stop when timeout. This will delay the amount specified once when fail by timeout then input is flushed, does not add much delay. (There may even be a zero delay option for this.) Combined with making sure delays until most of write is done (rather than 0 timeout--see above) and checking for write failures.
Also read the post on GetCommTimeouts in this thread--very applicable to weird problems like in the OP, and spot on.
GetCommTimeouts is a debugging function. You should never use it in production code unless you want a program that randomly fails depending on what arbitrary configuration is leftover on the port from the previous application that opened it.
Instead of calling GetCommTimeouts, start with a zero-filled COMMTIMEOUTS structure and then set every documented member explicitly. Currently you're leaving one unchanged, ReadIntervalTimeout, which is potentially highly relevant. Do not allow your code to inherit the previous configuration of ReadIntervalTimeout. Set it explicitly to the value you want.
The same applies to GetCommState. You never, ever, want to inherit port configuration leftover by some other application.
The BuildCommDCB function adjusts only those members of the DCB structure that are specifically affected by the lpDef parameter, with the following exceptions
That's really not what you want, you want a 100% predictable configuration in order to get consistent behavior. Do not use configuration that you found leftover from the previous user. Set it entirely yourself, starting with a zero-filled DCB.
(Been dealing with serial ports for 50+ years. And supporting a company that builds equipment that is tested and connected by an RS232 serial port.)
First you need to know if the serial port is actually working. I do that with an oscilloscope (don't understand how someone can debug systems without that), but you can get a "null modem" and set up a separate port or computer. I would recommend Teraterm which will send and receive text (ASCII) characters. So set up two Teraterm terminals, one set up to the port in question, and the other to another port which you connect through a Null Modem. Set for same serial rate and communication settings (8 bit, 1 stop, no parity for example) and same rate (9600 buad for example). Then hit characters on one terminal and see they appear on the other, and vice versa. After you know your ports themselves work, then move on to the serial library and program. If the port isn't working, then that explains why the software no longer talks to the equipment.
Next configure your program to send a simple ASCII message. (e.g copy program and add some test code at the beginning in the copy). After that wait for a single character and print out what is received in a loop. You can kill the window to end the program, don't need fancy programming to test. So then hit keys on the other terminal, just like the terminal-terminal test above. Be sure your program is configured to same parameters as the Teraterm window that is connected through the null modem. You should see the string you send out, and then see characters you hit on the keyboard received back to your program loop. That verifies that the basic serial library interface is working. If not, you can concentrate on where it goes wrong. For example if doesn't send, no point in spending time debugging the wait times for receive. After some detail is known then can decide where to look next.
As to discussion of why one might purge before each write -- this is because you want to start in a clean state with reading the next packet response. If something was in the input buffer before sending the packet, the response would not be associated with the packet that was sent. However one must be careful about timing, for example don't time out and cancel the previous sending or receiving packet before it is completed, for example by timing out too soon. Only reason for timeout at that point is if the packet is not received by the equipment (thus it doesn't respond), the response (given) is not received, or the equipment doesn't actually respond. You need to run down which of those issues is applicable, and the testing above will help verify the system as whole before these details.
(Note, I had misunderstood that the timeout occurred every time. Note my point about using an oscilloscope to observe, and might extend to suggest writing a message simulator program to respond to the test program from null modem connection with reporting so know the exact transmit state at the time of the failure condition.)
I am new to Qt and serial programming.
I am using Qt 5.3 on rhel 7 server.
I receive a packet of size 75 bytes on serial port. On reading the packet using QSerialPort::readAll() function, only 8 bytes are read at a time. On checking QSerialPort::bytesAvailable() function it shows 8 bytes.
On Google i found that QSerialPort::readAll() can read 512 bytes in 1 go.
Am I missing something that needs to specified explicitly?
Thank you in advance
That could be a matter of your specific serial port FIFO buffer settings. You could try to change that system settings in your device manager for example. However, I would suggest not to rely on that and connecting to QSerialPort::readyRead() signal and read the data and put it in a user defined buffer.
QByteArray _MyBuffer;
void MyReadHandler() {
_MyBuffer.append(readAll());
if (_MyBuffer.count() >= 75) {
// Do something with your data...
} else {
// Wait for more data...
}
}
In this slot you can wait for a specific amount of data before further processing it.
Partial reads are always possible with serial programming (and communications programming in general). Serial ports are also slow and have no concept of a "packet". The operating system will feed you characters as they arrive. There are ways of optimising this but that doesn't stop your code and protocol from having to deal with comms timing and failures.
You are likely to end up with code with a timer that builds up a packet gradually, checks it and then processes it.
I have an AHRS (attitude heading reference system) that interfaces with my C++ application. I receive a 50Hz stream of messages via Ethernet from the AHRS, and as part of this message, I get UTC time. My system will also have NTPD running as the time server for our embedded network. The AHRS also has a 1PPS output that indicates the second roll-over time for UTC. I would like to synchronize the NTPD time with the UTC. After some research, I have found that there are techniques that utilize a serial port as input for the 1PPS. From what I can find, these techniques use GPSD to read the 1PPS and communicate with NTPD to synchronize the system time. However, GPSD is expecting a NMEA formatted message from a GPS. I don't have that.
The way I see it now, I have a couple of optional approaches:
Don't use GPSD. Write a program that reads the 1PPS and the Ethernet
message contain UTC, and then somehow communicates this information
to NTPD.
Use GPSD. Write a program that repackages the Ethernet message into
something that can be sent to GPSD, and let it handle the
interaction with NTPD.
Something else?
Any suggestions would be very much appreciated.
EDIT:
I apologize for this poorly constructed question.
My solution to this problem is as follows:
1 - interface 1PPS to RS232 port, which as it turns out is a standard approach that is handled by GPSD.
2 - write a custom C++ application to read the Ethernet messages containing UTC, and from that build an NMEA message containing the UTC.
3 - feed the NMEA message to GPSD, which in turn interfaces with NTPD to synchronize the GPS/1PPS information with system time.
I dont know why you would want drive a PPS device with a signal that is delivered via ethernet frames. Moreover PPS does not work the way you seem to think it does. There is no timecode in a PPS signal so you cant sync the time to the PPS signal. The PPS signal is simply used to inform the computer of how long a second is.
there are examples that show how a PPS signal can be read in using a serial port, e.g. by attaching it to an interrupt capable pin - that might be RingIndicator (RI) or something else with comparable features. the problem i am seeing there is that any sort of code-driven service of an interrupt has its latencys and jitter. this is defined by your system design (and if you are doing it, by your own system tailored special interrupt handler routine - on a PC even good old ISA bus introduced NMI handlers might see such effects).
to my best understanding people that are doing time sync on a "computer" are using a true hardware timer-counter (with e.g. 64 bits) and a latch that gets triggered to sample and hold the value of the timer on every incoming 1PPS pulse. - folks are doing that already with PTP over the ethernet with the small variation that a special edge of the incoming data is used as the trigger and by this sender and receiver can be synchronized using further program logic that grabs the resulting value from the built in PTP-hardware-latch.
see here: https://en.wikipedia.org/wiki/Precision_Time_Protocol
along with e.g. 802.1AS: http://www.ieee802.org/1/pages/802.1as.html
described wikipedia in section "Related initiatives" as:
"IEEE 802.1AS-2011 is part of the IEEE Audio Video Bridging (AVB) group of standards, further extended by the IEEE 802.1 Time-Sensitive Networking (TSN) Task Group. It specifies a profile for use of IEEE 1588-2008 for time synchronization over a virtual bridged local area network (as defined by IEEE 802.1Q). In particular, 802.1AS defines how IEEE 802.3 (Ethernet), IEEE 802.11 (Wi-Fi), and MoCA can all be parts of the same PTP timing domain."
some article (in German): https://www.elektronikpraxis.vogel.de/ethernet-fuer-multimediadienste-im-automobil-a-157124/index4.html
and some presentation: http://www.ieee802.org/1/files/public/docs2008/as-kbstanton-8021AS-overview-for-dot11aa-1108.pdf
my rationale to your question is:
yes its possible. but it is a precision limited design due to the various internal things like latency and jitter of the interrupt handler you are forced to use. the achievable overall precision per pulse and in a long term run is hard to say but might be in the range of some 10 ms at startup with a single pulse to maybe/guessed 0,1 ms. - doing it means proving it. long term observations should help you unveiling the true practical caps with your very specific computer and selected software environment.
I had no experience with Controller Area Networks(CAN) or ValueCAN3 prior to this project, and I used an example from Intrepid for my reading of messages. However I am having issues with efficiency and frequency of the updates for my GUI which displays the analog and digital signals I am reading.
My GUI consists of 16 numeric up/down boxes for analog channels and 36 buttons that change to green depending if a digital signal is turned on (1) or off(0). While reading in my CAN messages I then update the GUI controls to display the appropriate feedback. However the digital channels respond almost instantly when I press a button on a CAN joystick that is plugged in, whereas the analog signals don't update that fast with the string pots I am using to vary the signal. Sometimes it takes an analog signal 1 - 2 seconds to respond.
Currently I set the GUI controls, and then I read the values from the GUI controls and send the values out over a socket connection to another application via UDP. I should probably change this to sending the data from the signals I receive directly rather then reading from the GUI controls I am setting, but I don't think that is the issue.
I am using System::Timers::Timer objects to update, read messages, and send out data packets. I need a rate of 50hz - 100hz, preferably closer to 100hz. Using the socket on the other end I can see that my packets are sent frequently enough, but the data doesn't change smoothly or frequently for the analog channels. If anyone has any thoughts on what I might be doing wrong, or how to process the data in a more efficient way please state it.
Here is the segment of code from Intrepid that reads the CAN messages:
// Read the messages every timer event (1000 ms)
if (m_bPortOpen) // only if the port is open
{
// call icsneoGetMessages to read out the messages
lResult = icsneoGetMessages(hObject,stMessages,&lNumberOfMessages,&lNumberOfErrors);
if (lResult != 0)
{
// a successful read
mNumberOfErrorsRead = lNumberOfErrors;
mNumberOfMessagesRead = lNumberOfMessages; // store the number of messages in the current buffer
}
}
My form requests a msg from my CanReader Object using:
msg = can->GetLatestMsg();
and that method grabs the last message received.
public: icsSpyMessage* GetLatestMsg()
{
return &stMessages[mNumberOfMessagesRead - 1];
};
I think this GetLatestMsg() seems like a bad way to implement the retrival of the latest message, but I'm not entirely sure how much this is affecting my program or how else I could do this because the CanReader is seperate from the Form so I'd have to pass an array of messages I think otherwise. I do suspect this might be skipping messages because it only reads the last one grabbed and not the ones leading up to it, which if those were read should make the GUI output appear smoother for transitions.
Another thing to note is that I am reading from 6 different PGNs, the analog signals correspond to 4 of the PGNs and 2 correspond to the digital signals.
UPDATE
After playing with my application and using the string pots on different analog channels I am noticing that some channels are updating more then others. And by checking the PGNs being accessed I find that I am accessing some more frequently then others.
Doesn't a CAN device broadcast the data at relatively the same rates for the different PGNs? and if yes then my GetLatestMsg() method must not be reading the different PGNs effectively. It reads a new msg every 5 milliseconds.
Additionally, does anyone know if I should make seperate reading timers to detect the different PGNs seperately?
If there is additional code I can provide for clarity please let me know.
After lots of testing and debugging I have found a stable solution. I am posting this answer in hopes that someone else may benefit from this explanation.
When reading message on a CANbus the messages are broadcasting across the bus and you have to pick out the messages you want based on the parameter group names (PGNs). Each PGN should be defined for different pieces of data.
For instance engine rpm would correspond to a particular PGN, and when you check the messages in your buffer, you will be looking for the messages with the PGN corresponding to the engine rpm.
In my case, I am reading/wanting 6 different PGNs from the valueCAN3 device I am using. Each of these messages contains different data that corresponds to analog and digital signals. The error with efficiency was related to how I was retreiving the latest message from the buffer.
When you are just streaming all the same data in each packet it makes sense to just grab the latest or most recent message from your buffer. However, in my case since I require 6 different PGNs that means I when I grab the last message off of the buffer it corresponds to just 1 of out the 6 PGNs, and since the messages on my CANbus are sent out at a constant rate and in the same order, I was reading about half of the packets all the time, and missing the other half. By grabbing the 2 most recent packets I found that all my channels are now updating at the speed needed for my application. All my signals change without noticable lag, and are easily meeting my 50hz - 100hz range that I required.
However, I think the optimal solution would be to read the latest message for each PGN when I pull from the buffer, rather then just the latest two messages, which could correspond to any of my 6 PGNs because I read more often then I receive the packets.
I have to write a C++ application that reads from the serial port byte by byte. This is an important need as it is receiving messages over radio transmission using modbus and the end of transmission is defined by 3.5 character length duration so I MUST be able to get the message byte by byte. The current system utilises DOS to do this which uses hardware interrupts. We wish to transfer to use Linux as the OS for this software, but we lack expertise in this area. I have tried a number of things to do this - firstly using polling with non-blocking read, using select with very short timeout values, setting the size of the read buffer of the serial port to one byte, and even using a signal handler on SIGIO, but none of these things provide quite what I require. My boss informs me that the DOS application we currently run uses hardware interrupts to get notification when there is something available to read from the serial port and that the hardware is accessible directly. Is there any way that I can get this functionality from a user space Linux application? Could I do this if I wrote a custom driver (despite never having done this before and having close to zero knowledge of how the kernel works) ??. I have heard that Linux is a very popular OS for hardware control and embedded devices so I am guessing that this kind of thing must be possible to do somehow, but I have spent literally weeks on this so far and still have no concrete idea of how best to proceed.
I'm not quite sure how reading byte-by-byte helps you with fractional-character reception, unless it's that there is information encoded in the duration of intervals between characters, so you need to know the timing of when they are received.
At any rate, I do suspect you are going to need to make custom modifications to the serial port kernel driver; that's really not all that bad as a project goes, and you will learn a lot. You will probably also need to change the configuration of the UART "chip" (really just a tiny corner of some larger device) to make it interrupt after only a single byte (ie emulate a 16450) instead of when it's typically 16-byte (emulating at 16550) buffer is partway full. The code of the dos program might actually be a help there. An alternative if the baud rate is not too fast would be to poll the hardware in the kernel or a realtime extension (or if it is really really slow as it might be on an HF radio link, maybe even in userspace)
If I'm right about needing to know the timing of the character reception, another option would be offload the reception to a micro-controller with dual UARTS (or even better, one UART and one USB interface). You could then have the micro watch the serial stream, and output to the PC (either on the other serial port at a much faster baud rate, or on the USB) little packages of data that include one received character and a timestamp - or even have it decode the protocol for you. The nice thing about this is that it would get you operating system independence, and would work on legacy free machines (byte-by-byte access is probably going to fail with an off-the-shelf USB-serial dongle). You can probably even make it out of some cheap eval board, rather than having to manufacture any custom hardware.