How to correctly parse incoming HTTP requests

How to correctly parse incoming HTTP requests - c++

i've created an C++ application using WinSck, which has a small (handles just a few features which i need) http server implemented. This is used to communicate with the outside world using http requests. It works, but sometimes the requests are not handled correctly, because the parsing fails. Now i'm quite sure that the requests are correctly formed, since they are sent by major web browsers like firefox/chrome or perl/C# (which have http modules/dll's).
After some debugging i found out that the problem is in fact in receiving the message. When the message comes in more than just one part (it is not read in one recv() call) then sometimes the parsing fails. I have gone through numerous tries on how to resolve this, but nothing seems to be reliable enough.
What i do now is that i read in data until i find "\r\n\r\n" sequence which indicates end of header. If WSAGetLastError() reports something else than 10035 (connection closed/failed) before such a sequence is found i discard the message. When i know i have the whole header i parse it and look for information about the body length. However i'm not sure if this information is mandatory (i think not) and what should i do if there is no such information - does it mean there will be no body? Another problem is that i do not know if i should look for a "\r\n\r\n" after the body (if its length is greater than zero).
Does anybody know how to reliably parse a http message?
Note: i know there are implementations of http servers out there. I want my own for various reasons. And yes, reinventing the wheel is bad, i know that too.

If you're set on writing your own parser, I'd take the Zed Shaw approach: use the Ragel state machine compiler and build your parser based on that. Ragel can handle input arriving in chunks, if you're careful.
Honestly, though, I'd just use something like this.
Your go-to resource should be RFC 2616, which describes HTTP 1.1, which you can use to construct a parser. Good luck!

You could try looking at their code to see how they handle a HTTP message.
Or you could look at the spec, there's message length fields you should use. Only buggy browsers send additional CRLFs at the end, apparently.

Anyway HTTP request has "\r\n\r\n" at the end of request headers and before the request data if any, even if request is "GET / HTTP/1.0\r\n\r\n".
If method is "POST" you should read as many bytes after "\r\n\r\n", as specified in Content-Length field.
So pseudocode is:
read_until(buf, "\r\n\r\n");
if(buf.starts_with("POST")
{
contentLength = regex("^Content-Length: (\d+)$").find(buf)[1];
read_all(buf, contentLength);
}
There will be "\r\n\r\n" after the content only if content includes it. Content may be binary data, it hasn't any terminating sequences, and the one method to get its size is use Content-Length field.

HTTP GET/HEAD requests have no body, and POST request can have no body too. You have to check if it's a GET/HEAD, if it's, then you have no content (body/message) sent. If it was a POST, do as the specs say about parsing a message of known/unknown length, as #gbjbaanb said.

Related

Parsing JSON data from TCP stream

I am using nlohmann's json library for parsing json data from a TCP stream. I am not quite sure how to handle partial json reads from local socket. Suppose that in the first read() I get:
{
"MessageType": "CancelOrder",
"Account":11111,
"CustomerNo":11111,
"Side":"A",
"DestinationMarket":"DUMB_MARKET",
"Symbol":"DUMB_SYMBOL",
"PositionEffect":"D",
"Limi
and in the following read() from socket, I get:
tPrice":0,
"Quantity":1,
"OrderType":"DUMB_TYPE",
"StopPrice":0,
"TimeInForce":"01.06.1999",
"ExpireDate":0,
"OrderID": "DUMB_ID",
"IsStopOrder":"DUMB_STOP",
"CorrelationId": 456
}
Partial reads cannot be parsed by the library since they are not valid. Does the library offer a solution to this? Or should I implement a solution myself?
What should be the best practice here?

You've gotten some good answers in the comments. I'm going to assemble some and add one more choice.
If you have control over both ends of the communications, then some people feel you should change the communications in one of two ways:
Send the length of text first
Or use a smarter messaging system over the socket
Either of these would solve your problem for you.
I'll offer two more possible choices.
Send an "end of data" indicator -- something that won't appear in the JSON. For instance, a null-byte. Break before the EOD character.
Try successively parsing data until it parses successfully.
The second one is kind of ugly. You'd parse { and get an exception. Then you'd parse {" for an exception, over and over until finally you have complete JSON. I bet it's slow, but it might work, and it doesn't depend on changing the data stream in any way.
Personally, I'd consider in order:
Use a proper messaging protocol
Use an End of Data indicator
Send the length
The hack of parsing and catching the exceptions until it parses
I think any of these would work. The last one is the only one that doesn't force you to change both ends of your data stream.

How should I handle a huge amount of mails with Chilkat?

I'm trying to fetch a huge amount of mails (2500 and more) from an IMAP-Server. Actually I'm using the imap.FetchHeaders() fuction but this is not THAT fast. Then I've tried the imap.FetchSingleHeader() but this is so much slower than imap.FetchHeaders()...
What would you recommend ??

The imap.FetchHeaders() method will send a single IMAP command to fetch the headers. The IMAP server will send all headers in a single reply. The majority of the time it takes for the entire operation to complete is likely the IMAP server "think time", to process the request and send the response. If you turn on verbose logging (set the imap.VerboseLogging property = true) and then examine the contents of the imap.LastErrorText property, you should see timing information in elapsed milliseconds.
In summary, it's unlikely that fetching 2500 headers can be made any faster.
One note: To avoid problems we've seen when trying to fetch huge numbers of emails, Chilkat will send a maximum request of 1000 headers in a single request. This means that inside the FetchHeaders method (for the case of fetching 2500 headers), three separate request/response pairs will occur.

Thanks Howard,
This is to answer your question in the comment above about GetMailboxStatus.
The GetMailboxStatus method sends a STATUS command requesting the following items: (MESSAGES RECENT UIDNEXT UIDVALIDITY UNSEEN)
Given that it's part of the IMAP protocol standard (at https://www.rfc-editor.org/rfc/rfc3501#section-6.3.10 ), it should be valid for all servers. (I don't recall ever fielding a support question where GetMailboxStatus did not work correctly.)

Implementing Telegram bot webhooks in ColdFusion

I am developing an application in ColdFusion (CFML) to create generic, stateful, bots to be run on the Telegram messaging platform. I've found so far plenty of examples in PHP, some in other languages (Ruby,...), none in CFML. So, here I am.
The "getUpdates" (i.e., polling) way runs like a breeze, but it's not feasible polling the Telegram server for new updates at a rate decent for interactive use (some 30 sec). So, I've turned to Webhooks.
I will go over the webhook setting for a self-signed certificate, it's out of scope here, but I am ready to explain how I did overcome the issue.
My problem is: how to decode the posts received from Telegram server on occurrence of an update?
What my application server (ColdFusion + Tomcat + Apache2) gets from Telegram is an HTTP with an header like this:
struct
accept-encoding: gzip, deflate
connection: keep-alive
content-length: 344
content-type: application/json
host: demo.bigopen.eu
and a content section like this:
binary
1233411711210097116101951..... (*cut*)
Please note that the data section (ASCII) contains only decimal digits, not hex. I've been struggling how to decode that stuff, I'm striving to get a JSON representation of a single message.
I've been trying to use the CFML tools I have, such as BinaryDecode(), CharsetEncode(), Java GZip libraries, etc. but no success so far. I was expecting some serialized JSON in the reply, but it's encoded in a way I cannot decode. I've found no hint in the literature, since only calls to language-specific libraries (such as file_get_contents for PHP) are shown.
I don't expect to be given the actual CFML code, but hopefully what kind of encoding is performed by the Telegram side.

I'd like to inform that after some effort I could be able to have this issue solved. Encoding is handled by ColdFusion itself. The data given back by Telegram in a Webhook update is binary, and CF treats them as ByteArray (actually, they're declared as "Array" but not directly addressable). Nonetheless, the ToString() function, if applied, returns a string fully valid.
So, the first thing to do is :
<cfset reply = DeserializeJSON(ToString(StructFind(GetHttpRequestData(), "content"))) >
BTW, StructFind() just extracts the "content" section by the structure returned by GetHttpRequestData().
After that, reply is a structure holding what is needed, such as :
<cfset message_id = reply.message.message_id />
<cfset message_text = reply.message.text />
and so on.
Hoping that it may be useful to anyone.

Is it OK to reuse HTTP status codes like 416?

I want to notify a client of a specific error condition using a HTTP status code.
The closest I can come by is "416 Range Not Satisfiable" - although the service has nothing to do with serving byte-ranges from files.
Can I liberally interpret the meaning of "Range Not Satisfiable" or must I respect the technical definition involving byte-ranges of files?

You can liberally interpret that. However, that doesn't make it the correct thing to do.
Errors that aren't specifically handled by the current 4xx set generally use the more generic 400 error along with an added explanation as to why. The general rule is that, if your error is an exact match to the more specific code, use it, otherwise use the less specific code.
Overloading the meaning of the specific codes is likely to lead to mass confusion.
As per RFC7231, section 6.5 (my italics):
The 4xx (Client Error) class of status code indicates that the client seems to have erred. Except when responding to a HEAD request, the server SHOULD send a representation containing an explanation of the error situation, and whether it is a temporary or permanent condition. These status codes are applicable to any request method. User agents SHOULD display any included representation to the user.

c++ accessed url log

im now currently developing a standalone c++ program that would list all the access URL in a browser and its corresponding response time....
at this point of time, i can already sniff all out and in going packets. i am using winpcap for this...
retrieved packets were filtered to by only those 'tcp port 80(http) or 443(https)'...
and know i want to read some http headers. the problem i have is that usually ip are fragmented.
I want to know how to reassemble this and how to have some details about the http..
Note: i want to implement that of WIRESHARK.. in every packet/frame, it has a
'REASSEMBLED TCP SEGMENT'
any idea or tutorials how i can easily attain this?!..
thanks alot!

You'll have to do the same thing TCP does to reassemble packets, which means parsing the header of the packets and sequencing them into another buffer. The worst program logic is probably dealing with missing information; you'll then have to see if it was flagged and retransmitted.
There are a number of RFCs which cover this: 675, 793, 1122 and others. If looking through those seems overwhelming, maybe back off and look at the Roadmap RFC, rfc 4614.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js