How can a web server know when an HTTP request is fully received?

How can a web server know when an HTTP request is fully received? - c++

I'm currently writing a very simple web server to learn more about low level socket programming. More specifically, I'm using C++ as my main language and I am trying to encapsulate the low level C system calls inside C++ classes with a more high level API.
I have written a Socket class that manages a socket file descriptor and handles opening and closing using RAII. This class also exposes the standard socket operations for a connection oriented socket (TCP) such as bind, listen, accept, connect etc.
After reading the man pages for the send and recv system calls I realized that I needed to call these functions inside some form of loop in order to guarantee that all bytes are successfully sent/received.
My API for sending and receiving looks similar to this
void SendBytes(const std::vector<std::uint8_t>& bytes) const;
void SendStr(const std::string& str) const;
std::vector<std::uint8_t> ReceiveBytes() const;
std::string ReceiveStr() const;
For the send functionality I decided to use a blocking send call inside a loop such as this (it is an internal helper function that works for both std::string and std::vector).
template<typename T>
void Send(const int fd, const T& bytes)
{
using ValueType = typename T::value_type;
using SizeType = typename T::size_type;
const ValueType *const data{bytes.data()};
SizeType bytesToSend{bytes.size()};
SizeType bytesSent{0};
while (bytesToSend > 0)
{
const ValueType *const buf{data + bytesSent};
const ssize_t retVal{send(fd, buf, bytesToSend, 0)};
if (retVal < 0)
{
throw ch::NetworkError{"Failed to send."};
}
const SizeType sent{static_cast<SizeType>(retVal)};
bytesSent += sent;
bytesToSend -= sent;
}
}
This seems to work fine and guarantees that all bytes are sent once the member function returns without throwing an exception.
However, I started running into problems when I began implementing the receive functionality. For my first attempt I used a blocking recv call inside a loop and exited the loop if recv returned 0 indicating that the underlying TCP connection was closed.
template<typename T>
T Receive(const int fd)
{
using SizeType = typename T::size_type;
using ValueType = typename T::value_type;
T result;
const SizeType bufSize{1024};
ValueType buf[bufSize];
while (true)
{
const ssize_t retVal{recv(fd, buf, bufSize, 0)};
if (retVal < 0)
{
throw ch::NetworkError{"Failed to receive."};
}
if (retVal == 0)
{
break; /* Connection is closed. */
}
const SizeType offset{static_cast<SizeType>(retVal)};
result.insert(std::end(result), buf, buf + offset);
}
return result;
}
This works fine as long as the connection is closed by the sender after all bytes have been sent. However, this is not the case when using e.g. Chrome to request a webpage. The connection is kept open and my receive member function is stuck blocked on the recv system call after receiving all bytes in the request. I managed to get around this problem by setting a timeout on the recv call using setsockopt. Basically, I return all bytes received so far once the timeout expires. This feels like a very inelegant solution and I do not think that this is the way web servers handles this issue in reality.
So, on to my question.
How does a web server know when an HTTP request have been fully received?
A GET request in HTTP 1.1 does not seem to include a Content-Length header. See e.g. this link.

HTTP/1.1 is a text-based protocol, with binary POST data added in a somewhat hacky way. When writing a "receive loop" for HTTP, you cannot completely separate the data receiving part from the HTTP parsing part. This is because in HTTP, certain characters have special meaning. In particular, the CRLF (0x0D 0x0A) token is used to separate headers, but also to end the request using two CRLF tokens one after the other.
So to stop receiving, you need to keep receiving data until one of the following happens:
Timeout – follow by sending a timeout response
Two CRLF in the request – follow by parsing the request, then respond as needed (parsed correctly? request makes sense? send data?)
Too much data – certain HTTP exploits aim to exhaust server resources like memory or processes (see e.g. slow loris)
And perhaps other edge cases. Also note that this only applies to requests without a body. For POST requests, you first wait for two CRLF tokens, then read Content-Length bytes in addition. And this is even more complicated when the client is using multipart encoding.

A request header is terminated by an empty line (two CRLFs with nothing between them).
So, when the server has received a request header, and then receives an empty line, and if the request was a GET (which has no payload), it knows the request is complete and can move on to dealing with forming a response. In other cases, it can move on to reading Content-Length worth of payload and act accordingly.
This is a reliable, well-defined property of the syntax.
No Content-Length is required or useful for a GET: the content is always zero-length. A hypothetical Header-Length is more like what you're asking about, but you'd have to parse the header first in order to find it, so it does not exist and we use this property of the syntax instead. As a result of this, though, you may consider adding an artificial timeout and maximum buffer size, on top of your normal parsing, to protect yourself from the occasional maliciously slow or long request.

The solution is within your link
A GET request in HTTP 1.1 does not seem to include a Content-Length header. See e.g. this link.
There it says:
It must use CRLF line endings, and it must end in \r\n\r\n

The answer is formally defined in the HTTP protocol specifications 1:
in W3C's spec for HTTP 0.9.
in RFC 1945 for HTTP 1.0, specifically in Section 4: HTTP Message, Section 5: Request, and Section 7: Entity.
in RFC 2616 for HTTP 1.1, specifically in Section 4: HTTP Message, particular in 4.3: Message Body and 4.4: Message Length.
in RFC 7230 (and 7231...7235) for HTTP 1.1, specifically in Section 3: Message Format, in particular 3.3: Message Body.
So, to summarize, the server first reads the message's initial start-line to determine the request type. If the HTTP version is 0.9, the request is done, as the only supported request is GET without any headers. Otherwise, the server then reads the message's message-headers until a terminating CRLF is reached. Then, only if the request type has a defined message body then the server reads the body according to the transfer format outlined by the request headers (requests and responses are not restricted to using a Content-Length header in HTTP 1.1).
In the case of a GET request, there is no message body defined, so the message ends after the start-line in HTTP 0.9, and after the terminating CRLF of the message-headers in HTTP 1.0 and 1.1.
1: I'm not going to get into HTTP 2.0, which is a whole different ballgame.

Related

How to parse/check an HTTP message in PCapPlusPlus?

In PCap++, I want to detect if a payload is an HTTP request or not. For this, I am trying to parse the string and expect the library to allows me to check if this was done successfully.
Unfortunately, I was unable to achieve this:
I can create a RawPacket with the message
I can create a Packet with the message, but it does not contains any HttpRequestLayer, in consequence, the parsing is useless to detect the validity of the message.
I cannot create an HttpRequestLayer directly from the message.
Some examples:
std::string msg= "GET /index.html HTTP/1.1\nHost: example.com\n\n";
// Try to get a RawPacket: works, but does not helps a lot
struct timeval tp; // requires <time.h>
gettimeofday(&tp, nullptr);
RawPacket rp(static_cast<const uint8_t*>(msg.data()), static_cast<int>(msg.size()), tp, false);
// Trying to parse it: works but detect generic Newtork layer only, no HTTP
Packet p(&rp, false, HTTP);
// Trying to create an HttpRequestLayer directly: crash
HttpRequestLayer http(static_cast<const uint8_t*>(msg.data()), static_cast<int>(msg.size()), nullptr, nullptr);
My question is:
How to detect if a message is a valid HTTP message with PCap++?
Note: I am looking for an efficient solution (very sub-optimal solutions, like generating TCP layers is not an option).

PcapPlusPlus can parse packets, not messages. A RawPacket object expects a stream of bytes that represent a network packet, typically with a data link layer (e.g Ethernet), network layer (e.g IP), transport layer (e.g TCP) and application layer (HTTP in this case). PcapPlusPlus will parse this byte stream into the a list of layers/protocols you can look into.
HTTP is an application protocol, hence any HTTP packet will contain the other layers mentioned above. So providing just the HTTP message is not enough and PcapPlusPlus won't be able to parse it as a packet.
You can learn more about PcapPlusPlus from the tutorials: https://pcapplusplus.github.io/docs/tutorials
Specifically you can look into the packet parsing tutorial:
https://pcapplusplus.github.io/docs/tutorials/packet-parsing

pcpp::Packet has a method of getting layer you need - getLayerOfType. You could detect HTTP message using it.
Example:
timeval tm;
gettimeofday(&tm, NULL);
pcpp::RawPacket rawPacket((uint8_t*)rawPacketFromNet.data(), rawPacketFromNet.size(), tm, false, pcpp::LinkLayerType::LINKTYPE_RAW);
pcpp::Packet parsedPacket(&rawPacket);
pcpp::HttpRequestLayer* httpLayer = parsedPacket.getLayerOfType<pcpp::HttpRequestLayer>();
if (httpLayer)
{
// you have this layer in your packet
uint8_t* dataPtr = httpLayer->getData();
size_t size = httpLayer->getDataLen();
}
I think your example could have worked if you'd start with pcpp::Packet and then add to it http layer. For constructing http layer in your case try to use HttpRequestLayer(HttpMethod method, std::string uri, HttpVersion version);

Windows XP socket error with recv()

I'm having a strange behaviour with the recv() function.
My C++ (MFC) application with WinSock implements a simple HTTP client (non-blocking socket) for accessing HTML pages on a web server. Some of these pages are taking a few seconds for loading. On Windows 7 this is not a problem, because recv() also returns partial data. But on Windows XP the recv() function always returns SOCKET_ERROR and the error code is WSAEWOULDBLOCK. Only when the connection is finished the data is returned in one access.
Does anyone know this problem? How can I force Windows XP to also receive partial data?
I setted the buffer size (SO_RCVBUF) to 1000 Bytes. On Windows 7 this is also reflected to the TCP Window Size - on XP not.
The real problem which I have with this issue is, that I don't know how to check if the connection is still alive or not. How can I check if a connection is still alive? Or how can I specify a timeout (max time between two received packets from the server)?

By default, a socket operates in blocking mode, so the only way you can get a WSAEWOULDBLOCK error at all is if you explicitly put the socket into non-blocking mode instead. Doing so, you agree to handle WSAEWOULDBLOCK (otherwise, don't use non-blocking mode).
WSAEWOULDBLOCK is not a real error, it is just an indication that the operation you attempted to perform cannot be completed at that moment because it would block the calling thread. You need to detect this "error" and simply retry the same operation again at a later time, preferably after a socket state change is detected.
For recv(), WSAEWOULDBLOCK simply means there is no data available on the socket to be read at that moment. In non-blocking mode, you should be using select() (or WSAEventSelect(), or WSAAsyncSelect(), or Overlapped I/O, or an I/O Completion Port) to detect inbound data before you then read it.
That being said, you are implementing an HTTP client, so you must follow the HTTP protocol properly, regardless of the socket I/O mode you are using, regardless of your socket buffer sizes. You must follow the pseudo code logic I outlined in this answer on another question:
You must follow the rules outlined in RFC 2616. Namely:
Read until the "\r\n\r\n" sequence is encountered. Do not read any more bytes past that yet.
Analyze the received headers, per the rules in RFC 2616 Section 4.4. They tell you the actual format of the remaining response data.
Read the data per the format discovered in #2.
Check the received headers for the presence of a Connection: close header if the response is using HTTP 1.1, or the lack of a Connection: keep-alive header if the response is using HTTP 0.9 or 1.0. If detected, close your end of the socket connection because the server is closing its end. Otherwise, keep the connection open and re-use it for subsequent requests (unless you are done using the connection, in which case do close it).
Process the received data as needed.
In short, you need to do something more like this instead (pseudo code):
string headers[];
byte data[];
string statusLine = read a CRLF-delimited line;
int statusCode = extract from status line;
string responseVersion = extract from status line;
do
{
string header = read a CRLF-delimited line;
if (header == "") break;
add header to headers list;
}
while (true);
if ( !((statusCode in [1xx, 204, 304]) || (request was "HEAD")) )
{
if (headers["Transfer-Encoding"] ends with "chunked")
{
do
{
string chunk = read a CRLF delimited line;
int chunkSize = extract from chunk line;
if (chunkSize == 0) break;
read exactly chunkSize number of bytes into data storage;
read and discard until a CRLF has been read;
}
while (true);
do
{
string header = read a CRLF-delimited line;
if (header == "") break;
add header to headers list;
}
while (true);
}
else if (headers["Content-Length"] is present)
{
read exactly Content-Length number of bytes into data storage;
}
else if (headers["Content-Type"] == "multipart/byteranges")
{
string boundary = extract from Content-Type header;
read into data storage until terminating boundary has been read;
}
else
{
read bytes into data storage until disconnected;
}
}
if (!disconnected)
{
if (responseVersion == "HTTP/1.1")
{
if (headers["Connection"] == "close")
close connection;
}
else
{
if (headers["Connection"] != "keep-alive")
close connection;
}
}
check statusCode for errors;
process data contents, per info in headers list;
As you can see, HTTP requires reading CRLF-delimited lines of text, or fixed lengths of raw bytes. To do that, you must call recv() in a loop until you encounter the terminating CRLF, or have received the expected number of bytes, whichever the case may be. Whether you use a synchronous loop that just ignores WSAEWOULDBLOCK errors while looping, or you use a state machine driven by asynchronous events/callbacks, that is up to you to decide. That doesn't change how you must process the HTTP protocol.
This applies to all versions of Windows (even all platforms that use BSD-style socket APIs). What you are encountering is not a Windows bug at all. It is an underlying flaw in your understanding of how to use socket I/O correctly and effectively.
As for checking if the connection is alive, recv() will return 0 if the server closed the connection gracefully, or will report an error otherwise (usually WSAECONNABORTED or WSAECONNRESET, though there can be others). But an abnormal disconnect may take a long time to detect, so you should implement timeouts in your code instead. In synchronous mode, you can use setsockopt(SO_RCVTIMEO). In non-blocking mode, you can use select(). In asynchronous (overlapped) mode, you can use WaitForSingleObject() on whatever event/object you use to drive your state machine.

You can't expect recv to give you any data on a non-blocking socket. If there's no data available it returns WOULDBLOCK. You just need to call recv again (normally after select notifies you some data is available). Whether you get data on the first (or any) call is going to depend on how fast the server is sending it.
When the socket is closed you'll get a different error from recv, like WSAECONNRESET or WSAENOTCONN. select will also notify you when the socket is closed.

It's very strange.
Today I have changed my software to use blocking sockets. But it still doesn't work on Windows XP. Windows 7 is no problem.
So I thought: Let's try another PC. On this PC (also Windows XP) it does work. Now I tried a 3rd PC with Windows XP and here it also works.
I still don't know what the problem is but I think there must be a bug with the PC.

Server programming in C++

I'd like to make a chatting program using win socket in c/c++. (I am totally newbie.)
The first question is about how to check if the client receives packets from server.
For instance, a server sends "aaaa" to a client.
And if the client doesn't receive packet "aaaa", the server should re-send the packet again.(I think). However, I don't know how to check it out.
Here is my thought blow.
First case.
Server --- "aaaa" ---> Client.
Server will be checking a sort of time waiting confirm msg from the client.
Client --- "I received it" ---> Server.
Server won't re-send the packet.
The other case.
Server --- "aaaa" ---> Client.
Server is waiting for client msg until time out
Server --- "aaaa" ---> Client again.
But these are probably inappropriate.
Look at second case. Server is waiting a msg from client for a while.
And if time's out, server will re-send a packet again.
In this case, client might receive the packet twice.
Second question is how to send unlimited size packet.
A book says packet should have a type, size, and msg.
Following it, I can only send msg with the certain size.
But i want to send msg like 1Mbytes or more.(unlimited)
How to do that?
Anyone have any good link or explain correct logic to me as easy as possible.
Thanks.

Use TCP. Think "messages" at the application level, not packets.
TCP already handles network-level packet data, error checking & resending lost packets. It presents this to the application as a "stream" of bytes, but without necessarily guaranteed delivery (since either end can be forcibly disconnected).
So at the application level, you need to handle Message Receipts & buffering -- with a re-connecting client able to request previous messages, which they hadn't (yet) correctly received.
Here are some data structures:
class or struct Message {
int type; // const MESSAGE.
int messageNumber; // sequentially incrementing.
int size; // 4 bytes, probably signed; allows up to 2GB data.
byte[] data;
}
class or struct Receipt {
int type; // const RECEIPT.
int messageNumber; // last #, successfully received.
}
You may also want a Connect/ Hello and perhaps a Disconnect/ Goodbye handshake.
class Connect {
int type; // const CONNECT.
int lastReceivedMsgNo; // last #, successfully received.
// plus, who they are?
short nameLen;
char[] name;
}
etc.
If you can be really simple & don't need to buffer/ re-send messages to re-connecting clients, it's even simpler.
You could also adopt a "uniform message structure" which had TYPE and SIZE (4-byte int) as the first two fields of every message or handshake. This might help standardize your routines for handling these, at the expense of some redundancy (eg in 'name' field-sizes).

For first part, have a look over TCP.
It provides a ordered and reliable packet transfer. Plus you can have lot of customizations in it by implementing it yourself using UDP.
Broadly, what it does is,
Server:
1. Numbers each packet and sends it
2. Waits for acknowledge of a specific packet number. And then re-transmits the lost packets.
Client:
1. Receives a packet and maintains a buffer (sliding window)
2. It keeps on collecting packets in buffer until the buffer overflows or a wrong sequenced packet arrives. As soon as it happens, the packets with right sequence are 'delivered', and the sequence number of last correct packet is send with acknowledgement.
For second part:
I would use HTTP for it.
With some modifications. Like you should have some very unique indicator to tell client that transmission is complete now, etc

boost::asio read - return after all data where read from socket, without waiting for EOF

I'm quite new to boost::asio, I faced one problem I don't really know how to fix, could you please help me.
In general I'm trying to implement proxy based on boost::asio. I'm using async_read_some function to read response from server, something like that:
_ssocket.async_read_some(boost::asio::buffer(_sbuffer),
boost::bind(&connection::handle_server_read_body_some,
shared_from_this(),
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred
));
Everything is fine, it reads some bunch of data and call handler. The problem is at the moment when I'm caling async_read_some function - and there is no more data to read from socket. So handler is not called for about ~15 seconds - till EOF will be rased. (So server socket disconnected). I've tried different read functions, and all of them returns only when 1 or mote bytes where read or there was some error.
The thing is that sometimes I don't know how many bytes I need to read - so I just need to read everything what is present. I tried to use
boost::asio::socket_base::bytes_readable
or
_ssocket.available(err)
To figgure out how many bytes avaliable on socket, but the thing is that those function returns number of bytes which could be read without blocking, so I can't base my implementation on that, even from tests I see that sometimes bytes_readable returns 0 - and next call of async_read_some on the same socket - reads bunch of data.
My question is - is there any way to get imidiate return (in case of synchronous call) / handler call (in case of async) when there is no more data to read from socket? Because currently it just hang for 15 sec till EOF.
I will appriciate any advice or tips you can give me.

There's nothing wrong with your usage of Boost.Asio. The problem is that you need to know how to deal with HTTP messages. Basically, you need to detect message type and parse it to get known its length. Server disconnection is not always the case because HTTP supports KEEP-ALIVE (the same connection is used for multiple messages). Please read following quote from RFC 2616:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html
4.4 Message Length
The transfer-length of a message is the length of the message-body as
it appears in the message; that is, after any transfer-codings have
been applied. When a message-body is included with a message, the
transfer-length of that body is determined by one of the following (in
order of precedence):
1.Any response message which "MUST NOT" include a message-body (such as the 1xx, 204, and 304 responses and any response to a HEAD request)
is always terminated by the first empty line after the header fields,
regardless of the entity-header fields present in the message.
2.If a Transfer-Encoding header field (section 14.41) is present and has any value other than "identity", then the transfer-length is
defined by use of the "chunked" transfer-coding (section 3.6), unless
the message is terminated by closing the connection.
3.If a Content-Length header field (section 14.13) is present, its decimal value in OCTETs represents both the entity-length and the
transfer-length. The Content-Length header field MUST NOT be sent if
these two lengths are different (i.e., if a Transfer-Encoding
header field is present). If a message is received with both a
Transfer-Encoding header field and a Content-Length header field,
the latter MUST be ignored.
4.If the message uses the media type "multipart/byteranges", and the transfer-length is not otherwise specified, then this self- delimiting
media type defines the transfer-length. This media type MUST NOT be
used unless the sender knows that the recipient can parse it; the
presence in a request of a Range header with multiple byte- range
specifiers from a 1.1 client implies that the client can parse
multipart/byteranges responses.
A range header might be forwarded by a 1.0 proxy that does not
understand multipart/byteranges; in this case the server MUST
delimit the message using methods defined in items 1,3 or 5 of
this section.
5.By the server closing the connection. (Closing the connection cannot be used to indicate the end of a request body, since that would leave
no possibility for the server to send back a response.)
For compatibility with HTTP/1.0 applications, HTTP/1.1 requests
containing a message-body MUST include a valid Content-Length header
field unless the server is known to be HTTP/1.1 compliant. If a
request contains a message-body and a Content-Length is not given, the
server SHOULD respond with 400 (bad request) if it cannot determine
the length of the message, or with 411 (length required) if it wishes
to insist on receiving a valid Content-Length.
All HTTP/1.1 applications that receive entities MUST accept the
"chunked" transfer-coding (section 3.6), thus allowing this mechanism
to be used for messages when the message length cannot be determined
in advance.
Messages MUST NOT include both a Content-Length header field and a
non-identity transfer-coding. If the message does include a non-
identity transfer-coding, the Content-Length MUST be ignored.
When a Content-Length is given in a message where a message-body is
allowed, its field value MUST exactly match the number of OCTETs in
the message-body. HTTP/1.1 user agents MUST notify the user when an
invalid length is received and detected.

Broken HTML - browsers don't downloads whole HTTP response from my webserver, CURL does

Symptom
I think, I messed up something, because both Mozilla Firefox and Google Chrome produce the same error: they don't receive the whole response the webserver sends them. CURL never misses, the last line of the quick-scrolling response is always "</html>".
Reason
The reason is, that I send response in more part:
sendHeaders(); // is calls sendResponse with a fix header
sendResponse(html_opening_part);
for ( ...scan some data... ) {
sendResponse(the_data);
} // for
sendResponse(html_closing_part)
The browsers stop receiving data between sendResponse() calls. Also, the webserver does not close() the socket, just at the end.
(Why I'm doing this way: the program I write is designed for non-linux system, it will run on an embedded computer. It has not too much memory, which is mostly occupied by lwIP stack. So, avoid collecting the - relativelly - huge webpage, I send it in parts. Browsers like it, no broken HTML occurred as under Linux.)
Environment
The platform is GNU/Linux (Ubuntu 32-bit with 3.0 kernel). My small webserver sends the stuff back to the client standard way:
int sendResponse(char* data,int length) {
int x = send(fd,data,length,MSG_NOSIGNAL);
if (x == -1) {
perror("this message never printed, so there's no error \n");
if (errno == EPIPE) return 0;
if (errno == ECONNRESET) return 0;
... panic() ... (never happened) ...
} // if send()
} // sendResponse()
And here's the fixed header I am using:
sendResponse(
"HTTP/1.0 200 OK\n"
"Server: MyTinyWebServer\n"
"Content-Type: text/html; charset=UTF-8\n"
"Cache-Control: no-store, no-cache\n"
"Pragma: no-cache\n"
"Connection: close\n"
"\n"
);
Question
Is this normal? Do I have to send the whole response with a single send()? (Which I'm working on now, until a quick solution arrives.)

If you read RFC 2616, you'll see that you should be using CR+LF for the ends of lines.
Aside from that, open the browser developer tools to see the exact requests they are making. Use a tool like Netcat to duplicate the requests, then eliminate each header in turn until it starts working.

Gotcha!
As #Jim adviced, I've tried sending same headers with CURL, as Mozilla does: fail, broken pipe, etc. I've deleted half of headers: okay. I've added back one by one: fail. Deleted another half of headers: okay... So, there is error, only if header is too long. Bingo.
As I've said, there're very small amount of memory in the embedded device. So, I don't read the whole request header, only 256 bytes of them. I need only the GET params and "Host" header (even I don't need it really, just to perform redirects with the same "Host" instead of IP address).
So, if I don't recv() the whole request header, I can not send() back the whole response.
Thanks for your advices, dudes!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js