Quick question - is there a maximum size for the Status-Line of a HTTP Response?
In the RFC I could not find this information, just something like this:
Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF
According to this i could assume:
HTTP-Version is usually 8 Bytes ( e.g. HTTP/1.1 )
Status-Code is 3 Bytes
2 Spaces + CRLF is 4 Bytes
Reason-Phrase -> The longest according to the RFC is Requested range not satisfiable so 31 Bytes
This would be a sum of 46 Bytes.
Is this assumption correct or did I miss anything?
UPDATE:
Due to the answer below, I just want to specify my problem a bit:
I am parsing some kind of Log file with TCP messages from a server. Now there is some random Data I don't care for and some HTTP Messages which I want to read. Now all data I get I parse for a \r\n to find the Status Line. Since I need to make assumption that my header is split into several TCP packages I just buffer all data and parse it.
If there is no maximum size for the header status-line, I need to buffer all data until the next \r\n occurs. In the worst case this means I save like kilobytes over kilobytes of random data, since it could ( but will most likely will not ) be part of the Header Status Line.
Or would it , in this case, be rather appropriate to parse for the HTTP Version String instead of the CRLF ?
RFC 2616, 6.1.1:
The reason phrases listed here are only recommendations -- they MAY be
replaced by local equivalents without affecting the protocol.
Aside from this, the HTTP protocol is "allowed" to add more status codes (in a new RFC) without changing the HTTP version to 1.2, provided that the new codes don't introduce additional requirements on HTTP clients. Clients are supposed to treat an unknown status code as if it were x00 (where x is the first digit of the code they get, indicating the category of response), except that they shouldn't cache the response.
So the only limit is the max length of an HTTP header line or of the response headers in total. As far as I can see, the RFC doesn't define any limit, although specific servers impose their own.
What you can be sure of is that the user-agent may ignore the Reason Phrase entirely. So if it's big, you can read it in small pieces and throw them away one at a time until you reach CRLF. If you want to display a human-readable message, mostly you can use the recommended Reason Phrase for the status code that the server provides, regardless of what Reason Phrase the server sends.
I don't think there is any limit on the length of the ReasonPHrase. The W3C doc states it is a "short message" but that is not canonical.
I would not assume Version is 8 characters. Perhaps a version in the future could have 3 digits, ie: HTTP/10.1. The syntax specifies Version is delimited by a SPACE, so I would parse it by stopping at the first SPACE.
https://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html
The Reason-Phrase is intended to give a short textual description of the Status-Code. The Status-Code is intended for use by automata and the Reason-Phrase is intended for the human user. The client is not required to examine or display the Reason- Phrase.
Related
I am writing a DNS reply parser with libpcap and find that some CNAMEs' TLDs seem to be missing from the corresponding DNS packet payload. One example is shown in an example packet's wireshark dissection where wireshark shows the actual CNAME is
prd-push-access-net5-175542503.us-east-1.elb.amazonaws.com
but I can only find
prd-push-access-net5-175542503.us-east-1.elb.amazonaws
(i.e. no ".com") in the corresponding part of payload. I wonder how could one (and how did wireshark) parse the full CNAME (with ".com") out of this payload?
(Also this CNAME seems malformed since per RFC1035, a QNAME in question section should "terminates with the zero length octet for the null label of the root" and I guess the same applies for CNAME?)
DNS packets use name compression, see https://www.rfc-editor.org/rfc/rfc1035 section 4.1.4
In many places (where names appear), each label can be represented by a pointer to a former place in the packet where it appears already, instead of the string.
In your example, we can clearly see com in myfoscam.com earlier in the packet.
So with the content (using only the end because it is tedious to extract data from an image, you should have copied things as text) 03656c6209616d617a6f6e617773c019c02e00 we have to analyze it like this:
03: the following is a string of length 3
656c62: this is the string elb, lenght 3 as advertised
09: the following is a string of length 9
616d617a6f6e617773: this is the string amazonaws
c0 : this has the first two bits as 1 (since it is value 192, so more or equal to 128+64), which means it is a part of a two bytes pointer. Hence c019 is a pointer here at offset 25 in decimal (19 in hexadecimal) into the packet.
So if you start from the whole packet, and switch to offset 25, you should find the sequence 03636f6d which is com (with the prefix of a length of 3).
Or maybe something else, because you have another pointer after in fact: c02e, so this is for offset 46 in the message. Or that part is for something else completely, it really depends on what is pointed by previous pointer, if it finishes with a null label or not (if it is 03636f6d00 at offset 25 or not). See example in the RFC (and/or provide all the packet content as text in your question)
Then it ends with 00 the null label, which means the root (the hidden . at the end of any name).
The RFC822/RFC2822 standard says that "Header fields are lines composed of a field name, followed by a colon (':'), followed by a field body, and terminated by CRLF".
But I see at least one RFC822 MIME parser that auto-normalizes payloads that use LF ("\n") into CRLF ("\r\n") before proceeding with parsing.
How safe is it to use an RFC822 format for serializing data that may have been hand-edited in places to use LF instead of CRLF? Would it be safe to send this data around to different programs & expect them to be able to parse it with various RFC822 parser libraries?
In the general case, not safe at all. Be conservative in what you send / generate.
Having said that, most Unix tools expect locally stored email files to use local line ending conventions. RFC5322 really only codifies the format used on the wire.
In the content retrieved with ColdFusion http object there are some characters that are returned as question marks; namely these are roman numerals (like Ⅱ) which are displayed without problems when I visit the same page with a browser.
The server where I make request to dose not seem to provide any charset information in the response headers (the value of Content-Type is just "text/html" and charset property in the result of cfhttp is blank), but the encoding is declared in page's html as "charset=EUC-JP" (it is a page in Japanese). So I make request with charset set to EUC-JP.
The content in Japanese (Japanese characters) is retrieved correctly, but the roman numerals are turned into question marks.
I tried requesting with charset set to UTF-8, but in this case everything gets scrambled. To me it seems that those roman numerals are Unicode, so my understanding is that the server where I make request to mixes encodings (but I maybe wrong about this).
How do I get those special characters to display correctly in the fileContent of cfhttp?
Thanks!
The only way I can think of is to make 2 requests with the different encodings and the merge the data together. The first request would be for charset of EUC-JP and the second would be with UTF 8. After the second request look through the content from the first and for every question mark, look up the index in the second request. For example, when you hit the 5th question mark in the first set of content, look for the 5th roman numeral in the second set. It's not guaranteed to work, but it's all I can think of.
i have a program wherein it searches the reply from a curl request for specific strings. i sometimes get gzipped data. is there a way to find whether the reply is text or gzipped format?
header sometimes contain gziipped,deflate header, but its not consistent. is there a way to search the string and find if its gzipped?
You could try taking a look at the first two bytes of data. For gzipped data, they should be 0x1f, 0x8b.
Member header and trailer
ID1 (IDentification 1)
ID2 (IDentification 2)
These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139 (0x8b, \213),
to identify the file as being in gzip format.
You could look at the first bytes of the file. Perhaps they containt a magic number.
The gzip file format starts with some "magic bytes". You can check whether the body starts with these, and if it does, push back the bytes into the stream and start unzipping it.
You could pipe it through zcat, and if it fails, use the string as is. Sloppy I know, but it ought to be reliable; a plain text file would never contain valid gzipped data.
Standards-compliant HTTP responses will contain a Content-Encoding: or Transfer-Encoding: header specifying "gzip" for compressed responses, eliminating the need to guess by looking at magic number. Unfortunately, lots of sites get these headers wrong, though.
I'm working on a web application that parses and displays email messages in a threaded format (among other things). Emails may come from any number of different mail clients, and in either text or HTML format.
Given that most people have a tendency to top post, I'd like to be able to hide the duplicated message in an email reply in a manner similar to how Gmail does it (e.g. "show quoted text").
Determining which part of the message is the reply is somewhat challenging. Personally, I use "> " delimiters at the beginning of the quoted text when replying. I created a regexp that looks for these lines and wraps a div around them to allow some JS to hide or show this block of text.
I then noticed that Outlook doesn't use the "> " characters by default, it simply adds a header block above the reply with the summary of the headers (From, Subject, Date, etc.). The reply is untouched. I can match on this and hide the rest of the email, working with the assumption that it's a top quote.
I then looked at Thunderbird, and it uses "> " for text, and <blockquote> for HTML mails. I still haven't looked at what Apple Mail does, what Notes does, or what any of the other millions of mail clients out there do.
Will I be writing a special case regexp for every single client out there? or is there something I'm missing?
Any suggestions, sample code or pointers to third party libraries much appreciated!
It'll be pretty hard to duplicate the way gmail does it since it doesn't care about whether it was a quoted piece or not, like Zac says, it just seems to care about the diff.
Its actually pretty hard to get this right 100% of the time. Plain text email is "lossy", its entirely possible for you to send
> Here is my long line that is over 74 chars (email line length limit)
Which can get encoded as something like
> Here is my long line that is over 74 chars (email=
line length limit)
And then is decoded as
> Here is my long line that is over 74 chars (email
line length limit)
Making it indistinguishable from an inline-reply.
This is email, so variations are abound. Email usually line-wraps at something like 74 characters, and encoding schemes can differ. Its a real PITA. If you can access the HTML version, you will probably have better luck looking for quote tags and the like. Another idea would be to parse both the plain text and html version to try and determine the boundries.
Additionally, its best to just plan for specific client hacks. They all construct mime messages differently, both in structure and header content.
Edit: I say this with the experience of writing an email processing system as well as seeing several people try to do the -exact- thing you're doing. It always only got "ok" results.
From what I can tell, gmail does not bother about prefixed lines or section headings, except to ignore them. If the text lines appeared earlier in the thread, and then reappear, it is considered to be quoted. Thus, e.g., if you send multiple messages and don't change your signature, the signature is considered to be quoted. If you've already dealt with the '>' prefix, a simple diff should do most of the rest. No need to get fancy.
First thing I think I'd do is strip out all the white space, or reduce white space to 1 between each word, and special characters from both blocks, then look for the old one in the new one.
Here's a mozdev project that may be helpful for others who stumble across this page looking for a Thunderbird solution:
http://quotecollapse.mozdev.org/