In which order should I put LEN/CRC/DATA in a message? Should CRC protect the LEN field?

In which order should I put LEN/CRC/DATA in a message? Should CRC protect the LEN field? - crc

There's a section (2.5) in Xz format inadequate for long-term archiving:
According to Koopman (p. 50), one of the "Seven Deadly Sins" (i.e.,
bad ideas) of CRC and checksum use is failing to protect a message
length field. This causes vulnerabilities due to framing errors. Note
that the effects of a framing error in a data stream are more serious
than what Figure 1 suggests. Not only data at a random position are
interpreted as the CRC. Whatever data that follow the bogus CRC will
be interpreted as the beginning of the following field, preventing the
successful decoding of any remaining data in the stream.
He talks about this case, when a message is like this:
ID LEN DATA CRC
If LEN is damaged, then a CRC at a random position will be used. But I fail to see, why it is a problem. At that random position, almost surely there will be an invalid CRC value, so the error is detected.
And he talks about decoding the following data. I fail to see, if the LEN is protected, how one is able to decode the following data either. If LEN is damaged, we cannot find the next message in both cases.
For example, PNG doesn't protect the length field.
So, why is it obviously better, when a LEN field is protected by CRC?
If I were to design a message structure, which is the best way to do that? What order should I use, and what should I protect with CRC? Suppose that the message has the following parts:
message type ID (variable length integer)
message length (variable length integer)
CRC
the message data itself
My current design is this:
CRC, protects the whole message
message type ID (variable length integer)
message length (variable length integer)
the message data itself
Is there any drawback of this method?

What Koopman actually says (here) is:
Failing to protect message length field
Results in pointing to data as FCS, giving HD=1
HD is the Hamming distance, meaning that the probability of a false positive can go up significantly on a low bit-error-rate stream if you look at part of the data as the (faux) check value, instead of the actual check value. To really do it right, you should protect the length field and other header values with their own check value before the data.
As for your design, putting the CRC first has the disadvantage of having to buffer all of the message to compute the CRC before you can write the message in a stream. You could do type id, length, header crc, message, message crc.

Related

Step function exceeding the maximum number of characters service limit

My state in a step function flow returns an error of state/task returned a result with a size exceeding the maximum number of characters service limit.. In the step function documentation, the limit for characters for input/output is 32,768 characters. Upon checking the total characters of my result data if falls below the limit. Are there any other scenarios that it will throw that error? Thanks!

2020-09-29 Edit: Step Functions now supports 256KB payloads!
256KB is the maximum size of the payload that can be passed between states. You could also exceed this limit from a Map or Parallel state, whose final output is an array with the output of each iteration or branch.
https://aws.amazon.com/about-aws/whats-new/2020/09/aws-step-functions-increases-payload-size-to-256kb
The recommended solution from the Step Functions documentation is to store the data somewhere else (e.g. S3) and pass around the ARN instead of raw JSON.
https://docs.aws.amazon.com/step-functions/latest/dg/avoid-exec-failures.html
You can also use OutputPath to reduce the output to the fields you want to pass to the next state.
https://docs.aws.amazon.com/step-functions/latest/dg/input-output-outputpath.html

Most efficient way to use AWS SQS (with Golang)

When using the AWS SQS (Simple Queue Service) you pay for each request you make to the service (push, pull, ...). There is a maximum of 256kb for each message you can send to a queue.
To save money I'd like to buffer messages sent to my Go application before I send them out to SQS until I have enough data to efficiently use the 256kb limit.
Since my Go application is a webserver, my current idea is to use a string mutex and append messages as long as I would exceed the 256kb limit and then issue the SQS push event. To save even more space I could gzip every single message before appending it to the string mutex.
I wonder if there is some kind of gzip stream that I could use for this. My assumption is that gzipping all concatenated messages together will result in smaller size then gzipping every message before appending it to the string mutex. One way would be to gzip the string mutex after every append to validate its size. But that might be very slow.
Is there a better way? Or is there a total better approach involving channels? I'm still new to Go I have to admit.

I'd take the following approach
Use a channel to accept incoming "internal" messages to a go routine
In that go routine keep the messages in a "raw" format, so 10 messages is 10 raw uncompressed items
Each time a new raw item arrives, compress all the raw messages into one. If the size with the new message > 256k then compress messages EXCEPT the last one and push to SQS
This is computationally expensive. Each individual message causes a full compression of all pending messages. However it is efficient for SQS use

You could guesstimate the size of the gzipped messages and calculate whether you've reached the max size threshold. Keep track of a message size counter and for every new message increment the counter by it's expected compressed size. Do the actual compression and send to SQS only if your counter will exceed 256kb. So you could avoid compressing every time a new message comes in.
For a use-case like this, running a few tests on a sample set of messages should give the rough percentage of compression expected.

Before you get focused on compression, eliminate redundant data that is known on both sides. This is what encodings like msgpack, protobuf, AVRO, and so on do.
Let's say all of your messages are a struct like this:
type Foo struct {
bar string
qux int
}
and you were thinking of encoding it as JSON. Then the most efficient you could do is:
{"bar":"whatever","qux",123}
If you wanted to just append all of those together in memory, you might get something like this:
{"bar":"whatever","qux",123}{"bar":"hello world","qux",0}{"bar":"answer to life, the universe, and everything","qux",42}{"bar":"nice","qux",69}
A really good compression algorithm might look at hundreds of those messages and identify the repetitiveness of {"bar":" and ","qux",.
But compression has to do work to figure that out from your data each time.
If the receiving code already knows what "schema" (the {"bar": some_string, "qux": some_int} "shape" of your data) each message has, then you can just serialize the messages like this:
"whatever"123"hello world"0"answer to life, the universe, and everything"42"nice"69
Note that in this example encoding, you can't just start in the middle of the data and unambiguously find your place. If you have a bunch of messages such as {"bar":"1","qux":2}, {"bar":"2","qux":3}, {"bar":"3","qux":4}, then the encoding will produce: "1"2"2"3"3"4, and you can't just start in the middle and know for sure if you're looking at a number or a string - you have to count from the ends. Whether or not this matters will depend on your use case.
You can come up with other simple schemes that are more unambiguous or make the code for writing or reading messages easier or simpler, like using a field separator or message separator character which is escaped in your encoding of the other data (just like how \ and " would be escaped in quoted JSON strings).
If you can't have the receiver just know/hardcode the expected message schema - if you need the full flexibility of something like JSON and you always unmarshal into something like a map[string]interface{} or whatever - then you should consider using something like BSON.
Of course, you can't use msgpack, protobuf, AVRO, or BSON directly - they need a medium that allows arbitrary bytes like 0x0. And according to the AWS SQS FAQ:
Q: What kind of data can I include in a message?
Amazon SQS messages can contain up to 256 KB of text data, including XML, JSON and unformatted text. The following Unicode characters are accepted:
#x9 | #xA | #xD | [#x20 to #xD7FF] | [#xE000 to #xFFFD] | [#x10000 to #x10FFFF]
So if you want to aim for maximum space efficiency for your exact usecase, you'd have to write your own code which use the techniques from those encoding schemes, but only used bytes which bytes are allowed in SQS messages.
Relatedly, if you have a lot of integers, and you know most of them are small (or clump around a certain spot of the number line, so that by adding a constant offset to all of them you can make most of them small), you can use one of the variable length quantity techniques to encode all of those integers. In fact several of those common encoding schemes mentioned above use variable length quantities in their encoding of integers. If you use a "piece size" of six (6) bits (instead of the standard implicitly assumed piece size of eight (8) bits to match a byte) then you can use base64. Not full base64 encoding, because the padding will completely defeat the purpose - just map from the 64 possible values that fit in six bits to the 64 distinct ASCII characters that base64 uses.
Anyway, unless you know your data has a lot repetition (but not the kind that you can just not send, like the same field names in every message) I would start with all of that, and only then would I look at compression.
Even so, if you want minimal size, I would aim for LZMA, and if you want minimal computing overhead, I would use LZ4. Gzip is not bad per se - if it's much easier to use gzip then just use it - but if you're optimizing for either size or for speed, there are better options. I don't know if gzip is even a good "middle ground" of speed and output size and working memory size - it's pretty old and maybe there's compression algorithms which are just strictly superior in speed and output and memory size by now. I think gzip, depending on implementation, also includes headers and framing information (like version metadata, size, checksums, and so on), which if you really need to minimize for size you probably don't want, and in the context of SQS messages you probably don't need.

I wonder about crc error probability. How can I get 2^(-n)?

I wonder about crc error probability.
In most papers, crc error rate is described like 1-2(-n)
For example, the probability of crc-16 is 1-2(-16),
so 2(-16)=1∕65536=0.0015%, prob = 99.9984%
I want to know how I can get this formula: 2^(-n).
If 2(-n) is correct rate, the rate of crc-16 and crc-ccitt is same?
And if message bit is bigger than before, the rate is same?

For an n-bit CRC, there are 2n possible values of that CRC. Therefore the probability that a message with random errors applied, regardless of the length of the message (so long as it's four bytes or more), has the same CRC as the original message is 2-n. This true for any hash function, including any variant of a CRC, that mixes the input bits well into the output.

struct.unpack for network byte order binary encoded numbers

I am totally new to Python. I have to parse a .txt file that contains network byte order binary encoded numbers (see here for the details on the data). I know that I have to use the package struct.unpack in Python. My questions are the following:
(1) Since I don't really understand how the function struct.unpack works, is it straight forward to parse the data? By that, I mean that if you look at the data structure it seems that I have to write a code for each type of messages. But if I look online for the documentation on struct.unpack it seems more straight forward but I am not sure how to write the code. A short sample would be appreciated.
(2) What's the best practice once I parse the data? I would like to save the parsed file in order to avoid parsing the file each time I need to make a query. In what format should I keep the parsed file that would be the most efficient?

This should be relatively straight forward. I can't comment on how you're actually supposed to get the byte encoded packets of information, but I can help you parse them.
First, here's a list of some of the packet types you'll be dealing with that I gathered from section 4 of the documentation:
TimeStamp
System Event Message
Stock Related Messages
Stock Directory
Stock Trading Action
Reg SHO Short Sale Price Test Restricted Indicator
Market Participant Position
Add Order Message
This continues on. But as an example, let's see how to decode one or two of these:
System Event Message
A System Event Message packet has 3 portions, which is 6 bytes long:
A Message Type, which starts at byte 0, is 1 byte long, with a Value of S (a Single Character)
A TimeStamp, which starts at byte 1, is 4 bytes long, and should be interpreted an in Integer.
An Event Code, which starts at byte 5, is 1 byte long and is a String (Alpha).
Looking up each type in the struct.unpack code table, we'll need to build a string to represent this sequence. First, we have a Character, then a 4Byte Unsigned Integer, then another Character. This corresponds to the encoding and decoding string of "cIc".
*NOTE: The unsigned portion of the Integer is documented in Section 3: Data Types of their documentation
Construct a fake packet
This could probably be done better, but it's functional:
>>> from datetime import datetime
>>> import time
>>> data = struct.pack('cIc', 'S', int(time.mktime(datetime.now().timetuple())), 'O')
>>> print repr(data) # What does the bytestring look like?
'S\x00\x00\x00\xa6n\x8dRO' # Yep, that's bytes alright!
Unpack the data
In this example, we'll use the fake packet above, but in the real world we'd use a real data response:
>>> response_tuple = struct.unpack('cIc', data)
>>> print(repr(response_tuple))
('S', 1385000614, 'O')
In this case, the 3rd item in the tuple (the 'O') is a key, to be looked up in another table called System Event Codes - Daily and System Event Codes - As Needed.
If you need additional examples, feel free to ask, but that's the jist of it.
Recommendations on how to store this data. Well, I suppose that depends on what you'd like to do long term to this data. Probably, a database makes sense here. However, without further information, I cannot say.
Hope that helps!

How can I obfuscate/de-obfuscate integer properties?

My users will in some cases be able to view a web version of a database table that stores data they've entered. For various reasons I need to include all the stored data, including a number of integer flags for each record that encapsulate adjacencies and so forth within the data (this is for speed and convenience at runtime). But rather than exposing them one-for-one in the webview, I'd like to have an obfuscated field that's just called "reserved" and contains a single unintelligible string representing those flags that I can easily encode and decode.
How can I do this efficiently in C++/Objective C?
Thanks!

Is it necessary that this field is exposed to the user visually, or just that it’s losslessly captured in the HTML content of the webview? If possible, can you include the flags as a hidden input element with each row, i.e., <input type=“hidden” …?

Why not convert each of the fields to hex, and append them as a string and save that value?
As long as you always append the strings in the same order, breaking them back apart and converting them back to numbers should be trivial.

Use symmetric encryption (example) to encode and decode the values. Of course, only you should know of the key.
Alternatively, Assymetric RSA is more powerfull encryption but is less efficient and is more complex to use.
Note: i am curios about the "various reasons" that require this design...

Multiply your flag integer by 7, add 3, and convert to base-36. To check if the resulting string is modified, convert back to base-2, and check if the result modulo 7 is still 3. If so, divide by 7 to get the flags. note that this is subject to replay attacks - users can copy any valid string in.

Just calculate a CRC-32 (or similar) and append it to your value. That will tell you, with a very high probability, if your value has been corrupted.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js