Error correcting codes - c++

I need to use an error correcting technique on short messages (between 100 and 200 bits). Space available to add the redundant bits is constrained to 20-50%.
I will have to implement the coding and decoding in C/C++. So it needs to be either open sourced or sufficiently easy to program. (I have had some experience in the past with decoding algorithms - they are dreadful!)
Can anyone advise of a suitable error code to use (with relevant parameters) ?

Take a look at Reed Solomon error correction.
Sample implementation in C++ is available here.
For a different option look here - see item #11
EDIT: If you want a commercial library - http://www.schifra.com/faq.html

Reed-Solomon encoders are described in the form RS(CAPACITY,PAYLOAD). The capacity is always 2^SYMBOL-1, where SYMBOL is the number of bits in each Reed-Solomon symbol. Quite often, this SYMBOL size is 8 bits (a normal byte). It can typically be anything from 3 to 16 bits. For an 8-bit symbol, the Reed-Solomon encoder will be named RS(255,PAYLOAD).
The PAYLOAD is the number of non-parity symbols. If you want 4 parity symbols, you would specify RS(255,251).
To effectively correct errors in your data block, you must first package the data as symbols (groups of bits, quite often just 8-bit bytes). Your goal is to try to arrange (if possible) for any errors to be clustered into the smallest number of symbols possible.
For example, if an error occurs on average every 8 bits, then an 8-bit symbol will not be appropriate; pretty much every symbol will have an error! You might go for 4-bit symbols and use an RS(15,11) codec -- for up to 11 4-bit symbols at a time, producing 4 parity symbols per block. The smaller the symbol size, the lower the CAPACITY (eg. for a SYMBOL size of 4 bits, 2^4-1 == 15 symbol CAPACITY).
But typically, you would use 8-bit symbols. If you have a more realistic error rate of, say, 10% of your 8-bit symbols being erroneous, then you might use an RS(255,205) -- 50 parity symbols per 255 symbol Reed-Solomon "codeword", with a maximum PAYLOAD of 205 bytes. This gives us ~25% parity, allowing us to correct a codeword containing up to ~12.5% errors.
Using https://github.com/pjkundert/ezpwd-reed-solomon's c++/ezpwd/rs Reed-Solomon API, you would specify this as:
#include <ezpwd/rs>
...
ezpwd::RS<255,205> rscodec;
Put your data in a std::string (it can handle raw 8-bit binary data just fine) or a std::vector and call the API, adding the 50 symbols of parity:
std::string data;
// ... fill data with a fixed size block, up to 205 bytes
rscodec.encode( data );
Send your data, and later on, after you receive the data+parity, recover the original data (and discard the 50 parity symbols):
int corrected = rscodec.decode( data );
If the data could be recovered, the number of symbols corrected will be returned, or -1 if the Reed-Solomon codeword contained too many errors.
Enjoy!

Related

Fixed-length compression algorithm?

I'm trying to find a compression algorithm which I can use to encode a blob using only 16 fixed-length symbols (0b0000 - 0b1111).
Without any compression, I could use those 16 symbols to encode their respective bit-values (e.g. symbol 5 (0b0101) encodes bits 0101, so if my blob is 100 bits-long I would need 25 symbols to represent it - but doing so provides no compression.
I think what I need is somewhat a reverse-Huffman (in the sense that the code is fixed-length, but it represents variable-length output).
Any ideas?
I only need to do this to one specific blob that's about 2 KBs, so it doesn't need to be super-efficient.
If I am understanding your question correctly, use any standard compressor to compress your data, and then simply encode each byte of the result as two of your symbols.

Why RS(255,233) has 32 redundant symbols?

How does Reed Solomon code(255,233) is formed?
I understand how RS(255,223) is formed because
n=2^8-1=255
r=32, k=n-r=223
but how about RS(255,233)?
I read somewhere on the internet, it says RS(255,233) has 32 redundant symbols but why? Isn't it supposed to be 22 redundant symbols?
Any link that I can refer to would be appreciated. Thank you.
It was a mistake. RS(255,233) would be 22 parity symbols, RS(255,223) would be 32 parity symbols.
https://www.cs.cmu.edu/~guyb/realworld/reedsolomon/reed_solomon_codes.html
Note, in some cases of RS(n,k), n-k is an odd number, so 2t+1 parity symbols.
Another note, in the Wiki article, t means the number of parity symbols, instead of the number of errors that can be corrected:
https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction
The Wiki article also covers the original view for RS code, which is actually a different code with the same name. In this case, for GF(2^8), the max value for n is 256 instead of 255. Some erasure only codes use original view encoding, called "Vandermonde" encoding. Another encoding method is "Cauchy".

CBC MAC: message length and length prepending

I want to use CBC MAC in C++. First I hope to find some implementation of block cipher which I will use in CBC mode, which I understand is CBC MAC. But I have two questions:
1) If the length of the message to be authenticated is not multiple of block cipher block length, what shall I do?
2) To strengthen CBC MAC one recommended way as mentioned on Wiki is to put the length of the message in the first block. But how should I encode the length, as string? Or in binary? If block length of cipher is say 64 bits, do I encode the number as 64 bit number? e.g. if message length is 230, I should use following value as first block:
00000000 00000000 00000000 00000000 00000000 00000000 00000000 ‭11100110‬
?
It depends on the 2nd question. You must "pad" the message with something until it is a multiple of the block size. The pad bytes are added to the message before the MAC is calculated but only the original message is transmitted/stored/etc.
For a MAC, the easiest thing to do is pad with zeros. However this has a vulnerability - if the message part ends in one or more zeros, an attacker could add or remove zeros and not change the MAC. But if you do step 2, this and another attack are both mitigated.
If you prepend the length of the message before the message (e.g. not just in the first block, but the first thing in the first block), it mitigates the ability to sometimes add/remove zeros. It also mitigates an attackers ability to forge a message with an entire arbitrary extra block added. So this is a good thing to do. And it is a good idea for completely practical reasons too - you know how many bytes the message is without relying on any external means.
It does not matter what the format of the length is - some encoded ASCII version or binary. However as a practical matter it should always be simple binary.
There is no reason that the number of bits in the length must match the cipher block size. The size of the length field must be large enough to represent the message sizes. For example, if the message size could range from 0 to 1000 bytes long, you could prepend an unsigned 16 bit integer.
This is done first before the MAC is calculated on both the sender and receiver. In essence the length is verified at the same time as the rest of the message, eliminating the ability for an attacker to forge a longer or shorter message.
There are many open source C implementations of block ciphers like AES that would be easy to find and get working.
Caveat
Presumably the purpose of the question is just for learning. Any serious use should consider a stronger MAC such as suggested by other comments and a good crypto library. There are other weaknesses and attacks that can be very subtle so you should never attempt to implement your own crypto. Neither of us are crypto experts and this should be done only for learning.
BTW, I recommend both of the following books by Bruce Schneier:
http://www.amazon.com/Applied-Cryptography-Protocols-Algorithms-Source/dp/0471117099/ref=asap_bc?ie=UTF8
http://www.amazon.com/Cryptography-Engineering-Principles-Practical-Applications/dp/0470474246/ref=asap_bc?ie=UTF8

Writing binary data in c++

I am in the process of building an assembler for a rather unusual machine that me and a few other people are building. This machine takes 18 bit instructions, and I am writing the assembler in C++.
I have collected all of the instructions into a vector of 32 bit unsigned integers, none of which is any larger than what can be represented with an 18 bit unsigned number.
However, there does not appear to be any way (as far as I can tell) to output such an unusual number of bits to a binary file in C++, can anyone help me with this.
(I would also be willing to use C's stdio and File structures. However there still does not appear to be any way to output such an arbitrary amount of bits).
Thank you for your help.
Edit: It looks like I didn't specify how the instructions will be stored in memory well enough.
Instructions are contiguous in memory. Say the instructions start at location 0 in memory:
The first instruction will be at 0. The second instruction will be at 18, the third instruction will be at 36, and so on.
There is no gaps, or no padding in the instructions. There can be a few superfluous 0s at the end of the program if needed.
The machine uses big endian instructions. So an instruction stored as 3 should map to: 000000000000000011
Keep an eight-bit accumulator.
Shift bits from the current instruction into to the accumulator until either:
The accumulator is full; or
No bits remain of the current instruction.
Whenever the accumulator is full:
Write its contents to the file and clear it.
Whenever no bits remain of the current instruction:
Move to the next instruction.
When no instructions remain:
Shift zeros into the accumulator until it is full.
Write its contents.
End.
For n instructions, this will leave (8 - 18n mod 8) zero bits after the last instruction.
There are a lot of ways you can achieve the same end result (I am assuming the end result is a tight packing of these 18 bits).
A simple method would be to create a bit-packer class that accepts the 32-bit words, and generates a buffer that packs the 18-bit words from each entry. The class would need to do some bit shifting, but I don't expect it to be particularly difficult. The last byte can have a few zero bits at the end if the original vector length is not a multiple of 4. Once you give all your words to this class, you can get a packed data buffer, and write it to a file.
You could maybe represent your data in a bitset and then write the bitset to a file.
Wouldn't work with fstreams write function, but there is a way that is described here...
The short answer: Your C++ program should output the 18-bit values in the format expected by your unusual machine.
We need more information, specifically, that format that your "unusual machine" expects, or more precisely, the format that your assembler should be outputting. Once you understand what the format of the output that you're generating is, the answer should be straightforward.
One possible format — I'm making things up here — is that we could take two of your 18-bit instructions:
instruction 1 instruction 2 ...
MSB LSB MSB LSB ...
bits → ABCDEFGHIJKLMNOPQR abcdefghijklmnopqr ...
...and write them in an 8-bits/byte file thus:
KLMNOPQR CDEFGHIJ 000000AB klmnopqr cdefghij 000000ab ...
...this is basically arranging the values in "little-endian" form, with 6 zero bits padding the 18-bit values out to 24 bits.
But I'm assuming: the padding, the little-endianness, the number of bits / byte, etc. Without more information, it's hard to say if this answer is even remotely near correct, or if it is exactly what you want.
Another possibility is a tight packing:
ABCDEFGH IJKLMNOP QRabcdef ghijklmn opqr0000
or
ABCDEFGH IJKLMNOP abcdefQR ghijklmn 0000opqr
...but I've made assumptions about where the corner cases go here.
Just output them to the file as 32 bit unsigned integers, just as you have in memory, with the endianness that you prefer.
And then, when the loader / eeprom writer / JTAG or whatever method you use to send the code to the machine, for each 32 bit word that is read, just omit the 14 more significant bits and send the real 18 bits to the target.
Unless, of course, you have written a FAT driver for your machine...

error correcting code over a 4 element alphabet

I need to develop an error correcting code.
My alphabet is {0,1,2,3} (4 elements)
Codeword size n will be 8 or 12
expected error correction capability = 1 digit
expected error detection capability = 2 digit
I reviewed many ecc techniques (rs,ldpc,etc), yet still dont know where to start, and how to do.
Can anybody plz help me to construct it?
Thx
Have you considered a checksum?
There are tons of ways to implement this, but a common approach would be to use a Reed-Solomon code.
Since you need to detect all two-symbol errors and correct all one-symbol errors, that means you will need two check symbols.
You say you have 2-bit (4-element) symbols, which limits your code length to 3 symbols.
Add that up and you have 1 data symbol and 2 check symbols for each 12-bit code word.
Not very efficient, eh? For that efficiency, you might as well just triplicate your symbol thrice, with the same codewords size and detective and corrective power.
To use Reed-Solomon more effectively, you'll need to use large symbols. This is true for most other types of codes as well.
EDIT:
You may want to consider generalized BCH codes which don't have quite as many limitations as Reed-Solomon codes (which are a subset of BCH codes), at the expense of more complex decoding:
http://en.wikipedia.org/wiki/BCH_code