Using Reed Solomon decoding, do we need to know which shards are correct? - error-correction

I am using Reed-Solomon error correction in a Java project. The library I use is JavaReedSolomon (https://github.com/Backblaze/JavaReedSolomon). There is an example of decoding using JavaReedSolomon:
byte[][] shards = new byte[NUM_SHARDS][SHARD_SIZE];
//shards is the array containing all the shards
ReedSolomon reedSolomon = ReedSolomon.create(NUM_DATA_SHARDS, NUM_PARITY_SHARDS);
reedSolomon.decodeMissing(shards, shardPresent, 0, shardSize);
The array shardPresent represents which shards are sure to be correct, for example, if you are sure that the 4th shard is correct, then shardPresent[3] equals true.
My question is, does Reed-Solomon decoding necessarily need to know which shards are correct or it is just how this library implement it?

The answer is no: the decoding procedure can recover from both unknown and known errors (erasures). A Reed-Solomon code (in fact, any MDS code) can correct twice as many erasures as errors. There are multiple ways to determine the error locator.
It is likely the API in the library corresponds to its use case, i.e. there is probably some side-channel information about which parts of the data are correct.

Related

Parallel bzip2 decompression scan may fail?

Bzip2 byte stream compression in parallel can be easily done with a FIFO queue where every chunk is processed as parallel task and streamed into a file.
The other way round parallel decompression is not so easy, because everything is bit-aligned and the exact bit-length of a block is known after it's decompressed.
As far as I can see, parallel decompress implementations use magic numbers for block start and stream end and perform a bit-scan. Isn't there a small chance that one of the streams contain such a magic value by coincidence?
Possible block validations:
4 Bytes CRC
6 Bytes "compressed magic"
6 Bytes "end of stream magic"
some bit combinations for the huffman trees are not allowed
max. x Bytes of huffman stream (range to search for next magic)
Per file:
4 Bytes File CRC
padding at the end
I could implement such a scan by just bit-shifting from the stream until I have a magic. But then when I read block N and it fails, I should (maybe also not) take into account, that it was a false positive. For a parallel implementation I can then stop all tasks for blocks N, N+1, N+2, .. , then try to find the next signature and go one. That makes everything very complicated and I don't know if it's worth the effort? I guess maybe not, but is there a chance that a parallel bzip2 implementation fails?
I'm wondering why a file format uses magic numbers as markers, but doesn't include jump hints. I guess the magic numbers are important for filesystem recovery, but anyway, why can't a block contain e.g 16bits for telling how far to jump to the next block.
Yes, the source code you linked notes that the magic 48-bit value can show up in compressed data by chance. It also notes the probability, around 10-14 (actually 2-48, closer to 3.55x10-15). That probability is at every sample, so on average one will occur in every 32 terabytes of compressed data. That's about one month of run time on one core on my machine. Not all that long. In a production environment, you should assume that it will happen. Because it will.
Also as noted in the source you linked, due to the possibility of a false positive, you need to then validate the remainder of the block. You would not stop the subsequent possible block processing, since it is extremely likely that they are all valid blocks. Just validate all and keep the validated ones. Verify when combining that the valid blocks exactly covered the input, with no overlaps. A properly implemented parallel bzip2 decompressor will always work on valid bzip2 streams.
It would need to be more than 16 bits, but yes, in principle a block could have contained the offset to the next block, since it already contains a CRC at the start of the block. Julian did consider that in the revision of bzip2, but decided against it:
bzip2-1.0.X, 0.9.5 and 0.9.0 use exactly the same file format as the
original version, bzip2-0.1. This decision was made in the interests
of stability. Creating yet another incompatible compressed file format
would create further confusion and disruption for users.
...
The compressed file format was never designed to be handled by a
library, and I have had to jump though some hoops to produce an
efficient implementation of decompression. It's a bit hairy. Try
passing decompress.c through the C preprocessor and you'll see what I
mean. Much of this complexity could have been avoided if the
compressed size of each block of data was recorded in the data stream.

AWS S3 returning differing "Content-Range" compared to "Range" request header specification

We send the "Range" Header with our HTTP Get Requests to S3 for resuming failed downloads of our ~3GB application that we host there. A strange behavior has occurred. While at "lower" completion percentages (0-50%), the returned Content-Range Header from the server exactly matches up with what we requested. However, at a certain, yet undetermined point, the returned Content-Range is differing from our requests. We always request from XYZ bytes to end. I yet had no time to debug whether there is a well defined "boarder" after which this behavior occurs, I can only assure that it always happens after 75%, a mark at which I tested a dozen times already, and doesn't happen at anything lower than 50%.
My question is whether this is expectable behavior as I cannot see any documentation about this. If it is, is there any resources to get familiar with this behavior or even a way to prevent it? For example, we requested the range to start at 2566960807 bytes, S3 responded with Content-Range: bytes 1499653561-3227660049. This is quite a large chunk that would needed to be re-downloaded. The amount of successful, preceding partial content requests seems to be of no relevancy, e.g. the same behavior is to be experienced when multiple or none "Range" requests (which had returned the "correct" byte range) have been made on the same object beforehand.
If this information is of any relevancy, between the failure of the initial download (killed internet connection) until the attempted ranged continuation only are a couple seconds. (~10-20 seconds max)
In case someone is interested in the solution, the issue was surprisingly stupid. The library we used for the downloaded was using .Net 3.5, where HttWebRequest.AddRange only supported int. Passing in a long parameter lead to a cast error when the int range was exceeded. Have to say, really questionable implementation by Avira (we use their FileDownloader), casting a long to int unchecked.

Write a C++ struct to a file and read file using another programming language?

I have a challenging situation; we will have programs on Mac, PC, iOS and Android receiving files in a legacy format and parsing data from those files. We cannot change how those files are created.
The files are produced by a C++ program filling a struct with numbers and Strings and then writing it out. Here's a sanitized version.
struct MyObject {
String Kfkj(MAXKYS);
String Oern(MAXKYS);
String Vdflj(MAXKYS, 9);
int Muic;
int Tdfkj;
int VdfkAsdk;
int SsdjsdDsldsk;
int Ndsoief;
String TdflsajPdlj;
String TdckjdfPas;
String AdsfakjIdd;
int IdkfjdKasdkj;
int AsadkjaKadkja(MAXKYS);
int Kasldsdkj;
bool Usadl;
String PsadkjOasdj(9);
String PasdkjOsdkj;
};
Primitives and Strings, as you can see.
Then here is how they write it out to a file:
MyInstance MyObject;
FileName = "C:\MyFile.ab2"
ofstream fout (FileName, ios::binary);
fout.write((char*)& MyInstance, sizeof(MyInstance));
There is no option for us to translate it once and then distribute the file to other platforms; we must translate it on each and every different platform, and this is what we have to work with. I'd appreciate any information on how C++ serializes data, so we know how to parse the file.
EDIT: solution
The feedback I received from multiple answers here was VERY helpful. Using that, I did extensive analysis with hex editors and discovered:
the elements come in the file one after another
a "String," in this case, starts with an int describing how many characters follow the int for that String. If the String does not exist, it will still have that int with a value of 0.
integers, for the files and machines I saw, are two bytes, little-endian, and MOSTLY unsigned (there were a few that were signed, just to keep me on my toes)
the boolean was two bytes, with apparently -1 (FF FF) representing "true"
So far we have not ran into issues with different padding or endianness on different devices, but those are very real concerns. The skilled notes and warnings in these answers provides us with more ammunition to try to convince the client to change to a less fragile alternative, such as XML or JSON, for transferring data online across platforms.
As for those of you asking if the developer was fired... well, let's just say their code is very old, but after multiple conversations we're still having trouble convincing them writing out the C++ struct and trying to read that on different platforms is not a good idea.
You're going to run into many problems.
C++ doesn't have a specific format for serializing data per se. It is highly dependent on the computer architecture/processor that you are running on.
The compiler is allowed to add padding to help alignment on systems. When we say alignment we basically are referring to an architecture/processor's affinity for having data lie on specific byte boundaries. For example, some processors vastly prefer floating point numbers to lie at 4 or 8 byte boundaries - if they don't the processor may work much slower or may not work at all.
So, you can't simply know what padding your system is adding magically.
What you can do is use #pragma pack(1) / #pragma pack(0) to stop your compiler from padding your numbers.
PS: you also have to worry about endianness. What if one computer is running on big-endian and one is little endian? They will interpret bytes differently without a conversion.
Simply put, you either have to fix the application generating the files so it uses a proper serialization scheme OR you need to look at it running on a SPECIFIC computer, look at exactly how it writes the files, and write a translator for every target platform (which is just silly).
Interesting Suggestion
If you're really stuck, write an app that monitors the folder where you write files. Have the app pick up the files (since it's on the same PC it'll be able to read their format without issue). Have it write the files back in XML or some other true serialization format and distribute those instead.
Whoa - that's crazy. So String objects don't contain any pointers? Must not- because you claim this is working code.
Anyway, that code isn't doing any serialization. Its just writing the structure out to file exactly the way it is laid out in memory. The only issue you have is that on some platforms padding and sizeof integral types like int may be different.
You'll have to find the size of the integral types, and use that information in reader/writer for newer platforms to make sure they get laid out the same way on the legacy platform.
You're running a real risk with that code though. As it is, a compiler change could suddenly cause the file layout to change.
The format of your data file is entirely down to the compiler that your C++ program is compiled with, and the definition of your String class. You can rely on the fields being in the order they're declared in, and in this case, I think you can rely on there not being any padding at the start, but that's about all. Some tips that might help you out in this case:-
You don't give the definition of the String class you're using. If it's a typedef for std::string, you're completely screwed, because the contents of the string aren't in the memory. I assume your C++ programmers are using some special local buffer, in which case I'll guess you will find the first bytes of the object are the string, and there is some amount of useless padding afterwards. I hope the struct contains an int at the start telling you how much data in it is useful.
You'll probably find the int fields are four bytes long.
You'll probably find the bool field is one byte long, followed by three bytes of useless padding. Only one bit, most likely the bottom bit, will be set.
That's about all the useful guesswork I can offer you. In your target language, make sure to read the whole file in as the closest thing to a byte array available in the language, and only after that, use the language features to convert it into the right kind of thing in your language. Don't try reading it in as integers, as that won't let you byte-swap if you're on a platform with different endianness to the C++ program. I suggest also looking through the file in a text editor to reverse-engineer it and help you find the offset of each field.
Last piece of advice: consider printing P45s (or pink slips, or whatever you have in your country) for whichever programmers or project managers thought this kind of 'serialization' was a good idea. This kind of sloppy work might have been acceptable in a life-or-death situation, but they have seriously screwed you over in a way you're going to find it very hard to recover from. Writing the code to read in these files will not be that hard, if it's only one struct like this, but keeping it reliable will be a world of pain, and they've effectively made it impossible for themselves to change compilers or compiler version safely.
The way it's done, the struct is written in raw form to a file. So basically what you need to know to parse this file is the binary layout of your struct.
Basically, the fields are just one after the other, so to read an int, you just read 4 bytes and cast that to an int, etc.
Strings are a particular case. It's not clear from your code whether this "String" type is an inline array of characters, or a pointer to such an array. In the first case, you need to know how many characters each string contains and simply read that number of characters sequentially. In the second case, you won't be able to get the string back, since it won't have been written to file. The pointer will be useless to you.
One last concern is whether the struct is packed or not. Since you gave no indication to that, by default struct fields are aligned to 4-bytes boundaries, so there may be space for instance after the boolean field that you need to account for. If the struct is packed, then each field comes directly after the previous.
So, to make a long story short, figure out your struct binary layout using its definition and, if all else fails, inspecting the memory at run-time with the debugger, or use a hex editor to study the output file. Then write that specification down somewhere and this will give you what you need to read from the file. It's impossible to tell exactly what that layout is simply by looking at the pseudo-definition you gave.
Writing in an ofstream does not serialize data. This code write the raw memory content of the struct as it was a string of char. Depending of your compiler, its version, its options and the system it is running on the content will be completely different.
Even the number of bits of a char is allowed to change between c++ implementation.
Data referenced by the object of the struct won't be written (forget the content of std::string).
If you cannot change the writer code. You must know the alignment policy, the size of base type and the data representation. You will have to analyze files produced by hand, for example with an hexadecimal editor like this one
http://www.physics.ohio-state.edu/~prewett/hexedit/
, and probably look at your compiler documentation.
If you can change the writer code. Use proper serialization like json, protocol buffer or simply xml.
No one has pointed out something that sticks out to me as particularly problematic (maybe because I've been bit by it). That problem: the data member bool Usadl;. sizeof(bool) varies across platforms, across compilers, and even across releases of the same compiler. Common values for sizeof(bool) are 4 and 1. This will bite you. It's getting hard to find a big endian machine nowadays, very, very hard to find a computer where CHAR_BIT is not 8 or sizeof(int) is not 4. This is not the case for sizeof(bool).
In agreement with everyone else, Chad's team needs to document the structure of the records in the file, and then make sure the program that produces the file writes this structure explicitly, including element sizes, padding, and endianness. Don't depend on class layout to do this for you. That's just asking for trouble.
The best way would probably be to use JSON or if you want a more robust solution go with something like Avro. Avro has a C++ API and a Java API, so it covers most of the cases you're encountering.

Where is this encryption/decryption algorithm going wrong?

I've been working on a basic string encryption/decryption algorithm in C++ (the source is here: http://pastebin.com/MLnn8D82)
The problem I'm having is that it doesn't decrypt properly. The encryption equation is:
strInput[nPos]=(((strInput[nPos])+(nPos+1))*2);
And the decryption equation is:
strPassword[nPos]=(((strPassword[nPos])-(nPos+1))/2);
When I try it with just addition/subtraction operators, it works perfectly. But when I multiply in encryption and divide in decryption, I get a seemingly random string outputted.
At first I thought it may be because the password is written to and retrieved from a file before being decrypted, but I tried outputting it directly from the main function and I ended up with the same results.
Is there a problem with dividing/multiplying strings? It worked before with C-style (char array) strings, but I guess this could be different.
Any help is appreciated!
Edit: Thanks for the answers so far. I know that this isn't secure and that I shouldn't use it; I'm only doing it for practice.
Also, it's not a memory problem. I've tried dividing in the encryption stage rather than multiplying, but I still get a random string rather than the original string.
It's quite likely your multiplication is overflowing for some characters, meaning your division will never be able to recover the original.
On a side note, why are you writing the encryption algorithm yourself? If you're going to be using it for anything real, rather than just learning, you would be much better off using a library written by cryptography experts that is known to be secure. Something like Keyczar would be a good idea because it's designed to be difficult to get wrong (which is very easy to do in ways that are very subtle when it comes to cryptography).
There are multiple things wrong with this algorithm:
This is just a basic change to a standard Vigenère Cipher, which is well known to be very insecure. Do not use it for anything more than writing letters to a girlfriend, which other students should not be able to read. Even a somewhat decent math teacher will be able to decipher it easily.
Do not ever try to invent a cryptographic algorithm, unless you have a doctorate in number theory or cryptography. Even with a degree in one of these fields, writing a cryptographic algorithm, which is fairly secure, is a very hard task. And even if you find an algorithm, do not try to implement it yourself, but rather try to find an implementation which is already available. There is a lot you can get wrong, as can be seen by the various security flaws, which were cause by badly implemented cryptographic algorithms.
You do not have any support for a passphrase in your algorithm. This means, anybody who knows the algorithm can easily decipher your encrypted data. Usually a cryptographic algorithm takes a passphrase as an input, which is then used to decipher the data. This way the algorithm can be made public and only the passphrase must be kept secret. If the algorithm is kept secret, this is considered a fatal flaw by the cryptographic community.
Your multiplication might overflow, in case it yields a result, which is bigger than what could be stored in a char. In that case a division will not be able to retrieve the original data. This has been pointed out by others as well.
The order of operation is wrong. In your encryption step you add first then you multiply. Have a look at the resultion equation. Solving that equation for the input means you also have to change the order. In your case this means, you first have to divide and then you have to subtract. However in your code you are first subtracting and then dividing.
These are all the things I can tell you for now. This is not meant to discourage you from trying out this kind of stuff. I wrote a fair amount of similar algorithms when I was much younger. You just need to be very aware, that they will not be very secure.
There are two issues here.
One appears to stem from the use of strings and the input/output streams. If you set a breakpoint and step through this you'll realize that in the fRetrieve function the values of strPassword[nPos] can be negative. You are essentially reading in binary data into a string and trying to act on it.
What you should be doing is processing your strings into a binary data buffer. Such as a char array. That solely stores bytes. Then in your decryption you will get purely binary data back and can convert that into a string. This will insure the integrity of your data when writing/reading from the file. Playing with strings and high ASCII values is asking for the data to be interpreted wrong.
Second, is that your decryption algorithm is not properly reversed. So even if you did decrypt it correctly you would be off by 1 every time. This is an order of operations issue.
Example, assume an A (65) and nPos of 0. Encrypt:
65 + (0+1) = 66 * 2 = 132
Then reverse:
132 - (0+1) = 131 / 2 = 65.5
This may be rounded or truncated since it's an integer data type. The proper reverse is
(strPassword[nPos] / 2) - (nPos+1)

error correcting code over a 4 element alphabet

I need to develop an error correcting code.
My alphabet is {0,1,2,3} (4 elements)
Codeword size n will be 8 or 12
expected error correction capability = 1 digit
expected error detection capability = 2 digit
I reviewed many ecc techniques (rs,ldpc,etc), yet still dont know where to start, and how to do.
Can anybody plz help me to construct it?
Thx
Have you considered a checksum?
There are tons of ways to implement this, but a common approach would be to use a Reed-Solomon code.
Since you need to detect all two-symbol errors and correct all one-symbol errors, that means you will need two check symbols.
You say you have 2-bit (4-element) symbols, which limits your code length to 3 symbols.
Add that up and you have 1 data symbol and 2 check symbols for each 12-bit code word.
Not very efficient, eh? For that efficiency, you might as well just triplicate your symbol thrice, with the same codewords size and detective and corrective power.
To use Reed-Solomon more effectively, you'll need to use large symbols. This is true for most other types of codes as well.
EDIT:
You may want to consider generalized BCH codes which don't have quite as many limitations as Reed-Solomon codes (which are a subset of BCH codes), at the expense of more complex decoding:
http://en.wikipedia.org/wiki/BCH_code