Which error correction could should I use for GF(32) - error-correction

I searched for comparisons between Reed-Solomon, Turbo and LDPC codes but they all seem to focus on efficiency. I'm more interested in commercial license of available libs, easiness and GF(32), i.e. a code with 32 symbols only (available Reed-Solomon implementations work for GF(256) and above).
Efficiency (speed) is not relevant. The messages are comprised of 24 symbols.
Can you provide a quick comparison on the most well-known Reed-Solomon, Turbo and LDPC codes for this case in which speed is not relevant?
Thanks.

Basically, Reed-Solomon is optimal, thus it means that you can exactly correct up to (n-k)/2 errors (k=length of your message, n=length of message + EC symbols), while TurboCodes and LDPC are near-optimal, meaning that you can correct up to (n-k-e)/2 where e is a small constant, so in ideal cases you are very close to (n-k)/2 (that's why it's called near-optimal, it's close to the Shannon limit). TurboCodes and LDPC have similar error correction power, and there are lots of variants depending on your needs (you can find lots of literature reviews or presentations).
What the different variants of LDPC or Turbocodes do is to optimize the algorithm to fit certain characteristics of the erasure channel (ie, the data) so as to reduce the constant e (and thus approaching the Shannon limit). So the best variant in your case depends on the details of your erasure channel. Also, to my knowledge, they are all in public domain now (maybe not yet for Turbocodes patents, but if not yet then they will soon).

Related

What string search algorithm does strstr use?

I was reading through the String searching algorithm wikipedia article, and it made me wonder what algorithm strstr uses in Visual Studio? Should I try and use another implementation, or is strstr fairly fast?
Thanks!
The implementation in visual studio strstr is not know to me, and I am uncertain if it is to anyone. However I found these interesting sources and an example implementation. The latter shows that the algorithm runs in worst case quadratic time wrt the size of the searched string. Aggregate should be less than that. The algorithmic limit of non stochastic solutions should be that.
What is actually the case is that depending the size of the input it might be possible that different algorithms are used, mainly optimized to the metal. However, one cannot really bet on that. In case that you are doing DNA sequencing strstr and family are very important and most probably you will have to write your own customized version. Usually, standard implementations are optimized for the general case, but on the other hand those working on compilers know their shit n staff. At any rate you should not bet your own skills against the pros.
But really all this discussion about time to develop is hurting the effort to write good software. Be certain that the benefit of rewriting a custom strstr outweigh the effort that is going to be needed to maintain and tune it for your specific case, before you embark on this task.
As others have recommended: Profile. Perform valid performance tests.
Without the profile data, you could be optimizing a part of the code that runs 20% of the time, a waste of ROI.
Development costs are the prime concern with modern computers, not execution time. The best use of time is to develop the program to operate correctly with few errors before entering System Test. This is where the focus should be. Also due to this reasoning, most people don't care how Visual Studio implements strstr as long as the function works correctly.
Be aware that there is line or point where a linear search outperforms other searches. This line depends on the size of the data or the search criteria. For example, linear search using a processor with branch prediction and a large instruction cache may outperform other techniques for small and medium data sizes. A more complicated algorithm may have more branches that cause reloading of the instruction cache or data cache (wasting execution time).
Another method for optimizing your program is to make the data organization easier for searching. For example, making the string small enough to fit into a cache line. This also depends on the quantity of searching. For a large amount of searches, optimizing the data structure may gain some performance.
In summary, optimize if and only if the program is not working correctly, the User is complaining about speed, it is missing timing constraints or it doesn't fit in the allocated memory. Next step is then to profile and optimize the areas where most of the time is spent. Any other optimization is futile.
The C++ standard refers to the C standard for the description of what strstr does. The C standard doesn't seem to put any restrictions on the complexity, so pretty much any algorithm the finds the first instance of the substring would be compliant.
Thus different implementations may choose different algorithms. You'd have to look at your particular implementation to determine which it uses.
The simple, brute-force approach is likely O(m×n) where m and n are the lengths of the strings. If you need better than that, you can try other libraries, like Boost, or implement one of the sub-linear searches yourself.

About optimized math functions, ranges and intervals

I'm trying to wrap my head around how people that code math functions for games and rendering engines can use an optimized math function in an efficient way; let me explain that further.
There is an high need for fast trigonometric functions in those fields, at times you can optimize a sin, a cos or other functions by rewriting them in a different form that is valid only for a given interval, often times this means that your approximation of f(x) is just for the first quadrant, meaning 0 <= x <= pi/2 .
Now the input for your f(x) is still about all 4 quadrants, but the real formula only covers 1/4 of that interval, the straightforward solution is to detect the quadrant by analyzing the input and see in which range it belongs to, then you adjust the result of the formula accordingly if the input comes from a quadrant that is not the first quadrant .
This is good in theory but this also presents a couple of really bad problems, especially considering the fact that you are doing all this to steal a couple of cycles from your CPU ( you also get a consistent implementation, that is not platform dependent like an hardcoded fsin in Intel x86 asm that only works on x86 and has a certain error range, all of this may differ on other platforms with other asm instructions ), so you should keep things working at a concurrent and high performance level .
The reason I can't wrap my head around the "switch case" with quadrants solution is:
it just prevents possible optimizations, namely memoization, considering that you usually want to put that switch-case inside the same functions that actually computes the f(x), probably the situation can be improved by implementing the formula for f(x) outside said function, but this is will lead to a doubling in the number of functions to maintain for any given math library
increase probability of more branching with a concurrent execution
generally speaking doesn't lead to better, clean, dry code, and conditional statements are often times a potential source of bugs, I don't really like switch-cases and similar things .
Assuming that I can implement my cross-platform f(x) in C or C++, how the programmers in this field usually address the problem of translating and mapping the inputs, the quadrants to the result via the actual implementation ?
Note: In the below answer I am speaking very generally about code.
Assuming that I can implement my cross-platform f(x) in C or C++, how the programmers in this field usually address the problem of translating and mapping the inputs, the quadrants to the result via the actual implementation ?
The general answer to this is: In the most obvious and simplest way possible that achieves your purpose.
I'm not entirely sure I follow most of your arguments/questions but I have a feeling you are looking for problems where really none exist. Do you truly have the need to re-implement the trigonometric functions? Don't fall into the trap of NIH (Not Invented Here).
the straightforward solution is to detect the quadrant
Yes! I love straightforward code. Code that is perfectly obvious at a glance what it does. Now, sometimes, just sometimes, you have to do some crazy things to get it to do what you want: for performance, or avoiding bugs out of your control. But the first version should be most obvious and simple code that solves your problem. From there you do testing, profiling, and benchmarking and if (only if) you find performance or other issues, then you go into the crazy stuff.
This is good in theory but this also presents a couple of really bad problems,
I would say that this is good in theory and in practice for most cases and I definitely don't see any "bad" problems. Minor issues in specific corner cases or design requirements at most.
A few things on a few of your specific comments:
approximation of f(x) is just for the first quadrant: Yes, and there are multiple reasons for this. One simply is that most trigonometric functions have identities so you can easily use these to reduce range of input parameters. This is important as many numerical techniques only work over a specific range of inputs, or are more accurate/performant for small inputs. Next, for very large inputs you'll have to trim the range anyways for most numerical techniques to work or at least work in a decent amount of time and have sufficient accuracy. For example, look at the Taylor expansion for cos() and see how long it takes to converge sufficiently for large vs small inputs.
it just prevents possible optimizations: Chances are your c++ compiler these days is way better at optimizations than you are. Sometimes it isn't but the general procedure is to let the compiler do its optimization and only do manual optimizations where you have measured and proven that you need it. Theses days it is very non-intuitive to tell what code is faster by just looking at it (you can read all the questions on SO about performance issues and how crazy some of the root causes are).
namely memoization: I've never seen memoization in place for a double function. Just think how many doubles are there between 0 and 1. Now in reduced accuracy situations you can take advantage of it but this is easily implemented as a custom function tailored for that exact situation. Thinking about it, I'm not exactly sure how to implement memoization for a double function that actually means anything and doesn't loose accuracy or performance in the process.
increase probability of more branching with a concurrent execution: I'm not sure I'd implement trigonometric functions in a concurrent manner but I suppose its entirely possible to get some performance benefits. But again, the compiler is generally better at optimizations than you so let it do its jobs and then benchmark/profile to see if you really need to do better.
doesn't lead to better, clean, dry code: I'm not sure what exactly you mean here, or what "dry code" is for that matter. Yes, sometimes you can get into trouble by too many or too complex if/switch blocks but I can't see a simple check for 4 quadrants apply here...it's a pretty basic and simple case.
So for any platform I get the same y for the same values of x: My guess is that getting "exact" values for all 53 bits of double across multiple platforms and systems is not going to be possible. What is the result if you only have 52 bits correct? This would be a great area to do some tests in and see what you get.
I've used trigonometric functions in C for over 20 years and 99% of the time I just use whatever built-in function is supplied. In the rare case I need more performance (or accuracy) as proven by testing or benchmarking, only then do I actually roll my own custom implementation for that specific case. I don't rewrite the entire gamut of <math.h> functions in the hope that one day I might need them.
I would suggest try coding a few of these functions in as many ways as you can find and do some accuracy and benchmark tests. This will give you some practical knowledge and give you some hard data on whether you actually need to reimplement these functions or not. At the very least this should give you some practical experience with implementing these types of functions and chances are answer a lot of your questions in the process.

Fast implementation of MD5 in C++

First of all, to be clear, I'm aware that a huge number of MD5 implementations exist in C++. The problem here is I'm wondering if there is a comparison of which implementation is faster than the others. Since I'm using this MD5 hash function on files with size larger than 10GB, speed indeed is a major concern here.
I think the point avakar is trying to make is: with modern processing power the IO speed of your hard drive is the bottleneck not the calculation of the hash. Getting a more efficient algorithm will not help you as that is not (likely) the slowest point.
If you are doing anything special (1000's of rounds for example) then it may be different, but if you are just calculating a hash of a file. You need to speed up your IO, not your math.
I don't think it matters much (on the same hardware; but indeed GPGPU-s are different, and perhaps faster, hardware for that kind of problem). The main part of md5 is a quite complex loop of complex arithmetic operations. What does matter is the quality of compiler optimizations.
And what does also matter is how you read the file. On Linux, mmap and madvise and readahead could be relevant. Disk speed is probably the bottleneck (use an SSD if you can).
And are you sure you want md5 specifically? There are simpler and faster hash coding algorithms (md4, etc.). Still your problem is more I/O bound than CPU bound.
I'm sure there are plenty of CUDA/OpenCL adaptations of the algorithm out there which should give you a definite speedup. You could also take the basic algorithm and think a bit -> get a CUDA/OpenCL implementation going.
Block-ciphers are perfect candidates for this type of implementation.
You could also get a C implementation of it and grab a copy of the Intel C compiler and see how good that is. The vectorization extensions in Intel CPUs are amazing for speed boosts.
table available here:
http://www.golubev.com/gpuest.htm
looks like probably your bottleneck will be your harddrive IO

LZF may compress with different algorithms

I am using libLZF for compression in my application. In the documentation, there is a comment that concerns me:
lzf_compress might use different algorithms on different systems and
even different runs, thus might result in different compressed strings
depending on the phase of the moon or similar factors.
I plan to compare compressed data to know if the input was identical. Obviously if different algorithms were used then the compressed data would be different. Is there a solution to this problem? Possibly a way to force a certain algorithm each time? Or is this comment not ever true in practice? After all, phase of the moon, or similar factors is a little strange.
The reason for the "moon phase dependency" is that they omit initialization of some data structures to squeeze out a little bit of performance (only where it does not affect decompression correctness, of course). Not an uncommon trick, as compression libraries go. So if you put your compression code in a separate, one-shot process, and your OS zeroes memory before handing it over to a process (all "big" OSes do but some smaller may not), then you'll always get the same compression result.
Also, take note of the following, from lzfP.h:
/*
* You may choose to pre-set the hash table (might be faster on some
* modern cpus and large (>>64k) blocks, and also makes compression
* deterministic/repeatable when the configuration otherwise is the same).
*/
#ifndef INIT_HTAB
# define INIT_HTAB 0
#endif
So I think you only need to #define INIT_HTAB 1 when compiling libLZF to make it deterministic, though wouldn't bet on it too much without further analysis.
Decompress on the fly, then compare.
libLZF's web site states that "decompression [...] is basically at (unoptimized) memcpy-speed".

What are the functions in the standard library that can be implemented faster with programming hacks? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have recently read an article about fast sqrt calculation. Therefore, I have decided to ask SO community and its experts to help me find out, which STL algorithms or mathematical calculations can be implemented faster with programming hacks?
It would be great if you can give examples or links.
Thanks in advance.
System library developers have more concerns than just performance in mind:
Correctness and standards compliance: Critical!
General use: No optimisations are introduced, unless they benefit the majority of users.
Maintainability: Good hand-written assembly code can be faster, but you don't see much of it. Why?
Portability: Decent libraries should be portable to more than just Windows/x86/32bit.
Many optimisation hacks that you see around violate one or more of the requirements above.
In addition, optimisations that will be useless or even break when the next generation CPU comes around the corner are not a welcome thing.
If you don't have profiler evidence on it being really useful, don't bother optimising the system libraries. If you do, work on your own algorithms and code first, anyway...
EDIT:
I should also mention a couple of other all-encompassing concerns:
The cost/effort to profit/result ratio: Optimisations are an investment. Some of them are seemingly-impressive bubbles. Others are deeper and more effective in the long run. Their benefits must always be considered in relation to the cost of developing and maintaining them.
The marketing people: No matter what you think, you'll end up doing whatever they want - or think they want.
Probably all of them can be made faster for a specific problem domain.
Now the real question is, which ones should you hack to make faster? None, until the profiler tells you to.
Several of the algorithms in <algorithm> can be optimized for vector<bool>::[const_]iterator. These include:
find
count
fill
fill_n
copy
copy_backward
move // C++0x
move_backward // C++0x
swap_ranges
rotate
equal
I've probably missed some. But all of the above algorithms can be optimized to work on many bits at a time instead of just one bit at a time (as would a naive implementation).
This is an optimization that I suspect is sorely missing from most STL implementations. It is not missing from this one:
http://libcxx.llvm.org/
This is where you really need to listen to project managers and MBAs. What you're suggesting is re-implementing parts of the STL and or standard C library. There is an associated cost in terms of time to implement and maintenance burden of doing so, so you shouldn't do it unless you really, genuinely need to, as John points out. The rule is simple: is this calculation you're doing slowing you down (a.k.a. you are bound by the CPU)? If not, don't create your own implementation just for the sake of it.
Now, if you're really interested in fast maths, there are a few places you can start. The gnu multi-precision library implements many algorithms from modern computer arithmetic and semi numerical algorithms that are all about doing maths on arbitrary precision integers and floats insanely fast. The guys who write it optimise in assembly per build platform - it is about as fast as you can get in single core mode. This is the most general case I can think of for optimised maths i.e. that isn't specific to a certain domain.
Bringing my first paragraph and second in with what thkala has said, consider that GMP/MPIR have optimised assembly versions per cpu architecture and OS they support. Really. It's a big job, but it is what makes those libraries so fast on a specific small subset of problems that are programming.
Sometimes domain specific enhancements can be made. This is about understanding the problem in question. For example, when doing finite field arithmetic under rijndael's finite field you can, based on the knowledge that the characteristic polynomial is 2 with 8 terms, assume that your integers are of size uint8_t and that addition/subtraction are equivalent to xor operations. How does this work? Well basically if you add or subtract two elements of the polynomial, they contain either zero or one. If they're both zero or both one, the result is always zero. If they are different, the result is one. Term by term, that is equivalent to xor across a 8-bit binary string, where each bit represents a term in the polynomial. Multiplication is also relatively efficient. You can bet that rijndael was designed to take advantage of this kind of result.
That's a very specific result. It depends entirely on what you're doing to make things efficient. I can't imagine many STL functions are purely optimised for cpu speed, because amongst other things STL provides: collections via templates, which are about memory, file access which is about storage, exception handling etc. In short, being really fast is a narrow subset of what STL does and what it aims to achieve. Also, you should note that optimisation has different views. For example, if your app is heavy on IO, you are IO bound. Having a massively efficient square root calculation isn't really helpful since "slowness" really means waiting on the disk/OS/your file parsing routine.
In short, you as a developer of an STL library are trying to build an "all round" library for many different use cases.
But, since these things are always interesting, you might well be interested in bit twiddling hacks. I can't remember where I saw that, but I've definitely stolen that link from somebody else on here.
Almost none. The standard library is designed the way it is for a reason.
Taking sqrt, which you mention as an example, the standard library version is written to be as fast as possible, without sacrificing numerical accuracy or portability.
The article you mention is really beyond useless. There are some good articles floating around the 'net, describing more efficient ways to implement square roots. But this article isn't among them (it doesn't even measure whether the described algorithms are faster!) Carmack's trick is slower than std::sqrt on a modern CPU, as well as being less accurate.
It was used in a game something like 12 years ago, when CPUs had very different performance characteristics. It was faster then, but CPU's have changed, and today, it's both slower and less accurate than the CPU's built-in sqrt instruction.
You can implement a square root function which is faster than std::sqrt without losing accuracy, but then you lose portability, as it'll rely on CPU features not present on older CPU's.
Speed, accuracy, portability: choose any two. The standard library tries to balance all three, which means that the speed isn't as good as it could be if you were willing to sacrifice accuracy or portability, and accuracy is good, but not as good as it could be if you were willing to sacrifice speed, and so on.
In general, forget any notion of optimizing the standard library. The question you should be asking is whether you can write more specialized code.
The standard library has to cover every case. If you don't need that, you might be able to speed up the cases that you do need. But then it is no longer a suitable replacement for the standard library.
Now, there are no doubt parts of the standard library that could be optimized. the C++ IOStreams library in particular comes to mind. It is often naively, and very inefficiently, implemented. The C++ committee's technical report on C++ performance has an entire chapter dedicated to exploring how IOStreams could be implemented to be faster.
But that's I/O, where performance is often considered to be "unimportant".
For the rest of the standard library, you're unlikely to find much room for optimization.