Error detection and correction general algorithms - error-correction

I am making some data encoding software for a very lossy format of data storage (namely magnetic tape) and as such will be needing to incorporate digital error correction.
My current plan is to split the data into small and large "blocks" and add (7,4) hamming codes to the small blocks and high redundancy reed Solomon codes to the large blocks. I was then gonna redistribute all the data throughout itself in a recoverable pattern.
My logic here is that data errors tend to occur in bursts and so redistributing the data amongst itself should spread out the errors evenly the hamming codes should catch all the single-bit errors leaving the reed Solomon code to catch any errors the hamming codes couldn't.
I was searching for general suggestions of any better systems of digital data error correction that would be more suitable for this project as I am aware that I have VERY limited knowledge of this subject area.
While I would be grateful for some further reading on this topic as my self imposed deadline is looming so book recommendations are only limitedly helpful.
TLDR:
I want error correction algorithms that are
Easy to implement
Have adjustable amounts of redundant bits
Are able to cope with very high bit loss rates
Are well documented
Thanks so much (:

Related

Sentiment analysis feature extraction

I am new to NLP and feature extraction, i wish to create a machine learning model that can determine the sentiment of stock related social media posts. For feature extraction of my dataset I have opted to use Word2Vec. My question is:
Is it important to train my word2vec model on a corpus of stock related social media posts - the datasets that are available for this are not very large. Should I just use a much larger pretrained word vector ?
The only way to to tell what will work better for your goals, within your constraints of data/resources/time, is to try alternate approaches & compare the results on a repeatable quantititave evaluation.
Having training texts that are properly representative of your domain-of-interest can be quite important. You may need your representation of the word 'interest', for example, to represent that of stock/financial world, rather than the more general sense of the word.
But quantity of data is also quite important. With smaller datasets, none of your words may get great vectors, and words important to evaluating new posts may be missing or of very-poor quality. In some cases taking some pretrained set-of-vectors, with its larger vocabulary & sharper (but slightly-mismatched to domain) word-senses may be a net help.
Because these pull in different directions, there's no general answer. It will depend on your data, goals, limits, & skills. Only trying a range of alternative approaches, and comparing them, will tell you what should be done for your situation.
As this iterative, comparative experimental pattern repeats endlessly as your projects & knowledge grow – it's what the experts do! – it's also important to learn, & practice. There's no authority you can ask for any certain answer to many of these tradeoff questions.
Other observations on what you've said:
If you don't have a large dataset of posts, and well-labeled 'ground truth' for sentiment, your results may not be good. All these techniques benefit from larger training sets.
Sentiment analysis is often approached as a classification problem (assigning texts to bins of 'positive' or 'negative' sentiment, operhaps of multiple intensities) or a regression problem (assigning texts a value on numerical scale). There are many more-simple ways to create features for such processes that do not involve word2vec vectors – a somewhat more-advanced technique, which adds complexity. (In particular, word-vectors only give you features for individual words, not texts of many words, unless you add some other choices/steps.) If new to the sentiment-analysis domain, I would recommend against starting with word-vector features. Only consider adding them later, after you've achieved some initial baseline results without their extra complexity/choices. At that point, you'll also be able to tell if they're helping or not.

Which error correction could should I use for GF(32)

I searched for comparisons between Reed-Solomon, Turbo and LDPC codes but they all seem to focus on efficiency. I'm more interested in commercial license of available libs, easiness and GF(32), i.e. a code with 32 symbols only (available Reed-Solomon implementations work for GF(256) and above).
Efficiency (speed) is not relevant. The messages are comprised of 24 symbols.
Can you provide a quick comparison on the most well-known Reed-Solomon, Turbo and LDPC codes for this case in which speed is not relevant?
Thanks.
Basically, Reed-Solomon is optimal, thus it means that you can exactly correct up to (n-k)/2 errors (k=length of your message, n=length of message + EC symbols), while TurboCodes and LDPC are near-optimal, meaning that you can correct up to (n-k-e)/2 where e is a small constant, so in ideal cases you are very close to (n-k)/2 (that's why it's called near-optimal, it's close to the Shannon limit). TurboCodes and LDPC have similar error correction power, and there are lots of variants depending on your needs (you can find lots of literature reviews or presentations).
What the different variants of LDPC or Turbocodes do is to optimize the algorithm to fit certain characteristics of the erasure channel (ie, the data) so as to reduce the constant e (and thus approaching the Shannon limit). So the best variant in your case depends on the details of your erasure channel. Also, to my knowledge, they are all in public domain now (maybe not yet for Turbocodes patents, but if not yet then they will soon).

dimension reduction in spam filtering

I'm performing an experiment in which I need to compare classification performance of several classification algorithms for spam filtering, viz. Naive Bayes, SVM, J48, k-NN, RandomForests, etc. I'm using the WEKA data mining tool. While going through the literature I came to know about various dimension reduction methods which can be broadly classified into two types-
Feature Reduction: Principal Component Analysis, Latent Semantic Analysis, etc.
Feature Selection: Chi-Square, InfoGain, GainRatio, etc.
I have also read this tutorial of WEKA by Jose Maria in his blog: http://jmgomezhidalgo.blogspot.com.es/2013/02/text-mining-in-weka-revisited-selecting.html
In this blog he writes, "A typical text classification problem in which dimensionality reduction can be a big mistake is spam filtering". So, now I'm confused whether dimensionality reduction is of any use in case of spam filtering or not?
Further, I have also read in the literature about Document Frequency and TF-IDF as being one of feature reduction techniques. But I'm not sure how does it work and come into play during classification.
I know how to use weka, chain filters and classifiers, etc. The problem I'm facing is since I don't have enough idea about feature selection/reduction (including TF-IDF) I am unable to decide how and what feature selection techniques and classification algorithms I should combine to make my study meaningful. I also have no idea about optimal threshold value that I should use with chi-square, info gain, etc.
In StringToWordVector class, I have an option of IDFTransform, so does it makes sence to set it to TRUE and also use a feature selection technique, say InfoGain?
Please guide me and if possible please provide links to resources where I can learn about dimension reduction in detail and can plan my experiment meaningfully!
Well, Naive Bayes seems to work best for spam filtering, and it doesn't play nicely with dimensionality reduction.
Many dimensionality reduction methods try to identify the features of the highest variance. This of course won't help a lot with spam detection, you want discriminative features.
Plus, there is not only one type of spam, but many. Which is likely why naive Bayes works better than many other methods that assume there is only one type of spam.

How many samples are optimal in one class using k-nearest neighbor?

I have implemented k-nearest algorithm in my system. It consists from 26 classes, each of 100 samples. In my case, K=7 and it was completely trial and error to get the best classification result.
I know that K should be chosen wisely to reduce the noise on the classification. But what about the number of samples? Is there any general rule such as "the more samples the better result"? Does it depend on something?
Thank you for all your responses.
You could try considering whatever underlying mechanism is generating your data, or whatever background knowledge you have on the problem, which might give you an idea of the relative size of noise and true underlying variation. E.g. predicting favourite sports team from location I would expect more change than predicting favourite sport, so would use smaller k. However I don't know of much general guidance, except to use cross-validation.

Does Compressed Sensing bring anything new to data Compression? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Compressed sensing is great for situations where capturing data is expensive (either in energy or time). It works by taking a smaller number of samples and using linear or convex programming to reconstruct the original reference signal away from the sensor.
However, in situations like image compression, given that the data is already on the computer -- does compressed sensing offer anything? For example, would it offer better data compression? Would it result in better image search?...
With regards to your question
"...given that the data is already on the computer -- does compressed sensing offer anything? For example, would it offer better data compression? Would it result in better image search?..."
In general the answer to your question is no it would not offer better data compression at least initially! This is the case for images where nonlinear schemes like jpeg does better than compressed sensing by a constant of 4 to 5 and comes from the klog(N/K) constant found in diverse theoretical results in different papers.
I said initially because right now compressed sensing is mostly focused on the concept of sparsity but there is new work now coming up that tries to use additional information such as the fact that wavelets decomposition comes in clumps that could improve the compression. This work and others are likely to provide additional improvement with maybe the possibility of getting close to the nonlinear transform such as jpeg.
The other thing you have to keep in mind is that jpeg is the result of a focused effort of the whole industry and many years of research. So it really is difficult to do better than that but compressive sensing really provides some means of compression of other datasets without the need for the years of experience and manpower.
Finally, there is something immensely awe inspiring in the compression found in compressive sensing. It is universal, this means that right now you may "decode" image to a certain level of detail and then in ten years, using the same data you might actually "decode" a better image/dataset (this is with the caveat that the information was there in the first place) because your solvers will be better. You cannot do that with jpeg or jpeg2000 because the data that is compressed is intrinsically connected to the decoding scheme.
(disclosure: I write a small blog on compressed sensing)
Since the whole point of compressed sensing is to avoid taking measurements, which, as you say, can be expensive to take, it should come as no surprise that the compression ratio will be worse than if the compression implementation is allowed to make all the measurements it wants, and cherry pick the ones that generates the best outcome.
As such, I very much doubt that an implementation utilizing compressed sensing for data already present (in effect, already having all the measurements), is going to produce better compression ratios than the optimal result.
Now, having said that, compressed sensing is also about picking a subset of the measurements that will reproduce a result that is similar to the original when decompressed, but might lack some of the detail, simply because you're picking that subset. As such, it might also be that you can indeed produce better compression ratios than the optimal result, at the expense of a bigger loss of detail. Whether this is better than, say, a jpeg compression algorithm where you simply throw out more of the coefficients, I don't know.
Also, if, say, an image compression implementation that utilizes compressed sensing can reduce the time it takes to compress the image from the raw bitmap data, that might give it some traction in scenarios where the time used is an expensive factor, but the detail level is not. For instance.
In essence, if you have to trade speed for quality of results, a compressed sensing implementation might be worth looking into. I have yet to see widespread usage of this though so something tells me it isn't going to be worth it, but I could be wrong.
I don't know why you bring up image search though, I don't see how the compression algorithm can help on image search, unless you will somehow use the compressed data to search for images. This will probably not do what you want, related to image search, as very often you search for images that contain certain visual patterns, but aren't 100% identical.
This may not be the exact answer for your question but I just want to emphasise on other important application domains of CS. Compressive Sending can be a great advantage in Wireless Multimedia Networks where there is great emphasis on powerconsumption of the sensor node. Here the sensor node has to transmit the information (say an image taken by a survillance camera). If it has to transmit all the samples, we cannot afford to improve the network lifetime. Where as if we use JPEG compression it bring in high complexity on the encoder (sensor node) side which is again undesirable. So, compressive Sensing somehow hwlps in moving the complexity from the encoider side to decoder side.
As a researcher in the area we are successful in transmitting an image and a video in a lossy channel with considerable quality only by sending 52% of the total samples.
One of the benefits of compressed sensing is that the sensed signal is not only compressed but it's encrypted as well. The only way a reference signal can be reconstructed from its sensed signal is to perform optimization (linear or convex programming) on a reference signal estimate when applied to the basis.
Does it offer better data compression? That's going to be application dependent. First, it will only work on sparse reference signals, meaning it's probably only applicable to image, audio, RF signal compression, and not applicable to general data compression. In some cases it may be possible to get a better compression ratio using compressed sensing than other approaches, and in other instances, that won't be the case. It dependes on the nature of the signal being sensed.
Would it result in better image search? I have little hesitation answering this "no". Since the sensed signal is both compressed and encrypted, there is virtually no way to reconstruct the reference signal from the sensed signal without the "key" (basis function). In those instances where the basis function is available, the reference signal still would need to be reconstructed to perform any sort of image processing / object identification / characterization or the like.
Compress sensing means some data can be reconstructed by some measurements. Most data can be linear transformed in another linear space in which most of the dimentions can be ignored.
So it means we can reconstruct most data in some dimentions, the "some" can be low rate of the number of premitive dimentions.