QR code generation algorithm data masking implementation case analysis - c++

I'm implementing a QR code generation algorithm as explained on thonky.com and I'm trying to understand one of the cases:
As stated on this page, after getting the percentage of the dark modules out of the whole code, I should take the two nearest multiples of five (for example 45 and 50 for 48%), but what if the percentage is a multiple of 5? for example 45.0? what numbers should be taken? 45? 40 and 50? 45 and 40? 45 and 50? something totally different? I couldn't find any answer to that anywhere...
Thank you very much in advance for the help!

Indeed the Thonky tutorial is unclear in this respect, so let's turn to the official standard (behind a paywall at ISO but easy to find online). Section 8.8.2, page 52, Table 24:
Evaluation condition: 50 ± (5 × k)% to 50 ± (5 × (k + 1))%
Points: N₄ × k
Here, N₄ = 10, and
k is the rating of the deviation of the proportion of dark modules in the symbol from 50% in steps of 5%.
So for for exactly 45% dark modules, you'd have k = 1, resulting in a penalty of 10 points.
Also note that it doesn't really matter if you get this slightly wrong. Because the mask pattern identifier is encoded in the format string, a reader can still decode the QR code even if you accidentally chose a slightly suboptimal mask pattern.

Related

Correct values for SsaSpikeEstimator's pvalueHistoryLength

In the creation of a SsaSpikeEstimator instance by the DetectSpikeBySsa method, there is a parameter called pvalueHistoryLength - could anybody please help me understand, for any given time series with X points, which is the optimal value for this parameter?
I got similar issue, when I try to read the paper, https://arxiv.org/pdf/1206.6910.pdf, I notice one paragraph
Also, simulations and theory (Golyandina, 2010) show that it is
better to choose window length L smaller than half of the time series length
N. One of the recommended values is N/3.
Maybe that's why in the ML.Net Power Anomaly example, the value is chosen to be 30 for the 90 points dataset.

PCA: What does it mean that the number of necessary PCs for a given explanation percentage changes?

Say one has a program that performs PCA.
The program calculates the number of PCs necessary in order to cover a given share of total variation in the data, e.g. 95 %.
Say the number of PCs necessary in order to cover 95 % of the variance is 10 for the data used at time t=1.
At t=2 we re-run the program with data from t=2.
For t=2 the number of PCs necessary in order to cover 95 % of the variance is 5.
Hence the number of necessary PCs in order to cover 95 % of the variance has dropped from 10 to 5 from t=1 to t=2.
Main question:
Can we make any conclusions about changes in the data from t=1 to t=2 in this case?
Example:
Can we say something like: "Since the number of PCs decreases from t=1 to t=2, there is more correlation in the data at t=1 than at t=2. With more correlation in the data, fewer PCs are needed to cover a given share of the varaince in the data."
Yes, If the original variables are strongly correlated, a reduced number of components can explain 80% to 90% of the variance, and the percentage of variance corresponds to the percentage of information from your data, that has been kept by the PCs. Furthermore, if you'd like to have more details about PCA, you can read this great comment: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579

Hamming codes formulas

This is the question:
Determine if the Hamming codes (15,10), (14,10) and (13,10) can correct a single error (SEC), detect a single error (SED) or detect double bit errors (DED).
I do know how Hamming distance work and how you can detect an error if you have the data-word that you want to transmit. But I don't know how to do it without the data-word.
Only for SEC which has the formula:
2^m > m+k+1
where
m = check bits
k = data bits
But is the any formulas for SED and DED? I have searched google all day long without any success.
I learned how to solve it checks the Hamming codes through this Youtube clip. there is no effective way. wish there was a faster way to solve the problem.
https://www.youtube.com/watch?v=JAMLuxdHH8o
This is not responding properly to the question because I know what the user
"user3314356" needs, but I could not comment because I did not have 50 reputation points!.

Simple Curve Fitting Implimentation in C++ (SVD Least Sqares Fit or similar)

I have been scouring the internet for quite some time now, trying to find a simple, intuitive, and fast way to approximate a 2nd degree polynomial using 5 data points.
I am using VC++ 2008.
I have come across many libraries, such as cminipack, cmpfit, lmfit, etc... but none of them seem very intuitive and I have had a hard time implementing the code.
Ultimately I have a set of discrete values put in a 1D array, and I am trying to find the 'virtual max point' by curve fitting the data and then finding the max point of that data at a non-integer value (where an integer value would be the highest accuracy just looking at the array).
Anyway, if someone has done something similar to this, and can point me to the package they used, and maybe a simple implementation of the package, that would be great!
I am happy to provide some test data and graphs to show you what kind of stuff I'm working with, but I feel my request is pretty straightforward. Thank you so much.
EDIT: Here is the code I wrote which works!
http://pastebin.com/tUvKmGPn
change size to change how many inputs are used
0 0
1 1
2 4
4 16
7 49
a: 1 b: 0 c: 0
Press any key to continue . . .
Thanks for the help!
Assuming that you want to fit a standard parabola of the form
y = ax^2 + bx + c
to your 5 data points, then all you will need is to solve a 3 x 3 matrix equation. Take a look at this example http://www.personal.psu.edu/jhm/f90/lectures/lsq2.html - it works through the same problem you seem to be describing (only using more data points). If you have a basic grasp of calculus and are able to invert a 3x3 matrix (or something nicer numerically - which I am guessing you do given you refer specifically to SVD in your question title) then this example will clarify what you need to do.
Look at this Wikipedia page on Poynomial Regression

Calculating the mean for a set of numbers while neglecting outliers

First of all this is more of a math question than it is a coding one, so please be patient.
I am trying to figure out an algorithm to calculate the mean for a set of numbers. However I need to neglect any numbers that are not close to the majority of the results. Here is an example of what I am trying to do:
Lets say I have a set of numbers that are similar to the following:
{ 90, 91, 92, 95, 2, 3, 99, 92, 92, 91, 300, 91, 92, 99, 400 }
it is clear for the set above that the majority of numbers lies between 90 and 99, however I have some outliers like { 300, 400, 2, 3 }. I need to calculate the mean of those numbers while neglecting the outliers. I do remember reading about something like that in a statistics class but I cant remember what was it or how to approach the solution.
Will appreciate any help..
Thanks
What you could do is:
estimate the percentage of outliers in your data: about 25% (4/15) of the provided dataset,
compute the adequate quantiles: 8-quantiles for your dataset, so as to exclude the outliers,
estimate the mean between the first and the last quantile.
PS: Outliers constituting 25% of your dataset is a lot!
PPS: For the second step, we assumed outliers are "symmetrically distributed". See the graph below, where we use 4-quantiles and 1.5 times the interquartile range (IQR) from Q1 and Q3:
First you need to determine the standard deviation and mean of the full set. The outliers are those values that are greater than 3 standard deviations from the (full set) mean.
A simple method that works well is to take the median instead of the average. The median is far more robust to outliers.
You could also minimize a Geman-McClure function:
x^ = argmin sum( G(xi - x')), where G(x) = x^2/(x^2+sigma^2)
If you plot the G function, you will find that it saturates, which is a good way of softly excluding outliers.
I'd be very careful about this. You could be doing yourself and your conclusions a great disservice.
How is your program supposed to recognize outliers? The normal distribution would say that 99.9% of the values fall within +/- three standard deviations of the mean, so you could calculate both for the unfiltered data, exclude the values that fall outside the assumed range, and recalculate.
However, you might be throwing away something significant by doing so. The normal distribution isn't sacred; outliers are far more common in real life than the normal distribution would suggest. Read Taleb's "Black Swan" to see what I mean.
Be sure you understand fully what you're excluding before you do so. I think it'd be far better to leave all the data points, warts and all, and come up with a good written explanation for them.
Another approach would be a use an alternate measure like median, which is less sensitive to outliers than mean. It's harder to calculate, though.