Standard Deviation of Principal Components - pca

Is it possible that the standard deviation of one or some of the Principal Components obtained from features are more than any of the features.
For eg.
If my standard deviation for features: feat1, feat2, feat3, feat4, feat5, feat6 are 0.019, 0.027, 0.026, 0.025,0.026,0.030,0.019. I have obtained the Standard Deviations for the Principal Components as:
PC1, PC2, PC3, PC4, PC5, PC6 as 0.05, 0.020,0.018, 0.016,0.014,0.012
As you can see PC1 has higher standard deviation than the rest. Is this possible ?

Is it possible that the standard deviation of one or some of the Principal Components obtained from features are more than any of the features.
Yes, and that is the sole purpose of PCA. We want to find a set of orthogonal axes along which the variance (and therefore the standard deviation) of the data set is maximized.
See the explanation here for more

Related

64/72 bit SECDED ECC

I would like to know if parity/syndrome generation for 64/72 bit SEC_DED coding is standardized or de-facto method used. I am going through some papers and all seem to have different combinations to generate the check bits.
There's no standard but the de facto method used is described here:
https://www.xilinx.com/support/documentation/application_notes/xapp383.pdf
or here:
https://www.youtube.com/watch?v=ms-Lnm1wJ48
The latter explains how it maps to DRAM (which uses 64/72b coding) though once you understand the general concept you can easily adapt it to any number of bits.
A variety of different H-matrices are used. The original paper, which gives a method of calculating them, is I believe from M. Y. Hsiao - A Class of Optimal Minimum Odd-weight column SEC-DEC Codes: https://people.eecs.berkeley.edu/~culler/cs252-s02/papers/hsiao70.pdf
Different matrices will have slightly different probabilities of miss correcting triple errors or detecting quadruple errors. See Table 2.

Type safe physics operations in C++

Does it make sens in C++ to define physics units as separate types and define valid operations between those types?
Is there any advantage in introducing a lot of types and a lot of operator overloading instead of using just plain floating point values to represent them?
Example:
class Time{...};
class Length{...};
class Speed{...};
...
Time operator""_s(long double val){...}
Length operator""_m(long double val){...}
...
Speed operator/(const Length&, const Time&){...}
Where Time, Length and Speed can be created only as a return type from different operators?
Does it make sens in C++ to define physics units as separate types and define valid operations between those types?
Absolutely. The standard Chrono library already does this for time points and durations.
Is there any advantage in introducing a lot of types and a lot of operator overloading instead of using just plain floating point values to represent them?
Yes: you can use the type system to catch errors like adding a mass to a distance at compile time, without adding any runtime overhead.
If you don't feel like defining the types and operators yourself, Boost has a Units library for that.
I would really recommend boost::units for this. It does all the conversion compile-time and also it gives you a compile time error if you're trying using erroneous dimensions
psuedo code example:
length l1, l2, l3;
area a1 = l1 * l2; // Compiles
area a2 = l1 * l2 * l3; // Compile time error, an area can't be the product of three lengths.
volume v1 = l1 * l2 * l3; // Compiles
I've gone down this road. The advantages are all the normal numerous and good advantages of type safety. The disadvantages I've run into:
You'll want to save off intermediate values in calculations... such as seconds squared. Having these values be a type is somewhat meaningless (seconds^2 obviously isn't a type like velocity is).
You'll want to do increasingly complex calculations which will require more and more overloads/operator defines to achieve.
At the end of the day, it's extremely clean for simple calculations and simple purposes. But when math gets complicated, it's hard to have a typed unit system play nice.
Everyone has mentioned the type-safety guarantees as a plus. Another HUGE plus is the ability to abstract the concept (length) from the units (meter).
So for example, a common issue when dealing with units is to mix SI with metric. When the concepts are abstracted as classes, this is no longer an issue:
Length width = Length::fromMeters(2.0);
Length height = Length::fromFeet(6.5);
Area area = width * height; //Area is computed correctly!
cout << "The total area is " << area.toInches() << " inches squared.";
The user of the class doesn't need to know what units the internal-representation uses... at least, as long as there are no severe rounding issues.
I really wish more trigonometry libraries did this with angles, because I always have to look up whether they're expecting degrees or radians...
For those looking for a powerful compile-time type-safe unit library, but are hesitant about dragging in a boost dependency, check out units. The library is implemented as a single .h file with no dependencies, and comes with a project to build unit tests/documentation. It's tested with msvc2013, 2015, and gcc-4.9.2, and should work with later versions of those compilers as well.
Full Disclosure: I'm the author of the library
Yes, it makes sense. Not only in physics, but in any discipline. In finance, e.g. interest rates are in units of inverse time intervals (typically express per year). Money has many different units. Converting between them can only be done with a cross-rate, has dimensions of one currency divided by another. Interest payments, dividend payments, principal payments, etc. ordinarily occur at a frequency.
It can prevent multiplying two values and ending up with an illegal value. It can prevent summing dollars and euros, etc.
I'm not saying you're wrong to do so, but we've gone overboard with that on the project I'm working on and frankly I doubt its benefits outweigh its hassle. Particularly if you're on a team, good variable naming (just spell the darn things out), code reviews, and unit testing will prevent any problems. On the other hand, if you can use Boost, units might be something to check into (I haven't).
To check for type safety, you can use a dedicated library.
The most wiedly use is boost::units, it works perfertly with no execution time overhead, a lot of features. If this library theoritically solve your problem. From a more practical point of vew, the interface is so awkward and badly documented that you may have problems. Morever the compilation time increase drastically with the number of dimensions, so clearly check that you can compile in a reasonable time a large project before using it.
doc : http://www.boost.org/doc/libs/1_56_0/doc/html/boost_units.html
An alternative is to use unit_lite. There are less features than the boost library but the compilation is faster, the interface simpler and errors messages are readables. This lib requires C++11.
code : https://github.com/pierreblavy2/unit_lite
The link to the doc is in the github description (I'm not allowed to post more than 2 links here !!!).
I gave a tutorial presentation at CPPcon 2015 on the Boost.Units library. It's a powerful library that every scientific application should be using. But it's hard to use due to poor documentation. Hopefully my tutorial helps with this. You can find the slides/code here:
If you want to use a very light weight header-only library based on c++20, you can use TU (Typesafe Units). It supports manipulation (*, /, +, -, sqrt, pow to arbitrary floating point numbers, unary operations on scalar units) for all SI units. It supports units with floating point dimensions like length^(d) where d is a decimal number. Moreover it is simple to define your own units. It also comes with a test suite...
...and yes, I'm the author of this library.

Is there a way to generate a random variate from a non-standard distribution without computing CDF?

I'm trying to write a Monte Carlo simulation. In my simulation I need to generate many random variates from a discrete probability distribution.
I do have a closed-form solution for the distribution and it has finite support; however, it is not a standard distribution. I am aware that I could draw a uniform[0,1) random variate and compare it to the CDF get a random variate from my distribution, but the parameters in the distributions are always changing. Using this method is too slow.
So I guess my question has two parts:
Is there a method/algorithm to quickly generate finite, discrete random variates without using the CDF?
Is there a Python module and/or a C++ library which already has this functionality?
Acceptance\Rejection:
Find a function that is always higher than the pdf. Generate 2 Random variates. The first one you scale to calculate the value, the second you use to decide whether to accept or reject the choice. Rinse and repeat until you accept a value.
Sorry I can't be more specific, but I haven't done it for a while..
Its a standard algorithm, but I'd personally implement it from scratch, so I'm not aware of any implementations.
Indeed acceptance/rejection is the way to go if you know analytically your pdf. Let's call it f(x). Find a pdf g(x) such that there exist a constant c, such that c.g(x) > f(x), and such that you know how to simulate a variable with pdf g(x) - For example, as you work with a distribution with a finite support, a uniform will do: g(x) = 1/(size of your domain) over the domain.
Then draw a couple (G, U) such that G is simulated with pdf g(x), and U is uniform on [0, c.g(G)]. Then, if U < f(G), accept U as your variable. Otherwise draw again. The U you will finally accept will have f as a pdf.
Note that the constant c determines the efficiency of the method. The smaller c, the most efficient you will be - basically you will need on average c drawings to get the right variable. Better get a function g simple enough (don't forget you need to draw variables using g as a pdf) but will the smallest possible c.
If acceptance rejection is also too inefficient you could also try some Markov Chain MC method, they generate a sequence of samples each one dependent on the previous one, so by skipping blocks of them one can subsample obtaining a more or less independent set. They only need the PDF, or even just a multiple of it. Usually they work with fixed distributions, but can also be adapted to slowly changing ones.

Random numbers from Beta distribution, C++

I've written a simulation in C++ that generates (1,000,000)^2 numbers from a specific probability distribution and then does something with them. So far I've used Exponential, Normal, Gamma, Uniform and Poisson distributions. Here is the code for one of them:
#include <boost/random.hpp>
...main...
srand(time(NULL)) ;
seed = rand();
boost::random::mt19937 igen(seed) ;
boost::random::variate_generator<boost::random::mt19937, boost::random::normal_distribution<> >
norm_dist(igen, boost::random::normal_distribution<>(mu,sigma)) ;
Now I need to run it for the Beta distribution. All of the distributions I've done so far took 10-15 hours. The Beta distribution is not in the boost/random package so I had to use the boost/math/distributions package. I found this page on StackOverflow which proposed a solution. Here it is (copy-pasted):
#include <boost/math/distributions.hpp>
using namespace boost::math;
double alpha, beta, randFromUnif;
//parameters and the random value on (0,1) you drew
beta_distribution<> dist(alpha, beta);
double randFromDist = quantile(dist, randFromUnif);
I replicated it and it worked. The run time estimates of my simulation are linear and accurately predictable. They say that this will run for 25 days. I see two possibilities:
1. the method proposed is inferior to the one I was using previously for other distributions
2. the Beta distribution is just much harder to generate random numbers from
Bare in mind that I have below minimal understanding of C++ coding, so the questions I'm asking may be silly. I can't wait for a month for this simulation to complete, so is there anything I can do to improve that? Perhaps use the initial method that I was using and modify it to work with the boost/math/distributions package? I don't even know if that's possible.
Another piece of information that may be useful is that the parameters are the same for all (1,000,000)^2 of the numbers that I need to generate. I'm saying this because the Beta distribution does have a nasty PDF and perhaps the knowledge that the parameters are fixed can somehow be used to simplify the process? Just a random guess.
The beta distribution is related to the gamma distribution. Let X be a random number drawn from Gamma(α,1) and Y from Gamma(β,1), where the first argument to the gamma distribution is the shape parameter. Then Z=X/(X+Y) has distribution Beta(α,β). With this transformation, it should only take twice as much time as your gamma distribution test.
Note: The above assumes the most common representation of the gamma distribution, Gamma(shape,scale). Be aware that different implementations of the gamma distribution random generator will vary with the meaning and order of the arguments.
If you want a distribution that is very Beta-like, but has a very simple closed-form inverse CDF, it's worth considering the Kumaraswamy distribution:
http://en.wikipedia.org/wiki/Kumaraswamy_distribution
It's used as an alternative to the Beta distribution when a large number of random samples are required quickly.
Try compiling with optimization. Using a flag -O3 will usually speed things up. See this post on optimisation flags or this overview for slightly more detail.

is there a function in C or C++ to do "saturation" on an integer

I am doing some 3D graphics and I have an open ocean. For this ocean, I have a matrix representing the sea state (i.e. wave heights) for a particular rectangular subsection of the sea. The rest of the ocean is flat. My problem is that my controlled sea, where there are waves, is positioned in the middle of open flat sea, and the discontinuity at the edges of my grid causes some bad artifacts. The reason I am only generating waves for a subsection and not the entire sea is because my noise function is prohibitively expensive to compute on the entire sea (and I know the easiest solution is to use a cheaper noise function like simplex noise, but that's not an option).
Having said that my question is really rather simple. If say I have a grid (aka matrix aka 2d array) of size 100x40, and I want to find the value for position 120x33, I simply want to take the nearest neighbour, which would be 100x33. So for any number that lies outside a given range, I want that number to saturate to lie within the given range. Is there a function in C or C++ that does this?
Edit: the position parameters are of type float
I know I can do this with some simple if statements, but it just seems like something that the standard libraries would include.
And now there is, in the form of std::clamp. And I'm barely seven years late :)
template<typename T>
T saturate(T val, T min, T max) {
return std::min(std::max(val, min), max);
}
ISO/IEC JTC1 SC22 WG14 N1169 (Programming languages - C - Extensions to support embedded processors) specifies a _Sat type qualifier for saturating data types. I have never tried to use it in any compiler, but it is included in the GCC 4.x documentation.
VC++ 2003 onward supports MMX intrinsics that allow saturating arithmetic.
Min and Max no?
x = std::min(100,x);
y = std::min(40,y);
Or if you like complicated defines !
#define DERIVE_MAX (100)
//Saturate X at +/-DERIVE_MAX
#define SAT(x) ( ((x) > DERIVE_MAX) ? DERIVE_MAX : ( (-(x) > DERIVE_MAX) ? (-DERIVE_MAX) : (x) ) )