Principal Component Analysis being too slow (MLPY Python) - pca

I am using the PCAFast method from the MLPY API in python (http://mlpy.sourceforge.net/docs/3.2/dim_red.html)
The method is executed pretty fast when it learns a feature matrix generated as follows:
x = np.random.rand(100, 100)
Sample output of this command is:
[[ 0.5488135 0.71518937 0.60276338 ..., 0.02010755 0.82894003
0.00469548]
[ 0.67781654 0.27000797 0.73519402 ..., 0.25435648 0.05802916
0.43441663]
[ 0.31179588 0.69634349 0.37775184 ..., 0.86219152 0.97291949
0.96083466]
...,
[ 0.89111234 0.26867428 0.84028499 ..., 0.5736796 0.73729114
0.22519844]
[ 0.26969792 0.73882539 0.80714479 ..., 0.94836806 0.88130699
0.1419334 ]
[ 0.88498232 0.19701397 0.56861333 ..., 0.75842952 0.02378743
0.81357508]]
However when the feature matrix x consists of data such as the following:
x = 7.55302582e-05*np.ones((n, d[i]))
Sample output:
[[ 7.55302582e-05 7.55302582e-05 7.55302582e-05 ..., 7.55302582e-05
7.55302582e-05 7.55302582e-05]
[ 7.55302582e-05 7.55302582e-05 7.55302582e-05 ..., 7.55302582e-05
7.55302582e-05 7.55302582e-05]
[ 7.55302582e-05 7.55302582e-05 7.55302582e-05 ..., 7.55302582e-05
7.55302582e-05 7.55302582e-05]
...,
[ 7.55302582e-05 7.55302582e-05 7.55302582e-05 ..., 7.55302582e-05
7.55302582e-05 7.55302582e-05]
[ 7.55302582e-05 7.55302582e-05 7.55302582e-05 ..., 7.55302582e-05
7.55302582e-05 7.55302582e-05]
[ 7.55302582e-05 7.55302582e-05 7.55302582e-05 ..., 7.55302582e-05
7.55302582e-05 7.55302582e-05]]
The method becomes very very slow... Why does this happen ? Does this have something to do with the type of the data stored in the x feature matrix ?
Any ideas on how to solve this ?

This is a terrible (poorly conditioned) matrix to run principal components analysis on. It has an eigenvalue of zero (which by itself may be problematic), and the remaining eigenvalues of 1 (you can subtract rows from one another to get a degenerate matrix). Python may have a poor implementation of eigensystem solver that relies on the matrix being reasonably regular (all eigenvalues are distinct and sufficiently well separated from zero and from each other). I am not familiar with the method, but my feeling, based on the title of Fast Fixed Point, is that they rely on the properties of multiplication of blowing up the eigenvalues: if A=Σk λk uk uk' for the appropriate orthogonal vectors uk, and λ1 > λ2 > … > λp > 0, then An ≈ λ1 u1 u1' for sufficient large power n. This idea simply does not work when you feed an array of ones as inputs: you just keep getting the top eigenvalue of one that does not separate from others. Worse, for the specific matrix that you feed in, (7·10-5)20 gets close to what can be represented as a double precision number. In the end, you may have a complete crap in the output. There are appropriate computational linear algebra methods that are much more computationally stable and reliable. A decision to implement one or the other method is the developer's judgment call; besides the fast part of it, one would also need to think about how robust the method is. No offense, but I would take a stable slow method over a quick and dirty method on most occasions.

Related

Displaying numpy matrices with fewer digits past the decimal

If I type M at my IDLE cursor to see the values in my matrix M, I get:
matrix([[ 1.65930000e+03, -2.34000000e+01, 1.50000000e+01,
0.00000000e+00],
[ 3.30000000e+00, 1.68600000e+03, -2.17000000e+01,
0.00000000e+00],
[ -1.70000000e+00, 5.00000000e+00, 1.69440000e+03,
0.00000000e+00],
[ -6.18000000e+01, 7.02000000e+01, -4.18000000e+01,
1.00000000e+00]])
Using the print statement form this answer only works if I convert the numpy matrix to an array first:
print(np.array_str(np.array(M), precision=2))
[[ 1.66e+03 -2.34e+01 1.50e+01 0.00e+00]
[ 3.30e+00 1.69e+03 -2.17e+01 0.00e+00]
[ -1.70e+00 5.00e+00 1.69e+03 0.00e+00]
[ -6.18e+01 7.02e+01 -4.18e+01 1.00e+00]]
This is helpful, but it is a lot of typing to do when I want to debug. Is there a quicker way to reduce the precision when inspecting during debugging?
I tried this also, but it's much worse. I like that the scientific notation is removed, but the precision has expanded.
print(np.array_str(M.astype(np.ndarray), precision=2))
[ matrix([[1659.2999999999988, -23.399999999999995, 14.999999999999995, 0.0]], dtype=object)]
[ matrix([[3.3, 1686.0000000000002, -21.700000000000003, 0.0]], dtype=object)]
[ matrix([[-1.699999999999999, 5.000000000000001, 1694.4, 0.0]], dtype=object)]
[ matrix([[-61.799999999998704, 70.20000000000171, -41.799999999998306,
1.0000000000000002]], dtype=object)]]
NumPy comes with helper functions such as set_printoptions so you can use
numpy.set_printoptions(precision=x)
to set the displayed precision.

Netlogo comparing list of global variables to numbers

I apologise in advance for how simple the answer probably is to this question, I am very new to netlogo and very out of my depth.
I am trying to read a water-temperature from a file and consequently get my turtles to die/breed depending on the temperature. I have got the file to read finally, and set water-temperature as a global variable, however I am now stuck on the comparison part. It won't let me compare the variable to a number because I think the variable is a list. The following error message comes up;
The > operator can only be used on two numbers, two strings, or two agents of the same type, but not on a list and a number.
error while turtle 7 running >
called by procedure REPRODUCE
called by procedure GO
called by Button 'go'
Code is below;
globals [ year
month
water-temperature ]
extensions [ csv ]
to setup
ca
load-data
create-turtles 50
[ set size 1
set color red
setxy random-xcor random-ycor ]
reset-ticks
end
to go
ask turtles [ move
reproduce ]
run-temperature
end
to load-data
file-close-all
file-open "C:\\Users\\Hannah\\Documents\\Summer research project\\test3.csv"
end
to run-temperature
file-close-all
file-open "C:\\Users\\Hannah\\Documents\\Summer research project\\test3.csv"
while [ not file-at-end? ] [
set water-temperature csv:from-row file-read-line
tick ]
file-close
end
to move
rt random 50
lt random 50
fd 1
end
to reproduce
if water-temperature > 35 [ die ]
if water-temperature > 30 and water-temperature < 34 [ hatch 1 rt random-float 360 fd 1 ]
if water-temperature > 25 and water-temperature < 29 [ hatch 2 rt random-float 360 fd 1 ]
if water-temperature > 20 and water-temperature < 24 [ hatch 3 rt random-float 360 fd 1 ]
end
I would be so grateful for any help!
Thanks :)
Hannah
welcome to Stack Overflow. Can you please provide an example of the first few lines of your "test3.csv" file? That will help get your question sorted- if you have a header or multiple columns that could be causing your problems- multiple columns might be getting read in as a list. As well, I think you want file-read instead of file-read-line.
A few other things- your load-data procedure is unnecessary as far as I can tell (you only need the loading to occur in run-temperature).
More importantly, your code right now says something like: "All turtles, move and reproduce. Now, read the whole temperature file line by line." The problem is that your while statement is saying "until you have yet reached the end of the file, read a line, tick, and move to the next one." Additionally, your model will tick once per line, without the turtles ever doing anything- it probably is simpler to just have your tick at the very end of your go procedure. It is likely better to avoid the use of while in your go procedure in this scenario, as it will loop until the while condition is satisfied.
It might be easier to just read your whole test.csv and store it in a variable for easier access- here is one example. Using this setup:
globals [
water-temperature
water-temperature-list
]
to setup
ca
crt 50 [
setxy random-xcor random-ycor
]
First, tell Netlogo water-temperature-list is a list using set and []. Then, do the same file close/open as before to prep your file for reading. Then, use a similar while loop to read your temperatures into water-temperature-list, using lput:
set water-temperature-list []
file-close-all
file-open "test3.csv"
while [ not file-at-end? ] [
set water-temperature-list lput file-read water-temperature-list
]
file-close-all
reset-ticks
end
Now your model more simply access those values, since they are stored in a model variable directly. You can easily use the ticks value with item as an index for that list- for example, on tick 0 the first element in the list will be accessed, on tick 1 the second element, and so on. For example:
to go
set water-temperature item ticks water-temperature-list
ask turtles [
if water-temperature > 30 [
die
]
if water-temperature <= 30 [
rt random 60
fd 1
]
]
tick
end
Note that with this setup once you get to the end of your temperatures, there will be an error telling you that Netlogo can't find the next list element- you'll have to put a stop condition somewhere to prevent that.
I know that is an alternative to your approach but I hope that it's helpful. For another similar but more complicated example, check out this model by Uri Wilensky.

How to plot gaussian-like histogram (definitely NOT a gaussian over a histogram)?

For an assignment I'm asked to plot, in front of a histogram, the histogram of a gaussian distribution. The question is simple, as I already know how to plot the gaussian on top of histograms (and also how to plot histograms).
I already have this:
#bins=array that indicates the boundaries of the bars
#array is an object with attributes such as; standard deviation (stdv); mean or average (mean).
In [63]: 1/(array.stdv*np.sqrt(2*np.pi))*np.exp(-(bins-array.mean)**2/(2*array.stdv**2))
Out[63]:
array([ 1.46863468e-03, 1.71347711e-03, 1.99065837e-03, ...,
5.37698408e-15, 3.25989619e-15, 1.96798911e-15])
In [64]: bins
Out[64]: array([ 33., 34., 35., ..., 187., 188., 189.])
When I try to plot with this, i get:
In [59]: plt.hist(1/(array.stdv*np.sqrt(2*np.pi))*np.exp(-(bins-array.mean)**2/(2*array.stdv**2)),bins,range=(min(histmain.array),max(histmain.array)),histtype='step')
Out[59]:
(array([ 0., 0., 0., ..., 0., 0., 0.]),
array([ 33., 34., 35., ..., 187., 188., 189.]),
<a list of 1 Patch objects>)
[Figure][1]
I think I already know the problem, though. I guess the hist function receives an array of data, and it plots according to the frequencies. What I have is an array of the frequencies. I tried multiplying the function by 3200, but what I got was this atrocity:
In [61]: plt.hist(3200/(array.stdv*np.sqrt(2*np.pi))*np.exp(-(bins-array.mean)**2/(2*array.stdv**2)),bins,range=(min(histmain.array),max(histmain.array)))
Out[61]:
(array([ 1., 1., 0., ..., 0., 0., 0.]),
array([ 33., 34., 35., ..., 187., 188., 189.]),
<a list of 156 Patch objects>)
[Figure][2]
All of the above is without the histogram of data showed.
So sorry. I didn't notice the images weren't displaying. Here's a link to them:
Figure 1 is the flat one. Figure 2 is the horrendous one. What I want to accomplish is a gaussian-like histogram.
Ok, i figured it out myself. This is the algorithm that I used to 'transform' a gaussian distribution into an histogram.
Of course it has yet to be optimized since I just came up with the idea after hours of trial and error.
Let bins and gauss be arrays with the bin distribution and the probability distribution of the left side of each bin respectively.
First, I created a list with the average probability of each bin:
for i in range(len(gauss)):
try:
avrgeprob.append((gauss[i]+gauss[i+1])/2)
except IndexError:
avrgeprob.append(gauss[i])
Second, I created another list with the expected frequency of each bin (the array that I used to test this had 3200 values in it)
for x in avrgeprob:
expfreq.append(3200*x)
You can see that
In [121]: sum(expfreq)
Out[121]: 3173.5316995750118
so it's pretty close. Third, I create one last list containing the lower value of each bean as many times as expected
for i in range(len(bins)):
if int(expfreq[i])!=0:
fakelist+=[bins[i]]*int(expfreq[i])
else:
fakelist.append(0)
After all that, I let the magic happen:
In [122]: plt.hist(fakelist,bins=bins,histtype=u'step')
Out[122]:
http://imgur.com/a/uADK4

Replicate IDL 'smooth' in Python 2.7

I have been trying to work out how to replicate IDL's smooth function in Python and I just can't get anything like the same results. (Disclaimer: It is probably 10 years since I touched this kind of mathematical problem so it has been dumped to make way for information like where to find the cheapest local fuel). I am trying to code this:
smooth(b,w,/nan)
where b is a 2D float array containing NANs (zeros - missing data - have also been converted to NAN).
From the IDL documents, it appears smooth uses a boxcar, so from scipy.ndimage.filters I have tried:
bsmooth = uniform_filter(b, w)
I am aware that there are some fundamental differences here:
the default edge behaviour from IDL is "the end points are copied
from the original array to the result with no smoothing" whereas I
don't seem to have the option to do this with the uniform filter.
Treatment of the NaN elements. In IDL, the /nan keyword seems to
mean that where possible the NaN values will be filled by the result
of the other points in the window. If there are no valid points to
generate a result, by a MISSING keyword. I thought I could
approximate this behaviour following the smoothing using
scipy.interpolate's NearestNDInterpolator (thanks to the brilliant
explanation by Alex on here:
filling gaps on an image using numpy and scipy)
Here is my test array:
>>>b array([[ 0.97599638, 0.93114936, 0.87070072, 0.5379253 ],
[ 0.34873217, nan, 0.40985891, 0.22407863],
[ nan, nan, nan, 0.67532134],
[ nan, nan, 0.85441768, nan]])
My answers bore not the SLIGHTEST resemblance to IDL, whether I use the /nan keyword or not.
IDL> smooth(b,2,/nan)
0.97599638 0.93114936 0.87070072 0.53792530
0.34873217 0.70728749 0.60817236 0.22407863
NaN 0.53766960 0.54091913 0.67532134
NaN NaN 0.85441768 NaN
IDL> smooth(b,2)
0.97599638 0.93114936 0.87070072 0.53792530
0.34873217 -NaN -NaN 0.22407863
-NaN -NaN -NaN 0.67532134
-NaN -NaN 0.85441768 NaN
I confess I find the scipy documentation rather sparse on detail so I have no idea if I am really doing what I think I doing. The fact that the two python approaches which I believed would both smooth the image give different answers suggests that things are not what I understood them to be.
>>>uniform_filter(b, 2)
array([[ 0.97599638, 0.95357287, 0.90092504, 0.70431301],
[ 0.66236428, nan, nan, nan],
[ nan, nan, nan, nan],
[ nan, nan, nan, nan]])
I thought it was a bit odd it was so empty so I tried this with an array of 100 elements (still using a window of 2) and output the images. The results (first image is 'b' second is 'bsmooth') are not quite what I was hoping for:
Going back to the smaller array and following the examples in: http://scipy.github.io/old-wiki/pages/Cookbook/SignalSmooth which I thought would give the same output as uniform_filter, I tried:
>>> box = np.array([1,1,1,1])
>>> box = box.reshape(2,2)
>>> box
array([[1, 1],
[1, 1]])
>>> bsmooth = scipy.signal.convolve2d(b,box,mode='same')
>>> print bsmooth
[[ 0.97599638 1.90714574 1.80185008 1.40862602]
[ 1.32472855 nan nan 2.04256356]
[ nan nan nan nan]
[ nan nan nan nan]]
Obviously I have completely misunderstood the scipy functions, maybe even the IDL one. If anyone can help me to replicate the IDL smooth function as closely as possible, I would be extremely grateful. I am under considerable time pressure to get a solution for this that doesn't rely on IDL and I am tossing a coin to decide whether to code the function from scratch or develop a very contagious illness.
How can I perform the same smoothing in python?
First: Please use matplotlib.pyplot.imshow with interpolation="none" that's nicer to look at and maybe with greyscale.
So for your example: There is actually no convolution (filter) within scipy and numpy that treat's NaN as missing values (they propagate them within the convolution). At least I've found none so far and your boundary-treatement is also (to my knowledge) not implemented. But the boundary could be just replaced afterwards.
If you want to do convolution with NaN you can for example use astropy.convolution.convolve. There NaNs are interpolated using the kernel of your filter. But their convolution has some drawbacks as well: Border handling like you want isn't implemented there neither and your kernel must be of odd shape and the sum of your kernel must not be zero (or very close to it)
For example:
from astropy.convolution import convolve
import numpy as np
array = np.random.uniform(10,100, (4,4))
array[1,1] = np.nan
kernel = np.ones((3,3))
convolve(array, kernel)
as an example an initial array of
array([[ 97.19514587, 62.36979751, 93.54811286, 30.23567842],
[ 51.02184613, nan, 46.14769821, 60.08088041],
[ 20.86482452, 42.39661484, 36.96961278, 96.89180175],
[ 45.54453509, 76.61274347, 46.44485141, 25.40985372]])
will become:
array([[ 266.9009961 , 406.59680717, 348.69637399, 230.01236989],
[ 330.16243546, 506.82785931, 524.95440336, 363.87378443],
[ 292.75477064, 422.31693304, 487.26826319, 311.94469828],
[ 185.41871792, 268.83318211, 324.72547798, 205.71611967]])
if you want to "normalize" it, astropy offers the normalize_kernel parameter:
convolved = convolve(array, kernel, normalize_kernel=True)
array([[ 29.58753936, 42.09982189, 49.31793529, 33.00203873],
[ 49.87040638, 65.67695002, 66.10447436, 40.44026448],
[ 52.51126383, 63.03914444, 60.85474739, 35.88011742],
[ 39.40188443, 46.82350749, 40.1380926 , 22.46090152]])
If you want to replace the "edge" values with the ones from the original array just replace them:
convolved[0,:] = array[0,:]
convolved[-1,:] = array[-1,:]
convolved[:,0] = array[:,0]
convolved[:,-1] = array[:,-1]
So that's what the existing packages offer (as far as I know it). If you want to learn a bit of Cython or numba you can easily write your own convolutions that is not much slower (only a factor of 2-10) than the numpy/scipy ones but does EXACTLY what you want without messing around.
Since this is not something that is available in the python packages and because I saw the question asked several times during my research without satisfactory answers, here is how I solved the issue.
Provided is a test version of my function that I'm off to tidy up. I am sure there will be better ways to do the things I have done as I'm still fairly new to Python - please do recommend any appropriate changes.
Plots use autumn colourmap just because it allowed me to see the NaNs clearly.
My results:
IDL propagate
0.033369284 0.067915268 0.96602046 0.85623550
0.30435592 NaN NaN 100.00000
0.94065958 NaN NaN 0.90966976
0.018516513 0.044460904 0.051047217 NaN
python propagate
[[ 3.33692829e-02 6.79152655e-02 9.66020487e-01 8.56235492e-01]
[ 3.04355923e-01 nan nan 1.00000000e+02]
[ 9.40659566e-01 nan nan 9.09669768e-01]
[ 1.85165123e-02 4.44609040e-02 5.10472165e-02 nan]]
IDL replace
0.033369284 0.067915268 0.96602046 0.85623550
0.30435592 0.47452110 14.829881 100.00000
0.94065958 0.33833817 17.002417 0.90966976
0.018516513 0.044460904 0.051047217 NaN
python replace
[[ 3.33692829e-02 6.79152655e-02 9.66020487e-01 8.56235492e-01]
[ 3.04355923e-01 4.74521092e-01 1.48298812e+01 1.00000000e+02]
[ 9.40659566e-01 3.38338177e-01 1.70024175e+01 9.09669768e-01]
[ 1.85165123e-02 4.44609040e-02 5.10472165e-02 nan]]
My function:
#!/usr/bin/env python
# smooth.py
__version__ = 0.1
# Version 0.1 29 Feb 2016 ELH Test release
import numpy as np
import matplotlib.pyplot as mp
def Smooth(v1, w, nanopt):
# v1 is the input 2D numpy array.
# w is the width of the square window along one dimension
# nanopt can be replace or propagate
'''
v1 = np.array(
[[3.33692829e-02, 6.79152655e-02, 9.66020487e-01, 8.56235492e-01],
[3.04355923e-01, np.nan , 4.86013025e-01, 1.00000000e+02],
[9.40659566e-01, 5.23314093e-01, np.nan , 9.09669768e-01],
[1.85165123e-02, 4.44609040e-02, 5.10472165e-02, np.nan ]])
w = 2
'''
mp.imshow(v1, interpolation='None', cmap='autumn')
mp.show()
# make a copy of the array for the output:
vout=np.copy(v1)
# If w is even, add one
if w % 2 == 0:
w = w + 1
# get the size of each dim of the input:
r,c = v1.shape
# Assume that w, the width of the window is always square.
startrc = (w - 1)/2
stopr = r - ((w + 1)/2) + 1
stopc = c - ((w + 1)/2) + 1
# For all pixels within the border defined by the box size, calculate the average in the window.
# There are two options:
# Ignore NaNs and replace the value where possible.
# Propagate the NaNs
for col in range(startrc,stopc):
# Calculate the window start and stop columns
startwc = col - (w/2)
stopwc = col + (w/2) + 1
for row in range (startrc,stopr):
# Calculate the window start and stop rows
startwr = row - (w/2)
stopwr = row + (w/2) + 1
# Extract the window
window = v1[startwr:stopwr, startwc:stopwc]
if nanopt == 'replace':
# If we're replacing Nans, then select only the finite elements
window = window[np.isfinite(window)]
# Calculate the mean of the window
vout[row,col] = np.mean(window)
mp.imshow(vout, interpolation='None', cmap='autumn')
mp.show()
return vout

Keeping track of an objects local coordinate space

Ok, so - this is heavily related to my previous question Transforming an object between two coordinate spaces, but a lot more direct and it should have an obvious answer.
An objects local coordinate space, how do I "get a hold of it"? Say that I load an Orc into my game, how do I know programatically where it's head, left arm, right arm and origin (belly button?) is? And when I know where it is do I need to save it manually or is it something that magically exists in the DirectX API? In my other question one person said something about storing a vertice for X,Y,Z directions and and origin? How do I find these vertices? Do I pick them arbitrarily, assign them in the model before loading it into my game? etc.
This is not about transforming between coordinate spaces
Actually, it is about transforming between coordinate spaces.
Ok, you understand that you can have a Matrix that does Translation or Rotation. (Or scaling. Or skew. Etc.) That you can multiply such a Matrix by a point (V) to get the new (translated/rotated/etc) point. E.g.:
Mtranlate = [ 1 0 0 Tx ] * V = [ Vx ] = [ Vx + Tx ]
[ 0 1 0 Ty ] [ Vy ] [ Vy + Ty ]
[ 0 0 1 Tz ] [ Vz ] [ Vz + Tz ]
[ 0 0 0 1 ] [ 1 ] [ 1 ]
Or rotation about the X/Y/Z axis:
Mrotate_x = [ 1 0 0 0 ] * V = [ Vx ] = [ Vx ]
[ 0 cos -sin 0 ] [ Vy ] [ Vy*cos - Vz*sin ]
[ 0 sin cos 0 ] [ Vz ] [ Vy*sin + Vz*cos ]
[ 0 0 0 1 ] [ 1 ] [ 1 ]
That you can combine multiple operations by multiplying the matrices together. For example: Mnew = Mtranslate * Mrotate_x.
(And yes, the order does matter!)
So given a real-world base origin, you could translate to your (known/supplied) orc's xy location, deduce from the terrain at that point what your orc's feet's z location is and translate to that. Now you have your orc's "feet" in global coordinates. From there, you might want to rotate your orc so he faces a particular direction.
Then translate upwards (based on your orc's model) to find it's neck. From the neck, we can translate upwards (and perhaps rotate) to find its head or over to find its shoulder. From the shoulder, we can rotate and translate to find the elbow. From the elbow we can rotate and translate to find the hand.
Each component of your orc model has a local reference frame. Each is related to the reference frame where it is attached by one or more transforms (matrices). By multiplying through all the local matrices (transforms), we can eventually see how to go from the orc's feet to his hand.
(Mind you, I'd animate the orc and just use his feet x,y,z location. Otherwise it's too much work.)
You see a lot of this sort of thing in robotics which can provide a nice elementary introduction to forward kinematics. In robotics, at least with the simpler robots, you have a motor which rotates, and a limb-length to translate by. And then there's another motor. (In contrast, the human shoulder-joint can rotate about all three axis at once.)
You might look over the course HANDOUTS for CMU's Robotic Manipulation class (15-384).
Perhaps something like their Forward Kinematics Intro.