Clustering a list of dates - python-2.7

I have a list of dates I'd like to cluster into 3 clusters. Now, I can see hints that I should be looking at k-means, but all the examples I've found so far are related to coordinates, in other words, pairs of list items.
I want to take this list of dates and append them to three separate lists indicating whether they were before, during or after a certain event. I don't have the time for this event, but that's why I'm guessing it by breaking the date/times into three groups.
Can anyone please help with a simple example on how to use something like numpy or scipy to do this?

k-means is exclusively for coordinates. And more precisely: for continuous and linear values.
The reason is the mean functions. Many people overlook the role of the mean for k-means (despite it being in the name...)
On non-numerical data, how do you compute the mean?
There exist some variants for binary or categorial data. IIRC there is k-modes, for example, and there is k-medoids (PAM, partitioning around medoids).
It's unclear to me what you want to achieve overall... your data seems to be 1-dimensional, so you may want to look at the many questions here about 1-dimensional data (as the data can be sorted, it can be processed much more efficiently than multidimensional data).
In general, even if you projected your data into unix time (seconds since 1.1.1970), k-means will likely only return mediocre results for you. The reason is that it will try to make the three intervals have the same length.
Do you have any reason to suspect that "before", "during" and "after" have the same duration? If not, don't use k-means.
You may however want to have a look at KDE; and plot the estimated density. Once you have understood the role of density for your task, you can start looking at appropriate algorithms (e.g. take the derivative of your density estimation, and look for the largest increase / decrease, or estimate an "average" level, and look for the longest above-average interval).

Here are some workaround methods that may not be the best answer but should help.
You can plot the dates as converted durations from a starting date (such as one week)
and convert the dates to number representations for time in minutes or hours from the starting point.
These would all graph along an x-axis but Kmeans should still be possible and clustering still visible on a graph.
Here are more examples of numpy:Python k-means algorithm

Related

Shape-matching of plots using non-linear least squares

What would b the best way to implement a simple shape-matching algorithm to match a plot interpolated from just 8 points (x, y) against a database of similar plots (> 12 000 entries), each plot having >100 nodes. The database has 6 categories of plots (signals measured under 6 different conditions), and the main aim is to find the right category (so for every category there's around 2000 plots to compare against).
The 8-node plot would represent actual data from measurement, but for now I am simulating this by selecting a random plot from the database, then 8 points from it, then smearing it using gaussian random number generator.
What would be the best way to implement non-linear least-squares to compare the shape of the 8-node plot against each plot from the database? Are there any c++ libraries you know of that could help with this?
Is it necessary to find the actual formula (f(x)) of the 8-node plot to use it with least squares, or will it be sufficient to use interpolation in requested points, such as interpolation from the gsl library?
You can certainly use least squares without knowing the actual formula. If all of your plots are measured at the same x value, then this is easy -- you simply compute the sum in the normal way:
where y_i is a point in your 8-node plot, sigma_i is the error on the point and Y(x_i) is the value of the plot from the database at the same x position as y_i. You can see why this is trivial if all your plots are measured at the same x value.
If they're not, you can get Y(x_i) either by fitting the plot from the database with some function (if you know it) or by interpolating between the points (if you don't know it). The simplest interpolation is just to connect the points with straight lines and find the value of the straight lines at the x_i that you want. Other interpolations might do better.
In my field, we use ROOT for these kind of things. However, scipy has a great collections of functions, and it might be easier to get started with -- if you don't mind using Python.
One major problem you could have would be that the two plots are not independent. Wikipedia suggests McNemar's test in this case.
Another problem you could have is that you don't have much information in your test plot, so your results will be affected greatly by statistical fluctuations. In other words, if you only have 8 test points and two plots match, how will you know if the underlying functions are really the same, or if the 8 points simply jumped around (inside their error bars) in such a way that it looks like the plot from the database -- purely by chance! ... I'm afraid you won't really know. So the plots that test well will include false positives (low purity), and some of the plots that don't happen to test well were probably actually good matches (low efficiency).
To solve that, you would need to either use a test plot with more points or else bring in other information. If you can throw away plots from the database that you know can't match for other reasons, that will help a lot.

How to combine two images with different gains in Matlab/ C++

I have two images, both taken at the same time from the same detector.
Both images have 11 bit resolution (yes, its odd but that is the case here). The difference between the two images is that one image as been amplified by a factor of 1 and the other has been amplified by a factor of 10.
How can I take these two 11 bit images, and combine their pixel values to get a single 16 bit image? Basically, this increases the dynamic range of the final image.
I am fairly new to image processing. I know there is a solution for this, since other systems do this on the fly pixel-by-pixel in an FPGA. I was just hoping to be able to do this in Matlab post processing instead of live. I know doing bitwise operations in Matlab can be kinda difficult, but we do have an educational license with every toolbox available.
As mentioned below, this look an awful lot like HDR processing. The goal isn't artistic, rather data preservation. This is eventually going to be put in C++ and flown on an autonomous flight computer and running standard bloated HDR software on the fly would kill our timing requirements
Thanks for the help!
As a side note, I'd like to be able to do this for any combination of gains. ie 2x and 30x, 4x and 8x ect. In my gut I feel like this is a deceptively simple algorithm or interpolation, but I just don't know where to start.
Gains
Since there is some confusion on what the gains mean, I'll try to explain. The image sensor (CMOS) being used on our custom camera has the capability to simultaneously output two separate images, both taken from the same exposure. It can do this because the sensor has 2 different electrical amplifiers along its data path.
In photography terms, it would be like your DSLR being able to take a picture using 2 different ISO values at the same time.
Sorry for the confusion
The problems you pose is known as "High Dynamic Range Imaging" and "Tone Mapping". I suggest you start with those Wikipedia articles, then drill down to the bibliography cited therein.
You don't provide enough details about your imagery to give a more specific answer. What is the "gain" you mention? Did you crank up the sensor's gain (to what ISO-equivalent number?), or did you use a longer exposure time? Are the 11-bit pixel values linear or already gamma-compressed?
To upscale an 11bit range to a 16bit range multiple by (2^16-1)/(2^11-1).
(Assuming you want a linear scaling. (Which is reasonable when scaling up.)
If the gain was discrete (applied to the 11bit range), then you have two 11bit images which may have some values saturated.
If the gain was applied in a continuous (analog) or floating point range, then your values can go beyond the original 11bits. Also, if the gain was applied in a continuous (analog) or floating point range, the values were probably scaled to another range first e.g. [0,1] (by dividing by (2^11-1)).
If the values were scaled to another range, you will have to divide by the maximum of the new range instead of by (2^11-1).
Either way (whether gain was in 11bit range or not), due to the gain and due to the addtion, the resulting values may be large than the original range. In this case, you need to decide how you want to scale them:
Do you want to scale the original 11bit range to 16bit (possible causing saturation)?
If so multiple by multiple by (2^16-1)/(2^11-1)
Do you want to scale the maximum possible value to 2^16-1?
If so multiple by multiple by (2^16-1)/( (2^11-1) * (G1+G2) )
Do you want to scale the actual maximum value to 2^16-1?
If so multiple by multiple by (2^16-1)/(max(sum(I1+I2))
Edit:
Since you do not want to add the images, but rather use the different details in them, perhaps this article will help you:
Digital Photography with Flash and No-Flash Image Pairs

Analyzing gaze tracking data

I have an image which was shown to groups of people with different domain knowledge of its content. I than recorded gaze fixation data of them watching the image.
I now kind of want to compare the results of the two groups - so what I need to know is, if there is a correlation of the positions of the sampling data between the two groups or not.
I have the original image as well as the fixation coords. Do you have any good idea how to start analyzing the data?
It's more about the idea or the plan so you don't have to be too technical on that one.
Thanks
Simple idea: render all the coordinates on the original image in a 'heat map' like way, one image for each group. You can then visually compare the images for correlation, and you have some nice graphics for in your paper.
There is something like the two-dimensional correlation coefficient. With software like R or Matlab you can do the number crunching for the correlation.
Matlab has a function for this:
Two Dimensional Correlation Function: corr2
Computes two dimensional correlation coefficient between two matrices
and the matrices must be of the same size. r = corr2 (A,B)
In gaze tracking, the most interesting data lies in two areas.
In where all people look, for that you can use the heat map Daan suggests. Make a heat map for all people, and heat maps for separate groups of people.
In when people look there. For that I would recommend you start by making heat maps as above, but for short time intervals starting from the time the picture was first shown. Again, for all people, and for the separate groups you have.
The resulting set of heat-maps, perhaps animated for the ones from the second point, should give you some pointers for further analysis.

Finding the spread of each cluster from Kmeans

I'm trying to detect how well an input vector fits a given cluster centre. I can find the best match quite easily (the centre with the minimum euclidean distance to the input vector is the best), however, I now need to work how good a match that is.
To do this I need to find the spread (standard deviation?) of the vectors which build up the centroid, then see if the distance from my input vector to the centre is less than the spread. If it's more than the spread than I should be able to say that I have no clusters to fit it (given that the best doesn't fit the input vector well).
I'm not sure how to find the spread per cluster. I have all the centre vectors, and all the training vectors are labelled with their closest cluster, I just can't quite fathom exactly what I need to do to get the spread.
I hope that's clear? If not I'll try to reword it!
TIA
Ian
Use the distance function and calculate the distance from your center point to each labeled point, then figure out the mean of those distances. That should give you the standard deviation.
If you switch to using a different algorithm, such as Mixture of Gaussians, you get the spread (e.g., std. deviation) as part of the model (clustering result).
http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/mixture.html
http://en.wikipedia.org/wiki/Mixture_model

Generating contour lines from regularly spaced data

I am currently working on a data visualization project.My aim is to produce contour lines ,in other words iso-lines, from gridded data.Data can be temperature, weather data or any kind of other environmental parameters but only condition is it must be regularly spaced.
I searched in internet , however i could not find a good algorithm, pseudo-code or source code for producing contour lines from grids.
Does anybody knows a library, source code or an algorithm for producing contour lines from gridded data?
it will be good if your suggestion has a good run time performance, i don't want to wait my users so much :)
Edit: thanks for response but isolines have some constrains like they should not intersects
so just generating bezier curves does not accomplish my goal.
See this question: How to approximate a vector contour from an elevation raster?
It's a near duplicate, but uses quite different terminology. You'll find that cartography and computer graphics solve many of the same problems, but use different terminology for them.
there's some reasonably good contouring available in GNUplot - if you're able to use GPL code that may help.
If your data is placed at regular intervals, this can be done fairly easily (assuming I understand your problem correctly). First you need to determine at what interval you want your contours. Next create the grid you are going to use to store the contour information (i'm assuming just a simple on/off or elevation at this contour level type of data), which should be one interval smaller than the source data.
Now the trick here is to offset the 2 grids by 1/2 an interval (won't actually show up in code like this, but its the concept I'm dealing with here) and compare the 4 coordinates surrounding the current point in the contour data grid you are calculating. If any of the 4 points are in a different interval range, then that 'pixel' in the contour grid should be set to true (or the value of the contour range being crossed).
With this method, there will be a problem when the interval is too fine which will cause several contours to overlap onto each other.
As the link from Paul Tomblin suggests, Bezier curves (which are a subset of B-splines) are a ripe solution for your problem. If runtime performance is an issue, Bezier curves have the added benefit of being constructable via the very fast de Casteljau algorithm, instead of drawing them according to the parametric equations. On the off chance you're working with DirectX, it has a library function for the de Casteljau, but it should not be challenging to brew one yourself using the 1001 web pages that describe it.