I'm trying to implement K-means clustering algorithm on 600 data points, all with 60 dimensions. A line of input would be something like:
28.7812 34.4632 31.3381 31.2834 28.9207 33.7596 25.3969 27.7849 35.2479 27.1159 32.8717 29.2171 36.0253 32.337 34.5249 32.8717 34.1173 26.5235 27.6623 26.3693 25.7744 29.27 30.7326 29.5054 33.0292 25.04 28.9167 24.3437 26.1203 34.9424 25.0293 26.6311 35.6541 28.4353 29.1495 28.1584 26.1927 33.3182 30.9772 27.0443 35.5344 26.2353 28.9964 32.0036 31.0558 34.2553 28.0721 28.9402 35.4973 29.747 31.4333 24.5556 33.7431 25.0466 34.9318 34.9879 32.4721 33.3759 25.4652 25.8717
I'm thinking have a struct of data points, where it has a vector of attributes, like
struct Point{
std::vector<double> attributes;
};
and I guess when iterating through all the points, add up the attributes with i as an iterator in a for loop? Is this the best way to go about this?
Not sure about what you are asking, but with C++11 you could use std::array so you might have some
std::vector<std::array<double,60>> myvec;
Then myvec[2] and myvec[10] (assuming myvec.size() > 10) are both elements of type std::array<double,60> so you can of course use myvec[2][7] and myvec[10][59]
600 data points is a small enough number. Looking up distances in 60 dimensional space, for 600 points is about 36.000 operations. That's manageable.
Keep in mind that your data is very, very sparse though. A more realistic data set for 60 dimensions would have far, far more points. In that case you might need to think about pre-partioning space. That would complicate your data structure.
One intermediate-level technique is to realize that distances only add up. When looking for a neighbor of point P, you need to calculate the distance to the first point in 60 dimensions. This establishes a lower bound D. But when you calculate the distance to the second point, you may find that you exceed D already after 59 dimensions. Now the tricky bit is that you cannot check this for every point after adding every dimension; that would be excessive. You may need to manually unroll loops, and exactly how depends on your data distribution.
Related
Assume I have the following array of objects:
Object 0:
[0]=1.1344
[1]=2.18
...
[N]=1.86
-----------
Object 1 :
[0]=1.1231
[1]=2.16781
...
[N]=1.8765
-------------
Object 2 :
[0]=1.2311
[1]=2.14781
...
[N]=1.5465
--------
Object 17:
[0]=1.31
[1]=2.55
...
[N]=0.75
How can I compare those objects?
You can see that object 0 and object 1 are very similar but object 17 not like any of them.
I would like to have algorithm tha twill give me all the similar object in my array
You tag this question with Algorithm (and I am not expert in C++) so lets give a pseudo code.
First, you should set a threshold which define 2 var with different under that threshold as similar. Second step will be to loop over all pair of elements and check for similarity.
Consider A to be array with n objects and m to be number of fields in each object.
threshold = 0.1
for i in (0, n):
for j in (i+1,n):
flag = true;
for k in (1,m):
if (abs(A[i][k] - A[j][k]) > threshold)
flag = false // if the absolute value of the diff is above the threshold object are not similar
break // no need to continue checks
if (flag)
print: element i and j similar // and do what ever
Time complexity is O(m * n^2).
Notice that you can use the same algorithm to sort the objects array - declare compare function as the max diff between field and then sort accordingly.
Hope that helps!
Your problem essentially boils down to nearest neighbor search which is a well researched problem in data mining.
There are diffent approaches to this problem.
I would suggest to decide first what number of similar elements you want OR to set a given threshold for the similarity. Than you have to iterate through all the vectors and compute a distance function between the query vector and each vector in the database.
I would suggest you to use Euclidean distance in your case since you have real nominal data.
You can read more about the topic of nearest neighbor search and Euclidean distancehere and here. Good luck!
What you need is a classifier, for your problem there are 2 algorithms depends on what you wanted.
If you need to find which object is most similar to the choosen object-m, you can use nearest neighbor algorithm or else if you need to find similar sets of objects you can use k-means algorithm to find k sets.
Well straight to the point, I have two arrays say oldArray[SIZE] and newArray[SIZE]. I want to find the difference between the each element of both arrays eg:
oldArray[0]-newArray[0] =
oldArray[1]-newArray[1] =
oldArray[2]-newArray[2] =
:
:
oldArray[SIZE]-newArray[SIZE] =
If the difference is zero no worries but if the diff is >0 store the data along with index. What the best way to store. I want to send this difference data to the client over network. Only ways that I am aware of is using a vector or a dynamic array. I'd really appreciate help with this.
Update: oldArray[] and newArra[] are two image frames of a video sequence which have depth values for each pixel, I want to compute the difference between the two frames and send only the difference over the network and on the other end I will again reconstruct the image frame, data is integer range from 0 to 1024. Hope this helps
I'd go for a std::map<int,std::pair<T,T>> where key is the index in question, and the std::pair contains the old value in first and the new value in second. No entries for equal first and second.
As for your edit a std::map<int,int> where key is the index, and value is the difference might be sufficient to keep your bitmaps synchronized.
How to serialize that properly over the network is a different kettle of fish.
I’m not specialist in signal processing. I’m doing simple processing on 1D signal using c++. I want really to know how I can determine the part that have the highest zero cross rate (highest frequency!). Is there a simple way or method to tell the beginning and the end of this part.
This image illustrate the form of my signal, and this image is what I need to do (two indexes of beginning and end)
Edited:
Actually I have no prior idea about the width of the beginning and the end, it's so variable.
I could calculate the number of zero crossing, but I have no idea how to define it's range
double calculateZC(vector<double> signals){
int ZC_counter=0;
int size=signals.size();
for (int i=0; i<size-1; i++){
if((signals[i]>=0 && signals[i+1]<0) || (signals[i]<0 && signals[i+1]>=0)){
ZC_counter++;
}
}
return ZC_counter;
}
Here is a fairly simple strategy which might give you some point to start. The outline of the algorithm is as follows
Input: Vector of your data points {y0,y1,...}
Parameters:
Window size sigma.
A threshold 0<p<1 defining when to start looking for a region.
Output: The start- and endpoint {t0,t1} of the region with the most zero-crossings
I won't give any C++ code, but the method should be easy to implement. As example let us use the following function
What we desire is the region between about 480 and 600 where the zero density higher than in the front. First step in the algorithm is to calculate the positions of zeros. You can do this by what you already have but instead of counting, you store the values for i where you met a zero.
This will give you a list of zero positions
From this list (you can do this directly in the above for-loop!) you create a list having the same size as your input data which looks like {0,0,0,...,1,0,..,1,0,..}. Every zero-crossing position in your input data is marked with a 1.
The next step is to smooth this list with a smoothing filter of size sigma. Here, you can use what you like; in the simplest case a moving average or a Gaussian filter. The higher you choose sigma the bigger becomes your look around window which measures how many zero-crossings are around a certain point. Let me give the output of this filter together with the original zero positions. Note that I used a Gaussian filter of size 10 here
In a next step, you go through the filtered data find the maximum value. In this case it is about 0.15. Now you choose your second parameter which is some percentage of this maximum. Lets say p=0.6.
The final step is to go through the filtered data and when the value is greater than p you start to remember a new region. As soon as the value drops below p, you end this region and remember start and endpoint. Once you are finished walking through the data, you are left with a list of regions, each defined by a start and an endpoint. Now you choose the region with the biggest extend and you are done.
(Optionally, you could add the filter size to each end of the final region)
For the above example, I get 11 regions as follows
{{164,173},{196,205},{220,230},{241,252},{259,271},{278,290},
{297,309},{318,327},{341,350},{458,468},{476,590}}
where the one with the biggest extend is the last one {476,590}. The final result looks (with 1/2 filter region padding)
Conclusion
Please don't be discouraged by the length of my answer. I tried to explain everything in detail. The implementation is really just some loops:
one loop to create the zero-crossings list {0,0,..,1,0,...}
one nested loop for the moving average filter (or you use some library Gaussian filter). Here you can at the same time extract the maximum value
one loop to extract all regions
one loop to extract the largest region if you haven't already extracted it in the above step
For a school project, I have a simple program, which compares 20x20 photos. I put 20 photos, and then i put 21th photo, which is compared to existing 20, and pops up the answer, which photo i did insert (or which one is most similar). The problem is, my teacher wanted me to use nearest neighbour algorithm, so i am counting distance from every photo. I got everything working, but the thing is, if photos are too similar, i got the problem with saying which one is closer to my one. For example i get these distances with 2 different photos (well, they are ALMOST the same):
0 distance: 1353.07982026191
1 distance: 1353.07982026191
It is 15 digits already, and i am using double type. I was reading that long double is the same. Is there any "easy" way to store numbers with more than 15 digits and do math on them?
I count distance using Euclidean distance
I just need to be more precise, or thats limit i probably wont pass here, and i should talk to my teacher i cant compare such similar photos?
I think you need this: gmplib.org
There's a guide how to install this library on this site too.
And here's article about floats: http://gmplib.org/manual/C_002b_002b-Interface-Floats.html#C_002b_002b-Interface-Floats
Maybe you could use an algebraic approach.
Let us assume that you are trying to calcuate if vector x is closer to a or b. What you need to calculate is the sign of
d2(x, a) - d2(x, b)
Which becomes (I'll omit some passages for brevity)
and then
Which only contains differences between values which should be very similar. Summing over such small values should yield a better precision than working on the aggregate.
Okay, I think I might be over-complicating this issue but I truly am stuck. Basically, I am trying to model a weight set, specifically an olympic weight set. So I have the bar which is 45 lbs, then I have 2 weights of 2.5 lbs, 4 of 5 lbs, and then 2 of 10, 25, 35, and 45 respectively. This makes a total of 300 lbs.
bar = 45 lbs
2 of 2.5
4 of 5
2 of 10
2 of 25
2 of 35
2 of 45
I want to model this weight set so that I have this information: the weight and the quantity of weights I have. I know I could hard-code this but I eventually want to let the user enter how many of each weight they may have.
Anyways, originally I thought I could simply have an NSDictionary with the key being the weight, such as 35, and the value being the quantity.
Naturally I cannot store primitives in an NSDictionary or other Cocoa collection, so I have to encapsulate each integer in an NSNumber. However, the point of my modeling this weight set is so that I can simulate the use of certain weights. For example, if I use a 35 lbs. weight that takes 2 off (one for each side), so I have to go and edit the value for the 35 lbs. weight to reflect the fact that I have deducted 2 from the quantity.
This involves the tedious task of unboxing the NSNumber, converting back to a primitive, doing the math, and then re-boxing into an NSNumber and assigning that new result to the appropriate location in the NSDictionary. After searching around a bit, I confirmed my initial premonition that this was not a good idea.
So I have a couple questions. First of all, is there a better way of modeling a weight set aside from using a dictionary-style solution? If not, what is the suggested way to go about doing this? Do I have to leave the cocoa-realm and resort to using some sort of C++ STL template such as a map?
I have seen some information on NSDecimalNumber, should I just use that?
Like I said, I wouldn't be surprised if I am over-complicating this. I would really appreciate any help, thanks.
EDIT: I am beginning to think that the weight set 'definition' as described should indeed be immutable, as it is a definition after all. Then when I use a certain weight, I can add to some sort of tally. The thing is that the tally will also be some form of collection whose values I will be modifying (adding to), so that I can correlate it to the specific weight. So I guess I am in the same problem.
I think where I am trying to get at is creating a 'clone' so to speak of the weight set definition which I can easily modify (to simulate the usage of individual weights).
Sorry, I'm burned out.
Storing this in a dictionary isn't a natural fit. I think the best approach would be to make a Weight class that represents the weights, and stick them in an NSCountedSet. You can get the individual kinds of Weight and the counts for each kind, and you can get the weight of the whole set with [weightSet valueForKeyPath:#"#sum.weightInPounds"] (assuming the Weights have a weightInPounds property that represents how heavy they are).
You could use NSNumbers in the NSCountedSet and sum them with #sum.integerValue if you wanted, but it seems a bit awkward to me. At any rate, NSCountedSet is definitely a more natural collection than an NSDictionary for storing — well, a counted set.
There's nothing wrong with storing your numbers in an NSDictionary! The question you referenced was referring to complicated, frequent math. Converting from NSNumber and back is slow compared to simple int addition, but is still super-fast compared to human perception. I think your dictionary idea is EDIT: not as good as Chuck's NSCountedSet idea. :)