Search similar object - c++

Assume I have the following array of objects:
Object 0:
[0]=1.1344
[1]=2.18
...
[N]=1.86
-----------
Object 1 :
[0]=1.1231
[1]=2.16781
...
[N]=1.8765
-------------
Object 2 :
[0]=1.2311
[1]=2.14781
...
[N]=1.5465
--------
Object 17:
[0]=1.31
[1]=2.55
...
[N]=0.75
How can I compare those objects?
You can see that object 0 and object 1 are very similar but object 17 not like any of them.
I would like to have algorithm tha twill give me all the similar object in my array

You tag this question with Algorithm (and I am not expert in C++) so lets give a pseudo code.
First, you should set a threshold which define 2 var with different under that threshold as similar. Second step will be to loop over all pair of elements and check for similarity.
Consider A to be array with n objects and m to be number of fields in each object.
threshold = 0.1
for i in (0, n):
for j in (i+1,n):
flag = true;
for k in (1,m):
if (abs(A[i][k] - A[j][k]) > threshold)
flag = false // if the absolute value of the diff is above the threshold object are not similar
break // no need to continue checks
if (flag)
print: element i and j similar // and do what ever
Time complexity is O(m * n^2).
Notice that you can use the same algorithm to sort the objects array - declare compare function as the max diff between field and then sort accordingly.
Hope that helps!

Your problem essentially boils down to nearest neighbor search which is a well researched problem in data mining.
There are diffent approaches to this problem.
I would suggest to decide first what number of similar elements you want OR to set a given threshold for the similarity. Than you have to iterate through all the vectors and compute a distance function between the query vector and each vector in the database.
I would suggest you to use Euclidean distance in your case since you have real nominal data.
You can read more about the topic of nearest neighbor search and Euclidean distancehere and here. Good luck!

What you need is a classifier, for your problem there are 2 algorithms depends on what you wanted.
If you need to find which object is most similar to the choosen object-m, you can use nearest neighbor algorithm or else if you need to find similar sets of objects you can use k-means algorithm to find k sets.

Related

Get an interval of values possible in regression using sci-kit learn machine learning

I am trying to use regression to predict a value. For a given set of independent variables, I get a fixed number as the expected value. However, is it possible to get a range of value, so as to say that the maximum possible value be say x and minimum possible value be say y.
Using
regr = linear_model.LinearRegression()
regr.fit(X_train, Y_train)
pred = regr.predict([[a, b]])
The value of pred comes out be say 10 , but i would rather want something like max = 12 and min = 8
Simply saying a range of values.
UPDATE
Tried looking into GMM, not sure if that work for this.
Tried Gausian processes but it again give a single value something like 11.137631, which really doesn't as i am looking for a range of value rather than a single value.
The linear regression always gives same result for a given input vector, however using a random forest regressor in iteration gives different result on each iteration and that can be used to get a Minimum and maximum possible value from a given input vector forecast.

Best way to store a vector with 60 dimensions - C++

I'm trying to implement K-means clustering algorithm on 600 data points, all with 60 dimensions. A line of input would be something like:
28.7812 34.4632 31.3381 31.2834 28.9207 33.7596 25.3969 27.7849 35.2479 27.1159 32.8717 29.2171 36.0253 32.337 34.5249 32.8717 34.1173 26.5235 27.6623 26.3693 25.7744 29.27 30.7326 29.5054 33.0292 25.04 28.9167 24.3437 26.1203 34.9424 25.0293 26.6311 35.6541 28.4353 29.1495 28.1584 26.1927 33.3182 30.9772 27.0443 35.5344 26.2353 28.9964 32.0036 31.0558 34.2553 28.0721 28.9402 35.4973 29.747 31.4333 24.5556 33.7431 25.0466 34.9318 34.9879 32.4721 33.3759 25.4652 25.8717
I'm thinking have a struct of data points, where it has a vector of attributes, like
struct Point{
std::vector<double> attributes;
};
and I guess when iterating through all the points, add up the attributes with i as an iterator in a for loop? Is this the best way to go about this?
Not sure about what you are asking, but with C++11 you could use std::array so you might have some
std::vector<std::array<double,60>> myvec;
Then myvec[2] and myvec[10] (assuming myvec.size() > 10) are both elements of type std::array<double,60> so you can of course use myvec[2][7] and myvec[10][59]
600 data points is a small enough number. Looking up distances in 60 dimensional space, for 600 points is about 36.000 operations. That's manageable.
Keep in mind that your data is very, very sparse though. A more realistic data set for 60 dimensions would have far, far more points. In that case you might need to think about pre-partioning space. That would complicate your data structure.
One intermediate-level technique is to realize that distances only add up. When looking for a neighbor of point P, you need to calculate the distance to the first point in 60 dimensions. This establishes a lower bound D. But when you calculate the distance to the second point, you may find that you exceed D already after 59 dimensions. Now the tricky bit is that you cannot check this for every point after adding every dimension; that would be excessive. You may need to manually unroll loops, and exactly how depends on your data distribution.

how to calculate nth term of mth row of this table?

there is a table which grows as
1,1
1,1,2
1,1,3,3
1,1,4,4,6
1,1,5,5,10,10
1,1,6,6,15,15,20
.....and so on
If i want to find an specific element of the table like if I want to find 4th element of 6th row then the answer will be 6 but if I want to find the nth element of mth row for any n>=1 m>=1 then how to do it?
These numbers look like binomial coefficients, so this "table" could be Pascal's triangle row-wise re-ordered by size.
Though, this is just one of the infinitely many "tables" that'd start like this. If you don't name a specific production rule or another way to deduce arbitrary values of the "table", there's no way telling for sure which of those infinitely many "tables" you have here.
I assume you want to hold the values in a kind of table without wasting memory by for example giving each line more slots than necessary.
To do that I'd suggest a vector of vectors (assuming your values are integers):
std::vector< std::vector<int> > table;
Provided you are sure that a value at (m, n) exists you can get it with:
int value = table[m][n];
(Note that m and n count from 0.)
If you're not sure use the safer
int value = table.at(m).at(n);
which will throw an exception if (m, n) doesn't exist.
To add a row you could call
table.resize(table.size() + 1);
and to add a column to a row
table[m].resize(table[m].size() + 1);
I'd recommend to put the table into the protected or private section of a special class and add functions to access the elements as needed.

C++, determine the part that have the highest zero crosses

I’m not specialist in signal processing. I’m doing simple processing on 1D signal using c++. I want really to know how I can determine the part that have the highest zero cross rate (highest frequency!). Is there a simple way or method to tell the beginning and the end of this part.
This image illustrate the form of my signal, and this image is what I need to do (two indexes of beginning and end)
Edited:
Actually I have no prior idea about the width of the beginning and the end, it's so variable.
I could calculate the number of zero crossing, but I have no idea how to define it's range
double calculateZC(vector<double> signals){
int ZC_counter=0;
int size=signals.size();
for (int i=0; i<size-1; i++){
if((signals[i]>=0 && signals[i+1]<0) || (signals[i]<0 && signals[i+1]>=0)){
ZC_counter++;
}
}
return ZC_counter;
}
Here is a fairly simple strategy which might give you some point to start. The outline of the algorithm is as follows
Input: Vector of your data points {y0,y1,...}
Parameters:
Window size sigma.
A threshold 0<p<1 defining when to start looking for a region.
Output: The start- and endpoint {t0,t1} of the region with the most zero-crossings
I won't give any C++ code, but the method should be easy to implement. As example let us use the following function
What we desire is the region between about 480 and 600 where the zero density higher than in the front. First step in the algorithm is to calculate the positions of zeros. You can do this by what you already have but instead of counting, you store the values for i where you met a zero.
This will give you a list of zero positions
From this list (you can do this directly in the above for-loop!) you create a list having the same size as your input data which looks like {0,0,0,...,1,0,..,1,0,..}. Every zero-crossing position in your input data is marked with a 1.
The next step is to smooth this list with a smoothing filter of size sigma. Here, you can use what you like; in the simplest case a moving average or a Gaussian filter. The higher you choose sigma the bigger becomes your look around window which measures how many zero-crossings are around a certain point. Let me give the output of this filter together with the original zero positions. Note that I used a Gaussian filter of size 10 here
In a next step, you go through the filtered data find the maximum value. In this case it is about 0.15. Now you choose your second parameter which is some percentage of this maximum. Lets say p=0.6.
The final step is to go through the filtered data and when the value is greater than p you start to remember a new region. As soon as the value drops below p, you end this region and remember start and endpoint. Once you are finished walking through the data, you are left with a list of regions, each defined by a start and an endpoint. Now you choose the region with the biggest extend and you are done.
(Optionally, you could add the filter size to each end of the final region)
For the above example, I get 11 regions as follows
{{164,173},{196,205},{220,230},{241,252},{259,271},{278,290},
{297,309},{318,327},{341,350},{458,468},{476,590}}
where the one with the biggest extend is the last one {476,590}. The final result looks (with 1/2 filter region padding)
Conclusion
Please don't be discouraged by the length of my answer. I tried to explain everything in detail. The implementation is really just some loops:
one loop to create the zero-crossings list {0,0,..,1,0,...}
one nested loop for the moving average filter (or you use some library Gaussian filter). Here you can at the same time extract the maximum value
one loop to extract all regions
one loop to extract the largest region if you haven't already extracted it in the above step

recursively find subsets

Here is a recursive function that I'm trying to create that finds all the subsets passed in an STL set. the two params are an STL set to search for subjects, and a number i >= 0 which specifies how big the subsets should be. If the integer is bigger then the set, return empty subset
I don't think I'm doing this correctly. Sometimes it's right, sometimes its not. The stl set gets passed in fine.
list<set<int> > findSub(set<int>& inset, int i)
{
list<set<int> > the_list;
list<set<int> >::iterator el = the_list.begin();
if(inset.size()>i)
{
set<int> tmp_set;
for(int j(0); j<=i;j++)
{
set<int>::iterator first = inset.begin();
tmp_set.insert(*(first));
the_list.push_back(tmp_set);
inset.erase(first);
}
the_list.splice(el,findSub(inset,i));
}
return the_list;
}
From what I understand you are actually trying to generate all subsets of 'i' elements from a given set right ?
Modifying the input set is going to get you into trouble, you'd be better off not modifying it.
I think that the idea is simple enough, though I would say that you got it backwards. Since it looks like homework, i won't give you a C++ algorithm ;)
generate_subsets(set, sizeOfSubsets) # I assume sizeOfSubsets cannot be negative
# use a type that enforces this for god's sake!
if sizeOfSubsets is 0 then return {}
else if sizeOfSubsets is 1 then
result = []
for each element in set do result <- result + {element}
return result
else
result = []
baseSubsets = generate_subsets(set, sizeOfSubsets - 1)
for each subset in baseSubssets
for each element in set
if no element in subset then result <- result + { subset + element }
return result
The key points are:
generate the subsets of lower rank first, as you'll have to iterate over them
don't try to insert an element in a subset if it already is, it would give you a subset of incorrect size
Now, you'll have to understand this and transpose it to 'real' code.
I have been staring at this for several minutes and I can't figure out what your train of thought is for thinking that it would work. You are permanently removing several members of the input list before exploring every possible subset that they could participate in.
Try working out the solution you intend in pseudo-code and see if you can see the problem without the stl interfering.
It seems (I'm not native English) that what you could do is to compute power set (set of all subsets) and then select only subsets matching condition from it.
You can find methods how to calculate power set on Wikipedia Power set page and on Math Is Fun (link is in External links section on that Wikipedia page named Power Set from Math Is Fun and I cannot post it here directly because spam prevention mechanism). On math is fun mainly section It's binary.
I also can't see what this is supposed to achieve.
If this isn't homework with specific restrictions i'd simply suggest testing against a temporary std::set with std::includes().