Split Data knowing its common ID

Split Data knowing its common ID - c++

I want to split this data,
ID x y
1 2.5 3.5
1 85.1 74.1
2 2.6 3.4
2 86.0 69.8
3 25.8 32.9
3 84.4 68.2
4 2.8 3.2
4 24.1 31.8
4 83.2 67.4
I was able, making match with their partner like,
ID x y ID x y
1 2.5 3.5 1 85.1 74.1
2 2.6 3.4 2 86.0 69.8
3 25.8 32.9
4 24.1 31.8
However, as you notice some of the new row in ID 4 were placed wrong, because it just got added in the next few rows. I want to split them properly without having to use complex logic which I am already using... Someone can give me an algorithm or idea?
it should looks like,
ID x y ID x y ID x y
1 2.5 3.5 1 85.1 74.1 3 25.8 32.9
2 2.6 3.4 2 86.0 69.8 4 24.1 31.8
4 2.8 3.2 3 84.4 68.2
4 83.2 67.4

It seems that your question is really about clustering, and that the ID column has nothing to do with the determining which points correspond to which.
A common algorithm to achieve that would be k-means clustering. However, your question implies that you don't know the number of clusters in advance. This complicates matters, and there have been already a lot of questions asked here on StackOverflow regarding this issue:
Kmeans without knowing the number of clusters?
compute clustersize automatically for kmeans
How do I determine k when using k-means clustering?
How to optimal K in K - Means Algorithm
K-Means Algorithm
Unfortunately, there is no "right" solution for this. Two clusters in one specific problem could be indeed considered as one cluster in another problem. This is why you'll have to decide that for yourself.
Nevertheless, if you're looking for something simple (and probably inaccurate), you can use Euclidean distance as a measure. Compute the distances between points (e.g. using pdist), and group points where the distance falls below a certain threshold.
Example
%// Sample input
A = [1, 2.5, 3.5;
1, 85.1, 74.1;
2, 2.6, 3.4;
2, 86.0, 69.8;
3, 25.8, 32.9;
3, 84.4, 68.2;
4, 2.8, 3.2;
4, 24.1, 31.8;
4, 83.2, 67.4];
%// Cluster points
pairs = nchoosek(1:size(A, 1), 2); %// Rows of pairs
d = sqrt(sum((A(pairs(:, 1), :) - A(pairs(:, 2), :)) .^ 2, 2)); %// d = pdist(A)
thr = d < 10; %// Distances below threshold
kk = 1;
idx = 1:size(A, 1);
C = cell(size(idx)); %// Preallocate memory
while any(idx)
x = unique(pairs(pairs(:, 1) == find(idx, 1) & thr, :));
C{kk} = A(x, :);
idx(x) = 0; %// Remove indices from list
kk = kk + 1;
end
C = C(~cellfun(#isempty, C)); %// Remove empty cells
The result is a cell array C, each cell representing a cluster:
C{1} =
1.0000 2.5000 3.5000
2.0000 2.6000 3.4000
4.0000 2.8000 3.2000
C{2} =
1.0000 85.1000 74.1000
2.0000 86.0000 69.8000
3.0000 84.4000 68.2000
4.0000 83.2000 67.4000
C{3} =
3.0000 25.8000 32.9000
4.0000 24.1000 31.8000
Note that this simple approach has the flaw of restricting the cluster radius to the threshold. However, you wanted a simple solution, so bear in mind that it gets complicated as you add more "clustering logic" to the algorithm.

Related

How to extract surface triangles from a tetrahedral mesh?

I want to render a tetrahedral mesh using some 3D software. However, I cannot directly load the tetrahedral mesh in the software of my choice (e.g. Blender) as the file format that I have for tetrahedral meshes is not supported. So I should somehow extract the faces with corresponding vertex indices myself.
For a cube, my tetrahedral file contains vertex IDs for each tetrahedron which includes 4 faces is as follow:
v 0.41 0.41 0.41
v 0.41 0.41 -0.41
v 0.41 -0.41 0.41
v 0.41 -0.41 -0.41
v -0.41 0.41 0.41
v -0.41 0.41 -0.41
v -0.41 -0.41 0.41
v -0.41 -0.41 -0.41
t 0 1 2 4
t 5 1 4 7
t 1 2 4 7
t 3 1 7 2
t 6 4 2 7
However, I'm not sure how I can and extract the surface mesh given this data. Does someone know how I can do this or what the algorithm is?

here is a simplistic brute force method. For each tetrahedron, for example look at the third one, t: 1 2 4 7, by removing each vertex, generate all four combination of three vertices out of the four tetrahedral vertices, i.e.
face[t][0]: 1 2 4, face[t][1]: 1 2 7, face[t][2]: 1 4 7, face[t][3]: 2 4 7
and sort each triangle's integer labels in ascending order (for uniqueness)
This way, you can generate the list (or some kind of array) of all faces of all tetrahedral from the tetrahedral mesh.
Now run a loop over the list of all triangle faces that you have just generated, looking for duplicates. Whenever a triangle is contained twice in the list of all triangle faces, you remove it, because it is an interior triangle, i.e. two adjacent tetrahedral share this triangular face, so it is interior face and not a boundary one.
Whatever is left after this procedure, are only the boundary (i.e. the surface) triangle faces of the tetrahedral mesh.
Here is an example of this algorithm written in python
import numpy as np
def list_faces(t):
t.sort(axis=1)
n_t, m_t= t.shape
f = np.empty((4*n_t, 3) , dtype=int)
i = 0
for j in range(4):
f[i:i+n_t,0:j] = t[:,0:j]
f[i:i+n_t,j:3] = t[:,j+1:4]
i=i+n_t
return f
def extract_unique_triangles(t):
_, indxs, count = np.unique(t, axis=0, return_index=True, return_counts=True)
return t[indxs[count==1]]
def extract_surface(t):
f=list_faces(t)
f=extract_unique_triangles(f)
return f
V = np.array([
[ 0.41, 0.41, 0.41],
[ 0.41, 0.41, -0.41],
[ 0.41, -0.41, 0.41],
[ 0.41, -0.41, -0.41],
[-0.41, 0.41, 0.41],
[-0.41, 0.41, -0.41],
[-0.41, -0.41, 0.41],
[-0.41, -0.41, -0.41]])
T = np.array([
[0, 1, 2, 4],
[5, 1, 4, 7],
[1, 2, 4, 7],
[3, 1, 7, 2],
[6, 4, 2, 7]])
F_all = list_faces(T)
print(F_all)
print(F_all.shape)
F_surf = extract_surface(T)
print(F_surf)
print(F_surf.shape)

A very efficient method is using hashsets (aka set in python, std::unordered_set in c++, HashSet in rust)
The principle of retreiving an envelope from a volume of tetraedrons is the same as retreiving the outline of a surface of triangles (as you can find here)
This gives in python the following code, easy to translate in any language with hashsets (in python here, for simplicity):
envelope = set()
for tet in tetraedrons:
for face in ( (tet[0], tet[1], tet[2]),
(tet[0], tet[2], tet[3]),
(tet[0], tet[3], tet[2]),
(tet[1], tet[3], tet[2]) ):
# if face has already been encountered, then it's not on the envelope
# the magic of hashsets makes that check O(1) (eg. extremely fast)
if face in envelope: envelope.remove(face)
# if not encoutered yet, add it flipped
else: envelope.add((face[2], face[1], face[0]))
# there is now only faces encountered once (or an odd number of times for paradoxical meshes)
return envelope

How can I write this algorithm that returns the count between x and y in a list?

I am given this algorithmic problem, and need to find a way to return the count in a list S and another list L that is between some variable x and some variable y, inclusive, that runs in O(1) time:
I've issued a challenge against Jack. He will submit a list of his favorite years (from 0 to 2020). If Jack really likes a year,
he may list it multiple times. Since Jack comes up with this list on the fly, it is in no
particular order. Specifically, the list is not sorted, nor do years that appear in the list
multiple times appear next to each other in the list.
I will also submit such a list of years.
I then will ask Jack to pick a random year between 0 and 2020. Suppose Jack picks the year x.
At the same time, I will also then pick a random year between 0 and 2020. Suppose I
pick the year y. Without loss of generality, suppose that x ≤ y.
Once x and y are picked, Jack and I get a very short amount of time (perhaps 5
seconds) to decide if we want to re-do the process of selecting x and y.
If no one asks for a re-do, then we count the number of entries in Jack's list that are
between x and y inclusively and the number of entries in my list that are between x and
y inclusively.
More technically, here is the situation. You are given lists S and L of m and n integers,
respectively, in the range [0, k], representing the collections of years selected by Jack and
I. You may preprocess S and L in O(m+n+k) time. You must then give an algorithm
that runs in O(1) time – so that I can decide if I need to ask for a re-do – that solves the
following problem:
Input: Two integers, x as a member of [0,k] and y as a member of [0,k]
Output: the number of entries in S in the range [x, y], and the number of entries in L in [x, y].
For example, suppose S = {3, 1, 9, 2, 2, 3, 4}. Given x = 2 and y = 3, the returned count
would be 4.
I would prefer pseudocode; it helps me understand the problem a bit easier.

Implementing the approach of user3386109 taking care of edge case of x = 0.
user3386109 : Make a histogram, and then compute the accumulated sum for each entry in the histogram. Suppose S={3,1,9,2,2,3,4} and k is 9. The histogram is H={0,1,2,2,1,0,0,0,0,1}. After accumulating, H={0,1,3,5,6,6,6,6,6,7}. Given x=2 and y=3, the count is H[y] - H[x-1] = H[3] - H[1] = 5 - 1 = 4. Of course, x=0 is a corner case that has to be handled.
# INPUT
S = [3, 1, 9, 2, 2, 3, 4]
L = [2, 9, 4, 6, 8, 5, 3]
k = 9
x = 2
y = 3
# Histogram for S
S_hist = [0]*(k+1)
for element in S:
S_hist[element] = S_hist[element] + 1
# Storing prefix sum in S_hist
sum = S_hist[0]
for index in range(1,k+1):
sum = sum + S_hist[index]
S_hist[index] = sum
# Similar approach for L
# Histogram for L
L_hist = [0] * (k+1)
for element in L:
L_hist[element] = L_hist[element] + 1
# Stroing prefix sum in L_hist
sum = L_hist[0]
for index in range(1,k+1):
sum = sum + L_hist[index]
L_hist[index] = sum
# Finding number of elements between x and y (inclusive) in S
print("number of elements between x and y (inclusive) in S:")
if(x == 0):
print(S_hist[y])
else:
print(S_hist[y] - S_hist[x-1])
# Finding number of elements between x and y (inclusive) in S
print("number of elements between x and y (inclusive) in L:")
if(x == 0):
print(L_hist[y])
else:
print(L_hist[y] - L_hist[x-1])

pulp shadow price difference with gurobi

I am comparing the values for shadow price (pi) calculated with gurobi and pulp. I get different values for the same input and I am not sure how to do it with pulp. Here is the lp file that I use:
Minimize
x[0] + x[1] + x[2] + x[3]
Subject To
C[0]: 7 x[0] >= 211
C[1]: 3 x[1] >= 395
C[2]: 2 x[2] >= 610
C[3]: 2 x[3] >= 97
Bounds
End
For the above lp file, gurobi gives me shadow prices:
[0.14285714285714285, 0.3333333333333333, 0.5, 0.5]
and with pulp I get:
[0.14285714, 0.33333333, 0.5, 0.5]
But If I execute the following lp model:
Minimize
x[0] + x[1] + x[2] + x[3] + x[4]
Subject To
C[0]: 7 x[0] + 2 x[4] >= 211
C[1]: 3 x[1] >= 395
C[2]: 2 x[2] + 2 x[4] >= 610
C[3]: 2 x[3] >= 97
Bounds
End
With gurobi I get:
[0.0, 0.3333333333333333, 0.5, 0.5]
and with pulp I get:
[0.14285714, 0.33333333, 0.5, 0.5]
The correct value is the one that gurobi returns (I think ?).
Why I get the same shadow prices with pulp for different models ? How I can get the same results as gurobi ?
(I did not supply the source code because the question will be too long, I think the lp models are enough)

In the second example, there are two dual solutions that are optimal: the one PuLP gives you, and the one you get by calling Gurobi directly. The unique optimal primal solution is [0.0, 131.67, 199.5, 48.5, 105.5], which makes the slacks for all the constraints are 0 in the optimal primal solution. For c[0] if you reduce the right hand side, you get no reduction in the objective, but if you increase it, the cheapest way to make the constraint feasible is by increasing x[0]. Gurobi only guarantees that you will produce an optimal primal and dual solution. The specific optimal solution you get is arbitrary.
The first example is just a precision issue.

c++: next_combination with struct

Before I put my code, that I built it based on Thomas Draper code, and thanks to Jarod42
I will explain it in example:
Let this data in txt file: where each integer numbers associated with probability
1 0.933 2 0.865 3 0.919 4 0.726
3 0.906 2 0.854 4 0.726
4 0.865 3 0.933 5 0.919
Let the use input threshold = 1.5
I want to apply (next_combination) on my data in loop from k=1 until there is no more combination
When k =1, the result will be:
First step: generate all set of size 1, where the frequency of item can be represented by the summation of its probability.
{1}= 0.933
{2}= 0.865 + 0.854= 1.719
{3}= 0.919 + 0.906 + 0.933 = 2.758
{4}= 0.726 + 0.726 + 0.865 = 2.317
{5}= 0.919
Second step: ears all set of size 1, that has frequency of < threshold
==> We erased set {1}, {5}. And save the deleted item in another set
Repeat the steps when k=2
First step: generate all set of size 2,
We check to see if the generated set is superset from set that in erased set
we know that {1}, {5} already erased,
So no need to generate any superset include {1}, {5}
The rest generated superset will be:
{2,3} = (0.865 * 0.919 ) + (0.906 * 0.854) = 1.56774
{2,4}= (0.865 * 0.726) + (0.854 * 0.726)= 1.247994
{3,4}= (0.919 * 0.726) + (0.906 * 0.726) + (0.865 *0.933)= 2.131995
Second step: ears all set of size 2, that has frequency of < threshold
==> We erased set {2,4} And save the deleted item in erased set, {1}, {5}, {2,4}
Repeat the steps when k=3
From the previous step we only have: {2,3}, and {3,4}
The new generated set will be
{2,3,4} = (0.865 * 0.919 * 0.726) + (0.906 * 0.854 * 0.726)=
1.138846434 < threshold
I have done this code before with vector of vector of integer, it gives me correct answer but worst time. (code)
here where i need help, I got a lot of errors can't deal with them because i don't know how to employ next_combination with struct (code)

how do i calculate the centroid of the brightest spot in a line of pixels?

i'd like to be able to calculate the 'mean brightest point' in a line of pixels. It's for a primitive 3D scanner.
for testing i simply stepped through the pixels and if the current pixel is brighter than the one before, the brightest point of that line will be set to the current pixel. This of course gives very jittery results throughout the image(s).
i'd like to get the 'average center of the brightness' instead, if that makes sense.
has to be a common thing, i'm simply lacking the right words for a google search.

Calculate the intensity-weighted average of the offset.
Given your example's intensities (guessed) and offsets:
0 0 0 0 1 3 2 3 1 0 0 0 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14
this would give you (5+3*6+2*7+3*8+9)/(1+3+2+3+1) = 7

You're looking for 1D Convolution which takes a filter with which you "convolve" the image. For example, you can use a Median filter (borrowing example from Wikipedia)
x = [2 80 6 3]
y[1] = Median[2 2 80] = 2
y[2] = Median[2 80 6] = Median[2 6 80] = 6
y[3] = Median[80 6 3] = Median[3 6 80] = 6
y[4] = Median[6 3 3] = Median[3 3 6] = 3
so
y = [2 6 6 3]
So here, the window size is 3 since you're looking at 3 pixels at a time and replacing the pixel around this window with the median. A window of 3 means, we look at the first pixel before and first pixel after the pixel we're currently evaluating, 5 would mean 2 pixels before and after, etc.
For a mean filter, you do the same thing except replace the pixel around the window with the average of all the values, i.e.
x = [2 80 6 3]
y[1] = Mean[2 2 80] = 28
y[2] = Mean[2 80 6] = 29.33
y[3] = Mean[80 6 3] = 29.667
y[4] = Mean[6 3 3] = 4
so
y = [28 29.33 29.667 4]
So for your problem, y[3] is the "mean brightest point".
Note how the borders are handled for y[1] (no pixels before it) and y[4] (no pixels after it)- this example "replicates" the pixel near the border. Therefore, we generally "pad" an image with replicated or constant borders, convolve the image and then remove those borders.
This is a standard operation which you'll find in many computational packages.

your problem is like finding the longest sequence problem. once you are able to determine a sequence( the starting point and the length), the all that remains is finding the median, which is the central element.
for finding the sequence, definition of bright and dark has to be present, either relative -> previous value or couple of previous values. absolute: a fixed threshold.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js