Remove duplicates algorithm - c++

I'm trying to write an algorithm to remove duplicates from a vector<struct xxxx*>.
struct xxxx{
int value; // This is just to make you understand
xxxx* one;
xxxx* two;
}
As you see my struct it's like a tree but the pointers are not in order. The pointers can point to any(actually not any but most) of the others. And the vector doesn't contain the structs but pointers, so I couldn't use the std algorithms to help me neither.
I'm trying to delete duplicates with exactly same value and the same two pointers, but in the same time if I have two similar structs (Let's say A and B) and C.one or C.two points to B. Then I need to change it to A and viceversa.
In other words: if A == B then remove B and change C.one to point A.
I think I can write the brute-force, so if there's no better algorithm I'll write it by myself.

Yesterday, I tried to explain the reasonable approach to a very similar problem to a coworker who had used an N squared solution to an N log N problem.
First create a helper struct, that is basically a wrapper around an xxxx* with a comparison operator checking the contents (not the pointer value) and probably with some other utility functions. This wrapper struct isn't strictly needed vs. just using xxxx*, but from experience, I think it makes the task cleaner.
Create a std::set of those helper structs, into which you will only insert unique elements, and likely another set into which you will insert recursively unresolved elements.
Loop through the original vector and at each position recurse through its children. If you hit a child already in the unique set, that is a final value for that child pointer. If you hit a child that matches a unique element without being the one it matches, then fix the pointer that got you there. If there is also the possibility of null pointers that should bottom the recursion, and if loops are possible you need to detect them (with that recursively unresolved set) and some decision about what to do with a loop. At some point you hit resolved unique elements and add that to the unique set.
The performance and maybe even soundness of the idea depends on the depth and complexity of the loops and what you want to do with loops. There are some messy cases where a loop would map onto another loop, but detecting that could be very tricky. If your phase "like a tree" meant "no loops" then the recursion bottoms cleanly and efficiently without the extra complexity of explicitly managing the recursively unresolved elements.
Obviously I left out some of the grunt work detail around detecting unique / non-unique as you back out of the recursion and around detecting "already did it during an earlier recursion" as you hit an item in the main loop above the recursion. But all those details should be pretty obvious as you write the relevant parts of the code.
Edit: To understand how few node visits there are despite nesting a recursion inside a sequential loop, think from the point of view of the pointers. We follow each pointer at most once (some duplicates are pre detected without following their pointers). For N nodes, there are N top level pointers (if I understood your description correctly) and significantly less than 2N internal pointers (the more tree-like it is, the closer it will be to N-1 internal pointers, rather than 2N). So each node is visited on average less than 3 times and a minority of those visits require both the pre lookup and the post recursion lookup, and each lookup is log U where U is the number of unique items found up to that point. So we can trivially see a bound of 6 N log N.

Related

How can I remove too close points in a list

I have a list of points with x,y coordinates:
List_coord=[(462, 435), (491, 953), (617, 285),(657, 378)]
This list lenght (4 element here) can be very large from few hundred up to 35000 elements.
I want to remove too close points by threshold in this list.
note:Points are never at the exact same position.
My current code for that:
while iteration<5:
for pt in List_coord:
for PT in List_coord:
if (abs(pt[0]-PT[0])+abs(pt[1]-PT[1]))!=0 and abs(pt[0]-PT[0])<threshold and abs(pt[1]-PT[1])<threshold:
List_coord.remove(PT)
iteration=iteration+1
Explication of my terrible code :) :
I check if the very distance is 0 then it means that i am comparing
the same point
then i check the distance in x and in y..
Iteration:
I need few iterations to avoid missing one remove because the list change inside the loop itself...
This code is working but it is a very low process!
I am sure there is another method much easier but i wasn't able to find even if some allready answered questions are close to mine..
note:I would like to avoid using extra library for that code if it is possible
Python will be a bit slow at this ;-)
The solution you will probably want is called quad-trees, but I'll mention a simpler approach first, in case it's preferable.
The usual approach is to group the points so that you can easily reject points that are clearly far away from each other.
One approach might be to sort the list twice, once by x once by y. You can prove that if two points are too-close, they must be close in one dimension or the other. Thus your inner loop can break out early. If it sees a point that is too far away from the outer point in the sorted direction, it can know for a fact that all future points in that list are also too far away. Thus it doesn't have to look any further. Do this in X and Y and you're set!
This approach is going to tend to be dominated by the O(n log n) sort times. However, if all of your points share a single x value, you'll end up doing the same slow O(n^2) iteration that you're doing right now because you never terminate the inner loop early.
The more robust solution is to use quadtrees. Quadtrees are designed to solve the kind of problem you are looking at. The idea is to build a tree such that you can rapidly exclude large numbers of points. I'd recommend this.
If your number of points gets too large, I'd recommend getting a clustering library. Efficient clustering is a very difficult task, and often done in C++ or another fast language.

Which is the better way to delete an array member?

I'm learning OOP, so I have to interact with arrays, not linked list. I have sorted data. The problem is to delete a member of the array (let's call it DL). The 1st method I came up with was overwrite data at i+1 to istarting at DL's index and decrease the amount of reading by 1. Later I found out that I can swap the DLwith the last member then decrease the counting variable by 1. However, I'll have to sort the data again. So which one is better?
If it needs to stay sorted, I'd say it's better to overwrite it by shifting every element after your target back one. Swapping it with the end element and then resorting would require more work, as a swap requires three actions:
1) Copying element one to a temp variable.
2) Copying element two to element one.
3) Copying the temp element to element two.
And this needs to be repeated multiple times in a sorting algorithm. And if you're working with an array of objects of a struct or class with multiple private data member each, the workload increases even more.
The overwrite takes fewer moves per iteration:
1) Copy i + 1 to i.
So, Id definitely go with overwriting, by moving all elements back one and decreasing count by one.
At any rate, it's probably just best to time both, with your specific data set, and see which one is faster. This is really simple to do by counting the milliseconds between start and finish of your implementation.
"Better" is a very subjective term and which one is more suitable (for whatever definition you choose) depends a great deal on the sort of data sets you're talking about (size, etc).
But I will mention this, the relative time complexities of array shuffle and most "regular" sorts are respectively O(n) and O(n log n).
That means the shuffle is likely to be faster in the vast majority of cases.

Is a partially ordered sort possible without using a tree?

Background:
So I have a few hundred elements (image is a simple case) that I need to constantly re-sort since their sorting value changes whenever the elements X or Y value changes. A normal (absolute) sort is not possible since many elements have an undefined relation to each other (like the purple and orange block) that would just break a merge/quick/bubble-sort. However, changing a single element can potentially change many ordering relationships if that element had an edge to many others (like if the green block were removed)
I understand the idea behind building a tree and doing a topological sort but this seems horribly inefficient to do all the time because of a single element changing.
If the above is still unclear, check out http://shaunlebron.com/IsometricBlocks since that is fairly similar to what I am trying to do.
My question:
I can't help but think that a tree is not necessary (at least for my case) but that a linked list would be fine, since my case guarantees that there will never be a cycle. Wouldn't it be sufficient (with ascending order) to just always place an element after the last element that it is greater than, but before the first element that it is less than? Wouldn't this effectively allow sorting of a partially ordered set?
Is there some case that prevents a person from just skipping the tree step and going straight to a list? Every simulation I've done on paper seems to suggest that this would work.
Simply put, yes, a partially ordered sort is possible without using a tree.
If you have your (original) unsorted list "U" and (empty, destination) sorted list "S", you will need to loop over list "U" and move entries which have no dependents remaining to the end list "S".
An entry has no dependents remaining if no other entry in list "U" is dependent upon it. If you are using {-1, 0, 1} to measure equality, then you will need to pick either "-1" or "1" to imply dependency. Whereas "0" would imply that there is no valid ordering for the two elements.

Accessing adjacent elements of a map in c++

Suppose I have a float-integer map m:
m[1.23] = 3
m[1.25] = 34
m[2.65] = 54
m[3.12] = 51
Imagine that I know that there's a mapping between 2.65 and 54, but I don't know about any other mappings.
Is there any way to visit the adjacent mappings without iterating from the beginning or searching using the find function?
In other words: can I directly access the adjacent values by just knowing about a single mapping...such as m[2.65]=54?
UPDATE Perhaps a more important "point" than my answer, brought up by #MattMcNabb:
Floating point keys in std:map
Can I directly access the adjacent values by just knowing about a single mapping (m[2.65]=54)
Yes. std::map is an ordered collection; which is to say that if an operator< exists (more generally, std::less) for the key type you can expect it to have sorted access. In fact--you won't be able to make a map for a key type if it doesn't have this comparison operator available (unless you pass in a predicate function to perform this comparison in the template invocation)
Note there is also a std::unordered_map which is often preferable for cases where you don't need this property of being able to navigate quickly between "adjacent" map entries. However you will need to have std::hash defined in that case. You can still iterate it, but adjacency of items in the iteration won't have anything to do with the sort order of the keys.
UPDATE also due to #MattMcNabb
Is there any way to visit the adjacent mappings without iterating from the beginning or searching using the find function?
You allude to array notation, and the general answer here would be "not really". Which is to say there is no way of saying:
if (not m[2.65][-2]) {
std::cout << "no element 2 steps prior to m[2.65]";
} else {
std::cout << "the element 2 before m[2.65] is " << *m[2.65][-2];
}
While no such notational means exist, the beauty (and perhaps the horror) of C++ is that you could write an augmentation of map that did that. Though people would come after you with torches and pitchforks. Or maybe they'd give you cult status and put your book on the best seller list. It's a fine line--but before you even try, count the letters and sequential consonants in your last name and make sure it's a large number.
What you need to access the ordering is an iterator. And find will get you one; and all the flexibility that it affords.
If you only use the array notation to read or write from a std::map, it's essentially a less-capable convenience layer built above iterators. So unless you build your own class derived from map, you're going to be stuck with the limits of that layer. The notation provides no way to get information about adjacent values...nor does it let you test for whether a key is in the map or not. (With find you can do this by comparing the result of a lookup to end(m) if m is your map.)
Technically speaking, find gives you the same effect as you could get by walking through the iterators front-to-back or back-to-front and comparing, as they are sorted. But that would be slower if you're seeking arbitrary elements. All the containers have a kind of algorithmic complexity guarantee that you can read up on.
When dereferencing an iterator, you will receive a pair whose first element is the key and second element is the value. The value will be mutable, but the key is constant. So you cannot find an element, then navigate to an adjacent element, and alter its key directly...just its value.

Implementing Skip List in C++

[SOLVED]
So I decided to try and create a sorted doubly linked skip list...
I'm pretty sure I have a good grasp of how it works. When you insert x the program searches the base list for the appropriate place to put x (since it is sorted), (conceptually) flips a coin, and if the "coin" lands on a then that element is added to the list above it(or a new list is created with element in it), linked to the element below it, and the coin is flipped again, etc. If the "coin" lands on b at anytime then the insertion is over. You must also have a -infinite stored in every list as the starting point so that it isn't possible to insert a value that is less than the starting point (meaning that it could never be found.)
To search for x, you start at the "top-left" (highest list lowest value) and "move right" to the next element. If the value is less than x than you continue to the next element, etc. until you have "gone too far" and the value is greater than x. In this case you go back to the last element and move down a level, continuing this chain until you either find x or x is never found.
To delete x you simply search x and delete it every time it comes up in the lists.
For now, I'm simply going to make a skip list that stores numbers. I don't think there is anything in the STL that can assist me, so I will need to create a class List that holds an integer value and has member functions, search, delete, and insert.
The problem I'm having is dealing with links. I'm pretty sure I could create a class to handle the "horizontal" links with a pointer to the previous element and the element in front, but I'm not sure how to deal with the "vertical" links (point to corresponding element in other list?)
If any of my logic is flawed please tell me, but my main questions are:
How to deal with vertical links and whether my link idea is correct
Now that I read my class List idea I'm thinking that a List should hold a vector of integers rather than a single integer. In fact I'm pretty positive, but would just like some validation.
I'm assuming the coin flip would simply call int function where rand()%2 returns a value of 0 or 1 and if it's 0 then a the value "levels up" and if it's 0 then the insert is over. Is this incorrect?
How to store a value similar to -infinite?
Edit: I've started writing some code and am considering how to handle the List constructor....I'm guessing that on its construction, the "-infinite" value should be stored in the vectorname[0] element and I can just call insert on it after its creation to put the x in the appropriate place.
http://msdn.microsoft.com/en-us/library/ms379573(VS.80).aspx#datastructures20_4_topic4
http://igoro.com/archive/skip-lists-are-fascinating/
The above skip lists are implemented in C#, but can work out a c++ implementation using that code.
Just store 2 pointers. One called above, and one called below in your node class.
Not sure what you mean.
According to wikipedia you can also do a geometric distribution. I'm not sure if the type of distribution matters for totally random access, but it obviously matters if you know your access pattern.
I am unsure of what you mean by this. You can represent something like that with floating point numbers.
You're making "vertical" and "horizontal" too complicated. They are all just pointers. The little boxes you draw on paper with lines on them are just to help visualize something when thinking about them. You could call a pointer "elephant" and it would go to the next node if you wanted it to.
eg. a "next" and "prev" pointer are the exact same as a "above"/"below" pointer.
Anyway, good luck with your homework. I got the same homework once in my data structures class.