STL Container For Best Performance? - c++

I have a project where I need to read a text file and record how many occurrences there are of each string, character, or number that is read until the EoF.
I then need to print the top 10 most used words.
For example, the file would contain "This is a test for this project". I would read this and store each word in a container as well as its current count.
Now, we are graded on how our efficient our time complexity is as input grows. So, I need some help on choosing which STL container would be the most efficient.
It seems order is not important, I can forever insert at the end, and I will never have to make insertions. I will, however, have to search through the container for the top 10 most used words. Which STL container has the best time complexity for requirements like this?
Also, if you could explain your reasoning so I know more going forward, that would be great!

Let's assume that you decide to use a std::unordered_map<std::string, int> to get a frequency count of the items. That is a good start, but the other part of the question that needs to be addressed is to get the top 10 items.
Whenever a question asks "get the top N" or "get the smallest N", or similar, there are various methods of getting this information.
One way is to sort the data and get the first N items. Using std::sort or a good sorting routine, that operation should take O(N*log(N)) in time complexity.
The other method is to use a min-heap or max-heap of N items, depending on whether you want to get the top N or bottom N, respectively.
Assume you have working code using the unordered_set to get the frequency count. Here is a routine that uses the STL heap functions to get the top N items. It has not been fully tested, but should demonstrate how to handle the heap.
#include <vector>
#include <algorithm>
#include <iostream>
#include <unordered_map>
void print_top_n(const std::unordered_map<std::string, int>& theMap, size_t n)
{
// This is the heap
std::vector<std::pair<std::string, int>> vHeap;
// This lambda is the predicate to build and perform the heapify
auto heapfn =
[](std::pair<std::string, int>& p1, std::pair<std::string, int>& p2) -> bool
{ return p1.second > p2.second; };
// Go through each entry in the map
for (auto& m : theMap)
{
if (vHeap.size() < n)
{
// Add item to the heap, since we haven't reached n items yet
vHeap.push_back(m);
// if we have reached n items, now is the time to build the heap
if (vHeap.size() == n)
// make the min-heap of the N elements
std::make_heap(vHeap.begin(), vHeap.end(), heapfn);
continue;
}
else
// Heap has been built. Check if the next element is larger than the
// top of the heap
if (vHeap.front().second <= m.second)
{
// adjust the heap
// remove the front of the heap by placing it at the end of the vector
std::pop_heap(vHeap.begin(), vHeap.end(), heapfn);
// get rid of that item now
vHeap.pop_back();
// add the new item
vHeap.push_back(m);
// heapify
std::push_heap(vHeap.begin(), vHeap.end(), heapfn);
}
}
// sort the heap
std::sort_heap(vHeap.begin(), vHeap.end(), heapfn);
// Output the results
for (auto& v : vHeap)
std::cout << v.first << " " << v.second << "\n";
}
int main()
{
std::unordered_map<std::string, int> test = { {"abc", 10},
{ "123",5 },
{ "456",1 },
{ "xyz",15 },
{ "go",8 },
{ "text1",7 },
{ "text2",17 },
{ "text3",27 },
{ "text4",37 },
{ "text5",47 },
{ "text6",9 },
{ "text7",7 },
{ "text8", 22 },
{ "text9", 8 },
{ "text10", 2 } };
print_top_n(test, 10);
}
Output:
text5 47
text4 37
text3 27
text8 22
text2 17
xyz 15
abc 10
text6 9
text9 8
go 8
The advantage of using a heap is that:
The complexity of heapifying is O(log(N)), and not the usual O(N*log(N)) that a sorting routine will give you.
Note that we only need to heapify if we detect that the top item on the min-heap is going to be discarded.
We don't need to store an entire (multi)map of frequency counts to strings in addition to the original map of strings to frequency counts.
The heap will only store N elements, regardless of the number of items are in the original map.

I have used two containers for such task: std::unordered_map<std::string, int> to store words frequencies, and std::map<int, std::string> to track most frequent words.
While updating the firs map with the new word, you also update the second one. To keep it neat, erase the least-frequent word if the size of that second map gets over 10.
UPDATE
In response to the comments below, I did some benchmarking.
First, #PaulMcKenzie - you are correct: to keep the ties, I need std::map<int, std::set<std::string>> (that became obvious as soon as I started to implement this).
Second, #dratenik - turns out you are correct, too. While constantly cleaning up the frequency map keeps it small, the overhead doesn't pay for the benefits. Also, this would only be needed if the client want to see the "running total" (as I was asked for in my project). It makes no sense at all in post-processing, when all words are loaded.
For the test, I used alice29.txt (available online), pre-processed - I removed all punctuation marks and converted to upper case. Here is my code:
int main()
{
auto t1 = std::chrono::high_resolution_clock::now();
std::ifstream src("c:\\temp\\alice29-clean.txt");
std::string str;
std::unordered_map<std::string, int> words;
std::map<int, std::set<std::string>> freq;
int i(0);
while (src >> str)
{
words[str]++;
i++;
}
for (auto& w : words)
{
freq[w.second].insert(w.first);
}
int count(0);
for (auto it = freq.rbegin(); it != freq.rend(); ++it)
{
for (auto& w : it->second)
{
std::cout << w << " - " << it->first << std::endl;
++count;
}
if (count >= 10)
break;
}
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() << std::endl;
return i;
}

Related

Is there an efficient, time saving method of maintaining a heap in which elements are removed in the middle?

I'm working on a path planning program where I have a priority queue 'U':
using HeapKey = pair<float, float>;
vector<pair<HeapKey, unsigned int>> U;
I order and maintain my priority queue as a binary min-heap (aka. the cheapest node first in the queue) using greater as my comparison function to get a min-heap (maybe not important). While the program is executing and planning a path it is adding nodes to 'U' with push_back() followed by push_heap() to get that node into the correct order and everything is working fine there...
However, the algorithm I'm using calls for sometimes updating a node already present in 'U' with new values. It does this by removing it from 'U' (I find it with find_if() and remove it with erase(), if that's important) and then call the function to re-insert (again push_back() followed by push_heap()) so the node have its updated values.
This have proved a bit of an unexpected problem for me. I'm no expert at this, but as far as I've been able to think, since the node is removed some place INSIDE 'U' then it messes up the order of the heap. I've been able to get the program to work by using make_heap() after the node is removed. However, this solution brought another issue since the program now takes a lot more time to complete, longer the larger my map/nodes in the heap, presumably because make_heap() is re-organizing/iterating through the entire heap every time I update a node, thus slowing down the overall planning.
Sadly I don't have time to change my program and get new results, I can only make use of simple, easy solutions I can implement fast. I'm mostly here to learn and perhaps see if there are some suggestions I can pass on about how to solve this issue of efficiently maintaining a heap/priority queue when you aren't just removing the first or last elements but elements maybe in the middle. Reducing the time taken to plan is the only thing I am missing.
Attempt at minimal reproducible example without going into the actual algorithm and such:
#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;
using Cost = float;
using HeapKey = pair<Cost, Cost>;
pair<Cost, Cost> PAIR1;
vector<pair<HeapKey, unsigned int>> U;
using KeyCompare = std::greater<std::pair<HeapKey, unsigned int>>;
int in_U[20];
ostream& operator<<(ostream& os, pair<Cost, Cost> const& p) {
return os << "<" << p.first << ", " << p.second << ">";
}
bool has_neightbor(unsigned int id) {
if ( (in_U[id+1]) && (in_U[id-1])) {
return true;
}
return false;
}
void insert(unsigned int id, HeapKey k) {
U.push_back({ k, id });
push_heap(U.begin(), U.end(), KeyCompare());
in_U[id]++;
}
void update(unsigned int id) {
Cost x;
Cost y;
if (id != 21) { //lets say 21 is the goal
x = U[id].first.first;
y = U[id].first.second;
}
if (in_U[id]) {
auto it = find_if(U.begin(), U.end(), [=](auto p) { return p.second == id; });
U.erase(it);
make_heap(U.begin(), U.end(), KeyCompare());
in_U[id]--;
}
int r1 = rand() % 10 + 1;
int r2 = rand() % 10 + 1;
if (x != y) {
insert(id, {x + r1, y + r2});
}
}
int main() {
U.push_back({ {8, 2}, 1 });
in_U[1]++;
U.push_back({ {5, 1}, 2 });
in_U[2]++;
U.push_back({ {6, 1}, 3 });
in_U[3]++;
U.push_back({ {6, 5}, 4 });
in_U[4]++;
U.push_back({ {2, 3}, 5 });
in_U[5]++;
U.push_back({ {2, 9}, 6 });
in_U[6]++;
U.push_back({ {9, 2}, 7 });
in_U[7]++;
U.push_back({ {4, 7}, 8 });
in_U[8]++;
U.push_back({ {11, 4}, 9 });
in_U[9]++;
U.push_back({ {2, 2}, 10 });
in_U[10]++;
U.push_back({ {1, 2}, 11 });
in_U[11]++;
U.push_back({ {7, 2}, 12 });
in_U[12]++;
make_heap(U.begin(), U.end(), KeyCompare());
PAIR1.first = 14;
PAIR1.second = 6;
while (U.front().first < PAIR1) {
cout << "Is_heap?: " << is_heap(U.begin(), U.end(), KeyCompare()) << endl;
cout << "U: ";
for (auto p : U) {
cout << p.second << p.first << " - ";
}
cout << endl;
auto uid = U.front().second;
pop_heap(U.begin(), U.end(), KeyCompare());
U.pop_back();
if (has_neightbor(uid)) {
update(uid - 1);
update(uid + 1);
}
}
//getchar();
}
Yes, the algorithm is relatively simple. Note that when considering an item at index i, it's "parent" in a heap is at index (i-1)/2, and it's children are at indecies i*2+1 and i*2+2.
Swap item_to_pop for the last item in the range. This moves that item to the desired (last) position, but inserts a "small" item in the middle of the heap. This needs to be fixed.
If the "small" item at item_to_pop position is larger than it's current parent, then swap with it's parent. Repeat until that item is either no longer larger than it's current parent or is the new root. Then we're done. Notably, this is the same algorithm as push_heap, except with the shortcut that we start in the middle instead of at the end.
If the "small" item at item_to_pop position is smaller than either current child, then swap with the larger child. Repeat until that item is larger than any of its current children (note that near the end it might only have one or no children). Then we're done. Notably, this is the same algorithm as pop_heap, except with the shortcut that we start in the middle instead of at the top.
This algorithm will do at most log2(n)+1 swaps, and log2(n)*2+1 comparisons, making it almost as fast as pop_heap and push_heap. Which isn't really surprising since it's the same algorithm.
template< class RandomIt, class Compare >
constexpr void pop_mid_heap(RandomIt first, RandomIt last, RandomIt item_to_pop, Compare comp) {
assert(std::is_heap(first, last)); //this is compiled out of release builds
assert(first<=item_to_pop);
assert(item_to_pop<last);
using std::swap;
std::size_t new_size = last - first - 1;
if (new_size == 0) return;
//swap the end of the range and item_to_pop, so that item_to_pop is at the end
swap(*item_to_pop, *--last);
if (new_size == 1) return;
//If item_to_pop is bigger than it's parent, then swap with the parent
bool moved_up = false;
RandomIt swap_itr;
while (true) {
std::size_t offset = item_to_pop - first;
if (offset == 0) break; //item_to_pop is at root: exit loop
swap_itr = first + (offset-1) / 2;
if (comp(*item_to_pop, *swap_itr))
break; //item_to_pop smaller than it's parent: exit loop
swap(*item_to_pop, *swap_itr); //swap with parent and repeat
item_to_pop = swap_itr;
moved_up = true;
}
if (moved_up) return; //if we moved the item up, then heap is complete: exit
//If biggest child is bigger than item_to_pop, then swap with that child
while (true) {
std::size_t offset = item_to_pop - first;
std::size_t swap_idx = offset * 2 + 1;
if (swap_idx >= new_size) break; //no children: exit loop
swap_itr = first + swap_idx;
if (swap_idx+1 < new_size && comp(*swap_itr, *(swap_itr+1))) //if right child exists and is bigger, swap that instead
++swap_itr;
if (!comp(item_to_pop, swap_itr)) break; //item_to_pop bigger than biggest child: exit loop
swap(*item_to_pop, *swap_itr); //swap with bigger child and repeat
item_to_pop = swap_itr;
}
}
template< class RandomIt >
constexpr void pop_mid_heap(RandomIt first, RandomIt last, RandomIt item_to_pop) {
pop_mid_heap(first, last, item_to_pop, std::less<>{});
}
https://ideone.com/zNW7h7
Theoretically one can optimize out the "or is the new root" check in the push_heap part, but the checks to detect that case adds complexity that doesn't seem worth it.
IMO, this is useful and should be part of the C++ standard library.
In general it's expensive to update a node in the middle of a binary heap not because the update operation is expensive but because finding the node is an O(n) operation. If you know where the node is in the heap, updating its priority is very easy. My answer at https://stackoverflow.com/a/8706363/56778 shows how to delete a node. Updating a node's priority is similar: rather than replacing the node with the last one in the heap, you just sift the node up or down as required.
If you want the ability to find a node quickly, then you have to build an indexed heap. Basically, you have a dictionary entry for each node. The dictionary key is the node's ID (or whatever you use to identify it), and the value is the node's index in the binary heap. You modify the heap code so that it updates the dictionary entry whenever the node is moved around in the heap. It makes the heap a little bit slower (by a constant factor), but makes finding an arbitrary node an O(1) operation.
Or, you can replace the binary heap with a Pairing Heap, skip list, or any of the other "heap" types that work with node pointers. My experience has been that although the theoretical performance of those two isn't as good as the theoretical performance of Fibonacci heap, the real-world performance is much better.
With either of those it's a whole lot easier to maintain an index: you just add a node reference to it when you add a node to the heap, and remove a reference when the node is removed from the heap. Both of those heap types are easy to build and performance will be about the same as for a binary heap although they will use somewhat more memory. From experience I'll say that Pairing heap is easier to build than skip list, but skip list is a more generally useful data structure.

How could I craft a function that check if a word is repeated more than two time or more in a vector and output the number of time it repeated? in C++

first time on Stack here, I hope to learn from you guys!
So my code involve the user reading a passage from a text-file and adding the word into a vector. That vector would be pass into a word count function and print out how many word are repeating.
for example:
count per word: age = 2 , belief = 1, best =1, it = 10
however i'm trying to come up with a function that loop to the same vector and print out the word that are repeated more than two time. In this case the word "it" is repeated more than two time.
map<string, int> get_word_count(const vector<string>& words) {
map<string, int> word_count{};
for (string word : words) {
auto search = word_count.find(word);
if (search == word_count.end()) {
word_count[word] = 1; // not found - add word with count of 1
}
else {
word_count[word] += 1; // found - increment count for word
}
}
return word_count;
}
this is my snipet of code that check the many word that are repeated from the vector. However im struggle to figure out how to add a condition to check if the word are repeated twice or more than two time. I try to add a condition if word_count > 2, then print out the repeated word of twice. However it did not work. I hope to hear from you guys hint, thank.
No need to check as a std::map automatically check if the entry exists or not. If not, it creates a new one, if yes, the value is handled correctly.
Simply loop over the std::map which holds the words vs. counts and use a condition as needed. See full example.
int main()
{
std::vector< std::string > words{ "belief","best","it","it","it" };
std::map< std::string, int > counts;
for ( const auto& s: words )
{
counts[s]++;
}
for ( const auto& p: counts )
{
if ( p.second > 2 ) { std::cout << p.first << " repeats " << p.second << std::endl; }
}
}
Hint:
If you write auto x: y you get a copy of each element of y which is typically not what you want. If you write const auto& x: y you get a const reference to the element of your container, which is in your case much faster/efficient, as there is no need to create a copy. The compiler is maybe able to "see" that the copy is not needed and optimize it out, but it is more clear to the reader of the source code, what is intended!
first of all I really suggest you to explore documentation about C++ before coding,
your code is actually rewritable in this way
map<string, int> get_word_count(const vector<string>& words) {
map<string, int> word_count{};
for (string& word : words) {
word_count[word] += 1;
}
return word_count;
}
This works because map::operator[] (like unordered_map::operator[]) doesn't work like map::at does (which throws a std::out_of_range exception if the key is not in the map). The difference is that operator[] returns a reference to the Value at the given key and if the key is not already in the map, it is inserted and default-initialized (in your case a int is value-initialized to 0).
Operator[] on cppreference
In order to add the "twice or more than twice" part you can modify your code by adding the condition in the for loop.
map<string, int> get_word_count(const vector<string>& words) {
map<string, int> word_count{};
for (string& word : words) {
auto& map_ref = word_count[word];
map_ref += 1;
if(map_ref == 2){
// Here goes your code
}
}
return word_count;
}
If you're interested in how many times a word is repeated you shall scan again the map with a loop.

How to chain delete pairs from a vector in C++?

I have this text file where I am reading each line into a std::vector<std::pair>,
handgun bullets
bullets ore
bombs ore
turret bullets
The first item depends on the second item. And I am writing a delete function where, when the user inputs an item name, it deletes the pair containing the item as second item. Since there is a dependency relationship, the item depending on the deleted item should also be deleted since it is no longer usable. For example, if I delete ore, bullets and bombs can no longer be usable because ore is unavailable. Consequently, handgun and turret should also be removed since those pairs are dependent on bullets which is dependent on ore i.e. indirect dependency on ore. This chain should continue until all dependent pairs are deleted.
I tried to do this for the current example and came with the following pseudo code,
for vector_iterator_1 = vector.begin to vector.end
{
if user_input == vector_iterator_1->second
{
for vector_iterator_2 = vector.begin to vector.end
{
if vector_iterator_1->first == vector_iterator_2->second
{
delete pair_of_vector_iterator_2
}
}
delete pair_of_vector_iterator_1
}
}
Not a very good algorithm, but it explains what I intend to do. In the example, if I delete ore, then bullets and bombs gets deleted too. Subsequently, pairs depending on ore and bullets will also be deleted (bombs have no dependency). Since, there is only one single length chain (ore-->bullets), there is only one nested for loop to check for it. However, there may be zero or large number of dependencies in a single chain resulting in many or no nested for loops. So, this is not a very practical solution. How would I do this with a chain of dependencies of variable length? Please tell me. Thank you for your patience.
P. S. : If you didn't understand my question, please let me know.
One (naive) solution:
Create a queue of items-to-delete
Add in your first item (user-entered)
While(!empty(items-to-delete)) loop through your vector
Every time you find your current item as the second-item in your list, add the first-item to your queue and then delete that pair
Easy optimizations:
Ensure you never add an item to the queue twice (hash table/etc)
personally, I would just use the standard library for removal:
vector.erase(remove_if(vector.begin(), vector.end(), [](pair<string,string> pair){ return pair.second == "ore"; }));
remove_if() give you an iterator to the elements matching the criteria, so you could have a function that takes in a .second value to erase, and erases matching pairs while saving the .first values in those being erased. From there, you could loop until nothing is removed.
For your solution, it might be simpler to use find_if inside a loop, but either way, the standard library has some useful things you could use here.
I couldn't help myself to not write a solution using standard algorithms and data structures from the C++ standard library. I'm using a std::set to remember which objects we delete (I prefer it since it has log-access and does not contain duplicates). The algorithm is basically the same as the one proposed by #Beth Crane.
#include <iostream>
#include <vector>
#include <utility>
#include <algorithm>
#include <string>
#include <set>
int main()
{
std::vector<std::pair<std::string, std::string>> v
{ {"handgun", "bullets"},
{"bullets", "ore"},
{"bombs", "ore"},
{"turret", "bullets"}};
std::cout << "Initially: " << std::endl << std::endl;
for (auto && elem : v)
std::cout << elem.first << " " << elem.second << std::endl;
// let's remove "ore", this is our "queue"
std::set<std::string> to_remove{"bullets"}; // unique elements
while (!to_remove.empty()) // loop as long we still have elements to remove
{
// "pop" an element, then remove it via erase-remove idiom
// and a bit of lambdas
std::string obj = *to_remove.begin();
v.erase(
std::remove_if(v.begin(), v.end(),
[&to_remove](const std::pair<const std::string,
const std::string>& elem)->bool
{
// is it on the first position?
if (to_remove.find(elem.first) != to_remove.end())
{
return true;
}
// is it in the queue?
if (to_remove.find(elem.second) != to_remove.end())
{
// add the first element in the queue
to_remove.insert(elem.first);
return true;
}
return false;
}
),
v.end()
);
to_remove.erase(obj); // delete it from the queue once we're done with it
}
std::cout << std::endl << "Finally: " << std::endl << std::endl;
for (auto && elem : v)
std::cout << elem.first << " " << elem.second << std::endl;
}
#vsoftco I looked at Beth's answer and went off to try the solution. I did not see your code until I came back. On closer examination of your code, I see that we have done pretty much the same thing. Here's what I did,
std::string Node;
std::cout << "Enter Node to delete: ";
std::cin >> Node;
std::queue<std::string> Deleted_Nodes;
Deleted_Nodes.push(Node);
while(!Deleted_Nodes.empty())
{
std::vector<std::pair<std::string, std::string>>::iterator Current_Iterator = Pair_Vector.begin(), Temporary_Iterator;
while(Current_Iterator != Pair_Vector.end())
{
Temporary_Iterator = Current_Iterator;
Temporary_Iterator++;
if(Deleted_Nodes.front() == Current_Iterator->second)
{
Deleted_Nodes.push(Current_Iterator->first);
Pair_Vector.erase(Current_Iterator);
}
else if(Deleted_Nodes.front() == Current_Iterator->first)
{
Pair_Vector.erase(Current_Iterator);
}
Current_Iterator = Temporary_Iterator;
}
Deleted_Nodes.pop();
}
To answer your question in the comment of my question, that's what the else if statement is for. It's supposed to be a directed graph so it removes only next level elements in the chain. Higher level elements are not touched.
1 --> 2 --> 3 --> 4 --> 5
Remove 5: 1 --> 2 --> 3 --> 4
Remove 3: 1 --> 2 4 5
Remove 1: 2 3 4 5
Although my code is similar to yours, I am no expert in C++ (yet). Tell me if I made any mistakes or overlooked anything. Thanks. :-)

Rearrange list the same way as another one

I bumped into a page where there were a lot of categories and next to each one the number of items in each category, wrapped in parenthesis. Something really common. It looked like this:
Category 1 (2496)
Category 2 (34534)
Category 3 (1039)
Category 4 (9)
...
So I was curious and I wanted to see which categories had more items and such, and since all categories were all together in the page I could just select them all and copy them in a text file, making things really easy.
I made a little program that reads all the numbers, store them in a list and sort them. In order to know what category the number it belonged to I would just Ctrl + F the number in the browser.
But I thought it would be nice to have the name of the category next to the number in my text file, and I managed to parse them in another file. However, they are not ordered, obviously.
This is what I could do so far:
bool is_number(const string& s) {
return !s.empty() && find_if(s.begin(), s.end(), [](char c) { return !isdigit(c); }) == s.end();
}
int main() {
ifstream file;
ofstream file_os, file_t_os;
string word, text; // word is the item count and text the category name
list<int> words_list; // list of item counts
list<string> text_list; // list of category names
file.open("a.txt");
file_os.open("a_os.txt");
file_t_os.open("a_t_os.txt");
while (file >> word) {
if (word.front() == '(' && word.back() == ')') { // check if it's being read something wrapped in parenthesis
string old_word = word;
word.erase(word.begin());
word.erase(word.end()-1);
if (is_number(word)) { // check if it's a number (item count)
words_list.push_back(atoi(word.c_str()));
text.pop_back(); // get rid of an extra space in the category name
text_list.push_back(text);
text.clear();
} else { // it's part of the category name
text.append(old_word);
text.append(" ");
}
} else {
text.append(word);
text.append(" ");
}
}
words_list.sort();
for (list<string>::iterator it = text_list.begin(); it != text_list.end(); ++it) {
file_t_os << *it << endl;
}
for (list<int>::iterator it = words_list.begin(); it != words_list.end(); ++it) {
file_os << fixed << *it << endl;
}
cout << text_list.size() << endl << words_list.size() << endl; // I'm getting the same count
}
Now I forget about having the name next to the number, because something more interesting occured to me. I thought it would be interesting to find a way to rearrange the strings in the text_list which contain the names of the categories in the exact same way the list with the item count was sorted.
Let me explain with an example, lets say we have the following categories:
A (5)
B (3)
C (10)
D (6)
The way I'm doing it I will have a list<int> containing this: {10, 6, 5, 3} and a list<string> containing this: {A, B, C, D}.
What I'm saying is I want to find a way I can keep track of the way the elements were rearranged in the first list and apply that very pattern to the second list. What would be the rearrange pattern? It would be: the first item (5) goes to the third position, the second one (3) to the fourth one, the third one (10) to the first one, and so on.... Then this pattern should be applied to the other list, so that it would end up like this: {C, D, A, B}.
The thing would be to keep track of the Pattern and apply it to the list below.
Is there any way I can do this? Any particular function that could help me? Any way to track all the swaps and switches the sort algorithm does so it can be applied to a different list with the same size? What about a different sorting algorithm?
I know this might be highly inefficient and a bad idea, but it seemed like a little challenge.
I also know I could just pair both string and int, category and item count, in some sort of container like pair or map or make a container class of my own and sort the items based on the item count (I guess map would be the best choice, what do you think?), but this is not what I am asking.
The best way to do this would be to create a list that contains both sets of information you want to sort and feed in a custom sorting function.
For instance:
struct Record {
string name;
int count;
};
list<Record> myList;
sort(myList, [](Record a, Record b){
return a.count < b.count;
});
In the general case, it's always better to manage one list of a complex datatype, than to try to separately manage two or more lists of simple datatypes, especially when they're mutable.
Some more improve way:
First some notes:
It's recommended to storage category name and items together, for clarity, easy of read code, etc...
It's better use std::vector instead of std::list (see Bjarne Stroustrup opinion)
The code load the file with the format specified in your question, storage in the vector the info pair.
Use std::sort function to sort only by items number (the categories with the same items would be in any order, if you would like to sort for category name the categories with the same items change the lambda body to return std::tie(left.items, left.name) > std::tie(right.items, right.name);.
Added a version with info split, in one collection items and index (to correlate items with name) info, and in the other names info.
Code:
#include <iostream>
#include <fstream>
#include <algorithm>
#include <vector>
bool is_number(const std::string& s) {
return !s.empty() &&
find_if(s.begin(), s.end(), [](char c) { return !isdigit(c); }) ==
s.end();
}
struct category_info {
std::string name;
int items;
};
struct category_items_info {
int items;
size_t index;
};
int main() {
std::ifstream file("H:\\save.txt");
std::vector<category_info> categories;
std::vector<category_items_info> categories_items;
std::vector<std::string> categories_names;
std::string word;
std::string text;
while (file >> word) {
if (word.front() == '(' && word.back() == ')') {
std::string inner_word = word.substr(1, word.size() - 2);
if (is_number(inner_word)) {
std::string name = text.substr(0, text.size() - 1);
int items = atoi(inner_word.c_str());
categories.push_back(category_info{name, items});
categories_names.push_back(name);
categories_items.push_back(
category_items_info{items, categories_items.size()});
text.clear();
} else { // it's part of the category name
text.append(word);
text.append(" ");
}
} else {
text.append(word);
text.append(" ");
}
}
std::sort(categories.begin(), categories.end(),
[](const category_info& left, const category_info& right) {
return left.items > right.items;
});
std::sort(
categories_items.begin(), categories_items.end(),
[](const category_items_info& left, const category_items_info& right) {
return left.items > right.items;
});
std::cout << "Using the same storage." << std::endl;
for (auto c : categories) {
std::cout << c.name << " (" << c.items << ")" << std::endl;
}
std::cout << std::endl;
std::cout << "Using separated storage." << std::endl;
for (auto c : categories_items) {
std::cout << categories_names[c.index] << " (" << c.items << ")"
<< std::endl;
}
}
Output obtained:
Using the same storage.
Category 2 (34534)
Category 1 (2496)
Category 3 (1039)
Category 4 (9)
Using separated storage.
Category 2 (34534)
Category 1 (2496)
Category 3 (1039)
Category 4 (9)
Lists do not support random access iterators, so this is going to be a problem, since a list can't be permuted based on a vector (or array) of indices without doing a lot of list traversal back and forth to emulate random access iteration. NetVipeC's solution was to use vectors instead of lists to get around this problem. If using vectors, then you could generate a vector (or array) of indices to the vector to be sorted, then sort the vector indices using a custom compare operator. You could then copy the vectors according to the vector of sorted indices. It's also possible to reorder a vector in place according to the indices, but that algorithm also sorts the vector of indices, so you're stuck making a copy of the sorted indices (to sort the second vector), or copying each vector in sorted index order.
If you really want to use lists, you could implement your own std::list::sort, that would perform the same operations on both lists. The Microsoft version of std::list::sort uses an array of lists where the number of nodes in array[i] = 2^i, and it merges nodes one at a time into the array, then when all nodes are processed, it merges the lists in the array to produce a sorted list. You'd need two arrays, one for each list to be sorted. I can post example C code for this type of list sort if you want.

Erasing multiple objects from a std::vector?

Here is my issue, lets say I have a std::vector with ints in it.
let's say it has 50,90,40,90,80,60,80.
I know I need to remove the second, fifth and third elements. I don't necessarily always know the order of elements to remove, nor how many. The issue is by erasing an element, this changes the index of the other elements. Therefore, how could I erase these and compensate for the index change. (sorting then linearly erasing with an offset is not an option)
Thanks
I am offering several methods:
1. A fast method that does not retain the original order of the elements:
Assign the current last element of the vector to the element to erase, then erase the last element. This will avoid big moves and all indexes except the last will remain constant. If you start erasing from the back, all precomputed indexes will be correct.
void quickDelete( int idx )
{
vec[idx] = vec.back();
vec.pop_back();
}
I see this essentially is a hand-coded version of the erase-remove idiom pointed out by Klaim ...
2. A slower method that retains the original order of the elements:
Step 1: Mark all vector elements to be deleted, i.e. with a special value. This has O(|indexes to delete|).
Step 2: Erase all marked elements using v.erase( remove (v.begin(), v.end(), special_value), v.end() );. This has O(|vector v|).
The total run time is thus O(|vector v|), assuming the index list is shorter than the vector.
3. Another slower method that retains the original order of the elements:
Use a predicate and remove if as described in https://stackoverflow.com/a/3487742/280314 . To make this efficient and respecting the requirement of
not "sorting then linearly erasing with an offset", my idea is to implement the predicate using a hash table and adjust the indexes stored in the hash table as the deletion proceeds on returning true, as Klaim suggested.
Using a predicate and the algorithm remove_if you can achieve what you want : see http://www.cplusplus.com/reference/algorithm/remove_if/
Don't forget to erase the item (see remove-erase idiom).
Your predicate will simply hold the idx of each value to remove and decrease all indexes it keeps each time it returns true.
That said if you can afford just removing each object using the remove-erase idiom, just make your life simple by doing it.
Erase the items backwards. In other words erase the highest index first, then next highest etc. You won't invalidate any previous iterators or indexes so you can just use the obvious approach of multiple erase calls.
I would move the elements which you don't want to erase to a temporary vector and then replace the original vector with this.
While this answer by Peter G. in variant one (the swap-and-pop technique) is the fastest when you do not need to preserve the order, here is the unmentioned alternative which maintains the order.
With C++17 and C++20 the removal of multiple elements from a vector is possible with standard algorithms. The run time is O(N * Log(N)) due to std::stable_partition. There are no external helper arrays, no excessive copying, everything is done inplace. Code is a "one-liner":
template <class T>
inline void erase_selected(std::vector<T>& v, const std::vector<int>& selection)
{
v.resize(std::distance(
v.begin(),
std::stable_partition(v.begin(), v.end(),
[&selection, &v](const T& item) {
return !std::binary_search(
selection.begin(),
selection.end(),
static_cast<int>(static_cast<const T*>(&item) - &v[0]));
})));
}
The code above assumes that selection vector is sorted (if it is not the case, std::sort over it does the job, obviously).
To break this down, let us declare a number of temporaries:
// We need an explicit item index of an element
// to see if it should be in the output or not
int itemIndex = 0;
// The checker lambda returns `true` if the element is in `selection`
auto filter = [&itemIndex, &sorted_sel](const T& item) {
return !std::binary_search(
selection.begin(),
selection.end(),
itemIndex++);
};
This checker lambda is then fed to std::stable_partition algorithm which is guaranteed to call this lambda only once for each element in the original (unpermuted !) array v.
auto end_of_selected = std::stable_partition(
v.begin(),
v.end(),
filter);
The end_of_selected iterator points right after the last element which should remain in the output array, so we now can resize v down. To calculate the number of elements we use the std::distance to get size_t from two iterators.
v.resize(std::distance(v.begin(), end_of_selected));
This is different from the code at the top (it uses itemIndex to keep track of the array element). To get rid of the itemIndex, we capture the reference to source array v and use pointer arithmetic to calculate itemIndex internally.
Over the years (on this and other similar sites) multiple solutions have been proposed, but usually they employ multiple "raw loops" with conditions and some erase/insert/push_back calls. The idea behind stable_partition is explained beautifully in this talk by Sean Parent.
This link provides a similar solution (and it does not assume that selection is sorted - std::find_if instead of std::binary_search is used), but it also employs a helper (incremented) variable which disables the possibility to parallelize processing on larger arrays.
Starting from C++17, there is a new first argument to std::stable_partition (the ExecutionPolicy) which allows auto-parallelization of the algorithm, further reducing the run-time for big arrays. To make yourself believe this parallelization actually works, there is another talk by Hartmut Kaiser explaining the internals.
Would this work:
void DeleteAll(vector<int>& data, const vector<int>& deleteIndices)
{
vector<bool> markedElements(data.size(), false);
vector<int> tempBuffer;
tempBuffer.reserve(data.size()-deleteIndices.size());
for (vector<int>::const_iterator itDel = deleteIndices.begin(); itDel != deleteIndices.end(); itDel++)
markedElements[*itDel] = true;
for (size_t i=0; i<data.size(); i++)
{
if (!markedElements[i])
tempBuffer.push_back(data[i]);
}
data = tempBuffer;
}
It's an O(n) operation, no matter how many elements you delete. You could gain some efficiency by reordering the vector inline (but I think this way it's more readable).
This is non-trival because as you delete elements from the vector, the indexes change.
[0] hi
[1] you
[2] foo
>> delete [1]
[0] hi
[1] foo
If you keep a counter of times you delete an element and if you have a list of indexes you want to delete in sorted order then:
int counter = 0;
for (int k : IndexesToDelete) {
events.erase(events.begin()+ k + counter);
counter -= 1;
}
You can use this method, if the order of the remaining elements doesn't matter
#include <iostream>
#include <vector>
using namespace std;
int main()
{
vector< int> vec;
vec.push_back(1);
vec.push_back(-6);
vec.push_back(3);
vec.push_back(4);
vec.push_back(7);
vec.push_back(9);
vec.push_back(14);
vec.push_back(25);
cout << "The elements befor " << endl;
for(int i = 0; i < vec.size(); i++) cout << vec[i] <<endl;
vector< bool> toDeleted;
int YesOrNo = 0;
for(int i = 0; i<vec.size(); i++)
{
cout<<"You need to delete this element? "<<vec[i]<<", if yes enter 1 else enter 0"<<endl;
cin>>YesOrNo;
if(YesOrNo)
toDeleted.push_back(true);
else
toDeleted.push_back(false);
}
//Deleting, beginning from the last element to the first one
for(int i = toDeleted.size()-1; i>=0; i--)
{
if(toDeleted[i])
{
vec[i] = vec.back();
vec.pop_back();
}
}
cout << "The elements after" << endl;
for(int i = 0; i < vec.size(); i++) cout << vec[i] <<endl;
return 0;
}
Here's an elegant solution in case you want to preserve the indices, the idea is to replace the values you want to delete with a special value that is guaranteed not be used anywhere, and then at the very end, you perform the erase itself:
std::vector<int> vec = {1, 2, 3, 4, 5, 6, 7, 8, 9};
// marking 3 elements to be deleted
vec[2] = std::numeric_limits<int>::lowest();
vec[5] = std::numeric_limits<int>::lowest();
vec[3] = std::numeric_limits<int>::lowest();
// erase
vec.erase(std::remove(vec.begin(), vec.end(), std::numeric_limits<int>::lowest()), vec.end());
// print values => 1 2 5 7 8 9
for (const auto& value : vec) std::cout << ' ' << value;
std::cout << std::endl;
It's very quick if you delete a lot of elements because the deletion itself is happening only once. Items can also be deleted in any order that way.
If you use a a struct instead of an int, then you can still mark an element of that struct, for ex dead=true and then use remove_if instead of remove =>
struct MyObj
{
int x;
bool dead = false;
};
std::vector<MyObj> objs = {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}};
objs[2].dead = true;
objs[5].dead = true;
objs[3].dead = true;
objs.erase(std::remove_if(objs.begin(), objs.end(), [](const MyObj& obj) { return obj.dead; }), objs.end());
// print values => 1 2 5 7 8 9
for (const auto& obj : objs) std::cout << ' ' << obj.x;
std::cout << std::endl;
This one is a bit slower, around 80% the speed of the remove.