I bumped into a page where there were a lot of categories and next to each one the number of items in each category, wrapped in parenthesis. Something really common. It looked like this:
Category 1 (2496)
Category 2 (34534)
Category 3 (1039)
Category 4 (9)
...
So I was curious and I wanted to see which categories had more items and such, and since all categories were all together in the page I could just select them all and copy them in a text file, making things really easy.
I made a little program that reads all the numbers, store them in a list and sort them. In order to know what category the number it belonged to I would just Ctrl + F the number in the browser.
But I thought it would be nice to have the name of the category next to the number in my text file, and I managed to parse them in another file. However, they are not ordered, obviously.
This is what I could do so far:
bool is_number(const string& s) {
return !s.empty() && find_if(s.begin(), s.end(), [](char c) { return !isdigit(c); }) == s.end();
}
int main() {
ifstream file;
ofstream file_os, file_t_os;
string word, text; // word is the item count and text the category name
list<int> words_list; // list of item counts
list<string> text_list; // list of category names
file.open("a.txt");
file_os.open("a_os.txt");
file_t_os.open("a_t_os.txt");
while (file >> word) {
if (word.front() == '(' && word.back() == ')') { // check if it's being read something wrapped in parenthesis
string old_word = word;
word.erase(word.begin());
word.erase(word.end()-1);
if (is_number(word)) { // check if it's a number (item count)
words_list.push_back(atoi(word.c_str()));
text.pop_back(); // get rid of an extra space in the category name
text_list.push_back(text);
text.clear();
} else { // it's part of the category name
text.append(old_word);
text.append(" ");
}
} else {
text.append(word);
text.append(" ");
}
}
words_list.sort();
for (list<string>::iterator it = text_list.begin(); it != text_list.end(); ++it) {
file_t_os << *it << endl;
}
for (list<int>::iterator it = words_list.begin(); it != words_list.end(); ++it) {
file_os << fixed << *it << endl;
}
cout << text_list.size() << endl << words_list.size() << endl; // I'm getting the same count
}
Now I forget about having the name next to the number, because something more interesting occured to me. I thought it would be interesting to find a way to rearrange the strings in the text_list which contain the names of the categories in the exact same way the list with the item count was sorted.
Let me explain with an example, lets say we have the following categories:
A (5)
B (3)
C (10)
D (6)
The way I'm doing it I will have a list<int> containing this: {10, 6, 5, 3} and a list<string> containing this: {A, B, C, D}.
What I'm saying is I want to find a way I can keep track of the way the elements were rearranged in the first list and apply that very pattern to the second list. What would be the rearrange pattern? It would be: the first item (5) goes to the third position, the second one (3) to the fourth one, the third one (10) to the first one, and so on.... Then this pattern should be applied to the other list, so that it would end up like this: {C, D, A, B}.
The thing would be to keep track of the Pattern and apply it to the list below.
Is there any way I can do this? Any particular function that could help me? Any way to track all the swaps and switches the sort algorithm does so it can be applied to a different list with the same size? What about a different sorting algorithm?
I know this might be highly inefficient and a bad idea, but it seemed like a little challenge.
I also know I could just pair both string and int, category and item count, in some sort of container like pair or map or make a container class of my own and sort the items based on the item count (I guess map would be the best choice, what do you think?), but this is not what I am asking.
The best way to do this would be to create a list that contains both sets of information you want to sort and feed in a custom sorting function.
For instance:
struct Record {
string name;
int count;
};
list<Record> myList;
sort(myList, [](Record a, Record b){
return a.count < b.count;
});
In the general case, it's always better to manage one list of a complex datatype, than to try to separately manage two or more lists of simple datatypes, especially when they're mutable.
Some more improve way:
First some notes:
It's recommended to storage category name and items together, for clarity, easy of read code, etc...
It's better use std::vector instead of std::list (see Bjarne Stroustrup opinion)
The code load the file with the format specified in your question, storage in the vector the info pair.
Use std::sort function to sort only by items number (the categories with the same items would be in any order, if you would like to sort for category name the categories with the same items change the lambda body to return std::tie(left.items, left.name) > std::tie(right.items, right.name);.
Added a version with info split, in one collection items and index (to correlate items with name) info, and in the other names info.
Code:
#include <iostream>
#include <fstream>
#include <algorithm>
#include <vector>
bool is_number(const std::string& s) {
return !s.empty() &&
find_if(s.begin(), s.end(), [](char c) { return !isdigit(c); }) ==
s.end();
}
struct category_info {
std::string name;
int items;
};
struct category_items_info {
int items;
size_t index;
};
int main() {
std::ifstream file("H:\\save.txt");
std::vector<category_info> categories;
std::vector<category_items_info> categories_items;
std::vector<std::string> categories_names;
std::string word;
std::string text;
while (file >> word) {
if (word.front() == '(' && word.back() == ')') {
std::string inner_word = word.substr(1, word.size() - 2);
if (is_number(inner_word)) {
std::string name = text.substr(0, text.size() - 1);
int items = atoi(inner_word.c_str());
categories.push_back(category_info{name, items});
categories_names.push_back(name);
categories_items.push_back(
category_items_info{items, categories_items.size()});
text.clear();
} else { // it's part of the category name
text.append(word);
text.append(" ");
}
} else {
text.append(word);
text.append(" ");
}
}
std::sort(categories.begin(), categories.end(),
[](const category_info& left, const category_info& right) {
return left.items > right.items;
});
std::sort(
categories_items.begin(), categories_items.end(),
[](const category_items_info& left, const category_items_info& right) {
return left.items > right.items;
});
std::cout << "Using the same storage." << std::endl;
for (auto c : categories) {
std::cout << c.name << " (" << c.items << ")" << std::endl;
}
std::cout << std::endl;
std::cout << "Using separated storage." << std::endl;
for (auto c : categories_items) {
std::cout << categories_names[c.index] << " (" << c.items << ")"
<< std::endl;
}
}
Output obtained:
Using the same storage.
Category 2 (34534)
Category 1 (2496)
Category 3 (1039)
Category 4 (9)
Using separated storage.
Category 2 (34534)
Category 1 (2496)
Category 3 (1039)
Category 4 (9)
Lists do not support random access iterators, so this is going to be a problem, since a list can't be permuted based on a vector (or array) of indices without doing a lot of list traversal back and forth to emulate random access iteration. NetVipeC's solution was to use vectors instead of lists to get around this problem. If using vectors, then you could generate a vector (or array) of indices to the vector to be sorted, then sort the vector indices using a custom compare operator. You could then copy the vectors according to the vector of sorted indices. It's also possible to reorder a vector in place according to the indices, but that algorithm also sorts the vector of indices, so you're stuck making a copy of the sorted indices (to sort the second vector), or copying each vector in sorted index order.
If you really want to use lists, you could implement your own std::list::sort, that would perform the same operations on both lists. The Microsoft version of std::list::sort uses an array of lists where the number of nodes in array[i] = 2^i, and it merges nodes one at a time into the array, then when all nodes are processed, it merges the lists in the array to produce a sorted list. You'd need two arrays, one for each list to be sorted. I can post example C code for this type of list sort if you want.
Related
I have a project where I need to read a text file and record how many occurrences there are of each string, character, or number that is read until the EoF.
I then need to print the top 10 most used words.
For example, the file would contain "This is a test for this project". I would read this and store each word in a container as well as its current count.
Now, we are graded on how our efficient our time complexity is as input grows. So, I need some help on choosing which STL container would be the most efficient.
It seems order is not important, I can forever insert at the end, and I will never have to make insertions. I will, however, have to search through the container for the top 10 most used words. Which STL container has the best time complexity for requirements like this?
Also, if you could explain your reasoning so I know more going forward, that would be great!
Let's assume that you decide to use a std::unordered_map<std::string, int> to get a frequency count of the items. That is a good start, but the other part of the question that needs to be addressed is to get the top 10 items.
Whenever a question asks "get the top N" or "get the smallest N", or similar, there are various methods of getting this information.
One way is to sort the data and get the first N items. Using std::sort or a good sorting routine, that operation should take O(N*log(N)) in time complexity.
The other method is to use a min-heap or max-heap of N items, depending on whether you want to get the top N or bottom N, respectively.
Assume you have working code using the unordered_set to get the frequency count. Here is a routine that uses the STL heap functions to get the top N items. It has not been fully tested, but should demonstrate how to handle the heap.
#include <vector>
#include <algorithm>
#include <iostream>
#include <unordered_map>
void print_top_n(const std::unordered_map<std::string, int>& theMap, size_t n)
{
// This is the heap
std::vector<std::pair<std::string, int>> vHeap;
// This lambda is the predicate to build and perform the heapify
auto heapfn =
[](std::pair<std::string, int>& p1, std::pair<std::string, int>& p2) -> bool
{ return p1.second > p2.second; };
// Go through each entry in the map
for (auto& m : theMap)
{
if (vHeap.size() < n)
{
// Add item to the heap, since we haven't reached n items yet
vHeap.push_back(m);
// if we have reached n items, now is the time to build the heap
if (vHeap.size() == n)
// make the min-heap of the N elements
std::make_heap(vHeap.begin(), vHeap.end(), heapfn);
continue;
}
else
// Heap has been built. Check if the next element is larger than the
// top of the heap
if (vHeap.front().second <= m.second)
{
// adjust the heap
// remove the front of the heap by placing it at the end of the vector
std::pop_heap(vHeap.begin(), vHeap.end(), heapfn);
// get rid of that item now
vHeap.pop_back();
// add the new item
vHeap.push_back(m);
// heapify
std::push_heap(vHeap.begin(), vHeap.end(), heapfn);
}
}
// sort the heap
std::sort_heap(vHeap.begin(), vHeap.end(), heapfn);
// Output the results
for (auto& v : vHeap)
std::cout << v.first << " " << v.second << "\n";
}
int main()
{
std::unordered_map<std::string, int> test = { {"abc", 10},
{ "123",5 },
{ "456",1 },
{ "xyz",15 },
{ "go",8 },
{ "text1",7 },
{ "text2",17 },
{ "text3",27 },
{ "text4",37 },
{ "text5",47 },
{ "text6",9 },
{ "text7",7 },
{ "text8", 22 },
{ "text9", 8 },
{ "text10", 2 } };
print_top_n(test, 10);
}
Output:
text5 47
text4 37
text3 27
text8 22
text2 17
xyz 15
abc 10
text6 9
text9 8
go 8
The advantage of using a heap is that:
The complexity of heapifying is O(log(N)), and not the usual O(N*log(N)) that a sorting routine will give you.
Note that we only need to heapify if we detect that the top item on the min-heap is going to be discarded.
We don't need to store an entire (multi)map of frequency counts to strings in addition to the original map of strings to frequency counts.
The heap will only store N elements, regardless of the number of items are in the original map.
I have used two containers for such task: std::unordered_map<std::string, int> to store words frequencies, and std::map<int, std::string> to track most frequent words.
While updating the firs map with the new word, you also update the second one. To keep it neat, erase the least-frequent word if the size of that second map gets over 10.
UPDATE
In response to the comments below, I did some benchmarking.
First, #PaulMcKenzie - you are correct: to keep the ties, I need std::map<int, std::set<std::string>> (that became obvious as soon as I started to implement this).
Second, #dratenik - turns out you are correct, too. While constantly cleaning up the frequency map keeps it small, the overhead doesn't pay for the benefits. Also, this would only be needed if the client want to see the "running total" (as I was asked for in my project). It makes no sense at all in post-processing, when all words are loaded.
For the test, I used alice29.txt (available online), pre-processed - I removed all punctuation marks and converted to upper case. Here is my code:
int main()
{
auto t1 = std::chrono::high_resolution_clock::now();
std::ifstream src("c:\\temp\\alice29-clean.txt");
std::string str;
std::unordered_map<std::string, int> words;
std::map<int, std::set<std::string>> freq;
int i(0);
while (src >> str)
{
words[str]++;
i++;
}
for (auto& w : words)
{
freq[w.second].insert(w.first);
}
int count(0);
for (auto it = freq.rbegin(); it != freq.rend(); ++it)
{
for (auto& w : it->second)
{
std::cout << w << " - " << it->first << std::endl;
++count;
}
if (count >= 10)
break;
}
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() << std::endl;
return i;
}
first time on Stack here, I hope to learn from you guys!
So my code involve the user reading a passage from a text-file and adding the word into a vector. That vector would be pass into a word count function and print out how many word are repeating.
for example:
count per word: age = 2 , belief = 1, best =1, it = 10
however i'm trying to come up with a function that loop to the same vector and print out the word that are repeated more than two time. In this case the word "it" is repeated more than two time.
map<string, int> get_word_count(const vector<string>& words) {
map<string, int> word_count{};
for (string word : words) {
auto search = word_count.find(word);
if (search == word_count.end()) {
word_count[word] = 1; // not found - add word with count of 1
}
else {
word_count[word] += 1; // found - increment count for word
}
}
return word_count;
}
this is my snipet of code that check the many word that are repeated from the vector. However im struggle to figure out how to add a condition to check if the word are repeated twice or more than two time. I try to add a condition if word_count > 2, then print out the repeated word of twice. However it did not work. I hope to hear from you guys hint, thank.
No need to check as a std::map automatically check if the entry exists or not. If not, it creates a new one, if yes, the value is handled correctly.
Simply loop over the std::map which holds the words vs. counts and use a condition as needed. See full example.
int main()
{
std::vector< std::string > words{ "belief","best","it","it","it" };
std::map< std::string, int > counts;
for ( const auto& s: words )
{
counts[s]++;
}
for ( const auto& p: counts )
{
if ( p.second > 2 ) { std::cout << p.first << " repeats " << p.second << std::endl; }
}
}
Hint:
If you write auto x: y you get a copy of each element of y which is typically not what you want. If you write const auto& x: y you get a const reference to the element of your container, which is in your case much faster/efficient, as there is no need to create a copy. The compiler is maybe able to "see" that the copy is not needed and optimize it out, but it is more clear to the reader of the source code, what is intended!
first of all I really suggest you to explore documentation about C++ before coding,
your code is actually rewritable in this way
map<string, int> get_word_count(const vector<string>& words) {
map<string, int> word_count{};
for (string& word : words) {
word_count[word] += 1;
}
return word_count;
}
This works because map::operator[] (like unordered_map::operator[]) doesn't work like map::at does (which throws a std::out_of_range exception if the key is not in the map). The difference is that operator[] returns a reference to the Value at the given key and if the key is not already in the map, it is inserted and default-initialized (in your case a int is value-initialized to 0).
Operator[] on cppreference
In order to add the "twice or more than twice" part you can modify your code by adding the condition in the for loop.
map<string, int> get_word_count(const vector<string>& words) {
map<string, int> word_count{};
for (string& word : words) {
auto& map_ref = word_count[word];
map_ref += 1;
if(map_ref == 2){
// Here goes your code
}
}
return word_count;
}
If you're interested in how many times a word is repeated you shall scan again the map with a loop.
I have this text file where I am reading each line into a std::vector<std::pair>,
handgun bullets
bullets ore
bombs ore
turret bullets
The first item depends on the second item. And I am writing a delete function where, when the user inputs an item name, it deletes the pair containing the item as second item. Since there is a dependency relationship, the item depending on the deleted item should also be deleted since it is no longer usable. For example, if I delete ore, bullets and bombs can no longer be usable because ore is unavailable. Consequently, handgun and turret should also be removed since those pairs are dependent on bullets which is dependent on ore i.e. indirect dependency on ore. This chain should continue until all dependent pairs are deleted.
I tried to do this for the current example and came with the following pseudo code,
for vector_iterator_1 = vector.begin to vector.end
{
if user_input == vector_iterator_1->second
{
for vector_iterator_2 = vector.begin to vector.end
{
if vector_iterator_1->first == vector_iterator_2->second
{
delete pair_of_vector_iterator_2
}
}
delete pair_of_vector_iterator_1
}
}
Not a very good algorithm, but it explains what I intend to do. In the example, if I delete ore, then bullets and bombs gets deleted too. Subsequently, pairs depending on ore and bullets will also be deleted (bombs have no dependency). Since, there is only one single length chain (ore-->bullets), there is only one nested for loop to check for it. However, there may be zero or large number of dependencies in a single chain resulting in many or no nested for loops. So, this is not a very practical solution. How would I do this with a chain of dependencies of variable length? Please tell me. Thank you for your patience.
P. S. : If you didn't understand my question, please let me know.
One (naive) solution:
Create a queue of items-to-delete
Add in your first item (user-entered)
While(!empty(items-to-delete)) loop through your vector
Every time you find your current item as the second-item in your list, add the first-item to your queue and then delete that pair
Easy optimizations:
Ensure you never add an item to the queue twice (hash table/etc)
personally, I would just use the standard library for removal:
vector.erase(remove_if(vector.begin(), vector.end(), [](pair<string,string> pair){ return pair.second == "ore"; }));
remove_if() give you an iterator to the elements matching the criteria, so you could have a function that takes in a .second value to erase, and erases matching pairs while saving the .first values in those being erased. From there, you could loop until nothing is removed.
For your solution, it might be simpler to use find_if inside a loop, but either way, the standard library has some useful things you could use here.
I couldn't help myself to not write a solution using standard algorithms and data structures from the C++ standard library. I'm using a std::set to remember which objects we delete (I prefer it since it has log-access and does not contain duplicates). The algorithm is basically the same as the one proposed by #Beth Crane.
#include <iostream>
#include <vector>
#include <utility>
#include <algorithm>
#include <string>
#include <set>
int main()
{
std::vector<std::pair<std::string, std::string>> v
{ {"handgun", "bullets"},
{"bullets", "ore"},
{"bombs", "ore"},
{"turret", "bullets"}};
std::cout << "Initially: " << std::endl << std::endl;
for (auto && elem : v)
std::cout << elem.first << " " << elem.second << std::endl;
// let's remove "ore", this is our "queue"
std::set<std::string> to_remove{"bullets"}; // unique elements
while (!to_remove.empty()) // loop as long we still have elements to remove
{
// "pop" an element, then remove it via erase-remove idiom
// and a bit of lambdas
std::string obj = *to_remove.begin();
v.erase(
std::remove_if(v.begin(), v.end(),
[&to_remove](const std::pair<const std::string,
const std::string>& elem)->bool
{
// is it on the first position?
if (to_remove.find(elem.first) != to_remove.end())
{
return true;
}
// is it in the queue?
if (to_remove.find(elem.second) != to_remove.end())
{
// add the first element in the queue
to_remove.insert(elem.first);
return true;
}
return false;
}
),
v.end()
);
to_remove.erase(obj); // delete it from the queue once we're done with it
}
std::cout << std::endl << "Finally: " << std::endl << std::endl;
for (auto && elem : v)
std::cout << elem.first << " " << elem.second << std::endl;
}
#vsoftco I looked at Beth's answer and went off to try the solution. I did not see your code until I came back. On closer examination of your code, I see that we have done pretty much the same thing. Here's what I did,
std::string Node;
std::cout << "Enter Node to delete: ";
std::cin >> Node;
std::queue<std::string> Deleted_Nodes;
Deleted_Nodes.push(Node);
while(!Deleted_Nodes.empty())
{
std::vector<std::pair<std::string, std::string>>::iterator Current_Iterator = Pair_Vector.begin(), Temporary_Iterator;
while(Current_Iterator != Pair_Vector.end())
{
Temporary_Iterator = Current_Iterator;
Temporary_Iterator++;
if(Deleted_Nodes.front() == Current_Iterator->second)
{
Deleted_Nodes.push(Current_Iterator->first);
Pair_Vector.erase(Current_Iterator);
}
else if(Deleted_Nodes.front() == Current_Iterator->first)
{
Pair_Vector.erase(Current_Iterator);
}
Current_Iterator = Temporary_Iterator;
}
Deleted_Nodes.pop();
}
To answer your question in the comment of my question, that's what the else if statement is for. It's supposed to be a directed graph so it removes only next level elements in the chain. Higher level elements are not touched.
1 --> 2 --> 3 --> 4 --> 5
Remove 5: 1 --> 2 --> 3 --> 4
Remove 3: 1 --> 2 4 5
Remove 1: 2 3 4 5
Although my code is similar to yours, I am no expert in C++ (yet). Tell me if I made any mistakes or overlooked anything. Thanks. :-)
I have a list in the following format:
2323 0 1212
2424 0 1313
2525 1 1414
I need to store every row of these values and I need the possibility to access each of them individually and to be able to search for the occurrence of any number which is stored in whatever I use.
What can I use? Should I use multiple vectors or can I store them in a multimap or maybe a boost::tuple?
I can't use c++11 and I have only limited boost support (1.36 is installed and I can't update).
I already have a parser which can parse the list (found it there):http://en.highscore.de/cpp/boost/parser.html
Thank you in advance
If I understand your question correctly, you could define a struct (taking names from your comment):
struct Item
{
int position;
int direction;
int nextPosition;
};
And then just have an std::vector<Item>. The row would be the index. To count the occurrence of a value, you can pass a custom predicate to std::count or just define your own function to do so, as I think using std::count without C++11 lambdas might be a bit difficult.
EDIT: To make things easier for you, as suggested by Thomas Matthews, you could overload operator>> for your struct to read directly from a file:
struct Item
{
int position;
int direction;
int nextPosition;
friend std::istream& operator>>(std::istream& stream, Item& item);
};
std::ifstream& operator>>(std::ifstream& stream, Item& item)
{
stream >> item.position >> item.direction >> item.nextPosition;
return stream;
}
That depends on what you want to do with it. A vector of vectors would work to access all numbers and keep them row-by-row. That would also preserve the ordering of the numbers within the rows. You can find numbers by walking through the vectors (or use the STL find functions which do the same)
If you need to insert or delete rows/numbers, you might consider a list instead of a vector. Lists have better performance for insert/delete but you lose random access.
If you just need to know whether specific numbers exist, then you might put them into a multiset. You will lose the ability to know in which row a number is and also you lose the ordering of the numbers within the rows.
It might be fastest (although I haven't tested it) to simply throw the numbers into a std::vector, one after the other, totally ignoring the row structure, and just step through the whole thing to detect a match at index i. You can then get the row with a bit of index wrangling as ( i % 3 ) + {0|1|2}.
If you can safely assume that every row always has 3 columns, then use a std::vector<int>. Then when you want to know the row/column that a number originated from, you can use a function like this:
bool find( const std::vector<int>& numbers, int target, int& row, int& column )
{
std::vector<int>::iterator it = std::find( numbers.begin(), numbers.end(), target );
if( it == numbers.end() )
{
return false;
}
int index = it - numbers.begin();
row = index / 3;
column = index % 3;
return true;
}
// Example:
std::vector<int> numbers = ...
int row, column;
if( find( numbers, 10, row, column ) )
{
std::cout << "Found at row " << row << ", column " << column << std::endl;
}
else
{
std::cout << "Not found" << std::endl;
}
say i have the following:
string myArray[] = { "adam", "aaron", "brad", "brandon" };
cout << "Please type a name: ";
i want it so when a user types "bra" and hits enter, the program returns
brad
brandon
if the user types "a", the program returns
adam
aaron
if the user types "adam", the program returns
adam
I have tried strstr, mystring.compare(str), mystring.compare(x, n, str) - i can't find anything that is working.
what function would be the best way of handling this operation?
This is a great time for lambdas and std::copy_if. Assuming the string you want to find is named to_find:
std::copy_if(std::begin(myArray), std::end(myArray),
std::ostream_iterator<std::string>(std::cout, "\n"),
[&](const std::string& s){
return s.find(to_find) == 0;
});
Specifically, the way to test if some string a contains another string b, we can do:
a.find(b) == 0
std::string::find returns the index at which b is found, or npos if it's not found. We want 0, since you only want prefixes. We then wrap that in a lambda and pass it into copy_if, which will copy every element which passes our predicate and writes it to the output iterator we provide - in this case an ostream_iterator<std::string> which writes to std::cout and uses \n' as a delimeter.
To write out the same without C+11 would look something like:
const size_t size = sizeof(myArray) / sizeof(*myArray);
for (size_t i = 0; i < size; ++i) {
if (myArray[i].find(to_find) == 0) {
std::cout << myArray[i] << std::endl;
}
}
Depending on how big your list of names is going to be, it can quickly become very slow to scan through the whole thing. You might want to look into implementing your own prefix tree (aka trie).