Creating indexes of highest value in struct for top 5 - c++

Lets say i have a struct below
struct info
{
string firstname;
string lastname;
double kids;
double income;
double cars;
int index;
};
Lets say i have 500 people in this struct, each containing the information first, last name, kids, income and cars.
I created a int called index so that i can sort who has the most income from highest to least.
What method would you use or how would you go about finding the top 5 people with the most income, and giving them an index as 1,2,3,4,5 etc. So that i can tell who the top 5 are if i wished to print their names out.
I am looking for a simple method as im still learning about trees and such.
Thanks!

vector of structs. Supply a specialized function for comparison that gets called during sort.
the specialized compare function shall compare based on income (descending order)
the first top 5 elements from sorted vector should give your answer

If you just want the top 5 (and don't need them in order) you can use std::nth_element to find them. This is normally faster than sorting.
If you do want the top 5 in order, you could use std::partial_sort to do the job, something like this:
std::partial_sort(x.begin(), x.begin() + 5, x.end(),
[](auto a, auto b) { return b.income < a.income; });
Note that I've swapped the two parameters when comparing them to get it to sort in descending order instead of ascending.
I don't see a very good way to use the index field you've put into the structure. To work very well, you'd want the index separate from the data you're sorting, and you'd do an indirect sort on the indexes (that is, you'd sort the indexes based on the income for the item at that index).

Related

Hashing Function/Code

so I'm just learning (or trying to) a bit about hashing. I'm attempting to make a hashing function, however I'm confused where I save the data to. I'm trying to calculate the number of collisions and print that out. I have made 3 different files, one with 10,000 words, 20,000 words and 30,000 words. Each word is just 10 random numbers/letters.
long hash(char* s]){
long h;
for(int i = 0; i < 10; i++){
h = h + (int)s[i];
}
//A lot of examples then mod h by the table size
//I'm a bit confused what this table is... Is it an array of
//10,000 (or however many words)?
//h % TABLE_SIZE
return h
}
int main (int argc, char* argv[]){
fstream input(argv[1]);
char* nextWord;
while(!input.eof()){
input >> nextWord;
hash(nextWord);
}
}
So that's what I currently have, but I can't figure out what the table is exactly, as I said in the comments above... Is it a predefined array in my main with the number of words in it? For example, if I have a file of 10 words, do I make an array a of size 10 in my main? Then if/when I return h, lets say the order goes: 3, 7, 2, 3
The 4th word is a collision, correct? When that happens, I add 1 to collision and then add 1 to then check if slot 4 is also full?
Thanks for the help!
The point of hashing is to have a constant time access to every element you store. I'll try to explain on simple example bellow.
First, you need to know how much data you'd have to store. If for example you want to store numbers and you know, that you won't store numbers greater than 10. Simpliest solution is to create an array with 10 elements. That array is your "table", where you store your numbers. So how do I achieve that amazing constant time access? Hashing function! It's point is to return you an index to your array. Let's create a simple one: If you'd like to store 7, you just save it to array on position 7. Every time, you'd like to look, for element 7, you just pass it to your hasning funcion and bzaah! You got an position to your element in constant time! But what if you'd like to store more elements with value 7? Your simple hashing function is returning 7 for every element and now its position i already occupied! How to solve that? Well, there is not many solution, the simpliest are:
1: Chaining - you simply save element on first free position. This has significant draw back. Imagine, you want to delete some element ... (this is the method, you describing in question)
2: Linked list - if you create an array of pointers on some linked lists, you can easilly add your new element at the end of linked list, that is on position 7!
Both of this simple solutions has its drawbacks and cons. I guess you can see them. As #rwols has said, you don't have to use array. You can also use a tree or be a real C++ master and use unordered_map and unordered_set with custom hash function, which is quite cool. Also there is structure named trie, which is usefull, when you'd like to create some sort of dictionary (where is really hard to know, how many words you will need to store)
To sum it up. You has to know, how many things, you wan't to store and then, create ideal hashing function, that covers up array of apropriate size and in perfect world, it has to have uniform index distribution, with no colisions. (Achiving this is pretty hard and in the real world, I guess, this is impossible, so the less colisions, the better.)
Your hash function, is pretty bad. It will have lot of colisions (like strings "ab" and "ba") and also, you need to mod m it with m being the size of you array (aka. table), so you can save it to some array and you can profit of it. The modus is a way of simplyfiing the has function, because has function has to "fit" in table, that you specified in beginning, because you can't save element on position 11, 12, ... if you have array of 10.
How should good hashing function look like? Well, there is better sources than me. Some example (Alert! It's in Java)
To your example: You simply can't save 10k or even more words into table of size 10. That'll create a lot of collisions and you loose the main benefit of hashing function - constant access to elements you saved.
And how would your code look? Something like this:
int main (int argc, char* argv[]){
fstream input(argv[1]);
char* nextWord;
TypeOfElement table[size_of_table];
while(!input.eof()){
input >> nextWord;
table[hash(nextWord)] = // desired element which you want to save
}
}
But I guess, your goal isn't to save something somewhere, but to count number of colisions. Also note that code above doesn't solve colisions. If you'd like to count colisions, create array table of ints and initialize it to zero. Than, just increment the value, which is stored on index, which is returned by your hash funcion, like this:
table[hash(nextWord)]++;
I hope I helped. Please specify, what else you want to know.
If a hash table is required then as others have stated std::unordered_map will work in most cases. Now if you need something more powerful because of a large entry base, then I would suggest looking into tries. Tries combine the concepts of (Vector-Array) insertion, (Hashing) & Linked Lists. The run time is close to O(M) where M is the amount of characters in a string if you are hashing a string. It helps to remove the chance of collisions. And the more you add to a trie structure the less work has to be done as certain nodes are opened and created. The one draw back is that tries require more memory. Here is a diagram
Now your trie may vary on the size of the array due to what you are storing, but the overall concept and construction of one is the same. If you was doing a word - definition look up then you may want an array of 26 or a few more for each possible hashing character.
To count a number of words which have same hash, we should know hashes of all previous words. When you count a hash of some word, you should write it down, for example in some array. So you need an array with size equal to the number of words.
Then you should compare the new hash with all previous ones. Method of counting depends on what you need - number of pair of collisions or number off same elements.
Hash function should not be responsible for storing data. Normally you would have a container that uses hash function internally.
From what you wrote I understood that you want to create hashtable. One way you could do that (probably not the most efficient one, but should give you an idea):
#include <fstream>
#include <vector>
#include <string>
#include <map>
#include <memory>
using namespace std;
namespace example {
long hash(char* s){
long h;
for(int i = 0; i < 10; i++){
h = h + (int)s[i];
}
return h;
}
}
int main (int argc, char* argv[]){
fstream input(argv[1]);
char* nextWord;
std::map<long, std::unique_ptr<std::vector<std::string>>> hashtable;
while(!input.eof()){
input >> nextWord;
long newHash = example::hash(nextWord);
auto it = hashtable.find(newHash);
// Collision detected?
if (it == hashtable.end()) {
hashtable.insert(std::make_pair(newHash, std::unique_ptr<std::vector<std::string>>(new std::vector<std::string> { nextWord } )));
}
else {
it->second->push_back(nextWord);
}
}
}
I used some C++ 11 features to write an example faster.
I am not sure that I understand what you do not understand. The explanations below might help you.
A hash table is a kind of associative array. It is used to map keys to values in a similar manner an array is used to map indexes (keys) to values. For instance, an array of three numbers, { 11, -22, 33 }, associates index 0 to 11, index 1 to -22 and index 2 to 33.
Now, let us assume that we would like to associate 1 to 11, 2 to -22 and 3 to 33. The solution is simple: we keep the same array, only we transform the key by subtracting one from it, thus obtaining the original index
This is fine until we realize that this is just a particular case. What if the keys are not so “predictable”? A solution would be to put the associations in a list of {key, value} pairs and when someone is asking for a key, just search the list: { 123, 11}, {3, -22}, {0, 33} If the value associated to 3 is asked, we simply search the keys in list for a match and find -22. That’s fine, but if the list is large we’re in trouble. We could speed the search if we sort the array by keys and use binary search, but still the search may take some time if the list is large.
The search speed may be further enhanced if we break the list in sub-lists (or buckets) made of related pairs. This is what a hash function does: puts together pairs by related keys (an ideal hash function would associate one key to one value).
A hash table is a two columns table (an array):
The first column is the hash key (the index computed by a hash function). The size of the hash table is given by the maximum value of the hash function. If, for instance, the last step in computing the hash function is modulo 10, the size of the table will be 10; the pairs list will be broken into 10 sub-lists.
The second column is a list (bucket) of key/values pairs (the sub-list I was taking about).

Sort vector of vectors with doubles according to a column in C++

I have a matrix consisting of a vector of which each element representing the rows is composed of a vector representing the columns of the matrix. I would like to sort the rows according to the 1st column.
Each element inside this matrix is a double, although the first column contains a number that serves as an identifier (but is not unique).
My goal is to have something like the aggregate functions available in SQL, such as count() and sum() when I group by the first column.
For instance, if I have:
ID VALUE
1 10
2 20
1 30
2 40
3 60
I would like to get:
ID COUNT MEAN
1 2 20
2 2 30
3 1 60
However, I am stuck in the very first step: how do I sort the rows according to the value of the first element of each row?
I found a clue on this topic, and changed adapted the comparator to:
bool compareFunction (double i,double j)
{
return (i<j);
}
But the compiler was not very happy about that (making a reference to the stl_algo.h file):
error: cannot convert 'std::vector<double>' to 'double' in argument passing
I was therefore wondering if there is a way to sort such a vector of vectors when it contains doubles.
Answer (imho): use a different datastructure. What you are trying to do is setup a multimap. Oh hey look:
http://www.cplusplus.com/reference/map/multimap/
stl::multimap - how do i get groups of data?
It'll be faster for large numbers of elements. And is actually a map rather than a vector of vector of double.
Either that, or skip the sorting all together, and count by key using std::map, std::unordered_map, or (if you know the number of keys and/or the keys are offset by 1 with no breaks) std::vector.
To expand, sorting your list to get means will be slow. Sorting (using std::sort) is O(nlogn), and will be O(nlogn) every time you compute this mean. And it is an unessisary step: your stuff is grouped by key reguardless of order. std::map and std::multimap will "sort as you go" which will be just a little faster than sorting every time, but you won't have to sort the whole thing to get the list. Then you can just iterate the multimap to get the mean, O(n) each mean calculation. (It is still O(nlg(n)) to add all the elements to the multimap)
But if you know the key output is going to be 1,2,3...n-1,n, than sorting is a complete waste of time. Just make a counter for each key (since you know what the keys can be) and add to the key while iterating the array.
BUT WAIT THERE IS MORE
If the keys are actually setup the way you are thinking, than the best way from the get go is to forget the table structure, and make build it like this:
Index VALUE
0 10,30
1 20,40
2 60
Count is now constant time for each row. Mean for each row is O(n). Getting a list is constant time for each row. EVERYBODY WINS.
You need to create a comparator function comparing vector<double>:
struct VecComp {
bool operator()(const vector<double>& _a, const vector<double>& _b) {
//compare first elements
}
}
Then you can use std::sort on your structure with the new comparator function:
std::sort(myMat.begin(), myMat.end(), VecComp());
If you are using c++11 features you can also utilize lambda functions here:
std::sort(myMat.begin(), myMat.end(), [](const vector<double>& a, const vector<double>& b) {
//compare the first elements
}
);
You need to write your own comparator functor to pass into your vector declaration:
struct comp {
bool operator() (const std::vector<double>& i,
const std::vector<double>& j) {
return i[0] < j[0];
}
Have you tried just this?:
std::sort(vecOfVecs.begin(), vecOfVecs.end());
That should work as std::vector has operator< which provides lexicographical sorting, which is (a little more specific than) what you want.

C++: Time complexity of using STL's sort in order to sort a 2d array of integers on different columns

let's say we have the following 2d array of integers:
1 3 3 1
1 0 2 2
2 0 3 1
1 1 1 0
2 1 1 3
I was trying to create an implementation where the user could give as input the array itself and a string. An example of a string in the above example would be 03 which would mean that the user wants to sort the array based on the first and the fourth column.
So in this case the result of the sorting would be the following:
1 1 1 0
1 3 3 1
1 0 2 2
2 0 3 1
2 1 1 3
I didn't know a lot about the compare functions that are being used inside the STL's sort function, however after searching I created the following simple implementation:
I created a class called Comparator.h
class Comparator{
private:
std::string attr;
public:
Comparator(std::string attr) { this->attr = attr; }
bool operator()(const int* first, const int* second){
std::vector<int> left;
std::vector<int> right;
size_t i;
for(i=0;i<attr.size();i++){
left.push_back(first[attr.at(i) - '0']);
right.push_back(second[attr.at(i) - '0']);
}
for(i=0;i<left.size();i++){
if(left[i] < right[i]) return true;
else if(left[i] > right[i]) return false;
}
return false;
}
};
I need to know the information inside the string so I need to have a class where this string is a private variable. Inside the operator I would have two parameters first and second, each of which will refer to a row. Now having this information I create a left and a right vector where in the left vector I have only the numbers of the first row that are important to the sorting and are specified by the string variable and in the right vector I have only the numbers of the second row that are important to the sorting and are specified by the string variable.
Then I do the needed comparisons and return true or false. The user can use this class by calling this function inside the Sorting.cpp class:
void Sorting::applySort(int **data, std::string attr, int amountOfRows){
std::sort(data, data+amountOfRows, Comparator(attr));
}
Here is an example use:
int main(void){
//create a data[][] variable and fill it with integers
Sorting sort;
sort.applySort(data, "03", number_of_rows);
}
I have two questions:
First question
Can my implementation get better? I use extra variables like the left and right vectors, and then I have some for loops which brings some extra costing to the sorting operation.
Second question
Due to the extra cost, how much worse does the time complexity of the sorting become? I know that STL's sort is O(n*logn) where n is the number of integers that you want to sort. Here n has a different meaning, n is the number of rows and each row can have up to m integers which in turn can be found inside the Comparator class by overriding the operator function and using extra variables(the vectors) and for loops.
Because I'm not sure how exactly is STL's sort implemented I can only make some estimates.
My initial estimate would be O(n*m*log(n)) where m is the number of columns that are important to the sorting however I'm not 100% certain about it.
Thank you in advance
You can certainly improve your comparator. There's no need to copy the columns and then compare them. Instead of the two push_back calls, just compare the values and either return true, return false, or continue the loop according to whether they're less, greater, or equal.
The relevant part of the complexity of sort is O(n * log n) comparisons (in C++11. C++03 doesn't give quite such a good guarantee), where n is the number of elements being sorted. So provided your comparator is O(m), your estimate is OK to sort the n rows. Since attr.size() <= m, you're right.
First question: you don't need left and rigth - you add elements one by one and then iterate over the vectors in the same order. So instead of pushing values to vectors and then iterating over them, simply use the values as you generate them in the first cycle like so:
for(i=0;i<attr.size();i++){
int left = first[attr.at(i) - '0'];
int right = second[attr.at(i) - '0'];
if(left < right) return true;
else if(left > right) return false;
}
Second question: can the time complexity be improved? Not with sorting algorithm that uses direct comparison. On the other had the problem you solve here is somewhat similar to radix sort. And so I believe you should be able to do the sorting in O(n*m) where m is the number of sorting criteria.
1) Firstly to start off you should convert the string into an integer array in the constructor. With validation of values being less than the number of columns.
(You could also have another constructor that takes an integer array as a parameter.
A slight enhancement is to allow negative values to indicate that the order of the sort is reversed for that column. In this case the values would be -N..-1 , 1..N)
2) There is no need for the intermediate left, right arrays.

maintaining the order of vectors

lets say I have 1 vector of names and another vector for the telephone numbers. First, the user will enter names (not sorted, meaning they are not organized from a to z), then, the user will enter the corresponding telephone number.
After filling out both vectors, the program then executes sorting mechanism in the name vector(vector 1). The problem is now the vector 2, (since there is no adopting mechanism to map it to vector 1).
Example:
vector name | vector telephone
f 232132
a 34242342
b 997345
the result will be
vector name | vector telephone
a 232132
b 34242342
f 997345
as you can see, the vector telephone hasnt been adjusted. how can we adjust this?? thanks
Create a struct that holds a string for the name and a string/int for the phone number. Go through it linearly and record the name information. Go through it again and record the phone # information. Then sort.
If you do not wish to create a class, you can use a pair object.
vector<pair<string,int> > nameAndNumber;
Edit: fixed a bug, thanks smocking
the vector telephone hasnt been adjusted. how can we adjust this??
By combining both entities "name" and "telephone" inside a data structure and then use its vector.
struct NameNumber {
std::string t_Name;
unsigned long t_Number;
bool operator < (const NameNumber&) const; // use 't_Name' inside
};
std::vector<NameNumber> v;
For completeness of the solution, I have mentioned the operator < which will sort the vector according to the names.

Suitable Data Structure for storing and calculating Highest Scoring K items

I need to store W items. Each item has a 'string' attribute and a 'double' attribute (the item's score) associated with it. In each iteration, additional C items are added to the set. After the iteration is complete, score of some of the items is updated by a small amount. Now, out of the W+C items only W items need to be taken forward to the next iteration. Highest Scoring 'W' items will be selected that will go to the next generation.
In every iteration a different set of 'C' items are added.
W is of the order of 10,000. C is of the order of 600.
What is the best data structure to use this in terms of time complexity. Hash Table, Heap, Binary Search Tree??
I am using C++. Some boost references will be appreciated
I would store these values in two parallel structures. First, have an array of the double values, each of which stores a pointer. Next, store all the strings in a hash table along with an auxiliary integer. The idea is that the pointers in the array point to the nodes in the hash table or trie holding the string associated with the double, while the integer value with each string stores the index of the double paired with that string.
To insert a string/double pair into this structure, you add the string to the hash table, append the double to the array, then store a pointer to the new string in the array and the index of the double in the hash table. This has complexity O(k), where k is the length of the string.
To change a priority, look up the string in the hash table, then get the index of the double in the array. You can then modify that element to change tye associated priority. This also has complexity O(k).
To discard all but the top B key/value pairs, run a selection algorithm on the array to put the top B elements in one part of the array and the remaining C elements in the other. Whenever you perform a swap, follow the pointers out of the array and into the hash table and update the indices of the elements you just swapped. Finally, iterate across the last C elements of the array, follow their pointers back into the hash table, and remove the elements they point at from the table. This takes expected O(n) time to do the selection step, or worst-case O(n) time using the median-of-medians algorithm, followed by O(n) time to remove the elements from the hash table, for an expected runtime of O(n), where n is the number of elements in the structure.
To summarize, this gives you O(k) insertion and lookup of any string, where k is the string length, and O(n) retaining of the best elements, where n is the total number of elements.
Well, I think you will be fine just using a std::vector<Item> and doing a std::nth_element (on the score) once at end of iteration. E.g. if you want to keep 10000 items, do like this:
struct Item {
double score;
std::string name;
};
bool comparator(const Item& a, const Item& b) {
return a.score > b.score;
};
if (items.size() > 10000) {
// Make sure the 10,000 first elements contain the highest scores.
items.nth_element(item.begin(), item.begin() + 10000, item.end(),
comparator);
// Only keep the first 10,000 elements.
items.resize(10000);
}
Actually, if you do it like this, updating values (by linear search and string comparison) will probably be slower than sorting. You can speed up the comparisons by putting a string hash into your Item instead of the pure strings.
If you want even faster updating: Before updating, sort items on string hash. Then you can do a binary search instead of linear search to find the item you want to update.