Sort two-dimensional array without touching second dimension

Sort two-dimensional array without touching second dimension - c++

First, you may wonder: Why do you need to have two dimensions, just create two one-dimensional arrays.
Well, here is my task: user types number of queries and then for each query he enters ID of the task, for ID = 0 I need to use first array ([][0]) and for ID = 1 second one, would be much prettier to do it without if statement (if (ID == 0)...blablabla).
Of course there is a way to do this with sorting algorithms (bubble, quick ...), but I'm kind of curios: Can it be done with std::sort(); function?
Thanks.
P.s. Here is quick example to clarify:
I so sorry to give such a bad example I,m gonna edit it, first here goes data that I have
2 1
14 5
5 7
3 45 `
so I need to sort It like this:
2 1
3 5
5 7
14 45
so when queries are entered for example id=1 number=2 just output array[2][1];

You can represent the data as a structure of some kind
struct data{
int id;
int other_data;
...
};
To sort by ID only...
{
std::vector<data> my_vec;
//Populate my_vec
std::sort(my_vec.begin(), my_vec.end(), [](const data& d1, const data& d2){
return d1.id < d2.id;
});
}
Three benefits:
If you later need other_data_2 to also stay ordered with id, it's trivial to add it
You can avoid using complicated multi-dimensional arrays
You can avoid using arrays (by using vectors)

Related

Iterating through one variable in a vector of struct with lower/upper bound

I have 2 structs, one simply has 2 values:
struct combo {
int output;
int input;
};
And another that sorts the input element based on the index of the output element:
struct organize {
bool operator()(combo const &a, combo const &b)
{
return a.input < b.input;
}
};
Using this:
sort(myVector.begin(), myVector.end(), organize());
What I'm trying to do with this, is iterate through the input varlable, and check if each element is equal to another input 'in'.
If it is equal, I want to insert the value at the same index it was found to be equal at for input, but from output into another temp vector.
I originally went with a more simple solution (when I wasn't using a structs and simply had 2 vectors, one input and one output) and had this in a function called copy:
for(int i = 0; i < input.size(); ++i){
if(input == in){
temp.push_back(output[i]);
}
}
Now this code did work exactly how I needed it, the only issue is it is simply too slow. It can handle 10 integer inputs, or 100 inputs but around 1000 it begins to slow down taking an extra 5 seconds or so, then at 10,000 it takes minutes, and you can forget about 100,000 or 1,000,000+ inputs.
So, I asked how to speed it up on here (just the function iterator) and somebody suggested sorting the input vector which I did, implemented their suggestion of using upper/lower bound, changing my iterator to this:
std::vector<int>::iterator it = input.begin();
auto lowerIt = std::lower_bound(input.begin(), input.end(), in);
auto upperIt = std::upper_bound(input.begin(), input.end(), in);
for (auto it = lowerIt; it != upperIt; ++it)
{
temp.push_back(output[it - input.begin()]);
}
And it worked, it made it much faster, I still would like it to be able to handle 1,000,000+ inputs in seconds but I'm not sure how to do that yet.
I then realized that I can't have the input vector sorted, what if the inputs are something like:
input.push_back(10);
input.push_back(-1);
output.push_back(1);
output.push_back(2);
Well then we have 10 in input corresponding to 1 in output, and -1 corresponding to 2. Obviously 10 doesn't come before -1 so sorting it smallest to largest doesn't really work here.
So I found a way to sort the input based on the output. So no matter how you organize input, the indexes match each other based on what order they were added.
My issue is, I have no clue how to iterate through just input with the same upper/lower bound iterator above. I can't seem to call upon just the input variable of myVector, I've tried something like:
std::vector<combo>::iterator it = myVector.input.begin();
But I get an error saying there is no member 'input'.
How can I iterate through just input so I can apply the upper/lower bound iterator to this new way with the structs?
Also I explained everything so everyone could get the best idea of what I have and what I'm trying to do, also maybe somebody could point me in a completely different direction that is fast enough to handle those millions of inputs. Keep in mind I'd prefer to stick with vectors because not doing so would involve me changing 2 other files to work with things that aren't vectors or lists.
Thank you!

I think that if you sort it in smallest to largest (x is an integer after all) that you should be able to use std::adjacent_find to find duplicates in the array, and process them properly. For the performance issues, you might consider using reserve to preallocate space for your large vector, so that your push back operations don't have to reallocate memory as often.

Creating indexes of highest value in struct for top 5

Lets say i have a struct below
struct info
{
string firstname;
string lastname;
double kids;
double income;
double cars;
int index;
};
Lets say i have 500 people in this struct, each containing the information first, last name, kids, income and cars.
I created a int called index so that i can sort who has the most income from highest to least.
What method would you use or how would you go about finding the top 5 people with the most income, and giving them an index as 1,2,3,4,5 etc. So that i can tell who the top 5 are if i wished to print their names out.
I am looking for a simple method as im still learning about trees and such.
Thanks!

vector of structs. Supply a specialized function for comparison that gets called during sort.
the specialized compare function shall compare based on income (descending order)
the first top 5 elements from sorted vector should give your answer

If you just want the top 5 (and don't need them in order) you can use std::nth_element to find them. This is normally faster than sorting.
If you do want the top 5 in order, you could use std::partial_sort to do the job, something like this:
std::partial_sort(x.begin(), x.begin() + 5, x.end(),
[](auto a, auto b) { return b.income < a.income; });
Note that I've swapped the two parameters when comparing them to get it to sort in descending order instead of ascending.
I don't see a very good way to use the index field you've put into the structure. To work very well, you'd want the index separate from the data you're sorting, and you'd do an indirect sort on the indexes (that is, you'd sort the indexes based on the income for the item at that index).

Sort vector of vectors with doubles according to a column in C++

I have a matrix consisting of a vector of which each element representing the rows is composed of a vector representing the columns of the matrix. I would like to sort the rows according to the 1st column.
Each element inside this matrix is a double, although the first column contains a number that serves as an identifier (but is not unique).
My goal is to have something like the aggregate functions available in SQL, such as count() and sum() when I group by the first column.
For instance, if I have:
ID VALUE
1 10
2 20
1 30
2 40
3 60
I would like to get:
ID COUNT MEAN
1 2 20
2 2 30
3 1 60
However, I am stuck in the very first step: how do I sort the rows according to the value of the first element of each row?
I found a clue on this topic, and changed adapted the comparator to:
bool compareFunction (double i,double j)
{
return (i<j);
}
But the compiler was not very happy about that (making a reference to the stl_algo.h file):
error: cannot convert 'std::vector<double>' to 'double' in argument passing
I was therefore wondering if there is a way to sort such a vector of vectors when it contains doubles.

Answer (imho): use a different datastructure. What you are trying to do is setup a multimap. Oh hey look:
http://www.cplusplus.com/reference/map/multimap/
stl::multimap - how do i get groups of data?
It'll be faster for large numbers of elements. And is actually a map rather than a vector of vector of double.
Either that, or skip the sorting all together, and count by key using std::map, std::unordered_map, or (if you know the number of keys and/or the keys are offset by 1 with no breaks) std::vector.
To expand, sorting your list to get means will be slow. Sorting (using std::sort) is O(nlogn), and will be O(nlogn) every time you compute this mean. And it is an unessisary step: your stuff is grouped by key reguardless of order. std::map and std::multimap will "sort as you go" which will be just a little faster than sorting every time, but you won't have to sort the whole thing to get the list. Then you can just iterate the multimap to get the mean, O(n) each mean calculation. (It is still O(nlg(n)) to add all the elements to the multimap)
But if you know the key output is going to be 1,2,3...n-1,n, than sorting is a complete waste of time. Just make a counter for each key (since you know what the keys can be) and add to the key while iterating the array.
BUT WAIT THERE IS MORE
If the keys are actually setup the way you are thinking, than the best way from the get go is to forget the table structure, and make build it like this:
Index VALUE
0 10,30
1 20,40
2 60
Count is now constant time for each row. Mean for each row is O(n). Getting a list is constant time for each row. EVERYBODY WINS.

You need to create a comparator function comparing vector<double>:
struct VecComp {
bool operator()(const vector<double>& _a, const vector<double>& _b) {
//compare first elements
}
}
Then you can use std::sort on your structure with the new comparator function:
std::sort(myMat.begin(), myMat.end(), VecComp());
If you are using c++11 features you can also utilize lambda functions here:
std::sort(myMat.begin(), myMat.end(), [](const vector<double>& a, const vector<double>& b) {
//compare the first elements
}
);

You need to write your own comparator functor to pass into your vector declaration:
struct comp {
bool operator() (const std::vector<double>& i,
const std::vector<double>& j) {
return i[0] < j[0];
}

Have you tried just this?:
std::sort(vecOfVecs.begin(), vecOfVecs.end());
That should work as std::vector has operator< which provides lexicographical sorting, which is (a little more specific than) what you want.

C++: Time complexity of using STL's sort in order to sort a 2d array of integers on different columns

let's say we have the following 2d array of integers:
1 3 3 1
1 0 2 2
2 0 3 1
1 1 1 0
2 1 1 3
I was trying to create an implementation where the user could give as input the array itself and a string. An example of a string in the above example would be 03 which would mean that the user wants to sort the array based on the first and the fourth column.
So in this case the result of the sorting would be the following:
1 1 1 0
1 3 3 1
1 0 2 2
2 0 3 1
2 1 1 3
I didn't know a lot about the compare functions that are being used inside the STL's sort function, however after searching I created the following simple implementation:
I created a class called Comparator.h
class Comparator{
private:
std::string attr;
public:
Comparator(std::string attr) { this->attr = attr; }
bool operator()(const int* first, const int* second){
std::vector<int> left;
std::vector<int> right;
size_t i;
for(i=0;i<attr.size();i++){
left.push_back(first[attr.at(i) - '0']);
right.push_back(second[attr.at(i) - '0']);
}
for(i=0;i<left.size();i++){
if(left[i] < right[i]) return true;
else if(left[i] > right[i]) return false;
}
return false;
}
};
I need to know the information inside the string so I need to have a class where this string is a private variable. Inside the operator I would have two parameters first and second, each of which will refer to a row. Now having this information I create a left and a right vector where in the left vector I have only the numbers of the first row that are important to the sorting and are specified by the string variable and in the right vector I have only the numbers of the second row that are important to the sorting and are specified by the string variable.
Then I do the needed comparisons and return true or false. The user can use this class by calling this function inside the Sorting.cpp class:
void Sorting::applySort(int **data, std::string attr, int amountOfRows){
std::sort(data, data+amountOfRows, Comparator(attr));
}
Here is an example use:
int main(void){
//create a data[][] variable and fill it with integers
Sorting sort;
sort.applySort(data, "03", number_of_rows);
}
I have two questions:
First question
Can my implementation get better? I use extra variables like the left and right vectors, and then I have some for loops which brings some extra costing to the sorting operation.
Second question
Due to the extra cost, how much worse does the time complexity of the sorting become? I know that STL's sort is O(n*logn) where n is the number of integers that you want to sort. Here n has a different meaning, n is the number of rows and each row can have up to m integers which in turn can be found inside the Comparator class by overriding the operator function and using extra variables(the vectors) and for loops.
Because I'm not sure how exactly is STL's sort implemented I can only make some estimates.
My initial estimate would be O(n*m*log(n)) where m is the number of columns that are important to the sorting however I'm not 100% certain about it.
Thank you in advance

You can certainly improve your comparator. There's no need to copy the columns and then compare them. Instead of the two push_back calls, just compare the values and either return true, return false, or continue the loop according to whether they're less, greater, or equal.
The relevant part of the complexity of sort is O(n * log n) comparisons (in C++11. C++03 doesn't give quite such a good guarantee), where n is the number of elements being sorted. So provided your comparator is O(m), your estimate is OK to sort the n rows. Since attr.size() <= m, you're right.

First question: you don't need left and rigth - you add elements one by one and then iterate over the vectors in the same order. So instead of pushing values to vectors and then iterating over them, simply use the values as you generate them in the first cycle like so:
for(i=0;i<attr.size();i++){
int left = first[attr.at(i) - '0'];
int right = second[attr.at(i) - '0'];
if(left < right) return true;
else if(left > right) return false;
}
Second question: can the time complexity be improved? Not with sorting algorithm that uses direct comparison. On the other had the problem you solve here is somewhat similar to radix sort. And so I believe you should be able to do the sorting in O(n*m) where m is the number of sorting criteria.

1) Firstly to start off you should convert the string into an integer array in the constructor. With validation of values being less than the number of columns.
(You could also have another constructor that takes an integer array as a parameter.
A slight enhancement is to allow negative values to indicate that the order of the sort is reversed for that column. In this case the values would be -N..-1 , 1..N)
2) There is no need for the intermediate left, right arrays.

Algorithm: A Better Way To Calculate Frequencies of a list of words

This question is actually quite simple yet I would like to hear some ideas before jumping into coding. Given a file with a word in each line, calculating most n frequent numbers.
The first and unfortunately only thing that pops up in my mind use to use a std::map. I know fellow C++'ers will say that unordered_map would be so much reasonable.
I would like to know if anything could be added to the algorithm side or this is just basically 'whoever picks the best data structure wins' type of question. I've searched it over the internet and read that hash table and a priority queue might provide an algorithm with O(n) running time however I assume it will be to complex to implement
Any ideas?

The best data structure to use for this task is a Trie:
http://en.wikipedia.org/wiki/Trie
It will outperform a hash table for counting strings.

There are many different approaches to this question. It would finally depend on the scenario and others factors such as the size of the file (If the file has a billion lines) then a HashMapwould not be an efficient way to do it. Here are some things which you can do depending on your problem:
If you know that the number of unique words are very limited, you can use a TreeMap or in your case std::map.
If the number of words are very large then you can build a trie and keep count of various words in another data structure. This could be a heap (min/max depends on what you want to do) of size n. So you don't need to store all the words, just the necessary ones.

I would not start with std::map (or unordered_map) if I had much choice (though I don't know what other constraints may apply).
You have two data items here, and you use one as the key part of the time, but the other as the key another part of the time. For that, you probably want something like a Boost Bimap or possibly Boost MultiIndex.
Here's the general idea using Bimap:
#include <boost/bimap.hpp>
#include <boost/bimap/list_of.hpp>
#include <iostream>
#define elements(array) ((sizeof(array)/sizeof(array[0])))
class uint_proxy {
unsigned value;
public:
uint_proxy() : value(0) {}
uint_proxy& operator++() { ++value; return *this; }
unsigned operator++(int) { return value++; }
operator unsigned() const { return value; }
};
int main() {
int b[]={2,4,3,5,2,6,6,3,6,4};
boost::bimap<int, boost::bimaps::list_of<uint_proxy> > a;
// walk through array, counting how often each number occurs:
for (int i=0; i<elements(b); i++)
++a.left[b[i]];
// print out the most frequent:
std::cout << a.right.rbegin()->second;
}
For the moment, I've only printed out the most frequent number, but iterating N times to print out the N most frequent is pretty trivial.

If you are just interested in the top N most frequent words, and you don't need it to be exact, then there is a very clever structure you can use. I heard of this by way of Udi Manber, it works as follows:
You create an array of N elements, each element tracks a value and a count, you also keep a counter that indexes into this array. Additionally, you have a map from value to index into that array.
Every time you update your structure with a value (like a word from a stream of text) you first check your map to see if that value is already in your array, if it is you increment the count for that value. If it is not then you decrement the count of whatever element your counter is pointing at and then increment the counter.
This sounds simple, and nothing about the algorithm makes it seem like it will yield anything useful, but for typical real data it tends to do very well. Normally if you wish to track the top N things you might want to make this structure with the capacity of 10*N, since there will be a lot of empty values in it. Using the King James Bible as input, here is what this structure lists as the most frequent words (in no particular order):
0 : in
1 : And
2 : shall
3 : of
4 : that
5 : to
6 : he
7 : and
8 : the
9 : I
And here are the top ten most frequent words (in order):
0 : the , 62600
1 : and , 37820
2 : of , 34513
3 : to , 13497
4 : And , 12703
5 : in , 12216
6 : that , 11699
7 : he , 9447
8 : shall , 9335
9 : unto , 8912
You see that it got 9 of the top 10 words correct, and it did so using space for only 50 elements. Depending on your use case the savings on space here may be very useful. It is also very fast.
Here is the implementation of topN that I used, written in Go:
type Event string
type TopN struct {
events []Event
counts []int
current int
mapped map[Event]int
}
func makeTopN(N int) *TopN {
return &TopN{
counts: make([]int, N),
events: make([]Event, N),
current: 0,
mapped: make(map[Event]int, N),
}
}
func (t *TopN) RegisterEvent(e Event) {
if index, ok := t.mapped[e]; ok {
t.counts[index]++
} else {
if t.counts[t.current] == 0 {
t.counts[t.current] = 1
t.events[t.current] = e
t.mapped[e] = t.current
} else {
t.counts[t.current]--
if t.counts[t.current] == 0 {
delete(t.mapped, t.events[t.current])
}
}
}
t.current = (t.current + 1) % len(t.counts)
}

Given a file with a word in each line, calculating most n frequent numbers.
...
I've searched it over the internet and read that hash table and a priority queue might provide an algorithm with O(n)
If you meant the *n*s arethe same then no, this is not possible. However, if you just meant time linear in terms of the size of the input file, then a trivial implementation with a hash table will do what you want.
There might be probabilistic approximate algorithms with sublinear memory.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js