c++ finding same record in vector

c++ finding same record in vector - c++

Ihave a vector that contains monthyear
Jan2013
Jan2013
Jan2013
Jan2014
Jan2014
Jan2014
Jan2014
Feb2014
Feb2014
Basically what I want to do is to search through the vector, for every same record, group them
together like
e.g
total count for Jan2013 = 3;
total count for Jan2014 = 4;
total count for Feb2014 = 2;
Of course as we know, we can just simply write multiple if to solve it
if(monthyear = "Jan2013") {
//add count
}
if(monthyear = "Jan2014") {
//add count
}
if(monthyear = "Feb2014") {
//add count
}
but no way a programmer is going code it in this way.
what if there's additional monthyear march2014,april2014,may2014 all the way to dec2014
and jan2015-dec2015 and so on.
I don't think I should be adopting this kind of hard-coding method in the
long run and looking for a more dynamic approach.
I not asking for codes, but just some steps and perhaps give me some hints on what c++ methods should I be researching on.
Thanks in advance

You can use std::map. For example
std::map<std::string, size_t> m;
for ( const std::string &s : v ) ++m[s];

I'd probably do a std::map<monthyear, int>. For each member of your vector, increment that member of the map.

Just for completeness: the solution by #VladfromMoscow is optimal for the general case in which you have little knowledge about your input. It is of O(N log N) complexity for an input of length N.
Equivalently, you can first sort your input in O(N log N), and then iterate in O(N) over the sorted input and store the counts in a std::vector<std::pair<std::string, int>>.
However, if you have a priori information on the range of your input (say you know for sure it runs from Jan 2013 until Jan 2014), you can also directly run over your input and update a pre-allocated std::vector<std::pair<std::string, int>> in O(N) complexity.

Related

Improve searching through unsorted list

My code spends 40% of its time searching through unsorted vectors. More specifically, the searching function my_search repeatedly receives a single unsorted vector of length N, where N can take any values between 10 and 100,000. The weights associated with each element have relatively little variance (e.g. [ 0.8, 0.81, 0.85, 0.78, 0.8, 0.7, 0.84, 0.82, ...]).
The algorithm my_search starts by summing all the weights for each object and then sample an average of N elements (as many as the length of the vector) with replacements. The algorithm is quite similar to
int sum_of_weight = 0;
for(int i=0; i<num_choices; i++) {
sum_of_weight += choice_weight[i];
}
int rnd = random(sum_of_weight);
for(int i=0; i<num_choices; i++) {
if(rnd < choice_weight[i])
return i;
rnd -= choice_weight[i];
}
from this post.
I could sort the vector before searching but takes a time of the order of O(N log N) (depending on the sort algorithm used) and I doubt (but might be wrong as I haven't tried) that I would gain much time especially as the weights have little variance.
Another solution would be to store information of how much weight there is before a series of points. For example, while summing the vector, every N/10 elements, I could store the information of how much weights has been summed yet. Then, I could first compare rnd to these 10 breakpoints and search in only a tenth of the total length of the vector.
Would this be a good solution?
Is there a name for the process I described?
How can I estimate what is the right number of breakpoints to store as a function of N?
Is there a better solution?

log(N) Solution
{
std::vector<double> sums;
double sum_of_weight = 0;
for(int i=0; i<num_choices; i++) {
sum_of_weight += choice_weight[i];
sums.push_back(sum_of_weight);
}
std::vector<double>::iterator high = std::upper_bound(sums.begin(), sums.end(), random(sum_of_weight));
return std::distance(sums.begin(), high);
}
Essentially the same idea you have for a better way to solve it, but rather than store only a 10th of the elements, store all of them and use binary search to find the index of the one closest to your value.
Analysis
Even though this solution is O(logN), you really have to ask yourself if it's worth it. Is it worth it to have to create an extra vector, thus accumulating extra clock cycles to store things in the vector, the time it takes for vectors to resize, the time it takes to call a function to perform binary search, etc?
As I was writing the above, I realised you can use a deque instead and that will almost get rid of the performance hit from having to resize and copy contents of vectors without affecting the O(1) lookup of vectors.
So I guess the question remains, is it worth it to copy over the elements into another container and then only do an O(logN) search?
Conclusion
TBH, I don't think you've gained much from this optimization. In fact I think you gained an overhead of O(logN).

How do I print out vectors in different order every time

I'm trying to make two vectors. Where vector1 (total1) is containing some strings and vector2(total2) is containing some random unique numbers(that are between 0 and total1.size() - 1)
I want to make a program that print out total1s strings, but in different order every turn. I don't want to use iterators or something because I want to improve my problem solving capacity.
Here is the specific function that crash the program.
for (unsigned i = 0; i < total1.size();)
{
v1 = rand() % total1.size();
for (unsigned s = 0; s < total1.size(); ++s)
{
if (v1 == total2[s])
;
else
{
total2.push_back(v1);
++i;
}
}
}
I'm very grateful for any help that I can get!

Can I suggest you change of algorithm?. Because, even if your current one is correctly implemented ("s", in your code, must go from 0 to total2.size not total1.size and if element is found, break and generate a new random), it has the following drawback: assume vectors of 1.000.000 elements and you are trying the last random number. You have one probability in 1.000.000 of find a random number not previously used. That is a very small amount.Last but one number has a probability of 2 in 1.000.000 also small. In conclusion, your program will loop and expend lots of CPU resources.
Your best alternative is follow #NathanOliver suggestion and look for function std::shuffle. The manual page shows the implementation algorithm, that is what you are looking for.
Another simple algorithm, with some pros and cons, is:
init total2 with sequence 0,1,2,...,n where n is the size total1 - 1
choice two random numbers, i1 and i2, in range [0,n-1].
Swap elements i1 and i2 in total2.
repeat from (2) a fixed number of times "R".
This method allows to known a priori the necessary steps and to control the level of "randomness" of the final vector (bigger R is more random). However, it is far to be good in its randomness quality.
Another method, better in the probabilistic distribution:
fill a list L with number 0,1,2,...size total1-1.
choice a random number i between 0 and the size of list L - 1 .
Store in total2 the i-th element in list L.
Remove this element from L.
repeat from (2) until L is empty.

If you just want to shuffle vector<string> total1, you can do this without using helping vector<int> total2. Here is an implementation based on Fisher–Yates shuffle.
for(int i=n-1; i>=1; i--) {
int j=rand()%(i+1);
swap(total1[j], total1[i]); // your prof might not allow use of swap:)
}
If you must use vector<int> total2 then shuffle it using above algorithm. Next you can use it to create a new vector<string> result from total1 where result[i]=total1[total2[i]].

Find pair of elements in integer array such that abs(v[i]-v[j]) is minimized

Lets say we have int array with 5 elements: 1, 2, 3, 4, 5
What I need to do is to find minimum abs value of array's elements' subtraction:
We need to check like that
1-2 2-3 3-4 4-5
1-3 2-4 3-5
1-4 2-5
1-5
And find minimum abs value of these subtractions. We can find it with 2 fors. The question is, is there any algorithm for finding value with one and only for?

sort the list and subtract nearest two elements

The provably best performing solution is assymptotically linear O(n) up until constant factors.
This means that the time taken is proportional to the number of the elements in the array (which of course is the best we can do as we at least have to read every element of the array, which already takes O(n) time).
Here is one such O(n) solution (which also uses O(1) space if the list can be modified in-place):
int mindiff(const vector<int>& v)
{
IntRadixSort(v.begin(), v.end());
int best = MAX_INT;
for (int i = 0; i < v.size()-1; i++)
{
int diff = abs(v[i]-v[i+1]);
if (diff < best)
best = diff;
}
return best;
}
IntRadixSort is a linear time fixed-width integer sorting algorithm defined here:
http://en.wikipedia.org/wiki/Radix_sort
The concept is that you leverage the fixed-bitwidth nature of ints by paritioning them in a series of fixed passes on the bit positions. ie partition them on the hi bit (32nd), then on the next highest (31st), then on the next (30th), and so on - which only takes linear time.

The problem is equivalent to sorting. Any sorting algorithm could be used, and at the end, return the difference between the nearest elements. A final pass over the data could be used to find that difference, or it could be maintained during the sort. Before the data is sorted the min difference between adjacent elements will be an upper bound.
So to do it without two loops, use a sorting algorithm that does not have two loops. In a way it feels like semantics, but recursive sorting algorithms will do it with only one loop. If this issue is the n(n+1)/2 subtractions required by the simple two loop case, you can use an O(n log n) algorithm.

No, unless you know the list is sorted, you need two

Its simple Iterate in a for loop
keep 2 variable "minpos and maxpos " and " minneg" and "maxneg"
check for the sign of the value you encounter and store maximum positive in maxpos
and minimum +ve number in "minpos" do the same by checking in if case for number
less than zero. Now take the difference of maxpos-minpos in one variable and
maxneg and minneg in one variable and print the larger of the two . You will get
desired.
I believe you definitely know how to find max and min in one for loop
correction :- The above one is to find max difference in case of minimum you need to
take max and second max instead of max and min :)

This might be help you:
end=4;
subtractmin;
m=0;
for(i=1;i<end;i++){
if(abs(a[m]-a[i+m])<subtractmin)
subtractmin=abs(a[m]-a[i+m];}
if(m<4){
m=m+1
end=end-1;
i=m+2;
}}

C++: how to compare several vectors, then make a new sorted vector that contains ALL elements of all vectors

Update: I have a couple of what are probably silly questions about commenter 6502's answer (below). If anyone could help, I'd really appreciate it.
1) I understand that data 1 and data 2 are the maps, but I don't understand what allkeys is for. Can anyone explain?
2) I know that: data1[vector1[i].name] = vector1[i].value; means assign a value to the map of interest where the correct label is... But I don't understand this: vector1[i].name and vector1[i].value. Are't "name" and "value" two separate vectors of labels and values? So what are they doing on vector1? Shouldn't this read, name[i] and value[i] instead?
Thanks everyone.
I have written code for performing a calculation. The code uses data from elsewhere. The calculation code is fine, but I'm having trouble manipulating the data.
The data exist as sets of vectors. Each set has one vector of labels (names, these are strings) and a corresponding set of values (doubles or ints).
The problem is that I need each data set to have the same name/label in the same column as the other data sets. This problem is not the same as sorting the data in the vectors (which I know how to do) because sometimes names/labels can be missing from some vectors.
For example:
Data set 1:
vector names1 = Jim, Tom, Mary
vector values1 = 1 2 3
Data set 2:
vector names2 = Tom, Mary, Joan
vector values2 = 2 3 4
I want (pseudo-code) ONE name vector that has all possible names. I also want each corresponding numbers vector to be sorted the SAME way:
vector namesUniversal = Jim, Joan, Mary, Tom
vector valuesUniversal1 = 1 0 3 2
vector valuesUniversal2 = 0 4 3 2
What I want to do is come up with a universal vector that contains ALL the labels/names sorted alphabetically and all the corresponding numerical data sorted too.
Can anyone tell me whether there is an elegant way to do this in c++? I guess I could compare each element of each name vector with each element of each other name vector, but this seems quite clunky and I would not know how to get the data into the right columns in the corresponding data vectors. Thanks for any advice.

The algorithm you are looking for is usually named "merging". Basically you sort the two data sets and then look at data in pairs: if the keys are equal then you process and output the pair, otherwise you process and advance only the smallest one.
You must also handle the case where one of the two lists ends before the other (this can be avoided by using special flag values that are guaranteed to be higher than any value you need to process).
The following is pseudocode for merging
Sort vector1
Sort vector2
Set index1 = index2 = 0;
Loop until both index1 >= vector1.size() and index2 >= vector2.size() (in other words until both vectors are exhausted)
If index1 == vector1.size() (i.e. if vector1 has been processed) then output vector2[index2++]
Otherwise if index2 == vector2.size() (i.e. if vector2 has been processed) then output vector1[index1++]
Otherwise if vector1[index1] == vector2[index2] output merged data and increment both index1 and index2
Otherwise if vector1[index1] < vector2[index2] output vector1[index1++]
Otherwise output vector2[index2++]
However in C++ you can implement a much easier to write solution that is probably still fast enough (warning: untested code!):
std::map<std::string, int> data1, data2;
std::set<std::string> allkeys;
for (int i=0,n=vector1.size(); i<n; i++)
{
allkeys.insert(vector1[i].name);
data1[vector1[i].name] = vector1[i].value;
}
for (int i=0,n=vector2.size(); i<n; i++)
{
allkeys.insert(vector2[i].name);
data2[vector2[i].name] = vector2[i].value;
}
for (std::set<std::string>::iterator i=allkeys.begin(), e=allkeys.end();
i!=e; ++i)
{
const std::string& key = *i;
std::cout << key << data1[key] << data2[key] << std::endl;
}
The idea is to just build two maps data1 and data2 from name to values, and at the same time collecting all keys that are appearing in a std::set of keys named allkeys (adding the same name to a set multiple times does nothing).
After the collection phase this set can then be iterated to find all the names that have been observed and for each name the value can be retrieved from data1 and data2 maps (std::map<std::string, int> will return 0 when looking for the value of a name that has not been added to the map).
Technically this is sort of overkilling (uses three balanced trees to do the processing that would have required just two sort operations) but is less code and probably acceptable anyway.

6502's solution looks fine at first glance. You should probably use std::merge for the merging part though.
EDIt:
I forgot to mention that there is now also a multiway_merge extension of the STL available in the GNU version of the STL. It is a part of the parallel mode, so it resides in the namespace __gnu_parallel. If you need to do multiway merging, it will be very hard to come up with something as fast or simple to use as this.

A quick way which comes to mind is to use a map<pair<string, int>, int> and for each value store it in the map with the right key. (For example (Tom, 2) in the first values set will be under the key (Tom, 1) with value 2)
Once the map is ready iterate over it and build whatever data structure you want (Assuming the map is not enough for you).

I think you need to alter how you store this data.
It looks like you're saying each number is logically associated with the name in the same position: Jim = 1, Mary = 3, etc.
If so, and you want to stick with a vector of some kind, you could redo your data structure like so:
typedef std::pair<std::string, int> NameNumberPair;
typedef std::vector<NameNumberPair> NameNumberVector;
NameNumberVector v1;
You'll need to write your own operator< which returns based on the sort order of the underlying names. However, as Nawaz points out, a map would be a better way to represent the associated nature of the data.

how to get median value from sorted map

I am using a std::map. Sometimes I will do an operation like: finding the median value of all items. e.g
if I add
1 "s"
2 "sdf"
3 "sdfb"
4 "njw"
5 "loo"
then the median is 3.
Is there some solution without iterating over half the items in the map?

I think you can solve the problem by using two std::map. One for smaller half of items (mapL) and second for the other half (mapU). When you have insert operation. It will be either case:
add item to mapU and move smallest element to mapL
add item to mapL and move greatest element to mapU
In case the maps have different size and you insert element to the one with smaller number of
elements you skip the move section.
The basic idea is that you keep your maps balanced so the maximum size difference is 1 element.
As far as I know STL all operations should work in O(ln(n)) time. Accessing smallest and greatest element in map can be done by using iterator.
When you have n_th position query just check map sizes and return greatest element in mapL or smallest element in mapR.
The above usage scenario is for inserting only but you can extend it to deleting items as well but you have to keep track of which map holds item or try to delete from both.
Here is my code with sample usage:
#include <iostream>
#include <string>
#include <map>
using namespace std;
typedef pair<int,string> pis;
typedef map<int,string>::iterator itis;
map<int,string>Left;
map<int,string>Right;
itis get_last(map<int,string> &m){
return (--m.end());
}
int add_element(int key, string val){
if (Left.empty()){
Left.insert(make_pair(key,val));
return 1;
}
pis maxl = *get_last(Left);
if (key <= maxl.first){
Left.insert(make_pair(key,val));
if (Left.size() > Right.size() + 1){
itis to_rem = get_last(Left);
pis cpy = *to_rem;
Left.erase(to_rem);
Right.insert(cpy);
}
return 1;
} else {
Right.insert(make_pair(key,val));
if (Right.size() > Left.size()){
itis to_rem = Right.begin();
pis cpy = *to_rem;
Right.erase(to_rem);
Left.insert(*to_rem);
}
return 2;
}
}
pis get_mid(){
int size = Left.size() + Right.size();
if (Left.size() >= size / 2){
return *(get_last(Left));
}
return *(Right.begin());
}
int main(){
Left.clear();
Right.clear();
int key;
string val;
while (!cin.eof()){
cin >> key >> val;
add_element(key,val);
pis mid = get_mid();
cout << "mid " << mid.first << " " << mid.second << endl;
}
}

I think the answer is no. You cannot just jump to the N / 2 item past the beginning because a std::map uses bidirectional iterators. You must iterate through half of the items in the map. If you had access to the underlying Red/Black tree implementation that is typically used for the std::map, you might be able to get close like in Dani's answer. However, you don't have access to that as it is encapsulated as an implementation detail.

Try:
typedef std::map<int,std::string> Data;
Data data;
Data::iterator median = std::advance(data.begin(), data.size() / 2);
Works if the size() is odd. I'll let you work out how to do it when size() is even.

In self balancing binary tree(std::map is one I think) a good approximation would be the root.
For exact value just cache it with a balance indicator, and each time an item added below the median decrease the indicator and increase when item is added above. When indicator is equal to 2/-2 move the median upwards/downwards one step and reset the indicator.

If you can switch data structures, store the items in a std::vector and sort it. That will enable accessing the middle item positionally without iterating. (It can be surprising but a sorted vector often out-performs a map, due to locality. For lookups by the sort key you can use binary search and it will have much the same performance as a map anyway. See Scott Meyer's Effective STL.)

If you know the map will be sorted, then get the element at floor(length / 2). If you're in a bit twiddly mood, try (length >> 1).

I know no way to get the median from a pure STL map quickly for big maps. If your map is small or you need the median rarely you should use the linear advance to n/2 anyway I think - for the sake of simplicity and being standard.
You can use the map to build a new container that offers median: Jethro suggested using two maps, based on this perhaps better would be a single map and a continuously updated median iterator. These methods suffer from the drawback that you have to reimplement every modifiying operation and in jethro's case even the reading operations.
A custom written container will also do what you what, probably most efficiently but for the price of custom code. You could try, as was suggested to modify an existing stl map implementation. You can also look for existing implementations.
There is a super efficient C implementation that offers most map functionality and also random access called Judy Arrays. These work for integer, string and byte array keys.

Since it sounds like insert and find are your two common operations while median is rare, the simplest approach is to use the map and std::advance( m.begin(), m.size()/2 ); as originally suggested by David Rodríguez. This is linear time, but easy to understand so I'd only consider another approach if profiling shows that the median calls are too expensive relative to the work your app is doing.

The nth_element() method is there for you for this :) It implements the partition part of the quick sort and you don't need your vector (or array) to be sorted.
And also the time complexity is O(n) (while for sorting you need to pay O(nlogn)).

For a sortet list, here it is in java code, but i assume, its very easy to port to c++:
if (input.length % 2 != 0) {
return input[((input.length + 1) / 2 - 1)];
} else {
return 0.5d * (input[(input.length / 2 - 1)] + input[(input.length / 2 + 1) - 1]);
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js