C++ Equivalent of VLOOKUP function - c++

I'm trying to create an equivalent of the excel VLOOKUP function for a two dimensional csv file I have. If given a number 5 I would like to be able to look at a column of a dynamic table I have and find the row with the highest number less than five in that column.
For example. If I used 5 from my example before:
2 6
3 7
4 11
6 2
9 4
Would return to me 11, the data paired with the highest entry below 5.
I have no idea how to go about doing this. If it helps, the entries in column one (the column I will be searching) will go from smallest to largest.
I am a beginner to C++ so I apologize if I'm missing some obvious method.

std::map can do this pretty easily:
You'd start by creating a map of the correct type, then populating it with your data:
std::map<int, int, std::greater<int> > data;
data[2] = 6;
data[3] = 7;
data[4] = 11;
data[6] = 2;
data[9] = 4;
Then you'd search for data with lower_bound or upper_bound:
std::cout << data.lower_bound(5)->second; // prints 11
A couple of notes: First, note the use of std::greater<T> as the comparison operator. This is necessary because lower_bound will normally return an iterator to the next item (instead of the previous) if the key you're looking for isn't present in the map. Using std::greater<T> sorts the map in reverse, so the "next" item is the smaller one instead of the larger.
Second, note that this automatically sorts the data based on the keys, so it depends only on the data you insert, not the order of insertion.

Related

Generate unique combinations from multiple arrays/vectors

I have 800 data files and each file contain 8 lines of integer eg
17,1,2,3,4,5,6,7,10,11,12,13,15,16,20,22,24,26,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16,1,2,3,4,5,6,7,8,9,10,11,12,16,17,21,26,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
23,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19,20,23,25,26,28,29,35,36,,,,,,,,,,,,,,,,,,,,,,,,,,
27,8,9,11,12,13,14,15,17,19,20,21,22,23,24,26,27,28,29,30,31,32,34,37,39,40,41,42,,,,,,,,,,,,,,,,,,,,,,
27,14,16,17,18,19,20,22,23,24,25,26,27,28,29,30,31,32,33,35,36,37,38,39,40,42,43,44,,,,,,,,,,,,,,,,,,,,,,
24,20,24,26,27,28,29,30,31,32,33,34,35,36,37,39,40,41,42,43,44,45,46,47,48,,,,,,,,,,,,,,,,,,,,,,,,,
16,33,34,35,36,37,38,39,41,42,43,44,45,46,47,48,49,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
14,35,37,38,39,40,41,42,43,44,45,46,47,48,49,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Each line has 50 elements, 1st element of each line is number count i.e. 17 of line 1 indicate there is 17 numbers in this line 1,2,3,4,5,6,7,10,11,12,13,15,16,20,22,24,26. Numbers in each line is unique , in ascending order and within range 1~49.
My task is to generate list of unique 8 numbers combinations from this 8 lines
i.e. A,B,C,D,E,F,G,H
A from line 1, B from line 2 ... H from line 8
24,517,914,624 (17*16*23*27*27*24*16*14) entries will be generated:
1,1,4,8,14,20,33,35
...
1,1,4,8,14,20,33,49
....
1,2,4,8,14,20,33,35
...
2,1,4,8,14,20,33,35
...
And then process the 24,517,914,624 entries list as follow:
i) remove entries with duplicate numbers e.g. 1,1,4,8,14,20,33,35 and 1,1,4,8,14,20,33,49 will be removed
ii) sort number in each entry in ascending order e.g. 2,1,4,8,14,20,33,35 will become 1,2,4,8,14,20,33,35
iii) remove duplicated entries e.g. 2,1,4,8,14,20,33,35 is same as 1,2,4,8,14,20,33,35 after sorted, therefore only 1 entry of 1,2,4,8,14,20,33,35 will be kept
After the above process, may be around 10 millions entries left (which is the result I want)
However. processing a 24,517,914,624 entries array is a nearly impossible task,
therefore I tried the following 2 approachs to tackle the problem (try remove entries with duplicate numbers and sort number for each entry.
1) Brute force approach, use 8 nested for loop to generate combinations:
for (int i = 0; i < LineArr[0][0]; i++) {
for (int j = 0; j < LineArr[1][0]; j++) {
for (int k = 0; k < LineArr[2][0]; k++) {
for (int l = 0; l < LineArr[3][0]; l++) {
for (int m = 0; m < LineArr[4][0]; m++) {
for (int n = 0; n < LineArr[5][0]; n++) {
for (int o = 0; o < LineArr[6][0]; o++) {
for (int p = 0; p < LineArr[7][0]; p++) {
MyRes[0]=LineArr[0][i]
MyRes[1]=LineArr[1][j]
MyRes[2]=LineArr[2][k]
MyRes[3]=LineArr[3][l]
MyRes[4]=LineArr[4][m]
MyRes[5]=LineArr[5][n]
MyRes[6]=LineArr[6][o]
MyRes[7]=LineArr[7][p]
// Sort number of MyRes and discard if it contains duplicate numbers
// store valid combination in a temp array/vector
}}}}}}}}
// remove duplicate entries in the temp array/vector ('unique' the temp array)
2) Stepwise approach
Instead of generate 8 numbers combination at once, generate 2 numbers combination from first 2 lines, sort number in each entry, remove entry with duplicate number and unify the list
the output will be something like this:
1,2
1,3
1,4
1,1 2,2 will be removed and 4,1 will become 1,4 and duplicated entries removed.
Then the above list will combine with line 3 to form 3 numbers combinations, also sort and remove entries with duplicated number and unify the list.
Apply the above to 4,5,6...8 lines to form 4,5,6...8 numbers combinations
Since this is part of an automation project, AutoIt is used throughout the project (those 800 files
are from another 3rd party software). I tried implement the combinations generation with AutoIt,
Technically approach 1) generate 24,517,914,624 entries, sort number in each entry right after generation and discard entry with duplicate number in it.
This approach takes forever to run since it involve billions entries to test/sort and its array size is much higher than AutoIt's array size limit (16 millions). Therefore approach 1) can be discarded,
it only suitable for (at most) 5 numbers combination (eg 1,3,7,14,23).
For approach 2), I tried 2 variations:
i) store the result in each step in temp array and use AutoIt's _ArrayUnique function to remove duplicate entries. This also takes forever to run!!
ii) Instead of store the result in temp array, I make use of SQLite, i.e. put the combination generated in each step into a single row table in SQLite, the table/row is created with PRIMARY KEY UNIQUE Then I select the row back into AutoIt for further processing.
Variation ii) eventually work, it takes 1 hr 20 min to handle 1 file (and I have 800 of such files)
Now I plan to implement the combination generation in VC++ (VS 2017) and I have the follow questions:
1) Apart from "Brute force" and "Stepwise", any other approach/algorithm to generate unique combinations from multiple arrays/vectors from performance point of view ?
2) To sort number in each entry and check repeat number in each entry, I think std::sort, std::search/std::find will do the job, however, since there will be millions entries to check, is there any other options from performance point of view ?
3) To remove duplicate entries (unify the combination list i.e. get unique combinations), I should use std::unique or still rely on SQLite ? since the size of array may as large as 30~40 millions and shrink to 10 millions after std::sort and std::unique or SELECT from SQLite (don't know which implementation is better in performance point of view)
4) Any ready made LIB can easy the task ?
Thanks a lot.
Regds
LAM Chi-fung
Just find out the std::set, and its sort/unique feature suit my need. I implement the stepwise approach with it and the program run like fly. Only that it easily go out of memory after row 6, so I combine it with SQLite i.e. after work on 6 rows, I discard the std::set and store the combined result in SQLite table (single row table with PRIMARY KEY UNIQUE). This may not be a perfect solution but workable.

Determine unique values across multiple sets

In this project, there are multiple sets in which they hold values from 1 - 9. Within this, I need to efficiently determine if there are values that is unique in one set but not others.
For Example:
std::set<int> s_1 = { 1, 2, 3, 4, 5 };
std::set<int> s_2 = { 2, 3, 4 };
std::set<int> s_3 = { 2, 3, 4, 6 };
Note: The number of sets is unknown until runtime.
As you can see, s_1 contains the unique value of 1 and 5 and s_3 contains the unique value of 6.
After determining the unique values, the aforementioned sets should then just contain the unique values like:
// s_1 { 1, 5 }
// s_2 { 2, 3, 4 }
// s_3 { 6 }
What I've tried so far is to loop through all the sets and record the count of the numbers that have appeared. However I wanted to know if there is a more efficient solution out there.
There are std algorithm in the std C++ library for intersection, difference and union operations on 2 sets.
If I understood well your problem you could do this :
do an intersection on all sets (in a loop) to determine a base, and then apply a difference between each set and the base ?
You could benchmark this against your current implementation. Should be faster.
Check out this answer.
Getting Union, Intersection, or Difference of Sets in C++
EDIT: cf Tony D. comment : You can basically do the same operation using a std::bitset<> and binary operators (& | etc..), which should be faster.
Depending on the actual size of your input, might be well worth a try.
I would suggest something in c# like this
Dictionary<int, int> result = new Dictionary<int, int>();
foreach(int i in sets){
if(!result.containskey(i))
result.add(i,1);
else
result[i].value = result[i].value+1;
}
now the Numbers with count value only 1 means its unique, then find the sets with these numbers...
I would suggest :
start inserting all the elements in all the sets into a multimap.
Here each element is a key and and the set name with be the value.
One your multimap is filled with all the elements in all the sets,
then loop throgth the multimap and take count of each element in the
multimap.
If the count is 1 for any key, this means its unique and value of
that will be the set name.

Which dataset should I use?

The title may have been a bit vague, but I will appreciate some ideas for the current problem I have.
Here is a dataset:
1 1/1/2013
2 1/1/2013
3 1/1/2013
1 1/2/2013
2 1/2/2013
1 1/3/2013
2 1/3/2013
3 1/3/2013
So, I begin with the first record, and see if there is another 1 in my list. If there is, I ignore it, and go back to the second record. If there is another 2 in my list, I ignore it, and go back to the 3rd record, and so on and so forth.
Now, the desired result of this list, that I am looking for is <1, 1/3/2013>, since no other record of 1 exists below it.
Similarly, in this dataset:
1 1/1/2013
2 1/1/2013
3 1/1/2013
1 1/2/2013
2 1/2/2013
3 1/2/2013
4 1/2/2013
1 1/3/2013
2 1/3/2013
3 1/3/2013
The desired result would be <4, 1/2/2013>, since there is no other occurrence of 4 down the list.
My question is, how would I go about doing this, what standard STL container can I use? Further more, these are the results returned by a query.
I am sorry I don't use boost or any of the other libraries, and looking to get this done with std variables.
You can use two maps - one map to store mapping from the key (your first column) to the value (your second column) and second map to store mapping from the key (your first column) to the record number:
std::map<int, std::string> m1;
std::map<int, int> m2;
int counter = 0;
while (...)
{
<...get record...>
m1[record.key] = record.value;
m2[record.key] = counter++;
}
Then you need to scan the second map m2 in order to find the key with minimal position:
int keyMin = <...big number...>, posMin = <...big number...>;
for (std::map<int, int>::const_iterator it = m2.begin(); it != m2.end(); ++it)
{
if (it->second < posMin)
{
keyMin = it->first;
posMin = it->second;
}
}
The result will be the first key, for which there are no records with this key down the road. Using this key and the first map m1 you'll be able to find its corresponding value.
You can check from the bottom, and remember the first(last when counting from the top) appearance of each index. And after You've done this (in time O(n)) You can take the last You found.
What does query return? You can choose std::vector<some-structure> if it returns a known structure, or std::vector<std::vector<std::string> > if it returns a string list.
Then going from bottom and remembering all unique ids that you see you are able to get the last good value in o(n) time and o(n) memory.

Efficient method for finding the middle k-combination of n sorted elements

Suppose that I have a sorted array, N, consisting of n elements. Now, given k, I need a highly efficient method to generate the k-combination that would be the middle combination (if all the k-combinations were lexicographically sorted).
Example:
N = {a,b,c,d,e} , k = 3
1: a,b,c
2: a,b,d
3: a,b,e
4: a,c,d
5: a,c,e
6: a,d,e
7: b,c,d
8: b,c,e
9: b,d,e
10: c,d,e
I need the algorithm to generate combination number 5.
The Wikipedia page on the combinatorial number system explains how this can be obtained (in a greedy way). However, since n is very large and I need to find the middle combination for all k's less than n, I need something much more efficient than that.
I'm hoping that since the combination of interest always lies in the middle, there is some sort of a straightforward method for finding it. For example, the first k-combination in the above list is always given by the first k elements in N, and similarly the last combination is always given by the last k elements. Is there such a way to find the middle combination as well?
http://en.wikipedia.org/wiki/Combinatorial_number_system
If you are looking for a way to obtain the K-indexes from the lexicographic index or rank of a unique combination, then your problem falls under the binomial coefficient. The binomial coefficient handles problems of choosing unique combinations in groups of K with a total of N items.
I have written a class in C# to handle common functions for working with the binomial coefficient. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters.
Converts the K-indexes to the proper lexicographic index or rank of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle and is very efficient compared to iterating over the set.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. The technique used is also much faster than older iterative solutions.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to use the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with several cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
The following tested code will calculate the median lexicographic element for any N Choose K combination:
void TestMedianMethod()
{
// This test driver tests out the GetMedianNChooseK method.
GetMedianNChooseK(5, 3); // 5 choose 3 case.
GetMedianNChooseK(10, 3); // 10 choose 3 case.
GetMedianNChooseK(10, 5); // 10 choose 5 case.
}
private void GetMedianNChooseK(int N, int K)
{
// This method calculates the median lexicographic index and the k-indexes for that index.
String S;
// Create the bin coeff object required to get all
// the combos for this N choose K combination.
BinCoeff<int> BC = new BinCoeff<int>(N, K, false);
int NumCombos = BinCoeff<int>.GetBinCoeff(N, K);
// Calculate the median value, which in this case is the number of combos for this N
// choose K case divided by 2.
int MedianValue = NumCombos / 2;
// The Kindexes array holds the indexes for the specified lexicographic element.
int[] KIndexes = new int[K];
// Get the k-indexes for this combination.
BC.GetKIndexes(MedianValue, KIndexes);
StringBuilder SB = new StringBuilder();
for (int Loop = 0; Loop < K; Loop++)
{
SB.Append(KIndexes[Loop].ToString());
if (Loop < K - 1)
SB.Append(" ");
}
// Print out the information.
S = N.ToString() + " choose " + K.ToString() + " case:\n";
S += " Number of combos = " + NumCombos.ToString() + "\n";
S += " Median Value = " + MedianValue.ToString() + "\n";
S += " KIndexes = " + SB.ToString() + "\n\n";
Console.WriteLine(S);
}
Output:
5 choose 3 case:
Number of combos = 10
Median Value = 5
KIndexes = 4 2 0
10 choose 3 case:
Number of combos = 120
Median Value = 60
KIndexes = 8 3 1
10 choose 5 case:
Number of combos = 252
Median Value = 126
KIndexes = 9 3 2 1 0
You should be able to port this class over fairly easily to the language of your choice. You probably will not have to port over the generic part of the class to accomplish your goals. Depending on the number of combinations you are working with, you might need to use a bigger word size than 4 byte ints.

C++: how to compare several vectors, then make a new sorted vector that contains ALL elements of all vectors

Update: I have a couple of what are probably silly questions about commenter 6502's answer (below). If anyone could help, I'd really appreciate it.
1) I understand that data 1 and data 2 are the maps, but I don't understand what allkeys is for. Can anyone explain?
2) I know that: data1[vector1[i].name] = vector1[i].value; means assign a value to the map of interest where the correct label is... But I don't understand this: vector1[i].name and vector1[i].value. Are't "name" and "value" two separate vectors of labels and values? So what are they doing on vector1? Shouldn't this read, name[i] and value[i] instead?
Thanks everyone.
I have written code for performing a calculation. The code uses data from elsewhere. The calculation code is fine, but I'm having trouble manipulating the data.
The data exist as sets of vectors. Each set has one vector of labels (names, these are strings) and a corresponding set of values (doubles or ints).
The problem is that I need each data set to have the same name/label in the same column as the other data sets. This problem is not the same as sorting the data in the vectors (which I know how to do) because sometimes names/labels can be missing from some vectors.
For example:
Data set 1:
vector names1 = Jim, Tom, Mary
vector values1 = 1 2 3
Data set 2:
vector names2 = Tom, Mary, Joan
vector values2 = 2 3 4
I want (pseudo-code) ONE name vector that has all possible names. I also want each corresponding numbers vector to be sorted the SAME way:
vector namesUniversal = Jim, Joan, Mary, Tom
vector valuesUniversal1 = 1 0 3 2
vector valuesUniversal2 = 0 4 3 2
What I want to do is come up with a universal vector that contains ALL the labels/names sorted alphabetically and all the corresponding numerical data sorted too.
Can anyone tell me whether there is an elegant way to do this in c++? I guess I could compare each element of each name vector with each element of each other name vector, but this seems quite clunky and I would not know how to get the data into the right columns in the corresponding data vectors. Thanks for any advice.
The algorithm you are looking for is usually named "merging". Basically you sort the two data sets and then look at data in pairs: if the keys are equal then you process and output the pair, otherwise you process and advance only the smallest one.
You must also handle the case where one of the two lists ends before the other (this can be avoided by using special flag values that are guaranteed to be higher than any value you need to process).
The following is pseudocode for merging
Sort vector1
Sort vector2
Set index1 = index2 = 0;
Loop until both index1 >= vector1.size() and index2 >= vector2.size() (in other words until both vectors are exhausted)
If index1 == vector1.size() (i.e. if vector1 has been processed) then output vector2[index2++]
Otherwise if index2 == vector2.size() (i.e. if vector2 has been processed) then output vector1[index1++]
Otherwise if vector1[index1] == vector2[index2] output merged data and increment both index1 and index2
Otherwise if vector1[index1] < vector2[index2] output vector1[index1++]
Otherwise output vector2[index2++]
However in C++ you can implement a much easier to write solution that is probably still fast enough (warning: untested code!):
std::map<std::string, int> data1, data2;
std::set<std::string> allkeys;
for (int i=0,n=vector1.size(); i<n; i++)
{
allkeys.insert(vector1[i].name);
data1[vector1[i].name] = vector1[i].value;
}
for (int i=0,n=vector2.size(); i<n; i++)
{
allkeys.insert(vector2[i].name);
data2[vector2[i].name] = vector2[i].value;
}
for (std::set<std::string>::iterator i=allkeys.begin(), e=allkeys.end();
i!=e; ++i)
{
const std::string& key = *i;
std::cout << key << data1[key] << data2[key] << std::endl;
}
The idea is to just build two maps data1 and data2 from name to values, and at the same time collecting all keys that are appearing in a std::set of keys named allkeys (adding the same name to a set multiple times does nothing).
After the collection phase this set can then be iterated to find all the names that have been observed and for each name the value can be retrieved from data1 and data2 maps (std::map<std::string, int> will return 0 when looking for the value of a name that has not been added to the map).
Technically this is sort of overkilling (uses three balanced trees to do the processing that would have required just two sort operations) but is less code and probably acceptable anyway.
6502's solution looks fine at first glance. You should probably use std::merge for the merging part though.
EDIt:
I forgot to mention that there is now also a multiway_merge extension of the STL available in the GNU version of the STL. It is a part of the parallel mode, so it resides in the namespace __gnu_parallel. If you need to do multiway merging, it will be very hard to come up with something as fast or simple to use as this.
A quick way which comes to mind is to use a map<pair<string, int>, int> and for each value store it in the map with the right key. (For example (Tom, 2) in the first values set will be under the key (Tom, 1) with value 2)
Once the map is ready iterate over it and build whatever data structure you want (Assuming the map is not enough for you).
I think you need to alter how you store this data.
It looks like you're saying each number is logically associated with the name in the same position: Jim = 1, Mary = 3, etc.
If so, and you want to stick with a vector of some kind, you could redo your data structure like so:
typedef std::pair<std::string, int> NameNumberPair;
typedef std::vector<NameNumberPair> NameNumberVector;
NameNumberVector v1;
You'll need to write your own operator< which returns based on the sort order of the underlying names. However, as Nawaz points out, a map would be a better way to represent the associated nature of the data.