I have data file which is being read into a vector. Example :
West Ham 38 12
Leicester City 38 13
In actual file there is more doubles followed the name. Anyways, previously I've used this kind of search:
vector<Team>newTeams; //vector of Team objects
string homeName;
cout << "Enter home team name: ";
cin >> homeName;
cout << endl;
Team ht;
for(Team team : newTeams)
{
if(team.getName() == homeTeam)
{
ht = team;
}
}
Basically I go through a vector and look for a specific team name. If I found team name, I assigned team to ht. Then, I would use ht to get needed data, i.e.:
ht.getHomeGamesPlayed();
ht.getPoints();
ht.getHomeGoalsScored();
So, my question is: is there a better way for search?(Use map of names + vector of doubles?):
map<name, vector<doubles>>;
Also, how do I make search case insensitive? i.e I type in leicester instead of Leicester CIty, and Leicester City would still get picked?
UPD:
Seems like I figured it out, here's the code if you're interested:
string homeName;
cout << "Name: " << endl;
cin >> homeName;
Team ht;
for (Team team : dataTable) {
if (strstr(team.getName().c_str(), homeName.c_str()))
{
ht = team;
}
}
So when I type Leic it picks Leicester City(when I type leic, it doesn't work though)
Yes, std::map would be a better fit for your problem. So would std::unordered_map.
To get case insensitive matches, you can use a string that has been converted to all upper case or all lower case as the map key. Then store the original name separately.
std::map<std::string, std::pair<std::string, std::vector<double>>> mymap;
If you need partial matches, e.g. finding Leicester City when you type leicester, the vector approach you're using now might be best. If you need to match only on the first part of the name, you can still use std::map and use map::lower_bound to find a starting place for your search.
There are a few solutions to this problem. The one I'd suggest is a radix tree with multiple input positions.
First, create a map or set or whatever to hold your objects. Then, you create a radix tree, indexing each partial match of some given width, e.g.
abcde fge
creates an entry for "abcde fge", "bcde fge", "cde fge", ... "e". pointing to your (multi)map value. You can use the property of radix trees that you can easily find all values with a given prefix to perform a fast search to find all matches for a given substring in around O(1) (or O(N) in an N size search term) provided you have a fixed input size. Note however that construction of the tree scales in O(n^2) of the size of the searchable material, in order to remedy this, you can limit the size of the search terms (e.g. 50 characters) that are indexed, or index in increments followed by multiple searches (e.g. index "abcdefg" "cdefg" and "efg", then when searching for "def" you search for "def" and "ef", "ef" results in a prefix match with efg.)
Note that the search string must be at least as long as the width you skip, otherwise you'd have to search entire trees..
Related
Please can anybody help me? I'm a beginner and I have a hard assignment.
I need to write a c++ program that does the following :
Ask the user to enter two text file the first one contains the text file, the second one contains a list of words in one column Regardless of their number like this:
//output
Enter the text file: text_file.txt
Enter the keywords file: keywords_file.txt
Search for the keywords from the keywords file in the text file
Find the appearance of each word like this :
system: 55 times
analysis: 21 times
Then output in new text file (ofstream)
This is my code it's coded properly ..but it asks the user to enter the words..I want it to take the words from the input text file ..and show the output on the text file (ofstream)
this is a part of keywords contents
//they are as a list in the original file
List item
model,
management,
e-commerce,
system,
cluster
infrastructure,
computer,
knowledge,
metadata,
process,
alter,
simulate,
stock,
inventory,
strategy,
plan,
historical,
deices,
exact,
Analyst,
break even point,
SWOT,
tactic,
develop,
prototype,
feasible,
Inferences,
busy,
cloud compute,
schema,
enterprise,
custom,
expert system,
structure,
data mine,
data warehouse,
organism,
data mart,
operate,
quality assure,
forecast,
report,
this is a part of the book contents
the circuit is characterised by long straights and chicanes. this means the cars’ engines
are at full throttle for over 75% of the lap, a higher percentage than most other circuits.
the track requires heavier-than-average braking over a given lap, as the cars repeatedly
decelerate at the end of some of the world’sfasTest straights for the slow chicanes.
the chicanes are lined by rugged kerbs. riding over these hard is crucial for fast laps.
the long straights require small wings for minimum drag. this means lower downforce,
resulting in lower grip on corners and under braking, and less stability over bumps.
the main high-speed corners lesmo 1, lesmo 2 and parabolica are all right turns.
parts of the circuit are surrounded by trees, which means leaves can be blown
onto the track.
#include <algorithm>
#include <fstream>
#include <iostream>
#include <string>
#include <unordered_map>
#include <vector>
using namespace std;
int main() {
ofstream ofs("output.txt");
ofs << "Keywords " << endl;
string kw_file;
cout << "enter the file name :" << " ";
getline(cin, kw_file);
ifstream keywordFile(kw_file.c_str());
string text_file;
cout << "enter the file name :" << " ";
getline(cin, text_file);
ifstream textFile(text_file.c_str());
if (!keywordFile.is_open() || !textFile.is_open()) {
std::cout << "Error in opening files\n";
return 1;
}
{//vector method
size_t i = 0;
std::vector<std::string> keywordVector;
std::string keyword;
while (keywordFile >> std::ws >> keyword) {
keywordVector.push_back(keyword);
}
std::vector<int> countVector(keywordVector.size());
std::string textWord;
while (textFile >> std::ws >> textWord) {
for (size_t i = 0; i < keywordVector.size(); ++i) {
if (keywordVector[i] == textWord) {
countVector[i]++;
}
}
}
for (size_t i = 0; i < keywordVector.size(); ++i) {
ofs << "The number of times [" << keywordVector[i] << "] appears in
textFile is " << countVector[i] << '\n';
}
}
keywordFile.clear(); textFile.clear();
keywordFile.seekg(0);
textFile.seekg(0);
}
Preliminary
Before even starting to solve this problem, let's think about what is actually required from this:
Read in the keywords from keywords_file.txt
Store these keywords in a data structure for later use.
Read in words from text_file.txt
Use the data structure from Step#2 to compare the words read in Step#3
Data structure simply refers to how your program stores the data that it works on. When you create an array such as
int arr[] = {1, 2, 3};
to later do some manipulations on, that too is a data structure.
Choosing the Data Structure for the Problem
From the above steps, reading in words from the file is a simple problem to solve. The most pressing problem is to figure out the data structure in Step#2.
Let's try to start with a basic one: vector. We are going to use vector instead of arrays like in the C-language primarily for 2 reasons:
Fixed arrays are simple to handle and are the preferred data structure when we have fixed data. But in this problem, we have dynamic data because the number of words in the keywords_file.txt could vary. It's much easier to use vectors as dynamic arrays rather than C-style dynamic arrays because it's much easier to manage memory with vector than with a plain array.
We are using C++ and not C, so let's try to use the data structures available in the Standard Library of C++.
If we choose a vector as the data structure to store the words from keywords_file.txt, then we would also need a data strcture to store the count of each word which would be the number of times it's found in text_file.txt. We can again use a vector for this. We could have some code like (pseudo-code only)
Read in a keyword(using a std::string) from keywords_file.txt
Store it in a vector (using the push_back() function) [Let's call this vector as keywordVector]
Repeat Step#1 and #2 until all the words from keywords_file.txt are read in and added to keywordVector.
Create another vector (let's call it countVector for storing the counts of the word which is of the same size as that of keywordVector.
Initialize all the values in countVector with 0.
Read in a word(again a std::string) from text_file.txt
Search for this word in the vector.
If the word is found in the vector, increment that word's count by increasing the count in countVector for that word, else ignore the word
Repeat Step#5 and Step#6 till all the words from text_file.txt are read in.
Note for Step#8, if word from text_file.txt matches with word at keywordVector[i], then increment the value of the corresponding index in the countVector i.e. ++countVector[i]
This should work for small programs and should solve the problem. Here's how the code would look like:
std::vector<std::string> keywordVector;//empty vector of strings
std::string keyword; //string to store each word read in from keywords_file.txt
while(keywordFile>>std::ws>>keyword) {//std::ws just ignores any whitespace
keywordVector.push_back(keyword);
}
//create vector of same size as keywordVector to keep track of count
std::vector<int> countVector(keywordVector.size());
std::string textWord;
while(textFile>>std::ws>>textWord) {
//Search through the keywordVector tocheck if any word matches
for(size_t i = 0; i < keywordVector.size(); ++i) {
if(keywordVector[i] == textWord) {
countVector[i]++;//Found a match, increment the count
}
}
}
Is that all there is?
No! The above method should work fine for small programs but things become problematic when the number of keywords becomes large and/or the number of words in text_file.txt becomes large.
Imagine this. Your keywords_file.txt becomes very large since it stores all the words in the English Oxford Dictionary and has around a million words. The first part of reading in the keywords would be fine since we require it anyway. But the next part is where the problems start. Imagine all the words in your text_file.txt were the word zebra. Now while searching for zebra in this list, you would have to go through that while loop every time for each word. If there are a billion words in your text_file.txt then you would end up doing a million iterations for each word that you read and the total iterations would be 1billion X 1 million = 1 quadrillion iterations (1 followed by 15 zeroes). This is too big. Your program would complete in a very long time. Assuming each iteration takes 1 nanosecond, then total time would be 1 million seconds (10^15* 10^-9) which is ((1000000/60)/60)/24 = around 11.5 days!! Surely that is not acceptable!
We realize that the basic problem is that we have to search the entire loop every time we read in a word from text_file.txt. Only if there was a way to just lookup the word in the keywordVector directly without having to iterate over the entire vector each and every time.
This is exactly what map data structure helps with. These data structures use a hash-function to quickly lookup things in a collection. They store what is called Key-Value pairs and you use the key to lookup its value.
e.g. if we have something like
{ 101: "Alice", 202: "Bob", 303: "Charlie"}
Then 101 is the key and Alice is the value of that key, 202 is the key, Bob is the value etc.
In the C++ STL there are 2 data structures that are build upon this concept std::unordered_map and std::map. As the names suggest, the first one stores the keys in no particular order but the second one stores them in a sorted way.
Given this very basic intro to maps, we can see how this might be helpful for our case. We don't need to store the words from keyword_file.txt in any particular order. So we can use std::unordered_map, and use the keyword as the key for this map. We'll store the number of times that word appears as the value of this key. Here's some code:
/* Create a map to store key-value pairs of
string and int, where each string is a keyword
from the keywords_file.txt
*/
std::unordered_map<std::string, int> keywordMap;
std::string keyword;
while(keywordFile>>std::ws>>keyword) {
//Initialize each word's count to 0
keywordMap[keyword] = 0;
}
std::string textWord;
while(textFile>>std::ws>>textWord) {
/*We do a find for the textWord in the map.
This find isn't a linear loop like thing (unlike vector)
but uses hashing to quickly look up if textWord exists in the
map or not
*/
if(keywordMap.find(textWord) != end(keywordMap)) {
//If it exists, then we can just directly increment the count
keywordMap[textWord]++;
}
}
Using our time calculations, this time around the lookup of a billion words from the text_file.txt would each take up only in the order of a few nano-seconds since the unordered_map.find() has an average case constant time complexity, unlike the earlier approach's linear complexity.
So for a billion words, it takes an order of a billion nano-seconds which is just 1 second! Imagine the drastic difference in the times. The earlier method took days and this takes seconds. Hashing is a very powerful concept and finds applications in a lot of problems.
Followup
Here's the full code, if you want to use it. This is a basic solution to finding the frequency of words in a file. Since you're beginning out in C++, I'd suggest you take your time to read in-depth into all the data structures used here and use this example to build upon your understanding. Also, if complexity is new to you, please do acquaint yourself with the topic.
I've been working on a program that reads in a whole dictionary, and utilizes the WordNet from CMU that splits every word to its pronunciation.
The goal is to utilize the dictionary to find the best rhymes and alliterations of a given word, given the number of syllables in the word we need to find and its part of speech.
I've decided to use std::map<std::string, vector<Sound> > and std::multimap<int, std::string> where the map maps each word in the dictionary to its pronunciation in a vector, and the multimap is returned from a function that finds all the words that rhyme with a given word.
The int is the number of syllables of the corresponding word, and the string holds the word.
I've been working on the efficiency, but can't seem to get it to be more efficient than O(n). The way I'm finding all the words that rhyme with a given word is
vector<string> *rhymingWords = new vector<string>;
for (iterator it : map<std::string, vector<Sound> >) {
if(rhymingSyllables(word, it.first) >= 1 && it.first != word) {
rhymingWords->push_back(it.first);
}
}
return rhymingWords;
And when I find the best rhyme for a word (a word that rhymes the most syllables with the given word), I do
vector<string> rhymes = *getAllRhymes(rhymesWith);
int x = 0;
for (string s : rhymes) {
if (countSyllables(s) == numberOfSyllables) {
int a = rhymingSyllables(s, rhymesWith);
if (a > x) {
maxRhymes = thisRhyme;
bestRhyme = s;
}
}
}
return bestRhyme;
The drawback is the O(n) access time in terms of the number of words in the dictionary. I'm thinking of ideas to drop this down to O(log n) , but seem to hit a dead end every time. I've considered using a tree structure, but can't work out the specifics.
Any suggestions? Thanks!
The rhymingSyllables function is implemented as such:
int syllableCount = 0;
if((soundMap.count(word1) == 0) || (soundMap.count(word2) == 0)) {
return 0;
}
vector<Sound> &firstSounds = soundMap.at(word1), &secondSounds = soundMap.at(word2);
for(int i = firstSounds.size() - 1, j = secondSounds.size() - 1; i >= 0 && j >= 0; --i, --j){
if(firstSounds[i] != secondSounds[j]) return syllableCount;
else if(firstSounds[i].isVowel()) ++syllableCount;
}
return syllableCount;
P.S.
The vector<Sound> is the pronunciation of the word, where Sound is a class that contains every different pronunciation of a morpheme in English: i.e,
AA vowel AE vowel AH vowel AO vowel AW vowel AY vowel B stop CH affricate D stop DH fricative EH vowel ER vowel EY vowel F fricative G stop HH aspirate IH vowel IY vowel JH affricate K stop L liquid M nasal N nasal NG nasal OW vowel OY vowel P stop R liquid S fricative SH fricative T stop TH fricative UH vowel UW vowel V fricative W semivowel Y semivowel Z fricative ZH fricative
Perhaps you could group the morphemes that will be matched during rhyming and compare not the vectors of morphemes, but vectors of associated groups. Then you can sort the dictionary once and get a logarithmic search time.
After looking at rhymingSyllables implementation, it seems that you convert words to sounds, and then match any vowels to each other, and match other sounds only if they are the same. So applying advice above, you could introduce an extra auxiliary sound 'anyVowel', and then during dictionary building convert each word to its sound, replace all vowels with 'anyVowel' and push that representation to dictionary. Once you're done sort the dictionary. When you want to search a rhyme for a word - convert it to the same representation and do a binary search on the dictionary, first by last sound as a key, then by previous and so on. This will give you m*log(n) worst case complexity, where n is dictionary size and m is word length, but typically it will terminate faster.
You could also exploit the fact that for best rhyme you consider words only with certain syllable numbers, and maintain a separate dictionary per each syllable count. Then you count number of syllables in word you look rhymes for, and search in appropriate dictionary. Asymptotically it doesn't give you any gain, but a speedup it gives may be useful in your application.
I've been thinking about this and I could probably suggest an approach to an algorithm.
I would maybe first take the dictionary and divide it into multiple buckets or batches. Where each batch represents the number of syllables each word has. The traversing of the vector to store into different buckets should be linear as you are traverse a large vector of strings. From here since the first bucket will have all words of 1 syllable there is nothing to do at the moment so you can skip to bucket two and each bucket after will need to take each word and separate the syllables of each word. So if you have say 25 buckets, where you know the first few and the last few are not going to hold many words their time shouldn't be significant and should be done first, however the buckets in the middle that have say 3-5 or 3-6 syllables in length will be the largest to do so you could run each of these buckets on a separate thread if their size is over a certain amount and have them run in parallel. Now once you are done; each bucket should return a std::vector<std::shared_ptr<Word>> where your structure might look like this:
enum SpeechSound {
SS_AA,
SS_AE,
SS_...
SS_ZH
};
enum SpeechSoundType {
ASPIRATE,
...
VOWEL
};
struct SyllableMorpheme {
SpeechSound sound;
SpeechSoundType type;
};
class Word {
public:
private:
std::string m_strWord;
// These Two Containers Should Match In Size! One String For Each
// Syllable & One Matching Struct From Above Containing Two Enums.
std::vector<std::string> m_vSyllables
std::vector<SyllableMorpheme> m_vMorphemes;
public:
explicit Word( const std::string& word );
std::string getWord() const;
std::string getSyllable( unsigned index ) const;
unsigned getSyllableCount() const;
SyllableMorpheme getMorhpeme( unsigned index ) const;
bool operator==( const ClassObj& other ) const;
bool operator!=( const ClassObj& other ) const;
private:
Word( const Word& c ); // Not Implemented
Word& operator=( const Word& other ) const; // Not Implemented
};
This time you will now have new buckets or vectors of shared pointers of these class objects. Then you can easily write a function to traverse through each bucket or even multiple buckets since the buckets will have the same signature only a different amount of syllables. Remember; each bucket should already be sorted alphabetically since we only added them in by the syllable count and never changed the order that was read in from the dictionary.
Then with this you can easily compare if two words are equal or not while checking For Matching Syllables and Morphemes. And these are contained in std::vector<std::shared_ptr<Word>>. So you don't have to worry about memory clean up as much either.
The idea is to use linear search, separation and comparison as much as possible; yet if your container gets too large, then create buckets and run in parallel multiple threads, or maybe use a hash table if it will suite your needs.
Another possibility with this class structure is that you could even add more to it later on if you wanted or needed to such as another std::vector for its definitions, and another std::vector<string> for its part of speech {noun, verb, etc.} You could even add in other vector<string> for things such as homonyms, homophomes and even a vector<string> for a list of all words that rhyme with it.
Now for your specific task of finding the best matching rhyme you may find that some words may end up having a list of Words that would all be considered a Best Match or Fit! Due to this you wouldn't want to store or return a single string, but rather a vector of strings!
Case Example:
To Too Two Blue Blew Hue Hew Knew New,
Bare Bear Care Air Ayre Heir Fair Fare There Their They're
Plain, Plane, Rain, Reign, Main, Mane, Maine
Yes these are all single syllable rhyming words, but as you can see there are many cases where there are multiple valid answers, not just a single best case match. This is something that does need to be taken into consideration.
I have a huge list (N = ~1million) of strings 100 characters long that I'm trying to find the overlaps between. For instance, one string might be
XXXXXXXXXXXXXXXXXXAACTGCXAACTGGAAXA (and so on)
I need to build an N by N matrix that contains the longest overlap value for every string with every other string. My current method is (pseudocode)
read in all strings to array
create empty NxN matrix
compare each string to every string with a higher array index (to avoid redoing comparisons)
Write longest overlap to matrix
There's a lot of other stuff going on, but I really need a much more efficient way to build the matrix. Even with the most powerful computing clusters I can get my hands on this method takes days.
In case you didn't guess, these are DNA fragments. X indicates "wild card" (probe gave below a threshold quality score) and all other options are a base (A, C, T, or G). I tried to write a quaternary tree algorithm, but this method was far too memory intensive.
I'd love any suggestions you can give for a more efficient method; I'm working in C++ but pseudocode/ideas or other language code would also be very helpful.
Edit: some code excerpts that illustrate my current method. Anything not particularly relevant to the concept has been removed
//part that compares them all to each other
for (int j=0; j<counter; j++) //counter holds # of DNA
for (int k=j+1; k<counter; k++)
int test = determineBestOverlap(DNArray[j],DNArray[k]);
//boring stuff
//part that compares strings. Definitely very inefficient,
//although I think the sheer number of comparisons is the main problem
int determineBestOverlap(string str1, string str2)
{
int maxCounter = 0, bestOffset = 0;
//basically just tries overlapping the strings every possible way
for (int j=0; j<str2.length(); j++)
{
int counter = 0, offset = 0;
while (str1[offset] == str2[j+offset] && str1[offset] != 'X')
{
counter++;
offset++;
}
if (counter > maxCounter)
{
maxCounter = counter;
bestOffset = j;
}
}
return maxCounter;
} //this simplified version doesn't account for flipped strings
Do you really need to know the match between ALL string pairs? If yes, then you will have to compare every string with every other string, which means you will need n^2/2 comparisons, and you will need one half terabyte of memory even if you just store one byte per string pair.
However, i assume what you really are interested in is long strings, those that have more than, say, 20 or 30 or even more than 80 characters in common, and you probably don't really want to know if two string pairs have 3 characters in common while 50 others are X and the remaining 47 don't match.
What i'd try if i were you - still without knowing if that fits your application - is:
1) From each string, extract the largest substring(s) that make(s) sense. I guess you want to ignore 'X'es at the start and end entirely, and if some "readable" parts are broken by a large number of 'X'es, it probably makes sense to treat the readable parts individually instead of using the longer string. A lot of this "which substrings are relevant?" depends on your data and application that i don't really know.
2) Make a list of these longest substrings, together with the number of occurences of each substring. Order this list by string length. You may, but don't really have to, store the indexes of every original string together with the substring. You'll get something like (example)
AGCGCTXATCG 1
GAGXTGACCTG 2
.....
CGCXTATC 1
......
3) Now, from the top to the bottom of the list:
a) Set the "current string" to the string topmost on the list.
b) If the occurence count next to the current string is > 1, you found a match. Search your original strings for the substring if you haven't remembered the indexes, and mark the match.
c) Compare the current string with all strings of the same length, to find matches where some characters are X.
d) Remove the 1st character from the current string. If the resulting string is already in your table, increase its occurence counter by one, else enter it into the table.
e) Repeat 3b with the last, instead of the first, character removed from the current string.
f) Remove the current string from the list.
g) Repeat from 3a) until you run out of computing time, or your remaining strings become too short to be interesting.
If this is a better algorithm depends very much on your data and which comparisons you're really interested in. If your data is very random/you have very few matches, it will probably take longer than your original idea. But it might allow you to find the interesting parts first and skip the less interesting parts.
I don't see many ways to improve the fact that you need to compare each string with each other including shifting them, and that is by itself super long, a computation cluster seems the best approach.
The only thing I see how to improve is the string comparison by itself: replace A,C,T,G and X by binary patterns:
A = 0x01
C = 0x02
T = 0x04
G = 0x08
X = 0x0F
This way you can store one item on 4 bits, i.e. two per byte (this might not be a good idea though, but still a possible option to investigate), and then compare them quickly with a AND operation, so that you 'just' have to count how many consecutive non zero values you have. That's just a way to process the wildcard, sorry I don't have a better idea to reduce the complexity of the overall comparison.
I am working on subject extraction fro articles algorithm using c++.
First I have written code to remove words like articles, propositions etc.
Then rest of the words get store in one char array: char *excluded_string[50] = { 0 };
while ((NULL != word) && (50 > i)) {
ch[i] = strdup(word);
excluded_string[j]=strdup(word);
word = strtok(NULL, " ");
skp = BoyerMoore_skip(ch[i], strlen(ch[i]) );
if(skp != NULL)
{
i++;
continue;
}
j++;
skp is NULL when ch[i] is not articles or similar caregory.
This function checks whether any word belongs to articles or propo...etc
Now at the end ex..[] contains set of required words. Now I want occurrence of each words in this array and after that word which has max occurrence. All if more then one.
What logic should I use?
What I thought is:
Taking and two dimension array. First column will have word. and 2nd column I can use for storing count values.
Then for each word sending that word to the array and for each occurance of that word increment count values and store that count values for that words in 2nd column.
But this is costly and also complex.
Any other idea?
If you wish to count the occurrences of each word in an array then you can do no better than O(n) (i.e. one pass over the array). However, if you try to store the word counts in a two dimensional array then you must also do a lookup each time to see if the word is already there, and this can quickly become O(n^2).
The trick is to use a hash table to do your lookup. As you step through your word list you increment the right entry in the hash table. Each lookup should be O(1), so it ought to be efficient as long as there are sufficiently many words to offset the complexity of the hashing algorithm and memory usage (i.e. don't bother if you're dealing with less than 10 words, say).
Then, when you're done, you just iterate over the entries in the hash table to find the maximum. In fact, I would probably keep track of that while counting the words so there's no need to do it after ("if thisWordCount is greater than currentMaximumCount then currentMaximum = thisWord").
I believe the standard C++ unordered_map type should do what you need. There's an example here.
Update: I have a couple of what are probably silly questions about commenter 6502's answer (below). If anyone could help, I'd really appreciate it.
1) I understand that data 1 and data 2 are the maps, but I don't understand what allkeys is for. Can anyone explain?
2) I know that: data1[vector1[i].name] = vector1[i].value; means assign a value to the map of interest where the correct label is... But I don't understand this: vector1[i].name and vector1[i].value. Are't "name" and "value" two separate vectors of labels and values? So what are they doing on vector1? Shouldn't this read, name[i] and value[i] instead?
Thanks everyone.
I have written code for performing a calculation. The code uses data from elsewhere. The calculation code is fine, but I'm having trouble manipulating the data.
The data exist as sets of vectors. Each set has one vector of labels (names, these are strings) and a corresponding set of values (doubles or ints).
The problem is that I need each data set to have the same name/label in the same column as the other data sets. This problem is not the same as sorting the data in the vectors (which I know how to do) because sometimes names/labels can be missing from some vectors.
For example:
Data set 1:
vector names1 = Jim, Tom, Mary
vector values1 = 1 2 3
Data set 2:
vector names2 = Tom, Mary, Joan
vector values2 = 2 3 4
I want (pseudo-code) ONE name vector that has all possible names. I also want each corresponding numbers vector to be sorted the SAME way:
vector namesUniversal = Jim, Joan, Mary, Tom
vector valuesUniversal1 = 1 0 3 2
vector valuesUniversal2 = 0 4 3 2
What I want to do is come up with a universal vector that contains ALL the labels/names sorted alphabetically and all the corresponding numerical data sorted too.
Can anyone tell me whether there is an elegant way to do this in c++? I guess I could compare each element of each name vector with each element of each other name vector, but this seems quite clunky and I would not know how to get the data into the right columns in the corresponding data vectors. Thanks for any advice.
The algorithm you are looking for is usually named "merging". Basically you sort the two data sets and then look at data in pairs: if the keys are equal then you process and output the pair, otherwise you process and advance only the smallest one.
You must also handle the case where one of the two lists ends before the other (this can be avoided by using special flag values that are guaranteed to be higher than any value you need to process).
The following is pseudocode for merging
Sort vector1
Sort vector2
Set index1 = index2 = 0;
Loop until both index1 >= vector1.size() and index2 >= vector2.size() (in other words until both vectors are exhausted)
If index1 == vector1.size() (i.e. if vector1 has been processed) then output vector2[index2++]
Otherwise if index2 == vector2.size() (i.e. if vector2 has been processed) then output vector1[index1++]
Otherwise if vector1[index1] == vector2[index2] output merged data and increment both index1 and index2
Otherwise if vector1[index1] < vector2[index2] output vector1[index1++]
Otherwise output vector2[index2++]
However in C++ you can implement a much easier to write solution that is probably still fast enough (warning: untested code!):
std::map<std::string, int> data1, data2;
std::set<std::string> allkeys;
for (int i=0,n=vector1.size(); i<n; i++)
{
allkeys.insert(vector1[i].name);
data1[vector1[i].name] = vector1[i].value;
}
for (int i=0,n=vector2.size(); i<n; i++)
{
allkeys.insert(vector2[i].name);
data2[vector2[i].name] = vector2[i].value;
}
for (std::set<std::string>::iterator i=allkeys.begin(), e=allkeys.end();
i!=e; ++i)
{
const std::string& key = *i;
std::cout << key << data1[key] << data2[key] << std::endl;
}
The idea is to just build two maps data1 and data2 from name to values, and at the same time collecting all keys that are appearing in a std::set of keys named allkeys (adding the same name to a set multiple times does nothing).
After the collection phase this set can then be iterated to find all the names that have been observed and for each name the value can be retrieved from data1 and data2 maps (std::map<std::string, int> will return 0 when looking for the value of a name that has not been added to the map).
Technically this is sort of overkilling (uses three balanced trees to do the processing that would have required just two sort operations) but is less code and probably acceptable anyway.
6502's solution looks fine at first glance. You should probably use std::merge for the merging part though.
EDIt:
I forgot to mention that there is now also a multiway_merge extension of the STL available in the GNU version of the STL. It is a part of the parallel mode, so it resides in the namespace __gnu_parallel. If you need to do multiway merging, it will be very hard to come up with something as fast or simple to use as this.
A quick way which comes to mind is to use a map<pair<string, int>, int> and for each value store it in the map with the right key. (For example (Tom, 2) in the first values set will be under the key (Tom, 1) with value 2)
Once the map is ready iterate over it and build whatever data structure you want (Assuming the map is not enough for you).
I think you need to alter how you store this data.
It looks like you're saying each number is logically associated with the name in the same position: Jim = 1, Mary = 3, etc.
If so, and you want to stick with a vector of some kind, you could redo your data structure like so:
typedef std::pair<std::string, int> NameNumberPair;
typedef std::vector<NameNumberPair> NameNumberVector;
NameNumberVector v1;
You'll need to write your own operator< which returns based on the sort order of the underlying names. However, as Nawaz points out, a map would be a better way to represent the associated nature of the data.