Parse key, value pairs when key is not unique - c++

My input are multiple key, value pairs e.g.:
A=1, B=2, C=3, ..., A=4
I want to parse the input into the following type:
std::map< char, std::vector< int > > m
Values for equal keys shall be appended to the vector. So the parsed output should be equal to:
m['A']={1,4};
m['B']={2};
m['C']={3};
What is the simplest solution using 'boost::spirit::qi' ?

Here is one way to do it:
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/include/vector.hpp>
#include <boost/fusion/include/at_c.hpp>
#include <iostream>
#include <utility>
#include <string>
#include <vector>
#include <map>
namespace qi = boost::spirit::qi;
namespace fusion = boost::fusion;
int main()
{
std::string str = "A=1, B=2, C=3, A=4";
std::map< char, std::vector< int > > m;
auto inserter = [&m](fusion::vector< char, int > const& parsed,
qi::unused_type, qi::unused_type)
{
m[fusion::at_c< 0 >(parsed)].push_back(fusion::at_c< 1 >(parsed));
};
auto it = str.begin(), end = str.end();
bool res = qi::phrase_parse(it, end,
((qi::char_ >> '=' >> qi::int_)[inserter]) % ',',
qi::space);
if (res && it == end)
std::cout << "Parsing complete" << std::endl;
else
std::cout << "Parsing incomplete" << std::endl;
for (auto const& elem : m)
{
std::cout << "m['" << elem.first << "'] = {";
for (auto value : elem.second)
std::cout << " " << value;
std::cout << " }" << std::endl;
}
return 0;
}
A few comments about the implementation:
qi::phrase_parse is a Boost.Spirit algorithm that takes a pair of iterators, a parser, and a skip parser, and runs the parsers on the input denoted by the iterators. In the process, it updates the beginning iterator (it in this example) so that it points to the end of the consumed input upon return. The returned res value indicates whether the parsers have succeeded (i.e. the consumed input could be successfully parsed). There are other forms of qi::phrase_parse that allow extracting attributes (which is the parsed data, in terms of Boost.Spirit) but we're not using attributes here because you have a peculiar requirement of the resulting container structure.
The skip parser is used to skip portions of the input between the elements of the main parser. In this case, qi::space means that any whitespace characters will be ignored in the input, so that e.g. "A = 1" and "A=1" can both be parsed similarly. There is qi::parse family of algorithms which do not have a skip parser and therefore require the main parser to handle all input without skips.
The (qi::char_ >> '=' >> qi::int_) part of the main parser matches a single character, followed by the equals sign character, followed by a signed integer. The equals sign is expressed as a literal (i.e. it is equivalent to the qi::lit('=') parser), which means it only matches the input but does not result in a parsed data. Therefore the result of this parser is an attribute that is a sequence of two elements - a character and an integer.
The % ',' part of the parser is a list parser, which parses any number of pieces of input described by the parser on the left (which is the parser described above), separated by the pieces described by the parser on the right (i.e. with comma characters in our case). As before, the comma character is a literal parser, so it doesn't produce output.
The [inserter] part is a semantic action, which is a function that is called by the parser every time it matches a portion of input string. The parser passes all its parsed output as the first argument to this function. In our case the semantic action is attached to the parser described in bullet #3, which means a sequence of a character and an integer is passed. Boost.Spirit uses a fusion::vector to pass these data. The other two arguments of the semantic action are not used in this example and can be ignored.
The inserter function in this example is a lambda function, but it could be any other kind of function object, including a regular function, a function generated by std::bind, etc. The important part is that it has the specified signature and that the type of its first argument is compatible with the attribute of the parser, to which it is attached as a semantic action. So, if we had a different parser in bullet #3, this argument would have to be changed accordingly.
fusion::at_c< N >() in the inserter obtains the element of the vector at index N. It is very similar to std::get< N >() when applied to std::tuple.

Related

I can't understand the use of std::istream_iterator

I can't understand the code below.
(from https://www.boost.org/doc/libs/1_74_0/more/getting_started/unix-variants.html)
#include <boost/lambda/lambda.hpp>
#include <iostream>
#include <iterator>
#include <algorithm>
int main()
{
using namespace boost::lambda;
typedef std::istream_iterator<int> in;
std::for_each(
in(std::cin), in(), std::cout << (_1 * 3) << " " );
}
The web page doesn't explain anything for the code.
What I can't understand is the line with std::for_each function.
std::for_each is defined as below.
template <class InputIterator, class Function>
Function for_each(InputIterator first, InputIterator last, Function fn);
So first is in(std::cin), last is just in(), the function is the cout statement.
Could anyone explain to me the syntax of first and last syntax and meaning in the example code?
The first iterator seems to be constructed with initial value std::cin, but what's the use of in() for the last value?
I also can't understand the _1 part.
The program outputs 3 * any number of integer values I type in.
First a little explanation about the std::for_each function.
The function loops over a set of iterators, from the beginning to the end, calling a function for each element in the range.
If you have e.g. a vector of integers:
std::vector<int> v = { 1, 2, 3, 4 };
and want to print print them, then you could do:
std::for_each(v.begin(), v.end(), [](int val) { std::cout << val; });
The above call to std::for_each is equivalent to:
for (auto i = v.begin(); i != v.end(); ++i)
{
std::cout << *i;
}
Now if we go the usage of the std::istream_iterator in the question, it is wrapping the input operator >> using iterators.
Rewriting the std::for_each call using standard C++ lambdas, it would look like this:
std::for_each(in(std::cin), in(), [](int value) { std::cout << (value * 3) << " " ); });
If we translate it to the "normal" for iterator loop then it becomes:
for (auto i = in(std::cin); i != in(); ++i)
{
std::cout << (*i * 3) << " ";
}
What it does, is reading integer input (until end-of-file or an error) from std::cin, and then output the value multiplied by 3 and a space.
If you're wondering about in(std::cin) and in(), you have to remember that in is an alias for the type std::istream_iterator<int>.
That means in(std::cin) is the same as std::istream_iterator<int>(std::cin). I.e. it creates a std::istream_iterator<int> object, and passes std::cin to the constructor. And in() constructs an end-iterator object.
Making it even clearer, the code is equivalent to:
std::istream_iterator<int> the_beginning(std::cin);
std::istream_iterator<int> the_end; // Default construct, becomes the "end" iterator
for (std::istream_iterator<int> i = the_beginning; i != the_end; ++i)
{
int value = *i; // Dereference iterator to get its value
// (effectively the same as std::cin >> value)
std::cout << (value * 3) << " ";
}
Could anyone explain to me the syntax of first and last syntax and meaning in the example code?
The first iterator seems to be constructed with initial value std::cin, but what's the use of in() for the last value?
If you look at the description of the constructor of std::istream_iterator you can see that in() constructs the end-of-stream iterator.
istream_iterator(); // < C++11
constexpr istream_iterator(); // > C++11
Constructs the end-of-stream iterator, value-initializes the stored value. This constructor is constexpr if the initializer in the definition auto x = T(); is a constant initializer (since C++11).
As for in(std::cin):
istream_iterator( istream_type& stream );
istream_iterator( const istream_iterator& other ); //< C++11
istream_iterator( const istream_iterator& other ) = default; // > C++11
Initializes the iterator, stores the address of stream in a data member, and performs the first read from the input stream to initialize the cached value data member.
source
And I also can't understand the _1 part.
What this does is to replace the placeholder _1 with every element in the iteration sequence and multiply it by 3, using the result in the output stream, as it should, given the unary function argument.
for_each(a.begin(), a.end(), std::cout << _1 << ' ');
The expression std::cout << _1 << ' ' defines a unary function object. The variable _1 is the parameter of this function, a placeholder for the actual argument. Within each iteration of for_each, the function is called with an element of a as the actual argument. This actual argument is substituted for the placeholder, and the “body” of the function is evaluated.
source

C++ count the number of words from standard input

I saw a piece of C++ code to count the number of words inputted from standard input:
#include <iostream>
#include <iterator>
#include <string>
using namespace std;
int main() {
auto ss {
istream_iterator<string>{cin}
};
cout << distance(ss, {}) << endl;
return 0;
}
I have several questions:
What's the type of auto ss?
What does distance(ss, {}) do? Why does it calculate the number of words?
My guess is:
istream_iterator<string>{cin} converts the standard input into the istream_iterator type, automatically separated by space (why?). Thus ss looks like a container with all words as its elements;
distance(ss, {}) calculates the distance between the 1st element and the empty (thus outside of the last, but why?) element.
Can someone help me to go through my guess on this fantastic short piece of code?
auto ss deduces ss to be std::istream_iterator<std::string>, because that is what the full statement is constructing and assigning to ss.
istream_iterator uses the specified istream's operator>> to read formatted input of the specified type, where operator>> reads input delimited by whitespace, including space characters, line breaks, etc. So, in this case, istream_iterator<string> is reading std::string values from std::cin one whitespace-delimited word at a time.
istream_iterator is an input iterator, and a default-constructed istream_iterator denotes an end-of-stream iterator. When istream_iterator stops reading (EOF is reached, bad input is entered, etc), its value becomes equal to the end-of-stream iterator.
std::distance() returns the number of elements in a range denoted by a pair of iterators. For a non-random input iterator, like istream_iterator, std::distance() iterates the range one element at a time via the iterator's operator++, counting the number of times it incremented the iterator, until the target iterator is reached. So, in this case, istream_iterator::operator++ internally reads a value from its istream, thus std::distance() effectively returns how many words are successfully read from std::cin until end-of-stream is reached.
So, the code you have shown is roughly just an algorithmic way of writing the following equivalent loop:
int main() {
string s;
size_t count = 0;
while (cin >> s) {
++count;
}
cout << count << endl;
return 0;
}
ss has type std::istream_iterator<std::string>.
std::distance(ss, {}) computes the number of items between the first whitespace-delimited token in std::cin to the end of cin, effectively returning the number of whitespace-delimited tokens in std::cin. This is due to the way std::istream::operator>>(std::istream&, std::string&) functions (the second parameter is not actually an std::string, but I'm trying to keep this short). The default constructor for a std::istream_iterator<std::string> returns the end of any std::istream_iterator<std::string>.
The cutting of the contents of std::cin is actually done lazily when computing the distance.
That is indeed an interesting piece of code.

Reading in from a .tsv file

I'm trying to read in information from a tab separated value file with the format:
<string> <int> <string>
Example:
Seaking 119 Azumao
Mr. Mime 122 Barrierd
Weedle 13 Beedle
This is currently how I'm doing it:
string americanName;
int pokedexNumber;
string japaneseName;
inFile >> americanName;
inFile >> pokedexNumber
inFile >> japaneseName;
My issue stems from the space in the "Mr. Mime" as the strings can contain spaces.
I would like to know how to read the file in properly.
Standard library uses such things as locales to determine the categories of different symbols and other locale-dependent things depending on your system locale. Standard streams use that to determine what is a space because of various unicode issues.
You can use this fact to control the meaning of ' ' in your case:
#include <iostream>
#include <locale>
#include <algorithm>
struct tsv_ws : std::ctype<char>
{
mask t[table_size]; // classification table, stores category for each character
tsv_ws() : ctype(t) // ctype will use our table to check character type
{
// copy all default values to our table;
std::copy_n(classic_table(), table_size, t);
// here we tell, that ' ' is a punctuation, but not a space :)
t[' '] = punct;
}
};
int main() {
std::string s;
std::cin.imbue(std::locale(std::cin.getloc(), new tsv_ws)); // using our locale, will work for any stream
while (std::cin >> s) {
std::cout << "read: '" << s << "'\n";
}
}
Here we make ' ' a punctuation symbol, but not a space symbol, so streams don't consider it a separator anymore. The exact category isn't important, but it mustn't be space.
That's quite powerful technique. For example, you could redefine ',' to be a space to read in CSV format.
You can use std::getline to extract strings with non-tab whitespace.
std::getline(inFile, americanName, '\t'); // read up to first tab
inFile >> pokedexNumber >> std::ws; // read number then second tab
std::getline(inFile, japaneseName); // read up to first newline
Seems like you want to read csv data or in your case tsv data. But let's stick to the common term "csv". This is a standard task and I will give you detailed explanations. In the end all the reading will be done in a one-liner.
I would recommend to use "modern" C++ approach.
After searching for "reading csv data", people are still are linking to How can I read and parse CSV files in C++?, the questions is from 2009 and now over 10 years old. Most answers are also old and very complicated. So, maybe its time for a change.
In modern C++ you have algorithms that iterate over ranges. You will often see something like "someAlgoritm(container.begin(), container.end(), someLambda)". The idea is that we iterate over some similar elements.
In your case we iterate over tokens in your input string, and create substrings. This is called tokenizing.
And for exactly that purpose, we have the std::sregex_token_iterator. And because we have something that has been defined for such purpose, we should use it.
This thing is an iterator. For iterating over a string, hence sregex. The begin part defines, on what range of input we shall operate, then there is a std::regex for what should be matched / or what should not be matched in the input string. The type of matching strategy is given with last parameter.
1 --> give me the stuff that I defined in the regex and
-1 --> give me that what is NOT matched based on the regex.
So, now that we understand the iterator, we can std::copy the tokens from the iterator to our target, a std::vector of std::string. And since we do not know, how may columns we have, we will use the std::back_inserter as a target. This will add all tokens that we get from the std::sregex_token_iterator and append it ot our std::vector<std::string>>. It does'nt matter how many columns we have.
Good. Such a statement could look like
std::copy( // We want to copy something
std::sregex_token_iterator // The iterator begin, the sregex_token_iterator. Give back first token
(
line.begin(), // Evaluate the input string from the beginning
line.end(), // to the end
re, // Add match a comma
-1 // But give me back not the comma but everything else
),
std::sregex_token_iterator(), // iterator end for sregex_token_iterator, last token + 1
std::back_inserter(cp.columns) // Append everything to the target container
);
Now we can understand, how this copy operation works.
Next step. We want to read from a file. The file conatins also some kind of same data. The same data are rows.
And as for above, we can iterate of similar data. If it is the file input or whatever. For this purpose C++ has the std::istream_iterator. This is a template and as a template parameter it gets the type of data that it should read and, as a constructor parameter it gets a reference to an input stream. It doesnt't matter, if the input stream is a std::cin, or a std::ifstream or a std::istringstream. The behaviour is identical for all kinds of streams.
And since we do not have files an SO, I use (in the below example) a std::istringstream to store the input csv file. But of course you can open a file, by defining a std::ifstream testCsv(filename). No problem.
And with std::istream_iterator, we iterate over the input and read similar data. In our case one problem is that we want to iterate over special data and not over some build in data type.
To solve this, we define a Proxy class, which does the internal work for us (we do not want to know how, that should be encapsulated in the proxy). In the proxy we overwrite the type cast operator, to get the result to our expected type for the std::istream_iterator.
And the last important step. A std::vector has a range constructor. It has also a lot of other constructors that we can use in the definition of a variable of type std::vector. But for our purposes this constructor fits best.
So we define a variable csv and use its range constructor and give it a begin of a range and an end of a range. And, in our specific example, we use the begin and end iterator of std::istream_iterator.
If we combine all the above, reading the complete CSV file is a one-liner, it is the definition of a variable with calling its constructor.
Please see the resulting code:
#include <iostream>
#include <sstream>
#include <fstream>
#include <string>
#include <vector>
#include <iterator>
#include <regex>
#include <algorithm>
std::istringstream testCsv{ R"(Seaking 119 Azumao
Mr. Mime 122 Barrierd
Weedle 13 Beedle)" };
// Define Alias for easier Reading
using Columns = std::vector<std::string>;
using CSV = std::vector<Columns>;
// Proxy for the input Iterator
struct ColumnProxy {
// Overload extractor. Read a complete line
friend std::istream& operator>>(std::istream& is, ColumnProxy& cp) {
// Read a line
std::string line; cp.columns.clear();
if (std::getline(is, line)) {
// The delimiter
const std::regex re("\t");
// Split values and copy into resulting vector
std::copy(std::sregex_token_iterator(line.begin(), line.end(), re, -1),
std::sregex_token_iterator(),
std::back_inserter(cp.columns));
}
return is;
}
// Type cast operator overload. Cast the type 'Columns' to std::vector<std::string>
operator std::vector<std::string>() const { return columns; }
protected:
// Temporary to hold the read vector
Columns columns{};
};
int main()
{
// Define variable CSV with its range constructor. Read complete CSV in this statement, So, one liner
CSV csv{ std::istream_iterator<ColumnProxy>(testCsv), std::istream_iterator<ColumnProxy>() };
// Print result. Go through all lines and then copy line elements to std::cout
std::for_each(csv.begin(), csv.end(), [](Columns & c) {
std::copy(c.begin(), c.end(), std::ostream_iterator<std::string>(std::cout, " ")); std::cout << "\n"; });
}
I hope the explanation was detailed enough to give you an idea, what you can do with modern C++.
This example does basically not care how many rows and columns are in the source text file. It will eat everything.

Inserting characters intro a string in C++

I need to insert a character into a string of letters that are in alphabetical order, and this character has to be placed where it belongs alphabetically.
For example I have the string string myString("afgjz"); and the input code
cout << "Input your character" << endl;
char ch;
cin >> ch;
but how can I make it so that after inputting the char(say b) it is then added to the string on the proper position resulting in the string becoming "abfgjz".
You can use std::lower_bound to find the position to insert.
myString.insert(std::lower_bound(myString.begin(), myString.end(), ch), ch);
A more generic solution would be having a function like
namespace sorted
{
template<class Container, class T>
void insert(Container & object, T const & value)
{
using std::begin;
using std::end;
object.insert(std::lower_bound(begin(object),
end(object), value), value);
}
}
And then use
sorted::insert(myString, ch);
Class std::string has the following insert method (apart from other its insert methods):
iterator insert(const_iterator p, charT c);
So all what you need is to find the position where the new character has to be inserted. If the string has already the same character then there are two approaches: either the new character is inserted before the existent character in the string and in this case you should use standard algorithm std::lower_bound or the new character is inserted after the existent character in the string and in this case you should use standard algorithm std::upper_bound.
Here is a demonstrative program that shows how this can be done using standard algorithm std::upper_bound. You may substitute it for std::lower_bound if you like. Though in my opinion it is better to insert the new character after existent one because in some situation you can avoid moving characters after the target position that to insert the new character.
#include <iostream>
#include <algorithm>
#include <string>
int main()
{
std::string myString( "afgjz" );
char c = 'b';
myString.insert( std::upper_bound( myString.begin(), myString.end(), c ), c );
std::cout << myString << std::endl;
return 0;
}
The program output is
abfgjz

Count word frequency using map

This is my first time implementing map in C++. So given a character array with text, I want to count the frequency of each word occurring in the text. I decided to implement map to store the words and compare following words and increment a counter.
Following is the code I have written so far.
const char *kInputText = "\
So given a character array with text, I want to count the frequency of
each word occurring in the text.\n\
I decided to implement map to store the\n\
words and compare following words and increment a counter.\n";
typedef struct WordCounts
{
int wordcount;
}WordCounts;
typedef map<string, int> StoreMap;
//countWord function is to count the total number of words in the text.
void countWord( const char * text, WordCounts & outWordCounts )
{
outWordCounts.wordcount = 0;
size_t i;
if(isalpha(text[0]))
outWordCounts.wordcount++;
for(i=0;i<strlen(text);i++)
{
if((isalpha(text[i])) && (!isalpha(text[i-1])))
outWordCounts.wordcount++;
}
cout<<outWordCounts.wordcount;
}
//count_for_map() is to count the word frequency using map.
void count_for_map(const char *text, StoreMap & words)
{
string st;
while(text >> st)
words[st]++;
}
int main()
{
WordCounts wordCounts;
StoreMap w;
countWord( kInputText, wordCounts );
count_for_map(kInputText, w);
for(StoreMap::iterator p = w.begin();p != w.end();++p)
{
std::cout<<p->first<<"occurred" <<p->second<<"times. \n";
}
return 0;
}
Error: No match for 'operator >>' in 'text >> st'
I understand this is an operator overloading error, so I went ahead and
wrote the following lines of code.
//In the count_for_map()
/*istream & operator >> (istream & input,const char *text)
{
int i;
for(i=0;i<strlen(text);i++)
input >> text[i];
return input;
}*/
Am I implementing map in the wrong way?
There is no overload for >> with a const char* left hand side.
text is a const char*, not an istream, so your overload doesn't apply (and the overload 1: is wrong, and 2: already exists in the standard library).
You want to use the more suitable std::istringstream, like this:
std::istringstream textstream(text);
while(textstream >> st)
words[st]++;
If you use modern C++ language, then life will get by far easier.
First. Usage of a std::map is the correct approach.
This is a more or less standard approach for counting something in a container.
We can use an associative container like a std::map or a std::unordered_map. And here we associate a "key", in this case the "word" to count, with a value, in this case the count of the specific word.
And luckily the maps have a very nice index operator[]. This will look for the given key and if found, return a reference to the value. If not found, the it will create a new entry with the key and return a reference to the new entry. So, in bot cases, we will get a reference to the value used for counting. And then we can simply write:
std::unordered_map<std::string, unsigned int> counter{};
counter[word]++;
But how to get words from a string. A string is like a container containing elements. And in C++ many containers have iterators. And especially for strings there is a dedicated iterator that allows to iterate over patterns in a std::string. It is called std::sregex_token_iterator and described here.. The pattern is given as a std::regex which will give you a great flexibility.
And, because we have such a wonderful and dedicated iterator, we should use it!
Eveything glued together will give a very compact solution, with a minimal number of code lines.
Please see:
#include <iostream>
#include <string>
#include <regex>
#include <map>
#include <iomanip>
const std::regex re{ "\\w+" };
const std::string text{ R"(So given a character array with text, I want to count the frequency of
each word occurring in the text.
I decided to implement map to store the
words and compare following words and increment a counter.")" };
int main() {
std::map<std::string, unsigned int> counter{};
for (auto word{ std::sregex_token_iterator(text.begin(),text.end(),re) }; word != std::sregex_token_iterator(); ++word)
counter[*word]++;
for (const auto& [word, count] : counter)
std::cout << std::setw(20) << word << "\toccurred\t" << count << " times\n";
}