C++11 regex_token_iterator - c++

Hmm... I thought I understood regexes, and I thought I understood iterators, but C++11's regex implementation has me puzzled...
One area I don't understand: Reading about regex token iterators, I came across the following sample code:
#include <fstream>
#include <iostream>
#include <algorithm>
#include <iterator>
#include <regex>
int main()
{
std::string text = "Quick brown fox.";
// tokenization (non-matched fragments)
// Note that regex is matched only two times: when the third value is obtained
// the iterator is a suffix iterator.
std::regex ws_re("\\s+"); // whitespace
std::copy( std::sregex_token_iterator(text.begin(), text.end(), ws_re, -1),
std::sregex_token_iterator(),
std::ostream_iterator<std::string>(std::cout, "\n"));
...
}
I don't understand how the following output:
Quick
brown
fox.
is being created by the std::copy() function above. I see no loop, so I am puzzled as how the iteration is occurring. Or put another way, how is more than one line of output being generated?

std::copy copies elements from an input range into an output range. In your program, the input range is the three tokens extracted using the regular expression delimiter. These are the three words that are printed to the output. The output range is ostream_iterator which simply takes each element it is given and writes the element to an output stream.
If you step through std::copy using your debugger, you will see that it loops over the elements of the input range.

Related

Reading in from a .tsv file

I'm trying to read in information from a tab separated value file with the format:
<string> <int> <string>
Example:
Seaking 119 Azumao
Mr. Mime 122 Barrierd
Weedle 13 Beedle
This is currently how I'm doing it:
string americanName;
int pokedexNumber;
string japaneseName;
inFile >> americanName;
inFile >> pokedexNumber
inFile >> japaneseName;
My issue stems from the space in the "Mr. Mime" as the strings can contain spaces.
I would like to know how to read the file in properly.
Standard library uses such things as locales to determine the categories of different symbols and other locale-dependent things depending on your system locale. Standard streams use that to determine what is a space because of various unicode issues.
You can use this fact to control the meaning of ' ' in your case:
#include <iostream>
#include <locale>
#include <algorithm>
struct tsv_ws : std::ctype<char>
{
mask t[table_size]; // classification table, stores category for each character
tsv_ws() : ctype(t) // ctype will use our table to check character type
{
// copy all default values to our table;
std::copy_n(classic_table(), table_size, t);
// here we tell, that ' ' is a punctuation, but not a space :)
t[' '] = punct;
}
};
int main() {
std::string s;
std::cin.imbue(std::locale(std::cin.getloc(), new tsv_ws)); // using our locale, will work for any stream
while (std::cin >> s) {
std::cout << "read: '" << s << "'\n";
}
}
Here we make ' ' a punctuation symbol, but not a space symbol, so streams don't consider it a separator anymore. The exact category isn't important, but it mustn't be space.
That's quite powerful technique. For example, you could redefine ',' to be a space to read in CSV format.
You can use std::getline to extract strings with non-tab whitespace.
std::getline(inFile, americanName, '\t'); // read up to first tab
inFile >> pokedexNumber >> std::ws; // read number then second tab
std::getline(inFile, japaneseName); // read up to first newline
Seems like you want to read csv data or in your case tsv data. But let's stick to the common term "csv". This is a standard task and I will give you detailed explanations. In the end all the reading will be done in a one-liner.
I would recommend to use "modern" C++ approach.
After searching for "reading csv data", people are still are linking to How can I read and parse CSV files in C++?, the questions is from 2009 and now over 10 years old. Most answers are also old and very complicated. So, maybe its time for a change.
In modern C++ you have algorithms that iterate over ranges. You will often see something like "someAlgoritm(container.begin(), container.end(), someLambda)". The idea is that we iterate over some similar elements.
In your case we iterate over tokens in your input string, and create substrings. This is called tokenizing.
And for exactly that purpose, we have the std::sregex_token_iterator. And because we have something that has been defined for such purpose, we should use it.
This thing is an iterator. For iterating over a string, hence sregex. The begin part defines, on what range of input we shall operate, then there is a std::regex for what should be matched / or what should not be matched in the input string. The type of matching strategy is given with last parameter.
1 --> give me the stuff that I defined in the regex and
-1 --> give me that what is NOT matched based on the regex.
So, now that we understand the iterator, we can std::copy the tokens from the iterator to our target, a std::vector of std::string. And since we do not know, how may columns we have, we will use the std::back_inserter as a target. This will add all tokens that we get from the std::sregex_token_iterator and append it ot our std::vector<std::string>>. It does'nt matter how many columns we have.
Good. Such a statement could look like
std::copy( // We want to copy something
std::sregex_token_iterator // The iterator begin, the sregex_token_iterator. Give back first token
(
line.begin(), // Evaluate the input string from the beginning
line.end(), // to the end
re, // Add match a comma
-1 // But give me back not the comma but everything else
),
std::sregex_token_iterator(), // iterator end for sregex_token_iterator, last token + 1
std::back_inserter(cp.columns) // Append everything to the target container
);
Now we can understand, how this copy operation works.
Next step. We want to read from a file. The file conatins also some kind of same data. The same data are rows.
And as for above, we can iterate of similar data. If it is the file input or whatever. For this purpose C++ has the std::istream_iterator. This is a template and as a template parameter it gets the type of data that it should read and, as a constructor parameter it gets a reference to an input stream. It doesnt't matter, if the input stream is a std::cin, or a std::ifstream or a std::istringstream. The behaviour is identical for all kinds of streams.
And since we do not have files an SO, I use (in the below example) a std::istringstream to store the input csv file. But of course you can open a file, by defining a std::ifstream testCsv(filename). No problem.
And with std::istream_iterator, we iterate over the input and read similar data. In our case one problem is that we want to iterate over special data and not over some build in data type.
To solve this, we define a Proxy class, which does the internal work for us (we do not want to know how, that should be encapsulated in the proxy). In the proxy we overwrite the type cast operator, to get the result to our expected type for the std::istream_iterator.
And the last important step. A std::vector has a range constructor. It has also a lot of other constructors that we can use in the definition of a variable of type std::vector. But for our purposes this constructor fits best.
So we define a variable csv and use its range constructor and give it a begin of a range and an end of a range. And, in our specific example, we use the begin and end iterator of std::istream_iterator.
If we combine all the above, reading the complete CSV file is a one-liner, it is the definition of a variable with calling its constructor.
Please see the resulting code:
#include <iostream>
#include <sstream>
#include <fstream>
#include <string>
#include <vector>
#include <iterator>
#include <regex>
#include <algorithm>
std::istringstream testCsv{ R"(Seaking 119 Azumao
Mr. Mime 122 Barrierd
Weedle 13 Beedle)" };
// Define Alias for easier Reading
using Columns = std::vector<std::string>;
using CSV = std::vector<Columns>;
// Proxy for the input Iterator
struct ColumnProxy {
// Overload extractor. Read a complete line
friend std::istream& operator>>(std::istream& is, ColumnProxy& cp) {
// Read a line
std::string line; cp.columns.clear();
if (std::getline(is, line)) {
// The delimiter
const std::regex re("\t");
// Split values and copy into resulting vector
std::copy(std::sregex_token_iterator(line.begin(), line.end(), re, -1),
std::sregex_token_iterator(),
std::back_inserter(cp.columns));
}
return is;
}
// Type cast operator overload. Cast the type 'Columns' to std::vector<std::string>
operator std::vector<std::string>() const { return columns; }
protected:
// Temporary to hold the read vector
Columns columns{};
};
int main()
{
// Define variable CSV with its range constructor. Read complete CSV in this statement, So, one liner
CSV csv{ std::istream_iterator<ColumnProxy>(testCsv), std::istream_iterator<ColumnProxy>() };
// Print result. Go through all lines and then copy line elements to std::cout
std::for_each(csv.begin(), csv.end(), [](Columns & c) {
std::copy(c.begin(), c.end(), std::ostream_iterator<std::string>(std::cout, " ")); std::cout << "\n"; });
}
I hope the explanation was detailed enough to give you an idea, what you can do with modern C++.
This example does basically not care how many rows and columns are in the source text file. It will eat everything.

C++ copy cin into cout directly but in reverse order

Is there is any similar solution to this command:
using namespace std;
copy(istream_iterator<string>(cin), istream_iterator<string>(),ostream_iterator<string>(cout, "\n"));
-- this command copies everything into cout but I would like to change it to copy the string in reverse order so I have used this:
using namespace std;
reverse_copy(istream_iterator<string>(cin), istream_iterator<string>(),ostream_iterator<string>(cout, "\n"));
-- but this did not even compile. Are there any solutions to this? Thank you
The first two arguments to std::reverse_copy must be Bidirectional Iterator whereas std::istream_iterator is Input Iterator which cannot behave as Bidirectional Iterator. That explains why it doesn't work — it wouldn't even compile.
You've to write your own iterator — or do it manually in a loop — to solve this problem (which is not clear as to what mean by reverse : given foo bar as input, do you want bar foo or oof rab, or rab oof? as many of the commenters say).
You can write a recursive function. For example
#include <iostream>
#include <string>
#include <sstream>
std::ostream & reverse_output( std::istream &is = std::cin,
std::ostream &os = std::cout )
{
std::string s;
if ( is >> s ) reverse_output( is, os ) << s << '\n';
return os;
}
int main()
{
std::istringstream is( "Hello Bobul Mentol" );
reverse_output( is );
}
The program output is
Mentol
Bobul
Hello
Of course instead of the string stream I used for the demonstrative purpose you can use std::cin. In this case the call of the function will look just like
reverse_output();
Otherwise you need to store the input in some container and use it to reverse the inputted data for outputing.
For example
std::vector<std::string> v( std::istream_iterator<std::string>( std::cin ),
std::istream_iterator<std::string>() );
std::reverse_copy( v.begin(), v.end(),
std::ostream_iterator<std::string>( std::cout, "\n" ) );
I have never heard of some standard algorithm which can copy a reversed collection by an input_iterator or even forward_iterator - probably if this exist, it requires at least bidirectional_iterator.
So, you can use the temporary collection to store the values read, like this:
vector<string> tmp;
copy(istream_iterator<string>(cin), istream_iterator<string>(), back_inserter(tmp));
copy(tmp.rbegin(), tmp.rend(), ostream_iterator<string>(cout, "\n"));
There is a general problem here: std::cin is a stream. When it is attached to a file, you could imagine a way to initially find the size and so how to know where the reverse iterator should start. But when it is attached to a terminal, with an imprevisible human being able to type input data at will, at what position should the reverse iterator start? I have no idea of it, and it looks like cin designer had no more - more seriously, cin does not propose a reverse iterator, and it is by design.
If you want to present what has been inputted in cin but in reverse order, you must first specify:
what is the piece of input to reverse: anything until stream is closed, or anything until first end of line, or [put here your own definition]. Once it's done you have the start place for your reverse iterator
what is the unit to be reversed: one character or one word at a time. Once this is done, you know what your reverse iterator should return.
The implementation could use a vector of strings. You consistently accumulate words or single characters in it until what you have defined as the end of the stream. Once you hit the end, you hold a container with bidirectional iterators so copying it in reverse order should be easy.

How to find repeated words in file with vector C++

My task is that I don't know number of words in a file and the words are repeating several times,but how many times - It's unknown and I have to find that words. I use classes and vector to work with words,and fstream to work with files. But I cannot find resource or algorithm of finding repeating words and I'm so puzzled. I have vector of variable type and I pushed the words in it. It works successfully,I test it with v.size() output. I made all of things except algorithm of finding repeating words,which solve turned difficult to me.
My full code that I wrote:
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <algorithm>
#include <stdio.h>
#include <iterator>
using namespace std;
class Wording {
private:
string word;
vector <string> v;
public:
Wording(string Alternateword, vector <string> Alternatev) {
v = Alternatev;
word = Alternateword;
}
};
int main() {
ifstream ifs("words.txt");
ofstream ofs("wordresults.txt");
string word;
vector <string> v;
Wording obj(word,v);
while(ifs >> word) v.push_back(word);
for(int i=0; i<v.size(); i++) {
//waiting for algorithm
//ofs << v[i] << endl;
}
return 0;
}
Try using a hash map. If you are using gnu c++, it's std::hash_map. In C++11, you could use std::unordered_map, which would give you the same capabilities. Otherwise, hash_map is available from Boost, and probably elsewhere.
Key concept here is hash_map<word, count>.
Is the unique words in input file what you want? If so then you can do this with set (unordered_set if you don't really need them to be sorted) like so:
std::set<std::string> words; //can be changed to unordered_set
std::copy(ifs, std::ifstream(), std::inserter(words, words.begin());
std::copy(words.begin(), words.end(), ostream_iterator<std::string>(ofs));
You can also use vector, but you'll have to sort it and then use unique on it.
I can't compile this code now, so there might be some errors in my code snippet.
If what you want is the number of occurrences of a different words in file then you'll have to use some kind of map, as was already suggested. Of course using vector, sorting it and then counting consecutive words is also an solution, but wouldn't be too clear.

How to get a string of union set from a vector string?

I have a vector string filled with some file extensions as follows:
vector<string> vExt;
vExt.push_back("*.JPG;*.TGA;*.TIF");
vExt.push_back("*.PNG;*.RAW");
vExt.push_back("*.BMP;*.HDF");
vExt.push_back("*.GIF");
vExt.push_back("*.JPG");
vExt.push_back("*.BMP");
I now want to get a string of union set from the above-mentioned vector string, in which each file extension must be unique in the resulting string. As for my given example, the resulting string should take the form of "*.JPG;*.TGA;*.TIF;*.PNG;*.RAW;*.BMP;*.HDF;*.GIF".
I know that std::unique can remove consecutive duplicates in range. It con't work with my condition. Would you please show me how to do that? Thank you!
See it live here: http://ideone.com/0fmy0 (FIXED)
#include <iostream>
#include <sstream>
#include <iterator>
#include <algorithm>
#include <vector>
#include <set>
int main()
{
std::vector<std::string> vExt;
vExt.push_back("*.JPG;*.TGA;*.TIF");
vExt.push_back("*.PNG;*.RAW");
vExt.push_back("*.BMP;*.HDF");
vExt.push_back("*.GIF");
vExt.push_back("*.JPG");
vExt.push_back("*.BMP");
std::stringstream ss;
std::copy(vExt.begin(), vExt.end(),
std::ostream_iterator<std::string>(ss, ";"));
std::string element;
std::set<std::string> unique;
while (std::getline(ss, element, ';'))
unique.insert(unique.end(), element);
std::stringstream oss;
std::copy(unique.begin(), unique.end(),
std::ostream_iterator<std::string>(oss, ";"));
std::cout << oss.str() << std::endl;
return 0;
}
output:
*.BMP;*.GIF;*.HDF;*.JPG;*.PNG;*.RAW;*.TGA;*.TIF;
I'd tokenize each string into constituent parts (using semicolon as the separator), and stick the resulting tokens into a set. The resultant contents of that set is what you're looking for.
You need to parse the strings that contain multiple file extensions and then push them into the vector. After that std::unique will do what you want. Have a look at the Boost.Tokenizer class, that should make this trivial.

How do I alter this tokenization process to work on a text file with multiple lines?

I'm working this source code:
#include <string>
#include <vector>
#include <iostream>
#include <istream>
#include <ostream>
#include <iterator>
#include <sstream>
#include <algorithm>
int main()
{
std::string str = "The quick brown fox";
// construct a stream from the string
std::stringstream strstr(str);
// use stream iterators to copy the stream to the vector as whitespace separated strings
std::istream_iterator<std::string> it(strstr);
std::istream_iterator<std::string> end;
std::vector<std::string> results(it, end);
// send the vector to stdout.
std::ostream_iterator<std::string> oit(std::cout);
std::copy(results.begin(), results.end(), oit);
}
To, instead of tokenizing a single line and putting it into the vector results, it tokenizes a group of lines taken from this text file and puts the resulting words into a single vector .
Text File:
Munroe states there is no particular meaning to the name and it is simply a four-letter word without a phonetic pronunciation, something he describes as "a treasured and carefully-guarded point in the space of four-character strings." The subjects of the comics themselves vary. Some are statements on life and love (some love strips are simply art with poetry), and some are mathematical or scientific in-jokes.
So far, I'm only clear that I need to use a
while (getline(streamOfText, readTextLine)){}
to get the loop running.
But I don't think this would work:
while (getline(streamOfText, readTextLine)) {
cout << readTextLine << endl;
// construct a stream from the string
std::stringstream strstr(readTextLine);
// use stream iterators to copy the stream to the vector as whitespace separated strings
std::istream_iterator it(strstr);
std::istream_iterator end;
std::vector results(it, end);
/*HOw CAN I MAKE THIS INSIDE THE LOOP WITHOUT RE-DECLARING AND USING THE CONSTRUCTORS FOR THE ITERATORS AND VECTOR? */
// send the vector to stdout.
std::ostream_iterator<std::string> oit(std::cout);
std::copy(results.begin(), results.end(), oit);
}
Yes, then you have one whole line in readTextLine. Is it that what you wanted in that loop? Then instead of constructing the vector from the istream iterators, copy into the vector, and define the vector outside the loop:
std::vector<std::string> results;
while (getline(streamOfText, readTextLine)){
std::istringstream strstr(readTextLine);
std::istream_iterator<std::string> it(strstr), end;
std::copy(it, end, std::back_inserter(results));
}
You actually don't need to read a line into the string first, if all you need is all words from a stream, and no per-line processing. Just read from the other stream directly like you did in your code. It will not only read words from one line, but from the whole stream, until the end-of-file:
std::istream_iterator<std::string> it(streamOfText), end;
std::vector<std::string> results(it, end);
To do all that manually, like you ask for in the comments, do
std::istream_iterator<std::string> it(streamOfText), end;
while(it != end) results.push_back(*it++);
I recommend you to read a good book on this. It will show you much more useful techniques i think. C++ Standard library by Josuttis is a good book.