Tokenizer for full-text - c++

This should be an ideal case of not re-inventing the wheel, but so far my search has been in vain.
Instead of writing one myself, I would like to use an existing C++ tokenizer. The tokens are to be used in an index for full text searching. Performance is very important, I will parse many gigabytes of text.
Edit: Please note that the tokens are to be used in a search index. Creating such tokens is not an exact science (afaik) and requires some heuristics. This has been done a thousand time before, and probably in a thousand different ways, but I can't even find one of them :)
Any good pointers?
Thanks!

The C++ String Toolkit Library (StrTk) has the following solution to your problem:
#include <iostream>
#include <string>
#include <deque>
#include "strtk.hpp"
int main()
{
std::deque<std::string> word_list;
strtk::for_each_line("data.txt",
[&word_list](const std::string& line)
{
const std::string delimiters = "\t\r\n ,,.;:'\""
"!##$%^&*_-=+`~/\\"
"()[]{}<>";
strtk::parse(line,delimiters,word_list);
});
std::cout << strtk::join(" ",word_list) << std::endl;
return 0;
}
More examples can be found Here

If performance is a main issue you should probably stick to good old strtok which is sure to be fast:
/* strtok example */
#include <stdio.h>
#include <string.h>
int main ()
{
char str[] ="- This, a sample string.";
char * pch;
printf ("Splitting string \"%s\" into tokens:\n",str);
pch = strtok (str," ,.-");
while (pch != NULL)
{
printf ("%s\n",pch);
pch = strtok (NULL, " ,.-");
}
return 0;
}

A regular expression library might work well if your tokens aren't too difficult to parse.

I wrote my own tokenizer as part of the open-source
SWISH++ indexing and search engine.
There's also the the ICU tokenizer
that handles Unicode.

I might look into std::stringstream from <sstream>. C-style strtok has a number of usability problems, and C-style strings are just troublesome.
Here's an ultra-trivial example of it tokenizing a sentence into words:
#include <sstream>
#include <iostream>
#include <string>
int main(void)
{
std::stringstream sentence("This is a sentence with a bunch of words");
while (sentence)
{
std::string word;
sentence >> word;
std::cout << "Got token: " << word << std::endl;
}
}
janks#phoenix:/tmp$ g++ tokenize.cc && ./a.out
Got token: This
Got token: is
Got token: a
Got token: sentence
Got token: with
Got token: a
Got token: bunch
Got token: of
Got token: words
Got token:
The std::stringstream class is "bi-directional", in that it supports input and output. You'd probably want to do just one or the other, so you'd use std::istringstream or std::ostringstream.
The beauty of them is that they are also std::istream and std::ostreams respectively, so you can use them as you'd use std::cin or std::cout, which are hopefully familiar to you.
Some might argue these classes are expensive to use; std::strstream from <strstream> is mostly the same thing, but built on top of cheaper C-style 0-terminated strings. It might be faster for you. But anyway, I wouldn't worry about performance right away. Get something working, and then benchmark it. Chances are you can get enough speed by simply writing well-written C++ that minimizes unnecessary object creation and destruction. If it's still not fast enough, then you can look elsewhere. These classes are probably fast enough, though. Your CPU can waste thousands of cycles in the amount of time it takes to read a block of data from a hard disk or network.

You can use the Ragel State Machine Compiler to create a tokenizer (or a lexical analyzer).
The generated code has no external dependencies.
I suggest you look at the clang.rl sample for a relevant example of the syntax and usage.

Well, I would begin by searching Boost and... hop: Boost.Tokenizer
The nice thing ? By default it breaks on white spaces and punctuation because it's meant for text, so you won't forget a symbol.
From the introduction:
#include<iostream>
#include<boost/tokenizer.hpp>
#include<string>
int main(){
using namespace std;
using namespace boost;
string s = "This is, a test";
tokenizer<> tok(s);
for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){
cout << *beg << "\n";
}
}
// prints
This
is
a
test
// notes how the ',' and ' ' were nicely removed
And there are additional features:
it can escape characters
it is compatible with Iterators so you can use it with an istream directly... and thus with an ifstream
and a few options (like keeping empty tokens etc...)
Check it out!

Related

C++ std::string::at()

I want to print the first letter of a string.
#include <iostream>
#include <string>
using namespace std;
int main() {
string str = "다람쥐 헌 쳇바퀴 돌고파.";
cout << str.at(0) << endl;
}
I want '다' to be printed like java, but '?' is printed.
How can I fix it?
That text you have in str -- how is it encoded?
Unfortunately, you need to know that to get the first "character". The std::string class only deals with bytes. How bytes turn into characters is a rather large topic.
The magic word you are probably looking for is UTF-8. See here for more infomation: How do I properly use std::string on UTF-8 in C++?
If you want to go down this road yourself, look here: Extract (first) UTF-8 character from a std::string
And if you're really interested, here's an hour-long video that is actually a great explanation of text encoding: https://www.youtube.com/watch?v=_mZBa3sqTrI

How do I get all characters after a space in C++?

In C++, how do I get ALL of the text after a space. I am trying to make my own coding language, so I want the user to be able to enter (/print (text here)) and print the text the user has entered. I want this to be all in one line; without having the user to input the command, then input the thing they want to output. Thank you to anyone who replies in advance.
Try this way. It will give you all the characters after the first space in the string.
std::string x = "ABC CDEFG HIJKL";
x.substr(x.find(" ") + 1);
Leveraging <algorithm>
The following will work with C++11:
#include <string>
#include <algorithm>
#include <cctype>
#include <iostream>
#include <iterator>
bool is_blank(char ch)
{
return std::isblank(static_cast<unsigned char>(ch));
}
int main() {
std::string inp = "print foo";
auto it = std::find_if(inp.begin(), inp.end(), is_blank);
it = std::find_if_not(it, inp.end(), is_blank);
std::copy(it, inp.end(), std::ostream_iterator<char>(std::cout));
}
Run this code in Compiler Explorer.
Note that we're only iterating over the input string once. Also note that this solution leverages the algorithms which come with the C++ standard library - no raw loops required :-)
Using std::string's find functions
std::string has a ton of built-in functions. I'm pretty sure if C++ could be developed from scratch most of them wouldn't be there. But since we have them we put them to some use:
#include <string>
#include <algorithm>
#include <iostream>
#include <iterator>
int main() {
std::string inp = "print foo";
const std::string whitespace = " \t";
auto i = inp.find_first_of(whitespace);
i = inp.find_first_not_of(whitespace, i);
std::cout << inp.substr(i, inp.size() - i) << std::endl;
}
Run this code in Compiler Explorer.
I prefer the first solution since I find the last line a little more readable. std::copy might also be slightly more efficient. Here std::string::substr() returns a temporary string which gets destroyed once std::cout has printed it. Not ideal in terms of performance which might or might not matter here.

have a programming project for an intro c++ class one of the function we need to create is a split function

i was hoping to get some feedback on if i am doing this the "smart way" or if maybe i could be doing it faster. if i were splitting on white spaces
i would probably use getline(stringstream, word, delimiter)
but i didnt know how to adapt the delimiter to all the good characters so i just looped through the whole string generated a new word until i reached a bad character but as i am fairly new to programming im not sure if its the best way to do it
thanks for any feedback
#include <iostream>
#include <string>
using std::string;
#include <vector>
using std::vector;
#include <sstream>
#include <algorithm>
#include <iterator> //delete l8r
using std::cout; using std::cin; using std::endl;
/*
void split(string line, vector<string>&words, string good_chars)
o
Find words in the line that consist of good_chars.
Any other character is considered a separator.
o
Once you have a word, convert all the characters to lower case.
You then push each word onto the reference vector words.
Important: split goes in its own file. This is both for your own benefit, you can reuse
split, and for grading purposes.We will provide a split.h for you.
*/
void split(string line, vector<string> & words, string good_chars){
string good_word;
for(auto c : line){
if(good_chars.find(c)!=string::npos){
good_word.push_back(c);
}
else{
if(good_word.size()){
std::transform(good_word.begin(), good_word.end(), good_word.begin(), ::tolower);
words.push_back(good_word);
}
good_word = "";
}
}
}
int main(){
vector<string> words;
string good_chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'";
// TEST split
split("This isn't a TEST.", words, good_chars);
// words should have: {"this", "isn't", "a", "test"}, no period in test
std::copy(words.begin(), words.end(), std::ostream_iterator<string>(cout, ","));
cout << endl;
return 0;
}
I'd say that this is a reasonable approach given the context of an intro to C++ class. I'd even say that it's fairly likely that this is the approach your instructor expects to see.
There are, of course, a few optimization tweaks that can be done. Like instantiating a 256-element bool array, using good_chars to set the corresponding values to true, and all others defaulting to false, then replacing the find() call with a quick array lookup.
But, I'd predirect that if you were to hand in such a thing, you'll be suspected of copying stuff you found on the intertubes, so leave that alone.
One thing you might consider doing is using tolower when you push_back each character, instead, and removing the extra std::transform pass over the word.

convert array string into char

i have taken all txt from file and place line by line into array of string. I am trying to split this string so that i can save word by word in separate array. Kindly just tell me how shell i convert the array of string into char.
for example
string line[15]; // line[0] has : was there before
// line[1] has : then after
char * pch;
char *c = line.c_str(); // string to char (i am getting error here. Any body know?)
pch = strtok (c," ");
while (pch != NULL)
{
printf ("%s\n",pch);
pch = strtok (NULL, " ");
}
error: C2228: left of '.c_str' must have class/struct/union
string line[15]; is an array of strings. So when you have line.c_str(); line is a pointer to a string and not a string itself. A pointer doesn't have a .c_str() method and that's why the compiler is complaining. (Pointers don't have any methods and hence the compiler tells you that the left hand side of the expression must be a class/struct/union type). To fix this you want to index into the array to get a string. You can do this with something like: line[0].c_str();
Additionally you can't write to anything returned by c_str() as it returns a const pointer. So you'll need to copy the results from c_str first before you then operate on it if you are going to change it in place.
Also it might be worth mentioning that there's c++ ways of doing tokenizing, you might find some examples here Split a string in C++? . The last time I did this I was already using boost so I made use of the boost::tokenizer library.
There are simpler ways to accomplish this in C++. The strtok function is a C function and cannot be used with std::string directly since there is no way to get writable pointer access to the underlying characters in an std::string. You can use iostreams to get individual words separated by spaces from a file directly in C++.
Unfortunately the standard library lacks a simple, flexible, efficient method to split strings on arbitrary characters. It gets more complicated if you want to use iostreams to accomplish splitting on something other than whitespace. Using boost::split or the boost::tokenizer suggestion from shuttle87 is a good option if you need something more flexible (and it may well be more efficient too).
Here's a code example reading words from standard input, you can use pretty much the same code with an ifstream to read from a file or a stringstream to read from a string: http://ideone.com/fPpU4l
#include <algorithm>
#include <iostream>
#include <iterator>
#include <string>
using namespace std;
int main() {
vector<string> words;
copy(istream_iterator<string>{cin}, istream_iterator<string>{}, back_inserter(words));
copy(begin(words), end(words), ostream_iterator<string>{cout, ", "});
cout << endl;
}

conversion from stringstream to string removes '=' characters

I am reading an XML file into a stringstream buffer in order to parse it using RapidXML. RapidXML is only parsing the names of the XML nodes, but none of their attribute names or values. After some experimentation, I discovered that the problem is not likely to be with RapidXML, but with conversion of the stringstream buffer to a string using std::string content(buffer.str());. The '=' characters that are so important to XML parsing are converted to ' ' (space characters), prior to any RapidXML processing.
The character replacement is evident in the console window when the cout << calls are made in the code below, which is before RapidXML gets its hands on the string.
My code is as follows:
#include <iostream>
#include <fstream>
#include <stdio.h>
#include <conio.h>
#include <string>
#include <stdlib.h>
#include <rapidxml.hpp>
#include <vector>
#include <sstream>
using namespace std;
using namespace rapidxml;
//... main() and so forth, all works fine...
ifstream file(names.at(i)); // names.at(i) works fine...
//...
file.read(fileData, fileSize); // works fine...
//...
// Create XML document object using RapidXML:
xml_document<> doc;
//...
std::stringstream buffer;
buffer << file.rdbuf();
// This is where everything looks okay (i.e., '=' shows up properly):
cout << "\n" << buffer.str() << "\n\nPress a key to continue...";
getchar();
file.close();
std::string content(buffer.str());
// This is where the '=' are replaced by ' ' (space characters):
cout << "\n" << content << "\n\nPress a key to continue...";
getchar();
// Parse XML:
doc.parse<0>(&content[0]);
// Presumably the lack of '=' is preventing RapidXML from parsing attribute
// names and values, which always follow '='...
Thanks in advance for your help.
p.s. I followed advice on using this technique for reading an entire XML file into a stringstream, converting it to a string, and then feeding the string to RapidXML from the following links (thanks to contributors of these pieces of advice, sorry I can't make them work yet...):
Automation Software's RapidXML mini-tutorial
...this method was seen many other places, I won't list them here. Seems sensible enough. My errors seem to be unique. Could this be an ASCII vs. UNICODE issue?
I also tried code from here:
Thomas Whitton's example converting a string buffer to a dynamic cstring
code snippet from the above:
// string to dynamic cstring
std::vector<char> stringCopy(xml.length(), '\0');
std::copy(xml.begin(), xml.end(), stringCopy.begin());
char *cstr = &stringCopy[0];
rapidxml::xml_document<> parsedFromFile;
parsedFromFile.parse<0>(cstr);
...with similar RapidXML failure to parse node attribute names and values. Note that I didn't dump the character vector stringCopy to the console to inspect it, but I am getting the same problem, which for review is:
I am seeing correctly parsed names of XML tags after RapidXML parsing of the string fed to it for analysis.
There are no correctly parsed tag attribute names or values. These are dependent upon the '=' character showing up in the string to be parsed.
If you look closely the = characters probably aren't being replaced by spaces, but zero bytes. If you look at the rapidxml documentation here:
http://rapidxml.sourceforge.net/manual.html#namespacerapidxml_1differences
It specifically states that it modifies the source text. This way it can avoid allocating any new strings, instead it uses pointers to the original source.
This part seems to work correctly, maybe the problem is with the rest of your code that's trying to read the attributes?