How to read file or stream until string found - c++

I am writing a dictionary program, the input is specified by a file and parsed as such:
std::string savedDictionary(std::istreambuf_iterator<char>(std::ifstream(DICTIONARY_SAVE_FILE)), {});
// entire file loaded into savedDictionary
for (size_t end = 0; ;)
{
size_t term = savedDictionary.find("|TERM|", end);
size_t definition = savedDictionary.find("|DEFINITION|", term);
if ((end = savedDictionary.find("|END|", definition)) == std::string::npos) break;
// store term and definition here...
}
This throws std::bad_alloc on some of my third world users' machines that don't have enough RAM to store the dictionary string + the dictionary as it's held inside my program.
If I could do this:
std::string term;
for (std::ifstream file(DICTIONARY_SAVE_FILE); file; std::getline(file, term, "|END|")
{
// same as above
}
then it would be great, but std::getline doesn't support a string as delimiter.
So, what's the most idiomatic way to read the file until I find "|END|" without allocating a crap ton of memory up front?

We can achieve the requested functionality by using a very simple proxy class. With that it is easy to use all std::algorithms and all std::iterators as usual.
So, we define a smal proxy class called LineUntilEnd. This can be used in conjunction with all streams like a std::ifstream or whatever you like. You can especially simply use the extractor operator to extract a value from the input stream and put it into the desired variable.
// Here we will store the lines until |END|
LineUntilEnd lue;
// Simply read the line until |END|
while (testInput >> lue) {
It works as expected.
An if we have such a string, we can parse it afterwords with easy regex operations.
I added a small example and put the resulting value in a std::multimap to build a demo dictionary.
Please see the following code
#include <iostream>
#include <string>
#include <iterator>
#include <regex>
#include <map>
#include <sstream>
#include <iterator>
// Ultra simple proxy class to read data until given word is found
struct LineUntilEnd
{
// Overload the extractor operator
friend std::istream& operator >>(std::istream& is, LineUntilEnd& lue);
// Intermediate storage for result
std::string data{};
};
// Read stream until "|END|" symbol has been found
std::istream& operator >>(std::istream& is, LineUntilEnd& lue)
{
// Clear destination string
lue.data.clear();
// We will count, how many bytes of the search string have been matched
size_t matchCounter{ 0U };
// Read characters from stream
char c{'\0'};
while (is.get(c))
{
// Add character to resulting string
lue.data += c;
// CHeck for a match. All characters must be matched
if (c == "|END|"[matchCounter]) {
// Check next matching character
++matchCounter;
// If there is a match for all characters in the searchstring
if (matchCounter >= (sizeof "|END|" -1)) {
// The stop reading
break;
}
}
else {
// Not all charcters could be matched. Start from the begining
matchCounter = 0U;
}
}
return is;
}
// Input Test Data
std::istringstream testInput{ "|TERM|bonjour|TERM|hola|TERM|hi|DEFINITION|hello|END||TERM|Adios|TERM|Ciao|DEFINITION|bye|END|" };
// Regex defintions. Used to build up a dictionary
std::regex reTerm(R"(\|TERM\|(\w+))");
std::regex reDefinition(R"(\|DEFINITION\|(\w+)\|END\|)");
// Test code
int main()
{
// We will store the found values in a dictionay
std::multimap<std::string, std::string> dictionary{};
// Here we will store the lines until |END|
LineUntilEnd lue;
// Simply read the line until |END|
while (testInput >> lue) {
// Search for the defintion string
std::smatch sm{};
if (std::regex_search(lue.data, sm, reDefinition)) {
// Definition string found
// Iterate over all terms
std::sregex_token_iterator tokenIter(lue.data.begin(), lue.data.end(), reTerm, 1);
while (tokenIter != std::sregex_token_iterator()) {
// STore values in dictionary
dictionary.insert({ sm[1],*tokenIter++ });
}
}
}
// And show some result to the user
for (const auto& d : dictionary) {
std::cout << d.first << " --> " << d.second << "\n";
}
return 0;
}

For those in the future, this is what I ended up writing:
std::optional<std::string> ReadEntry(std::istream& stream)
{
for (struct { char ch; int matched; std::string entry; } i{}; stream.get(i.ch); i.entry.push_back(i.ch))
if (i.ch == "|END|"[i.matched++]);
else if (i.matched == sizeof("|END|")) return i.entry;
else i.matched = 0;
return {};
}

Related

How to get a word vector from a string?

I want to store words separated by spaces into single string elements in a vector.
The input is a string that may end or may not end in a symbol( comma, period, etc.)
All symbols will be separated by spaces too.
I created this function but it doesn't return me a vector of words.
vector<string> single_words(string sentence)
{
vector<string> word_vector;
string result_word;
for (size_t character = 0; character < sentence.size(); ++character)
{
if (sentence[character] == ' ' && result_word.size() != 0)
{
word_vector.push_back(result_word);
result_word = "";
}
else
result_word += character;
}
return word_vector;
}
What did I do wrong?
Your problem has already been resolved by answers and comments.
I would like to give you the additional information that such functionality is already existing in C++.
You could take advantage of the fact that the extractor operator extracts space separated tokens from a stream. Because a std::string is not a stream, we can put the string first into an std::istringstream and then extract from this stream vie the std:::istream_iterator.
We could life make even more easier.
Since roundabout 10 years we have a dedicated, special C++ functionality for splitting strings into tokens, explicitely designed for this purpose. The std::sregex_token_iterator. And because we have such a dedicated function, we should simply use it.
The idea behind it is the iterator concept. In C++ we have many containers and always iterators, to iterate over the similar elements in these containers. And a string, with similar elements (tokens), separated by a delimiter, can also be seen as such a container. And with the std::sregex:token_iterator, we can iterate over the elements/tokens/substrings of the string, splitting it up effectively.
This iterator is very powerfull and you can do really much much more fancy stuff with it. But that is too much for here. Important is that splitting up a string into tokens is a one-liner. For example a variable definition using a range constructor for iterating over the tokens.
See some examples below:
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <iterator>
#include <algorithm>
#include <regex>
const std::regex delimiter{ " " };
const std::regex reWord{ "(\\w+)" };
int main() {
// Some debug print function
auto print = [](const std::vector<std::string>& sv) -> void {
std::copy(sv.begin(), sv.end(), std::ostream_iterator<std::string>(std::cout, "\n")); std::cout << "\n"; };
// The test string
std::string test{ "word1 word2 word3 word4." };
//-----------------------------------------------------------------------------------------
// Solution 1: use istringstream and then extract from there
std::istringstream iss1(test);
// Define a vector (CTAD), use its range constructor and, the std::istream_iterator as iterator
std::vector words1(std::istream_iterator<std::string>(iss1), {});
print(words1); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 2: directly use dedicated function sregex_token iterator
std::vector<std::string> words2(std::sregex_token_iterator(test.begin(), test.end(), delimiter, -1), {});
print(words2); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 3: directly use dedicated function sregex_token iterator and look for words only
std::vector<std::string> words3(std::sregex_token_iterator(test.begin(), test.end(), reWord, 1), {});
print(words3); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 4: Use such iterator in an algorithm, to copy data to a vector
std::vector<std::string> words4{};
std::copy(std::sregex_token_iterator(test.begin(), test.end(), reWord, 1), {}, std::back_inserter(words4));
print(words4); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 5: Use such iterator in an algorithm for direct output
std::copy(std::sregex_token_iterator(test.begin(), test.end(), reWord, 1), {}, std::ostream_iterator<std::string>(std::cout,"\n"));
return 0;
}
You added the index instead of the character:
vector<string> single_words(string sentence)
{
vector<string> word_vector;
string result_word;
for (size_t i = 0; i < sentence.size(); ++i)
{
char character = sentence[i];
if (character == ' ' && result_word.size() != 0)
{
word_vector.push_back(result_word);
result_word = "";
}
else
result_word += character;
}
return word_vector;
}
Since your mistake was only due to the reason, that you named your iterator variable character even though it is actually not a character, but rather an iterator or index, I would like to suggest to use a ranged-base loop here, since it avoids this kind of confusion. The clean solution is obviously to do what #ArminMontigny said, but I assume you are prohibited to use stringstreams. The code would look like this:
#include <iostream>
#include <string>
#include <vector>
using namespace std;
vector<string> single_words(string sentence)
{
vector<string> word_vector;
string result_word;
for (char& character: sentence) // Now `character` is actually a character.
{
if (character==' ' && result_word.size() != 0)
{
word_vector.push_back(result_word);
result_word = "";
}
else
result_word += character;
}
word_vector.push_back(result_word); // In your solution, you forgot to push the last word into the vector.
return word_vector;
}
int main() {
string sentence="Maybe try range based loops";
vector<string> result= single_words(sentence);
for(string& word: result)
cout<<word<<" ";
return 0;
}

reading selected lines of file using mmap

I posted one question, which was related to faster reading of a file, by skipping specific lines but that does not seem to go well with standard c++ api's.
I researched more and got to know what memory mapped files could come handy for these kinds of cases. Details about memory mapped files are here.
All in all,
Suppose, the file(file.txt) is like this:
A quick brown fox
// Blah blah
// Blah blah
jumps over the little lazy dog
And then in code, opened file, Read that as memory mapped file and then iterate over the contents of the char* pointer, skipping the file pointers itself. Wanted to give it a try before reaching to an conclusion on it. Skeleton of my code looks like this:
struct stat filestat;
FILE *file = fopen("file.txt", "r");
if (-1 == fstat(fileno(file), &filestat)) {
std::cout << "FAILED with fstat" << std::endl;
return FALSE;
} else {
char* data = (char*)mmap(0, filestat.st_size, PROT_READ, MAP_PRIVATE, fileno(file), 0);
if (data == 0) {
std::cout << "FAILED " << std::endl;
return FALSE;
}
// Filter out 'data'
// for (unsigned int i = 0; i < filestat.st_size; ++i) {
// Do something here..
// }
munmap(data, filestat.st_size);
return TRUE;
}
In this case, I would want to capture lines which does not start with //. Since this file(file.txt) is already memory mapped, I could go over the data pointer and filter out the lines. Am I correct in doing so?
If so, what is the efficient way to parse the lines?
Reading selected lines from wherever and copy them to whatever can be done with the C++ algorithms.
You can use std::copy_if. This will copy data from any source to any destination, if the predicate is true.
I show you a simple example that copies data from a file and skips all lines starting with "//". The result will be put in a vector.
This is one statement with calling one function. So, a classical one liner.
For debugging purposes, I print the result to the console.
#include <iostream>
#include <vector>
#include <iterator>
#include <algorithm>
#include <string>
#include <fstream>
using LineBasedTextFile = std::vector<std::string>;
class CompleteLine { // Proxy for the input Iterator
public:
// Overload extractor. Read a complete line
friend std::istream& operator>>(std::istream& is, CompleteLine& cl) { std::getline(is, cl.completeLine); return is; }
// Cast the type 'CompleteLine' to std::string
operator std::string() const { return completeLine; }
protected:
// Temporary to hold the read string
std::string completeLine{};
};
int main()
{
// Open the input file
std::ifstream inputFile("r:\\input.txt");
if (inputFile)
{
// This vector will hold all lines of the file
LineBasedTextFile lineBasedTextFile{};
// Read the file and copy all lines that fullfill the required condition, into the vector of lines
std::copy_if(std::istream_iterator<CompleteLine>(inputFile), std::istream_iterator<CompleteLine>(), std::back_inserter(lineBasedTextFile), [](const std::string & s) {return s.find("//") != 0; });
// Print vector of lines
std::copy(lineBasedTextFile.begin(), lineBasedTextFile.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
}
return 0;
}
I hope this helps

Must Return a Value; Vector of Integers; Word Count

I'm having some problems with a program that I'm writing where you find the count of words in a file. The issue I'm having is with my function "get_word_counts" in my cpp. I'm continually getting an message in Visual Studio that states "error C4716: 'TextCounter::get_word_counts': must return a value", despite the fact that I do, in fact, return a value at the end of the function.
Can someone help me understand what the issue with this is? I have searched everywhere but I can't seem to figure out exactly what the problem is. Perhaps it's something simple, but I just can't see it.
I'll post below my cpp file as well as the header:
cpp:
#include"TextCounter.h"
#include <string>
#include <iostream>
#include <vector>
/*
Constructor takes in the filename and builds the map.
*/
TextCounter::TextCounter(std::string file):
filename(file) {
// Build the input list.
parse_file();
}
/*
Get the count of each word in the document.
If a word doesn't occur in the document, put 0.
*/
std::vector<int> TextCounter::get_word_counts(const std::vector<std::string>& words) {
// TODO: Finish this method.
std::vector<int> result;
std::unordered_map<std::string, int>::iterator iter;
for (auto const &i : words) {
iter = TextCounter::frequency.find(i);
if (iter == frequency.end()) {
result.push_back(0);
}
else {
result.push_back(iter->second);
}
}
return result;
}
// Add a word to the map.
// Check to see if the word exists, if so, increment
// otherwise create a new entry and set it to 1.
void TextCounter::add_word(std::string word) {
// COMP: finished this method.
//Look if it's already there.
if (frequency.find(word) == frequency.end()) // Then we've encountered the word for a first time.
frequency[word] = 1; // Initialize it to 1.
else // Then we've already seen it before..
frequency[word]++; // Increment it.
}
// Parse an input file.
// Return -1 if there is an error.
int TextCounter::parse_file() {
// Local variables.
std::ifstream inputfile; // ifstream for reading from input file.
// Open the filename specified for input.
inputfile.open (filename);
// Tokenize the input.
// Read one character at a time.
// If the character is not in a-z or A-Z, terminate current string.
char c;
char curr_str[MAX_STRING_LEN];
int str_i = 0; // Index into curr_str.
bool flush_it = false; // Whether we have a complete string to flush.
while (inputfile.good()) {
// Read one character, convert it to lowercase.
inputfile.get(c);
c = tolower(c);
if (c >= 'a' && c <= 'z') {
// c is a letter.
curr_str[str_i] = c;
str_i++;
// Check over-length string.
if (str_i >= MAX_STRING_LEN) {
flush_it = true;
}
} else {
// c is not a letter.
// Create a new string if curr_str is non-empty.
if (str_i>0) {
flush_it = true;
}
}
if (flush_it) {
// Create the new string from curr_str.
std::string the_str(curr_str,str_i);
// std::cout << the_str << std::endl;
// COMP: Insert code to handle new entries or increment an existing entry.
TextCounter::add_word(the_str);
// Reset state variables.
str_i = 0;
flush_it = false;
}
}
// Close input file.
inputfile.close();
return 0;
}
header:
#include <fstream>
#include <unordered_map>
#include <string>
#include <vector>
#define MAX_STRING_LEN 256
#define DICT_SIZE 20000
class TextCounter {
public:
explicit TextCounter(std::string file = "");
// Get the counts of a vector of words.
std::vector<int> get_word_counts(const std::vector<std::string>& words);
private:
// Name of the input file.
std::string filename;
// COMP: Implement a data structure to keep track of each word and the
// number of times that word occurs in the document.
std::unordered_map<std::string, int> frequency;
// Parse an input file.
int parse_file();
// Add a word to the map.
void add_word(std::string word);
};
Any help would be greatly appreciated.
Thank you!

C/C++ reading and writing long strings to files

I have a list of cities that I'm formatting like this:
{town, ...},
{...},
...
Reading and building each town and creating town1, town2,.... works
The problem is when I output it, 1st line works {town, ...}, but the second line crashes.
Any idea why?
I have [region] [town] (excel table).
So each region repeats by how many towns are in it.
Each file has 1 region/town per line.
judete contains each region repeated 1 time.
AB
SD
PC
....
orase contains the towns list.
town1
town2
....
orase-index contains the region of each town
AB
AB
AB
AB
SD
SD
SD
PC
PC
...
I want an output like this {"town1", "town2", ...} and each row (row 5) contains the town that belong to the region from judete at the same row (row 5).
Here's my code:
#include<stdio.h>
#include<string.h>
char judet[100][100];
char orase[50][900000];
char oras[100], ceva[100];
void main ()
{
int i=0, nr;
FILE *judete, *index, *ORASE, *output;
judete = fopen("judete.txt", "rt");
index = fopen("orase-index.txt", "rt");
ORASE = fopen("orase.txt", "rt");
output = fopen("output.txt", "wt");
while( !feof(judete) )
{
fgets(judet[i], 100, judete);
i++;
}
nr = i;
char tmp[100];
int where=0;
for(i=0;i<nr;i++)
strcpy(orase[i],"");
while( !feof(index) )
{
fgets(tmp, 100, index);
for(i=0;i<nr;i++)
{
if( strstr(judet[i], tmp) )
{
fgets(oras, 100, ORASE);
strcat(ceva, "\"");
oras[strlen(oras)-1]='\0';
strcat(ceva, oras);
strcat(ceva, "\", ");
strcat(orase[i], ceva);
break;
}
}
}
char out[900000];
for(i=0;i<nr;i++)
{
strcpy(out, "");
strcat(out, "{");
strcat(out, orase[i]); //fails here
fprintf(output, "%s},\n", out);
}
}
The result I get from running the code is:
Unhandled exception at 0x00D4F7A9 (msvcr110d.dll) in orase-judete.exe: 0xC0000005: Access violation writing location 0x00A90000.
You don't clear orase array, beacause your loop
for(i-0;i<nr;i++)
strcpy(orase[i],"");
by mistake ('-' instead of '=') executes 0 times.
I think you need to start by making up your mind whether you're writing C or C++. You've tagged this with both, but the code looks like it's pure C. While a C++ compiler will accept most C, the result isn't what most would think of as ideal C++.
Since you have tagged it as C++, I'm going to assume you actually want (or all right with) C++ code. Well written C++ code is going to be enough different from your current C code that it's probably easier to start over than try to rewrite the code line by line or anything like that.
The immediate problem I see with doing that, however, is that you haven't really specified what you want as your output. For the moment I'm going to assume you want each line of output to be something like this: "{" <town> "," <town> "}".
If that's the case, I'd start by noting that the output doesn't seem to depend on your judete file at all. The orase and orase-index seem to be entirely adequate. For that, our code can look something like this:
#include <iostream>
#include <string>
#include <iterator>
#include <fstream>
#include <vector>
// a class that overloads `operator>>` to read a line at a time:
class line {
std::string data;
public:
friend std::istream &operator>>(std::istream &is, line &l) {
return std::getline(is, l.data);
}
operator std::string() const { return data; }
};
int main() {
// open the input files:
std::ifstream town_input("orase.txt");
std::ifstream region_input("orase-index.txt");
// create istream_iterator's to read from the input files. Note
// that these iterate over `line`s, (i.e., objects of the type
// above, so they use its `operator>>` to read each data item).
//
std::istream_iterator<line> regions(region_input),
towns(town_input),
end;
// read in the lists of towns and regions:
std::vector<std::string> town_list {towns, end};
std::vector<std::string> region_list {regions, end};
// write out the file of town-name, region-name:
std::ofstream result("output.txt");
for (int i=0; i<town_list.size(); i++)
result << "{" << town_list[i] << "," << region_list[i] << "}\n";
}
Noe that since this is C++, you'll typically need to save the source as something.cpp instead of something.c for the compiler to recognize it correctly.
Edit: Based on the new requirements you've given in the comments, you apparently want something closer to this:
#include <iostream>
#include <string>
#include <iterator>
#include <fstream>
#include <vector>
#include <map>
// a class that overloads `operator>>` to read a line at a time:
class line {
std::string data;
public:
friend std::istream &operator>>(std::istream &is, line &l) {
return std::getline(is, l.data);
}
operator std::string() const { return data; }
};
int main() {
// open the input files:
std::ifstream town_input("orase.txt");
std::ifstream region_input("orase-index.txt");
// create istream_iterator's to read from the input files. Note
// that these iterate over `line`s, (i.e., objects of the type
// above, so they use its `operator>>` to read each data item).
//
std::istream_iterator<line> regions(region_input),
towns(town_input),
end;
// read in the lists of towns and regions:
std::vector<std::string> town_list (towns, end);
std::vector<std::string> region_list (regions, end);
// consolidate towns per region:
std::map<std::string, std::vector<std::string> > consolidated;
for (int i = 0; i < town_list.size(); i++)
consolidated[region_list[i]].push_back(town_list[i]);
// write out towns by region
std::ofstream output("output.txt");
for (auto pos = consolidated.begin(); pos != consolidated.end(); ++pos) {
std::cout << pos->first << ": ";
std::copy(pos->second.begin(), pos->second.end(),
std::ostream_iterator<std::string>(output, "\t"));
std::cout << "\n";
}
}
Notice that ceva is never initialized.
Instead of using strcpy to initialize strings, I would recommend using static initialization:
char ceva[100]="";

Selective iterator

FYI: no boost, yes it has this, I want to reinvent the wheel ;)
Is there some form of a selective iterator (possible) in C++? What I want is to seperate strings like this:
some:word{or other
to a form like this:
some : word { or other
I can do that with two loops and find_first_of(":") and ("{") but this seems (very) inefficient to me. I thought that maybe there would be a way to create/define/write an iterator that would iterate over all these values with for_each. I fear this will have me writing a full-fledged custom way-too-complex iterator class for a std::string.
So I thought maybe this would do:
std::vector<size_t> list;
size_t index = mystring.find(":");
while( index != std::string::npos )
{
list.push_back(index);
index = mystring.find(":", list.back());
}
std::for_each(list.begin(), list.end(), addSpaces(mystring));
This looks messy to me, and I'm quite sure a more elegant way of doing this exists. But I can't think of it. Anyone have a bright idea? Thanks
PS: I did not test the code posted, just a quick write-up of what I would try
UPDATE: after taking all your answers into account, I came up with this, and it works to my liking :). this does assume the last char is a newline or something, otherwise an ending {,}, or : won't get processed.
void tokenize( string &line )
{
char oneBack = ' ';
char twoBack = ' ';
char current = ' ';
size_t length = line.size();
for( size_t index = 0; index<length; ++index )
{
twoBack = oneBack;
oneBack = current;
current = line.at( index );
if( isSpecial(oneBack) )
{
if( !isspace(twoBack) ) // insert before
{
line.insert(index-1, " ");
++index;
++length;
}
if( !isspace(current) ) // insert after
{
line.insert(index, " ");
++index;
++length;
}
}
}
Comments are welcome as always :)
That's relatively easy using the std::istream_iterator.
What you need to do is define your own class (say Term). Then define how to read a single "word" (term) from the stream using the operator >>.
I don't know your exact definition of a word is, so I am using the following definition:
Any consecutive sequence of alpha numeric characters is a term
Any single non white space character that is also not alpha numeric is a word.
Try this:
#include <string>
#include <sstream>
#include <iostream>
#include <iterator>
#include <algorithm>
class Term
{
public:
// This cast operator is not required but makes it easy to use
// a Term anywhere that a string can normally be used.
operator std::string const&() const {return value;}
private:
// A term is just a string
// And we friend the operator >> to make sure we can read it.
friend std::istream& operator>>(std::istream& inStr,Term& dst);
std::string value;
};
Now all we have to do is define an operator >> that reads a word according to the rules:
// This function could be a lot neater using some boost regular expressions.
// I just do it manually to show it can be done without boost (as requested)
std::istream& operator>>(std::istream& inStr,Term& dst)
{
// Note the >> operator drops all proceeding white space.
// So we get the first non white space
char first;
inStr >> first;
// If the stream is in any bad state the stop processing.
if (inStr)
{
if(std::isalnum(first))
{
// Alpha Numeric so read a sequence of characters
dst.value = first;
// This is ugly. And needs re-factoring.
while((first = insStr.get(), inStr) && std::isalnum(first))
{
dst.value += first;
}
// Take into account the special case of EOF.
// And bad stream states.
if (!inStr)
{
if (!inStr.eof())
{
// The last letter read was not EOF and and not part of the word
// So put it back for use by the next call to read from the stream.
inStr.putback(first);
}
// We know that we have a word so clear any errors to make sure it
// is used. Let the next attempt to read a word (term) fail at the outer if.
inStr.clear();
}
}
else
{
// It was not alpha numeric so it is a one character word.
dst.value = first;
}
}
return inStr;
}
So now we can use it in standard algorithms by just employing the istream_iterator
int main()
{
std::string data = "some:word{or other";
std::stringstream dataStream(data);
std::copy( // Read the stream one Term at a time.
std::istream_iterator<Term>(dataStream),
std::istream_iterator<Term>(),
// Note the ostream_iterator is using a std::string
// This works because a Term can be converted into a string.
std::ostream_iterator<std::string>(std::cout, "\n")
);
}
The output:
> ./a.exe
some
:
word
{
or
other
std::string const str = "some:word{or other";
std::string result;
result.reserve(str.size());
for (std::string::const_iterator it = str.begin(), end = str.end();
it != end; ++it)
{
if (isalnum(*it))
{
result.push_back(*it);
}
else
{
result.push_back(' '); result.push_back(*it); result.push_back(' ');
}
}
Insert version for speed-up
std::string str = "some:word{or other";
for (std::string::iterator it = str.begin(), end = str.end(); it != end; ++it)
{
if (!isalnum(*it))
{
it = str.insert(it, ' ') + 2;
it = str.insert(it, ' ');
end = str.end();
}
}
Note that std::string::insert inserts BEFORE the iterator passed and returns an iterator to the newly inserted character. Assigning is important since the buffer may have been reallocated at another memory location (the iterators are invalidated by the insertion). Also note that you can't keep end for the whole loop, each time you insert you need to recompute it.
a more elegant way of doing this exists.
I do not know how BOOST implements that, but traditional way is by feeding input string character by character into a FSM which detects where tokens (words, symbols) start and end.
I can do that with two loops and find_first_of(":") and ("{")
One loop with std::find_first_of() should suffice.
Though I'm still a huge fan of FSMs for such parsing tasks.
P.S. Similar question
How about something like:
std::string::const_iterator it, end = mystring.end();
for(it = mystring.begin(); it != end; ++it) {
if ( !isalnum( *it ))
list.push_back(it);
}
This way, you'll only iterate once through the string, and isalnum from ctype.h seems to do what you want. Of course, the code above is very simplistic and incomplete and only suggests a solution.
Are you looking to tokenize the input string, ala strtok?
If so, here is a tokenizing function that you can use. It takes an input string and a string of delimiters (each char int he string is a possible delimitter), and it returns a vector of tokens. Each token is a tuple with the delimitted string, and the delimiter used in that case:
#include <cstdlib>
#include <vector>
#include <string>
#include <functional>
#include <iostream>
#include <algorithm>
using namespace std;
// FUNCTION : stringtok(char const* Raw, string sToks)
// PARAMATERS : Raw Pointer to NULL-Terminated string containing a string to be tokenized.
// sToks string of individual token characters -- each character in the string is a token
// DESCRIPTION : Tokenizes a string, much in the same was as strtok does. The input string is not modified. The
// function is called once to tokenize a string, and all the tokens are retuned at once.
// RETURNS : Returns a vector of strings. Each element in the vector is one token. The token character is
// not included in the string. The number of elements in the vector is N+1, where N is the number
// of times the Token character is found in the string. If one token is an empty string (as with the
// string "string1##string3", where the token character is '#'), then that element in the vector
// is an empty string.
// NOTES :
//
typedef pair<char,string> token; // first = delimiter, second = data
inline vector<token> tokenize(const string& str, const string& delims, bool bCaseSensitive=false) // tokenizes a string, returns a vector of tokens
{
bCaseSensitive;
// prologue
vector<token> vRet;
// tokenize input string
for( string::const_iterator itA = str.begin(), it=itA; it != str.end(); it = find_first_of(++it,str.end(),delims.begin(),delims.end()) )
{
// prologue
// find end of token
string::const_iterator itEnd = find_first_of(it+1,str.end(),delims.begin(),delims.end());
// add string to output
if( it == itA ) vRet.push_back(make_pair(0,string(it,itEnd)));
else vRet.push_back(make_pair(*it,string(it+1,itEnd)));
// epilogue
}
// epilogue
return vRet;
}
using namespace std;
int main()
{
string input = "some:word{or other";
typedef vector<token> tokens;
tokens toks = tokenize(input.c_str(), " :{");
cout << "Input: '" << input << " # Tokens: " << toks.size() << "'\n";
for( tokens::iterator it = toks.begin(); it != toks.end(); ++it )
{
cout << " Token : '" << it->second << "', Delimiter: '" << it->first << "'\n";
}
return 0;
}