How to tokenize (words) classifying punctuation as space

How to tokenize (words) classifying punctuation as space - c++

Based on this question which was closed rather quickly:
Trying to create a program to read a users input then break the array into seperate words are my pointers all valid?
Rather than closing I think some extra work could have gone into helping the OP to clarify the question.
The Question:
I want to tokenize user input and store the tokens into an array of words.
I want to use punctuation (.,-) as delimiter and thus removed it from the token stream.
In C I would use strtok() to break an array into tokens and then manually build an array.
Like this:
The main Function:
char **findwords(char *str);
int main()
{
int test;
char words[100]; //an array of chars to hold the string given by the user
char **word; //pointer to a list of words
int index = 0; //index of the current word we are printing
char c;
cout << "die monster !";
//a loop to place the charecters that the user put in into the array
do
{
c = getchar();
words[index] = c;
}
while (words[index] != '\n');
word = findwords(words);
while (word[index] != 0) //loop through the list of words until the end of the list
{
printf("%s\n", word[index]); // while the words are going through the list print them out
index ++; //move on to the next word
}
//free it from the list since it was dynamically allocated
free(word);
cin >> test;
return 0;
}
The line tokenizer:
char **findwords(char *str)
{
int size = 20; //original size of the list
char *newword; //pointer to the new word from strok
int index = 0; //our current location in words
char **words = (char **)malloc(sizeof(char *) * (size +1)); //this is the actual list of words
/* Get the initial word, and pass in the original string we want strtok() *
* to work on. Here, we are seperating words based on spaces, commas, *
* periods, and dashes. IE, if they are found, a new word is created. */
newword = strtok(str, " ,.-");
while (newword != 0) //create a loop that goes through the string until it gets to the end
{
if (index == size)
{
//if the string is larger than the array increase the maximum size of the array
size += 10;
//resize the array
char **words = (char **)malloc(sizeof(char *) * (size +1));
}
//asign words to its proper value
words[index] = newword;
//get the next word in the string
newword = strtok(0, " ,.-");
//increment the index to get to the next word
++index;
}
words[index] = 0;
return words;
}
Any comments on the above code would be appreciated.
But, additionally, what is the best technique for achieving this goal in C++?

Have a look at boost tokenizer for something that's much better in a C++ context than strtok().

Already covered by a lot of questions is how to tokenize a stream in C++.
Example: How to read a file and get words in C++
But what is harder to find is how get the same functionality as strtok():
Basically strtok() allows you to split the string on a whole bunch of user defined characters, while the C++ stream only allows you to use white space as a separator. Fortunately the definition of white space is defined by the locale so we can modify the locale to treat other characters as space and this will then allow us to tokenize the stream in a more natural fashion.
#include <locale>
#include <string>
#include <sstream>
#include <iostream>
// This is my facet that will treat the ,.- as space characters and thus ignore them.
class WordSplitterFacet: public std::ctype<char>
{
public:
typedef std::ctype<char> base;
typedef base::char_type char_type;
WordSplitterFacet(std::locale const& l)
: base(table)
{
std::ctype<char> const& defaultCType = std::use_facet<std::ctype<char> >(l);
// Copy the default value from the provided locale
static char data[256];
for(int loop = 0;loop < 256;++loop) { data[loop] = loop;}
defaultCType.is(data, data+256, table);
// Modifications to default to include extra space types.
table[','] |= base::space;
table['.'] |= base::space;
table['-'] |= base::space;
}
private:
base::mask table[256];
};
We can then use this facet in a local like this:
std::ctype<char>* wordSplitter(new WordSplitterFacet(std::locale()));
<stream>.imbue(std::locale(std::locale(), wordSplitter));
The next part of your question is how would I store these words in an array. Well, in C++ you would not. You would delegate this functionality to the std::vector/std::string. By reading your code you will see that your code is doing two major things in the same part of the code.
It is managing memory.
It is tokenizing the data.
There is basic principle Separation of Concerns where your code should only try and do one of two things. It should either do resource management (memory management in this case) or it should do business logic (tokenization of the data). By separating these into different parts of the code you make the code more generally easier to use and easier to write. Fortunately in this example all the resource management is already done by the std::vector/std::string thus allowing us to concentrate on the business logic.
As has been shown many times the easy way to tokenize a stream is using operator >> and a string. This will break the stream into words. You can then use iterators to automatically loop across the stream tokenizing the stream.
std::vector<std::string> data;
for(std::istream_iterator<std::string> loop(<stream>); loop != std::istream_iterator<std::string>(); ++loop)
{
// In here loop is an iterator that has tokenized the stream using the
// operator >> (which for std::string reads one space separated word.
data.push_back(*loop);
}
If we combine this with some standard algorithms to simplify the code.
std::copy(std::istream_iterator<std::string>(<stream>), std::istream_iterator<std::string>(), std::back_inserter(data));
Now combining all the above into a single application
int main()
{
// Create the facet.
std::ctype<char>* wordSplitter(new WordSplitterFacet(std::locale()));
// Here I am using a string stream.
// But any stream can be used. Note you must imbue a stream before it is used.
// Otherwise the imbue() will silently fail.
std::stringstream teststr;
teststr.imbue(std::locale(std::locale(), wordSplitter));
// Now that it is imbued we can use it.
// If this was a file stream then you could open it here.
teststr << "This, stri,plop";
cout << "die monster !";
std::vector<std::string> data;
std::copy(std::istream_iterator<std::string>(teststr), std::istream_iterator<std::string>(), std::back_inserter(data));
// Copy the array to cout one word per line
std::copy(data.begin(), data.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
}

Related

What is the reason behind the debugging getting stopped abruptly in the following code?

Here is the code to find the number of matches of a string, which is input from the user, can be found in the file temp.txt. If, for example, we want love to be counted, then matches like love, lovely, beloved should be considered. We also want to count the total number of words in temp.txt file.
I am doing a line by line reading here, not word by word.
Why does the debugging stop at totalwords += counting(line)?
/*this code is not working to count the words*/
#include<iostream>
#include<fstream>
#include<string>
using namespace std;
int totalwords{0};
int counting(string line){
int wordcount{0};
if(line.empty()){
return 1;
}
if(line.find(" ")==string::npos){wordcount++;}
else{
while(line.find(" ")!=string::npos){
int index=0;
index = line.find(" ");
line.erase(0,index);
wordcount++;
}
}
return wordcount;
}
int main() {
ifstream in_file;
in_file.open("temp.txt");
if(!in_file){
cerr<<"PROBLEM OPENING THE FILE"<<endl;
}
string line{};
int counter{0};
string word {};
cout<<"ENTER THE WORD YOU WANT TO COUNT IN THE FILE: ";
cin>>word;
int n {0};
n = ( word.length() - 1 );
while(getline(in_file>>ws,line)){
totalwords += counting(line);
while(line.find(word)!=string::npos){
counter++;
int index{0};
index = line.find(word);
line.erase(0,(index+n));
}
}
cout<<endl;
cout<<counter<<endl;
cout<<totalwords;
return 0;
}

line.erase(0, index); doesn't erase the space, you need
line.erase(0, index + 1);

Your code reveals a few problems...
At very first, counting a single word for an empty line doesn't appear correct to me. Second, erasing again and again from the string is pretty inefficient, with every such operation all of the subsequent characters are copied towards the front. If you indeed wanted to do so you might rather want to search from the end of the string, avoiding that. But you can actually do so without ever modifying the string if you use the second parameter of std::string::find (which defaults to 0, so has been transparent to you...):
int index = line.find(' ' /*, 0*); // first call; 0 is default, thus implicit
index = line.find(' ', index + 1); // subsequent call
Note that using the character overload is more efficient if you search for a single character anyway. However, this variant doesn't consider other whitespace like e. g. tabulators.
Additionally, the variant as posted in the question doesn't consider more than one subsequent whitespace! In your erasing variant – which erases one character too few, by the way – you would need to skip incrementing the word count if you find the space character at index 0.
However I'd go with a totally new approach, looking at each character separately; you need a stateful loop for in that case, though, i.e. you need to remember if you already are within a word or not. It might look e. g. like this:
size_t wordCount = 0; // note: prefer an unsigned type, negative values
// are meaningless anyway
// size_t is especially fine as it is guaranteed to be
// large enough to hold any count the string might ever
// contain characters
bool inWord = false;
for(char c : line)
{
if(isspace(static_cast<unsigned char>(c)))
// you can check for *any* white space that way...
// note the cast to unsigned, which is necessary as isspace accepts
// an int and a bare char *might* be signed, thus result in negative
// values
{
// no word any more...
inWord = false;
}
else if(inWord)
{
// well, nothing to do, we already discovered a word earlier!
//
// as we actually don't do anything here you might just skip
// this block and check for the opposite: if(!inWord)
}
else
{
// OK, this is the start of a word!
// so now we need to count a new one!
++wordCount;
inWord = true;
}
}
Now you might want to break words at punctuation characters as well, so you might actually want to check for:
if(isspace(static_cast<unsigned char>(c)) || ispunct(static_cast<unsigned char>(c))
A bit shorter is the following variant:
if(/* space or punctuation */)
{
inWord = false;
}
else
{
wordCount += inWord; // adds 0 or 1 depending on the value
inWord = false;
}
Finally: All code is written freely, thus unchecked – if you find a bug, please fix yourself...

debugging getting stopped abruptly
Does debugging indeed stop at the indicated line? I observed instead that the program hangs within the while loop in counting. You may make this visible by inserting an indicator output (marked by HERE in following code):
int counting(string line){
int wordcount{0};
if(line.empty()){
return 1;
}
if(line.find(" ")==string::npos){wordcount++;}
else{
while(line.find(" ")!=string::npos){
int index=0;
index = line.find(" ");
line.erase(0,index);
cout << '.'; // <--- HERE: indicator output
wordcount++;
}
}
return wordcount;
}
As Jarod42 pointed out, the erase call you are using misses the space itself. That's why you are finding spaces and “counting words” forever.
There is also an obvious misconception about words and separators of words visible in your code:
empty lines don't contain words
consecutive spaces don't indicate words
words may be separated by non-spaces (parentheses for example)
Finally, as already mentioned: if the problem is about counting total words, it's not necessary to discuss the other parts. And after the test (see HERE) above, it also appears to be independent on file input. So your code could be reduced to something like this:
#include <iostream>
#include <string>
int counting(std::string line) {
int wordcount = 0;
if (line.empty()) {
return 1;
}
if (line.find(" ") == std::string::npos) {
wordcount++;
} else {
while (line.find(" ") != std::string::npos) {
int index = 0;
index = line.find(" ");
line.erase(0, index);
wordcount++;
}
}
return wordcount;
}
int main() {
int totalwords = counting("bla bla");
std::cout << totalwords;
return 0;
}
And in this form, it's much easier to see if it works. We expect to see a 2 as output. To get there, it's possible to try correcting your erase call, but the result would then still be wrong (1) since you are actually counting spaces. So it's better to take the time and carefully read Aconcagua's insightful answer.

argument list for class template "std::vector" is missing

I am in need of some help with this program. I am in my first ever programming class and have run into wall trying to getting my program to work. I have included what I have written so far but still it doesn't compile. It is giving the error: argument list for class template "std::vector" is missing.
Here is the question:
When you read a long document, there is a good chance that many words occur multiple times. Instead of storing each word, it may be beneficial to only store unique words, and to represent the document as a vector of pointers to the unique words. Write a program that implements this strategy. Read a word at a time from cin. Keep a vector <char *> of words. If the new word is not present in this vector, allocate memory, copy the word into it, and append a pointer to the new memory. If the word is already present, then append a pointer to the existing word.
Below is code snippet:
#include "stdafx.h"
#include <string>
#include <iostream>
using namespace std;
/* Create a vector of char pointers to hold the individual words.
Create a string input to hold the next input through cin. */
int main() {
vector words;
string input;
/* Keep the while loop running using cin as the condition to read an entire document.
This will end when a document has reached its end. */
while (cin >> input) {
/* For every word read as a string, convert the word into a c-string by allocating
a new character array with the proper size and using c_str and strcpy to copy
an identical c-string into the memory heap. */
char* temp = new char[input.length() + 1];
strcpy(temp, input.c_str());
/* Next, check if the word is already in the words array. Use a boolean variable
that updates if the word is found. Compare words by using the strcmp function;
when they are equal, strcmp equals 0. */
bool already_present = false;
for (int i = 0; i < words.size(); i++) {
if (strcmp(temp, words[i]) == 0) {
already_present = true;
}
}
/* If the word is already present, delete the allocated memory.
Otherwise, push the pointer into the words vector. */
if (already_present) {
delete temp;
} else {
words.push_back(temp);
}
}
}

I hope below code snippet could be helpful:
#include <string>
#include <iostream>
#include <string.h> // String.h for strcmp()
#include <vector> // Vector Header file is added
using namespace std;
int main() {
vector <char *> words; // vector of char *
string input;
while (cin >> input) {
char *temp = new char[input.length() + 1];
strcpy(temp, input.c_str());
bool already_present = false;
for (unsigned int i = 0; i < words.size(); i++) {
if (strcmp(temp, words[i]) == 0) {
already_present = true;
}
}
if (already_present) {
delete temp;
} else {
words.push_back(temp);
}
}
/* Print the desired output */
for(unsigned int i=0; i<words.size(); i++) {
cout << words[i] << endl;
}
return 0;
}
Any doubt, comments most welcome.
EDIT: After reading your comments, I came to the conclusion that you use Microsoft Visual Stdio. See, the reason you were getting warning is that strcpy() is potentially unsafe because it can lead to buffer overflow if you try to copy a string to a buffer that is not large enough to contain it.
Consider a code snippet for a moment:
char foo[10]; /* a buffer able to hold 9 chars (plus the null) */
char bar[] = "A string longer than 9 chars";
strcpy( foo, bar ); /* compiles ok, but VERY BAD because you have a buffer overflow
and are corrupting memory. */
strcpy_s() is safer because you have to explicitly specify the size of the target buffer, so the function will not overflow:
strcpy_s( foo, 10, bar ); /* strcpy_s will not write more than 10 characters */
The limitations of this strcpy_s() is that, it is non-standard and MS specific. Therefore if you write code to use it, your code will not be portable any more.

How do i store an (int)string into an int array

int ascii[1000] = {0};
string *data = (string*)malloc ( 1000*sizeof( string));
char *text = (char*)malloc ( 1000 *sizeof( char));
cout << "Enter the first arrangement of data." << endl;
cin.getline(text, 1000);
char *token = strtok(text, " ");
while ( token != NULL )
{
if ( strlen(token) > 0)
{
cout << "The tokens are: " << token << endl;
data[Tcount++] = *token;
}
token = strtok(NULL, " ");
for(i=0; i < (Tcount); i++)
{
ascii[i] = (int)data[i]; // error here
}
Im using this code to build a parser and i want to store the ascii values of the tokens which are stored in 'data' into an array named 'ascii'.
When i run the program i get the error message, "error: assigning to 'int' from incompatible type 'string' (aka 'basic_string, allocator >')
Any help would be appreciated.

One thing before the main event here. Obviously you're allowed to use std::string so, let's get the data in a more civilized fashion.
std::vector<std::string> data;
std::string line;
std::getline(cin, line); //read a whole line
std::stringstream tokenizer(line); // stuff the line into an object that's
// really good at tokenizing
std::string token;
while (tokenizer >> token) // one by one push a word out of the tokenizer
{
data.push_back(token); //and stuff it into a vector
}
we now have all of the individual words on the line packed into a nice resizable container, a vector. No messy dynamic memory to clean up.
Step 2: turn those strings into ints. 'Fraid you can't do that, ace. You could take a string that represents a number and turn it into an int. That's easy. Dozens of ways to do it. I like strtol.
But the ascii values are character by character. A string is a variable number of characters. You can pack one into an int, shift the int over by the width of one character and stuff in another, but you're going to run out of space after probably 4 or 8 characters.
Let's go with that, shall we? And we'll do it the old way without an iterator.
std::string data;
int ascii = 0;
if (data.length() > 0)
{
ascii |= data[index];
for(size_t index = 0; index < data.length(); index++)
{
ascii <<= 8; //we're talking ascii here so no unicode bit counting games
ascii |= data[index];
}
}
Done. Not very useful unless all the strings are pretty short, but done.
Instead if you're going to do a parser why not go full geek and try this:
typedef void handlerfunc();
std::map<std::string, handlerfunc> parser;
parser["do something"] = somethingfunc;
parser["do something else"] = somethingelsefunc;
Where somethingfunc is a function that looks like void somethingfunc() that, obviously, does something. Dito somethingelsefunc. Only it does somethingelse.
Usage could be as simple as:
parser[token]();
But it's not. Sigh.
It's more like
found = parser.find(token)
if (found != parser.end())
{
found->second();
return CMD_OK;
}
else
{
return CMD_NOT_FOUND;
}
But seriously, look into some of the fun stuff a good container can do for you. Save a ton of time.
I crapped out all of the code without a compiler. Please let me know if I borked any of it.

Traversing a Fatsa file in C/C++

I'm looking to write a program in C/C++ to traverse a Fasta file formatted like:
>ID and header information
SEQUENCE1
>ID and header information
SEQUENCE2
and so on
in order to find all unique sequences (check if subset of any other sequence)
and write unique sequences (and all headers) to an output file.
My approach was:
Copy all sequences to an array/list at the beginning (more efficient way to do this?)
Grab header, append it to output file, compare sequence for that header to everything in the list/array. If unique, write it under the header, if not delete it.
However, I'm a little unsure as to how to approach reading the lines in properly. I need to read the top line for the header, and then "return?" to the next line to read the sequence. Sometimes the sequence spans more then two lines, so would I use > (from the example above) as a delimiter? If I use C++, I imagine I'd use iostreams to accomplish this?
If anybody could give me a nudge in the right direction as to how I would want to read the information I need to manipulate/how to carry out the comparison, it'd be greatly appreciated.

First, rather than write your own FASTA reading routine you probably want to use something that alrady exists, for example, see: http://lh3lh3.users.sourceforge.net/parsefastq.shtml
Internally you'll have the sequences without newlines and that is probably helpful. I think the simplest approach from a high level is
loop over fasta and write out sequences to a file
sort that file
with the sorted file it becomes easier to pick out subsequences so write a program to find the "unique ids"
Using the unique id's go back to the original fasta and get whatever additional information you need.

Your approach is usable. Below is an implementation of it.
However, I'm a little unsure as to how to approach reading the lines
in properly. ... Sometimes the sequence spans more then two lines, so would I use > (from the example above) as a delimiter?
That's right; in addition, there's just the EOF which has to be checked.
I wrote the function getd() for that, which reads a single-line description or concatenated lines of sequence data and returns a pointer to the string it allocated.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *getd()
{
char *s = NULL, *eol;
size_t i = 0, size = 1; // 1 for null character
int c;
#define MAXLINE 80 // recommended max. line length; longer lines are okay
do // read single-line description or concatenated lines of sequence data
{
do // read a line (until '\n')
{
s = realloc(s, size += MAXLINE+1); // +1 for newline character
if (!s) puts("out of memory"), exit(1);
if (!fgets(s+i, size, stdin)) break;
eol = strchr(s+i, '\n');
i += MAXLINE+1;
} while (!eol);
if (!i) { free(s); return NULL; } // nothing read
if (*s == '>') return s; // single-line description
i = eol-s;
ungetc(c = getchar(), stdin); // peek at next character
} while (c != '>' && c != EOF);
return s;
}
int main()
{
char *s;
struct seq { char *head, *data; } *seq = NULL;
int n = 0, i, j;
while (s = getd())
if (*s == '>')
{ // new sequence: make room, store header
seq = realloc(seq, ++n * sizeof *seq);
if (!seq) puts("out of memory"), exit(1);
seq[n-1] = (struct seq){ s, "" };
}
else
if (n) // store sequence data if at least one header present
seq[n-1].data = s;
for (i = 0; i < n; ++i)
{
const int max = 70; // reformat output data to that line length max.
printf("%s", seq[i].head);
for (s = seq[i].data, j = 0; j < n; ++j)
if (j != i) // compare sequence to the others, delete if not unique
if (strstr(seq[j].data, s)) { s = seq[i].data = ""; break; }
for (; strlen(s) > max && s[max] != '\n'; s += max)
printf("%.*s\n", max, s);
printf("%s", s);
}
}

Trying to create a program to read a users input then break the array into seperate words are my pointers all valid? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
char **findwords(char *str);
int main()
{
int test;
char words[100]; //an array of chars to hold the string given by the user
char **word; //pointer to a list of words
int index = 0; //index of the current word we are printing
char c;
cout << "die monster !";
//a loop to place the charecters that the user put in into the array
do {
c = getchar();
words[index] = c;
} while (words[index] != '\n');
word = findwords(words);
while (word[index] != 0) //loop through the list of words until the end of the list
{
printf("%s\n", word[index]); // while the words are going through the list print them out
index ++; //move on to the next word
}
//free it from the list since it was dynamically allocated
free(word);
cin >> test;
return 0;
}
char **findwords(char *str)
{
int size = 20; //original size of the list
char *newword; //pointer to the new word from strok
int index = 0; //our current location in words
char **words = (char **)malloc(sizeof(char *) * (size +1)); //this is the actual list of words
/* Get the initial word, and pass in the original string we want strtok() *
* to work on. Here, we are seperating words based on spaces, commas, *
* periods, and dashes. IE, if they are found, a new word is created. */
newword = strtok(str, " ,.-");
while (newword != 0) //create a loop that goes through the string until it gets to the end
{
if (index == size)
{
//if the string is larger than the array increase the maximum size of the array
size += 10;
//resize the array
char **words = (char **)malloc(sizeof(char *) * (size +1));
}
//asign words to its proper value
words[index] = newword;
//get the next word in the string
newword = strtok(0, " ,.-");
//increment the index to get to the next word
++index;
}
words[index] = 0;
return words;
}
break the array into the individual words then print them out th

do {
c = getchar();
words[index] = c;
} while (words[index] != '\n');
you should also add '\0' at the end of your string (after the loop) in "words" array
You are not incrementing index this way you save only the last c
you should do while(word[index] != '\0') not while(word[index] != 0 ('\0' indicates end of line no 0)
while (word[index] != 0) //loop through the list of words until the end of the list
{
printf("%s\n", word[index]); // while the words are going through the list print them out
index ++; //move on to the next word
}

I think there is a bug memory leakage because you first allocate
char **words = (char **)malloc(sizeof(char *) * (size +1)); //when declaring
when declaring the variable, and after that you again allocate the same **words in the loop body:
char **words = (char **)malloc(sizeof(char *) * (size +1)); // in the while loop
The above line in the while loop with which you allocate the space to store the string should be (1)
//in the while loop should be
char *words[index] = (char *)malloc(sizeof(char ) * (size +1));
strcpy (words[index], str);
Or simply (2)
words[index] = str;
Because the str already points to a valid memory location which you assign to the array of pointers.
In the (1) method above you are allocating a block of memory of size+1 of type char and copying the string in str into words[index] with strcpy. For this you require to reserve a memory location into words[index] first and then perform the copy. If this is the case the the memory freeing is not at simple as free (word) instead each of the allocated block will need to be manually removed.
for (index = 0; words[index] != 0; index++)
{
free (words[index];
}
free (words);
In the (2) solution is in my opinion not a good one, because you have passed a pointer to a string and assign that pointer value to store the string. So both the str and the words[index] point to the same location. Now after the function returns if anybody frees str (if it was dynamically allocated) then the words[index] reference will become illegal.
EDIT:
Also you need to use
gets (words); or in using c++ cin >> words; or use getline, or simply increment the index counter in your code, and assign a null at the end to terminate the string.
in main function. You do not increment the index counter so all the characters are assigned in the same location.

I think everybody is trying to do it the hard way.
The std streams already break the input into words using the >> operator. We just need to be more careful on how we define a word. To do this you just need to define an ctype facet that defines space correctly (for the context) and then imbue the stream with it.
#include <locale>
#include <string>
#include <sstream>
#include <iostream>
// This is my facet that will treat the ,.- as space characters and thus ignore them.
class WordSplitterFacet: public std::ctype<char>
{
public:
typedef std::ctype<char> base;
typedef base::char_type char_type;
WordSplitterFacet(std::locale const& l)
: base(table)
{
std::ctype<char> const& defaultCType = std::use_facet<std::ctype<char> >(l);
// Copy the default value from the provided locale
static char data[256];
for(int loop = 0;loop < 256;++loop) { data[loop] = loop;}
defaultCType.is(data, data+256, table);
// Modifications to default to include extra space types.
table[','] |= base::space;
table['.'] |= base::space;
table['-'] |= base::space;
}
private:
base::mask table[256];
};
Now the code looks very simple:
int main()
{
// Create the facet.
std::ctype<char>* wordSplitter(new WordSplitterFacet(std::locale()));
// Here I am using a string stream.
// But any stream can be used. Note you must imbue a stream before it is used.
// Otherwise the imbue() will silently fail.
std::stringstream teststr;
teststr.imbue(std::locale(std::locale(), wordSplitter));
// Now that it is imbued we can use it.
// If this was a file stream then you could open it here.
teststr << "This, stri,plop";
// Now use the stream normally
std::string word;
while(teststr >> word)
{
std::cout << "W(" << word << ")\n";
}
}
Testing:
> ./a.out
W(This)
W(stri)
W(plop)
With a correctly imbues stream we can use the old trick of copying from a stream into a vector:
std::copy(std::istream_iterator<std::string>(teststr),
std::istream_iterator<std::string>(),
std::back_inserter(data)
);

Lots of issues:
In your first loop you are forgetting to increment index after each read character.
Also, if you have more than 100 characters, your program will likely crash.
getchar returns an "int". Not a char. Very important - especially if you input is redirected or piped in.
Try this instead:
int tmp;
tmp = getchar();
while ((index < 99) && (tmp >= 0) && (tmp != '\n'))
{
word[index] = (char)tmp;
tmp = getchar();
index++;
}
word[index] = 0; /* make life easier - null terminate your string */
Your "findwords" function scares the hell out of me. You haven't don't have enough points on S.O. for me to elaborate on the issues here. In any case

I'm tempted to open with some lame crack about the '80s calling and wanting their obsolete "C++ as a better C" code back, but I'll try to restrain myself and just give at least some idea of how you might consider doing something like this:
std::string line;
// read a line of input from the user:
std::getline(line, std::cin);
// break it up into words:
std::istringstream buffer(line);
std::vector<std::string> words((std::istream_iterator<std::string>(buffer)),
std::istream_iterator<std::string>());
// print out the words, one per line:
std::copy(words.begin(), words.end(),
std::ostream_iterator(std::cout, "\n"));

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to tokenize (words) classifying punctuation as space - c++

Have a look at boost tokenizer for something that's much better in a C++ context than strtok().

Related

What is the reason behind the debugging getting stopped abruptly in the following code?

argument list for class template "std::vector" is missing

How do i store an (int)string into an int array

Traversing a Fatsa file in C/C++

Trying to create a program to read a users input then break the array into seperate words are my pointers all valid? [closed]

Categories

Resources