I just wrote a program that tokenizes a char array using pointers. The program only needed to work with a space as the delimiter character. I just turned it in and got full credit, but after turning it in, I realized that this program worked only if the delimiter character was a space.
My question is, how could I make this program work with an arbitrary delimiter character?
The function I've shown you below returns a pointer to the next word in the char array. This is what I believe I need to change for it to work with any delimiter character.
Thanks!
Code:
char* StringTokenizer::Next(void) {
pNextWord = pStart;
if (*pStart == '\0') { return NULL; }
while (*pStart != delim) {
pStart++;
}
if (*pStart == '\0') { return NULL; }
*pStart = '\0';
pStart++;
return pNextWord;
}
The printing loop in main():
while ((nextWord = tk.Next()) != NULL) {
cout << nextWord << endl;
}
The simpliest way is to change your
while (*pStart != delim)
to something like
while (*pStart != ' ' && *pStart != '\n' && *pStart != '\t')
Or, you could make delim a string, and create a function that checks if a char is in the string:
bool isDelim(char c, const char *delim) {
while (*delim) {
if (*delim == c)
return true;
delim++;
}
return false;
}
while ( !isDelim(*pStart, " \n\t") )
Or, perhaps the best solution is to use one of the prebuilt functions for doing all this, such as strtok.
Just change the line
while (*pStart != delim)
as follows:
while (*pStart != '\0' && strchr(" \t\n", *pStart) == NULL)
The standard strchr function (declared in the string.h header)
looks for a character (given in the second argument) in a C-string
(given in the first argument) and returns a pointer to the position
where that character occurs for the first time. Hence, the expression
strchr(" \t\n", *pStart) == NULL is true if the current character
(*pStart) cannot be not found in string " \t\n" and, therefore,
is not a delimiter. (Modify the delimiter string to adapt it to your
needs, of course.)
This approach provides a short and simple way to test whether a given
character belongs to a (small) set of characters of interest. And it
uses a standard function.
By the way, you can do this using not only a C-string, but with
a std::string, too. All you need is to declare a const std::string
with " \t\n"-like value and then replace the call to the strchr
function with the find method of the declared delimiter string.
Hmm...this doesn't look quite right:
if (*pStart = '\0')
The condition can never be true. I'm guessing you intended == instead of =? You also have a bit of a problem here:
while (*pStart != delim)
If the last word in the string isn't followed by a delimiter, this is going to run off the end of the string, which will cause serious problems.
Edit: Unless you really need to do this on your own, consider using a stringstream for the job. It already has all the right mechanism in place and quite heavily tested. It does add overhead, but it's quite acceptable in a lot of cases.
Not compiled. but I'd do something like this.
//const int N = someGoodValue;
char delimList[N] = {' ',',','.',';', '|', '!', '$', '\n'};//all delims here.
char* StringTokenizer::Next(void)
{
if (*pStart == '\0') { return NULL; }
pNextWord = pStart;
while (1){
for (int x = 0; x < N; x++){
if (*pStart == delimList[x]){ //this is it.
*pStart = '\0';
pStart++;
return pNextWord;
}
}
if ('\0' == *pStart){ //last word.. maybe.
return pNextWord;
}
pStart++;
}
}
// (!compiled).
I assume that we want to stick to C instead of C++. Functions strspn and strcspn are good for tokenizing by a set a delimiters. You can use strspn to find where the next separator begins (i.e. where the current token ends) and then using strcspn to find where the separator ends (i.e. where the next token begins). Loop until you reach the end.
Related
I came across a problem in which I need to load data from a text file, and than save it into an array of string type. My approach is to consider the array as a 2D array, but of char type.
This is my code:
string *rollno;
rollno=new string[2];
string line;
ifstream in("file.txt",ios::app);
int i=0;
char single;
in.get(single);
while (single != '.') {
for (int j=0; single!=',' || single!='.'; j++) {
rollno[i][j]=single;\\saving in array character wise
}
in.get(single);\\getting the next line
i++;
}
cout<<rollno[0]<<endl<<rollno[1];\\checking
Could anyone help me figure out what I'm doing wrong?
Hmm... Not entirely sure what you're trying to do, but from the looks of it, wouldn't this be what you're looking for?
(Assuming i is the array index, and j is for std::string.operator[], as appears to be the case.)
// ...
while (single != '.') {
while (single != ',' && single != '.') {
rollno[i] += single; // appending to string.
}
in.get(single); //getting the next line
i++;
}
You can append a char to a std::string with operator+=, as described here.
Also, correct me if I'm wrong, but (single != ',' || single != '.') will always evaluate to true; if single == ',', then single != '.', and vice versa. If you want it to stop processing once it encounters either delimiter, then you use a logical AND to make sure it stops if either check fails.
(Apologies for any typoes or mistakes I may have missed.)
I was just talking with a friend about what would be the most efficient way to check if a std::string has only spaces. He needs to do this on an embedded project he is working on and apparently this kind of optimization matters to him.
I've came up with the following code, it uses strtok().
bool has_only_spaces(std::string& str)
{
char* token = strtok(const_cast<char*>(str.c_str()), " ");
while (token != NULL)
{
if (*token != ' ')
{
return true;
}
}
return false;
}
I'm looking for feedback on this code and more efficient ways to perform this task are also welcome.
if(str.find_first_not_of(' ') != std::string::npos)
{
// There's a non-space.
}
In C++11, the all_of algorithm can be employed:
// Check if s consists only of whitespaces
bool whiteSpacesOnly = std::all_of(s.begin(),s.end(),isspace);
Why so much work, so much typing?
bool has_only_spaces(const std::string& str) {
return str.find_first_not_of (' ') == str.npos;
}
Wouldn't it be easier to do:
bool has_only_spaces(const std::string &str)
{
for (std::string::const_iterator it = str.begin(); it != str.end(); ++it)
{
if (*it != ' ') return false;
}
return true;
}
This has the advantage of returning early as soon as a non-space character is found, so it will be marginally more efficient than solutions that examine the whole string.
To check if string has only whitespace in c++11:
bool is_whitespace(const std::string& s) {
return std::all_of(s.begin(), s.end(), isspace);
}
in pre-c++11:
bool is_whitespace(const std::string& s) {
for (std::string::const_iterator it = s.begin(); it != s.end(); ++it) {
if (!isspace(*it)) {
return false;
}
}
return true;
}
Here's one that only uses STL (Requires C++11)
inline bool isBlank(const std::string& s)
{
return std::all_of(s.cbegin(),s.cend(),[](char c) { return std::isspace(c); });
}
It relies on fact that if string is empty (begin = end) std::all_of also returns true
Here is a small test program: http://cpp.sh/2tx6
Using strtok like that is bad style! strtok modifies the buffer it tokenizes (it replaces the delimiter chars with \0).
Here's a non modifying version.
const char* p = str.c_str();
while(*p == ' ') ++p;
return *p != 0;
It can be optimized even further, if you iterate through it in machine word chunks. To be portable, you would also have to take alignment into consideration.
I do not approve of you const_casting above and using strtok.
A std::string can contain embedded nulls but let's assume it will be all ASCII 32 characters before you hit the NULL terminator.
One way you can approach this is with a simple loop, and I will assume const char *.
bool all_spaces( const char * v )
{
for ( ; *v; ++v )
{
if( *v != ' ' )
return false;
}
return true;
}
For larger strings, you can check word-at-a-time until you reach the last word, and then assume the 32-bit word (say) will be 0x20202020 which may be faster.
Something like:
return std::find_if(
str.begin(), str.end(),
std::bind2nd( std::not_equal_to<char>(), ' ' ) )
== str.end();
If you're interested in white space, and not just the space character,
then the best thing to do is to define a predicate, and use it:
struct IsNotSpace
{
bool operator()( char ch ) const
{
return ! ::is_space( static_cast<unsigned char>( ch ) );
}
};
If you're doing any text processing at all, a collection of such simple
predicates will be invaluable (and they're easy to generate
automatically from the list of functions in <ctype.h>).
it's highly unlikely you'll beat a compiler optimized naive algorithm for this, e.g.
string::iterator it(str.begin()), end(str.end())
for(; it != end && *it == ' '; ++it);
return it == end;
EDIT: Actually - there is a quicker way (depending on size of string and memory available)..
std::string ns(str.size(), ' ');
return ns == str;
EDIT: actually above is not quick.. it's daft... stick with the naive implementation, the optimizer will be all over that...
EDIT AGAIN: dammit, I guess it's better to look at the functions in std::string
return str.find_first_not_of(' ') == string::npos;
I had a similar problem in a programming assignment, and here is one other solution I came up with after reviewing others. here I simply create a new sentence without the new spaces. If there are double spaces I simply overlook them.
string sentence;
string newsent; //reconstruct new sentence
string dbl = " ";
getline(cin, sentence);
int len = sentence.length();
for(int i = 0; i < len; i++){
//if there are multiple whitespaces, this loop will iterate until there are none, then go back one.
if (isspace(sentence[i]) && isspace(sentence[i+1])) {do{
i++;
}while (isspace(sentence[i])); i--;} //here, you have to dial back one to maintain at least one space.
newsent +=sentence[i];
}
cout << newsent << "\n";
Hm...I'd do this:
for (auto i = str.begin(); i != str.end() ++i)
if (!isspace(i))
return false;
Pseudo-code, isspace is located in cctype for C++.
Edit: Thanks to James for pointing out that isspace has undefined behavior on signed chars.
If you are using CString, you can do
CString myString = " "; // All whitespace
if(myString.Trim().IsEmpty())
{
// string is all whitespace
}
This has the benefit of trimming all newline, space and tab characters.
I've been practicing C++ for a competition next week. And in the sample problem I've been working on, requires splitting of paragraphs into words. Of course, that's easy. But this problem is so weird, that the words like: isn't should be separated as well: isn and t. I know it's weird but I have to follow this.
I have a function split() that takes a constant char delimiter as one of the parameter. It's what I use to separate words from spaces. But I can't figure out this one. Even numbers like: phil67bs should be separated as phil and bs.
And no, I don't ask for full code. A pseudocode will do, or something that will help me understand what to do. Thanks!
PS: Please no recommendations for external libs. Just the STL. :)
Filter out numbers, spaces and anything else that isn't a letter by using a proper locale. See this SO thread about treating everything but numbers as a whitespace. So use a mask and do something similar to what Jerry Coffin suggests but only for letters:
struct alphabet_only: std::ctype<char>
{
alphabet_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table()
{
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::space);
std::fill(&rc['A'], &rc['['], std::ctype_base::upper);
std::fill(&rc['a'], &rc['{'], std::ctype_base::lower);
return &rc[0];
}
};
And, boom! You're golden.
Or... you could just do a transform:
char changeToLetters(const char& input){ return isalpha(input) ? input : ' '; }
vector<char> output;
output.reserve( myVector.size() );
transform( myVector.begin(), myVector.end(), insert_iterator(output), ptr_fun(changeToLetters) );
Which, um, is much easier to grok, just not as efficient as Jerry's idea.
Edit:
Changed 'Z' to '[' so that the value 'Z' is filled. Likewise with 'z' to '{'.
This sounds like a perfect job for the find_first_of function which finds the first occurrence of a set of characters. You can use this to look for arbitrary stop characters and generate words from the spaces between such stop characters.
Roughly:
size_t previous = 0;
for (; ;) {
size_t next = str.find_first_of(" '1234567890", previous);
// Do processing
if (next == string::npos)
break;
previous = next + 1;
};
Just change your function to delimit on anything that isn't an alphabetic character. Is there anything in particular that you are having trouble with?
Break down the problem: First, write a function that gets the first "word" from the sentence. This is easy; just look for the first non-alphabetic character. The next step is to remove all leading non-alphabetic character from the remaining string. From there, just repeat.
You can do something like this:
vector<string> split(const string& str)
{
vector<string> splits;
string cur;
for(int i = 0; i < str.size(); ++i)
{
if(str[i] >= '0' && str[i] <= '9')
{
if(!cur.empty())
{
splits.push_back(cur);
}
cur="";
}
else
{
cur += str[i];
}
}
if(! cur.empty())
{
splits.push_back(cur);
}
return splits;
}
let's assume that the input is in a std::string (use std::getline(cin, line) for example to read a full line from cin)
std::vector<std::string> split(std::string const& input)
{
std::string::const_iterator it(input), end(input.end());
std::string current;
vector<std::string> words;
for(; it != end; ++it)
{
if (isalpha(*it))
{
current.push_back(*it); // add this char to the current word
}
else
{
// push the current word in to the result list
words.push_back(current);
current.clear(); // next word
}
}
return words;
}
I've not tested it, but I guess it ought to work...
Whats the most efficient way of removing a 'newline' from a std::string?
#include <algorithm>
#include <string>
std::string str;
str.erase(std::remove(str.begin(), str.end(), '\n'), str.cend());
The behavior of std::remove may not quite be what you'd expect.
A call to remove is typically followed by a call to a container's erase method, which erases the unspecified values and reduces the physical size of the container to match its new logical size.
See an explanation of it here.
If the newline is expected to be at the end of the string, then:
if (!s.empty() && s[s.length()-1] == '\n') {
s.erase(s.length()-1);
}
If the string can contain many newlines anywhere in the string:
std::string::size_type i = 0;
while (i < s.length()) {
i = s.find('\n', i);
if (i == std::string:npos) {
break;
}
s.erase(i);
}
You should use the erase-remove idiom, looking for '\n'. This will work for any standard sequence container; not just string.
Here is one for DOS or Unix new line:
void chomp( string &s)
{
int pos;
if((pos=s.find('\n')) != string::npos)
s.erase(pos);
}
Slight modification on edW's solution to remove all exisiting endline chars
void chomp(string &s){
size_t pos;
while (((pos=s.find('\n')) != string::npos))
s.erase(pos,1);
}
Note that size_t is typed for pos, it is because npos is defined differently for different types, for example, -1 (unsigned int) and -1 (unsigned float) are not the same, due to the fact the max size of each type are different. Therefore, comparing int to size_t might return false even if their values are both -1.
s.erase(std::remove(s.begin(), s.end(), '\n'), s.end());
The code removes all newlines from the string str.
O(N) implementation best served without comments on SO and with comments in production.
unsigned shift=0;
for (unsigned i=0; i<length(str); ++i){
if (str[i] == '\n') {
++shift;
}else{
str[i-shift] = str[i];
}
}
str.resize(str.length() - shift);
std::string some_str = SOME_VAL;
if ( some_str.size() > 0 && some_str[some_str.length()-1] == '\n' )
some_str.resize( some_str.length()-1 );
or (removes several newlines at the end)
some_str.resize( some_str.find_last_not_of(L"\n")+1 );
Another way to do it in the for loop
void rm_nl(string &s) {
for (int p = s.find("\n"); p != (int) string::npos; p = s.find("\n"))
s.erase(p,1);
}
Usage:
string data = "\naaa\nbbb\nccc\nddd\n";
rm_nl(data);
cout << data; // data = aaabbbcccddd
All these answers seem a bit heavy to me.
If you just flat out remove the '\n' and move everything else back a spot, you are liable to have some characters slammed together in a weird-looking way. So why not just do the simple (and most efficient) thing: Replace all '\n's with spaces?
for (int i = 0; i < str.length();i++) {
if (str[i] == '\n') {
str[i] = ' ';
}
}
There may be ways to improve the speed of this at the edges, but it will be way quicker than moving whole chunks of the string around in memory.
If its anywhere in the string than you can't do better than O(n).
And the only way is to search for '\n' in the string and erase it.
for(int i=0;i<s.length();i++) if(s[i]=='\n') s.erase(s.begin()+i);
For more newlines than:
int n=0;
for(int i=0;i<s.length();i++){
if(s[i]=='\n'){
n++;//we increase the number of newlines we have found so far
}else{
s[i-n]=s[i];
}
}
s.resize(s.length()-n);//to delete only once the last n elements witch are now newlines
It erases all the newlines once.
About answer 3 removing only the last \n off string code :
if (!s.empty() && s[s.length()-1] == '\n') {
s.erase(s.length()-1);
}
Will the if condition not fail if the string is really empty ?
Is it not better to do :
if (!s.empty())
{
if (s[s.length()-1] == '\n')
s.erase(s.length()-1);
}
To extend #Greg Hewgill's answer for C++11:
If you just need to delete a newline at the very end of the string:
This in C++98:
if (!s.empty() && s[s.length()-1] == '\n') {
s.erase(s.length()-1);
}
...can now be done like this in C++11:
if (!s.empty() && s.back() == '\n') {
s.pop_back();
}
Optionally, wrap it up in a function. Note that I pass it by ptr here simply so that when you take its address as you pass it to the function, it reminds you that the string will be modified in place inside the function.
void remove_trailing_newline(std::string* str)
{
if (str->empty())
{
return;
}
if (str->back() == '\n')
{
str->pop_back();
}
}
// usage
std::string str = "some string\n";
remove_trailing_newline(&str);
Whats the most efficient way of removing a 'newline' from a std::string?
As far as the most efficient way goes--that I'd have to speed test/profile and see. I'll see if I can get back to you on that and run some speed tests between the top two answers here, and a C-style way like I did here: Removing elements from array in C. I'll use my nanos() timestamp function for speed testing.
Other References:
See these "new" C++11 functions in this reference wiki here: https://en.cppreference.com/w/cpp/string/basic_string
https://en.cppreference.com/w/cpp/string/basic_string/empty
https://en.cppreference.com/w/cpp/string/basic_string/back
https://en.cppreference.com/w/cpp/string/basic_string/pop_back
I've written a simple string tokenizing program using pointers for a recent school project. However, I'm having trouble with my StringTokenizer::Next() method, which, when called, is supposed to return a pointer to the first letter of the next word in the char array. I get no compile-time errors, but I get a runtime error which states:
Unhandled exception at 0x012c240f in Project 5.exe: 0xC0000005: Access violation reading location 0x002b0000.
The program currently tokenizes the char array, but then stops and this error pops up. I have a feeling it has to do with the NULL checking I'm doing in my Next() method.
So how can I fix this?
Also, if you notice anything I could do more efficiently or with better practice, please let me know.
Thanks!!
StringTokenizer.h:
#pragma once
class StringTokenizer
{
public:
StringTokenizer(void);
StringTokenizer(char* const, char);
char* Next(void);
~StringTokenizer(void);
private:
char* pStart;
char* pNextWord;
char delim;
};
StringTokenizer.cpp:
#include "stringtokenizer.h"
#include <iostream>
using namespace std;
StringTokenizer::StringTokenizer(void)
{
pStart = NULL;
pNextWord = NULL;
delim = 'n';
}
StringTokenizer::StringTokenizer(char* const pArray, char d)
{
pStart = pArray;
delim = d;
}
char* StringTokenizer::Next(void)
{
pNextWord = pStart;
if (pStart == NULL) { return NULL; }
while (*pStart != delim) // access violation error here
{
pStart++;
}
if (pStart == NULL) { return NULL; }
*pStart = '\0'; // sometimes the access violation error occurs here
pStart++;
return pNextWord;
}
StringTokenizer::~StringTokenizer(void)
{
delete pStart;
delete pNextWord;
}
Main.cpp:
// The PrintHeader function prints out my
// student info in header form
// Parameters - none
// Pre-conditions - none
// Post-conditions - none
// Returns - void
void PrintHeader();
int main ( )
{
const int CHAR_ARRAY_CAPACITY = 128;
const int CHAR_ARRAY_CAPCITY_MINUS_ONE = 127;
// create a place to hold the user's input
// and a char pointer to use with the next( ) function
char words[CHAR_ARRAY_CAPACITY];
char* nextWord;
PrintHeader();
cout << "\nString Tokenizer Project";
cout << "\nyour name\n\n";
cout << "Enter in a short string of words:";
cin.getline ( words, CHAR_ARRAY_CAPCITY_MINUS_ONE );
// create a tokenizer object, pass in the char array
// and a space character for the delimiter
StringTokenizer tk( words, ' ' );
// this loop will display the tokens
while ( ( nextWord = tk.Next ( ) ) != NULL )
{
cout << nextWord << endl;
}
system("PAUSE");
return 0;
}
EDIT:
Okay, I've got the program working fine now, as long as the delimiter is a space. But if I pass it a `/' as a delim, it comes up with the access violation error again. Any ideas?
Function that works with spaces:
char* StringTokenizer::Next(void)
{
pNextWord = pStart;
if (*pStart == '\0') { return NULL; }
while (*pStart != delim)
{
pStart++;
}
if (*pStart = '\0') { return NULL; }
*pStart = '\0';
pStart++;
return pNextWord;
}
An access violation (or "segmentation fault" on some OSes) means you've attempted to read or write to a position in memory that you never allocated.
Consider the while loop in Next():
while (*pStart != delim) // access violation error here
{
pStart++;
}
Let's say the string is "blah\0". Note that I've included the terminating null. Now, ask yourself: how does that loop know to stop when it reaches the end of the string?
More importantly: what happens with *pStart if the loop fails to stop at the end of the string?
This answer is provided based on the edited question and various comments/observations in other answers...
First, what are the possible states for pStart when Next() is called?
pStart is NULL (default constructor or otherwise set to NULL)
*pStart is '\0' (empty string at end of string)
*pStart is delim (empty string at an adjacent delimiter)
*pStart is anything else (non-empty-string token)
At this point we only need to worry about the first option. Therefore, I would use the original "if" check here:
if (pStart == NULL) { return NULL; }
Why don't we need to worry about cases 2 or 3 yet? You probably want to treat adjacent delimiters as having an empty-string token between them, including at the start and end of the string. (If not, adjust to taste.) The while loop will handle that for us, provided you also add the '\0' check (needed regardless):
while (*pStart != delim && *pStart != '\0')
After the while loop is where you need to be careful. What are the possible states now?
*pStart is '\0' (token ends at end of string)
*pStart is delim (token ends at next delimiter)
Note that pStart itself cannot be NULL here.
You need to return pNextWord (current token) for both of these conditions so you don't drop the last token (i.e., when *pStart is '\0'). The code handles case 2 correctly but not case 1 (original code dangerously incremented pStart past '\0', the new code returned NULL). In addition, it is important to reset pStart for case 1 correctly, such that the next call to Next() returns NULL. I'll leave the exact code as an exercise to reader, since it is homework after all ;)
It's a good exercise to outline the possible states of data throughout a function in order to determine the correct action for each state, similar to formally defining base cases vs. recursive cases for recursive functions.
Finally, I noticed you have delete calls on both pStart and pNextWord in your destructor. First, to delete arrays, you need to use delete [] ptr; (i.e., array delete). Second, you wouldn't delete both pStart and pNextWord because pNextWord points into the pStart array. Third, by the end, pStart no longer points to the start of the memory, so you would need a separate member to store the original start for the delete [] call. Lastly, these arrays are allocated on the stack and not the heap (i.e., using char var[], not char* var = new char[]), and therefore they shouldn't be deleted. Therefore, you should simply use an empty destructor.
Another useful tip is to count the number of new and delete calls; there should be the same number of each. In this case, you have zero new calls, and two delete calls, indicating a serious issue. If it was the opposite, it would indicate a memory leak.
Inside ::Next you need to check for the delim character, but you also need to check for the end of the buffer, (which I'm guessing is indicated by a \0).
while (*pStart != '\0' && *pStart != delim) // access violation error here
{
pStart++;
}
And I think that these tests in ::Next
if (pStart == NULL) { return NULL; }
Should be this instead.
if (*pStart == '\0') { return NULL; }
That is, you should be checking for a Nul character, not a null pointer. Its not clear whether you intend for these tests to detect an uninitialized pStart pointer, or the end of the buffer.
An access violation usually means a bad pointer.
In this case, the most likely cause is running out of string before you find your delimiter.