Malformed output when converting string to char* in C++ - c++

I've got a function that splits up a string into various sections and then parses them, but when converting a string to char* I get a malformed output.
int parseJob(char * buffer)
{ // Parse raw data, should return individual jobs
const char* p;
int rows = 0;
for (p = strtok( buffer, "~" ); p; p = strtok( NULL, "~" )) {
string jobR(p);
char* job = &jobR[0];
parseJobParameters(job); // At this point, the data is still in good condition
}
return (1);
}
int parseJobParameters(char * buffer)
{ // Parse raw data, should return individual job parameters
const char* p;
int rows = 0;
for (p = strtok( buffer, "|" ); p; p = strtok( NULL, "|" )) { cout<<p; } // At this point, the data is malformed.
return (1);
}
I don't know what happens between the first function calling the second one, but it malforms the data.
As you can see from the code example given, the same method to convert string to char* is used and it works fine.
I'm using Visual Studio 2012/C++, any guidance and code examples will be greatly appreciated.

The "physical" reason your code does not work has nothing to do with std::string or C++. It wouldn't work in pure C as well. strtok is a function that stores its intermediate parsing state in some global variable. This immediately means that you cannot use strtok to parse more than one string at a time. Starting the second parse session before finishing the first would override the internal data stored by the first parse session, thus ruining it beyond repair. In other words, strtok parse sessions must not overlap. In your code they do overlap.
Also, in C++03 the idea of using std::string with strtok directly is doomed from the start. The internal sequence stored in std::string is not guaranteed to be null-terminated. This means that generally &jobR[0] is not a C-string. It can't be used with strtok. To convert a std::string to a C-string you have to use c_str(). But C-string returned by c_str() is non-modifiable.
In C++11 the null-termination is supposed to be visible through the [] operator, but still there seems to be no requirement to store the terminator object contiguously with the actual string, so &jobR[0] is still not a C-string even in C++11. C-string returned by c_str() or data() is non-modifiable.

You cannot use strtok() to parse multiple strings at the same time, like you are doing. The first call to parseJobParameters() in the first loop iteration of parseJob() will alter the internal buffer that strtok() points to, thus the second loop iteration of parseJob() will not be processing the original data anymore. You need to rewrite your code to not use nested calls to strtok() anymore, eg:
#include <vector>
#include <string>
void split(std::string s, const char *delims, std::vector &vec)
{
// alternatively, use s.find_first_of() and s.substr() instead...
for (const char* p = strtok(s.c_str(), delims); p != NULL; p = strtok(NULL, delims))
{
vec.push_back(p);
}
}
int parseJob(char * buffer)
{
std::vector<std::string> jobs;
split(buffer, "~", jobs);
for (std::vector<std::string>::iterator i = jobs.begin(); i != jobs.end(); ++i)
{
parseJobParameters(i->c_str());
}
return (1);
}
int parseJobParameters(char * buffer)
{
std::vector<std::string> params;
split(buffer, "|", params);
for (std::vector<std::string>::iterator i = params.begin(); i != params.end(); ++i)
{
std::cout << *i;
}
return (1);
}

Whilst this will give you the address of the first character in the string char* job = &jobR[0];, it does not give you a valid C-style string. YOu SHOULD use char* job = jobR.c_str();
I'm fairly sure that will solve your problem, but there could of course be something wrong with the way you read the buffer that is passed to parseJob in as well.
Edit: of course, you are also calling strtok from a function that uses strtok. Inside strtok looks a bit like this:
char *strtok(char *str, char *separators)
{
static char *last;
char *found = NULL;
if (!str) str = last;
... do searching for needle, set found to beginning of non-separators ...
if (found)
{
*str = 0; // mark end of string.
}
last = str;
return found;
}
Since "last" gets overwritten when you call parseParameters, you can't use strtok(NULL, ... ) when you get back to parseJobs

Related

Incrementing pointers for *char in a while loop

Here is what I have:
char* input = new char [input_max]
char* inputPtr = iput;
I want to use the inputPtr to traverse the input array. However I am not sure what will correctly check whether or not I have reached the end of the string:
while (*inputPtr++)
{
// Some code
}
or
while (*inputPtr != '\0')
{
inputPtr++;
// Some code
}
or a more elegant option?
Assuming input string is null-terminated:
for(char *inputPtr = input; *inputPtr; ++inputPtr)
{
// some code
}
Keep in mind that the example you posted may not give the results you want. In your while loop condition, you're always performing a post-increment. When you're inside the loop, you've already passed the first character. Take this example:
#include <iostream>
using namespace std;
int main()
{
const char *str = "apple\0";
const char *it = str;
while(*it++)
{
cout << *it << '_';
}
}
This outputs:
p_p_l_e__
Notice the missing first character and the extra _ underscore at the end. Check out this related question if you're confused about pre-increment and post-increment operators.
I would do:
inputPtr = input; // init inputPtr always at the last moment.
while (*inputPtr != '\0') { // Assume the string last with \0
// some code
inputPtr++; // After "some code" (instead of what you wrote).
}
Which is equivalent to the for-loop suggested by greatwolf. It's a personal choice.
Be careful, with both of your examples, you are testing the current position and then you increment. Therefore, you are using the next character!
Assuming input isn't null terminated:
char* input = new char [input_max];
for (char* inputPtr = input; inputPtr < input + input_max;
inputPtr++) {
inputPtr[0]++;
}
for the null terminated case:
for (char* inputPtr = input; inputPtr[0]; inputPtr++) {
inputPtr[0]++;
}
but generally this is as good as you can get. Using std::vector, or std::string may enable cleaner and more elegant options though.

How can I tokenize a c-style string from a file without making a copy?

Let's say I have a constant c-style string say
const char* msg = "fred,jim,345,7665";
I'd like to tokenize this and read out the individual fields but for performance reasons I don't want to make a copy. How can I do this?
Obviously strtok takes a non-constant pointer and boost::tokenizer is an option but I am unsure what is doing behind the scenes.
Inevitably you will require some copy of the string, even if it is a substring being copied.
If you have a strtok_r function, you can use that, but it will still require a mutable string to do its work. Beware, however, as not all systems provide the function (e.g. Windows), which is why I've provided an implementation here. It works by requiring an additional parameter: a pointer to a C string to save the address of the next match. This allows for it to be more reentrant (thread-safe) in theory. However, you'll still be mutating the value. You could modify it to suit your needs if you like, perhaps copying N bytes into a destination buffer and null-terminating that buffer to avoid the need to modify the source string.
/*
Usage:
char *tok;
char *savep;
tok = mystrtok_r (somestr, ",", &savep);
while (NULL != tok)
{
/* Do something with `tok'. */
tok = mystrtok_r (NULL, ",", &savep);
}
*/
char *
mystrtok_r (char *str, const char *delims, char **nextp)
{
if (str == NULL)
str = *nextp;
str += strspn (str, delims);
*nextp = str + strcspn (str, delims);
**nextp = 0;
if (*str == 0)
return NULL;
++*nextp;
return str;
}
It depends on how you're going to use it.
If you want to get the next token, and then the next (like an iteration over the string, then you only really need to copy the current token into memory.
long strtok2( char *strDest, const char *strSrc, const char cTok, long lOffset, long lMax)
{
if(lMax > 0)
{
strSrc += lOffset;
char * start = strDest;
while(--lMax && *strSrc != cTok && (*strDest++ = * strSrc++) );
*strDest = 0; //for when the token was found, not the null.
return strDest - start - 1; //the length of the token
}
return 0;
}
I snagged a simple strcpy from http://vijayinterviewquestions.blogspot.com.au/2007/07/implement-strcpy-function.html
const char* msg = "fred,jim,345,7665";
char * buffer[20];
long offset = 0;
while(length = strtok2(buffer, msg, ',', offset, 20))
{
cout << buffer;
offset += (length+1);
}
Well, without a little more detail it's hard to know exactly what you want. I'll guess you are parsing delimited items where consecutive delimiters should be treated as zero length tokens (which is usually correct for comma separated elements). I'm also assuming a blank line counts as a single zero length token. This is how I'd approach it:
const char *token_begin = msg;
int length;
for(;;)
{
length = 0;
while(!isDelimiter(token_begin[length])) //< must include \0 as delimiter
++length;
//..do something here with token. token is at: token_begin[0..length)
if ( token_begin[length] != 0 )
token_begin = &token_begin[length+1]; //skip beyond non-null delimiter
else
break; //token null terminated. exit
}
If you are going to store the tokens somewhere then a copy will be necessary in any case and strtok does this nicely by using the string a placing null terminating character inside it.
The only other option I see to avoid copying it is a lexer which reads the string and through a state machine produces tokens by scanning the string and storing the partial results in a buffer but every token should in any case be stored at least in a null terminated string to you are not really saving anything.
Here is my proposal, my code is structured and use a global variable pos(I know global variable are a bad practice but is only to give you the idea), you can replace it with a data member if you need OOP.
int position, messageLength;
char token[MAX]; // MAX = Value greater than the maximum length
// of the tokens(e.g. 1,000);
bool hasNext()
{
return position < messageLength;
}
char* next(const char* message)
{
int i = 0;
while (position < messageLength && message[position] != ',') {
token[i++] = message[position];
position++;
}
position++; // ',' found
token[i] = '\0';
return token;
}
int main(int argc, char **argv)
{
const char* msg = "fred,jim,345,7665";
position = 0;
messageLength = strlen(msg);
while (hasNext())
cout << next(msg) << endl;
return EXIT_SUCCESS;
}

Substring of char* to std::string

I have an array of chars and I need to extract subsets of this array and store them in std::strings. I am trying to split the array into lines, based on finding the \n character. What is the best way to approach this?
int size = 4096;
char* buffer = new char[size];
// ...Array gets filled
std::string line;
// Find the chars up to the next newline, and store them in "line"
ProcessLine(line);
Probably need some kind of interface like this:
std::string line = GetSubstring(char* src, int begin, int end);
I'd create the std::string as the first step, as splitting the result will be far easier.
int size = 4096;
char* buffer = new char[size];
// ... Array gets filled
// make sure it's null-terminated
std::string lines(buffer);
// Tokenize on '\n' and process individually
std::istringstream split(lines);
for (std::string line; std::getline(split, line, '\n'); ) {
ProcessLine(line);
}
You can use the std::string(const char *s, size_t n) constructor to build a std::string from the substring of a C string. The pointer you pass in can be to the middle of the C string; it doesn't need to be to the very first character.
If you need more than that, please update your question to detail exactly where your stumbling block is.
I didn't realize you only wanted to process each line one at a time, but just in case you need all the lines at once, you can also do this:
std::vector<std::string> lines;
char *s = buffer;
char *head = s;
while (*s) {
if (*s == '\n') { // Line break found
*s = '\0'; // Change it to a null character
lines.push_back(head); // Add this line to our vector
head = ++s;
} else s++; //
}
lines.push_back(head); // Add the last line
std::vector<std::string>::iterator it;
for (it = lines.begin(); it != lines.end(); it++) {
// You can process each line here if you want
ProcessLine(*it);
}
// Or you can process all the lines in a separate function:
ProcessLines(lines);
// Cleanup
lines.erase(lines.begin(), lines.end());
I've modified the buffer in place, and the vector.push_back() method generates std::string objects from each of the resulting C substrings automatically.
your best bet (best meaning easiest) is using strtok and convert the tokens to std::string via the constructor. (just note that pure strtok is not reentrant, for that you need to use the non standard strtok_r).
void ProcessTextBlock(char* str)
{
std::vector<std::string> v;
char* tok = strtok(str,"\n");
while(tok != NULL)
{
ProcessLine(std::string(tok));
tok = strtok(tok,"\n");
}
}
You can turn a substring of char* to std::string with a std::string's constructor:
template< class InputIterator >
basic_string( InputIterator first, InputIterator last, const Allocator& alloc = Allocator() );
Just do something like:
char *cstr = "abcd";
std::string str(cstr + 1, cstr + 3);
In that case str would be "bc".

Can getline() be used to get a char array from a fstream

I want to add a new (fstream) function in a program that already uses char arrays to process strings.
The problem is that the below code yields strings, and the only way i can think of getting this to work would be to have an intermediary function that would copy the strings, char by char, into a new char array, pass these on to the functions in the program, get back the results and then copy the results char by char back into the string.
Surely (hopefully) there must be a better way?
Thanks!
void translateStream(ifstream &input, ostream& cout) {
string inputStr;
string translated;
getline(input, inputStr, ' ');
while (!input.eof()) {
translateWord(inputStr, translated);
cout << translated;
getline(input, inputStr, ' ');
}
cout << inputStr;
the translateWord func:
void translateWord(char orig[], char pig[]) {
bool dropCap = false;
int len = strlen(orig)-1;
int firstVowel = findFirstVowel(orig);
char tempStr[len];
strcpy(pig, orig);
if (isdigit(orig[0])) return;
//remember if dropped cap
if (isupper(orig[0])) dropCap = true;
if (firstVowel == -1) {
strcat(pig, "ay");
// return;
}
if (isVowel(orig[0], 0, len)) {
strcat(pig, "way");
// return;
} else {
splitString(pig,tempStr,firstVowel);
strcat(tempStr, pig);
strcat(tempStr, "ay");
strcpy(pig,tempStr);
}
if (dropCap) {
pig[0] = toupper(pig[0]);
}
}
You can pass a string as the first parameter to translateWord by making the first parameter a const char *. Then you call the function with inputStr.c_str() as the first parameter. Do deal with the second (output) parameter though, you need to either completely re-write translateWord to use std::string (the best solution, IMHO), or pass a suitably sized array of char as the second parameter.
Also, what you have posted is not actually C++ - for example:
char tempStr[len];
is not supported by C++ - it is an extension of g++, taken from C99.
You can use the member function ifstream::getline. It takes a char* buffer as the first parameter, and a size argument as the second.

Using strtok with a std::string

I have a string that I would like to tokenize.
But the C strtok() function requires my string to be a char*.
How can I do this simply?
I tried:
token = strtok(str.c_str(), " ");
which fails because it turns it into a const char*, not a char*
#include <iostream>
#include <string>
#include <sstream>
int main(){
std::string myText("some-text-to-tokenize");
std::istringstream iss(myText);
std::string token;
while (std::getline(iss, token, '-'))
{
std::cout << token << std::endl;
}
return 0;
}
Or, as mentioned, use boost for more flexibility.
Duplicate the string, tokenize it, then free it.
char *dup = strdup(str.c_str());
token = strtok(dup, " ");
free(dup);
If boost is available on your system (I think it's standard on most Linux distros these days), it has a Tokenizer class you can use.
If not, then a quick Google turns up a hand-rolled tokenizer for std::string that you can probably just copy and paste. It's very short.
And, if you don't like either of those, then here's a split() function I wrote to make my life easier. It'll break a string into pieces using any of the chars in "delim" as separators. Pieces are appended to the "parts" vector:
void split(const string& str, const string& delim, vector<string>& parts) {
size_t start, end = 0;
while (end < str.size()) {
start = end;
while (start < str.size() && (delim.find(str[start]) != string::npos)) {
start++; // skip initial whitespace
}
end = start;
while (end < str.size() && (delim.find(str[end]) == string::npos)) {
end++; // skip to end of word
}
if (end-start != 0) { // just ignore zero-length strings.
parts.push_back(string(str, start, end-start));
}
}
}
There is a more elegant solution.
With std::string you can use resize() to allocate a suitably large buffer, and &s[0] to get a pointer to the internal buffer.
At this point many fine folks will jump and yell at the screen. But this is the fact. About 2 years ago
the library working group decided (meeting at Lillehammer) that just like for std::vector, std::string should also formally, not just in practice, have a guaranteed contiguous buffer.
The other concern is does strtok() increases the size of the string. The MSDN documentation says:
Each call to strtok modifies strToken by inserting a null character after the token returned by that call.
But this is not correct. Actually the function replaces the first occurrence of a separator character with \0. No change in the size of the string. If we have this string:
one-two---three--four
we will end up with
one\0two\0--three\0-four
So my solution is very simple:
std::string str("some-text-to-split");
char seps[] = "-";
char *token;
token = strtok( &str[0], seps );
while( token != NULL )
{
/* Do your thing */
token = strtok( NULL, seps );
}
Read the discussion on http://www.archivum.info/comp.lang.c++/2008-05/02889/does_std::string_have_something_like_CString::GetBuffer
With C++17 str::string receives data() overload that returns a pointer to modifieable buffer so string can be used in strtok directly without any hacks:
#include <string>
#include <iostream>
#include <cstring>
#include <cstdlib>
int main()
{
::std::string text{"pop dop rop"};
char const * const psz_delimiter{" "};
char * psz_token{::std::strtok(text.data(), psz_delimiter)};
while(nullptr != psz_token)
{
::std::cout << psz_token << ::std::endl;
psz_token = std::strtok(nullptr, psz_delimiter);
}
return EXIT_SUCCESS;
}
output
pop
dop
rop
EDIT: usage of const cast is only used to demonstrate the effect of strtok() when applied to a pointer returned by string::c_str().
You should not use
strtok() since it modifies the tokenized string which may lead to undesired, if not undefined, behaviour as the C string "belongs" to the string instance.
#include <string>
#include <iostream>
int main(int ac, char **av)
{
std::string theString("hello world");
std::cout << theString << " - " << theString.size() << std::endl;
//--- this cast *only* to illustrate the effect of strtok() on std::string
char *token = strtok(const_cast<char *>(theString.c_str()), " ");
std::cout << theString << " - " << theString.size() << std::endl;
return 0;
}
After the call to strtok(), the space was "removed" from the string, or turned down to a non-printable character, but the length remains unchanged.
>./a.out
hello world - 11
helloworld - 11
Therefore you have to resort to native mechanism, duplication of the string or an third party library as previously mentioned.
I suppose the language is C, or C++...
strtok, IIRC, replace separators with \0. That's what it cannot use a const string.
To workaround that "quickly", if the string isn't huge, you can just strdup() it. Which is wise if you need to keep the string unaltered (what the const suggest...).
On the other hand, you might want to use another tokenizer, perhaps hand rolled, less violent on the given argument.
Assuming that by "string" you're talking about std::string in C++, you might have a look at the Tokenizer package in Boost.
First off I would say use boost tokenizer.
Alternatively if your data is space separated then the string stream library is very useful.
But both the above have already been covered.
So as a third C-Like alternative I propose copying the std::string into a buffer for modification.
std::string data("The data I want to tokenize");
// Create a buffer of the correct length:
std::vector<char> buffer(data.size()+1);
// copy the string into the buffer
strcpy(&buffer[0],data.c_str());
// Tokenize
strtok(&buffer[0]," ");
If you don't mind open source, you could use the subbuffer and subparser classes from https://github.com/EdgeCast/json_parser. The original string is left intact, there is no allocation and no copying of data. I have not compiled the following so there may be errors.
std::string input_string("hello world");
subbuffer input(input_string);
subparser flds(input, ' ', subparser::SKIP_EMPTY);
while (!flds.empty())
{
subbuffer fld = flds.next();
// do something with fld
}
// or if you know it is only two fields
subbuffer fld1 = input.before(' ');
subbuffer fld2 = input.sub(fld1.length() + 1).ltrim(' ');
Typecasting to (char*) got it working for me!
token = strtok((char *)str.c_str(), " ");
Chris's answer is probably fine when using std::string; however in case you want to use std::basic_string<char16_t>, std::getline can't be used. Here is a possible other implementation:
template <class CharT> bool tokenizestring(const std::basic_string<CharT> &input, CharT separator, typename std::basic_string<CharT>::size_type &pos, std::basic_string<CharT> &token) {
if (pos >= input.length()) {
// if input is empty, or ends with a separator, return an empty token when the end has been reached (and return an out-of-bound position so subsequent call won't do it again)
if ((pos == 0) || ((pos > 0) && (pos == input.length()) && (input[pos-1] == separator))) {
token.clear();
pos=input.length()+1;
return true;
}
return false;
}
typename std::basic_string<CharT>::size_type separatorPos=input.find(separator, pos);
if (separatorPos == std::basic_string<CharT>::npos) {
token=input.substr(pos, input.length()-pos);
pos=input.length();
} else {
token=input.substr(pos, separatorPos-pos);
pos=separatorPos+1;
}
return true;
}
Then use it like this:
std::basic_string<char16_t> s;
std::basic_string<char16_t> token;
std::basic_string<char16_t>::size_type tokenPos=0;
while (tokenizestring(s, (char16_t)' ', tokenPos, token)) {
...
}
It fails because str.c_str() returns constant string but char * strtok (char * str, const char * delimiters ) requires volatile string. So you need to use *const_cast< char > inorder to make it voletile.
I am giving you a complete but small program to tokenize the string using C strtok() function.
#include <iostream>
#include <string>
#include <string.h>
using namespace std;
int main() {
string s="20#6 5, 3";
// strtok requires volatile string as it modifies the supplied string in order to tokenize it
char *str=const_cast< char *>(s.c_str());
char *tok;
tok=strtok(str, "#, " );
int arr[4], i=0;
while(tok!=NULL){
arr[i++]=stoi(tok);
tok=strtok(NULL, "#, " );
}
for(int i=0; i<4; i++) cout<<arr[i]<<endl;
return 0;
}
NOTE: strtok may not be suitable in all situation as the string passed to function gets modified by being broken into smaller strings. Pls., ref to get better understanding of strtok functionality.
How strtok works
Added few print statement to better understand the changes happning to string in each call to strtok and how it returns token.
#include <iostream>
#include <string>
#include <string.h>
using namespace std;
int main() {
string s="20#6 5, 3";
char *str=const_cast< char *>(s.c_str());
char *tok;
cout<<"string: "<<s<<endl;
tok=strtok(str, "#, " );
cout<<"String: "<<s<<"\tToken: "<<tok<<endl;
while(tok!=NULL){
tok=strtok(NULL, "#, " );
cout<<"String: "<<s<<"\t\tToken: "<<tok<<endl;
}
return 0;
}
Output:
string: 20#6 5, 3
String: 206 5, 3 Token: 20
String: 2065, 3 Token: 6
String: 2065 3 Token: 5
String: 2065 3 Token: 3
String: 2065 3 Token:
strtok iterate over the string first call find the non delemetor character (2 in this case) and marked it as token start then continues scan for a delimeter and replace it with null charater (# gets replaced in actual string) and return start which points to token start character( i.e., it return token 20 which is terminated by null). In subsequent call it start scaning from the next character and returns token if found else null. subsecuntly it returns token 6, 5, 3.