How do I tokenize a string only using <iostream>?

How do I tokenize a string only using <iostream>? - c++

I want to learn how to tokenize a string, like the strtok function only using <iostream>.
I made a program that deletes the spaces but I don't thinks its the same as strtok.
#include <iostream>
int main(){
int i = 0;
char s[100]="fix the car";
while(s[i] != '\0'){
if(s[i] == ' ')
s[i] = s[i-1];
else std::cout << s[i];
i++;
}
return 0;
}
prints: fixthecar
I want the whole strtok function, not just deleting delimiters, heard I have to use pointers, but I don't know how to code it.

The internal implementation of strtok has already been discussed here, you should check that before opening new questions.
The key to the operation of strtok() is preserving the location of the last seperator between seccessive calls (that's why strtok() continues to parse the very original string that is passed to it when it is invoked with a null pointer in successive calls)..
Have a look at this strtok() implementation which has a slightly different functionality than the one provided by strtok()
char *zStrtok(char *str, const char *delim) {
static char *static_str=0; /* var to store last address */
int index=0, strlength=0; /* integers for indexes */
int found = 0; /* check if delim is found */
/* delimiter cannot be NULL
* if no more char left, return NULL as well
*/
if (delim==0 || (str == 0 && static_str == 0))
return 0;
if (str == 0)
str = static_str;
/* get length of string */
while(str[strlength])
strlength++;
/* find the first occurance of delim */
for (index=0;index<strlength;index++)
if (str[index]==delim[0]) {
found=1;
break;
}
/* if delim is not contained in str, return str */
if (!found) {
static_str = 0;
return str;
}
/* check for consecutive delimiters
*if first char is delim, return delim
*/
if (str[0]==delim[0]) {
static_str = (str + 1);
return (char *)delim;
}
/* terminate the string
* this assignmetn requires char[], so str has to
* be char[] rather than *char
*/
str[index] = '\0';
/* save the rest of the string */
if ((str + index + 1)!=0)
static_str = (str + index + 1);
else
static_str = 0;
return str;
}

Related

Remove Duplicate words (only if followed) from char array

I am a little bit stuck and cant find out what is wrong here.
I have an assignment to enter a sentence into char array and if there are duplicate and followed words(example : same same , diff diff. but not : same word same.) they should be removed.
here is the function I wrote:
void Same(char arr[], char temp[]){
int i = 0, j = 0, f = 0, *p, k = 0, counter = 0;
for (i = 0; i < strlen(arr); i++){
while (arr[i] != ' ' && i < strlen(arr)){
temp[k] = arr[i];
i++;
k++;
counter++;
}
temp[k] = '\0';
k = 0;
p = strstr((arr + i), (temp + j));
if (p != NULL && (*p == arr[i])){
for (f = 0; f < strlen(p); f++){
*p = '*';
p++;
}
f = 0;
}
j = counter;
}
}

strtok is a handy function to grab the next word from a list (strsep is a better one, but is less likely to be available on your system). Using strtok, an approach like the following might work, at least for simple examples...
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define MAXPHRASELEN 1000
#define MAXTOKLEN 100
int main(int argc, char ** argv)
{
// Here is the sentence we are looking at
char * tmp = "This is a test and and another test";
// We will copy it to this variable
char phrase[MAXPHRASELEN+1];
strcpy(phrase, tmp);
// And will put the altered text in this variable
char new_phrase[MAXPHRASELEN+1];
// This will be the last word we looked at
char * lasttok = malloc(MAXTOKLEN+1);
// This will be the current word
char * tok = malloc(MAXTOKLEN+1);
// Both words are initially empty
new_phrase[0] = '\0';
lasttok[0] = '\0';
// Get the first word
lasttok = strtok(phrase, " ");
// If there is a word...
if (lasttok != NULL) {
// Put it in the altered text and add a space
strcat(new_phrase, lasttok);
strcat(new_phrase, " ");
// As long as there is a next word
while ( (tok = strtok(NULL, " ")) != NULL ) {
// See if it is the same as the last word
if (strcmp(tok,lasttok) != 0) {
// If it isn't, copy it to the altered text
strcat(new_phrase, tok);
// and add a space
strcat(new_phrase, " ");
// The current word becomes the last word
lasttok = tok;
}
}
}
// Print the lot
printf("%s\n", new_phrase);
}
If you really must write your own routine for grabbing the individual words, you could do worse than emulate strtok. It maintains a pointer to the beginning of current word in the string and puts a null character at the next separator (space character). When called again, it just moves the pointer to the character past the null, and puts another null after the next separator. Most string functions, when passed the pointer, will see the null as the end of the string and so just deal with the current word.
Minus comments, headers, and initialisation, it looks less threatening...
lasttok = strtok(phrase, " ");
if (lasttok != NULL) {
strcat(new_phrase, lasttok);
strcat(new_phrase, " ");
while ( (tok = strtok(NULL, " ")) != NULL ) {
if (strcmp(tok,lasttok) != 0) {
strcat(new_phrase, tok);
strcat(new_phrase, " ");
lasttok = tok;
}
}
}
printf("%s\n", new_phrase);

C++ read file until char found

I want to create a function that can read a file char by char continuously until some specific char encountered.
This is my method in a class FileHandler.
char* tysort::FileHandler::readUntilCharFound(size_t start, char seek)
{
char* text = new char;
if(this->inputFileStream != nullptr)
{
bool goOn = true;
size_t seekPos = start;
while (goOn)
{
this->inputFileStream->seekg(seekPos);
char* buffer = new char;
this->inputFileStream->read(buffer, 1);
if(strcmp(buffer, &seek) != 0)
{
strcat(text, buffer); // Execution stops here
seekPos++;
}
else
{
goOn = false;
}
}
}
//printf("%s\n", text);
return text;
}
I test this function and it actually works. This is an example to read a file content until new line character '\n' found.
size_t startPosition = 0;
char* text = this->fileHandler->readUntilCharFound(startPosition, '\n');
However, I am sure that something not right is exists somewhere in the code because if I use those method in a loop block, the app will just hangs. I guess the 'not right' things are about pointer but I don't know exactly where. Could you please point out for me?

C++ provides some easy-to-use solutions. For instance:
istream& getline (istream& is, string& str, char delim);
In your case, the parameter would be the equivalent of your text variable and delim would be the equivalent of your seek parameter. Also, the return value of getline would in some way be the equivalent of your goOn flag (there are good FAQs regarding the right patterns to check for EOF and IO errors using the return value of getline)

The lines
if(strcmp(buffer, &seek) != 0)
and
strcat(text, buffer); // Execution stops here
are causes for undefined behavior. strcmp and strcat expect null terminated strings.
Here's an updated version, with appropriate comments.
char* tysort::FileHandler::readUntilCharFound(size_t start, char seek)
{
// If you want to return a string containing
// one character, you have to allocate at least two characters.
// The first one contains the character you want to return.
// The second one contains the null character - '\0'
char* text = new char[2];
// Make it a null terminated string.
text[1] = '\0';
if(this->inputFileStream != nullptr)
{
bool goOn = true;
size_t seekPos = start;
while (goOn)
{
this->inputFileStream->seekg(seekPos);
// No need to allocate memory form the heap.
char buffer[2];
this->inputFileStream->read(buffer, 1);
if( buffer[0] == seek )
{
buffer[1] = '\0';
strcat(text, buffer);
seekPos++;
}
else
{
goOn = false;
}
}
}
return text;
}
You can further simplify the function to:
char* tysort::FileHandler::readUntilCharFound(size_t start, char seek)
{
// If you want to return a string containing
// one character, you have to allocate at least two characters.
// The first one contains the character you want to return.
// The second one contains the null character - '\0'
char* text = new char[2];
text[1] = '\0';
if(this->inputFileStream != nullptr)
{
this->inputFileStream->seekg(start);
// Keep reading from the stream until we find the character
// we are looking for or EOF is reached.
int c;
while ( (c = this->inputFileStream->get()) != EOF && c != seek )
{
}
if ( c != EOF )
{
text[0] = c;
}
}
return text;
}

this->inputFileStream->read(buffer, 1);
No error checking.
if(strcmp(buffer, &seek) != 0)
The strcmp function is used to compare strings. Here you just want to compare two characters.

How can I tokenize a c-style string from a file without making a copy?

Let's say I have a constant c-style string say
const char* msg = "fred,jim,345,7665";
I'd like to tokenize this and read out the individual fields but for performance reasons I don't want to make a copy. How can I do this?
Obviously strtok takes a non-constant pointer and boost::tokenizer is an option but I am unsure what is doing behind the scenes.

Inevitably you will require some copy of the string, even if it is a substring being copied.
If you have a strtok_r function, you can use that, but it will still require a mutable string to do its work. Beware, however, as not all systems provide the function (e.g. Windows), which is why I've provided an implementation here. It works by requiring an additional parameter: a pointer to a C string to save the address of the next match. This allows for it to be more reentrant (thread-safe) in theory. However, you'll still be mutating the value. You could modify it to suit your needs if you like, perhaps copying N bytes into a destination buffer and null-terminating that buffer to avoid the need to modify the source string.
/*
Usage:
char *tok;
char *savep;
tok = mystrtok_r (somestr, ",", &savep);
while (NULL != tok)
{
/* Do something with `tok'. */
tok = mystrtok_r (NULL, ",", &savep);
}
*/
char *
mystrtok_r (char *str, const char *delims, char **nextp)
{
if (str == NULL)
str = *nextp;
str += strspn (str, delims);
*nextp = str + strcspn (str, delims);
**nextp = 0;
if (*str == 0)
return NULL;
++*nextp;
return str;
}

It depends on how you're going to use it.
If you want to get the next token, and then the next (like an iteration over the string, then you only really need to copy the current token into memory.
long strtok2( char *strDest, const char *strSrc, const char cTok, long lOffset, long lMax)
{
if(lMax > 0)
{
strSrc += lOffset;
char * start = strDest;
while(--lMax && *strSrc != cTok && (*strDest++ = * strSrc++) );
*strDest = 0; //for when the token was found, not the null.
return strDest - start - 1; //the length of the token
}
return 0;
}
I snagged a simple strcpy from http://vijayinterviewquestions.blogspot.com.au/2007/07/implement-strcpy-function.html
const char* msg = "fred,jim,345,7665";
char * buffer[20];
long offset = 0;
while(length = strtok2(buffer, msg, ',', offset, 20))
{
cout << buffer;
offset += (length+1);
}

Well, without a little more detail it's hard to know exactly what you want. I'll guess you are parsing delimited items where consecutive delimiters should be treated as zero length tokens (which is usually correct for comma separated elements). I'm also assuming a blank line counts as a single zero length token. This is how I'd approach it:
const char *token_begin = msg;
int length;
for(;;)
{
length = 0;
while(!isDelimiter(token_begin[length])) //< must include \0 as delimiter
++length;
//..do something here with token. token is at: token_begin[0..length)
if ( token_begin[length] != 0 )
token_begin = &token_begin[length+1]; //skip beyond non-null delimiter
else
break; //token null terminated. exit
}

If you are going to store the tokens somewhere then a copy will be necessary in any case and strtok does this nicely by using the string a placing null terminating character inside it.
The only other option I see to avoid copying it is a lexer which reads the string and through a state machine produces tokens by scanning the string and storing the partial results in a buffer but every token should in any case be stored at least in a null terminated string to you are not really saving anything.

Here is my proposal, my code is structured and use a global variable pos(I know global variable are a bad practice but is only to give you the idea), you can replace it with a data member if you need OOP.
int position, messageLength;
char token[MAX]; // MAX = Value greater than the maximum length
// of the tokens(e.g. 1,000);
bool hasNext()
{
return position < messageLength;
}
char* next(const char* message)
{
int i = 0;
while (position < messageLength && message[position] != ',') {
token[i++] = message[position];
position++;
}
position++; // ',' found
token[i] = '\0';
return token;
}
int main(int argc, char **argv)
{
const char* msg = "fred,jim,345,7665";
position = 0;
messageLength = strlen(msg);
while (hasNext())
cout << next(msg) << endl;
return EXIT_SUCCESS;
}

Split array of chars into two arrays of chars

I would like to split one array of char containing two "strings "separated by '|' into two arays of char.
Here is my sample code.
void splitChar(const char *text, char *text1, char *text2)
{
for (;*text!='\0' && *text != '|';) *text1++ = *text++;
*text1 = '\0';
for (;*++text!='\0';) *text2++ = *text;
*text2 = '\0';
}
int main(int argc, char* argv[])
{
char *text = "monday|tuesday", text1[255], text2 [255];
splitChar (text, text1, text2);
return 0;
}
I have two questions:
How to further improve this code in C (for example rewrite it in 1 for cycle).
How to rewrite this code in C++?

If you wan to write it in C++, use the STL
string s = "monday|tuesday";
int pos = s.find('|');
if(pos == string::npos)
return 1;
string part1 = s.substr(0, pos);
string part2 = s.substr(pos+1, s.size() - pos);

For A, using internal libraries:
void splitChar(const char *text, char *text1, char *text2)
{
int len = (strchr(text,'|')-text)*sizeof(char);
strncpy(text1, text, len);
strcpy(text2, text+len+1);
}

I don't know about A), but for B), Here's a method from a utility library I use in various projects, showing how to split any number of words into a vector. It's coded to split on space and tab, but you could pass that in as an additional parameter if you wanted. It returns the number of words split:
unsigned util::split_line(const string &line, vector<string> &parts)
{
const string delimiters = " \t";
unsigned count = 0;
parts.clear();
// skip delimiters at beginning.
string::size_type lastPos = line.find_first_not_of(delimiters, 0);
// find first "non-delimiter".
string::size_type pos = line.find_first_of(delimiters, lastPos);
while (string::npos != pos || string::npos != lastPos)
{
// found a token, add it to the vector.
parts.push_back(line.substr(lastPos, pos - lastPos));
count++;
// skip delimiters. Note the "not_of"
lastPos = line.find_first_not_of(delimiters, pos);
// find next "non-delimiter"
pos = line.find_first_of(delimiters, lastPos);
}
return count;
}

Probably one of these solutions will work: Split a string in C++?

Take a look at the example given here: strtok, wcstok, _mbstok

I've found a destructive split is the best balance of performance and flexibility.
void split_destr(std::string &str, char split_by, std::vector<char*> &fields) {
fields.push_back(&str[0]);
for (size_t i = 0; i < str.size(); i++) {
if (str[i] == split_by) {
str[i] = '\0';
if (i+1 == str.size())
str.push_back('\0');
fields.push_back(&str[i+1]);
}
}
}
Then a non-destructive version for lazies.
template<typename C>
void split_copy(const std::string &str_, char split_by, C &container) {
std::string str = str_;
std::vector<char*> tokens;
parse::split_destr(str, split_by, tokens);
for (size_t i = 0 ; i < tokens.size(); i++)
container.push_back(std::string( tokens[i] ));
}
I arrived at this when things like boost::Tokenizer have fallen flat on their face dealing with gb+ size files.

I apologize advance for my answer :) No one should try this at home.
To answer the first part of your question.
A] How to further improve this code in C (for example rewrite it in 1 for cycle).
The complexity of this algorithm will depend on where the position of '|' is in the string but this example only works for 2 strings separated by a '|'. You can easily modify it later for more than that.
#include <stdio.h>
void splitChar(char *text, char **text1, char **text2)
{
char * temp = *text1 = text;
while (*temp != '\0' && *temp != '|') temp++;
if (*temp == '|')
{
*temp ='\0';
*text2 = temp + 1;
}
}
int main(int argc, char* argv[])
{
char text[] = "monday|tuesday", *text1,*text2;
splitChar (text, &text1, &text2);
printf("%s\n%s\n%s", text,text1,text2);
return 0;
}
This works because c-style arrays use the null character to terminate the string. Since initializing a character string with "" will add a null char to the end, all you would have to do is replace the occurrences of '|' with the null character and assign the other char pointers to the next byte past the '|'.
You have to make sure to initialize your original character string with [] because that tells the compiler to allocate storage for your character array where char * might initialize the string in a static area of memory that can't be changed.

strcmp segmentation fault

Here is a problem from spoj. nothing related to algorithms, but just c
Sample Input
2
a aa bb cc def ghi
a a a a a bb bb bb bb c c
Sample Output
3
5
it counts the longest sequence of same words
http://www.spoj.pl/problems/WORDCNT/
The word is less than 20 characters
But when i run it, it's giving segmentation fault. I debugged it using eclipse. Here's where it crashes
if (strcmp(previous, current) == 0)
currentLength++;
with the following message
No source available for "strcmp() at 0x2d0100"
What's the problem?
#include <iostream>
#include <cstring>
#include <string>
#include <cstdio>
using namespace std;
int main(int argc, const char *argv[])
{
int t;
cin >> t;
while (t--) {
char line[20000], previous[21], current[21], *p;
int currentLength = 1, maxLength = 1;
if (cin.peek() == '\n') cin.get();
cin.getline(line, 20000);
p = strtok(line, " '\t''\r'");
strcpy(previous, p);
while (p != NULL) {
p = strtok(NULL, " '\t''\r'");
strcpy(current, p);
if (strcmp(previous, current) == 0)
currentLength++;
else
currentLength = 1;
if (currentLength > maxLength)
maxLength = currentLength;
}
cout << maxLength << endl;
}
return 0;
}

The problem is probably here:
while (p != NULL) {
p = strtok(NULL, " '\t''\r'");
strcpy(current, p);
While p may not be NULL when the loop is entered.
It may be NULL when strcpy is used on it.
A more correct form of the loop would be:
while ((p != NULL) && ((p = strtok(NULL, " \t\r")) != NULL))
{
strcpy(current, p);
Note. Tokenizing a stream in C++ is a lot easier.
std::string token;
std::cin >> token; // Reads 1 white space seoporated word
If you want to tokenize a line
// Step 1: read a single line in a safe way.
std::string line;
std::getline(std::cin, line);
// Turn that line into a stream.
std::stringstream linestream(line);
// Get 1 word at a time from the stream.
std::string token;
while(linestream >> token)
{
// Do STUFF
}

Forgot to check for NULL on strtok, it will return NULL when done and you cannot use that NULL on strcpy, strcmp, etc.
Note that you do a strcpy right after the strtok, you should check for null before doing that using p as a source.

The strtok man page says:
Each call to strtok() returns a pointer to a null-terminated string containing the next
token. This string does not include the delimiting character. If no more tokens are found,
strtok() returns NULL.
And in your code,
while (p != NULL) {
p = strtok(NULL, " '\t''\r'");
strcpy(current, p);
you are not checking for NULL (for p) once the whole string has been parsed. After that you are trying to copy p (which is NULL now) in current and so getting the crash.

You will find that one of previous or current does not point to a null-terminated string at that point, so strcmp doesn't know when to stop.
Use proper C++ strings and string functions instead, rather than mixing C and C++. The Boost libraries can provide a much safer tokeniser than strtok.

You probably undersized current and previous. You should really use std::string for this kind of thing- that's what it's for.

You are doing nothing to check your string lengths before copying them into buffers of size 21. I bet that you are somehow overwriting the end of the buffer.

If you insist on using C strings, I'd suggest using strncmp instead of strcmp. That way, in case you are ending up with a non-null terminated string (which is what I suspect is the case), you can restrict your compare to the max length of the string (in this case 21).

Try this one...
#include <cstdio>
#include <cstring>
#define T(x) strtok(x, " \n\r\t")
char line[44444];
int main( )
{
int t; scanf("%d\n", &t);
while(t--)
{
fgets(line, 44444, stdin);
int cnt = 1, len, maxcnt = 0, plen = -1;
for(char *p = T(line); p != NULL; p = T(NULL))
{
len = strlen(p);
if(len == plen) ++cnt;
else cnt = 1;
if(cnt > maxcnt)
maxcnt = cnt;
plen = len;
}
printf("%d\n", maxcnt);
}
return 0;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How do I tokenize a string only using <iostream>? - c++

Related

Remove Duplicate words (only if followed) from char array

C++ read file until char found

How can I tokenize a c-style string from a file without making a copy?

Split array of chars into two arrays of chars

strcmp segmentation fault

Categories

Resources