Tokenize a string based on quotes

Tokenize a string based on quotes - c++

I am trying to read data from a text file and split the read line based on quotes. For example
"Hi how" "are you" "thanks"
Expected output
Hi how
are you
thanks
My code:
getline(infile, line);
ch = strdup(line.c_str());
ch1 = strtok(ch, " ");
while (ch1 != NULL)
{
a3[i] = ch1;
ch1 = strtok(NULL, " ");
i++;
}
I don't know what to specify as delimiter string. I am using strtok() to split, but it failed. Can any one help me?

Please have a look at the example code here. You should provide "\"" as delimiter string to strtok.
For example,
ch1 = strtok (ch,"\"");
Probably your problem is related with representing escape sequences. Please have a look here for a list of escape sequences for characters.

Given your input: "Hi how" "are you" "thanks", if you use strtok with "\"" as the delimiter, it'll treat the spaces between the quoted strings as if they were also strings, so if (for example) you printed out the result strings, one per line, surrounded by square brackets, you'd get:
[Hi how]
[ ]
[are you]
[ ]
[thanks]
I.e., the blank character between each quoted string is, itself, being treated as a string. If the delimiter you supplied to strtok was " \"" (i.e., included both a quote and a space) that wouldn't happen, but then it would also break on the spaces inside the quoted strings.
Assuming you can depend on every item you care about being quoted, you want to skip anything until you get to a quote, ignore the quote, then read data into your input string until you get to another quote, then repeat the whole process.

Related

Str.global_replace in OCaml putting carats where they shouldn't be

I am working to convert multiline strings into a list of tokens that might be easier for me to work with.
In accordance with the specific needs of my project, I'm padding any carat symbol that appears in my input with spaces, so that "^" gets turned into " ^ ". I'm using something like the following function to do so:
let bad_function string = Str.global_replace (Str.regexp "^") " ^ " (string)
I then use something like the below function to then turn this multiline string into a list of tokens (ignoring whitespace).
let string_to_tokens string = (Str.split (Str.regexp "[ \n\r\x0c\t]+") (string));;
For some reason, bad_function adds carats to places where they shouldn't be. Take the following line of code:
(bad_function " This is some
multiline input
with newline characters
and tabs. When I convert this string
into a list of tokens I get ^s showing up where
they shouldn't. ")
The first line of the string turns into:
^ This is some \n ^
When I feed the output from bad_function into string_to_tokens I get the following list:
string_to_tokens (bad_function " This is some
multiline input
with newline characters
and tabs. When I convert this string
into a list of tokens I get ^s showing up where
they shouldn't. ")
["^"; "This"; "is"; "some"; "^"; "multiline"; "input"; "^"; "with";
"newline"; "characters"; "^"; "and"; "tabs."; "When"; "I"; "convert";
"this"; "string"; "^"; "into"; "a"; "list"; "of"; "tokens"; "I"; "get";
"^s"; "showing"; "up"; "where"; "^"; "they"; "shouldn't."]
Why is this happening, and how can I fix so these functions behave like I want them to?

As explained in the Str module.
^ Matches at beginning of line: either at the beginning of the
matched string, or just after a '\n' character.
So you have to quote the '^' character using the escape character "\".
However, note that (also from the doc)
any backslash character in the regular expression must be doubled to
make it past the OCaml string parser.
This means you have to put a double '\' to do what you want without getting a warning.
This should do the job:
let bad_function string = Str.global_replace (Str.regexp "\\^") " ^ " (string);;

Using one cout command to print multiple strings with each string placed on a different (text editor) line

Take a look at the following example:
cout << "option 1:
\n option 2:
\n option 3";
I know,it's not the best way to output a string,but the question is why does this cause an error saying that a " character is missing?There is a single string that must go to stdout but it just consists of a lot of whitespace charcters.
What about this:
string x="
string_test";
One may interpret that string as: "\nxxxxxxxxxxxxstring_test" where x is a whitespace character.
Is it a convention?

That's called multiline string literal.
You need to escape the embedded newline. Otherwise, it will not compile:
std::cout << "Hello world \
and stackoverflow";
Note: Backslashes must be immediately before the line ends as they need to escape the newline in the source.
Also you can use the fun fact "Adjacent string literals are concatenated by the compiler" for your advantage by this:
std::cout << "Hello World"
"Stack overflow";
See this for raw string literals. In C++11, we have raw string literals. They are kind of like here-text.
Syntax:
prefix(optional) R"delimiter( raw_characters )delimiter"
It allows any character sequence, except that it must not contain the
closing sequence )delimiter". It is used to avoid escaping of any
character. Anything between the delimiters becomes part of the string.
const char* s1 = R"foo(
Hello
World
)foo";
Example taken from cppreference.

How to delimit this text file? strtok

so there's a text file where I have 1. languages, a 2. text of a number written in the said language, 3. the base of the number and 4. the number written in digits. Here's a sample:
francais deux mille quatre cents 10 2400
How I went about it:
struct Nomen{
char langue[21], nomNombre [31], baseC[3], nombreC[21];
int base, nombre;
};
and in the main:
if(myfile.is_open()){
{
while(getline(myfile, line))
{
strcpy(Linguo[i].langue, strtok((char *)line.c_str(), " "));
strcpy(Linguo[i].nomNombre, strtok(NULL, " "));
strcpy(Linguo[i].baseC, strtok(NULL, " "));
strcpy(Linguo[i].nombreC, strtok(NULL, "\n"));
i++;
}
Difficulty: I'm trying to put two whitespaces as a delimiter, but it seems that strtok() counts it as if there were only one whitespace. The fact there are spaces in the text number, etc. is messing up the tokenization. How should I go about it?

strtok treats any single character in the provided string as a delimiter. It does not treat the string itself as a single delimiter. So " " (two spaces) is the same as " " (one space).
strtok will also treat multiple delimiters together as a single delimiter. So the input "t1 t2" will be tokenized as two tokens, "t1" and "t2".
As mentioned in comments, strtok is also writes the NUL character into the input to create the token strings. So, it is an error to pass the result of string::c_str() as input to the function. The fact that you need to cast the constant string should have been enough to dissuade you from this approach.
If you want to treat a double space as a delimiter, you will have to scan the string and search for them yourself. Given you are using C APIs, you can consider strstr. However, in C++, you can use string::find.
Here's an algorithm to parse your string manually:
Given an input string input:
language is the substring from the start of input to the first SPC character.
From where language ends, skip over all whitespace, changing input to begin at the first non-whitespace character.
text is the substring from the start of input to the first double SPC sequence.
From where text ends, skip over all whitespace, changing input to begin at the first non-whitespace character.
Parse base, and parse number.

C++ , How can I ignore comma (,) from csv char *?

I have searched a lot about it on SO and solutions like "" the part where comma is are giving errors. Moreover it is using C++ :)
char *msg = new char[40];
msg = "1,2, Hello , how are you ";
char msg2[30];
strcpy_s(msg2, msg);
char * pch;
pch = strtok(msg2, ",");
while (pch != NULL)
{
cout << pch << endl;
pch = strtok(NULL, ",");
}
Output I want :
1
2
Hello , how are you
Out put it is producing
1
2
Hello
how are you
I have tried putting "" around Hello , how are you. But it did not help.

The CSV files are comma separated values. If you want a comma inside the value, you have to surround it with quotes.
Your example in CSV, as you need your output, should be:
msg = "1,2, \"Hello , how are you \"";
so the value Hello , how are you is surrounded with quotes.
This is the standard CSV. This has nothing to do with the behaviour of the strtok function.
The strtok function just searches, without considering anything else, the tokens you have passed to it, in this case the ,, thus it ignores the ".
In order to make it work as you want, you would have to tokenize with both tokens, the , and the ", and consider the previous found token in order to decide if the , found is a new value or it is inside quotes.
NOTE also that if you want to be completely conforming with the CSV specification, you should consider that the quotes may also be escaped, in order to have a quote character inside the value term. See this answer for an example:
Properly escape a double quote in CSV
NOTE 2: Just for completeness, here is the CSV specification (RFC-4180): https://www.rfc-editor.org/rfc/rfc4180

Remove spaces from string before period and comma

I could have a string like:
During this time , Bond meets a stunning IRS agent , whom he seduces .
I need to remove the extra spaces before the comma and before the period in my whole string. I tried throwing this into a char vector and only not push_back if the current char was " " and the following char was a "." or "," but it did not work. I know there is a simple way to do it maybe using trim(), find(), or erase() or some kind of regex but I am not the most familiar with regex.

A solution could be (using regex library):
std::string fix_string(const std::string& str) {
static const std::regex rgx_pattern("\\s+(?=[\\.,])");
std::string rtn;
rtn.reserve(str.size());
std::regex_replace(std::back_insert_iterator<std::string>(rtn),
str.cbegin(),
str.cend(),
rgx_pattern,
"");
return rtn;
}
This function takes in input a string and "fixes the spaces problem".
Here a demo

On a loop search for string " ," and if you find one replace that to ",":
std::string str = "...";
while( true ) {
auto pos = str.find( " ," );
if( pos == std::string::npos )
break;
str.replace( pos, 2, "," );
}
Do the same for " .". If you need to process different space symbols like tab use regex and proper group.

I don't know how to use regex for C++, also not sure if C++ supports PCRE regex, anyway I post this answer for the regex (I could delete it if it doesn't work for C++).
You can use this regex:
\s+(?=[,.])
Regex demo

First, there is no need to use a vector of char: you could very well do the same by using an std::string.
Then, your approach can't work because your copy is independent of the position of the space. Unfortunately you have to remove only spaces around the punctuation, and not those between words.
Modifying your code slightly you could delay copy of spaces waiting to the value of the first non-space: if it's not a punctuation you'd copy a space before the character, otherwise you just copy the non-space char (thus getting rid of spaces.
Similarly, once you've copied a punctuation just loop and ignore the following spaces until the first non-space char.
I could have written code. It would have been shorter. But i prefer letting you finish your homework with full understanding of the approach.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Tokenize a string based on quotes - c++

Please have a look at the example code here. You should provide "\"" as delimiter string to strtok. For example, ch1 = strtok (ch,"\""); Probably your problem is related with representing escape sequences. Please have a look here for a list of escape sequences for characters.

Related

Str.global_replace in OCaml putting carats where they shouldn't be

Using one cout command to print multiple strings with each string placed on a different (text editor) line

How to delimit this text file? strtok

C++ , How can I ignore comma (,) from csv char *?

Remove spaces from string before period and comma

Categories

Resources