Extracting tokens from string in C++

Extracting tokens from string in C++ - c++

Edit: I'm looking for a solution that doesn't use regex since it seems buggy and not trustable
I had the following function which extracts tokens of a string whenever one the following symbols is found: +,-,^,*,!
bool extract_tokens(string expression, std::vector<string> &tokens) {
static const std::regex reg(R"(\+|\^|-|\*|!|\(|\)|([\w|\s]+))");
std::copy(std::sregex_token_iterator(right_token.begin(), right_token.end(), reg, 0),
std::sregex_token_iterator(),
std::back_inserter(tokens));
return true;
}
I though it worked perfectly until today I found an edge case,
The following input : !aaa + ! a is supposed to return !,aaa ,+,!, a But it returns !,aaa ,+,"",!, a Notice the extra empty string between + and !.
How may I prevent this behaviour? I think this can be done with the regex expression,

In an attempt to salvage the regular expression-based solution, I came up with this:
[-+^*!()]|\s*[^-+^*!()\s][^-+^*!()]*
Demo. This reports delimiters, and anything between delimiters including leading and trailing whitespace, but drops tokens consisting of whitespace alone.
A similar expression that also strips leading and trailing whitespace:
[-+^*!()]|[^-+^*!()\s]+(\s+[^-+^*!()\s]+)*)
Demo

Inspired by https://stackoverflow.com/a/9436872/4645334 you could solve the problem with:
bool extract_tokens(std::string expression, std::vector<std::string> &tokens) {
std::string token;
for (const auto& c: expression) {
if (c == '/' || c == '-' || c == '*' || c == '+' || c == '!') {
if (token.length() && !std::all_of(token.cbegin(), token.cend(), [](auto c) { return c == ' '; })) tokens.push_back(token);
token.clear();
tokens.emplace_back(1, c);
} else {
token += c;
}
}
if (token.length() && !std::all_of(token.cbegin(), token.cend(), [](auto c) { return c == ' '; })) tokens.push_back(token);
return true;
}
Input:
"!aaa + ! a"
Output:
"!","aaa ","+","!"," a"

Related

How to separate a string using separator

I want to separate this string using _ as the separators
{[Data1]_[Data2]_[Data3]}_{[Val1]_[Val2]}_{[ID1]_[ID2]_[ID3]}
where underscore should not be considered inside the { } brackets.
so when we separate the strings we should have three new data items
{[Data1]_[Data2]_[Data3]}
{[Val1]_[Val2]}
{[ID1]_[ID2]_[ID3]}
currently I was separating using boost
std::vector<std::string> commandSplitSemiColon;
boost::split(commandSplitSemiColon, stdstrCommand, boost::is_any_of("_"),
boost::token_compress_on);
but how can we ignore the underscore inside the { } brackets and only consider underscore which are not inside the brackets.

A manual solution: iterate through the string and use a variable to check whether it's in the bracket or not:
std::vector<std::string> strings;
std::string s = "{[Data1]_[Data2]_[Data3]}_{[Val1]_[Val2]}_{[ID1]_[ID2]_[ID3]}";
std::string token = "";
bool inBracket = false;
for (char c : s) {
if (c == '{') {
inBracket = true;
token += c;
}
else if (c == '}') {
inBracket = false;
token += c;
}
else if (c == '_' && !inBracket) {
strings.push_back(token);
token = "";
}
else {
token += c;
}
}
if (!token.empty()) {
strings.push_back(token);
}

If you note the similarity between your format and conventional CSV (comma-separated value), you can search for existing solutions.
For example, on this forum: How can I read and parse CSV files in C++?. Specifically, using Boost Tokenizer with escaped_list_separator

C++ regex bug! Square bracket expression does not work with icase flag

// regex_replace example
#include <iostream>
#include <string>
#include <regex>
#include <iterator>
int main ()
{
std::string INPUT = "Replace_All_Characters_With_Anything";
std::string OUTEXP = "0";
std::regex expression("[A-Za-z]", std::regex_constants::icase);
std::cout << std::regex_replace(INPUT, expression, OUTEXP);
return 0;
}
This works here: http://cpp.sh/6gb5a
This works here: https://regexr.com/5bt9d
The problem seems to be down to using icase flag or not. A in All, the C in Characters, the W in With, etc. does not get replaced because of the underscore existing. The bug seems to be that using [] to match things only works if said character does not come after a non match.
There does seem to be a quick fix for this, if brackets are followed by a {1}, then it works.
example: [A-Za-z]{1}
Compiler:
Microsoft Visual Studio Community 2019 / Version 16.7.3 / c++17
Also tested in c++14, same bad behavior
expected result:
my result:

Not sure if this is an appropriate use of answering. But this is a known bug and it looks like the bug has been known for a few months. No ETA on a fix as far as I can see.
https://github.com/microsoft/STL/issues/993
Looks like RE2 is a recommended alternative regex library.
https://github.com/google/re2/
Instead of using another library, I will create a function that can be used to intercept and change the regex expression string as a temporary fix. Should work whether or not icase flag is used.
test code: https://rextester.com/LSNW3495
// add '{1}' after square bracket ranges unless there already is a quantifier or alternation such as '?' '*' '+' '{}'
std::string temporaryBugFix(std::string exp)
{
enum State
{
start,
skipNext,
lookForEndBracket,
foundEndBracket,
};
State state = start;
State prevState = start;
int p = -1;
std::vector<int> positionsToFix;
for (auto c : exp)
{
++p;
switch (state)
{
case start:
if (c == '\\')
{
prevState = state;
state = skipNext;
}
else if (c == '[')
state = lookForEndBracket;
continue;
case skipNext:
state = prevState;
continue;
case lookForEndBracket:
if (c == '\\')
{
prevState = state;
state = skipNext;
}
else if (c == ']')
{
state = foundEndBracket;
if (p + 1 == exp.length())
positionsToFix.push_back(p + 1);
}
continue;
case foundEndBracket:
if (c != '+' && c != '*' && c != '?')
positionsToFix.push_back(p);
state = start;
continue;
}
}
// check for valid curly brackets so we don't add an additional one
std::string s = exp;
std::smatch m;
std::regex e("\\{\\d+,?\\d*?\\}");
int offset = 0;
vector<int> validCurlyBracketPositions;
while (regex_search(s, m, e))
{
validCurlyBracketPositions.push_back(m.position(0) + offset);
offset += m.position(0) + m[0].length();
s = m.suffix();
}
// remove valid curly bracket positions from the fix vector
for (auto p : validCurlyBracketPositions)
positionsToFix.erase(std::remove(positionsToFix.begin(), positionsToFix.end(), p), positionsToFix.end());
// insert the fixes
for (int i = positionsToFix.size(); i--; )
exp.insert(positionsToFix[i], "{1}");
return exp;
}

Retrieve each token from a file according to specific criteria

I'm trying to create a lexer for a functional language, one of the methods of which should allow, on each call, to return the next token of a file.
For example :
func main() {
var MyVar : integer = 3+2;
}
So I would like every time the next method is called, the next token in that sequence is returned; in that case, it would look like this :
func
main
(
)
{
var
MyVar
:
integer
=
3
+
2
;
}
Except that the result I get is not what I expected:
func
main(
)
{
var
MyVar
:
integer
=
3+
2
}
Here is my method:
token_t Lexer::next() {
token_t ret;
std::string token_tmp;
bool IsSimpleQuote = false; // check string --> "..."
bool IsDoubleQuote = false; // check char --> '...'
bool IsComment = false; // check comments --> `...`
bool IterWhile = true;
while (IterWhile) {
bool IsInStc = (IsDoubleQuote || IsSimpleQuote || IsComment);
std::ifstream file_tmp(this->CurrentFilename);
if (this->eof) break;
char chr = this->File.get();
char next = file_tmp.seekg(this->CurrentCharIndex + 1).get();
++this->CurrentCharInCurrentLineIndex;
++this->CurrentCharIndex;
{
if (!IsInStc && !IsComment && chr == '`') IsComment = true; else if (!IsInStc && IsComment && chr == '`') { IsComment = false; continue; }
if (IsComment) continue;
if (!IsInStc && chr == '"') IsDoubleQuote = true;
else if (!IsInStc && chr == '\'') IsSimpleQuote = true;
else if (IsDoubleQuote && chr == '"') IsDoubleQuote = false;
else if (IsSimpleQuote && chr == '\'') IsSimpleQuote = false;
}
if (chr == '\n') {
++this->CurrentLineIndex;
this->CurrentCharInCurrentLineIndex = -1;
}
token_tmp += chr;
if (!IsInStc && IsLangDelim(chr)) IterWhile = false;
}
if (token_tmp.size() > 1 && System::Text::EndsWith(token_tmp, ";") || System::Text::EndsWith(token_tmp, " ")) token_tmp.pop_back();
++this->NbrOfTokens;
location_t pos;
pos.char_pos = this->CurrentCharInCurrentLineIndex;
pos.filename = this->CurrentFilename;
pos.line = this->CurrentLineIndex;
SetToken_t(&ret, token_tmp, TokenList::ToToken(token_tmp), pos);
return ret;
}
Here is the function IsLangDelim :
bool IsLangDelim(char chr) {
return (chr == ' ' || chr == '\t' || TokenList::IsSymbol(CharToString(chr)));
}
TokenList is a namespace that contains the list of tokens, as well as some functions (like IsSymbol in this case).
I have already tried other versions of this method, but the result is almost always the same.
Do you have any idea how to improve this method?

The solution for your problem is using a std::regex. Understanding the syntax is, in the beginning, a little bit difficult, but after you understand it, you will always use it.
And, it is designed to find tokens.
The specific critera can be expressed in the regex string.
For your case I will use: std::regex re(R"#((\w+|\d+|[;:\(\)\{\}\+\-\*\/\%\=]))#");
This means:
Look for one or more characters (That is a word)
Look for one or more digits (That is a integer number)
Or look for all kind of meaningful operators (Like '+', '-', '{' and so on)
You can extend the regex for all the other stuff that you are searching. You can also regex a regex result.
Please see example below. That will create your shown output from your provided input.
And, your described task is only one statement in main.
#include <iostream>
#include <string>
#include <algorithm>
#include <regex>
// Our test data (raw string) .
std::string testData(
R"#(func main() {
var MyVar : integer = 3+2;
}
)#");
std::regex re(R"#((\w+|\d+|[;:\(\)\{\}\+\-\*\/\%\=]))#");
int main(void)
{
std::copy(
std::sregex_token_iterator(testData.begin(), testData.end(), re, 1),
std::sregex_token_iterator(),
std::ostream_iterator<std::string>(std::cout, "\n")
);
return 0;
}

You try to parse using single loop, which makes the code very complicated. Instead i suggest something like this:
struct token { ... };
struct lexer {
vector<token> tokens;
string source;
unsigned int pos;
bool parse_ident() {
if (!is_alpha(source[pos])) return false;
auto start = pos;
while(pos < source.size() && is_alnum(source[pos])) ++pos;
tokens.push_back({ token_type::ident, source.substr(start, pos - start) });
return true;
}
bool parse_num() { ... }
bool parse_comment() { ... }
...
bool parse_whitespace() { ... }
void parse() {
while(pos < source.size()) {
if (!parse_comment() && !parse_ident() && !parse_num() && ... && !parse_comment()) {
throw error{ "unexpected character at position " + std::to_string(pos) };
}
}
}
This is standard structure i use, when lexing my files in any scripting language i've written. Lexing is usually greedy, so you don't need to bother with regex (which is effective, but slower, unless some crazy template based implementation). Just define your parse_* functions, make sure they return false, if they didn't parsed a token and make sure they are called in correct order.
Order itself doesn't matter usually, but:
operators needs to be checked from longest to shortest
number in style .123 might be incorrectly recognized as . operator (so you need to make sure, that after . there is no digit.
numbers and identifiers are very lookalike, except that identifiers starts with non-number.

Determine if a string contains only alphanumeric characters (or a space)

I am writing a function that determines whether a string contains only alphanumeric characters and spaces. I am effectively testing whether it matches the regular expression ^[[:alnum:] ]+$ but without using regular expressions. This is what I have so far:
#include <algorithm>
static inline bool is_not_alnum_space(char c)
{
return !(isalpha(c) || isdigit(c) || (c == ' '));
}
bool string_is_valid(const std::string &str)
{
return find_if(str.begin(), str.end(), is_not_alnum_space) == str.end();
}
Is there a better solution, or a “more C++” way to do this?

Looks good to me, but you can use isalnum(c) instead of isalpha and isdigit.

And looking forward to C++0x, you'll be able to use lambda functions (you can try this out with gcc 4.5 or VS2010):
bool string_is_valid(const std::string &str)
{
return find_if(str.begin(), str.end(),
[](char c) { return !(isalnum(c) || (c == ' ')); }) == str.end();
}

You can also do this with binders so you can drop the helper function. I'd recommend Boost Binders as they are much easier to use then the standard library binders:
bool string_is_valid(const std::string &str)
{
return find_if(str.begin(), str.end(),
!boost::bind(isalnum, _1) || boost::bind(std::not_equal_to<char>, _1, ' ')) == str.end();
}

Minor points, but if you want is_not_alnum_space() to be a helper function that is only visible in that particular compilation unit, you should put it in an anonymous namespace instead of making it static:
namespace {
bool is_not_alnum_space(char c)
{
return !(isalpha(c) || isdigit(c) || (c == ' '));
}
}
...etc

In case dont want to use stl function then this can be used:
// function for checking char is alphnumeric
bool checkAlphaNumeric(char s){
if(((s - 'a' >= 0) && (s - 'a' < 26)) ||((s - 'A' >= 0) && (s - 'A' < 26)) || ((s- '0' >= 0) &&(s - '0' < 10)))
return true;
return false;
}
//main
String s = "ab cd : 456";
for(int i = 0; i < s.length(); i++){
if(!checkAlphaNumeric(s[i])) return false;
}

C++: Looking for a concise solution to replace a set of characters in a std::string with a specific character

Suppose I have the following:
std::string some_string = "2009-06-27 17:44:59.027";
The question is: Give code that will replace all instances of "-" and ":" in some_string with a space i.e. " "
I'm looking for a simple one liner (if at all possible)
Boost can be used.

replace_if( some_string.begin(), some_string.end(), boost::bind( ispunct<char>, _1, locale() ), ' ' );
One line and not n^2 running time or invoking a regex engine ;v) , although it is a little sad that you need boost for this.

Boost has a string algorithm library that seems to fly under the radar:
String Algorithm Quick Reference
There is a regex based version of replace, similar to post 1, but I found find_format_all better performance wise. It's a one-liner to boot:
find_format_all(some_string,token_finder(is_any_of("-:")),const_formatter(" "));

You could use Boost regex to do it. Something like this:
e = boost::regex("[-:]");
some_string = regex_replace(some_string, e, " ");

I would just write it like this:
for (string::iterator p = some_string.begin(); p != some_string.end(); ++p) {
if ((*p == '-') || (*p == ':')) {
*p = ' ';
}
}
Not a concise one-liner, but I'm pretty sure it will work right the first time, nobody will have any trouble understanding it, and the compiler will probably produce near-optimal object code.

From http://www.cppreference.com/wiki/string/find_first_of
string some_string = "2009-06-27 17:44:59.027";
size_type found = 0;
while ((found = str.find_first_of("-:", found)) != string::npos) {
some_string[found] = ' ';
}

replace_if(str.begin(), str.end(), [&](char c) -> bool{
if (c == '-' || c == '.')
return true;
else
return false;
}, ' ');
This is 1 liner using c++ox. If not use functors.
replace_if(str.begin(), str.end(), CanReplace, ' ');
typedef string::iterator Iterator;
bool CanReplace(char t)
{
if (t == '.' || t == '-')
return true;
else
return false;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extracting tokens from string in C++ - c++

Related

How to separate a string using separator

C++ regex bug! Square bracket expression does not work with icase flag

Retrieve each token from a file according to specific criteria

Determine if a string contains only alphanumeric characters (or a space)

C++: Looking for a concise solution to replace a set of characters in a std::string with a specific character

Categories

Resources