Explaining a string trimming function

Explaining a string trimming function - c++

I came across the code below but need some help with understanding the code. Assume that the string s has spaces either side.
string trim(string const& s){
auto front = find_if_not(begin(s), end(s), isspace);
auto back = find_if_not(rbegin(s), rend(s), isspace);
return string { front, back.base() };
}
The author stated that back points to the end of the last space whereas the front points to the first non-white space character. So back.base() was called but I don't understand why.
Also what do the curly braces, following string in the return statement, represent?

The braces are the new C++11 initialisation.
.base() and reverse iterators
The .base() is to get back the the underlying iterator (back is a reverse_iterator), to properly construct the new string from a valid range.
A picture. Normal iterator positions of a string (it is a little more complex than this regarding how rend() works, but conceptually anyway...)
begin end
v v
-------------------------------------
| sp | sp | A | B | C | D | sp | sp |
-------------------------------------
^ ^
rend rbegin
Once your two find loops finish, the result of those iterators in this sequence will be positioned at:
front
v
-------------------------------------
| sp | sp | A | B | C | D | sp | sp |
-------------------------------------
^
back
Were we to take just those iterators and construct a sequence from them (which we can't, as they're not matching types, but regardless, supposed we could), the result would be "copy starting at A, stopping at D" but it would not include D in the resulting data.
Enter the back() member of a reverse iterator. It returns a non-reverse iterator of the forward iterator class, that is positioned at the element "next to" the back iterator; i.e.
front
v
-------------------------------------
| sp | sp | A | B | C | D | sp | sp |
-------------------------------------
^
back.base()
Now when we copy our range { front, back.base() } we copy starting at A and stopping at the first space (but not including it), thereby including the D we would have missed.
Its actually a slick little piece of code, btw.
Some additional checking
Added some basic checks to the original code.
In trying to keeping with the spirit of the original code (C++1y/C++14 usage), adding some basic checks for empty and white space only strings;
string trim_check(string const& s)
{
auto is_space = [](char c) { return isspace(c, locale()); };
auto front = find_if_not(begin(s), end(s), is_space);
auto back = find_if_not(rbegin(s), make_reverse_iterator(front), is_space);
return string { front, back.base() };
}

Related

Convert string to portable filename with <filesystem> or Boost.Filesystem

Is there a simple way, with <filesystem> or <boost/filesystem.hpp> to convert a sequence of bytes, perhaps represented by std::vector<char> into a portable filename string such that the result can be converted back to the input sequence?
As an example, if a platform permits a filename to be comprised of characters from ranging from [a,f] and [0,9]. A conversion function that suits the above constraint might be one that simply outputs each character in it's two-digit hex equivalent, so {'a', 'b'} would become "6768" as 'a' -> 97 -> 0x67, and 'b' -> 98 -> 0x68.

This is very simple to do with filesystem::path. The first step is to construct a path object from the sequence of characters. There are two-iterator constructors available, as well as constructors that take any of C++'s character types for encoding purposes.
Then, just call generic_u8string on that path object; you will get a std::u8string (in C++20; in C++17, you get a std::string) containing the path formatted in a platform-neutral generic format. This string can later be used to reconstitute the path object as well.
Now, full round-tripping from the platform-specific format through path back to the platform-specific format is not really permitted. You can get the native string version of the path (path::u8string returns this), but there's no guarantee of a byte-for-byte identical string. There is a guarantee that the two strings will identify the same filesystem resource. So the differences, if they exist, are unimportant.

It took me several days to write this :/.
My personal objective here was to following:
Each unique input string must result in a unique filename. In other words, the conversion must be "one-to-one" which implies that it is also reversible, as was requested by the OP.
As many characters as possible must be kept the same; at least, the filename should be mostly human readable and look much like the original string.
When nothing else goes, characters should be escaped as is usual for url-encoding: a percentage followed by the byte value in hexadecimal.
The escape character (%) itself is escaped with two of them (%%), unless another translation is requested.
I wanted simple things, like the ability to have spaces replaced by underscores; and since underscores might also occur frequently, I don't want those escaped with %2F, but with something neat, like a (multi-byte) unicode character.
I decided to only support UTF8 strings therefore, and be able to treat any utf8 glyph as a translatable 'character'; that is: you can translate single glyphs into different glyphs (including all 1-byte ASCII values).
It is not possible to translate multi-glyph sequences with my implementation, but since it is based on a Dictionary class, most of the code should be reusable if anyone wants to add support for that.
Making sure that every string, under any possible translation is reversible turned out to be non-trivial to say the least.
// From
// |
// v
// .----------------------------------------.
// | |
// | j--------.
// | | |
// | i k | |
// | .--------|-----|---+----|----------------.
// | |1 a v | b--->B v 2|<-- To
// | .--->M | I | | J |
// | | | E v | | |
// | | | ^ A d c | | |
// | .-------|--+--|-----|--|--|---+---------. |
// | | m |3 | g | v v | | |
// | | | e | | C K | | |
// | | p | v | | | |
// | l | | n o-->O G h | f------------->F |
// '----|-----+-|---|----+------|-|---------' | |
// | | | | |4 v v | |
// | | | | | H D | |
// | | | `----------------------------------->N |
// | | `--------->P | |
// `----------------->L | |
// | | | |
// | '----------------------------+-----------'
// | |
// | q | r
// | |
// | |<-- Illegal
// '---------------------------------------'
(and that doesn't even include escape characters)
But I think I succeeded. If anyone manages to find arguments to u8string_to_filename that does not convert back with filename_to_u8string let me know!
First of all I needed a function that returns the number of bytes of a glyph:
// Returns the length of the UTF8 encoded glyph, which is highly
// recommended to be either guaranteed correct UTF8, or points
// inside a zero terminated string.
//
// If the pointer does not point to a legal UTF8 glyph then 1 is returned.
// The zero termination is necessary to detect the end of the string
// in the case that the apparent encoded glyph length goes beyond the string.
//
int utf8_glyph_length(char8_t const* glyph)
{
// The length of a glyph is determined by the first byte.
// This magic formula returns 1 for 110xxxxx, 2 for 1110xxxx,
// 3 for 11110xxx and 0 otherwise.
int extra = (0x3a55000000000000 >> ((*glyph >> 2) & 0x3e)) & 0x3;
// Detect if there are indeed `extra` bytes that follow the first
// one, each of which must begin with 10xxxxxx to be legal UTF8.
int i = 0;
while (++i <= extra)
if (glyph[i] >> 6 != 2)
return 1; // Not legal UTF8 encoding.
return 1 + extra;
}
You can find this file here.
Next we need a simple Dictionary class:
class Dictionary
{
private:
std::vector<std::u8string_view> m_words;
public:
Dictionary(std::u8string const&);
size_t size() const { return m_words.size(); }
void add(std::u8string_view glyph);
int find(std::u8string_view glyph) const;
std::u8string_view operator[](int index) const { return m_words[index]; }
};
with its definition
Dictionary::Dictionary(std::u8string const& in)
{
// Run over each glyph in the input.
int glen; // The number of bytes of the current glyph.
for (char8_t const* glyph = in.data(); *glyph; glyph += glen)
{
glen = utf8_glyph_length(glyph);
m_words.emplace_back(glyph, glen);
}
}
void Dictionary::add(std::u8string_view glyph)
{
if (find(glyph) == -1)
m_words.push_back(glyph);
}
int Dictionary::find(std::u8string_view glyph) const
{
for (int index = 0; index < m_words.size(); ++index)
if (m_words[index] == glyph)
return index;
return -1;
}
I also used the following two helper functions
char8_t to_hex_digit(int d)
{
if (d < 10)
return '0' + d;
return 'A' + d - 10;
}
std::u8string to_hex_string(char8_t c)
{
std::u8string hex_string;
hex_string += to_hex_digit(c / 16);
hex_string += to_hex_digit(c % 16);
return hex_string;
}
Finally, here is the encoder function
// Copy str to the returned filename, replacing every occurance of
// the utf8 glyphs in `from` with the corresponding one in `to`.
//
// All glyphs in `illegal` will be escaped with a percentage sign (%)
// followed by two hexidecimal characters for each code point of
// the glyph.
//
// If `from` does not contain the escape character, then each '%' will
// be replaced with "%%".
//
// All glyphs in `to` that are not in `from` are considered illegal
// and will also be escaped.
//
std::filesystem::path u8string_to_filename(std::u8string const& str,
std::u8string const& illegal, std::u8string const& from, std::u8string const& to)
{
using namespace detail::us2f;
// All glyphs are found by their first byte.
// Build a dictionary for each of the three strings.
Dictionary from_dictionary(from);
Dictionary to_dictionary(to);
Dictionary illegal_dictionary(illegal);
// The escape character is always illegal (is not allowed to appear on its own
// in the output).
illegal_dictionary.add({ &escape, 1 });
// For each `from` entry there must exist one `to` entry.
ASSERT(from_dictionary.size() == to_dictionary.size());
std::filesystem::path filename;
// Run over all glyphs in the input string.
int glen; // The number of bytes of the current glyph.
for (char8_t const* gp = str.data(); *gp; gp += glen)
{
glen = utf8_glyph_length(gp);
std::u8string_view glyph(gp, glen);
// Perform translation.
int from_index = from_dictionary.find(glyph);
if (from_index != -1)
glyph = to_dictionary[from_index];
else if (*gp == escape)
{
filename += escape;
filename += escape;
continue;
}
// What is in illegal is *always* illegal - even when it is the result
// of a translation.
if (illegal_dictionary.find(glyph) != -1 ||
// If an input glyph is not in the from_dictionary (aka, it
// wasn't just translated) but it is in the to_dictionary -
// then also escape it. This is necessary to make sure that
// each unique input str results in a unique filename (and
// consequently is reversible).
(from_index == -1 && to_dictionary.find(glyph) != -1))
{
// Escape illegal glyphs.
// Always escape the original input (not a possible translation),
// otherwise we can't know if what the input was when decoding:
// the input could have been translated first or not.
for (int j = 0; j < glen; ++j)
{
filename += escape;
filename += to_hex_string(gp[j]);
}
continue;
}
// Append the glyph to the filename.
filename += glyph;
}
return filename;
}
And the decoder function
std::u8string filename_to_u8string(std::filesystem::path const& filename,
std::u8string const& from, std::u8string const& to)
{
using namespace detail::us2f;
std::u8string input = filename.u8string();
std::u8string result;
Dictionary from_dictionary(from);
Dictionary to_dictionary(to);
// First unescape all bytes in the filename.
int glen; // The number of bytes of the current glyph.
for (char8_t const* gp = input.c_str(); *gp; gp += glen)
{
glen = utf8_glyph_length(gp);
std::u8string_view glyph(gp, glen);
// First translate escape sequences back - those are then always
// original input.
if (*gp == escape)
{
if (gp[1] == escape)
{
glen = 2; // Skip the second escape character too.
result += escape;
}
else
{
char8_t val = 0;
for (int d = 1; d <= 2; ++d)
{
val <<= 4;
val |= ('0' <= gp[d] && gp[d] <= '9') ? gp[d] - '0'
: gp[d] - 'A' + 10;
}
result += val;
glen = 3; // Skip the two hex digits too.
}
continue;
}
else
{
// Otherwise - if the character is in the from dictionary, it must have
// been translated - otherwise it would have been escaped.
int from_index = from_dictionary.find(glyph);
if (from_index != -1)
glyph = to_dictionary[from_index];
}
result += glyph;
}
return result;
}
You can find this all back (and the latest version) on github

Order of precedence in C++: & or ()?

Provided that texts is an array of 3 strings, what's the difference between &texts[3] and (&texts)[3]?

The [] subscript operator has a higher precedence than the & address-of operator.
&texts[3] is the same as &(texts[3]), meaning the 4th element of the array is accessed and then the address of that element is taken. Assuming the array is like string texts[3], that will produce a string* pointer that is pointing at the 1-past-the-end element of the array, ie similar to an end iterator in a std::array or std::vector.
----------------------------
| string | string | string |
----------------------------
^
&texts[3]
(&texts)[3], on the other hand, takes the address of the array itself, producing a string(*)[3] pointer, and then increments that pointer by 3 whole string[3] arrays. So, again assuming string texts[3], you have a string(*)[3] pointer that is WAY beyond the end boundary of the array.
---------------------------- ---------------------------- ----------------------------
| string | string | string | | string | string | string | | string | string | string |
---------------------------- ---------------------------- ----------------------------
^ ^
&texts[3] (&texts)[3]

How to code nextToken() function for a descent recursive parser LL(1)

I'm writting a recursive descent parser LL(1) in C++, but I have a problem because I don't know exactly how to get the next token. I know I have to use regular expressions for getting a terminal but I don't know how to get the largest next token.
For example, this lexical and this grammar (without left recursion, left factoring and without cycles):
//LEXICAL IN FLEX
TIME [0-9]+
DIRECTION UR|DR|DL|UL|U|D|L|R
ACTION A|J|M
%%
{TIME} {printf("TIME"); return (TIME);}
{DIRECTION} {printf("DIRECTION"); return (DIRECTION);}
{ACTION} {printf("ACTION"); return (ACTION);}
"~" {printf("RELEASED"); return (RELEASED);}
"+" {printf("PLUS_OP"); return (PLUS_OP);}
"*" {printf("COMB_OP"); return (COMB_OP);}
//GRAMMAR IN BISON
command : list_move PLUS_OP list_action
| list_move COMB_OP list_action
| list_move list_action
| list_move
| list_action
;
list_move: move list_move_prm
;
list_move_prm: move
| move list_move_prm
| ";"
;
list_action: ACTION list_action_prm
;
list_action_prm: PLUS_OP ACTION list_action_prm
| COMB_OP ACTION list_action_prm
| ACTION list_action_prm
| ";" //epsilon
;
move: TIME RELEASED DIRECTION
| RELEASED DIRECTION
| DIRECTION
;
I have a string that contains: "D DR R + A" it should validate it, but getting "DR" I have problems because "D" it's a token too, I don't know how to get "DR" instead "D".

There are a number of ways of hand-writing a tokenizer
you can use a recusive descent LL(1) parser "all the way down" -- rewrite your grammar in terms of single characters rather than tokens, and left factor it. Then your nextToken() routine becomes just getchar(). You'll end up with additional rules like:
TIME: DIGIT more_digits ;
more_digits: /* epsilon */ | DIGIT more_digits ;
DIGIT: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
DIRECTION: 'U' dir_suffix | 'D' dir_suffix | 'L' | 'R' ;
dir_suffix: /* epsilon */ | 'L' | 'R' ;
You can use regexes. Generally this means keeping around a buffer and reading the input into it. nextToken() then runs a series of regexes on the buffer, figuring out which one returns the longest token and returns that, advancing the buffer as needed.
You can do what flex does -- this is the buffer approach above, combined with building a single DFA that evaluates all of the regexes simultaneously. Running this DFA on the buffer then returns the longest token (based on the last accepting state reached before getting an error).
Note that in all cases, you'll need to consider how to handle whitespace as well. You can just ignore whitespace everywhere (FORTRAN style) or you can allow whitespace between some tokens, but not others (eg, not between the digits of TIME or within a DIRECTION, but everywhere else in the grammar). This can make the grammar much more complex (and the process of hand-writing the recursive descent parser much more tedious).

“I don't know exactly how to get the next token”
Your input comes from a stream (std::istream). You must write a get_token(istream) function (or a tokenizer class). The function must first discard white spaces, then read a character (or more if necessary) analyze it and returns the associated token. The following functions will help you achieve your goal:
ws – discards white-space.
istream::get – reads a character.
istream::putback – puts back in the stream a character (think “undo get”).
"I don't know how to get "DR" instead "D""
Both "D" and "DR" are words. Just read them as you would read a word: is >> word. You will also need a keyword to token map (see std::map). If you read the "D" string, you can ask the map what the associated token is. If not found, throw an exception.
A starting point (run it):
#include <iostream>
#include <iomanip>
#include <map>
#include <string>
enum token_t
{
END,
PLUS,
NUMBER,
D,
DR,
R,
A,
// ...
};
// ...
using keyword_to_token_t = std::map < std::string, token_t >;
keyword_to_token_t kwtt =
{
{"A", A},
{"D", D},
{"R", R},
{"DR", DR}
// ...
};
// ...
std::string s;
int n;
// ...
token_t get_token( std::istream& is )
{
char c;
std::ws( is ); // discard white-space
if ( !is.get( c ) ) // read a character
return END; // failed to read or eof
// analyze the character
switch ( c )
{
case '+': // simple token
return PLUS;
case '0': case '1': // rest of digits
is.putback( c ); // it starts with a digit: it must be a number, so put it back
is >> n; // and let the library to the hard work
return NUMBER;
//...
default: // keyword
is.putback( c );
is >> s;
if ( kwtt.find( s ) == kwtt.end() )
throw "keyword not found";
return kwtt[ s ];
}
}
int main()
{
try
{
while ( get_token( std::cin ) )
;
std::cout << "valid tokens";
}
catch ( const char* e )
{
std::cout << e;
}
}

c++ find any string from a list in another string

What options do I have to find any string from a list in another string ?
With s being an std::string, I tried
s.find("CAT" || "DOG" || "COW" || "MOUSE", 0);
I want to find the first one of these strings and get its place in the string ; so if s was "My cat is sleeping\n" I'd get 3 as return value.
boost::to_upper(s);
was applied (for those wondering).

You can do this with a regex.
I don't think there's a way to get the position of a match directly, so first you have to search for the regex, and if there is a match you can search for that string. Like this:
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main() {
string s = "My cat is sleeping\n";
smatch m;
regex animal("cat|dog|cow|mouse");
if (regex_search (s,m,animal)) {
cout << "Match found: " << m.str() << endl;
size_t match_position = s.find(m.str());
// In this case it is always true, but in general you might want to check
if (match_position != string::npos) {
cout << "First animal found at: " << match_position << endl;
}
}
return 0;
}

You may convert your search cases to a DFA. It is the most efficient way of doing it.
states:
nil, c, ca, cat., d, do, dog., co, cow., m, mo, mou, mous, mouse.
transition table:
state | on | goto
nil | c | c
nil | d | d
nil | m | m
c | a | ca
c | o | co
d | o | do
m | o | mo
ca | t | cat.
co | w | cow.
do | g | dog.
mo | u | mou
mou | s | mous
mous | e | mouse.
* | * | nil
You may express this using a lot of intermediary functions. Using a lot of switches. Or using enum to represent states and a mapping to represent the transitions.
If your test case list is dynamic or grows too big, then a manually hardcoding the states will nor suffice for you. However, as you can see, the rule to make the states and the transitions is very simple.

c++ find text in QStringList that starts with "..." using .indexOf

I have a question concerning QStringList:
I have a .txt-File containing several 1000 lines of Data followed by this:
+-------------------------+-------------------+-----------------------|
| Conditions at | X1 | X2 |
+-------------------------+-------------------+-----------------------|
| Time [ms] | 0.10780E-02 | 0.27636E-02 |
| Travel [m] | 0.11366E+00 | 0.18796E+01 |
| Velocity [m/s] | 0.43980E+03 | 0.13920E+04 |
| Acceleration [g] | 0.11543E+06 | 0.20936E+05 |
…
Where the Header (Conditions at…) and the first column (Travel, Time,…) always stay the same but the values vary for each run. From this File I want to read the values (only!) into fields of a GUI.
First I write all data into a QStringList. (Each line of .txt copied to one Element of QStringList)
To get the values, from the QStringList I tried to find the corresponding lines with “.indexOf()" which didn´t work because I have to ask for the exact text of the whole line. Since the values vary, the lines are different for each run and my program is not able to find corresponding lines.
Is there a command like “.indexOf-Starting with certain text” which would find me the lines starting with a certain text for example “| Time [ms]”
Thank you very much
itelly

Yes there is method “.indexOf-Starting with certain text”. You can use regular expressions to match the beggining of a string:
int QStringList::indexOf (const QRegExp& rx, int from = 0) const
Use it in this way:
int timeLineIndex = stringList.indexOf(QRegExp("^\| Time \[ms\].+"));
^ means that this text should be at the beggining of a string
\ escapes special characters
.+ means that any text can follow this
EDIT:
Here is a working example that show how it works:
QStringList stringList;
stringList << "abc 5234 hjd";
stringList << "bnd|gf dfs aaa";
stringList << "das gf dfs aaa";
int index = stringList.indexOf(QRegExp("^bnd\|gf.+"));
qDebug() << index;
Output: 1
EDIT:
Here is a function for ezee usage of this:
int indexOfLineStartingWith(const QStringList& list, const QString& textToFind)
{
return list.indexOf(QRegExp("^" + QRegExp::escape(textToFind) + ".+"));
}
int index = indexOfLineStartingWith(stringList, "bnd|gf"); //it's not needed to escape characters here

First of all your actual data starts from the line 4 (excluding the header). Second - each data string has specific layout, that you can parse. Assuming that you read the whole file into the QStringList, where each item in the list represents each line, you can do the following:
QStringList data;
[..]
for (int i = 3; i < data.size(); i++) {
const QString &line = data.at(i);
// Parse the X1 and X2 columns' values
QString strX1 = line.section('|', 1, 1, QString::SectionSkipEmpty).trimmed();
QString strX2 = line.section('|', 2, 2, QString::SectionSkipEmpty).trimmed();
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Explaining a string trimming function - c++

Related

Convert string to portable filename with <filesystem> or Boost.Filesystem

Order of precedence in C++: & or ()?

How to code nextToken() function for a descent recursive parser LL(1)

c++ find any string from a list in another string

c++ find text in QStringList that starts with "..." using .indexOf

Categories

Resources