Convert string to portable filename with <filesystem> or Boost.Filesystem - c++

Is there a simple way, with <filesystem> or <boost/filesystem.hpp> to convert a sequence of bytes, perhaps represented by std::vector<char> into a portable filename string such that the result can be converted back to the input sequence?
As an example, if a platform permits a filename to be comprised of characters from ranging from [a,f] and [0,9]. A conversion function that suits the above constraint might be one that simply outputs each character in it's two-digit hex equivalent, so {'a', 'b'} would become "6768" as 'a' -> 97 -> 0x67, and 'b' -> 98 -> 0x68.

This is very simple to do with filesystem::path. The first step is to construct a path object from the sequence of characters. There are two-iterator constructors available, as well as constructors that take any of C++'s character types for encoding purposes.
Then, just call generic_u8string on that path object; you will get a std::u8string (in C++20; in C++17, you get a std::string) containing the path formatted in a platform-neutral generic format. This string can later be used to reconstitute the path object as well.
Now, full round-tripping from the platform-specific format through path back to the platform-specific format is not really permitted. You can get the native string version of the path (path::u8string returns this), but there's no guarantee of a byte-for-byte identical string. There is a guarantee that the two strings will identify the same filesystem resource. So the differences, if they exist, are unimportant.

It took me several days to write this :/.
My personal objective here was to following:
Each unique input string must result in a unique filename. In other words, the conversion must be "one-to-one" which implies that it is also reversible, as was requested by the OP.
As many characters as possible must be kept the same; at least, the filename should be mostly human readable and look much like the original string.
When nothing else goes, characters should be escaped as is usual for url-encoding: a percentage followed by the byte value in hexadecimal.
The escape character (%) itself is escaped with two of them (%%), unless another translation is requested.
I wanted simple things, like the ability to have spaces replaced by underscores; and since underscores might also occur frequently, I don't want those escaped with %2F, but with something neat, like a (multi-byte) unicode character.
I decided to only support UTF8 strings therefore, and be able to treat any utf8 glyph as a translatable 'character'; that is: you can translate single glyphs into different glyphs (including all 1-byte ASCII values).
It is not possible to translate multi-glyph sequences with my implementation, but since it is based on a Dictionary class, most of the code should be reusable if anyone wants to add support for that.
Making sure that every string, under any possible translation is reversible turned out to be non-trivial to say the least.
// From
// |
// v
// .----------------------------------------.
// | |
// | j--------.
// | | |
// | i k | |
// | .--------|-----|---+----|----------------.
// | |1 a v | b--->B v 2|<-- To
// | .--->M | I | | J |
// | | | E v | | |
// | | | ^ A d c | | |
// | .-------|--+--|-----|--|--|---+---------. |
// | | m |3 | g | v v | | |
// | | | e | | C K | | |
// | | p | v | | | |
// | l | | n o-->O G h | f------------->F |
// '----|-----+-|---|----+------|-|---------' | |
// | | | | |4 v v | |
// | | | | | H D | |
// | | | `----------------------------------->N |
// | | `--------->P | |
// `----------------->L | |
// | | | |
// | '----------------------------+-----------'
// | |
// | q | r
// | |
// | |<-- Illegal
// '---------------------------------------'
(and that doesn't even include escape characters)
But I think I succeeded. If anyone manages to find arguments to u8string_to_filename that does not convert back with filename_to_u8string let me know!
First of all I needed a function that returns the number of bytes of a glyph:
// Returns the length of the UTF8 encoded glyph, which is highly
// recommended to be either guaranteed correct UTF8, or points
// inside a zero terminated string.
//
// If the pointer does not point to a legal UTF8 glyph then 1 is returned.
// The zero termination is necessary to detect the end of the string
// in the case that the apparent encoded glyph length goes beyond the string.
//
int utf8_glyph_length(char8_t const* glyph)
{
// The length of a glyph is determined by the first byte.
// This magic formula returns 1 for 110xxxxx, 2 for 1110xxxx,
// 3 for 11110xxx and 0 otherwise.
int extra = (0x3a55000000000000 >> ((*glyph >> 2) & 0x3e)) & 0x3;
// Detect if there are indeed `extra` bytes that follow the first
// one, each of which must begin with 10xxxxxx to be legal UTF8.
int i = 0;
while (++i <= extra)
if (glyph[i] >> 6 != 2)
return 1; // Not legal UTF8 encoding.
return 1 + extra;
}
You can find this file here.
Next we need a simple Dictionary class:
class Dictionary
{
private:
std::vector<std::u8string_view> m_words;
public:
Dictionary(std::u8string const&);
size_t size() const { return m_words.size(); }
void add(std::u8string_view glyph);
int find(std::u8string_view glyph) const;
std::u8string_view operator[](int index) const { return m_words[index]; }
};
with its definition
Dictionary::Dictionary(std::u8string const& in)
{
// Run over each glyph in the input.
int glen; // The number of bytes of the current glyph.
for (char8_t const* glyph = in.data(); *glyph; glyph += glen)
{
glen = utf8_glyph_length(glyph);
m_words.emplace_back(glyph, glen);
}
}
void Dictionary::add(std::u8string_view glyph)
{
if (find(glyph) == -1)
m_words.push_back(glyph);
}
int Dictionary::find(std::u8string_view glyph) const
{
for (int index = 0; index < m_words.size(); ++index)
if (m_words[index] == glyph)
return index;
return -1;
}
I also used the following two helper functions
char8_t to_hex_digit(int d)
{
if (d < 10)
return '0' + d;
return 'A' + d - 10;
}
std::u8string to_hex_string(char8_t c)
{
std::u8string hex_string;
hex_string += to_hex_digit(c / 16);
hex_string += to_hex_digit(c % 16);
return hex_string;
}
Finally, here is the encoder function
// Copy str to the returned filename, replacing every occurance of
// the utf8 glyphs in `from` with the corresponding one in `to`.
//
// All glyphs in `illegal` will be escaped with a percentage sign (%)
// followed by two hexidecimal characters for each code point of
// the glyph.
//
// If `from` does not contain the escape character, then each '%' will
// be replaced with "%%".
//
// All glyphs in `to` that are not in `from` are considered illegal
// and will also be escaped.
//
std::filesystem::path u8string_to_filename(std::u8string const& str,
std::u8string const& illegal, std::u8string const& from, std::u8string const& to)
{
using namespace detail::us2f;
// All glyphs are found by their first byte.
// Build a dictionary for each of the three strings.
Dictionary from_dictionary(from);
Dictionary to_dictionary(to);
Dictionary illegal_dictionary(illegal);
// The escape character is always illegal (is not allowed to appear on its own
// in the output).
illegal_dictionary.add({ &escape, 1 });
// For each `from` entry there must exist one `to` entry.
ASSERT(from_dictionary.size() == to_dictionary.size());
std::filesystem::path filename;
// Run over all glyphs in the input string.
int glen; // The number of bytes of the current glyph.
for (char8_t const* gp = str.data(); *gp; gp += glen)
{
glen = utf8_glyph_length(gp);
std::u8string_view glyph(gp, glen);
// Perform translation.
int from_index = from_dictionary.find(glyph);
if (from_index != -1)
glyph = to_dictionary[from_index];
else if (*gp == escape)
{
filename += escape;
filename += escape;
continue;
}
// What is in illegal is *always* illegal - even when it is the result
// of a translation.
if (illegal_dictionary.find(glyph) != -1 ||
// If an input glyph is not in the from_dictionary (aka, it
// wasn't just translated) but it is in the to_dictionary -
// then also escape it. This is necessary to make sure that
// each unique input str results in a unique filename (and
// consequently is reversible).
(from_index == -1 && to_dictionary.find(glyph) != -1))
{
// Escape illegal glyphs.
// Always escape the original input (not a possible translation),
// otherwise we can't know if what the input was when decoding:
// the input could have been translated first or not.
for (int j = 0; j < glen; ++j)
{
filename += escape;
filename += to_hex_string(gp[j]);
}
continue;
}
// Append the glyph to the filename.
filename += glyph;
}
return filename;
}
And the decoder function
std::u8string filename_to_u8string(std::filesystem::path const& filename,
std::u8string const& from, std::u8string const& to)
{
using namespace detail::us2f;
std::u8string input = filename.u8string();
std::u8string result;
Dictionary from_dictionary(from);
Dictionary to_dictionary(to);
// First unescape all bytes in the filename.
int glen; // The number of bytes of the current glyph.
for (char8_t const* gp = input.c_str(); *gp; gp += glen)
{
glen = utf8_glyph_length(gp);
std::u8string_view glyph(gp, glen);
// First translate escape sequences back - those are then always
// original input.
if (*gp == escape)
{
if (gp[1] == escape)
{
glen = 2; // Skip the second escape character too.
result += escape;
}
else
{
char8_t val = 0;
for (int d = 1; d <= 2; ++d)
{
val <<= 4;
val |= ('0' <= gp[d] && gp[d] <= '9') ? gp[d] - '0'
: gp[d] - 'A' + 10;
}
result += val;
glen = 3; // Skip the two hex digits too.
}
continue;
}
else
{
// Otherwise - if the character is in the from dictionary, it must have
// been translated - otherwise it would have been escaped.
int from_index = from_dictionary.find(glyph);
if (from_index != -1)
glyph = to_dictionary[from_index];
}
result += glyph;
}
return result;
}
You can find this all back (and the latest version) on github

Related

How to code nextToken() function for a descent recursive parser LL(1)

I'm writting a recursive descent parser LL(1) in C++, but I have a problem because I don't know exactly how to get the next token. I know I have to use regular expressions for getting a terminal but I don't know how to get the largest next token.
For example, this lexical and this grammar (without left recursion, left factoring and without cycles):
//LEXICAL IN FLEX
TIME [0-9]+
DIRECTION UR|DR|DL|UL|U|D|L|R
ACTION A|J|M
%%
{TIME} {printf("TIME"); return (TIME);}
{DIRECTION} {printf("DIRECTION"); return (DIRECTION);}
{ACTION} {printf("ACTION"); return (ACTION);}
"~" {printf("RELEASED"); return (RELEASED);}
"+" {printf("PLUS_OP"); return (PLUS_OP);}
"*" {printf("COMB_OP"); return (COMB_OP);}
//GRAMMAR IN BISON
command : list_move PLUS_OP list_action
| list_move COMB_OP list_action
| list_move list_action
| list_move
| list_action
;
list_move: move list_move_prm
;
list_move_prm: move
| move list_move_prm
| ";"
;
list_action: ACTION list_action_prm
;
list_action_prm: PLUS_OP ACTION list_action_prm
| COMB_OP ACTION list_action_prm
| ACTION list_action_prm
| ";" //epsilon
;
move: TIME RELEASED DIRECTION
| RELEASED DIRECTION
| DIRECTION
;
I have a string that contains: "D DR R + A" it should validate it, but getting "DR" I have problems because "D" it's a token too, I don't know how to get "DR" instead "D".
There are a number of ways of hand-writing a tokenizer
you can use a recusive descent LL(1) parser "all the way down" -- rewrite your grammar in terms of single characters rather than tokens, and left factor it. Then your nextToken() routine becomes just getchar(). You'll end up with additional rules like:
TIME: DIGIT more_digits ;
more_digits: /* epsilon */ | DIGIT more_digits ;
DIGIT: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
DIRECTION: 'U' dir_suffix | 'D' dir_suffix | 'L' | 'R' ;
dir_suffix: /* epsilon */ | 'L' | 'R' ;
You can use regexes. Generally this means keeping around a buffer and reading the input into it. nextToken() then runs a series of regexes on the buffer, figuring out which one returns the longest token and returns that, advancing the buffer as needed.
You can do what flex does -- this is the buffer approach above, combined with building a single DFA that evaluates all of the regexes simultaneously. Running this DFA on the buffer then returns the longest token (based on the last accepting state reached before getting an error).
Note that in all cases, you'll need to consider how to handle whitespace as well. You can just ignore whitespace everywhere (FORTRAN style) or you can allow whitespace between some tokens, but not others (eg, not between the digits of TIME or within a DIRECTION, but everywhere else in the grammar). This can make the grammar much more complex (and the process of hand-writing the recursive descent parser much more tedious).
“I don't know exactly how to get the next token”
Your input comes from a stream (std::istream). You must write a get_token(istream) function (or a tokenizer class). The function must first discard white spaces, then read a character (or more if necessary) analyze it and returns the associated token. The following functions will help you achieve your goal:
ws – discards white-space.
istream::get – reads a character.
istream::putback – puts back in the stream a character (think “undo get”).
"I don't know how to get "DR" instead "D""
Both "D" and "DR" are words. Just read them as you would read a word: is >> word. You will also need a keyword to token map (see std::map). If you read the "D" string, you can ask the map what the associated token is. If not found, throw an exception.
A starting point (run it):
#include <iostream>
#include <iomanip>
#include <map>
#include <string>
enum token_t
{
END,
PLUS,
NUMBER,
D,
DR,
R,
A,
// ...
};
// ...
using keyword_to_token_t = std::map < std::string, token_t >;
keyword_to_token_t kwtt =
{
{"A", A},
{"D", D},
{"R", R},
{"DR", DR}
// ...
};
// ...
std::string s;
int n;
// ...
token_t get_token( std::istream& is )
{
char c;
std::ws( is ); // discard white-space
if ( !is.get( c ) ) // read a character
return END; // failed to read or eof
// analyze the character
switch ( c )
{
case '+': // simple token
return PLUS;
case '0': case '1': // rest of digits
is.putback( c ); // it starts with a digit: it must be a number, so put it back
is >> n; // and let the library to the hard work
return NUMBER;
//...
default: // keyword
is.putback( c );
is >> s;
if ( kwtt.find( s ) == kwtt.end() )
throw "keyword not found";
return kwtt[ s ];
}
}
int main()
{
try
{
while ( get_token( std::cin ) )
;
std::cout << "valid tokens";
}
catch ( const char* e )
{
std::cout << e;
}
}

How can I use languages (like arabic or chinese) in a QString?

How can I use languages (like arabic or chinese) in a QString?
I am creating a QString:
QString m = "سلام علیکم";
and then I am saving it into a file using:
void stWrite(QString Filename,QString stringtext){
QFile mFile(Filename);
if(!mFile.open(QIODevice::WriteOnly | QIODevice::Append |QIODevice::Text))
{
QMessageBox message_file_Write;
message_file_Write.warning(0,"Open Error"
,"could not to open file for Writing");
return;
}
QTextStream out(&mFile);
out << stringtext<<endl;
out.setCodec("UTF-8");
mFile.flush();
mFile.close();
}
But, when I open the result file I see:
???? ????
What is going wrong? How can I get my characters to be saved correctly in the file?
QString has unicode support. So, there is nothing wrong with having*:
QString m = "سلام علیکم";
Most modern compilers use UTF-8 to encode this ordinary string literal (You can enforce this in C++11 by using u8"سلام عليكم", see here). The string literal has the type of an array of chars. When QString is initialized from a const char*, it expects data to be encoded in UTF-8. And everything works as expected.
All input controls and text drawing methods in Qt can take such a string and display it without any problems. See here for a list of supported languages.
As for the problem you are having writing this string to a file, You just need to set the encoding of data you are writing to a codec that can encode these international characters (such as UTF-8).
From the docs, When using QTextStream::operator<<(const QString& string), The string is encoded using the assigned codec before it is written to the stream.
The problem you have is that you are using the operator<< before assigning. You should setCodec before writing. your code should look something like this:
void stWrite(QString Filename,QString stringtext){
QFile mFile(Filename);
if(!mFile.open(QIODevice::WriteOnly | QIODevice::Append |QIODevice::Text))
{
QMessageBox message_file_Write;
message_file_Write.warning(0,"Open Error"
,"could not to open file for Writing");
return;
}
QTextStream out(&mFile);
out.setCodec("UTF-8");
out << stringtext << endl;
mFile.flush();
mFile.close();
}
* In translation phase 1, Any source file character not in the basic character set is replaced by the universal-character-name that designates the character,The basic character set is defined as follows:
N4140 §2.3 [lex.charset]/1
The basic source character set consists of 96 characters: the space
character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ’
This means that a string like:
QString m = "سلام عليكم";
Will be translated to something like:
QString m = "\u0633\u0644\u0627\u0645\u0020\u0639\u0644\u064a\u0643\u0645";
Assuming that the source file is encoded in an encoding that supports storing such characters such as UTF-8.

c++ find text in QStringList that starts with "..." using .indexOf

I have a question concerning QStringList:
I have a .txt-File containing several 1000 lines of Data followed by this:
+-------------------------+-------------------+-----------------------|
| Conditions at | X1 | X2 |
+-------------------------+-------------------+-----------------------|
| Time [ms] | 0.10780E-02 | 0.27636E-02 |
| Travel [m] | 0.11366E+00 | 0.18796E+01 |
| Velocity [m/s] | 0.43980E+03 | 0.13920E+04 |
| Acceleration [g] | 0.11543E+06 | 0.20936E+05 |
…
Where the Header (Conditions at…) and the first column (Travel, Time,…) always stay the same but the values vary for each run. From this File I want to read the values (only!) into fields of a GUI.
First I write all data into a QStringList. (Each line of .txt copied to one Element of QStringList)
To get the values, from the QStringList I tried to find the corresponding lines with “.indexOf()" which didn´t work because I have to ask for the exact text of the whole line. Since the values vary, the lines are different for each run and my program is not able to find corresponding lines.
Is there a command like “.indexOf-Starting with certain text” which would find me the lines starting with a certain text for example “| Time [ms]”
Thank you very much
itelly
Yes there is method “.indexOf-Starting with certain text”. You can use regular expressions to match the beggining of a string:
int QStringList::indexOf (const QRegExp& rx, int from = 0) const
Use it in this way:
int timeLineIndex = stringList.indexOf(QRegExp("^\| Time \[ms\].+"));
^ means that this text should be at the beggining of a string
\ escapes special characters
.+ means that any text can follow this
EDIT:
Here is a working example that show how it works:
QStringList stringList;
stringList << "abc 5234 hjd";
stringList << "bnd|gf dfs aaa";
stringList << "das gf dfs aaa";
int index = stringList.indexOf(QRegExp("^bnd\|gf.+"));
qDebug() << index;
Output: 1
EDIT:
Here is a function for ezee usage of this:
int indexOfLineStartingWith(const QStringList& list, const QString& textToFind)
{
return list.indexOf(QRegExp("^" + QRegExp::escape(textToFind) + ".+"));
}
int index = indexOfLineStartingWith(stringList, "bnd|gf"); //it's not needed to escape characters here
First of all your actual data starts from the line 4 (excluding the header). Second - each data string has specific layout, that you can parse. Assuming that you read the whole file into the QStringList, where each item in the list represents each line, you can do the following:
QStringList data;
[..]
for (int i = 3; i < data.size(); i++) {
const QString &line = data.at(i);
// Parse the X1 and X2 columns' values
QString strX1 = line.section('|', 1, 1, QString::SectionSkipEmpty).trimmed();
QString strX2 = line.section('|', 2, 2, QString::SectionSkipEmpty).trimmed();
}

Explaining a string trimming function

I came across the code below but need some help with understanding the code. Assume that the string s has spaces either side.
string trim(string const& s){
auto front = find_if_not(begin(s), end(s), isspace);
auto back = find_if_not(rbegin(s), rend(s), isspace);
return string { front, back.base() };
}
The author stated that back points to the end of the last space whereas the front points to the first non-white space character. So back.base() was called but I don't understand why.
Also what do the curly braces, following string in the return statement, represent?
The braces are the new C++11 initialisation.
.base() and reverse iterators
The .base() is to get back the the underlying iterator (back is a reverse_iterator), to properly construct the new string from a valid range.
A picture. Normal iterator positions of a string (it is a little more complex than this regarding how rend() works, but conceptually anyway...)
begin end
v v
-------------------------------------
| sp | sp | A | B | C | D | sp | sp |
-------------------------------------
^ ^
rend rbegin
Once your two find loops finish, the result of those iterators in this sequence will be positioned at:
front
v
-------------------------------------
| sp | sp | A | B | C | D | sp | sp |
-------------------------------------
^
back
Were we to take just those iterators and construct a sequence from them (which we can't, as they're not matching types, but regardless, supposed we could), the result would be "copy starting at A, stopping at D" but it would not include D in the resulting data.
Enter the back() member of a reverse iterator. It returns a non-reverse iterator of the forward iterator class, that is positioned at the element "next to" the back iterator; i.e.
front
v
-------------------------------------
| sp | sp | A | B | C | D | sp | sp |
-------------------------------------
^
back.base()
Now when we copy our range { front, back.base() } we copy starting at A and stopping at the first space (but not including it), thereby including the D we would have missed.
Its actually a slick little piece of code, btw.
Some additional checking
Added some basic checks to the original code.
In trying to keeping with the spirit of the original code (C++1y/C++14 usage), adding some basic checks for empty and white space only strings;
string trim_check(string const& s)
{
auto is_space = [](char c) { return isspace(c, locale()); };
auto front = find_if_not(begin(s), end(s), is_space);
auto back = find_if_not(rbegin(s), make_reverse_iterator(front), is_space);
return string { front, back.base() };
}

C++ Reading a text file backwards from the end of each line up until a space

Is it possible to read a text file backwards from the end of each line up until a space? I need to be able to output the numbers at the end of each line. My text file is formatted as follows:
1 | First Person | 123.45
2 | Second Person | 123.45
3 | Third Person | 123.45
So my output would be, 370.35.
Yes. But in your case, it's most likely more efficient to simply read the whole file and parse out the numbers.
You could do something like this (and I'm writing this in pseudocode so you have to acutally write real code, since that's how you learn):
seek to end of file.
pos = current position
while(pos >= 0)
{
read a char from file.
if (char == space)
{
flag = false;
process string to fetch out number and add to sum.
}
else
{
add char to string
}
if (char == newline)
{
flag = true;
}
pos--
seek to pos-2
}