Replace all non-ASCII characters in a string by their ASCII equivalent - c++

Using Qt/C++, I need to generate a string with only a subset of ASCII characters : letters, digits, hyphen, underscore, period, or colon.
As input, I can have anything.
So I try to apply some rules :
every QChar::isSpace will be replaced with an underscore
every non-ASCII letters will be replaced with an ASCII equivalent (example : "é" will be replaced with "e")
every other non-ASCII character will be removed
Is there any simple way with Qt/C++ to apply the 2nd and the 3rd rule ?
Thanks

Yes, there is a way.
At first you should do unicode normalization to your string with
QString::normalized. Normalization is needed to separate diacritical signs from letters and to replace some fancy symbols with ascii equivalents. Here you can read about normalization forms.
Then you may take chars which can be encoded in Latin-1. Can be tested with
toLatin1 method of QChar.
char QChar::toLatin1() const
Returns the Latin-1 character equivalent to the QChar, or 0. This is mainly useful for non-internationalized software.
...
QString testString = QString::fromUtf8("Ceñía-üÏÖ马克ñ");
QString normalized = testString.normalized(QString::NormalizationForm_KD);
QString result;
copy_if(normalized.begin(), normalized.end(), back_inserter(result), [](QChar& c) {
return c.toLatin1() != 0;
});
qDebug() << result; // Cenia-uIOn

Related

How do I mimic a Unicode JS regular expression in Lucee

I am trying to write a regular express in Lucee to mimic the JS on the front end. Since Lucee's regex doesn't seem to suppoert unicode how do I do it.
This is the JS
function charTest(k){
var regexp = /^[\u00C0-\u00ff\s -\~]+$/;
return regexp.test(k)
}
if(!charTest(thisKey)){
alert("Please Use Latin Characters Only");
return false;
}
This is what I have tried in Lucee
regexp = '[\u00C0-\u00ff\s -\~]+/';
writeDump(reFind(regexp,"测));
writeDump(reFind(regexp,"test));
I have also tried
regexp = "[\\p{L}]";
but the dump is always 0
EDIT: Give me one second. I think I interpreted your initial JS regex incorrectly. Fixing it.
EDIT 2: It was more than a second. Your original JS regex was:
"/^[\u00C0-\u00ff\s -\~]+$/". This is:
Basic parts of regex:
"/..../" == signifies the start and stop of the Regex.
"^[...]" == signifies anything that is NOT in this group
"+" == signifies at least one of the previous
"$" == signifies the end of the string
Identifiers in the regex:
"\u00c0-\u00ff" == Unicode character range of Character 192 (À)
to Character 255 (ÿ). This is the Latin 1
Extension of the Unicode character set.
"\s" == signifies a Space Character
" -\~" == signifies another identifier for a space character to the
(escaped) tilde character (~). This is ASCII 32-126, which
includes the printable characters of ASCII (except the DEL
character (127). This includes alpha-numerics amd most punctuation.
I missed the second half of your printable Latin basic character set. I've updated my regex and tests to include it. There are ways to shorthand some of these identifiers, but I wanted it to be explicit.
You can try this:
<cfscript>
//http://www.asciitable.com/
//https://en.wikipedia.org/wiki/List_of_Unicode_characters
//https://en.wikipedia.org/wiki/Latin_script_in_Unicode
function charTest(k) {
return
REfind("[^"
& chr(32) & "-" & chr(126)
& chr(192) & "-" & chr(255)
& "]",arguments.k)
? "Please Use Latin Characters Only"
: ""
;
}
// TESTS
writeDump(charTest("测")); // Not Latin
writeDump(charTest("test")); // All characters between 31 & 126
writeDump(charTest("À")); // Character 192 (in range)
writeDump(charTest("À ")); // Character 192 and Space
writeDump(charTest(" ")); // Space Characters
writeDump(charTest("12345")); // Digits ( character 48-57 )
writeDump(charTest("ð")); // Character 240 (in range)
writeDump(charTest("ℿ")); // Character 8511 (outside range)
writeDump(charTest(chr(199))); // CF Character (in range)
writeDump(charTest(chr(10))); // CF Line Feed Character (outside range)
writeDump(charTest(chr(1000))); // CF Character (outside range)
writeDump(charTest("
")); // CRLF (outside range)
writeDump(charTest(URLDecode("%00", "utf-8"))); // CF Null character (outside range)
//writeDump(asc("测"));
//writeDump(asc("test"));
//writeDump(asc("À"));
//writeDump(asc("ð"));
//writeDump(asc("ℿ"));
</cfscript>
https://trycf.com/gist/05d27baaed2b8fc269f90c7c80a1aa82/lucee5?theme=monokai
All the regex does is look at your input string and if it doesn't find a value between chr(192) and chr(255), it will return your chosen string, else it will return nothing.
I think you can access the UNICODE characters below 255 directly. I'll have to test it.
Do you need to alert this function, like the Javascript? If you need to, you can just output a 1 or 0 to determine if this function actually found the character you're looking for.

Translate \n new line from Char to String in SML/NJ

I am trying to convert #"\n", a Char, to "\n", a String. I used
Char.toString(#"\n");
and it gives
val it = "\\n" : string
Why does not it return "\n"?
Char.toString from the documentation.
returns a printable string representation of the character, using, if
necessary, SML escape sequences.
It also specifies that some control characters are converted to two-character escape sequences and \n is one of it.
To return a string of size one, use String.str.
- String.str(#"\n");
val it = "\n" : string

How to find the character "\" in a string?

I am trying to manipulate a string by finding the \ character in the string Find\inHere. However, I can't put that as an input in test.find('\', 0). It won't work and gives me the error "missing terminating character." Is there a way to fix test.find('\', 0)?
string test = "Find\inHere";
int x = test.find('\', 0); // error on this line
cout << x; // x should equal 4
\ is a character used to introduce special characters, for example \n newline, \xDB shows the ASCII character with hexadecimal number DB etc.
So, in order to search this special character, you have to escape it by adding another \, use:
test.find("\\",0);
EDIT : Also, in your first string, it is not written in it "Find\inHere" but "Find" and an error because \inHere isn't a special instruction. So, same way to avoid it, write "Find\\inHere".

How to save " in a string in C++?

So I have the following code which doesn't work. I couldn't figure it out how to do it.
std::string str("Q850?51'18.23"");
First problem I face is " (quotation mark). I cannot save it as a string because at the end of the string I have two " characters and C++ doesn't let me save the whole string.
Second I want to split the string and save it in different variables.
E.g.;
double i = 850;
double j = 51;
double k = 18.23;
You will need to escape the quotation mark you require in the string;
std::string str("Q850?51'18.23\"");
// ^ escape the quote here
The cppreference site has a list of these escape sequences.
Alternatively you are use a raw string literal;
std::string str = R"(Q850?51'18.23")";
The second part of the problem is dependent on the format and predictability of the data;
If it is fixed width, a simple index and be used to extract the numbers and convert to the double you require.
If it is delimited with the characters above, you can consume the string to each of the delimiters extracting the numbers in-between them (you should be able to find suitable libraries to assist with this).
If it is some further unknown composition, you may be limited to consuming the string one character at a time and extracting the numerical values between the non-numerical values.
You need to escape your quote mark:
std::string str("Q850?51'18.23\"");
// ^
You need to escape your quote mark
Add a backslash before "
std::string str("Q850?51'18.23\"");

How do I add a backslash after every character in a string?

I need to transform a literal filepath (C:/example.txt) to one that is compatible with the various WinAPI Registry functions (C://example.txt) and I have no idea on how to go about doing it.
I've broken it down to having to add a backslash after a certain character (/ in this case) but i'm completely stuck after that.
Guidance and Code Examples will be greatly appreciated.
I'm using C++ and VS2012.
In C++, strings are made up of individual characters, like "foo". Strings can be composed of printable characters, such as the letters of the alphabet, or non-printable characters, such as the enter key or other control characters.
You cannot type one of these non-printable characters in the normal way when populating a string. For example, if you want a string that contains "foo" then a tab, and then "bar", you can't create this by typing:
fooTABbar
because this will simply insert that many spaces -- it won't actually insert the TAB character.
You can specify these non-printable characters by "escaping" them out. This is done by inserting a back slash character (\) followed by the character's code. In the case of the string above TAB is represented by the escape sequence \t, so you would write: "foo\tbar".
The character \ is not itself a non-printable character, but C++ (and C) recognize it to be special -- it always denotes the beginning of an escape sequence. To include the character "\" in a string, it has to itself be escaped, with \\.
So in C++ if you want a string that contains:
c:\windows\foo\bar
You code this using escape sequences:
string s = "c:\\windows\\foo\\bar"
\\ is not two chars, is one char:
for(size_t i = 0, sz = sPath.size() ; i < sz ; i++)
if(sPath[i]=='/') sPath[i] = '\\';
But be aware that some APIs work with \ and some with /, so you need to check in which cases to use this replacement.
If replacing every occurrence of a forward slash with two backslashes is really what you want, then this should do the job:
size_t i = str.find('/');
while (i != string::npos)
{
string part1 = str.substr(0, i);
string part2 = str.substr(i + 1);
str = part1 + R"(\\)" + part2; // Use "\\\\" instead of R"(\\)" if your compiler doesn't support C++11's raw string literals
i = str.find('/', i + 1);
}
EDIT:
P.S. If I misunderstood the question and your intention is actually to replace every occurrence of a forward slash with just one backslash, then there is a simpler and more efficient solution (as #RemyLebeau points out in a comment):
size_t i = str.find('/');
while (i != string::npos)
{
str[i] = '\\';
i = str.find('/', i + 1);
}
Or, even better:
std::replace_if(str.begin(), str.end(), [] (char c) { return (c == '/'); }, '\\');