How do I mimic a Unicode JS regular expression in Lucee - regex

I am trying to write a regular express in Lucee to mimic the JS on the front end. Since Lucee's regex doesn't seem to suppoert unicode how do I do it.
This is the JS
function charTest(k){
var regexp = /^[\u00C0-\u00ff\s -\~]+$/;
return regexp.test(k)
}
if(!charTest(thisKey)){
alert("Please Use Latin Characters Only");
return false;
}
This is what I have tried in Lucee
regexp = '[\u00C0-\u00ff\s -\~]+/';
writeDump(reFind(regexp,"测));
writeDump(reFind(regexp,"test));
I have also tried
regexp = "[\\p{L}]";
but the dump is always 0

EDIT: Give me one second. I think I interpreted your initial JS regex incorrectly. Fixing it.
EDIT 2: It was more than a second. Your original JS regex was:
"/^[\u00C0-\u00ff\s -\~]+$/". This is:
Basic parts of regex:
"/..../" == signifies the start and stop of the Regex.
"^[...]" == signifies anything that is NOT in this group
"+" == signifies at least one of the previous
"$" == signifies the end of the string
Identifiers in the regex:
"\u00c0-\u00ff" == Unicode character range of Character 192 (À)
to Character 255 (ÿ). This is the Latin 1
Extension of the Unicode character set.
"\s" == signifies a Space Character
" -\~" == signifies another identifier for a space character to the
(escaped) tilde character (~). This is ASCII 32-126, which
includes the printable characters of ASCII (except the DEL
character (127). This includes alpha-numerics amd most punctuation.
I missed the second half of your printable Latin basic character set. I've updated my regex and tests to include it. There are ways to shorthand some of these identifiers, but I wanted it to be explicit.
You can try this:
<cfscript>
//http://www.asciitable.com/
//https://en.wikipedia.org/wiki/List_of_Unicode_characters
//https://en.wikipedia.org/wiki/Latin_script_in_Unicode
function charTest(k) {
return
REfind("[^"
& chr(32) & "-" & chr(126)
& chr(192) & "-" & chr(255)
& "]",arguments.k)
? "Please Use Latin Characters Only"
: ""
;
}
// TESTS
writeDump(charTest("测")); // Not Latin
writeDump(charTest("test")); // All characters between 31 & 126
writeDump(charTest("À")); // Character 192 (in range)
writeDump(charTest("À ")); // Character 192 and Space
writeDump(charTest(" ")); // Space Characters
writeDump(charTest("12345")); // Digits ( character 48-57 )
writeDump(charTest("ð")); // Character 240 (in range)
writeDump(charTest("ℿ")); // Character 8511 (outside range)
writeDump(charTest(chr(199))); // CF Character (in range)
writeDump(charTest(chr(10))); // CF Line Feed Character (outside range)
writeDump(charTest(chr(1000))); // CF Character (outside range)
writeDump(charTest("
")); // CRLF (outside range)
writeDump(charTest(URLDecode("%00", "utf-8"))); // CF Null character (outside range)
//writeDump(asc("测"));
//writeDump(asc("test"));
//writeDump(asc("À"));
//writeDump(asc("ð"));
//writeDump(asc("ℿ"));
</cfscript>
https://trycf.com/gist/05d27baaed2b8fc269f90c7c80a1aa82/lucee5?theme=monokai
All the regex does is look at your input string and if it doesn't find a value between chr(192) and chr(255), it will return your chosen string, else it will return nothing.
I think you can access the UNICODE characters below 255 directly. I'll have to test it.
Do you need to alert this function, like the Javascript? If you need to, you can just output a 1 or 0 to determine if this function actually found the character you're looking for.

Related

How to detect if a string contains hindi (devnagri) in it with character and word count

Below is a example string -
$string = "abcde वायरस abcde"
I need to check weather this string contains any Hindi (Devanagari) content and if so the count of characters and words. I guess regex with unicode character class can work http://www.regular-expressions.info/unicode.html. But I am not able to figure out the correct regex statement.
To find out, if a string contains a Hindi (Devanagari) character, you need to have a full list of all Hindi characters. According to this website, the Hindi characters are the hexadecimal characters between 0x0900 and 0x097F (decimal 2304 to 2431).
The regular expression pattern needs to match, if any of those characters are in the set. Therefore, you can use a pattern (actually a set of characters) to match the string, which looks like this:
[\u0900\u0901\u0902 ... \u097D\u097E\u097F]
Because it is rather cumbersome to manually write this list of characters down, you can generate this string by iterating over the decimal characters from 2304 to 2431 or over the hexadecimal characters.
To count all words containing at least one Hindi character, you can use the following pattern. It contains white-space (\s) around the word or the beginning (^) or the end ($) around the word, and a global flag, to match every occurence (/g):
/(?:^|\s)[\u0900\u0901\u0902 ... \u097D\u097E\u097F]+?(?:\s|$)/g
Here is a live implementation in JavaScript:
var numberOfHindiCharacters = 128;
var unicodeShift = 0x0900;
var hindiAlphabet = [];
for(var i = 0; i < numberOfHindiCharacters; i++) {
hindiAlphabet.push("\\u0" + (unicodeShift + i).toString(16));
}
var regex = new RegExp("(?:^|\\s)["+hindiAlphabet.join("")+"]+?(?:\\s|$)", "g");
var string1 = "abcde वायरस abcde";
var string2 = "abcde abcde";
[ string1.match(regex), string2.match(regex) ].forEach(function(match) {
if(match) {
console.log("String contains " + match.length + " words with Hindi characters only.");
} else {
console.log("String does NOT contain any words with Hindi characters only.");
}
});
It should be a range. The list of all characters is not required.
The following will detect a Devanagari word
[\u0900-\u097F]+

Replace all non-ASCII characters in a string by their ASCII equivalent

Using Qt/C++, I need to generate a string with only a subset of ASCII characters : letters, digits, hyphen, underscore, period, or colon.
As input, I can have anything.
So I try to apply some rules :
every QChar::isSpace will be replaced with an underscore
every non-ASCII letters will be replaced with an ASCII equivalent (example : "é" will be replaced with "e")
every other non-ASCII character will be removed
Is there any simple way with Qt/C++ to apply the 2nd and the 3rd rule ?
Thanks
Yes, there is a way.
At first you should do unicode normalization to your string with
QString::normalized. Normalization is needed to separate diacritical signs from letters and to replace some fancy symbols with ascii equivalents. Here you can read about normalization forms.
Then you may take chars which can be encoded in Latin-1. Can be tested with
toLatin1 method of QChar.
char QChar::toLatin1() const
Returns the Latin-1 character equivalent to the QChar, or 0. This is mainly useful for non-internationalized software.
...
QString testString = QString::fromUtf8("Ceñía-üÏÖ马克ñ");
QString normalized = testString.normalized(QString::NormalizationForm_KD);
QString result;
copy_if(normalized.begin(), normalized.end(), back_inserter(result), [](QChar& c) {
return c.toLatin1() != 0;
});
qDebug() << result; // Cenia-uIOn

How to find the character "\" in a string?

I am trying to manipulate a string by finding the \ character in the string Find\inHere. However, I can't put that as an input in test.find('\', 0). It won't work and gives me the error "missing terminating character." Is there a way to fix test.find('\', 0)?
string test = "Find\inHere";
int x = test.find('\', 0); // error on this line
cout << x; // x should equal 4
\ is a character used to introduce special characters, for example \n newline, \xDB shows the ASCII character with hexadecimal number DB etc.
So, in order to search this special character, you have to escape it by adding another \, use:
test.find("\\",0);
EDIT : Also, in your first string, it is not written in it "Find\inHere" but "Find" and an error because \inHere isn't a special instruction. So, same way to avoid it, write "Find\\inHere".

C++11 regex to tokenize Mathematical Expression

I have the following code to tokenize a string of the format: (1+2)/((8))-(100*34):
I'd like to throw an error to the user if they use an operator or character that isn't part of my regex.
e.g if user enters 3^4 or x-6
Is there a way to negate my regex, search for it and if it is true throw the error?
Can the regex expression be improved?
//Using c++11 regex to tokenize input string
//[0-9]+ = 1 or many digits
//Or [\\-\\+\\\\\(\\)\\/\\*] = "-" or "+" or "/" or "*" or "(" or ")"
std::regex e ( "[0-9]+|[\\-\\+\\\\\(\\)\\/\\*]");
std::sregex_iterator rend;
std::sregex_iterator a( infixExpression.begin(), infixExpression.end(), e );
queue<string> infixQueue;
while (a!=rend) {
infixQueue.push(a->str());
++a;
}
return infixQueue;
-Thanks
You can run a search on the string using the search expression [^0-9()+\-*/] defined as C++ string as "[^0-9()+\\-*/]" which finds any character which is NOT a digit, a round bracket, a plus or minus sign (in real hyphen), an asterisk or a slash.
The search with this regular expression search string should not return anything otherwise the string contains a not supported character like ^ or x.
[...] is a positive character class which means find a character being one of the characters in the square brackets.
[^...] is a negative character class which means find a character NOT being one of the characters in the square brackets.
The only characters which must be escaped within square brackets to be interpreted as literal character are ], \ and - whereby - must not be escaped if being first or last character in the list of characters within the square brackets. But it is nevertheless better to escape - always within square brackets as this makes it easier for the regular expression engine / function to detect that the hyphen character should be interpreted as literal character and not with meaning "FROM x to z".
Of course this expression does not check for missing closing round brackets. But formula parsers do often not require that there is always a closing parenthesis for every opening parenthesis in comparison to a compiler or script interpreter simply because not needed to calculate the value based on entered formula.
Answer is given already but perhaps someone might need this
[0-9]?([0-9]*[.])?[0-9]+|[\\-\\+\\\\\(\\)\\/\\*]
This regex separates floats, integers and arithmetic operators
Heres the trick:
[0-9]?([0-9]*[.])?[0-9]+ -> if its a digit and has a point, then grab the digits with the point and the digits that follows it, if not, just grab the digits.
Sorry if my answer isn't clear, i just learned regex and found this solution by my own by just trial and errors.
Heres the code (it takes a mathematical expression and split all digits and operators into a vector)
NOTE: I don't know if it accepts whitespaces, meaning that the mathematical expression that i worked with had no whitespaces. Example: 4+2*(3+1) and would separate everything nicely, but i havent tried with whitespaces.
/* Separate every int or float or operator into a single string using regular expression and store it in untokenize vector */
string infix; //The string to be parse (the arithmetic operation if you will)
vector<string> untokenize;
std::regex words_regex("[0-9]?([0-9]*[.])?[0-9]+|[\\-\\+\\\\\(\\)\\/\\*]");
auto words_begin = std::sregex_iterator(infix.begin(), infix.end(), words_regex);
auto words_end = std::sregex_iterator();
for (std::sregex_iterator i = words_begin; i != words_end; ++i) {
cout << (*i).str() << endl;
untokenize.push_back((*i).str());
}
Output:
(<br/>
1<br/>
+<br/>
2<br/>
)<br/>
/<br/>
(<br/>
(<br/>
8<br/>
)<br/>
)<br/>
-<br/>
(<br/>
100<br/>
*<br/>
34<br/>
)<br/>

Is it possible to return "weird" characters in a char?

I would like to know is it possbile to return "weird" characters, or rather ones that are important to the language
For example: \ ; '
I would like to know that because I need to return them by one function that's checking the unicode value of the text key, and is returning the character by it's number, I need these too.
I get a 356|error: missing terminating ' character
Line 356 looks as following
return '\';
Ideas?
The backslash is an escape for special characters. If you want a literal backslash you have to escape it with another backslash. Try:
return '\\';
The only problem here is that a backslash is used to escape characters in a literal. For example \n is a new line, \t is a horizontal tab. In your case, the compiler is seeing \' and thinking you mean a ' character (this is so you could have the ' character like so: '\''). You just need to escape your backslash:
return '\\';
Despite this looking like a character literal with two characters in it, it's not. \\ is an escape sequence which represents a single backslash.
Similarly, to return a ', you would do:
return '\'';
The list of available escape sequences are given by Table 7:
You can have a character literal containing any character from the execution character set and the resulting char will have the value of that character. However, if the value does not fit in a char, it will have implementation-defined value.
Any character can be returned.
Yet for some of them, you have to escape it using backslash: \.
So for returning backslash, you have to return:
return '\\';
To get a plain backslash use '\\'.
In C the following characters are represented using a backslash:
\a or \A : A bell
\b or \B : A backspace
\f or \F : A formfeed
\n or \N : A new line
\r or \R : A carriage return
\t or \T : A horizontal tab
\v or \V : A vertical tab
\xhh or \Xhh : A hexadecimal bit pattern
\ooo : An octal bit pattern
\0 : A null character
\" : The " character
\' : The ' character
\\ : A backslash (\)
A plain backslash confuses the system because it expects a character to follow it. Thus, you need to "escape" it. The octal/hexadecimal bit patterns may not seem too useful at first, but they let you use ANSI escape codes.
If the character following the backslash does not specify a legal escape sequence, as shown above, the result is implementation defined, but often the character following the backslash is taken literally, as though the escape were not present.
If you have to return such characters(",',\,{,]...etc) more then once, you should write a function that escapes that characters. I wrote that function once and it is:
function EscapeSpecialChars (_data) {
try {
if (!GUI_HELPER.NOU(_data)) {
return _data;
}
if (typeof (_data) != typeof (Array)) {
return _data;
}
while (_data.indexOf("
") > 0) {
_data = _data.replace("
", "");
}
while (_data.indexOf("\n") > 0) {
_data = _data.replace("\n", "\\n");
}
while (_data.indexOf("\r") > 0) {
_data = _data.replace("\r", "\\r");
}
while (_data.indexOf("\t") > 0) {
_data = _data.replace("\t", "\\t");
}
while (_data.indexOf("\b") > 0) {
_data = _data.replace("\b", "\\b");
}
while (_data.indexOf("\f") > 0) {
_data = _data.replace("\f", "\\f");
}
return _data;
} catch (err) {
alert(err);
}
},
then use it like this:
return EscapeSpecialChars("\'"{[}]");
You should improve the function. It was working for me, but it is not escaping all special characters.