Find non-ascii characters in string

Find non-ascii characters in string - coldfusion

?testMSG=ÁáÉéÍíÑñÓóÚúÜü«»¿¡€
<cfset ascii = NOT REFind('[\u0080-\uFFFF]', arguments.textMSG)>
Variable ascii returns 1, which shouldn't be. REFind('[\u0080-\uFFFF]', arguments.textMSG) itself returns 0 despite textMSG containing characters above 128. The line itself is inside a remote cffunction.

As per the docs, ColdFusion's regex implementation doesn't support the \u escape sequence (and, indeed, I am fairly certain it's completely unaware of the concept of unicode).
To do what you want here, you're gonna have to use Java regexes.

Related

Python Raw-Unicode-Escape encoding

I am reading documentation of python 2.7, I just don't understand Raw-Unicode-Escape encoding. Original documentation is below:
For experts, there is also a raw mode just like the one for normal strings. You have to prefix the opening quote with ‘ur’ to have Python use the Raw-Unicode-Escape encoding. It will only apply the above \uXXXX conversion if there is an uneven number of backslashes in front of the small ‘u’.
And I wonder why the required number of backslashes is uneven. Is it just a rule or due to anything else?

\uXXXX escapes are handled specially in raw strings, as the text you quoted describes. ur'\\\\' is a string containing four backslashes, while ur'\\\u0020\\' is four backslashes and a space. If I had to guess why there have to be an uneven number of backslashes for the \u to be recognized, I'd guess that it was because the non-raw string parser works like that too (I haven't looked at the source to be sure).
The question of why probably comes down to "because that's the way it was defined" for python 2. Python 3 doesn't do that anymore - r'\\\u0020\\' is the same as
'\\\\\\u0020\\\\'.

Range of UTF-8 Characters in C++11 Regex

This question is an extension of Do C++11 regular expressions work with UTF-8 strings?
#include <regex>
if (std::regex_match ("中", std::regex("中") )) // "\u4e2d" also works
std::cout << "matched\n";
The program is compiled on Mac Mountain Lion with clang++ with the following options:
clang++ -std=c++0x -stdlib=libc++
The code above works. This is a standard range regex "[一-龠々〆ヵヶ]" for matching any Japanese Kanji or Chinese character. It works in Javascript and Ruby, but I can't seem to get ranges working in C++11, even with using a similar version [\u4E00-\u9fa0]. The code below does not match the string.
if (std::regex_match ("中", std::regex("[一-龠々〆ヵヶ]")))
std::cout << "range matched\n";
Changing locale hasn't helped either. Any ideas?
EDIT
So I have found that all ranges work if you add a + to the end. In this case [一-龠々〆ヵヶ]+, but if you add {1} [一-龠々〆ヵヶ]{1} it does not work. Moreover, it seems to overreach it's boundaries. It won't match latin characters, but it will match は which is \u306f and ぁ which is \u3041. They both lie below \u4E00
nhahtdh also suggested regex_search which also works without adding + but it still runs into the same problem as above by pulling values outside of its range. Played with the locales a bit as well. Mark Ransom suggests it treats the UTF-8 string as a dumb set of bytes, I think this is possibly what it is doing.
Further pushing the theory that UTF-8 is getting jumbled some how, [a-z]{1} and [a-z]+ matches a, but only [一-龠々〆ヵヶ]+ matches any of the characters, not [一-龠々〆ヵヶ]{1}.

Encoded in UTF-8, the string "[一-龠々〆ヵヶ]" is equal to this one: "[\xe4\xb8\x80-\xe9\xbe\xa0\xe3\x80\x85\xe3\x80\x86\xe3\x83\xb5\xe3\x83\xb6]". And this is not the droid character class you are looking for.
The character class you are looking for is the one that includes:
any character in the range U+4E00..U+9FA0; or
any of the characters 々, 〆, ヵ, ヶ.
The character class you specified is the one that includes:
any of the "characters" \xe4 or \xb8; or
any "character" in the range \x80..\xe9; or
any of the "characters" \xbe, \xa0, \xe3, \x80, \x85, \xe3 (again), \x80 (again), \x86, \xe3 (again), \x83, \xb5, \xe3 (again), \x83 (again), \xb6.
Messy isn't it? Do you see the problem?
This will not match "latin" characters (which I assume you mean things like a-z) because in UTF-8 those all use a single byte below 0x80, and none of those is in that messy character class.
It will not match "中" either because "中" has three "characters", and your regex matches only one "character" out of that weird long list. Try assert(std::regex_match("中", std::regex("..."))) and you will see.
If you add a + it works because "中" has three of those "characters" in your weird long list, and now your regex matches one or more.
If you instead add {1} it does not match because we are back to matching three "characters" against one.
Incidentally "中" matches "中" because we are matching the three "characters" against the same three "characters" in the same order.
That the regex with + will actually match some undesired things because it does not care about order. Any character that can be made from that list of bytes in UTF-8 will match. It will match "\xe3\x81\x81" (ぁ U+3041) and it will even match invalid UTF-8 input like "\xe3\xe3\xe3\xe3".
The bigger problem is that you are using a regex library that does not even have level 1 support for Unicode, the bare minimum required. It munges bytes and there isn't much your precious tiny regex can do about it.
And the even bigger problem is that you are using a hardcoded set of characters to specify "any Japanese Kanji or Chinese character". Why not use the Unicode Script property for that?
R"(\p{Script=Han})"
Oh right, this won't work with C++11 regexes. For a moment there I almost forgot those are annoyingly worse than useless with Unicode.
So what should you do?
You could decode your input into a std::u32string and use char32_t all over for the matching. That would not give you this mess, but you would still be hardcoding ranges and exceptions when you mean "a set of characters that share a certain property".
I recommend you forget about C++11 regexes and use some regular expression library that has the bare minimum level 1 Unicode support, like the one in ICU.

c++ - escape special characters

I need to escape all special characters and replace national characters and get "plain text" for a tablename.
string getTableName(string name)
My string could be "šárka65_%&." and I want to get string I can use in my database as a tablename.

Which DBMS?
In standard SQL, a name enclosed in double quotes is a delimited identifier and may contain any characters.
In MS SQL Server, a name enclosed in square brackets is a delimited identifier.
In MySQL, a name enclosed in back-ticks is a delimieted identifier.
You could simply choose to enclose the name in the appropriate markers.
I had a feeling that wasn't what you wanted...
What codeset is your string in? It seems to be UTF-8 by the time it gets to my browser. Do you need to be able to invert the mapping unambiguously? That is harder.
You can use many schemes to map the information:
One simple minded one is simply to hex-encode everything, using a marker (X) to protect against leading digits:
XC5A1C3A1726B6136355F25262E
One slightly less simple minded one is hex-encode anything that is not already an ASCII alphanumeric or underscore.
XC5A1C3A1rka65_25262E
Or, as a comment suggests, you can devise a mapping table for accented Latin letters - indeed, a mapping table appropriately initialized will be the fastest approach. The input is the character in the source string; the output is the desired mapped character or characters. If you use an 8-bit character set, this is entirely manageable. If you use full Unicode, it is a lot less manageable (not least, how do you map all the Han syllabary to ASCII?).
Or ...

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama

Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.

One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.

ASCII Value for Nothing

Is there an ascii value I can put into a char in C++, that represents nothing? I tried 0 but it ends up screwing up my file so I can't read it.

ASCII 0 is null. Other than that, there are no "nothing" characters in traditional ASCII. If appropriate, you could use a control character like SOH (start of heading), STX (start of text), or ETX (end of text). Their ASCII values are 1, 2, and 3 respectively.
For the full list of ASCII codes that I used for this explaination, see this site

Sure. Use any character value that won't appear in your regular data. This is commonly referred to as a delimited text file. Popular choices for delimiters include spaces, tabs, commas, semi-colons, vertical-bar characters, and tilde.

In a C++ source file, '\0' represents a 0 byte. However, C++ strings are usually null-terminated, which means that '\0' represents the end of the string - which may be what is messing up your file.
If you really want to store a 0 byte in a data file, you need to use some other encoding. A simplistic one would use some other character - 0xFF, for example - that doesn't appear in your data, or some length/data format or something similar.
Whatever encoding you choose to use, the application writing the file and the one reading it need to agree on what the encoding is. And that is a whole new nightmare.

The null character '\0' still takes up a byte.
Does your software recognize the null character as an end-of-file character?
If your software is reading in this file, you can define a place holder character (one that isn't the same as data) but you'll also need to handle that character. As in, say '*' is your place-holder. You will read in the character but not add it to the structure that stores your data. It will still take up space in your file, but it won't take up space in your data structure.
Am I answering your question or missing it?

Do you mean a value you can write which won't actually change the file? The answer is no.
Maybe post a little more about what you're trying to accomplish.

it would depend on what kind of file it is and who is parsing it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Find non-ascii characters in string - coldfusion

As per the docs, ColdFusion's regex implementation doesn't support the \u escape sequence (and, indeed, I am fairly certain it's completely unaware of the concept of unicode). To do what you want here, you're gonna have to use Java regexes.

Related

Python Raw-Unicode-Escape encoding

Range of UTF-8 Characters in C++11 Regex

c++ - escape special characters

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

ASCII Value for Nothing

Categories

Resources