Multi-language input validation with UTF-8 encoding - regex

To check a user input english name is valid, I would usually match the input against regular expression such as [A-Za-z]. But how can I do this if multi-language(like Chinese, Japanese etc.) support is required with utf8 encoding?

You can approximate the Unicode derived property \p{Alphabetic} pretty succintly with [\pL\pM\p{Nl}] if your language doensn’t support a proper Alphabetic property directly.
Don’t use Java’s \p{Alpha}, because that’s ASCII-only.
But then you’ll notice that you’ve failed to account for dashes (\p{Pd} or DashPunctuation works, but that does not include most of the hyphens!), apostrophes (usually but not always one of U+27, U+2BC, U+2019, or U+FF07), comma, or full stop/period.
You probably had better include \p{Pc} ConnectorPunctuation, just in case.
If you have the Unicode derived property \p{Diacritic}, you should use that, too, because it includes things like the mid-dot needed for geminated L’s in Catalan and the non-combining forms of diacritic marks which people sometimes use.
But then you’ll find people who use ordinal numbers in their names in ways that \p{Nl} (LetterNumber) doesn’t accomodate, so you throw \p{Nd} (DecimalNumber) or even all of \pN (Number) into the mix.
Then you realize that Asian names often require the use of ZWJ or ZWNJ to be written correctly in their scripts, so then you have to add U+200D and U+200C to the mix, which are both \p{Cf} (Format) characters and indeed also JoinControl ones.
By the time you’re done looking up the various Unicode properties for the various and many exotic characters that keep cropping up — or when you think you’re done, rather — you’re almost certain to conclude that you would do a much better job at this if you simply allowed them to use whatever Unicode characters for their name that they wish, as the link Tim cites advises. Yes, you’ll get a few jokers putting in things like “əɯɐuʇƨɐ⅂ əɯɐuʇƨɹᴉℲ”, but that just goes with the territory, and you can’t preclude silly names in any reasonable way.

Think about whether you really need to validate the user's name. Maybe you should let users call themselves whatever they want.
You certainly should never use [A-Za-z], because some people have names with apostrophes or hyphens. It can be quite insulting to prevent someone from using their real name just because it doesn't follow your arbitrary rules for what a name should look like.

In PHP I use this nasty hack:
setlocale(LC_ALL, 'de_DE');
preg_match('/^[[:alpha:]]+$/', $name);
That includes "Umlauts" (i.e. 'ä','ö' and the like) plus accented vowels (è,í,etc.).
But it falls short to validate for Cyrillic (Russia, Bulgaria, ...) or Chinese characters...

Related

How to achieve unicode-agnostic case insensitive comparison in C++

I have a requirement wherein my C++ code needs to do case insensitive comparison without worrying about whether the string is encoded or not, or the type of encoding involved. The string could be an ASCII or a non-ASCII, I just need to store it as is and compare it with a second string without concerning if the right locale is set and so forth.
Use case: Suppose my application receives a string (let's say it's a file name) initially as "Zoë Saldaña.txt" and it stores it as is. Subsequently, it receives another string "zoë saLdañA.txt", and the comparison between this and the first string should result in a match, by using a few APIs. Same with file name "abc.txt" and "AbC.txt".
I read about IBM's ICU and how it uses UTF-16 encoding by default. I'm curious to know:
If ICU provides a means of solving my requirement by seamlessly handling the strings regardless of their encoding type?
If the answer to 1. is no, then, using ICU's APIs, is it safe to normalize all strings (both ASCII and non-ASCII) to UTF-16 and then do the case-insensitive comparison and other operations?
Are there alternatives that facilitate this?
I read this post, but it doesn't quite meet my requirements.
Thanks!
The requirement is impossible. Computers don't work with characters, they work with numbers. But "case insensitive" comparisons are operations which work on characters. Locales determine which numbers correspond to which characters, and are therefore indispensible.
The above isn't just true for all progamming langguages, it's even true for case-sensitive comparisons. The mapping from character to number isn't always unique. That means that comparing two numbers doesn't work. There could be a locale where character 42 is equivalent to character 43. In Unicode, it's even worse. There are number sequences which have different lengths and still are equivalent. (precomposed and decomposed characters in particular)
Without knowing encoding, you cannot do that. I will take one example using french accented characters and 2 different encodings: cp850 used as OEM character for windows in west european zone, and the well known iso-8859-1 (also known as latin1, not very different from win1252 ansi character set for windows)).
in cp850, 0x96 is 'û', 0xca is '╩', 0xea is 'Û'
in latin1, 0x96 is non printable(*), 0xca is 'Ê', 0xea is 'ê'
so if string is cp850 encoded, 0xea should be the same as 0x96 and 0xca is a different character
but if string is latin1 encoded, 0xea should be the same as 0xca, 0x96 being a control character
You could find similar examples with other iso-8859-x encoding by I only speak of languages I know.
(*) in cp1252 0x96 is '–' unicode character U+2013 not related to 'ê'
For UTF-8 (or other Unicode) encodings, it is possible to perform a "locale neutral" case-insensitive string comparison. This type of comparison is useful in multi-locale applications, e.g. network protocols (e.g. CIFS), international database data, etc.
The operation is possible due to Unicode metadata which clearly identifies which characters may be "folded" to/from which upper/lower case characters.
As of 2007, when I last looked, there are less than 2000 upper/lower case character pairs. It was also possible to generate a perfect hash function to convert upper to lower case (most likely vice versa, as well, but I didn't try it).
At the time, I used Bob Burtle's perfect hash generator. It worked great in a CIFS implementation I was working on at the time.
There aren't many smallish, fixed sets of data out there you can point a perfect hash generator at. But this is one of 'em. :--)
Note: this is locale-neutral. So it will not support applications like German telephone books. There are a great many applications you should definitely use locale aware folding and collation. But there are a large number where locale neutral is actually preferable. Especially now when folks are sharing data across so many time zones and, necessarily, cultures. The Unicode standard does a good job of defining a good set of shared rules.
If you're not using Unicode, the presumption is that you have a really good reason. As a practical matter, if you have to deal with other character encodings, you have a highly locale aware application. In which case, the OP's question doesn't apply.
See also:
The Unicode® Standard, Chapter 4, section 4.2, Case
The Unicode® Standard, Chapter 5, section 5.18, Case Mappings, subsection Caseless Matching.
UCD - CaseFolding.txt
Well, first I must say that any programmer dealing with natural language text has the utmost duty to know and understand Unicode well. Other ancient 20th Century encodings still exists, but things like EBCDIC and ASCII are not able to encode even a simple English text, which may contain words like façade, naïve or fiancée or even a geographical sign, a mathematical symbol or even emojis — conceptually, they are similar to ideograms. The majority of the world population does not use Latin characters to write text. UTF-8 is now the prevalent encoding on the Internet, and UTF-16 is used internally by all present day operating systems, including Windows, which unfortunately still does it wrong. (For example, NTFS has a decade-long reported bug that allows a directory to contain 2 files with names that look exactly the same but are encoded with different normal forms — I get this a lot when synchronising files via FTP between Windows and MacOS or Linux; all my files with accented characters get duplicated because unlike the other systems, Windows uses a different normal forms and only normalise the file names on the GUI level, not on the file system level. I reported this in 2001 for Windows 7 and the bug is still present today in Windows 10.)
If you still don't know what a normal form is, start here: https://en.wikipedia.org/wiki/Unicode_equivalence
Unicode has strict rules for lower- and uppercase conversion, and these should be followed to the point in order for things to work nicely. First, make sure both strings use the same normal form (you should do this in the input process, the Unicode standard has the algorithm). Please do not reinvent the wheel, use ICU normalising and comparison facilities. They have been extensively tested and they work correctly. Use them, IBM has made it gratis.
A note: if you plan on comparing string for ordering, please remember that collation is locale-dependant, and highly influenced by the language and the scenery. For example, in a dictionary these Portuguese words would have this exact order: sabia, sabiá, sábia, sábio. The same ordering rules would not work for an address list, which would use phonetic rules to place names like Peçanha and Pessanha adjacently. The same phenomenon happens in German with ß and ss. Yes, natural language is not logical — or better saying, its rules are not simple.
C'est la vie. これが私たちの世界です。

Encode/decode certain text sequences in Qt

I have a QTextEdit where the user can insert arbitrary text. In this text, there may be some special sequences of characters which I wish to translate automatically. And from the translated version, I wish I could go back to the sequences.
Take for instance this:
QMessageBox::information(0, "Foo", MAGIC_TRANSLATE(myTextEdit->text()));
If the user wrote, inside myTextEdit's text, the sequence \n, I would like that MAGIC_TRANSLATE converted the string \n to an actual new line character.
In the same way, if I give a text with a new line inside it, a MAGIC_UNTRANSLATE will convert the newline with a \n string.
Now, of course I can implement these two functions by myself, but what I am asking is if there is something already made, easy to use, in Qt, which allows me to specify a dictionary and it does the rest for me.
Note that sequences with common prefix can create some conflicts, for example converting:
\foo -> FOO
\foobar -> FOOBAR
can give rise to issues when translating the text asd \foobar lol, because if \foo is searched and replaced before \foobar, then the resulting text will be asd FOObar lol instead of the (more natural) asd FOOBAR lol.
I hope to have made clear my needs. I believe that this may be a common task, so I hope there is a Qt solution which takes into account this kind of issues when having conflicting prefixes.
I am sorry if this is a trivial topic (as I think it may be), but I am not familiar at all with encoding techniques and issues, and my knowledge of Qt encoding cover only very simple Unicode-related issues.
EDIT:
Btw, in my case a data-oriented approach, based on resources or external files or anything that does not requires a recompilation would be great.
It sounds like your question is, "I want to run a sequence of regular expression or simple string replacements to map between two encodings of some text".
First you need to work out your mapping, exactly. As you say, if your escape sequences like \foo and \foobar are fiddly, you might find that you don't have a bidirectional, lossless mapping. No library in the world can help you if your design or encoding is flawed.
When you end up with a precise design (which we can't help you on given the complete lack of information provided on the purpose of this function), you'll probably find that a sequence of string replacements is fine. If it really is more complicated, then some QRegExps should be enough.
It is always a bit ugly to self-answer questions, but... Maybe this solution is useful to someone.
As suggested by Nicholas in his answer, a good strategy is to use replacement. It is simple and effective in most cases, for example in the plain C/C++ escaping:
\n \r \t etc
This works because they are all different. It will always work with a replacement if the sequences are all different and, in particular, if no sequence is a prefix to another sequence.
For example, if your sequences are the one aboves plus some greek letters, you will not like the \nu sequence, which should be translated to ν.
Instead, if the replacing function tests for \n before \nu, the result is wrong.
Assuming that both sequences will be translated in two completely different entities, there are two solutions: place a close-sequence character, for example \nu;, or just replace by longest to shorter strings. This ensure that any sequence which is prefix of another one is not replaced before it.
For various reasons, I tried another way: using a trie, which is a tree of all the prefixes of a dictionary of words. Long story short: it works fairly well and probably works faster than (most) regexes and replacements.
Regex are state machines and it is not rare to re-process the input, with a trie, you avoid to re-match characters twice, so you go pretty fast.
Code for tries is pretty easy to find on the internet, and the modifications to do efficient matching are trivial, so I will not write the code here.

Create a safe, escaped path base/file name, check if safe

I wonder if there is a generic way to produce filesystem safe filenames that is portable. That is, I have a user entered string and would like to produce a file with a name that as closely resembles the name they have chosen. The resulting name must not include any path reference or other special file-system special name or tag.
Currently I just replace a bunch of known bad characters with other characters, or empty strings. For example, given the name ABC / DEF* : A Company? I'd produce the string ABC - DEF - A Company. My choice for replacement characters is totally arbitrary as I don't know of a generic escape symbol.
So my related questions are:
Is there a method (perhaps in boost filesystem) that can tell me if the name refers strictly to a file without a path?
Is there a function that tells me if the name is "safe" to use as a file (this may be an additional check from 1 for some filesystems)?
Is there a function to convert a string into a reasonable safe name?
Addtional Notes
For #1 I thought to just compare a boost path::filename() to the original object, if they are the same then I have a file. However this still allows things like '..' and '.' But that might be okay if there is a good solution for #2
In theory I'd have to provide a directory in which the file would reside, since different file-systems may have different requirements. But a global solution for the OS would also be okay.
I already have a function that just replaces a bunch of commonly known unsafe characters.
Common file dialogs cannot be used to do the filtering since the interface may not always allow them and in some cases the user isn't directly aware of the relationship to the file (advanced users would however).
According to POSIX fully portable filenames, the only portable filenames are those that contain only A–Za–z0–9._- and are max 14 characters long.
That said, a more practical approach is to assume that modern filesystems can cope with longer filenames and to simply replace all characters which are not explicitly marked as "safe" with _. Sometimes, instead of replacing with _, those characters are hex-encoded, like in URLs: sample%20file.txt. KDE applications use this, for example.
As for implementation, it's as simple as s/[^A-Za-z0-9.-]/_/.
How portable is portable? Many systems had limits on length, and some
probably still do. Is disinguishing between names an issue? Some
systems distinguish case, and others don't. What about a final .xxx?
For some systems, it is significant, for others, it's just text.
Neglecting length, the safest bet is to take the opposite approach:
create a set of known safe characters, and convert everything outside of
that to a specific character. ASCII alphanumerics, and '_' seem
pretty safe, and you're probably OK (today) with '-', but I doubt the
list goes much further. And depending on what you're doing with these
names, you might want to force them to a single case, either upper or
lower.

How to find hard erroneous interface casts in Delphi (Win32)

I am trying to find some mysterious bugs in an application, and believe the cause may be some hard casts on interfaces. Such casts are unsafe in Delphi, for example
ISomeInterface(CurrentObj)
which should be
CurrentObj as ISomeInterface
In light of the lack of compiler warnings which in my opinion should be emitted for hard casts, my question is how to easily find these casts in a codebase? Some sort of regex grep search perhaps? The codebase is large and it would take forever to search manually for it.
You don't say which flavor of regular expressions you're using. I'm going to assume PCRE (Perl-compatible regular expressions), which means these examples won't work with the goofball "regular expressions" option in the IDE's Find dialog. However, they'll work with any self-respecting grep tool, as well as with the built-in regexes in Perl, Ruby, .NET, and many other languages.
You could start with something like this:
\w+\s*\(
which would search for one or more word characters, followed by zero or more spaces, followed by an open parenthesis. This would match:
TObject (Foo)
but depending on your regex library, and which options you use, and how you pass the input into it, might or not match if there's a line break before the open paren:
TObject
(Foo)
and definitely wouldn't work if there's a comment in between, like this pathological case:
X := ISomeInterface // come back and look at this cast, it's dangerous
(CurrentObj);
But for most well-formatted code, it will be good enough.
Now your problem is that it's giving you way more than just the typecasts -- it's also giving you just about every method call in your code. So some refinement is needed.
If your code follows the typical Delphi coding style, then this would probably work much better:
\b[TIE][A-Z]\w+\s*\(
and make sure you do a case-sensitive match. This will match anyplace where you have a word boundary, followed by a capital T (the traditional prefix for most classes and types) or capital I (the prefix for interfaces) or capital E (the prefix for Exception descendants), followed by another capital letter, then some number of upper- or lowercase letters or digits or underscores, followed by optional spaces and an open paren. There's a good chance this is what you need.
However, if you have any type names that don't follow the usual naming conventions (e.g. TfcTreeView that has a lowercase letter after the T), or if you ever rely on Delphi's case-insensitivity (e.g. if there's any chance you might ever have code like tobject(Foo) or even Tobject(Foo)), then it gets harder. If that's the case, post some details and I may be able to make suggestions.
If you know the name of the interface you could use the following
regular expression in the Find in Files Dialog.
ITest\([^)]+\)
Where ITest is the name of your interface

Converting wide char string to lowercase in C++

How do I convert a wchar_t string from upper case to lower case in C++?
The string contains a mixture of Japanese, Chinese, German and Greek characters.
I thought about using towlower...
http://msdn.microsoft.com/en-us/library/8h19t214%28VS.80%29.aspx
.. but the documentation says that:
The case conversion of towlower is locale-specific. Only the characters relevant to the current locale are changed in case.
Edit: Maybe I should describe what I'm doing. I receive a Unicode search query from a user. It's originally in UTF-8 encoding, but I'm converting it to a widechar (I may be wrong on the wording). My debugger (VS2008) correctly shows the Japanese, German, etc characters in in the "variable quick watch". I need to go through another set of data in Unicode and find matches of the search string. While this is no problem for me to do when the search is case sensitive, it's more problematic to do it case insensitive. My (maybe naive) approach to solve the problem would be to convert all input data and output data to lower case and then compare it.
If your string contains all those characters, the codeset must be Unicode-based. If implemented properly, Unicode (Chapter 4 'Character Properties') defines character properties including whether the character is upper case and the lower case mapping, and so on.
Given that preamble, the towlower() function from <wctype.h> is the correct tool to use. If it doesn't do the job, you have a QoI (Quality of Implementation) problem to discuss with your vendor. If you find the vendor unresponsive, then look at alternative libraries. In this case, you might consider ICU (International Components for Unicode).
You have a nasty problem in hand. A Japanese locale will not help converting German and vice versa. There are languages which do not have the concept of captalization either (toupper and friends would be a no-op here, I suppose). So, can you break up your string into individual chunks of words from the same language? If you can then you can convert the pieces and string them up.
This SO answer shows how to work with facets to work with several locales. If this is on Windows, you can consider using win32 API functions, if you can work with C++.NET (managed C++), you can use the char.ToLower and string.ToLower functions, which are Unicode compliant.
Have a look at _wcslwr_l in <wchar.h> (MSDN).
You should be able to run the function on the input for each of the locales.