Indexing strings and internationalization - c++

I have recently come up with an indexing algorithm for finding duplicate customer records. Short in short, this all works very well.
However, my issue is that I'd like to find "Diviér" should match "Divier", or "Aether" should match "Æther".
No problem, because removing diacritics is possible with libicu or boost::locale, and the problem uses wstring.
However, here is my problem: Normalizing/Latinizing a word changes it's meaning in a way that matching the may no longer make sense. I would like some input on whether this would be acceptable for names...
Also, what if someone has a Chinese name? This wouldn't be normalizable in this way, would it?
Do you have any recommendations on how to approach this?

You should look much more at the addresses and not to much into the names. In the end names can be very misleading. E.g. depending on the country the transcription of Chinese, Russion or Japanese characters may vary. Then sometimes names fields are to short to capture the full name of a person (especially common with Indish names) which leads to any kind of seemingly random abreviations. Sometimes people will ommit middle names, sometime they will not. And sometimes there are misspellings that give proper but different names.
So in my opinion the name should be the least important criterion in finding duplicates.

Related

Practical user validation (sensitivity and specificity)?

When I was first learning how to use regular expressions we were taught how to parse things like phone numbers (obviously always 5 digits, an optional space and a further 6 digits), email addresses (obviously always alphanumerics, then a single '#', then alphanumerics followed by a '.' and three letters) which we should always do to validate the data that the user enters.
Of course as I've developed I've learned how silly the basic approach can be, but the more I look, the more I question the concept altogether, the most open careful correct validation of something like an email address through regexes ends up being hundreds if not thousands of characters long in order to both accept all the legal cases and correctly reject only the illegal ones. Even worse, all that effort does absolutely nothing for the actual validity, the user may have accidentally added an 'a', or may not use that email address at all, or even is using someone else's address, or may even use a '+' symbol which is being flagged inappropriately.
Yet at the same time seemingly every site I come across still does this kind of technical checking, preventing me from putting more obscure characters in an email address or name, or objecting to the idea that someone would have more or less than a single title, then a single firstname and a single lastname, all made purely from latin characters yet without any form of check that it's my real name.
Is there a benefit to this? Once injection attacks are handled (which should be through methods other than sterilizing the input) is there any other point to these checks?
Or on the other hand, is there actually a sure fire way to actually validate user details other than to 'use' them in whatever way makes sense contextually and see if it falls over?
Overly validating things is indeed one of the banes of the internet. Especially if the person writing the validation code has no actual knowledge of the problem domain. No, you probably do not actually know what the valid syntax for email addresses is. Or real-world addresses, especially internationally. Or telephone numbers. Or people's names.
Looking at a few localised examples (my email address) and extrapolating to rules covering all possible values within the domain (all email addresses) is madness. Unless you have perfect domain knowledge, you should not come up with rules about the domain. In the case of email addresses this leads to only a very narrow subset of possible email addresses actually being usable in daily life. Ghee, thanks, guys.
As for people's names, whatever a person tells you is their name is by definition their name. It's what you call them by. You cannot validate it automatically; they'd have to send in a copy of their birth certificate for actual official validation. And even then, is that really what you're interested in knowing? Or do you merely need a "handle" to greet and identify them on your forum page?
Facebook does (did?) strict name validation in order to force people to use their real names to register. Well, many people I know on Facebook still use some made up nonsense name. The filter obviously doesn't work. Having said this, perhaps it works well enough for Facebook so that most people use their actual name because they couldn't be bothered to figure out which particular pattern will pass the validation. In that sense, such a filter can serve some purpose.
In the end it's up to you to decide on reasons for validation and the specific limits you want to enforce. The issue is that people often do not think about the bigger picture before writing validation code and they have no good reason for their specific limits. Don't fall into that trap.
is there any other point to these checks?
Certainly. Knowing that your data is valid is very important. In the case of email addresses, for example, sending an email to an address you haven't validated will, at the very least, lead to bounces. Enough bounces and your mailhost might block you for spamming. Not validating a phone number could lead to unnecessary costs if your app tries to send SMS to them. The list goes on and on.
Or on the other hand, is there actually a sure fire way to actually validate user details other than to 'use' them in whatever way makes sense contextually and see if it falls over?
Yes, but regex is generally bad way to validate data. If a phone number is supposed to be "5 digits a space then 6 digits", then your check is going to fail if I type "5 digits two spaces then 6 digits" or "5 digits a dash then 6 digits" or "11 digits". Use common sense, and expect any crazy format the user provides. Know what the absolute minimal requirement is. For example, if you need 11 digits total, then strip everything that's not a digit first. Then formatting doesn't matter.
Also, read the RFCs. I can't count the number of times my email address has been rejected because it has a plus sign in it. The amount of those that were large tech-oriented company with programmers that should know better was rather disappointing.

difference in regex logic

In my website, I am using the below regex to validate email.
^[a-zA-Z0-9]+[a-zA-Z0-9_.-]+[a-zA-Z0-9_-]+#[a-zA-Z0-9]+[a-zA-Z0-9.-]+[a-zA-Z0-9]+.[a-z]{2,4}$
My doubts are:
Can I use the below regex for the same functionality?
^[a-zA-Z0-9_.-]+#[a-zA-Z0-9.-]+.[a-z]{2,4}$
The reason why I ask this is, I tried to study the meaning. So I got a confusion that
[a-zA-Z0-9_.-] cover all the instances by [a-zA-Z0-9] and [a-zA-Z0-9_-]
I a not sure about this, as I am a beginner.
I got the regex from
http://regexlib.com/
I checked both regex in http://regex101.com/#pcre. And I can't find any difference in result. May be it is because of my limited knowledge
Please give a clarification. Thanks to all in advance
Maybe it's not the answer you were looking for, but I have to say that I ended up with this kind of email validation: ^.+#.+\..{2,}$ after trimming it.
What does it check? Existence of some symbols, # itself, some other symbols, dot, and at least two symbols for top-level domain. It says "Dude, there should be an email, not your hilarious username". And that's enough, I guess.
By the way, .[a-z]{2,4}$ is a huge mistake for checking TLD, since there are few popular domains that are longer than 4 symbols (i.e., .travel) and a lot of less popular ones.
Why do I think that you don't need a detailed validation? First of all, there are a lot of requirements which you'll miss anyway. Do you know that cyrillic symbols are allowed in the email address?
And, please, think what do you want from this validation? Avoid incorrect emails? You won't. Somebody will enter an email, which meet all of your requirements, but it'll be incorrect anyway. Is email#gmial.com a good one? No. Will it be checked by regexp? I'm afraid, the answer is "no" once again.
So, it's better to explain that user should provide valid email to get confirmation mail and to make an explanation "if you aren't the one, who registered at mysite.com, please just ignore this letter" in the email text.
Because regexp will never filter enough, but you can lose a couple of users with strange email adresses because of it.
And since this should be an answer for your question:
It won't be the same functionality, but both regexp's are far from being perfect.
Long regexp checks that first symbols in login mustn't be a dot, dash or underscore, last symbol shouldn't be a dot, while other symbols might be, but avoid the fact that login might be shorter than 3 symbols. Short regexp is better (= simpler) but it doesn't meet requirements mentioned above.
So, if you want to use your variant, just remove 4 from it. If you need the logic of the original one, you can't make it shorter.
Differences might be found using these examples: .mail#mail.com, a#gmail.com, gmail#a.com

Regex - name validation

I am looking for a regular expression to check names. I have searched the net and also used the suggestion that were given me by StackOverflow while posting this question.
I also know it's possible in stages, but I'm looking for a regex-1-liner to keep my code clean, simple and most important: fast.
What do I need:
A regular expression that checks names of people while they are registering to my site. I want to allow names as:
Name
Name surename
Name O'brian
Name surename secondarysurname
Name surename-surnametwo
N. Surename
But I don't want to allow names as:
Name (double spacebar)
Name -- surename (double minus)
Name--' (just bullshit)
Well, I think you understand what I mean and what I don't want to allow.
I only want to use a-zA-Z and - . '
I think that's the only thing I need to allow. The - . ' signs can only be used once between or after a word. Since a name like 'name O''Brian' does not exist.
But a name like 'Name surename secondary-thirdsurname' should be allowed. So one spacebar and one minus sign.
I came up with several regex' using http://public.kvalley.com/regex/regex.asp and other regex programs. But I'm just a noob with regex'.
I hope somebody knows a lot about regular expressions and is willing to help me. Because at the moment.. I'm stuck :(
Thanks in advance,
Jelmer
ps. If you have any questions regarding my question. Please ask them because I'd really like to have your help!
A general rule of thumb that applies to many aspects of coding, but especially to regex design, is, your code can be:
Simple
Clean
Fast
Pick two.
In addition to that, I guarantee that a one-liner this complex will never be clean. Break it up into regex variables and comment it liberally. Later on, you'll be thankful that you did! While you're at it, turn it into a generic name validator class that you can reuse. While Perl can be quite munged by experts into something totally unholy, it's beauty often comes out when we follow the same laws of politeness and cleanliness that we follow in other, more structured languages.
TL;DR: Don't make it a one-liner. Please.
this is not bulletproof but should give you a hint in the right direction
([a-zA-ZáéíóúñÑ]+ ?'?-?)+
UPDATE: Heres a better approach according to #Tim s suggestion
([a-zA-ZáíóúñÑö]+( |'|-)?)+

Word language detection in C++

After searching on Google I don't know any standard way or library for detecting whether a particular word is of which language.
Suppose I have any word, how could I find which language it is: English, Japanese, Italian, German etc.
Is there any library available for C++? Any suggestion in this regard will be greatly appreciated!
Simple language recognition from words is easy. You don't need to understand the semantics of the text. You don't need any computationally expensive algorithms, just a fast hash map. The problem is, you need a lot of data. Fortunately, you can probably find dictionaries of words in each language you care about. Define a bit mask for each language, that will allow you to mark words like "the" as recognized in multiple languages. Then, read each language dictionary into your hash map. If the word is already present from a different language, just mark the current language also.
Suppose a given word is in English and French. Then when you look it up ex("commercial") will map to ENGLISH|FRENCH, suppose ENGLISH = 1, FRENCH=2, ... You'll find the value 3. If you want to know whether the words are in your lang only, you would test:
int langs = dict["the"];
if (langs | mylang == mylang)
// no other language
Since there will be other languages, probably a more general approach is better.
For each bit set in the vector, add 1 to the corresponding language. Do this for n words. After about n=10 words, in a typical text, you'll have 10 for the dominant language, maybe 2 for a language that it is related to (like English/French), and you can determine with high probability that the text is English. Remember, even if you have a text that is in a language, it can still have a quote in another, so the mere presence of a foreign word doesn't mean the document is in that language. Pick a threshhold, it will work quite well (and very, very fast).
Obviously the hardest thing about this is reading in all the dictionaries. This isn't a code problem, it's a data collection problem. Fortunately, that's your problem, not mine.
To make this fast, you will need to preload the hash map, otherwise loading it up initially is going to hurt. If that's an issue, you will have to write store and load methods for the hash map that block load the entire thing in efficiently.
I have found Google's CLD very helpful, it's written in C++, and from their web site:
"CLD (Compact Language Detector) is the library embedded in Google's Chromium browser. The library detects the language from provided UTF8 text (plain text or HTML). It's implemented in C++, with very basic Python bindings."
Well,
Statistically trained language detectors work surprisingly well on single-word inputs, though there are obviously some cases where they can't possible work, as observed by others here.
In Java, I'd send you to Apache Tika. It has an Open-source statistical language detector.
For C++, you could use JNI to call it. Now, time for a disclaimer warning. Since you specifically asked for C++, and since I'm unaware of a C++ free alternative, I will now point you at a product of my employer, which is a statistical language detector, natively in C++.
http://www.basistech.com, the product name is RLI.
This will not work well one word at a time, as many words are shared. For instance, in several languages "the" means "tea."
Language processing libraries tend to be more comprehensive than just this one feature, and as C++ is a "high-performance" language it might be hard to find one for free.
However, the problem might not be too hard to solve yourself. See the Wikipedia article on the problem for ideas. Also a small support vector machine might do the trick quite handily. Just train it with the most common words in the relevant languages, and you might have a very effective "database" in just a kilobyte or so.
I wouldn't hold my breath. It is difficult enough to determine the language of a text automatically. If all you have is a single word, without context, you would need a database of all the words of all the languages in the world... the size of which would be prohibitive.
Basically you need a huge database of all the major languages. To auto-detect the language of a piece of text, pick the language whose dictionary contains the most words from the text. This is not something you would want to have to implement on your laptop.
Spell check first 3 words of your text in all languages (the more words to spell check, the better). The spelling with least number of spelling errors "wins". With only 3 words it is technically possible to have same spelling in a few languages but with each additional word it becomes less probable. It is not a perfect method, but I figure it would work in most cases.
Otherwise if there is equal number of errors in all languages, use the default language. Or randomly pick another 3 words until you have more clear result. Or expand the number of spell checked words to more than 3, until you get a more clear result as well.
As for the spell checking libraries, there are many, I personally prefer Hunspell. Nuspell is probably also good. It is a matter of personal opinion and/or technical capabilities which one to use.
I assume that you are working with text not with speech.
If you are working with UNICODE than it has provided slot for each languages.
So you can identify that all characters of particular word is fall in this language slot.
For more help about unicode language slot you can fine over here

Consecutive wild characters in email

I was looking at email validation. I read in RFC specs that consecutive . (dot) are not allowed, like, mail..me#server.com.
But are different wild characters allowed to occur consecutively? Like, mail.$me#server.com.
And if so, how do I make a regular expression which will take only single occurance of wild characters as long as they are different? It shouldn't accept the ones like, .. && $$, but accept the ones like, &$ .$ &.
And since there's a big list of wild characters allowed, I don't think a regex like \^(&&|$$|..)\ etc, is not an option.
There are a few RFC compliant email validation regexes. They are not pretty, in fact they are pretty awful, spanning hundreds of characters. You really don't want to create one, either use it or write regular code you can understand and maintain.
This is one of the RFC compliant regexes
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Check this link for expanded information and alternative (more practical) regexes http://www.regular-expressions.info/email.html
I finally used something like this:
/^([a-zA-Z0-9]+([\.\!\'\#\$\%\&\*\+\-\/\=\?\^\_\`\{\|\}\~]{0,1}))*[a-zA-Z0-9]+\#(([a-zA-Z0-9\-]+[\.]?[a-zA-Z0-9]+){0,2})[\.][a-zA-Z]{2,4}$/
Not pretty :)
but very much served my specifications.
Different characters like $ are allowed to occur multiple times in a row, yes. sam$$iam#example.com is a completely valid email address.
I would use a simple regex of email validation + another regex that checks double chars like /[.&$]{2}/
I suppose it depends on what you're doing with this email validation, but I've done this for years in online ASP.NET regex validators for form entry purposes.
For a few months I thought I had what was a pretty cool regular expression to take care of this. I found it online and it seemed to be a popular one. However, on several occasions I'd get a call from a customer trying to fill out the application where the form validation didn't like their email address. And who knows how many people had the same problem but didn't call.
I learned the lesson the hard way that it's better to err on the side of greediness than to try to be too strict. In other words, since there are soooooo many rules in defining what makes an email address valid (and invalid), I simply define a loose open-ended regex to cover all of my bases. It may match some invalid email addresses as well, but for my purposes that's not as big of a deal. Besides, quite honestly -- most of the time if the user is screwing up their email address it's going to be a misspelling which regex isn't going to catch anyways.
So here's what I use now:
^[^<>\s\#]+(\#[\w\-]+(\.[\w\-]+)+)$
And here's a working example to test this:
http://regexhero.net/tester/?id=b90d359f-0dda-4b2a-a9b7-286fc513cf40
This doesn't address your primary concern as this will still match consecutive dots, dashes, etc. And I still can't claim this will match every valid email address because I honestly don't know. But I can say that I've been using it for the past 3 years with over 25,000 users and not a single complaint.
See these answers:
stackoverflow.com/questions/997078/email-regular-expression
stackoverflow.com/questions/201323/what-is-the-best-regular-expression-for-validating-email-addresses
stackoverflow.com/questions/36261/test-expand-my-email-regex
Just remember, as stated before: the only way to tell if an email address is truly valid is to send email to it!