Best practices for localization - regex

HI there,
i am localizing my application. i have some REG EX(for english) expression for client side validation.if i want to localize for non-english ,what is the best approach
should i have REG-Ex for all languages chosen for localization
comments\suggestions appreciated.
Thanks
DEE

Having separate regular expressions for different language inputs stored in external resource files would be the best route. If you are doing .Net development, you can use resource files. If you are doing Java, you can use property files.

Depends on what you're validating with your regexps. I'll assume you're doing some very basing natural language processing (may the giant spaghetti monster help you if so.)
Obviously, if some parts of your regexps are english-dependant (like, full english words), you will need to write the corresponding regexps in the other language (translating the english word by word might not be enough, since you might be capturing some grammar in your regexp).
An example of your regexps might help.

Can you post some examples? As Derek mentioned, separate regexes/files for each language is probably a must-have, but hard to tell without looking at what you are doing...

Related

Is it possible using NLP? Natural Language processing

I have a set of Project Names, a set of keywords and a set of paragraphs.
Now my task is, to check whether keywords match any project names , and keywords match any word in any paragraph.
If any set of paragraphs are matched with a keyword and any project matched with same keyword, then I have to assign these paragraphs to that project.
I have been using String Regex for this. But can this be implemented using Natural Language Processing concepts.
If yes... Please let me know how can it be implemented. It would be very helpful for me.
Thanks in advance.
There's no NLP involved in this as such. No matter what you do, you must have to go through all the projects and all the paragraphs at least once. Yes, you can optimize your process by using hashmaps or dictionaries but at the end of the day, you will be searching and matching strings no matter what.
You can do it using dictionaries as mapping becomes easy with the help of dictionaries and regex will be in action too.

Try to figure out a good way to split English document into sentences in C#

Is there a good way to split English document into sentences? I mean English document frequently includes Mr. Mrs. U.S.A, etc. It is difficult to separate them out. Do we need a special natural language library to accomplish this? I suspect that we need it.
Thank you.
Technically, you need a complete understanding of English to do the job.
As a decent "almost" solution, you could use a dictionary of "things that end in period" and split on periods which do not immediately follow one of those tokens.
If every sentence begins with a capital and ends with a period, then I would define a sentence as the above but contains >1 word and does not end with (common abbreviation list or regex [a-zA-Z].+)
You can use sentence detector provided by numerous NLP tools such as OpenNLP or Stanford CoreNLP. They can handle cases like Mr. Mrs. U.S.A, etc.
Both OpenNLP and Stanford CoreNLP are written in Java.
SharpNLP is C# (ported) version of OpenNLP.

How to verify regexp patterns?

What are the common ways to verify the given regex pattern works well of the given scenario and check the results ?
I would like to know in general , not in the particular programming language and what is the best way to learn about writing regular expression ?
Books: Mastering Regular Expressions is the definitive guide to regular expressions. The Regular Expressions Cookbook is said to be lighter and more easily applicable.
Sites: Friedel's companion site is a good start. Regexlib is a source of idioms and patterns.
Software: RegexBuddy is a good, per pay, regex verifier.
I've used this resource when learning: http://www.regular-expressions.info/ and found myself going back there whenever there was something I needed to remember. It's very useful for learning and covers the basics very well. They also have various links to programs which can be used to verify regular expressions.
This is not a "real" verification, but RegexBuddy allows you to verify that your regex does what you expect it to do on any sample data you provide. It also translates the regex into an English description that can help to figure out mistakes. Plus, it knows all major regex flavors and can translate regexes between them.
For testing regular expression you can use RegEx Test tools like one below :
http://www.regextester.com/
To know more about how to learn regular expressions please check following SO threads :
Learning Regular Expressions
How to master Regular Expressions?
https://stackoverflow.com/questions/465119/how-do-i-learn-regular-expressions-closed
RAD Rexexp designer is a great tool
Set up an automated test using your tools of choice (because regex implementations vary from language to language and library to library) which applies the regex to a variety of both matching and non-matching inputs to verify that you get the correct results.
While RegexBuddy and the like may be helpful for initially creating the regex (or may not; I've never used them), you will still need to maintain it, just like any other code. When that time comes, it's vastly preferable to have a test script that will run through all your old test inputs (plus the new ones which created the need for the change) in a matter of seconds rather than having to sit on a website for tens of minutes, if not hours, trying to remember all your test inputs and manually re-run them to make sure you didn't break anything.

Regex misspellings

I have a regex created from a list in a database to match names for types of buildings in a game. The problem is typos, sometimes those writing instructions for their team in the game will misspell a building name and obviously the regex will then not pick it up (i.e. spelling "University" and "Unversity").
Are there any suggestions on making a regex match misspellings of 1 or 2 letters?
The regex is dynamically generated and run on a local machine that's able to handle a lot more load so I have as a last resort to algorithmically create versions of each word with a letter missing and then another with letters added in.
I'm using PHP but I'd hope that any solution to this issue would not be PHP specific.
Allow me to introduce you to the Levenshtein Distance, a measure of the difference between strings as the number of transformations needed to convert one string to the other.
It's also built into PHP.
So, I'd split the input file by non-word characters, and measure the distance between each word and your target list of buildings. If the distance is below some threshold, assume it was a misspelling.
I think you'd have more luck matching this way than trying to craft regex's for each special case.
Google's implementation of "did you mean" by looking at previous results might also help:
How do you implement a "Did you mean"?
What is Soundex() ? – Teifion (28 mins ago)
A soundex is similar to the levenshtein function Triptych mentions. It is a means of comparing strings. See: http://us3.php.net/soundex
You could also look at metaphone and similar_text. I would have put this in a comment but I don't have enough rep yet to do that. :D
Back in the days we sometimes used Soundex() for these problems.
You're in luck; the algorithms folks have done lots of work on approximate matching of regular expressions. The oldest of these tools is probably agrep originally developed at the University of Arizona and now available in a nice open-source version. You simply tell agrep how many mistakes you are willing to tolerate and it matches from there. It can also match other blocks of text besides lines. The link above has links to a newer, GPLed version of agrep and also a number of language-specific libraries for approximate matching of regular expressions.
This might be overkill, but Peter Norvig of Google has written an excellent article on writing a spell checker in Python. It's definitely worth a read and might apply to your case.
At the end of the article, he's also listed contributed implementations of the algorithm in various other languages.

Under what situations are regular expressions really the best way to solve the problem?

I'm not sure if Jeff coined it but it's the joke/saying that people who say "oh, I know I'll use regular expressions!" now have two problems. I've always taken this to mean that people use regular expressions in very inappropriate contexts.
However, under what circumstances are regular expressions really the best answer? What problems are they really the best or maybe only way to solve a situation?
RexExprs are good for:
Text Format Validations (email, url, numbers)
Text searchs/substitution.
Mappings (e.g. url pattern to function call)
Filtering some texts (related to substitution)
Lexical analysis during parsing.
They can be used to validate anything that have a pattern like :
Social Security Number
Telephone Number ( 555-555-5555 )
Email Address (something#example.com)
IP Address (but it's more complex to make sure it's valid)
All those have patterns and are easily verifiable by RegEx.
They are difficultly used for entry that have a logic instead of a pattern like a credit card number but they still can be used to do some client validation.
So the best ways?
To sanitize data entry on the client
side before sanitizing them on the
server.
To make "Search and Replace" of some
strings that contains pattern
I'm sure I am missing a lot of other cases.
Regular expressions are a great way to parse text that doesn't already have a parser (i.e. XML) I have used it to create a parser for the mod_rewrite syntax in the .htaccess file or in my URL Rewriter project http://www.codeplex.com/urlrewriter for example
they are really good when you want to be more specific than "*" or "?" like "3 letters then 2 numbers then a $ sign then a period"
The quote is from an anti-Perl rant from Jamie Zawinski. I think Perl used to do regex really badly but now it seems to be a standard engine for a lot of programs.
But the same sentiment still applies. If you don't know how to use regex, you better not try something real fancy other wise you get one of these tags too (see bronze list) ;o)
https://stackoverflow.com/users/730/keng
They are good for matching or finding text that takes a very specific and simple format. By "simple" I mean not nested and smaller than the entire html spec, for example.
They are primarily of value for highly structured text parsing. If you used named groups (and option in most mature regex systems), you have a phenomenally powerful and crisp way to handle the strings.
Here's an example. Consider that netstat in its various iterations on different linux OSes, and versions of netstat can return different results. Sometimes there is an extra column, sometimes there is a shift if the date/time format. Regexes give you a powerful way to handle that with a single expression. Couple that with named groups, and you can retrieve the data without hacks like:
1) split on spaces
2) ok, the netstat version is X so add I need to add 1 to all array references past column 5.
3) ok, the netstat version is Y so I need to make sure that I use multiple array references for the date info.
YUCK. Simple to fix in a Regex :-)