Regexp for Tokenizing English Text - regex

What would be the best regular expression for tokenizing an English text?
By an English token, I mean an atom consisting of maximum number of characters that can be meaningfully used for NLP purposes. An analogy is a "token" in any programming language (e.g. in C, '{', '[', 'hello', '&', etc. can be tokens). There is one restriction: Though English punctuation characters can be "meaningful", let's ignore them for the sake of simplicity when they do not appear in the middle of \w+. So, "Hello, world." yields 'hello' and 'world'; similarly, "You are good-looking." may yield either [you, are, good-looking] or [you, are, good, looking].

Treebank Tokenization
Penn Treebank (PTB) tokenization is a reasonably common tokenization scheme used for natural language processing (NLP) work.
You can find a sed script with the appropriate regular expressions to get this tokenization here.
Software Packages
However, most NLP packages provide ready to use tokenizers, so you don't really need to write your own. For example, if you're using python you can just use the TreebankWordTokenizer provided with NLTK. If you're using the Java based Stanford Parser, it will by default tokenize any sentence you give it using its edu.stanford.nlp.processor.PTBTokenizer.

You probably shouldn't try to use a regular expression for tokenizing English text. In English some tokens have several different meanings and you can only know which is right by understanding the context in which they are found, and that requires understanding the meaning of the text to some extent. Examples:
The character ' could be an apostrophe or it could be used as a single-quote to quote some text.
The period could be the end of a sentence or it could signify an abbreviation. Or in some cases it could fulfil both roles simultaneously.
Try a natural language parser instead. For example you could use the Stanford Parser. It is free to use and will do a much better job than any regular expression at tokenizing English text. That's just one example though - there are also many other NLP libraries you could use.

You can split on [^\p{L}]+. It will split on each characters group which doesn't contains letters.
Resources :
regular-expressions.info - unicode

There are some complexities.
A word will have [A-Za-z0-9\-]. But, you may have some other delimiters besides just the word! You can start with [(\s] and end with [),.-\s?:;!]

Related

regex to highlight sentences longer than n words

I am trying to write a regex expression that can be used to identify long sentences in a document. I my case a scientific manuscript. I aim to be doing that either in libre office or any text editor with regex search.
So far I got the following expression to work on most occasions:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+){24,}?(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
btw, I got inspired from this post
It contains:
group1:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+)
a repetition element (stating how many words n - 1):
{24,}?
group2:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
The basic functioning is:
group1 matches any number of word characters OR other characters that are present in the text followed by one or more spaces
group1 has to be repeated 24 times (or as many as you want the sentences to be long)
group2 matches any number of word characters OR other characters that are present in the text followed by a full stop, exclamation mark, question mark or paragraph break.
Any string that fulfills all the above would then be highlighted.
What I can't solve so far is to make it work when a dot appears in the text with another meaning than a full stop. Things like: i.e., e.g., et al., Fig., 1.89, etc....
Also I don't like that I had to manually adjust it to be able to handle sentences that contain non-word characters such as , [ ( % - # µ " ' and so on. I would have to extend the expression every time I come across some other uncommon character.
I'd be happy for any help or suggestions of other ways to solve this.
You can do a lot with the swiss-army-knife that is regular expressions, but the problem you've presented approaches regex's limits. Some of the things you want to detect can probably be handled with really small changes, while others are a bit harder. If your goal is to have some kind of tool that accurately measures sentence length for every possible mutation of characters, you'll probably need to move outside LibreOffice to a dedicated custom piece of software or a third-party tool.
But, that said, there are a few tricks you can worm into your existing regex to make it work better, if you want to avoid programming or another tool. Let's look at a few of the techniques that might be useful to you:
You can probably tweak your regex for a few special cases, like Fig. and Mr., by including them directly. Where you currently have [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, which is basically [\w]+ with a bunch of other "special" characters, you could use something like ([\w|...]+|Mr\.|Mrs\.|Miss\.|Fig\.) (substituting in all the special characters where I wrote ..., of course). Regexes are "greedy" algorithms, and will try to consume as much of the text as possible, so by including special "dot words" directly, you can make the regex "skip over" certain period characters that are problematic in your text. Make sure that when you want to add a "period to skip" that you always precede it with a backslash, like in i\.e\., so that it doesn't get treated as the special "any" character.
A similar trick can capture numbers better by assuming that digits followed by a period followed by more digits are supposed to "eat" the period: ([\w|...]+|\d+\.\d+|...) That doesn't handle everything, and if your document authors are writing stuff like 0. in the middle of sentences then you have a tough problem, but it can at least handle pi and e properly.
Also, right now, your regex consumes characters until it reaches any terminating punctuation character — a ., or !, or ?, or the end of the document. That's a problem for things like i.e., and 3.14, since as far as your regex is concerned, the sentence stops at the .. You could require your regex to only stop the sentence once ._ is reached — a period followed by a space. That wouldn't fix mismatches for words like Mr., but it would treat "words" like 3.14 as a word instead of as the end of a sentence, which is closer than you currently are. To do this, you'll have to include an odd sequence as part of the "word" regex, something like (\.[^ ]), which says "dot followed by not-a-space" is part of the word; and then you'll have to change the terminating sequence to (\. |!|?|$). Repeat the changes similarly for ! and ?.
Another useful trick is to take advantage of character-code ranges, instead of encoding each special character directly. Right now, you're doing it the hard way, by spelling out every accented character and digraph and diacritic in the universe. Instead, you could just say that everything that's a "special character" is considered to be part of the "word": Instead of [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, write [\w|\-|\/|\u0080-\uFFFF], which captures every character except emoji and a few from really obscure dead languages. LibreOffice seems to have Unicode support, so using \uXXXX patterns should work inside [ character ranges ].
This is probably enough to make your regex somewhat acceptable in LibreOffice, and might even be enough to answer your question. But if you're really intent on doing more complex document analysis like this, you may be better off exporting the document as plain text and then running a specialized tool on it.

PHP and XSD Regular expression for currency and location names

I am trying to find a suitable regular expression that allows number, digits, spaces and other characters that can be used in names such as . ' -_. I need to implement an expression as a PHP preg_match and an XSD pattern. Currently I have the following for PHP
/^[a-zA-Z0-9 '-.]
Which allows the characters I want (unless there are any other special characters you could kindly recommend I use). The issue with this is that it allows special characters to be used one after the other, allowing values such as .-- . I need it so that this can't happen, only allowing a special character if a letter or digit comes before it.
I would also like the equivalent for an XSD pattern but everything I have tried so far has been inadequate. I am currently using
[\w\d '-.']*[\w\d][\w\d '-.]*
In addition, the length must be between 3-50 (which works in all cases currently).
Any guidance would be fantastic as I have searched high and low for an answer.
Valid names could be:
Netherlands Antillean guilde
Timor-Leste
Cote d'Ivoire
Including letters with accents.
Invalid names could be:
Tes''t
Test (space before)
_test (special character before
-'.
...

Regular Expression for first two characters

I need to match records starting with a certain character followed by subset of certain set of characters. After first two characters any character digit is allowed e.g.
in following dataset
man
mbn
mcn
mdn
aan
adn
I need to extract words starting from m and followed by a-c. So only first 3 records should match.
may be this should work for you
^m[a-c]\w+$
m[a-c] does what you want here.
What language? Perl, c#, python? They're similar but here's a regex from c#:
m[a-c]\w+
I'd also recommend that you take a look at Regulator if you're building c# based regex strings. It works for other languages with the exception of .NET features.

Regular expression for English-only special characters

I need a regular expression to match a-zA-Z0-9 as well as whitespace and special characters, but only including English whitespace/special characters, not those of other languages like French or Spanish.
Thanks.
It's not possible/practical to write a regular expression that matches English, but not French, Spanish and other languages.
If you really want to test if a word is from the English language, you can write some code to look it up in a English dictionary. That should be simple enough.
Depending on the regex engine, you may be able to use:
^\p{IsBasicLatin}*$
To allow only characters in the Basic Latin character set, which includes standard English lanuage punctuation (i.e., the characters that can be directly entered on a U.S. keyboard).
I was looking for a regular expression that would match regular english text (and avoid maybe html/xml/url etc) and landed on this page. I think the questioner just wanted to avoid character with phonetic information in it but allow for english punctuation characters. I ended up writing something by myself looking at my keyboard
[A-Za-z\d,.?;:\'"!$%() ]*
I don't claim this will work for everyone but was good enough for me.

What are good regular expressions?

I have worked for 5 years mainly in java desktop applications accessing Oracle databases and I have never used regular expressions. Now I enter Stack Overflow and I see a lot of questions about them; I feel like I missed something.
For what do you use regular expressions?
P.S. sorry for my bad english
Consider an example in Ruby:
puts "Matched!" unless /\d{3}-\d{4}/.match("555-1234").nil?
puts "Didn't match!" if /\d{3}-\d{4}/.match("Not phone number").nil?
The "/\d{3}-\d{4}/" is the regular expression, and as you can see it is a VERY concise way of finding a match in a string.
Furthermore, using groups you can extract information, as such:
match = /([^#]*)#(.*)/.match("myaddress#domain.com")
name = match[1]
domain = match[2]
Here, the parenthesis in the regular expression mark a capturing group, so you can see exactly WHAT the data is that you matched, so you can do further processing.
This is just the tip of the iceberg... there are many many different things you can do in a regular expression that makes processing text REALLY easy.
Regular Expressions (or Regex) are used to pattern match in strings. You can thus pull out all email addresses from a piece of text because it follows a specific pattern.
In some cases regular expressions are enclosed in forward-slashes and after the second slash are placed options such as case-insensitivity. Here's a good one :)
/(bb|[^b]{2})/i
Spoken it can read "2 be or not 2 be".
The first part are the (brackets), they are split by the pipe | character which equates to an or statement so (a|b) matches "a" or "b". The first half of the piped area matches "bb". The second half's name I don't know but it's the square brackets, they match anything that is not "b", that's why there is a roof symbol thingie (technical term) there. The squiggly brackets match a count of the things before them, in this case two characters that are not "b".
After the second / is an "i" which makes it case insensitive. Use of the start and end slashes is environment specific, sometimes you do and sometimes you do not.
Two links that I think you will find handy for this are
regular-expressions.info
Wikipedia - Regular expression
Coolest regular expression ever:
/^1?$|^(11+?)\1+$/
It tests if a number is prime. And it works!!
N.B.: to make it work, a bit of set-up is needed; the number that we want to test has to be converted into a string of “1”s first, then we can apply the expression to test if the string does not contain a prime number of “1”s:
def is_prime(n)
str = "1" * n
return str !~ /^1?$|^(11+?)\1+$/
end
There’s a detailled and very approachable explanation over at Avinash Meetoo’s blog.
If you want to learn about regular expressions, I recommend Mastering Regular Expressions. It goes all the way from the very basic concepts, all the way up to talking about how different engines work underneath. The last 4 chapters also gives a dedicated chapter to each of PHP, .Net, Perl, and Java. I learned a lot from it, and still use it as a reference.
If you're just starting out with regular expressions, I heartily recommend a tool like The Regex Coach:
http://www.weitz.de/regex-coach/
also heard good things about RegexBuddy:
http://www.regexbuddy.com/
As you may know, Oracle now has regular expressions: http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/rischert_regexp_pt1.html. I have used the new functionality in a few queries, but it hasn't been as useful as in other contexts. The reason, I believe, is that regular expressions are best suited for finding structured data buried within unstructured data.
For instance, I might use a regex to find Oracle messages that are stuffed in log file. It isn't possible to know where the messages are--only what they look like. So a regex is the best solution to that problem. When you work with a relational database, the data is usually pre-structured, so a regex doesn't shine in that context.
A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex equivalent is .*\.txt$.
A great resource for regular expressions: http://www.regular-expressions.info
These RE's are specific to Visual Studio and C++ but I've found them helpful at times:
Find all occurrences of "routineName" with non-default params passed:
routineName\(:a+\)
Conversely to find all occurrences of "routineName" with only defaults:
routineName\(\)
To find code enabled (or disabled) in a debug build:
\#if._DEBUG*
Note that this will catch all the variants: ifdef, if defined, ifndef, if !defined
Validating strong passwords:
This one will validate a password with a length of 5 to 10 alphanumerical characters, with at least one upper case, one lower case and one digit:
^(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])[a-zA-Z0-9]{5,10}$