Can I perform stemming using regular expressions? - regex

How can I get my regular expression to match against just one condition exactly?
For example I have the following regular expression:
(\w+)(?=ly|es|s|y)
Matching the expression against the word "glasses" returns:
glasse
The correct match should be:
glass (match should be on 'es' rather than 's' as in the match above)
The expression should cater for any kinds of words such as:
films
lovely
glasses
glass
Currently the regular expression is matching the above words as:
film - correct
lovel - incorrect
glasse - incorrect
glas - incorrect
The correct match for the words should be:
film
love
glass
glass
The problem I am having at the moment is I am not sure how to adjust my regular expression to cater for either 's' or 'es' exactly, as a word could contain both such as "glasses".
Update
Thank you for the answers so far. I appreciate the complexity of stemming and the requirement of language knowledge. However in my particular case the words are finite (films,lovely,glasses and glass) and so therefore I will only ever encounter these words and the suffixes in the expression above. I don't have a particular application for this. I was just curious to see if it was possible using regular expressions. I have come to the conclusion that it is not possible, however would the following be possible:
A match is either found or not found, for example match glasses but NOT glass but DO match films:
film (match) - (films)
glass (match) - (glasses)
glass (no match) - (glass)
What I'm thinking is if there is a way to match the suffix exactly against the string from the end. In the example above 'es' match glass(es) therefore the condition 's' is discarded. In the case of glass (no match) the condition 's' is discarded because another 's' precedes it, it does not match exactly. I must admit I'm not 100% about this so my logic may seem a little shakey, it's just an idea.

If you want to do stemming, use a library like Snowball. It's going to be impossible to do what you want to do with regular expressions. In particular, it will be impossible for your regex to know that the trailing 's' should be removed from 'films' but not 'glass' without some kind of knowledge of the language.
There's vast literature on stemming and lemmatization. Google is your friend.

The basic problem you're having here is that the plus in
(\w+)(?=ly|es|s|y)
is greedy, and will grab as much as possible while still allowing the whole regex to match. You've not said exactly which flavour of regex you're using but try
(\w+?)(?=ly|es|s|y)
+? means the same as + but is reluctant, matching as little as possible while still allowing the overall match to succeed.
However this would still have the problem that it splits glass into glas and s. To handle this you'd need something like
(\w+?)(?=ly|es|(?<!s)s|y)
using negative look behind to prevent the s alternative from matching when preceded by another s.

As a case for somebody looking for such kind of solution in/for python, there is a RegexpStemmer provided by the natural language tool kit, and it works very fast
# regex stemmer
from nltk.stem import RegexpStemmer
rs = RegexpStemmer('ing$|s$|ed$|y$', min=3)
t=time.clock()
train[col]=train[col].apply(lambda x: ' '.join([rs.stem(word) for word in x.split()]))
print(time.clock()-t)
http://www.nltk.org/api/nltk.stem.html
http://snowball.tartarus.org/algorithms/english/stemmer.html

Related

Regex not separating n't (not)

I am trying to write a complex regex for a large corpus. However, due to many ORs, I am not able to capture the "not" in weren't don't wasn't didn't shouln't doesn't
I would like it to match base verb and n't separately: E.g. were and n't
I have added it in the first line on: https://www.regexpal.com/?fam=106183 with the regex.
Any clue why it is not picking despite it being present in the expression on first order: [a-z]{1}'\w
Edit:
The regex is long because it is part of a large corpus. My problem is that the n't is not getting separated out, even though I placed in first order of preference for OR.
Thanks in advance
Trying to parse natural language perfectly with a regular expression is never going to be "perfect". Language contains too many quirks and exceptions.
However, with that said, trying to cover all scenarios explicitly like you have done ("a 2 letter lower case word", "a 4 letter capitalised word", "a word with a multiple of 3 letters" (??!), ... is a doomed approach.
Keep the pattern as simple as you possibly can, and only add exceptions if you really need to.
Here's a basic approach:
/n't|\b\w+(?!'t)/
This is matching "n't", or 'any word, excluding the last letter if it's proceeded by "'t"'.
You may wish to build upon that slightly, but it solved the use case you've provided:
Demo
In order to understand why your original pattern doesn't work, let's consider a Minimal, Complete, Verifiable Example:
Cutting your pattern down to:
/[a-z]?'[a-z]{1,}|[\w-]+/
Consider how it matches the string:
"weren't"
First, the characters weren are matched by the [\w-]+ portion of the pattern.
Then, the 't characters are matched by the [a-z]?'[a-z]{1,} portion of the pattern.
Fundamentally, having the greedy [\w-]+ section in this pattern will mean it cannot work. This will always match up-to-and-including the "n" in "n't", which means the overall match fails for non-3-letter words.

RegEx: Searching for numbers (int, float) that are NOT part of a word

I'm hoping we have some regular expression guru's here that might be able to help me - a regex newbie - solve a problem.
I know some people will want to know some background info on this issue:
Regex Flavor: Basic Regex, being used in a Vertica Database using the REGEXP_REPLACE function.
The regex I am using is working great with one exception.
I have a rule that I'm trying to implement, related to stripping the numbers from text, where any number that is part of a word, e.g. table5, go2market, 33monroe, room222, etc. is ignored and NOT filtered.
Here is what I started with for detecting numbers:
[-+]?[0-9]*\.?[0-9]
That seems to work pretty well, including handling directly adjacent commas and parentheses for example.
But all cases where there is a number that is part of alphabetic text is also being detected, which fails the rule that it cannot be a part of a word, and by word, I mean any alphabetic text.
So, in searching for solutions, I happened upon this regex that seems to work well detecting those specific cases where numbers appear next to, or in, any string of characters:
((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)
My thought was that maybe I could add this as an INVERTED match to my original regex, to allow it to still select standalone numbers while ignoring those that were a part of a word, like so:
[-+]?[0-9]^((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)*\.?[0-9]^((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)
Unfortunately however, it breaks the original detection of standalone numbers.
:(
I'm hoping there is someone here that can spot what I'm doing wrong, and help me identify the right solution?
Thanks in advance!
According to Vertica documentation, the regex flavour seems to follow the Perl syntax. In this case you can use negative lookarounds and in particular a negative lookbehind: (?<!\w) (not preceded with a word character.)
Lookarounds are only tests and don't consume characters.
You can also use a negative lookahead to test the right part, (?!\w) (not followed by a word character), but it's more simple to use a word boundary since the pattern ends with a digit (that is also a word character):
(?<!\w)[-+]?\d*\.?\d+\b
In the worst case, if you have something like v1.0 in your string and you want to avoid it, you can try to use the bactracking control verbs (*SKIP) and (*FAIL). (*FAIL) forces the pattern to fail and (*SKIP) skips all the already matched positions before it. I hope vertica supports these Perl regex features.
Something like:
\p{L}+[-+]?\d*\.?\d+(*SKIP)(*FAIL)|[-+]?\d*\.?\d+(*SKIP)(?!\p{L})

Look behinds: all the rage in regex?

Many regex questions lately have some kind of look-around element in the query that appears to me is not necessary to the success of the match. Is there some teaching resource that is promoting them? I am trying to figure out what kinds of cases you would be better off using a positive look ahead/behind. The main application I can see is when trying to not match an element. But, for example, this query from a recent question has a simple solution to capturing the .*, but why would you use a look behind?
(?<=<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span
And this one from another question:
$url = "www.example.com/id/1234";
preg_match("/\d+(?<=id\/[\d])/",$url,$matches);
When is it truly better to use a positive look-around? Can you give some examples?
I realize this is bordering on an opinion-based question, but I think the answers would be really instructive. Regex is confusing enough without making things more complicated... I have read this page and am more interested in some simple guidelines for when to use them rather than how they work.
Thanks for all the replies. In addition to those below, I recommend checking out m.buettner's great answer here.
You can capture overlapping matches, and you can find matches which could lie in the lookarounds of other matches.
You can express complex logical assertions about your match (because many engines let you use multiple lookbehind/lookahead assertions which all must match in order for the match to succeed).
Lookaround is a natural way to express the common constraint "matches X, if it is followed by/preceded by Y". It is (arguably) less natural to add extra "matching" parts that have to be thrown out by postprocessing.
Negative lookaround assertions, of course, are even more useful. Combined with #2, they can allow you do some pretty wizard tricks, which may even be hard to express in usual program logic.
Examples, by popular request:
Overlapping matches: suppose you want to find all candidate genes in a given genetic sequence. Genes generally start with ATG, and end with TAG, TAA or TGA. But, candidates could overlap: false starts may exist. So, you can use a regex like this:
ATG(?=((?:...)*(?:TAG|TAA|TGA)))
This simple regex looks for the ATG start-codon, followed by some number of codons, followed by a stop codon. It pulls out everything that looks like a gene (sans start codon), and properly outputs genes even if they overlap.
Zero-width matching: suppose you want to find every tr with a specific class in a computer-generated HTML page. You might do something like this:
<tr class="TableRow">.*?</tr>(?=<tr class="TableRow">|</table>)
This deals with the case in which a bare </tr> appears inside the row. (Of course, in general, an HTML parser is a better choice, but sometimes you just need something quick and dirty).
Multiple constraints: suppose you have a file with data like id:tag1,tag2,tag3,tag4, with tags in any order, and you want to find all rows with tags "green" and "egg". This can be done easily with two lookaheads:
(.*):(?=.*\bgreen\b)(?=.*\begg\b)
There are two great things about lookaround expressions:
They are zero-width assertions. They require to be matched, but they consume nothing of the input string. This allows to describe parts of the string which will not be contained in a match result. By using capturing groups in lookaround expressions, they are the only way to capture parts of the input multiple times.
They simplify a lot of things. While they do not extend regular languages, they easily allow to combine (intersect) multiple expressions to match the same part of a string.
Well one simple case where they are handy is when you are anchoring the pattern to the start or finish of a line, and just want to make sure that something is either right ahead or behind the pattern you are matching.
I try to address your points:
some kind of look-around element in the query that appears to me is not necessary to the success of the match
Of course they are necessary for the match. As soon as a lookaround assertions fails, there is no match. They can be used to ensure conditions around the pattern, that have additionally to be true. The whole regex does only match, if:
The pattern does fit and
The lookaround assertions are true.
==> But the returned match is only the pattern.
When is it truly better to use a positive look-around?
Simple answer: when you want stuff to be there, but you don't want to match it!
As Bergi mentioned in his answer, they are zero width assertions, this means they don't match a character sequence, they just ensure it is there. So the characters inside a lookaround expression are not "consumed", the regex engine continues after the last "consumed" character.
Regarding your first example:
(?<=<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span
I think there is a misunderstanding on your side, when you write "has a simple solution to capturing the .*". The .* is not "captured", it is the only thing that the expression does match. But only those characters are matched that have a "<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">" before and a "<\/a><span" after (those two are not part of the match!).
"Captured" is only something that has been matched by a capturing group.
The second example
\d+(?<=id\/[\d])
Is interesting. It is matching a sequence of digits (\d+) and after the sequence, the lookbehind assertion checks if there is one digit with "id/" before it. Means it will fail if there is more than one digit or if the text "id/" before the digit is missing. Means this regex is matching only one digit, when there is fitting text before.
teaching resources
www.regular-expressions.info
perlretut on Looking ahead and looking behind
I'm assuming you understand the good uses of lookarounds, and ask why they are used with no apparent reason.
I think there are four main categories of how people use regular expressions:
Validation
Validation is usually done on the whole text. Lookarounds like you describe are not possible.
Match
Extracting a part of the text. Lookarounds are used mainly due to developer laziness: avoiding captures.
For example, if we have in a settings file with the line Index=5, we can match /^Index=(\d+)/ and take the first group, or match /(?<=^Index=)\d+/ and take everything.
As other answers said, sometimes you need overlapping between matches, but these are relatively rare.
Replace
This is similar to match with one difference: the whole match is removed and is being replaced with a new string (and some captured groups).
Example: we want to highlight the name in "Hi, my name is Bob!".
We can replace /(name is )(\w+)/ with $1<b>$2</b>,
but it is neater to replace /(?<=name is )\w+/ with <b>$&</b> - and no captures at all.
Split
split takes the text and breaks it to an array of tokens, with your pattern being the delimiter. This is done by:
Find a match. Everything before this match is token.
The content of the match is discarded, but:
In most flavors, each captured group in the match is also a token (notably not in Java).
When there are no more matches, the rest of the text is the last token.
Here, lookarounds are crucial. Matching a character means removing it from the result, or at least separating it from its token.
Example: We have a comma separated list of quoted string: "Hello","Hi, I'm Jim."
Splitting by comma /,/ is wrong: {"Hello", "Hi, I'm Jim."}
We can't add the quote mark, /",/: {"Hello, "Hi, I'm Jim."}
The only good option is lookbehind, /(?<="),/: {"Hello", "Hi, I'm Jim."}
Personally, I prefer to match the tokens rather than split by the delimiter, whenever that is possible.
Conclusion
To answer the main question - these lookarounds are used because:
Sometimes you can't match text that need.
Developers are shiftless.
Lookaround assertions can also be used to reduce backtracking which can be the main cause for a bad performance in regexes.
For example: The regex ^[0-9A-Z]([-.\w]*[0-9A-Z])*#(1) can also be written ^[0-9A-Z][-.\w]*(?<=[0-9A-Z])#(2) using a positive look behind (simple validation of the user name in an e-mail address).
Regex (1) can cause a lot of backtracking essentially because [0-9A-Z] is a subset of [-.\w] and the nested quantifiers. Regex (2) reduces the excessive backtracking, more information here Backtracking, section Controlling Backtracking > Lookbehind Assertions.
For more information about backtracking
Best Practices for Regular Expressions in the .NET Framework
Optimizing Regular Expression Performance, Part II: Taking Charge of Backtracking
Runaway Regular Expressions: Catastrophic Backtracking
I typed this a while back but got busy (still am, so I might take a while to reply back) and didn't get around to post it. If you're still open to answers...
Is there some teaching resource that is promoting them?
I don't think so, it's just a coincidence I believe.
But, for example, this query from a recent question has a simple solution to capturing the .*, but why would you use a look behind?
(?<=<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span
This is most probably a C# regex, since variable width lookbehinds are not supported my many regex engines. Well, the lookarounds could be certainly avoided here, because for this, I believe it's really simpler to have capture groups (and make the .* lazy as we're at it):
(<td><a href="\/xxx\.html\?n=[0-9]{0,5}">).*?(<\/a><span)
If it's for a replace, or
<td><a href="\/xxx\.html\?n=[0-9]{0,5}">(.*?)<\/a><span
for a match. Though an html parser would definitely be more advisable here.
Lookarounds in this case I believe are slower. See regex101 demo where the match is 64 steps for capture groups but 94+19 = 1-3 steps for the lookarounds.
When is it truly better to use a positive look-around? Can you give some examples?
Well, lookarounds have the property of being zero-width assertions, which mean they don't really comtribute to matches while they contribute onto deciding what to match and also allows overlapping matches.
Thinking a bit about it, I think, too, that negative lookarounds get used much more often, but that doesn't make positive lookarounds less useful!
Some 'exploits' I can find browsing some old answers of mine (links below will be demos from regex101) follow. When/If you see something you're not familiar about, I probably won't be explaining it here, since the question's focused on positive lookarounds, but you can always look at the demo links I provided where there's a description of the regex, and if you still want some explanation, let me know and I'll try to explain as much as I can.
To get matches between certain characters:
In some matches, positive lookahead make things easier, where a lookahead could do as well, or when it's not so practical to use no lookarounds:
Dog sighed. "I'm no super dog, nor special dog," said Dog, "I'm an ordinary dog, now leave me alone!" Dog pushed him away and made his way to the other dog.
We want to get all the dog (regardless of case) outside quotes. With a positive lookahead, we can do this:
\bdog\b(?=(?:[^"]*"[^"]*")*[^"]*$)
to ensure that there are even number of quotes ahead. With a negative lookahead, it would look like this:
\bdog\b(?!(?:[^"]*"[^"]*")*[^"]*"[^"]*$)
to ensure that there are no odd number of quotes ahead. Or use something like this if you don't want a lookahead, but you'll have to extract the group 1 matches:
(?:"[^"]+"[^"]+?)?(\bdog\b)
Okay, now say we want the opposite; find 'dog' inside the quotes. The regex with the lookarounds just need to have the sign inversed, first and second:
\bdog\b(?!(?:[^"]*"[^"]*")*[^"]*$)
\bdog\b(?=(?:[^"]*"[^"]*")*[^"]*"[^"]*$)
But without the lookaheads, it's not possible. the closest you can get is maybe this:
"[^"]*(\bdog\b)[^"]*"
But this doesn't get all the matches, or you can maybe use this:
"[^"]*?(\bdog\b)[^"]*?(?:(\bdog\b)[^"]*?)?"
But it's just not practical for more occurrences of dog and you get the results in variables with increasing numbers... And this is indeed easier with lookarounds, because they are zero width assertions, you don't have to worry about the expression inside the lookaround to match dog or not, or the regex wouldn't have obtained all the occurrences of dog in the quotes.
Of course now, this logic can be extended to groups of characters, such as getting specific patterns between words such as start and end.
Overlapping matches
If you have a string like:
abcdefghijkl
And want to extract all the consecutive 3 characters possible inside, you can use this:
(?=(...))
If you have something like:
1A Line1 Detail1 Detail2 Detail3 2A Line2 Detail 3A Line3 Detail Detail
And want to extract these, knowing that each line starts with #A Line# (where # is a number):
1A Line1 Detail1 Detail2 Detail3
2A Line2 Detail
3A Line3 Detail Detail
You might try this, which fails because of greediness...
[0-9]+A Line[0-9]+(?: \w+)+
Or this, which when made lazy no more works...
[0-9]+A Line[0-9]+(?: \w+)+?
But with a positive lookahead, you get this:
[0-9]+A Line[0-9]+(?: \w+)+?(?= [0-9]+A Line[0-9]+|$)
And appropriately extracts what's needed.
Another possible situation is one where you have something like this:
#ff00fffirstword#445533secondword##008877thi#rdword#
Which you want to convert to three pairs of variables (first of the pair being a # and some hex values (6) and whatever characters after them):
#ff00ff and firstword
#445533 and secondword#
#008877 and thi#rdword#
If there were no hashes inside the 'words', it would have been enough to use (#[0-9a-f]{6})([^#]+), but unfortunately, that's not the case and you have to resort to .*? instead of [^#]+, which doesn't quite yet solve the issue of stray hashes. Positive lookaheads however make this possible:
(#[0-9a-f]{6})(.+?)(?=#[0-9a-f]{6}|$)
Validation & Formatting
Not recommended, but you can use positive lookaheads for quick validations. The following regex for instance allow the entry of a string containing at least 1 digit and 1 lowercase letter.
^(?=[^0-9]*[0-9])(?=[^a-z]*[a-z])
This can be useful when you're checking for character length but have patterns of varying length in the a string, for example, a 4 character long string with valid formats where # indicates a digit and the hyphen/dash/minus - must be in the middle:
##-#
#-##
A regex like this does the trick:
^(?=.{4}$)\d+-\d+
Where otherwise, you'd do ^(?:[0-9]{2}-[0-9]|[0-9]-[0-9]{2})$ and imagine now that the max length was 15; the number of alterations you'd need.
If you want a quick and dirty way to rearrange some dates in the 'messed up' format mmm-yyyy and yyyy-mm to a more uniform format mmm-yyyy, you can use this:
(?=.*(\b\w{3}\b))(?=.*(\b\d{4}\b)).*
Input:
Oct-2013
2013-Oct
Output:
Oct-2013
Oct-2013
An alternative might be to use a regex (normal match) and process separately all the non-conforming formats separately.
Something else I came across on SO was the indian currency format, which was ##,##,###.### (3 digits to the left of the decimal and all other digits groupped in pair). If you have an input of 122123123456.764244, you expect 1,22,12,31,23,456.764244 and if you want to use a regex, this one does this:
\G\d{1,2}\K\B(?=(?:\d{2})*\d{3}(?!\d))
(The (?:\G|^) in the link is only used because \G matches only at the start of the string and after a match) and I don't think this could work without the positive lookahead, since it looks forward without moving the point of replacement.)
Trimming
Suppose you have:
this is a sentence
And want to trim all the spaces with a single regex. You might be tempted to do a general replace on spaces:
\s+
But this yields thisisasentence. Well, maybe replace with a single space? It now yields " this is a sentence " (double quotes used because backticks eats spaces). Something you can however do is this:
^\s*|\s$|\s+(?=\s)
Which makes sure to leave one space behind so that you can replace with nothing and get "this is a sentence".
Splitting
Well, somewhere else where positive lookarounds might be useful is where, say you have a string ABC12DE3456FGHI789 and want to get the letters+digits apart, that is you want to get ABC12, DE3456 and FGHI789. You can easily do use the regex:
(?<=[0-9])(?=[A-Z])
While if you use ([A-Z]+[0-9]+) (i.e. the captured groups are put back in the resulting list/array/etc, you will be getting empty elements as well.
Note that this could be done with a match as well, with [A-Z]+[0-9]+
If I had to mention negative lookarounds, this post would have been even longer :)
Keep in mind that a positive/negative lookaround is the same for a regex engine. The goal of lookarounds is to perform a check somewhere in your "regular expression".
One of the main interest is to capture something without using capturing parenthesis (capturing the whole pattern), example:
string: aaabbbccc
regex: (?<=aaa)bbb(?=ccc)
(you obtain the result with the whole pattern)
instead of: aaa(bbb)ccc
(you obtain the result with the capturing group.)

Confusion regarding the *? regular expression operator

So I want to search a string, using the below regular expression:
border-.*\.5pt
to find all border-top, border-bottom, etc CSS properties in a file with a border thickness of .5pt. It generally works great, but it's too greedy.
For example all of the below comes back as a single match:
border-top:solid #1F497D .5pt;border-bottom:solid #1F497D .5pt
I want those two CSS properties to be two separate matches.
So I tried to modify my regular expression to:
border-.*?\.5pt
Using ? to make it non-greedy. However, after that modification, nothing matches.
Can anyone explain why I see this behavior? What am I missing?
(If it's worth knowing, I'm using Microsoft Expression Web's 'find with regular expressions' when doing this search.)
There is no one "regular expression" language. While there are broad commonalities, details differ from implementation to implementation. Many regexes use - to be the non-greedy "0 or more", others use *?. Apparently Microsoft Expression Web uses #.
In short, regexes can differ, so you'll often need to RTM for the one you're using to find its range of capabilities and detailed syntax (i.e. support for alteration/backtracking/etc., grouping character, set shorthand, etc.)
.*? is the badest, so to say "antipattern" for Regular Expressions. It is commonly used as a "Match-something-until-the-string-i-want" Pattern - but it isn't.
Especially when combining multiple .*? within ONE pattern, it may lead to very wrong and unexpected results.
For your Case - as stated in the comments - It works. (Maybe you did something wrong?)
However, it is ALWAYS a good idea to be more specific, when generating a regex pattern.
ALWAYS KEEP IN MIND that .*? can be ANYTHING. Also Stuff you really don't want to match!
In your example, i would use something like this: border-(?:[^:]+):\s*(?:[^\s]+)\s+(?:\#[a-fA-F0-9]{6})\s+(?:\d*(?:\.\d+)?)pt;?
It is more specific, but matches the given Requirements, ignores all whitespaces that dont make sence, and even matches border widths, regardles if they are written as .2, 3 or 4.1. If you remove the ?: from the single match Groups you can also match every single attribute, if required. : Position, Border type, Color and thickness.
The pattern border-([^:]+):\s*([^\s]+)\s+(\#[a-fA-F0-9]{6})\s+(\d*(?:\.\d+)?)pt;? with your string border-top:solid #1F497D .5pt;border-bottom:solid #1F497D .5pt will match:
First Match:
1.top
2.solid
3.#1F497D
4..5
Second Match:
1.bottom
2.solid
3.#1F497D
4..5

What are good regular expressions?

I have worked for 5 years mainly in java desktop applications accessing Oracle databases and I have never used regular expressions. Now I enter Stack Overflow and I see a lot of questions about them; I feel like I missed something.
For what do you use regular expressions?
P.S. sorry for my bad english
Consider an example in Ruby:
puts "Matched!" unless /\d{3}-\d{4}/.match("555-1234").nil?
puts "Didn't match!" if /\d{3}-\d{4}/.match("Not phone number").nil?
The "/\d{3}-\d{4}/" is the regular expression, and as you can see it is a VERY concise way of finding a match in a string.
Furthermore, using groups you can extract information, as such:
match = /([^#]*)#(.*)/.match("myaddress#domain.com")
name = match[1]
domain = match[2]
Here, the parenthesis in the regular expression mark a capturing group, so you can see exactly WHAT the data is that you matched, so you can do further processing.
This is just the tip of the iceberg... there are many many different things you can do in a regular expression that makes processing text REALLY easy.
Regular Expressions (or Regex) are used to pattern match in strings. You can thus pull out all email addresses from a piece of text because it follows a specific pattern.
In some cases regular expressions are enclosed in forward-slashes and after the second slash are placed options such as case-insensitivity. Here's a good one :)
/(bb|[^b]{2})/i
Spoken it can read "2 be or not 2 be".
The first part are the (brackets), they are split by the pipe | character which equates to an or statement so (a|b) matches "a" or "b". The first half of the piped area matches "bb". The second half's name I don't know but it's the square brackets, they match anything that is not "b", that's why there is a roof symbol thingie (technical term) there. The squiggly brackets match a count of the things before them, in this case two characters that are not "b".
After the second / is an "i" which makes it case insensitive. Use of the start and end slashes is environment specific, sometimes you do and sometimes you do not.
Two links that I think you will find handy for this are
regular-expressions.info
Wikipedia - Regular expression
Coolest regular expression ever:
/^1?$|^(11+?)\1+$/
It tests if a number is prime. And it works!!
N.B.: to make it work, a bit of set-up is needed; the number that we want to test has to be converted into a string of “1”s first, then we can apply the expression to test if the string does not contain a prime number of “1”s:
def is_prime(n)
str = "1" * n
return str !~ /^1?$|^(11+?)\1+$/
end
There’s a detailled and very approachable explanation over at Avinash Meetoo’s blog.
If you want to learn about regular expressions, I recommend Mastering Regular Expressions. It goes all the way from the very basic concepts, all the way up to talking about how different engines work underneath. The last 4 chapters also gives a dedicated chapter to each of PHP, .Net, Perl, and Java. I learned a lot from it, and still use it as a reference.
If you're just starting out with regular expressions, I heartily recommend a tool like The Regex Coach:
http://www.weitz.de/regex-coach/
also heard good things about RegexBuddy:
http://www.regexbuddy.com/
As you may know, Oracle now has regular expressions: http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/rischert_regexp_pt1.html. I have used the new functionality in a few queries, but it hasn't been as useful as in other contexts. The reason, I believe, is that regular expressions are best suited for finding structured data buried within unstructured data.
For instance, I might use a regex to find Oracle messages that are stuffed in log file. It isn't possible to know where the messages are--only what they look like. So a regex is the best solution to that problem. When you work with a relational database, the data is usually pre-structured, so a regex doesn't shine in that context.
A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex equivalent is .*\.txt$.
A great resource for regular expressions: http://www.regular-expressions.info
These RE's are specific to Visual Studio and C++ but I've found them helpful at times:
Find all occurrences of "routineName" with non-default params passed:
routineName\(:a+\)
Conversely to find all occurrences of "routineName" with only defaults:
routineName\(\)
To find code enabled (or disabled) in a debug build:
\#if._DEBUG*
Note that this will catch all the variants: ifdef, if defined, ifndef, if !defined
Validating strong passwords:
This one will validate a password with a length of 5 to 10 alphanumerical characters, with at least one upper case, one lower case and one digit:
^(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])[a-zA-Z0-9]{5,10}$