What are good regular expressions? - regex

I have worked for 5 years mainly in java desktop applications accessing Oracle databases and I have never used regular expressions. Now I enter Stack Overflow and I see a lot of questions about them; I feel like I missed something.
For what do you use regular expressions?
P.S. sorry for my bad english

Consider an example in Ruby:
puts "Matched!" unless /\d{3}-\d{4}/.match("555-1234").nil?
puts "Didn't match!" if /\d{3}-\d{4}/.match("Not phone number").nil?
The "/\d{3}-\d{4}/" is the regular expression, and as you can see it is a VERY concise way of finding a match in a string.
Furthermore, using groups you can extract information, as such:
match = /([^#]*)#(.*)/.match("myaddress#domain.com")
name = match[1]
domain = match[2]
Here, the parenthesis in the regular expression mark a capturing group, so you can see exactly WHAT the data is that you matched, so you can do further processing.
This is just the tip of the iceberg... there are many many different things you can do in a regular expression that makes processing text REALLY easy.

Regular Expressions (or Regex) are used to pattern match in strings. You can thus pull out all email addresses from a piece of text because it follows a specific pattern.
In some cases regular expressions are enclosed in forward-slashes and after the second slash are placed options such as case-insensitivity. Here's a good one :)
/(bb|[^b]{2})/i
Spoken it can read "2 be or not 2 be".
The first part are the (brackets), they are split by the pipe | character which equates to an or statement so (a|b) matches "a" or "b". The first half of the piped area matches "bb". The second half's name I don't know but it's the square brackets, they match anything that is not "b", that's why there is a roof symbol thingie (technical term) there. The squiggly brackets match a count of the things before them, in this case two characters that are not "b".
After the second / is an "i" which makes it case insensitive. Use of the start and end slashes is environment specific, sometimes you do and sometimes you do not.
Two links that I think you will find handy for this are
regular-expressions.info
Wikipedia - Regular expression

Coolest regular expression ever:
/^1?$|^(11+?)\1+$/
It tests if a number is prime. And it works!!
N.B.: to make it work, a bit of set-up is needed; the number that we want to test has to be converted into a string of “1”s first, then we can apply the expression to test if the string does not contain a prime number of “1”s:
def is_prime(n)
str = "1" * n
return str !~ /^1?$|^(11+?)\1+$/
end
There’s a detailled and very approachable explanation over at Avinash Meetoo’s blog.

If you want to learn about regular expressions, I recommend Mastering Regular Expressions. It goes all the way from the very basic concepts, all the way up to talking about how different engines work underneath. The last 4 chapters also gives a dedicated chapter to each of PHP, .Net, Perl, and Java. I learned a lot from it, and still use it as a reference.

If you're just starting out with regular expressions, I heartily recommend a tool like The Regex Coach:
http://www.weitz.de/regex-coach/
also heard good things about RegexBuddy:
http://www.regexbuddy.com/

As you may know, Oracle now has regular expressions: http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/rischert_regexp_pt1.html. I have used the new functionality in a few queries, but it hasn't been as useful as in other contexts. The reason, I believe, is that regular expressions are best suited for finding structured data buried within unstructured data.
For instance, I might use a regex to find Oracle messages that are stuffed in log file. It isn't possible to know where the messages are--only what they look like. So a regex is the best solution to that problem. When you work with a relational database, the data is usually pre-structured, so a regex doesn't shine in that context.

A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex equivalent is .*\.txt$.
A great resource for regular expressions: http://www.regular-expressions.info

These RE's are specific to Visual Studio and C++ but I've found them helpful at times:
Find all occurrences of "routineName" with non-default params passed:
routineName\(:a+\)
Conversely to find all occurrences of "routineName" with only defaults:
routineName\(\)
To find code enabled (or disabled) in a debug build:
\#if._DEBUG*
Note that this will catch all the variants: ifdef, if defined, ifndef, if !defined

Validating strong passwords:
This one will validate a password with a length of 5 to 10 alphanumerical characters, with at least one upper case, one lower case and one digit:
^(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])[a-zA-Z0-9]{5,10}$

Related

EditPad: How to replace multiple search criteria with multiple values?

I did some searching and found tons of questions about multiple replacements with Regex, but I'm working in EditPadPro and so need a solution that works with the regex syntax of that environment. Hoping someone has some pointers as I haven't been able to work out the solution on my own.
Additional disclaimer: I suck with regex. I mean really... it's bad. Like I barely know wtf I'm doing.So that being said, here is what I need to do and how I'm currently approaching it...
I need to replace two possible values, with their corresponding replacements. My two searches are:
(.*)-sm
(.*)-rad
Currently I run these separately and replace each with simple strings:
sm
rad
Basically I need to lop off anything that comes prior to "sm" so I just detect everything up to and including sm, and then replace it all with that string (and likewise for "rad").
But it seems like there should be a way to do this in a single search/replace operation. I can do the search part fine with:
(.*)-sm|(.*)-rad
But then how to replace each with it's matching value? That's where I'm stuck. I tried:
sm|rad
but alas, that just becomes the literal complete string that is used for replacement.
Jonathan, first off let me congratulate you for using EPP Pro for regex in your text. It's my main text editor, and the main reason I chose it, as a regex lover, is that its support of regex syntax is vastly superior to competing editors. For instance Notepad++ is known for its shoddy support of regular expressions. The reason of course is that EPP's author Jan Goyvaerts is the author of the legendary RegexBuddy.
A picture is worth a thousand words... So here is how I would do your replacement. Just hit the "replace all button". The expression in the regex box assumes that anything before the dash that is not a whitespace character can be stripped, so if this is not what you want, we need to tune it.
Search for:
(.*)-(sm|rad)
Now, when you put something in parenthesis in Regex, those matches are stored in temporary variables. So whatever matched (.*) is stored in \1 and whatever matched (sm|rad) is stored in \2. Therefore, you want to replace with:
\2
Note that the replacement variable may be different depending on what programming language you are using. In Perl, for example, I would have to use $2 instead.

Can I perform stemming using regular expressions?

How can I get my regular expression to match against just one condition exactly?
For example I have the following regular expression:
(\w+)(?=ly|es|s|y)
Matching the expression against the word "glasses" returns:
glasse
The correct match should be:
glass (match should be on 'es' rather than 's' as in the match above)
The expression should cater for any kinds of words such as:
films
lovely
glasses
glass
Currently the regular expression is matching the above words as:
film - correct
lovel - incorrect
glasse - incorrect
glas - incorrect
The correct match for the words should be:
film
love
glass
glass
The problem I am having at the moment is I am not sure how to adjust my regular expression to cater for either 's' or 'es' exactly, as a word could contain both such as "glasses".
Update
Thank you for the answers so far. I appreciate the complexity of stemming and the requirement of language knowledge. However in my particular case the words are finite (films,lovely,glasses and glass) and so therefore I will only ever encounter these words and the suffixes in the expression above. I don't have a particular application for this. I was just curious to see if it was possible using regular expressions. I have come to the conclusion that it is not possible, however would the following be possible:
A match is either found or not found, for example match glasses but NOT glass but DO match films:
film (match) - (films)
glass (match) - (glasses)
glass (no match) - (glass)
What I'm thinking is if there is a way to match the suffix exactly against the string from the end. In the example above 'es' match glass(es) therefore the condition 's' is discarded. In the case of glass (no match) the condition 's' is discarded because another 's' precedes it, it does not match exactly. I must admit I'm not 100% about this so my logic may seem a little shakey, it's just an idea.
If you want to do stemming, use a library like Snowball. It's going to be impossible to do what you want to do with regular expressions. In particular, it will be impossible for your regex to know that the trailing 's' should be removed from 'films' but not 'glass' without some kind of knowledge of the language.
There's vast literature on stemming and lemmatization. Google is your friend.
The basic problem you're having here is that the plus in
(\w+)(?=ly|es|s|y)
is greedy, and will grab as much as possible while still allowing the whole regex to match. You've not said exactly which flavour of regex you're using but try
(\w+?)(?=ly|es|s|y)
+? means the same as + but is reluctant, matching as little as possible while still allowing the overall match to succeed.
However this would still have the problem that it splits glass into glas and s. To handle this you'd need something like
(\w+?)(?=ly|es|(?<!s)s|y)
using negative look behind to prevent the s alternative from matching when preceded by another s.
As a case for somebody looking for such kind of solution in/for python, there is a RegexpStemmer provided by the natural language tool kit, and it works very fast
# regex stemmer
from nltk.stem import RegexpStemmer
rs = RegexpStemmer('ing$|s$|ed$|y$', min=3)
t=time.clock()
train[col]=train[col].apply(lambda x: ' '.join([rs.stem(word) for word in x.split()]))
print(time.clock()-t)
http://www.nltk.org/api/nltk.stem.html
http://snowball.tartarus.org/algorithms/english/stemmer.html

Regular expression trouble

Hey guys - I'm tearing my hair out trying to create a regular expression to match something like:
{TextOrNumber{MoreTextOrNumber}}
Note the matching number of open/close {}. Is this even possible?
Many thanks.
Note the matching number of open/close {}. Is this even possible?
Historically, no. However, modern regular expressions aren’t actually regular and some allow such constructs:
\{TextOrNumber(?R)?\}
(?R) recursively inserts the pattern again. Notice that not many regex engines support that (yet).
If you need to do an arbitrary number of braces, you can use a parser generator, or create a regex inside a nested function. The following is an example of a recursive regex in ruby.
def parse(s)
if s =~ /^\{([A-Za-z0-9]*)({.*})?\}$/ then
puts $1
parse($2)
end
end
parse("{foo{bar{baz}}}")
This is not possible with 1 regex if you don't have a recursive extension available. You'll have to match a regex like the following one multiple times
/\{[a-z0-9]+([a-z0-9\{\}]+)?\}/i
capture the "MoreTextOrNumber" and let it match again until you are through or it fails.
Not easy but possible
Officially, regular expressions are not designed for parsing nested paired brackets --- and if you try to do this, you run into all sorts of problems. There are other other tools (like parser generators, e.g. yacc or bison) that are designed for such structures and can handle them well. But it can be done --- and if you do it right it may even be simpler than a yacc grammar with all the support code to work around the problems of yacc.
Here are some hints:
First of all, my suggestions work best if you have some characters that will never appear in the input. Often, characters like \01 and \02 should never appear, so you can do
s/[\01\02]/ /g;
to make sure they are not there. Otherwise, you may want to escape them (e.g. convert them to text like %0 and %1) with an expression like
s/([\01\02%])/"%".ord($1)/ge;
Notice, that I also escaped the escape character "%".
Now, I suggest to parse brackets from the inside out: replace any substring "{ text }" where "text" does not contain any brackets by a place holder "\01$number\2" and store the included text in $array[$number]:
$number=1;
while (s/\{([^{}]*)\}/"\01$number\02"/e) { $array[$number]=$1; $number++; }
$array[0]=$_; # $array[0] corresponds to your input
As a final step, you may want to process each element in #array to pull out and process the "\01$number\02" markers. This is easy because they are no longer nested.
I happily use this idea in a few parsers (including separating matching bracket types like "(){}[]" etc).
But before you go down this road, make sure to have used regular expressions in simpler applications: You will run into many small problems and you need experience to resolve them (rather than turning one small problem into two small problems etc.).

How to remove a small part of the string in the big string using RegExp

Hey guys, I don't know RegExp yet. I know a lil about it but I'm not experience user.
Supposed that I run a RegExp match on a website, the matches are:
Data: Informations
Data: Liberty
Then I want to extract only Informations and Liberty, I don't want the Data: part.
Does Data: always appear at the begining of a line?
Can there be multiple spaces between the : and the next word?
Do you know about groups?
What do you want: lazy matching vs greedy matching?
If so, you can use (with lazy matching):
^Data:\s+(.*?)$
With character classes:
^Data:\s+(\w+)$
if you know that it'll always be a word. Try this website.
Can't be absolutely sure without knowing more about the potential matches, but this should be at least a good starting point:
Data: (.*)$
That will return everything after "Data: " to the end of the line.
Search for a regular expression like
Data: (.*)
Then use the "first submatch", which is often referred to by "$1" or "\1", depending on the language you are using.
Regular expression engines support what are commonly called "capturing groups". If you surround a pattern or part of a pattern with (), the part of the string matched by that part of the regular expression will be captured.
The command(s) you use to do the matching will determine how to get these captured values. They may be stored in special variables (eg: $1, $2) or you may be able to specify the names of the variables either embedded within the regular expression or as arguments to the regular expression command. Exactly how depends on what language you are using.
So, read up on the regexp commands for the language of your choice and look for the term "capturing groups" or maybe just "groups".

Using an asterisk in a RegExp to extract data that is enclosed by a certain pattern

I have an text that consists of information enclosed by a certain pattern.
The only thing I know is the pattern: "${template.start}" and ${template.end}
To keep it simple I will substitute ${template.start} and ${template.end} with "a" in the example.
So one entry in the text would be:
aINFORMATIONHEREa
I do not know how many of these entries are concatenated in the text. So the following is correct too:
aFOOOOOOaaASDADaaASDSDADa
I want to write a regular expression to extract the information enclosed by the "a"s.
My first attempt was to do:
a(.*)a
which works as long as there is only one entry in the text. As soon as there are more than one entries it failes, because of the .* matching everything. So using a(.*)a on aFOOOOOOaaASDADaaASDSDADa results in only one capturing group containing everything between the first and the last character of the text which are "a":
FOOOOOOaaASDADaaASDSDAD
What I want to get is something like
captureGroup(0): aFOOOOOOaaASDADaaASDSDADa
captureGroup(1): FOOOOOO
captureGroup(2): ASDAD
captureGroup(3): ASDSDAD
It would be great to being able to extract each entry out of the text and from each entry the information that is enclosed between the "a"s. By the way I am using the QRegExp class of Qt4.
Any hints? Thanks!
Markus
Multiple variation of this question have been seen before. Various related discussions:
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Using regular expressions how do I find a pattern surrounded by two other patterns without including the surrounding strings?
Use RegExp to match a parenthetical number then increment it
Regex for splitting a string using space when not surrounded by single or double quotes
What regex will match text excluding what lies within HTML tags?
and probably others...
Simply use non-greedy expressions, namely:
a(.*?)a
You need to match something like:
a[^a]*a
You have a couple of working answers already, but I'll add a little gratuitous advice:
Using regular expressions for parsing is a road fraught with danger
Edit: To be less cryptic: for all there power, flexibility and elegance, regular expression are not sufficiently expressive to describe any but the simplest grammars. Ther are adequate for the problem asked here, but are not a suitable replacement for state machine or recursive decent parsers if the input language become more complicated.
SO, choosing to use RE for parsing input streams is a decision that should be made with care and with an eye towards the future.