Matching Conditions in Regex - regex

Just a note upfront: I'm a bit of a regex newbie. Perhaps a good answer to this question would involve linking me to a resource that explains how these sorts of conditions work :)
Lets say that I have a street name, like 23rd St or 5th St. I'd like to get rid of the proceeding "th", "rd", "nd", and "st". How can this be done?
Right now I have the expression: (st|nd|rd|th) . The problem with this is that it will also match street names that contain a "st", "nd", "rd", or "th". So what I really need is a conditional match that looks for a minimum of one number before itself (ie; 1st and not street).
Thank you!

It sounds like you just want to match the ordinal suffix (st|nd|rd|th), yes?
If your regex engine supports it, you could use a lookbehind assertion.
/(?<=\d)(st|nd|rd|th)/
That matches (st|nd|rd|th) only if preceded by a digit \d, but the match does not capture the digit itself.

What you really want are anchors.
Try and replace globally:
\b(\d+)(?:st|nd|rd|th)\b
with the first group.
Explanation:
\b --> matches a position where either a word character (digit, letter, underscore) is followed by a non word character (none of the previous group), or the reverse;
(\d+) --> matches one or more digits, and capture them in first group ($1);
(?:st|nd|rd|th) --> matches any of st, etc... wihtout capturing it ((?:...) is a non capturing group);
\b --> see above.
Demonstration using perl:
$ perl -pe 's/\b(\d+)(?:st|nd|rd|th)\b/$1/g' <<EOF
> Mark, 23rd street, New Hampshire
> I live on the 7th avenue
> No match here...
> azoiu32rdzeriuoiu
> EOF
Mark, 23 street, New Hampshire
I live on the 7 avenue
No match here...
azoiu32rdzeriuoiu

Try using this regex:
(\d+)(?:st|nd|rd|th)
I don't know ruby. In PHP I would use something like:
preg_replace('/(\d+)(?:st|nd|rd|th) /', '$1', 'South 2nd Street');
to remove suffix

To remove the ordinal:
/(\d+)(?:st|nd|rd|th)\b/$1/
You must capture the number so you can replace the match with it. You can capture the ordinal or not, it doesn't matter unless you want to output it somewhere else.
http://www.regular-expressions.info/javascriptexample.html

Related

How to write regex that excludes the name after Mr. when finding words at the start of a sentence?

I currently have to make a regex that matches the first words at the start of sentences. I've currently done to the point where it matches the first word at the start of the paragraph and the rest, the first words that come after ". The problem that I have here is that 'Sherwood' which is obviously a name, shouldn't be matched but is because it matches the regex which I have written. 'Capital starting letter, comes directly after . and a space'
How can I change my code to exclude the name that comes after Mr. or Dr.?
Current regex: ((^[A-Z]+[a-z]*[A-Z]*[a-z]*|(?<=\")[A-Z]+[a-z]*[A-Z]*[a-z]*)|(?<=\.\s)[A-Z]+[a-z]*[A-Z]*[a-z]*)
I've used regex101.com as a reference.
You could shorten the pattern with the alternations to a single non capture group containing the 3 patterns that are allowed for the start of the string.
As you are already using a lookbehind assertion, you can exclude Mr. of Dr. to the left using a negative lookbehind:
(?:^|(?<=")|(?<=\.\s))(?<![MD]r\. )[A-Z]+[a-z]*[A-Z]*[a-z]*
Regex demo
You might also first match an uppercase char, and then do the assertions to prevent the alternation with the lookbehind assertions to fire on every position when there is no match.
[A-Z](?<=".|\.\s.|^.)(?<![MD]r\. .)(?:[A-Z]*[a-z]*){2}
Regex demo

Basic regex capture

I am trying to use regex to capture a string between 2 strings.I'm not experienced with regex. I want to capture the town name in the following string:
On April 6, 2016, the Town of Woodstock auctioned
±
14.00
I've tried the most basic capture attempt:
town of (.*) auctioned
that I can think of , but I'm not getting any match at all. A link is below. what am I doing wrong?
see here
First, regexes are case-sensitive by default. Use the (?i) inline modifier to change that (assuming a regex engine that doesn't expect its modifiers after the regex).
Second, whitespace is treated like any other character. If the text contains two spaces between words, then your regex won't match if it uses only one space.
Lastly, you probably should use a lazy quantifier:
(?i)town\s+of\s+(.*?)\s+auctioned
should work.
If you say that the regex would start with town of and end with auctioned
then:
/town\s+of\s+(.*?)\s+auctioned/ig
here /i makes it case insensitive in javascript.
here capture group 1 contains your town name

Extract with regex when the same special character is used

I've been trying to use Regex tools online, but none seem to be working. I am close but not sure what I'm missing.
Here is the Text:
Valencia, Los Angeles, California - Map
I want to extract the first 2 letters of the state (so between "," and "-"), in this case "CA"
What I've done so far is:
[,/](.*)[-/]
$1
The output is:
Los Angeles, California
If anything I thought I would at least just get the state.
,\s*(\w\w)[^,]*-
will capture Ca in group 1.
, comma
\s* whitespace
(\w\w) capture the first two characters
[^,]* make sure there's no comma up to the next dash
-
,\s*(\S{2})[^,]*-
You're going to want to take just the first match.
I assume you use JavaScript.
Your regex fails this particular case because there are two commas in your input.
One possible fix is to modify the middle capture from . (any character) to [^,] (any character except comma). This will force the regex to match California only.
So, try [,/]([^,]*)[-/]. Here's a demo of how it works.
You can use this regex:
.*?,\s(\w\w)[^,]*-
$1 is the first two letters you're looking for.

how to use regex to grab the eighth word

New to Regex
Examples I've seen show searching for very specific exceptions ie. specific letter combos.
What I want is to grab the 8th word no matter what comes before,no matter what those words are.
So the spaces are what designates 'words'
Sample line would be
Sep 20 11:13:18 10.50.3.100 Sep 20 11:13:15 DC1ASM1.dcl.greendotcorp.com Blah Blah Blah
I want to extract the host name, in this case "DC1ASM1.dcl.greendotcorp.com", which is always preceded by "Month, Day, Timestamp, IP, Month, Day, Timestamp" pattern.
Thanks
Rex
I'm not 100% sure what version or flavor of regex you're using, so I'll avoid the look-behind and use a non-capturing group instead:
^(?:\S+?\s){7}(\S+)
That binds to the beginning of the line, ignores 7 consecutive patterns of [any character but whitespace] 1+ times] then [one single whitespsace character].
You can be more specific about "words" by using \w instead of \S if you so chose, though.
This expression will capture the host name in the named group HostName. It assumes there are always only single spaces.
^([^ ]+ ){7}(?<HostName>[^ ]+)
Two handle multiple spaces use the following expression.
^([^ ]+ +){7}(?<HostName>[^ ]+)
To also support tabs use the following expression.
^([^ \t]+[ \t]+){7}(?<HostName>[^ \t]+)
Try something like:
regex = "([^\s]+\s+){7}(?<eighthword>[^\s]+)"

Matching on repeated substrings in a regex

Is it possible for a regex to match based on other parts of the same regex?
For example, how would I match lines that begins and end with the same sequence of 3 characters, regardless of what the characters are?
Matches:
abcabc
xyz abc xyz
Doesn't Match:
abc123
Undefined: (Can match or not, whichever is easiest)
ababa
a
Ideally, I'd like something in the perl regex flavor. If that's not possible, I'd be interested to know if there are any flavors that can do it.
Use capture groups and backreferences.
/^(.{3}).*\1$/
The \1 refers back to whatever is matched by the contents of the first capture group (the contents of the ()). Regexes in most languages allow something like this.
You need backreferences. The idea is to use a capturing group for the first bit, and then refer back to it when you're trying to match the last bit. Here's an example of matching a pair of HTML start and end tags (from the link given earlier):
<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>
This regex contains only one pair of parentheses, which capture the string matched by [A-Z][A-Z0-9]* into the first backreference. This backreference is reused with \1 (backslash one). The / before it is simply the forward slash in the closing HTML tag that we are trying to match.
Applying this to your case:
/^(.{3}).*\1$/
(Yes, that's the regex that Brian Carper posted. There just aren't that many ways to do this.)
A detailed explanation for posterity's sake (please don't be insulted if it's beneath you):
^ matches the start of the line.
(.{3}) grabs three characters of any type and saves them in a group for later reference.
.* matches anything for as long as possible. (You don't care what's in the middle of the line.)
\1 matches the group that was captured in step 2.
$ matches the end of the line.
For the same characters at the beginning and end:
/^(.{3}).*\1$/
This is a backreference.
This works:
my $test = 'abcabc';
print $test =~ m/^([a-z]{3}).*(\1)$/;
For matching the beginning and the end you should add ^ and $ anchors.