Basic regex capture - regex

I am trying to use regex to capture a string between 2 strings.I'm not experienced with regex. I want to capture the town name in the following string:
On April 6, 2016, the Town of Woodstock auctioned
±
14.00
I've tried the most basic capture attempt:
town of (.*) auctioned
that I can think of , but I'm not getting any match at all. A link is below. what am I doing wrong?
see here

First, regexes are case-sensitive by default. Use the (?i) inline modifier to change that (assuming a regex engine that doesn't expect its modifiers after the regex).
Second, whitespace is treated like any other character. If the text contains two spaces between words, then your regex won't match if it uses only one space.
Lastly, you probably should use a lazy quantifier:
(?i)town\s+of\s+(.*?)\s+auctioned
should work.

If you say that the regex would start with town of and end with auctioned
then:
/town\s+of\s+(.*?)\s+auctioned/ig
here /i makes it case insensitive in javascript.
here capture group 1 contains your town name

Related

Make regex match only the capturing group

Due to the technology I'm currently working with (PySpark API), I need to adjust a regex so that the full match corresponds to the capturing group.
I want to use it as a delimiter pattern in a split function
This function splits an input string according to the matched substring, not the capturing group.
Hence why I need to match the \s+ caracters (that I currently only capture).
Here is a regex101 example or here: (\s)+(?:\d*\s*)(?=RUE|BOULEVARD|AVENUE)
I tried to extend the positive lookahead to combine the possibility that a \d+\s+ may be present before and therefore match a different \s. Didnt work so far.
The split function's output I wish to obtain is the following:
[7 BOULEVARD LAPIN BLANC,AVENUE MR LIEVRE,18 RUE PIERRE LAPIN]
I don't know pyspark but I guess it supports these things, split on spaces that are not preceded by a digit but followed by an optional digit then the type of street.
(?<!\d)\s+(?=(?:\d+\s)?(?:RUE|BOULEVARD|AVENUE))
In the demo I use a substitution with \n that simulate the split.
Demo & explanation

Regex: Date Exact Match

I have a little problem on a date-time
april 15, 2014
I wrote this regex: \D+[a-z] \d{2}, \d{1,4}
The problem is when I have text before date, for example:
text text april 15, 2014 text
Well, in this case my regex selects also the text, not only the date. So I need to modify a little bit my regex to find strictly my date, and not the text before or after.
Can anyone help me?
Just use [a-zA-Z]+ to match the month names:
[a-zA-Z]+ \d{2}, \d{1,4}
See the regex demo
Also, consider using word boundaries, \b[a-zA-Z]+ \d{2}, \d{1,4}\b, if you need to match these strings inside a larger string, or consider using anchors if you need a full string match: ^[a-zA-Z]+ \d{2}, \d{1,4}$.
Just a note: \d{1,4} matches 1 to 4 digits. If you plan to match 4-digit years, use \d{4}. If you plan to match 2- or 4-digit years, use (\d{4}|\d{2}), but this time you really need either word boundaries or anchors.
For your case you should be careful, because may your date would be something like this:
aprila lot of space15space,space2014
Then this regex is not safe:
[a-zA-Z]+ \d{2}, \d{1,4} It matches as you want but safety? No because a single space break the rule for that.
And this is a safe regex for you:
\w+\s+\d+,\s+\d+
And you can still make it safer by surrounded ^ and $ and the most safe regex for you is:
^\w+\s+\d+,\s+\d+$
prove

Regex Greediness

I have a perl regex that i'm fairly certain should work (perl) but is being too greedy:
regex:
(?:.*serial[^\d]+?(\d+).*)
Test string:
APPLICATIONSERIALNO123456Plnsn123456te20140728tdrnserialnun12hou
Desired group 1 match:
123456
Actual group 1 Match:
12
I've tried every permutation of lookahead and behind and laziness and I can't get the damn thing to work.
WHAT AM I MISSING.
Thanks!
The Problem is Not Greediness, but Case-Sensitivity
Currently your regex matches the 12 at the end of serialnun12, probably because it is case-sensitive. We have two options: using upper-case, or making the pattern case-insensitive.
Option 1: Use Upper-Case
If you only want 123456, you can use:
SERIALNO\K\d+
The \K tells the engine to drop what was matched so far from the final match it returns.
If you want to match the whole string and capture 123456 to Group 1, use:
.*?SERIAL\D+(\d+).*
Option 2: Turning Case-Sensitivity On using (?i) inline or the i flag
To only match 123456, you can use:
(?i)serial\D+\K\d+
Note that if you use the g flag, this would match both numbers.
If you want to match the whole string and capture 123456 to Group 1, use:
(?i).*?serial\D+(\d+).*
A few tips
You can turn case-insensitivity either with the (?i) inline modifier or the i flag at the end of the pattern: /serial\D+\K\d+/i
Instead of [^\d], use \D
There is no need for a lazy quantifier in something like \D+\d+ because the two tokens are mutually exclusive: there is no danger that the \D will run over the \d
The problem is not greediness; it's case-sensitivity.
Currently your regex matches the 12 at the end of serialnun12 because those are the only digits following serial. The ones you want follow SERIAL. S and s are different characters.
There are two solution.
Use the uppercase characters in the pattern.
my ($serial) = $string =~ /SERIAL\D*(\d+)/;
Use case-insensitive matching.
my ($serial) = $string =~ /serial\D*(\d+)/i;
There's probably no need for this, but I thought I'd mention it just in case.

how to use regex to grab the eighth word

New to Regex
Examples I've seen show searching for very specific exceptions ie. specific letter combos.
What I want is to grab the 8th word no matter what comes before,no matter what those words are.
So the spaces are what designates 'words'
Sample line would be
Sep 20 11:13:18 10.50.3.100 Sep 20 11:13:15 DC1ASM1.dcl.greendotcorp.com Blah Blah Blah
I want to extract the host name, in this case "DC1ASM1.dcl.greendotcorp.com", which is always preceded by "Month, Day, Timestamp, IP, Month, Day, Timestamp" pattern.
Thanks
Rex
I'm not 100% sure what version or flavor of regex you're using, so I'll avoid the look-behind and use a non-capturing group instead:
^(?:\S+?\s){7}(\S+)
That binds to the beginning of the line, ignores 7 consecutive patterns of [any character but whitespace] 1+ times] then [one single whitespsace character].
You can be more specific about "words" by using \w instead of \S if you so chose, though.
This expression will capture the host name in the named group HostName. It assumes there are always only single spaces.
^([^ ]+ ){7}(?<HostName>[^ ]+)
Two handle multiple spaces use the following expression.
^([^ ]+ +){7}(?<HostName>[^ ]+)
To also support tabs use the following expression.
^([^ \t]+[ \t]+){7}(?<HostName>[^ \t]+)
Try something like:
regex = "([^\s]+\s+){7}(?<eighthword>[^\s]+)"

Matching Conditions in Regex

Just a note upfront: I'm a bit of a regex newbie. Perhaps a good answer to this question would involve linking me to a resource that explains how these sorts of conditions work :)
Lets say that I have a street name, like 23rd St or 5th St. I'd like to get rid of the proceeding "th", "rd", "nd", and "st". How can this be done?
Right now I have the expression: (st|nd|rd|th) . The problem with this is that it will also match street names that contain a "st", "nd", "rd", or "th". So what I really need is a conditional match that looks for a minimum of one number before itself (ie; 1st and not street).
Thank you!
It sounds like you just want to match the ordinal suffix (st|nd|rd|th), yes?
If your regex engine supports it, you could use a lookbehind assertion.
/(?<=\d)(st|nd|rd|th)/
That matches (st|nd|rd|th) only if preceded by a digit \d, but the match does not capture the digit itself.
What you really want are anchors.
Try and replace globally:
\b(\d+)(?:st|nd|rd|th)\b
with the first group.
Explanation:
\b --> matches a position where either a word character (digit, letter, underscore) is followed by a non word character (none of the previous group), or the reverse;
(\d+) --> matches one or more digits, and capture them in first group ($1);
(?:st|nd|rd|th) --> matches any of st, etc... wihtout capturing it ((?:...) is a non capturing group);
\b --> see above.
Demonstration using perl:
$ perl -pe 's/\b(\d+)(?:st|nd|rd|th)\b/$1/g' <<EOF
> Mark, 23rd street, New Hampshire
> I live on the 7th avenue
> No match here...
> azoiu32rdzeriuoiu
> EOF
Mark, 23 street, New Hampshire
I live on the 7 avenue
No match here...
azoiu32rdzeriuoiu
Try using this regex:
(\d+)(?:st|nd|rd|th)
I don't know ruby. In PHP I would use something like:
preg_replace('/(\d+)(?:st|nd|rd|th) /', '$1', 'South 2nd Street');
to remove suffix
To remove the ordinal:
/(\d+)(?:st|nd|rd|th)\b/$1/
You must capture the number so you can replace the match with it. You can capture the ordinal or not, it doesn't matter unless you want to output it somewhere else.
http://www.regular-expressions.info/javascriptexample.html