Find a string after two other strings with something between them - regex

Let's go with an example:
"Blablabla. My name is John and I'm 21 years old. Blablabla"
Other example:
"Blablabla. My name is John and I'm 21 years old.
- Hi I'm Mary and I'm 22 years old."
Basically, I want to match the age of the first person (here, 21, it could be 23 or whatever). The idea is that I know I'll have a sentence beginning with "My name is $name and I'm 21" but I can't afford to know what is $name.
The gross idea is to select a number after "My name is "+something+" and I'm ".
How one would do that with a regex, knowing that I can't use catch groups?
What I have so far:
(?<=<My name is )(.*)(?= years old)
Ideally I would like something like that to work:
(?<=<My name is .* and I'm )(.*)(?= years old)
... but it does not! .* can't be in a look ahead group apparently (which makes some sense).
Thank you kindly.

/My name is (\w+) and I'm (\d+) years old./
Now the first matched group is the name, the second matched group is the age.
If for some reason you don't want to use groups, you can match:
/(?<=My name is )\w+(?= and I'm )/
for the name and:
/(?<= and I'm )\d+(?= years old.)/
for the age.
As you have noticed, lookbehinds with variable length are not allowed (at least in the regex engines that I know of, not that it is logically impossible). However, you can use \K as an alternative:
/My name is \w+ and I'm \K\d+(?= years old.)/

#ndn's answer is basically correct, but I think it needs a couple of modifications:
The \w+ expression will not find spaces, such as in "My name is Mary Kate and I'm 47 years old."
If I'm interpret your request correctly that you need only the date to match, then I don't think the lookbehind and lookaround assertions that you and #ndn have set up are necessary.
I believe this regex will give you what you want:
My name is .+? and I'm (\d+) years old\.
(Note the \. at the end so it will match the literal period, rather than any character.)
See example at https://regex101.com/r/nJ7wS5/1

Related

Match first and then all equal occurrences with regex

Lets say we have the string:
one day, when Anne, Lisa and Paul went to the store, then Anne said to Paul: "I love Lisa!". Then Lisa laughed and kissed Anne.
is there a way with regex to match the first name, and then match and all other occurrences of the same name in the string?
Given the name-matching regex /[A-Z][a-z]+ (with /g maybe?), can the regex matcher be made to remember the first match, and then use that match EXACTLY for the rest of the string? Other subsequent matches to the name-matching regex should be ignored (except for Anne in the example).
The result would be (if matches are replaced with "Foo"):
one day, when Foo, Lisa and Paul went to the store, then Foo said to Paul: "I love Lisa!". Then Lisa laughed and kissed Foo.
Please ignore the fact that the sentence start uncapitalized, or add an example that also handles this.
Using a script to get the first match and then using that as input for a second iteration works of course, but that's outside the scope of the question (which is limited to ONE regex expression).
The only way I could think of is with non-fixed width lookbehinds. For example through Pypi's regex module, and maybe Javascript too? Either way, assuming a name is capture through [A-Z][a-z]+ as per your question try:
\b([A-Z][a-z]+)\b(?<=^[^A-Z]*\b\1\b.*)
See an online demo
\b([A-Z][a-z]+)\b - A 1st capture group capturing a name between two word-boundaries;
(?<=^[^A-Z]*\b\1\b.*) - A non-fixed width positive lookbehind to match start of line anchor followed by 0+ characters other than uppercase followed by the content of the 1st capture group and 0+ characters.
Here is a PyPi's example:
import regex as re
s= 'Anne, Lisa and Paul went to the store, then Anne said to Paul: "I love Lisa!". Then Lisa laughed and kissed Anne.'
s_new = re.sub(r'\b([A-Z][a-z]+)\b(?<=^[^A-Z]*\b\1\b.*)', 'Foo', s)
print(s_new)
Prints:
Foo, Lisa and Paul went to the store, then Foo said to Paul: "I love Lisa!". Then Lisa laughed and kissed Foo.

REGEX matching (1|2) while NOT containing (3|4) [duplicate]

This question already has answers here:
A more elegant REGEX? Possibly with positive/negative lookaheads/behinds?
(3 answers)
Closed 4 years ago.
I know I have just asked a similar question earlier today, but seeing as how easily it appeared to be solved it's given me the thought that something a little more complex might be acheivable! I have strings with "regions" in brackets. I want to match all strings with, for example Japan or Brazil BUT not if they contain, for example USA, Europe or UK. Because they are are all preceded by either ( or a space and followed by either , or ) it makes it tricky!
At the moment I'm having to do two seperate matches to match Japan or Brazil and then make sure I'm not matching USA, Europe or UK.
Inputs:
MatchMe! (Japan)
MatchMe! (Japan, Brazil)
MatchMe! (Brazil, Japan)
MatchMe! (Other, Japan, Other)
Don'tMatchMe! (Japan, USA)
Don'tMatchMe! (USA, Japan)
Don'tMatchMe! (Brazil, USA, Japan)
Don'tMatchMe! (USA)
Regex I'm using now:
\(.*?, Japan\)|\(Japan, .*?\)|\(.*?, Japan, .*?\)|\(.*?Japan.*?\)
Demo:
https://regex101.com/r/h17uZ2/1
Combine positive lookahead for non-parentheses characters followed by Japan|Brazil with a negative lookahead for the same, but with USA|Europe|UK instead:
\((?=[^)]*(Japan|Brazil))(?![^)]*(USA|Europe|UK))[^)]+\)
https://regex101.com/r/h17uZ2/2

Capture the latest in backreference

I have this regex
(\b(\S+\s+){1,10})\1.*MY
and I want to group 1 to capture "The name" from
The name is is The name MY
I get "is" for now.
The name can be any random words of any length.
It need not be at the beginning.
It need on be only 2 or 3 words. It can be less than 10 words.
Only thing sure is that it will be the last set of repeating words.
Examples:
The name is Anthony is is The name is Anthony - "The name is Anthony".
India is my country All Indians are India is my country - "India is my country "
Times of India Alphabet Google is the company Alphabet Google canteen - "Alphabet Google"
You could try:
(\b\w+[\w\s]+\b)(?:.*?\b\1)
As demonstrated here
Explanation -
(\b\w+[\w\s]+\b) is the capture group 1 - which is the text that is repeated - separated by word boundaries.
(?:.*?\b\1) is a non-capturing group which tells the regex system to match the text in group 1, only if it is followed by zero-or-more characters, a word-boundary, and the repeated text.
Regex generally captures thelongest le|tmost match. There are no examples in your question where this would not actualny be the string you want, but that could just mean you have not found good examples to show us.
With that out of the way,
((\S+\s)+)(\S+\s){0,9}\1
would appear to match your requirements as currently stated. The "longest leftmost" behavior could still get in the way if there are e.g. straddling repetitions, like
this that more words this that more words
where in the general case regex alone cannot easily be made to always prefer the last possible match and tolerate arbitrary amounts of text after it.

Convert MS Outlook formatted email addresses to names of attendees using RegEx

I'm trying to use Notepadd ++ to find and replace regex to extract names from MS Outlook formatted meeting attendee details.
I copy and pasted the attendee details and got names like.
Fred Jones <Fred.Jones#example.org.au>; Bob Smith <Bob.Smith#example.org.au>; Jill Hartmann <Jill.Hartmann#example.org.au>;
I'm trying to wind up with
Fred Jones; Bob Smith; Jill Hartmann;
I've tried a number of permutations of
\B<.*>; \B
on Regex 101.
Regex is greedy, <.*> matches from the first < to the last > in one fell swoop. You want to say "any character which is neither of these" instead of just "any character".
*<[^<>]*>
The single space and asterisk before the main expression consumes any spaces before the match. Replace these matches with nothing and you will be left with just the names, like in your example.
This is a very common FAQ.

Regular Expression Match to test for a valid year

Given a value I want to validate it to check if it is a valid year. My criteria is simple where the value should be an integer with 4 characters. I know this is not the best solution as it will not allow years before 1000 and will allow years such as 5000. This criteria is adequate for my current scenario.
What I came up with is
\d{4}$
While this works it also allows negative values.
How do I ensure that only positive integers are allowed?
Years from 1000 to 2999
^[12][0-9]{3}$
For 1900-2099
^(19|20)\d{2}$
You need to add a start anchor ^ as:
^\d{4}$
Your regex \d{4}$ will match strings that end with 4 digits. So input like -1234 will be accepted.
By adding the start anchor you match only those strings that begin and end with 4 digits, which effectively means they must contain only 4 digits.
The "accepted" answer to this question is both incorrect and myopic.
It is incorrect in that it will match strings like 0001, which is not a valid year.
It is myopic in that it will not match any values above 9999. Have we already forgotten the lessons of Y2K? Instead, use the regular expression:
^[1-9]\d{3,}$
If you need to match years in the past, in addition to years in the future, you could use this regular expression to match any positive integer:
^[1-9]\d*$
Even if you don't expect dates from the past, you may want to use this regular expression anyway, just in case someone invents a time machine and wants to take your software back with them.
Note: This regular expression will match all years, including those before the year 1, since they are typically represented with a BC designation instead of a negative integer. Of course, this convention could change over the next few millennia, so your best option is to match any integer—positive or negative—with the following regular expression:
^-?[1-9]\d*$
This works for 1900 to 2099:
/(?:(?:19|20)[0-9]{2})/
Building on #r92 answer, for years 1970-2019:
(19[789]\d|20[01]\d)
To test a year in a string which contains other words along with the year you can use the following regex: \b\d{4}\b
In theory the 4 digit option is right. But in practice it might be better to have 1900-2099 range.
Additionally it need to be non-capturing group. Many comments and answers propose capturing grouping which is not proper IMHO. Because for matching it might work, but for extracting matches using regex it will extract 4 digit numbers and two digit (19 and 20) numbers also because of paranthesis.
This will work for exact matching using non-capturing groups:
(?:19|20)\d{2}
Use;
^(19|[2-9][0-9])\d{2}$
for years 1900 - 9999.
No need to worry for 9999 and onwards - A.I. will be doing all programming by then !!! Hehehehe
You can test your regex at https://regex101.com/
Also more info about non-capturing groups ( mentioned in one the comments above ) here http://www.manifold.net/doc/radian/why_do_non-capture_groups_exist_.htm
you can go with sth like [^-]\d{4}$: you prevent the minus sign - to be before your 4 digits.
you can also use ^\d{4}$ with ^ to catch the beginning of the string. It depends on your scenario actually...
/^\d{4}$/
This will check if a string consists of only 4 numbers. In this scenario, to input a year 989, you can give 0989 instead.
You could convert your integer into a string. As the minus sign will not match the digits, you will have no negative years.
I use this regex in Java ^(0[1-9]|1[012])[/](0[1-9]|[12][0-9]|3[01])[/](19|[2-9][0-9])[0-9]{2}$
Works from 1900 to 9999
If you need to match YYYY or YYYYMMDD you can use:
^((?:(?:(?:(?:(?:[1-9]\d)(?:0[48]|[2468][048]|[13579][26])|(?:(?:[2468][048]|[13579][26])00))(?:0?2(?:29)))|(?:(?:[1-9]\d{3})(?:(?:(?:0?[13578]|1[02])(?:31))|(?:(?:0?[13-9]|1[0-2])(?:29|30))|(?:(?:0?[1-9])|(?:1[0-2]))(?:0?[1-9]|1\d|2[0-8])))))|(?:19|20)\d{2})$
You can also use this one.
([0-2][0-9]|3[0-1])\/([0-1][0-2])\/(19[789]\d|20[01]\d)
In my case I wanted to match a string which ends with a year (4 digits) like this for example:
Oct 2020
Nov 2020
Dec 2020
Jan 2021
It'll return true with this one:
var sheetName = 'Jan 2021';
var yearRegex = new RegExp("\b\d{4}$");
var isMonthSheet = yearRegex.test(sheetName);
Logger.log('isMonthSheet = ' + isMonthSheet);
The code above is used in Apps Script.
Here's the link to test the Regex above: https://regex101.com/r/SzYQLN/1
You can try the following to capture valid year from a string:
.*(19\d{2}|20\d{2}).*
Works from 1950 to 2099 and value is an integer with 4 characters
^(?=.*?(19[56789]|20\d{2}).*)\d{4}$