Regex to get the last match of a pattern - regex

Here is a string similar to what I'm trying to match (with the exception of a couple of specific patterns, for the sake of simplicity).
Hello, tonight I'm in the town of Trenton in New Jersey and I will be staying in Hotel HomeStay [123] and I have no money.
I'm trying to match only the last in Hotel HomeStay [123].
I'm not very familiar with regex concepts like lookahead and lookbehind right now. Similar questions here don't seem to solve my issue. I've tried a bunch of regex (to the best of my understanding) and this is what I came up with (?= (?:in|\d+))([\w \[]*\s*\d*\]*)(?!.*in). The digits and special characters may be part of what I'm actually trying to match.
The lookahead and lookbehind patterns are not restricted to containing only in. They can have more common words as well such as and and is. I'm only looking for the last occurence of any of these, followed by the main pattern, which is quite distinctive -- edit let's say the match should necessarily contain either HomeStay or LuxuryInn, for the sake of the example.
However, this matches the whole of in the town of Trenton in New Jersey and I will be staying in Hotel HomeStay [123].
Where am I going wrong? Also, could someone explain why the in is captured despite being placed in a non-capturing group?
Any help is greatly appreciated.

If you want to retrieve a text containing HomeStay prefixed by certain words and not containing those words, you can use a capture group using negative look-ahead inside. The regex below captures all occurrences (working fiddle).
\b(?:in|and|is)\s+((?:.(?!\b(?:in|and|is)\b))*HomeStay(?:.(?!\b(?:in|and|is)\b))*)
Here, the regexp looks for :
a given prefix (in, and or is as a whole word, surrounded by word breakers \b)
... followed by at least one blank character,
... then a sequence of 0 or more characters each one not followed by a prefix,
... followed by HomeStay,
... followed by another sequence of 0 or more characters, each one still not followed by a prefix
If you just want the last occurrence, you can add another negative look-ahead after (fiddle).
\b(?:in|and|is)\s+((?:.(?!\b(?:in|and|is)\b))*HomeStay(?:.(?!\b(?:in|and|is)\b))*)(?!.*HomeStay.*)
Same as above, except the matched text must not be followed by a text containing HomeStay.
Finally, if the matching text has to contain at least a word from a list, just replace both occurrences of HomeStay with a list of alternatives. Example for HomeStay and Luxury: (?:HomeStay|Luxury) (fiddle).

In java:
String s = "Hello, tonight I'm in the town of Trenton in New Jersey and I will be "
+ "staying in Hotel HomeStay [123] and I have no money.";
// Garbage: final String SUBP = "\\bin\\s+(\\S+)";
Pattern p = Pattern.compile("^.*\\sin\\s+(\\S+).*$", Pattern.DOTALL);
String last = p.matcher(s).replaceFirst("$1"); // If found
This will find the last "... in ...", as .* (instead of eager .*?) will look for the longest sequence.
The result above will be Hotel (non spaces afer in) but it may be anything.
Dot-All will effect that . also matches line break characters.
The pattern will go from beginning ^ to end $.
Any characters .* (most longest) followed by a whitespace char \s.
Then "in ", then a word (non-spaces \S+) in group 1 (...)
Then any chars till the end .*. For purity it should have been .*? for the shortest sequence.
The End $.

Related

How do I match name, name here in Regex?

I've tried everything I can think of to get this to work, and after hours of trying, I've got to ask for help.
This is the string I'm scanning.
F:\Downloads\Downloads\500 Comics CCC English\Jack, Byrd - Art #01.cbr
This is the current Regex I'm using to try to match what I want.
(?i)English\\(?<Writer>.*(?= ))(?-i)
It matches English\Jack, Byrd - Art
All I want it to match is Jack, Byrd (with no space after it.)
For some reason, the only space I can get it to match is the space after Art.
No matter what I try, it will only match that space. It's like it doesn't consider the other spaces to be spaces.
The (?i)English\\(?<Writer>.*(?= ))(?-i) pattern contains .* that grabs as many chars as it can and then backtracking results in the English\Jack, Byrd - Art match because the space after Art is the last space (the space is required due to the positive lookahead (?= )).
There are several ways to fix it depending on the contexts you have. If there is always a space-hyphen-space after the necessary value add a space and hyphen to the lookahead
(?i)English\\(?<Writer>.*(?= - ))(?-i)
If the value is always a string of non-whitespace chars, comma, space and again a string of non-whitespace chars use
(?i)English\\(?<Writer>\S+,\s*\S+)(?-i)
where \S+ matches one or more non-whitespace chars.
For those who wonder what a (?<Write>...) is, the construct is called a named capturing group and is referenced to as \k<Writer> when used inside the same regex pattern or as ${Writer} when used inside a string replacement argument.

RegEx more than multiple characters before number

I really don't use RegEx that much. You could say I am RegEx n00b. I have been working on this issue for a half a day.
I am trying to write a pattern that looks backward from a number character. For example:
1. bob1 => bob
2. cat3 => cat
3. Mary34 => Mary
So far I have this (?![A-Z][a-z]{1,})([A-Za-z_])
It only matches for individual characters, I want all the characters before the number character. I tried to add the ^ and $ into my pattern and using an online simulator. I am unsure where to put the ^ and $.
NOTE: I am using RegEx for the .NET Framework
You may use a regex like
[\p{L}_]+(?=\d)
or
[\w-[\d]]+(?=\d)
See the regex demo
Pattern details
[\p{L}_]+ - any 1 or more letters (both lower- and uppercase) and/or _
OR
[\w-[\d]]+ - 1 or more word chars except digits (the -[] inside a character class is a character class subtraction construct)
(?=\d) - a positive lookahead that requires a digit to appear immediately to the right of the current location
If we break down your RegEx, we see:
(?![A-Z][a-z]{1,}) which says "look ahead to find a string that is NOT one uppercase letter followed one or more lowercase letters" and ([A-Za-z_]) which says "match one letter or underscore". This should end up matching any single lowercase letter.
If I understand what you want to achieve, then you want all of the letters before a number. I would write something like that as:
\b([a-zA-Z]+)[0-9]
This will start at a word boundary \b, match one or more letters, and require a digit right after the matched string.
(The syntax I used seems to match this document about .NET RegEx: https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expressions)
In light of Wiktor Stribizew's comment, here is a pure match RegEx:
\b[a-zA-Z_]+(?=[0-9])
This matches the pattern and then looks ahead for the digit. This is better than my first lookahead attempt. (Thank you Wiktor.)
http://www.rexegg.com/regex-lookarounds.html

Name validation - Adding a check to this regex to stop entering just identical characters

I'm trying to add another feature to a regex which is trying to validate names (first or last).
At the moment it looks like this:
/^(?!^mr$|^mrs$|^ms$|^miss$|^dr$|^mr-mrs$)([a-z][a-z'-]{1,})$/i
https://regex101.com/r/pQ1tP2/1
The idea is to do the following
Don't allow just adding a title like Mr, Mrs etc
Ensure the first character is a letter
Ensure subsequent characters are either letters, hyphens or apostrophes
Minimum of two characters
I have managed to get this far (shockingly I find regex so confusing lol).
It matches things like O'Brian or Anne-Marie etc and is doing a pretty good job.
My next additions I've struggled with though! trying to add additional features to the regex to not match on the following:
Just entering the same characters i.e. aaa bbbbb etc
Thanks :)
I'd add another negative lookahead alternative matching against ^(.)\1*$, that is, any character, repetead until the end of the string.
Included as is in your regex, it would make that :
/^(?!^mr$|^mrs$|^ms$|^miss$|^dr$|^mr-mrs$|^(.)\1*$)([a-z][a-z'-]{1,})$/i
However, I would probably simplify your negative lookahead as follows :
/^(?!(mr|ms|miss|dr|mr-mrs|(.)\2*)$)([a-z][a-z'-]{1,})$/i
The modifications are as follow :
We're evaluating the lookahead at the start of the string, as indicated by the ^ preceding it : no need to repeat that we match the start of the string in its clauses
Each alternative match the end of the string. We can put the alternatives in a group, which will be followed by the end-of-string anchor
We have created a new group, which we have to take into account in our back-reference : to reference the same group, it now must address \2 rather than \1. An alternative in certain regex flavours would have been to use a non-capturing group (?:...)

Regex with start and end match

I'm having trouble matching the start and end of a regex on Python.
Essentially I'm confused about the when to use word boundaries /b and start/end anchors ^ $
My regex of
^[A-Z]{2}\d{2}
matches 4 letter characters (two uppercase letters, two digits) which is what I'm after
Matches AJ99, RD22, CP44 etc
However, I also noted that AJAJAJAJAJAJAJAJAJSJHS99 could be matched as well. I've tried used ^ and $ together to match the whole string. This doesn't work
^[A-Z]{2}\d{2}$ # this doesn't work
but
^[A-Z]{2}\d{2} # this is fine
[A-Z]{2}\d{2}$ # this is fine
The string I'm matching against is 4 characters long, but in the first two examples the regex could pick the start and end of a longer string respectively.
s = "NZ43" # 4 characters, match perfect! However....
s = "AM27272727" # matches the first example
s = "HAHSHSHSHDS57" # matches the second example
The position anchors ^ and $ place a restriction on the position of your matched chars:
Analyzing your complete regex:
^[A-Z]{2}\d{2}$
^ matches only at the beginning of the text
[A-Z]{2} exactly 2 uppercase Ascii alphabetic characters
\d{2} exactly 2 digits (equivalent to [0-9]{2})
$ matches only at the end of the text
If you remove one or both of the 2 position anchors (^ or $) you can match a substring starting from the beginning or the end as you stated above.
If you want to match exactly a word without using the start/end of the string use the \b anchor, like this:
``\b[A-Z]{2}\d{2}\b``
\b matches at the start/end of text and between a regex word (in regex a word char \w is intended as one of [a-zA-Z0-9_]) and one char not in the word group (available as \W).
The regex above matches WS24 in all the next strings:
WS24 alone
before WS24
WS24 after
before WS24 after
NZ43
It doesn't match:
AM27272727 (it will do if is AM27 272727 or AM27"272727
HAHSHSHSHDS57 (it will do if HAHSHSHSH DS75 or...you get it)
A demo online (the site will be useful to you also to experiment with regex).
The fact that your shown behaviour is like it's supposed to be, your question suggests that you maybe does not have fully understood how regular expressions work.
As a addition to the very good and informative answer of GsusRecovery, here's a site, that guides you through the concepts of regular expressions and tries to teach you the basics with a lessons-based system. To be clear, I do not want to tout this website, as there are plenty of those, but however I could really made a use of this one and so it's the one I'm suggesting.

Regex matching beginning AND end strings

This seems like it should be trivial, but I'm not so good with regular expressions, and this doesn't seem to be easy to Google.
I need a regex that starts with the string 'dbo.' and ends with the string '_fn'
So far as I am concerned, I don't care what characters are in between these two strings, so long as the beginning and end are correct.
This is to match functions in a SQL server database.
For example:
dbo.functionName_fn - Match
dbo._fn_functionName - No Match
dbo.functionName_fn_blah - No Match
If you're searching for hits within a larger text, you don't want to use ^ and $ as some other responders have said; those match the beginning and end of the text. Try this instead:
\bdbo\.\w+_fn\b
\b is a word boundary: it matches a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one. This regex will find what you're looking for in any of these strings:
dbo.functionName_fn
foo dbo.functionName_fn bar
(dbo.functionName_fn)
...but not in this one:
foodbo.functionName_fnbar
\w+ matches one or more "word characters" (letters, digits, or _). If you need something more inclusive, you can try \S+ (one or more non-whitespace characters) or .+? (one or more of any characters except linefeeds, non-greedily). The non-greedy +? prevents it from accidentally matching something like dbo.func1_fn dbo.func2_fn as if it were just one hit.
^dbo\..*_fn$
This should work you.
Well, the simple regex is this:
/^dbo\..*_fn$/
It would be better, however, to use the string manipulation functionality of whatever programming language you're using to slice off the first four and the last three characters of the string and check whether they're what you want.
\bdbo\..*fn
I was looking through a ton of java code for a specific library: car.csclh.server.isr.businesslogic.TypePlatform (although I only knew car and Platform at the time). Unfortunately, none of the other suggestions here worked for me, so I figured I'd post this.
Here's the regex I used to find it:
\bcar\..*Platform
Scanner scanner = new Scanner(System.in);
String part = scanner.nextLine();
String line = scanner.nextLine();
String temp = "\\b" + part + "|" + part + "\\b";
Pattern pattern = Pattern.compile(temp.toLowerCase());
Matcher matcher = pattern.matcher(line.toLowerCase());
System.out.println(matcher.find() ? "YES" : "NO");
If you need to determine if any of the words of this text start or end with the sequence, you can use this regex: \bsubstring|substring\b:
anythingsubstring
substringanything
anythingsubstringanything
The simplest thing that you can do is:
dbo.*_fn$
It searches with dbo, followed by any characters, and then ends with _fn.
If you can identify what’s the right next character after n if it’s space, you can replace $ with space .