Make regex match only the capturing group - regex

Due to the technology I'm currently working with (PySpark API), I need to adjust a regex so that the full match corresponds to the capturing group.
I want to use it as a delimiter pattern in a split function
This function splits an input string according to the matched substring, not the capturing group.
Hence why I need to match the \s+ caracters (that I currently only capture).
Here is a regex101 example or here: (\s)+(?:\d*\s*)(?=RUE|BOULEVARD|AVENUE)
I tried to extend the positive lookahead to combine the possibility that a \d+\s+ may be present before and therefore match a different \s. Didnt work so far.
The split function's output I wish to obtain is the following:
[7 BOULEVARD LAPIN BLANC,AVENUE MR LIEVRE,18 RUE PIERRE LAPIN]

I don't know pyspark but I guess it supports these things, split on spaces that are not preceded by a digit but followed by an optional digit then the type of street.
(?<!\d)\s+(?=(?:\d+\s)?(?:RUE|BOULEVARD|AVENUE))
In the demo I use a substitution with \n that simulate the split.
Demo & explanation

Related

Extra groups in regex

I'm building a regex to be able to parse addresses and am running into some blocks. An example address I'm testing against is:
5173B 63rd Ave NE, Lake Forest Park WA 98155
I am looking to capture the house number, street name(s), city, state, and zip code as individual groups. I am new to regex and am using regex101.com to build and test against, and ended up with:
(^\d+\w?)\s((\w*\s?)+).\s(\w*\s?)+([A-Z]{2})\s(\d{5})
It matches all the groups I need and matches the whole string, but there are extra groups that are null value according to the match information (3 and 4). I've looked but can't find what is causing this issue. Can anyone help me understand?
Your regex expression was almost good:
(^\d+\w?)\s([\w*\s?]+).\s([\w*\s?]+)\s([A-Z]{2})\s(\d{5})
What I changed are the second and third groups: in both you used a group inside a group ((\w*\s?)+), where a class inside a group (([\w*\s?]+)) made sure you match the same things and you get the proper group content.
With your previous syntax, the inner group would be able to match an empty substring, since both quantifiers allow for a zero-length match (* is 0 to unlimited matches and ? is zero or one match). Since this group was repeated one or more times with the +, the last occurrence would match an empty string and only keep that.
For this you'll need to use a non-capturing group, which is of the form (?:regex), where you currently see your "null results". This gives you the regex:
(^\d+\w?)\s((?:\w*\s?)+).\s(?:\w*\s?)+([A-Z]{2})\s(\d{5})
Here is a basic example of the difference between a capturing group and a non-capturing group: ([^s]+) (?:[^s]+):
See how the first group is captured into "Group 1" and the second one is not captured at all?
Matching an address can be difficult due to the different formats.
If you can rely on the comma to be there, you can capture the part before it using a negated character class:
^(\d+[A-Z]?)\s+([^,]+?)\s*,\s*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo
Or take the part before the comma that ends on 2 or more uppercase characters, and then match optional non word characters using \W* to get to the first word character after the comma:
^(\d+[A-Z]?)\s+(.*?\b[A-Z]{2,}\b)\W*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo

Adjustment to this code to stop after finding two words

In my haste to get this working I failed to ask how to stop after the second word in my original post. Grab first 4 characters of two words RegEx
If I have Awesome Sauce Today I would like to have AwesSauc
The code in my first post will capture the first 4 characters of any word and combine them. so Awesome Sauce Today will become AwesSaucToda. I want it to stop capturing after the second word. So in my example Today would be ignored but it will still capture 4 characters of the first two words it encounters to create the new wor AwesSauc
You may still use the Replace Text action and use
Pattern: (?s)^\P{L}*(\p{L}{1,4})\p{L}*\P{L}+(\p{L}{1,4}).*
Replacement text: $1$2
See the regex demo.
The difference between this solution and the previous one is that the pattern is anchored at the start with ^, instead of a \W (that matches any non-word char) I am using a \P{L} that matches any non-letter char (adjust as you see fit), and to match the first and second word beginning, I am using 2 capturing groups now ((\p{L}{1,4})...(\p{L}{1,4})), hence two backreferences in the replacement pattern. The (?s) modifier makes the . pattern to match any char, including a newline. The .* at the end is necessary to remove the rest of the string after the necessary text is captured into the 2 capturing groups.

Capture number between two whitespaces (RegEx)

I have the following data:
SOMEDATA .test 01/45/12 2.50 THIS IS DATA
and I want to extract the number 2.50 out of this. I have managed to do this with the following RegEx:
(?<=\d{2}\/\d{2}\/\d{2} )\d+.\d+
However that doesn't work for input like this:
SOMEDATA .test 01/45/12 2500 THIS IS DATA
In this case, I want to extract the number 2500.
I can't seem to figure out a regex rule for that. Is there a way to extract something between two spaces ? So extract the text/number after the date until the next whitespace ? All I know is that the date will always have the same format and there will always be a space after the text and then a space after the number I want to extract.
Can someone help me out on this ?
Capture number between two whitespaces
A whitespace is matched with \s, and non-whitespace with \S.
So, what you can use is:
\d{2}\/\d{2}\/\d{2} +(\S+)
^^^
See the regex demo
The 1+ non-whitespace symbols are captured into Group 1.
If - for some reason - you need to only get the value as a whole match, use your lookbehind approach:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Or - if you are using PCRE - you may leverage the match reset operator \K:
\d{2}\/\d{2}\/\d{2} +\K\S+
^^
See another demo
NOTE: the \K and a capture group approaches allow 1 or more spaces after the date and are thus more flexible.
I see some people helped you already, but if you would want an alternative working one for some reason, here's what works too :)
.+ \d+\/\d+\/\d+ (\d+[\.\d]*)
So the .+ matches anything plus the first space
then the \d+/\d+/\d+ is the date parsing plus a space
the capturing group is the number, as you can see I made the last part optional, so both floating point values and normal values can be matched. Hope this helped!
Proof: https://regex101.com/r/fY3nJ2/1
Just make the fractal part optional:
(?<=\d{2}\/\d{2}\/\d{2} )\d+(?:\.\d+)?
Demo: https://regex101.com/r/jH3pU7/1
Update following clarifications in comments:
To match anything (but space) surrounded by spaces and prepended by date use:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Demo: https://regex101.com/r/jH3pU7/3
Rather than capture, you can make your entire match be the target text by using a look behind:
(?<=\d\d(\/\d\d){2} )\S+
This matches the first series of non-whitespace that follows a "date like" part.
Note also the reduction in the length of the "date like" pattern. You may consider using this part of the regex in whatever solution you use.

Mixing Lookahead and Lookbehind in 1 Regexp

I'm trying to match first occurrence of window.location.replace("http://stackoverflow.com") in some HTML string.
Especially I want to capture the URL of the first window.location.replace entry in whole HTML string.
So for capturing URL I formulated this 2 rules:
it should be after this string: window.location.redirect("
it should be before this string ")
To achieve it I think I need to use lookbehind (for 1st rule) and lookahead (for 2nd rule).
I end up with this Regex:
.+(?<=window\.location\.redirect\(\"?=\"\))
It doesn't work. I'm not even sure that it legal to mix both rules like I did.
Can you please help me with translating my rules to Regex? Other ways of doing this (without lookahead(behind)) also appreciated.
The pattern you wrote is really not the one you need as it matches something very different from what you expect: text window.location.redirect("=") in text window.location.redirect("=") something. And it will only work in PCRE/Python if you remove the ? from before \" (as lookbehinds should be fixed-width in PCRE). It will work with ? in .NET regex.
If it is JS, you just cannot use a lookbehind as its regex engine does not support them.
Instead, use a capturing group around the unknown part you want to get:
/window\.location\.redirect\("([^"]*)"\)/
or
/window\.location\.redirect\("(.*?)"\)/
See the regex demo
No /g modifier will allow matching just one, first occurrence. Access the value you need inside Group 1.
The ([^"]*) captures 0+ characters other than a double quote (URLs you need should not have it). If these URLs you have contain a ", you should use the second approach as (.*?) will match any 0+ characters other than a newline up to the first ").

Fetch one out of two Numbers out of String

I hav a list of strings, such as: Ø20X400
I need to extract the first of the numbers - between Ø and X
I've come so far to match the numbers in general with \d+ - as simple as it is...
But I need an expression to get the first value separated, not both of them...
You can use lookarounds (?<=..) and (?=..):
(?<=Ø)\d+(?=X)
or in Java style:
(?<=Ø)\\d+(?=X)
A second way is to use a capture group:
Ø(\d+)X
or
Ø(\\d+)X
Then you can extract the content of the group.
The regex engines I know parse \n as a newline. \d is used for numbers.
The following regex gives you the first number between a Ø and a X in a capture group:
^.*?Ø(\d+)X.*
Edit live on Debuggex
This Regex will do it for you, (\d+?)X, and here is a Rubular to prove it. See, you want to group digits together, but make it non-greedy, ending the evaluation on X.
Try this one:
\d+(?=\D)
Should find first number wich has some not a number ahead
With normal regular expressions, I would say:
Ø(\d+)X
This finds the Ø character, followed by one or more numbers, followed by an X. Also, the numbers will be stored in the first capture group. Capture groups differ from one regex implementation to another, but this would typically be denoted by \1. Capture group zero, \0, is usually the matched string itself. In this version, \d denotes digits 0-9, but if your regex engine uses \n for that purpose, use:
Ø(\n+)X