Regex with optional word inside string - regex

I try to Regex the following line (each word separated by one space):
Firstpartstring thisisoptional secondpartstring
I expect each string to match as group:
Group 1. Firstpartstring
Group 2. thisisoptional
Group 3. secondpartstring
This is what I tried so far:
(.*?)\s(thisisoptional)?\s(.*)
Only problem is, if "thisisoptional" does not exist inside the string, I don't get any results.
I expect:
Group 1. Firstpartstring
Group 2.
Group 3. secondpartstring
Please check this demo: https://regex101.com/r/YBlYXm/1
Can anyone get me in the right direction?
Thanks

Your problem is that you are asking for two spaces (\s) in your Regex which does not match your case if thisisoptional is not included. The easy fix is to include the second space in your 2nd capturing group:
(.*?)\s(thisisoptional\s)?(.*)
this selects anything followed by thisisoptional then followed by anything

The space before the optional word should be made optional as well; otherwise it would require two spaces between the first and the last word to match:
(.*?)(?:\s(thisisoptional))?\s(.*)
https://regex101.com/r/YBlYXm/2

Can't you just group all non-whitespace characters with (\S+) and then remove the middle one if you get three matches?
Example of this regex running: https://regex101.com/r/IIyM5Z/1

Related

Extra groups in regex

I'm building a regex to be able to parse addresses and am running into some blocks. An example address I'm testing against is:
5173B 63rd Ave NE, Lake Forest Park WA 98155
I am looking to capture the house number, street name(s), city, state, and zip code as individual groups. I am new to regex and am using regex101.com to build and test against, and ended up with:
(^\d+\w?)\s((\w*\s?)+).\s(\w*\s?)+([A-Z]{2})\s(\d{5})
It matches all the groups I need and matches the whole string, but there are extra groups that are null value according to the match information (3 and 4). I've looked but can't find what is causing this issue. Can anyone help me understand?
Your regex expression was almost good:
(^\d+\w?)\s([\w*\s?]+).\s([\w*\s?]+)\s([A-Z]{2})\s(\d{5})
What I changed are the second and third groups: in both you used a group inside a group ((\w*\s?)+), where a class inside a group (([\w*\s?]+)) made sure you match the same things and you get the proper group content.
With your previous syntax, the inner group would be able to match an empty substring, since both quantifiers allow for a zero-length match (* is 0 to unlimited matches and ? is zero or one match). Since this group was repeated one or more times with the +, the last occurrence would match an empty string and only keep that.
For this you'll need to use a non-capturing group, which is of the form (?:regex), where you currently see your "null results". This gives you the regex:
(^\d+\w?)\s((?:\w*\s?)+).\s(?:\w*\s?)+([A-Z]{2})\s(\d{5})
Here is a basic example of the difference between a capturing group and a non-capturing group: ([^s]+) (?:[^s]+):
See how the first group is captured into "Group 1" and the second one is not captured at all?
Matching an address can be difficult due to the different formats.
If you can rely on the comma to be there, you can capture the part before it using a negated character class:
^(\d+[A-Z]?)\s+([^,]+?)\s*,\s*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo
Or take the part before the comma that ends on 2 or more uppercase characters, and then match optional non word characters using \W* to get to the first word character after the comma:
^(\d+[A-Z]?)\s+(.*?\b[A-Z]{2,}\b)\W*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo

RegEx to match all sets of items that have part of specific value

I'm trying to use RegEx to filter all sets of items that have part of a specific value in a capture group that I have defined.
I have to check if the fifth capture group contains at least part of a specific text.
My string:
First Item;Second Item;Third Item;Fourth Item;First Word;Sixth
Item?First Item;Second Item;Third Item;Fourth Item;Second Word;Sixth
Item?First Item;Second Item;Third Item;Fourth Item;Can't Capture This
Set;Sixth Item
RegEx that works for exact word:
(?:^|\?)([^;]+);([^;]+);([^;]+);([^;]+);(Second Word);([^;\?$]+)
The problem is that I need this RegEx to work to capture only part of the word.
Not Working:
(?:^|\?)([^;]+);([^;]+);([^;]+);([^;]+);(.*Word.*);([^;\?$]+) >
Thanks!
Use [^;]* instead of .* because you have semi-colons as field delimiters:
(?:^|\?)([^;]+);([^;]+);([^;]+);([^;]+);([^;]*Word[^;]*);([^;?]+)
See proof. ([^;]*Word[^;]*) will match zero or more characters other than semi-colons, then a Word and zero or more characters other than semi-colons.

Clarification regarding nested groups in regex

I have the following regular expression: "Jan 1987" that I want it spliced into two groups:
The first group should match the whole string
The second group should match only the year
The following expression: (.+(\d+)) creates the first group but the second group only matches the the last digit, if I add a space like this (.+ (\d+)) the second group matches correctly the whole year.
Can someone explain me why?
Thanks in advance.
The following expression: (.+(\d+)) creates the first group but the second group only matches the the last digit, if I add a space like this (.+ (\d+)) the second group matches correctly the whole year.
Can someone explain me why?
Yes, because this term .+ is greedy and will match all the way up until the
last digit which only 1 is really required to then satisfy the whole match.
By adding the space requirement it tells the engine that it has to find
at least a space followed by a digit, which there is only 1 place in the sample,
therefore the Month is matched in the correct group and likewise
the year in its correct group.
The actual regex you most likely want to be using here is something like this:
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (\d{4})
If you want to access the entire match, then this is already the whole string. Depending on the regex tool/language you are using, you could also access the zeroth capture group. The 4 digit year would be available in the first capture group.
Demo

How can I match all instances of the first letter?

For example, for this string I want to match all A and a:
"All the apples make good cake."
Here's what I did: /(.)[^.]*\1*/ig
I started by getting the first character in the group, which can be any character: (.) Then I added [^.]* because I don't want to match any other character that isn't the first one. Finally I added \1* because I wanted to match the first character again. All other similar variations that I've tried don't seem to work.
The regex you are trying to build would capture very first character then any thing up to the same character as much as possible, using a negative lookahead (tempered dot):
(?i)(\w)(?:(?!\1).)*
Capturing group 1 holds the character you need. Try it on a live demo.
If regex engine supports \K match re-setter token then you can append it to the regex above to only match desired part:
(?i)(\w)(?:(?!\1).)*\K

Regex: pattern repeated capture – delimit the matching at the end of the pattern - non capture group and lookaheads negative example

I wish to match the end of my text and for it I have to match all the characters and the line breaks.
But I must exclude the beginning of the next capture!
What I want is to delimit the end of the pattern where the next pattern begins.
I tried to replace
[^-]
by something like
(?!-{2}\\*{3})
It doesn't work !
So I want to capture the number and I want to capture the whole paragraph (some text) between (--*** x ***)
Using this regex seems to work:
--\*{3}([\d]*)\*{3}(((?!-).*\n)*)
1st Capturing group: The digit inside the stars.
2nd Capturing group: The text between the "headers"
3rd Capturing group: The last line of the paragraph.
A link with the regex tested:
https://regex101.com/r/xJ0gC6/1
I found exactly what I wanted! :)
--\*{3}([^!*]*)\*{3}((?:(?!-{2}\*{3})(?:\n|.))*)
I must group what I want and what I don't want.
For that I must use a 'non-capture group' and a 'negative lookahead':
(?!nowant)(?:want)
Then I must use a 'non-capture group' to agregate the matching:
(?:(?!nowant)(?:want))
After, I add the quantifier '*'
(?:(?!nowant)(?:want))*
And finally, I add a 'capture group':
((?:(?!nowant)(?:want))*)
So here is the regex:
((?:(?!-{2}\*{3})(?:\n|.))*)
You can see the complete Regex here :
https://regex101.com/r/xJ0gC6/2