Specify end of regex group - regex

I am trying to create a regular expression which matches multiple groups, so the values between the groups can be extracted. Each group looks identical.
Lets consider the following example, note that the linebreaks are intended:
dog 1
wuff
wuff
cat
123
XYZ
dog 1
wuff
wuff
cat
456
ABC
dog 1
wuff
wuff
cat
789
Thus, with the right regular expression I want to get the output:
123
XYZ
456
ABC
789
On regex101.com I tried:
(?s)(?:dog.*cat)
which matches all values between the first occurence of dog an the last occurence of cat.
In addition I tried:
(?s)(?:dog.*(cat){1})
which, with my limited knowledge, should match the first occurence of cat and then end the group, but it does not.
I appreciate any help.

You may use this regex in MULTILINE mode to capture value after dog.*cat matches:
^dog\b(?:.*\n)+?cat\n(.*(?:\n.*)*?)(?=\ndog|\Z)
Your values are present in capture group #1
RegEx Demo
RegEx Details:
^: Match start line
dog\b: Match word dog with a word boundary
(?:.*\n)+?: Match anything followed by a line break. Repeat this 1+ times (lazy)
cat\n: Match cat followed by a newline
(.*(?:\n.*)*?): These are the multiline values you're interested in the first capture group.
(?=\ndog|\Z): Lookahead to assert that we have a dog after line break or end of input ahead of the current position

Related

POSIX regex expression with grouping not working as expected

I want to have a posix regex such if there's a digit in the end it is not included in the first group. Example:
abc -> group1:"abc" group2:""
def0 -> group1:"def" group2:"0"
I tried this: (\S+)([0-9]+)?
However this one returns:
abc -> group1:"abc" group2:""
def0 -> group1:"def0" group2:""
How can I make second group more greedy than the first group?
Well ... \s stands for “whitespace character”, and capitalizing it negates it, so \S means "any non-whitespace character". If what you want is to match alpha strings only, and not numeric, you can use the class (or a shortcut if your RE parser has one) just for that:
$ printf 'abc123\ndef0\n' | sed -E 's/([[:alpha:]]*)(.*)/group1: \1 group2: \2/'
group1: abc group2: 123
group1: def group2: 0

Extract count of groups and groups within a word using regex

I am trying to use regex to determine how many and which groups are repeated.
Input String= $$$ 12345 aaa bbb ccc ddd eee 678 $$$ aaabbbbccc aaa-bbb-ddd aab aaaaaabbbbbbbbbbbbbc a000000009999999888888
Expected Output =
$$$
12345
aaa
bbb
ccc
ddd
eee
678
$$$
aaa
bbbb
ccc
aaa
bbb
ddd
aa
b
aaaaaa
bbbbbbbbbbbbb
c
a
00000000
9999999
888888
Please note that I have separated aaa from aaaaaa bbbbbbbbbbbbb and cfor visual clarity. The actual output won't have any space or newline character between the words.
Rules:
1) There could be n number of words with characters among a-zA-Z0-9$. In above example, $$$ and 12345 are words.
2) A word could have n groups with repeated characters. E.g. aaa and a
3) What is the difference between a word and a group inside word? E.g. What is the difference between 12345 and aab.
Answer: 12345 doesn't have any repeated element. So, this stays as is without any further breakdown. However, aab has one repeated character a because of which it will be broken down into aa and b.
4) The output (consisting of groups) must not have any spaces or newline characters before or after the group.
I was able to separate words from each other. This was easy. I used r[$0-9a-zA-Z]+ However, I am unsure how to separate groups inside the word. i.e. how do I separate a000000009999999888888 into a 00000000 9999999 888888?
I'd appreciate any help. Thanks in advance.
Here's my regex101 sheet: REGEX101
If negative lookahead is supported, you might use an alternation and 2 capturing groups.
([a-z0-9$])\1+|(?:([a-z0-9$])(?!\2))+
Regex demo
([a-z0-9$])\1+ Match consecutive characters by capturing a what is in the character class in group 1 followed by repeating group 1 one or more times
| Or
(?: Non capturing group
([a-z0-9$]) Match what is in the character class and capture in group 2
(?!\2) Negative lookahead to assert that what follows is not group 2
)+ Close non capturing group and repeat one or more times
You did not specify any tool or language, but just an example how to get the full matches in Php or in Python.

Regex to match phone and fax numbers for WebHarvy

Sample text
5950 S Willow Dr Ste 304
Greenwood Village, CO 80111
P (123) 456-7890
F (123) 456-7890
Get Directions
Tried the following but it grabbed the first line of the address as well
(.*)(?=(\n.*){2}$)
Also tried
P\s(\(\d{3})\)\s\d+-\d+
but it doesn't work in WebHarvy even though it works on RegexStorm
Looking for an expression to match the phone and fax numbers from it. I would be using the expression in WebHarvy
https://www.webharvy.com/articles/regex.html
Thanks
Your second pattern is almost what you need to do. With P\s(\(\d{3})\)\s\d+-\d+, you captured into Group 1 only (\(\d{3}) part, while you need to capture the whole number.
I also suggest to restrict the context: either match P as a whole word, or as the first word on a line:
\bP\s*(\(\d{3}\)\s*\d+-\d+)
or
(?m)^\s*P\s*(\(\d{3}\)\s*\d+-\d+)
See the regex demo, and here is what you need to pay attention to there:
The \b part matches a word boundary (\b) and (?m)^\s* matches the start of a line ((?m) makes ^ match the start of a line) and then \s* matches 0+ whitespaces. You may change it to only match horizontal whitespaces by replacing the pattern with [\p{Zs}\t]*.

Regex deleting all lines except last occurence of a pattern

I want to delete all lines that match a pattern but the last occurrence.
So assume we have this text:
test a 043
test a 123
test a 987
test b 565
The result I'm aiming for is this:
test a 987
test b 565
Is it possible to compare strings like that with just regex in vim? This is also assuming the a and b in this example are dynamic ((test\s\w\s(.*)).
You will need a lookahead regex in vim for this:
:g/\v(^test \w+)(\_.*\1)#=/d
RegEx Breakup:
\v # very magic to avoid escapes
( # capturing group #1 start
^test \w+ # match any line starting with test \w+
) # capturing group #1 end
(\_.*\1)#= # positive lookahead to make sure there is at least one of \1 below

Regex Ignore Proceeding Words

I am trying to create a regex expression that starts with a certain word and ignores any other same proceeding words.
For example, if my string starts with the word "dog" and ends with "fish", how do I ignore any proceeding "dog" words and only match the last one?
dog cat fish
dog dog cat fish <- ignore first word "dog" and match second "dog" word.
dog dog dog cat fish <- ignore first and second "dog" words and match third "dog" word.
The following regex works:
(\b\w+\b |\b\w+\b$)(?!\1) with the m and g flags enabled
Demo: http://regex101.com/r/dW9fP5
As per your new request:
(\b\w+\b|\b\w+\b$)(?!\1) with the m and g flags enabled
To strip out space separated duplicates:
dog dog dog cat cat fish:
(?>(\w+) (?=\1\b))+
test at: regex101, eval.in (if php)
Using a lookahead to check if match of first parenthesized group is ahead (preceded by a space).
To match duplicates only at string start, add the ^ anchor at the beginning:
dog dog dog cat cat fish
^(?>(\w+) (?=\1\b))+
test at regex101
EDIT: Question has obviously changed to matching consecutive character sequences in one long string without spaces. Pattern modified a bit to strip out sequences of at least 3 characters at start:
dogdogdogcatcatfish
^(?>(\w{3,})(?=\1))+
test at regex101
Replace with empty string ""
Regex FAQ
Here's a simple (literal) pattern:
.*(dog)
Replace Pattern:
\1
Not the most exciting, but might as well show it. The target word in parentheses sets to match group \1
example: http://regex101.com/r/yU6xO8