Regex deleting all lines except last occurence of a pattern - regex

I want to delete all lines that match a pattern but the last occurrence.
So assume we have this text:
test a 043
test a 123
test a 987
test b 565
The result I'm aiming for is this:
test a 987
test b 565
Is it possible to compare strings like that with just regex in vim? This is also assuming the a and b in this example are dynamic ((test\s\w\s(.*)).

You will need a lookahead regex in vim for this:
:g/\v(^test \w+)(\_.*\1)#=/d
RegEx Breakup:
\v # very magic to avoid escapes
( # capturing group #1 start
^test \w+ # match any line starting with test \w+
) # capturing group #1 end
(\_.*\1)#= # positive lookahead to make sure there is at least one of \1 below

Related

Python Regex some name + US Address

I have these kind of strings:
WILLIAM SMITH 2345 GLENDALE DR RM 245 ATLANTA GA 30328-3474
LINDSAY SCARPITTA 655 W GRACE ST APT 418 CHICAGO IL 60613-4046
I want to make sure that strings I will get are like those strings like above.
Here's my regular expression:
[A-Z]+ [A-Z]+ [0-9]{3,4} [A-Z]+ [A-Z]{2,4} [A-Z]{2,4} [0-9]+ [A-Z]+ [A-Z]{2} [0-9]{5}-[0-9]{4}$
But my regular expression only matches the first example and does not match the second one.
Here's dawg's regex with capturing groups:
^([A-Z]+[ \t]+[A-Z]+)[ \t]+(\d+)[ \t](.*)[ \t]+([A-Z]{2})[ \t]+(\d{5}(?:-\d{4}))$
Here's the url.
UPDATE
sorry, I forgot to remove non-capturing group at the end of dawg's regex...
Here's new regex without non-capturing group: regex101
Try this:
^[A-Z]+[ \t]+[A-Z]+[ \t]+\d+.*[ \t]+[A-Z]{2}[ \t]+\d{5}(?:-\d{4})$
Demo
Explanation:
1. ^[A-Z]+[ \t]+[A-Z]+[ \t]+ Starting at the start of line,
two blocks of A-Z for the name
(however, names are often more complicated...)
2. \d+.*[ \t]+[A-Z]{2}[ \t]+ Using number start and
two letter state code at the end for the full address
Cities can have spaces such as 'Miami Beach'
3. \d{5}(?:-\d{4})$ Zip code with optional -NNNN with end anchor

How to transpose pieces of data using Regular expression in Notepad++

I am very new to the world of regular expressions. I am trying to use Notepad++ using Regex for the following:
Input file is something like this and there are multiple such files:
Code:
abc
17
015
0 7
4.3
5/1
***END***
abc
6
71
8/3
9 0
***END***
abc
10.1
11
9
***END***
I need to be able to edit the text in all of these files so that all the files look like this:
Code:
abc
1,2,3,4,5
***END***
abc
6,7,8,9
***END***
abc
10,11,12
***END***
Also:
In some files the number of * around the word END varies, is there a way to generalize the number of * so I don't have to worry about it?
There is some additional data before abcs which does not need to be transposed, how do I keep that data as it is along with transposing the data between abc and ***END***.
Kindly help me. Your help is much appreciated!
Try the following find and replace, in regex mode:
Find: ^(\d+)\R(?!\*{1,}END\*{1,})
Replace: $1,
Demo
Here is an explanation of the regex pattern:
^ from the start of the line
(\d+) match AND capture a number
\R followed by a platform independent newline, which
(?!\*{1,}END\*{1,}) is NOT followed by ***END***
Note carefully the negative lookahead at the end of the pattern, which makes sure that we don't do the replacement on the final number in each section. Without this, the last number would bring the END marker onto the same line.
This will eplace only between "abc" and "***END***" with any number of asterisk.
Ctrl+H
Find what: (?:(?<=^abc)\R|\G(?!^)).+\K\R(?!\*+END\*+)
Replace with: ,
CHECK Match case
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
(?: # non capture group
(?<=^abc) # positive look behind, make sure we have "abc" at the beginning of line before
\R # any kind of linebreak
| # OR
\G # restart from last match position
(?!^) # negative look ahead, make sure we are not at the beginning of line
) # end group
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\R # any kind of linebreak
(?!\*+END\*+) # negative lookahead, make sure we haven't ***END*** after
Screen capture (before):
Screen capture (after):

How to use regular expression to use as few groups as possible to match as long string as possible

For example, this is the regular expression
([a]{2,3})
This is the string
aaaa // 1 match "(aaa)a" but I want "(aa)(aa)"
aaaaa // 2 match "(aaa)(aa)"
aaaaaa // 2 match "(aaa)(aaa)"
However, if I change the regular expression
([a]{2,3}?)
Then the results are
aaaa // 2 match "(aa)(aa)"
aaaaa // 2 match "(aa)(aa)a" but I want "(aaa)(aa)"
aaaaaa // 3 match "(aa)(aa)(aa)" but I want "(aaa)(aaa)"
My question is that is it possible to use as few groups as possible to match as long string as possible?
How about something like this:
(a{3}(?!a(?:[^a]|$))|a{2})
This looks for either the character a three times (not followed by a single a and a different character) or the character a two times.
Breakdown:
( # Start of the capturing group.
a{3} # Matches the character 'a' exactly three times.
(?! # Start of a negative Lookahead.
a # Matches the character 'a' literally.
(?: # Start of the non-capturing group.
[^a] # Matches any character except for 'a'.
| # Alternation (OR).
$ # Asserts position at the end of the line/string.
) # End of the non-capturing group.
) # End of the negative Lookahead.
| # Alternation (OR).
a{2} # Matches the character 'a' exactly two times.
) # End of the capturing group.
Here's a demo.
Note that if you don't need the capturing group, you can actually use the whole match instead by converting the capturing group into a non-capturing one:
(?:a{3}(?!a(?:[^a]|$))|a{2})
Which would look like this.
Try this Regex:
^(?:(a{3})*|(a{2,3})*)$
Click for Demo
Explanation:
^ - asserts the start of the line
(?:(a{3})*|(a{2,3})*) - a non-capturing group containing 2 sub-sequences separated by OR operator
(a{3})* - The first subsequence tries to match 3 occurrences of a. The * at the end allows this subsequence to match 0 or 3 or 6 or 9.... occurrences of a before the end of the line
| - OR
(a{2,3})* - matches 2 to 3 occurrences of a, as many as possible. The * at the end would repeat it 0+ times before the end of the line
-$ - asserts the end of the line
Try this short regex:
a{2,3}(?!a([^a]|$))
Demo
How it's made:
I started with this simple regex: a{2}a?. It looks for 2 consecutive a's that may be followed by another a. If the 2 a's are followed by another a, it matches all three a's.
This worked for most cases:
However, it failed in cases like:
So now, I knew I had to modify my regex in such a way that it would match the third a only if the third a is not followed by a([^a]|$). So now, my regex looked like a{2}a?(?!a([^a]|$)), and it worked for all cases. Then I just simplified it to a{2,3}(?!a([^a]|$)).
That's it.
EDIT
If you want the capturing behavior, then add parenthesis around the regex, like:
(a{2,3}(?!a([^a]|$)))

RegEx: Get every word until last 4 words

I have strings like
wwww-wwww-wwww
wwww-www-ww-ww
Many w separated with -
But it's not regular wwww-wwww, it could be w-w-w-w as well
I try to find a regex that capture every word until the last 4 words.
So the result for example 1 would be the first 8w's (wwww-wwww)
For 2nd example the first 5w's (wwww-w)
Is it possible to do this in regex?
I have something like this right now:
^\w*(?=\w{4}$)
or maybe
[^-]*(?=\w{4}$)
I have 2 problems with my "solutions":
the last 4 words will not be captured for example 2. They are interrupted by the -
the words before the last 4 will not be captured. They are interrupted by the -.
Yes, it's possible with a slightly more sophisticated lookahead assertion:
/\w(?=(?:-*\w){4,}$)/x
Explanation:
/ # Start of regex
\w # Match a "word" character
(?= # only if the following can be matched afterwards:
(?: # (Start of capturing group)
-* # - zero or more separators
\w # - exactly one word character
){4,} # (End of capturing group), repeated 4 or more times.
$ # Then make sure we've reached the end of the string.
) # End of lookahead assertion/x
Test it live on regex101.com.

Capturing repeated word sequence

In Perl, to match text pattern like a11a, g22g, x33x below regex works fine
([a-z])(\d)\g2\g1
Now i want to match repeating groups like similar to above but having space in between words like
abcd 101 abcd 101 ( catch this entire string in single regex pattern in one single line text or a paragraph )
How to do this...i tried below pattern but it wont work
([a-zA-Z]*\s)([0-9]*\s)\g1\g2
#logic is : words followed by space in 1 group and
#numbers followed by space in 2nd group
Regex101 Demo
Also, please explain why the above regex fails to capture the desired text pattern!!!
EDIT
One more complication :
assume that pattern is something like
[words][space][numbers][space][words][space][numbers]
#assume all [numbers] and [word] are same
....so in last [numbers] case, [space] doesn't follow, how to filter then...because regex group capture like:
([0-9]*\s) certainly fails to capture last part if it is repeated, and
([0-9]*) would fail to capture mid-part if it is repeated!! ??
Regex 101
Your problem is that your regex expects a space at the end, because you have included the space in the captures.
Try instead:
([a-zA-Z]+)\s([0-9]+)\s\g1\s\g2
([0-9]*\s) = 101 with space
so \g2 doesn't match with 101 as it doesn't have any space at the end.
Update: Working regex ([a-zA-Z]*\s)([0-9]*)\s\g1\g2 for input abcd 101 abcd 101
Online Demo
More example:
([a-zA-Z]*\s) ([0-9]*) \s \g1 \g2
abcd+space 101 Space abcd+space 101