Capturing repeated word sequence

Capturing repeated word sequence - regex

In Perl, to match text pattern like a11a, g22g, x33x below regex works fine
([a-z])(\d)\g2\g1
Now i want to match repeating groups like similar to above but having space in between words like
abcd 101 abcd 101 ( catch this entire string in single regex pattern in one single line text or a paragraph )
How to do this...i tried below pattern but it wont work
([a-zA-Z]*\s)([0-9]*\s)\g1\g2
#logic is : words followed by space in 1 group and
#numbers followed by space in 2nd group
Regex101 Demo
Also, please explain why the above regex fails to capture the desired text pattern!!!
EDIT
One more complication :
assume that pattern is something like
[words][space][numbers][space][words][space][numbers]
#assume all [numbers] and [word] are same
....so in last [numbers] case, [space] doesn't follow, how to filter then...because regex group capture like:
([0-9]*\s) certainly fails to capture last part if it is repeated, and
([0-9]*) would fail to capture mid-part if it is repeated!! ??
Regex 101

Your problem is that your regex expects a space at the end, because you have included the space in the captures.
Try instead:
([a-zA-Z]+)\s([0-9]+)\s\g1\s\g2

([0-9]*\s) = 101 with space
so \g2 doesn't match with 101 as it doesn't have any space at the end.
Update: Working regex ([a-zA-Z]*\s)([0-9]*)\s\g1\g2 for input abcd 101 abcd 101
Online Demo
More example:
([a-zA-Z]*\s) ([0-9]*) \s \g1 \g2
abcd+space 101 Space abcd+space 101

Related

I need to extract all words prior to 4th Space in a line

Good Day
I need to extract all words prior to 5th Space in a line.
Sample Data
Article Number Crt.DI No. Date
6ZZ 999 123 S 000000093 19.01.2021
Article description Replace DI No. Date
I have written a expression to extract what is in between Date and Article and the result is this
(?<=Date)(.|\n)*(?=Article)
6RU 999 123 S 000000093 19.01.2021
however I need to retrieve all those characters before the 4 space
6ZZ 999 123 S
This is a material number and this can be 13 or 14 characters before the 4th space.
Appreciate your support.
Sample Data
Article Number Crt.DI No. Date
6RU 999 123 S 000000093 19.01.2021
Article description Replace DI No. Date
(Please Note : There is new lines in between, these are three consecutive lines and each line is followed by an enter key)
Regards,
Manjesh

You can use a capture group, and use \s to match a whitespace character or a newline.
The capture group approach can be more flexible in case you want to match more than one whitespace chars or newlines after Date and a quantifier in a lookbehind assertion is not supported.
\bDate\s+(\S+(?:\s+\S+){3})[\s\S]*?\bArticle\b
See a regex demo.
Or using lookarounds to get a match only.
(?<=\bDate\s)\S+(?:\s+\S+){3}(?=[\s\S]*?\bArticle\b)
The pattern matches:
(?<=\bDate\s) Positive lookbehind to assert Date to the left followed by a whitespace char that can also match a newline
\S+ Match 1 or more non whitespace chars
(?:\s+\S+){3}
(?= Positive lookahead to assert that what at the right is
[\s\S]*? Match any character including newlines
\bArticle\b Match the word Article
) Close the lookahead
See another regex demo.

How to transpose pieces of data using Regular expression in Notepad++

I am very new to the world of regular expressions. I am trying to use Notepad++ using Regex for the following:
Input file is something like this and there are multiple such files:
Code:
abc
17
015
0 7
4.3
5/1
***END***
abc
6
71
8/3
9 0
***END***
abc
10.1
11
9
***END***
I need to be able to edit the text in all of these files so that all the files look like this:
Code:
abc
1,2,3,4,5
***END***
abc
6,7,8,9
***END***
abc
10,11,12
***END***
Also:
In some files the number of * around the word END varies, is there a way to generalize the number of * so I don't have to worry about it?
There is some additional data before abcs which does not need to be transposed, how do I keep that data as it is along with transposing the data between abc and ***END***.
Kindly help me. Your help is much appreciated!

Try the following find and replace, in regex mode:
Find: ^(\d+)\R(?!\*{1,}END\*{1,})
Replace: $1,
Demo
Here is an explanation of the regex pattern:
^ from the start of the line
(\d+) match AND capture a number
\R followed by a platform independent newline, which
(?!\*{1,}END\*{1,}) is NOT followed by ***END***
Note carefully the negative lookahead at the end of the pattern, which makes sure that we don't do the replacement on the final number in each section. Without this, the last number would bring the END marker onto the same line.

This will eplace only between "abc" and "***END***" with any number of asterisk.
Ctrl+H
Find what: (?:(?<=^abc)\R|\G(?!^)).+\K\R(?!\*+END\*+)
Replace with: ,
CHECK Match case
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
(?: # non capture group
(?<=^abc) # positive look behind, make sure we have "abc" at the beginning of line before
\R # any kind of linebreak
| # OR
\G # restart from last match position
(?!^) # negative look ahead, make sure we are not at the beginning of line
) # end group
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\R # any kind of linebreak
(?!\*+END\*+) # negative lookahead, make sure we haven't ***END*** after
Screen capture (before):
Screen capture (after):

Regex to match phone and fax numbers for WebHarvy

Sample text
5950 S Willow Dr Ste 304
Greenwood Village, CO 80111
P (123) 456-7890
F (123) 456-7890
Get Directions
Tried the following but it grabbed the first line of the address as well
(.*)(?=(\n.*){2}$)
Also tried
P\s(\(\d{3})\)\s\d+-\d+
but it doesn't work in WebHarvy even though it works on RegexStorm
Looking for an expression to match the phone and fax numbers from it. I would be using the expression in WebHarvy
https://www.webharvy.com/articles/regex.html
Thanks

Your second pattern is almost what you need to do. With P\s(\(\d{3})\)\s\d+-\d+, you captured into Group 1 only (\(\d{3}) part, while you need to capture the whole number.
I also suggest to restrict the context: either match P as a whole word, or as the first word on a line:
\bP\s*(\(\d{3}\)\s*\d+-\d+)
or
(?m)^\s*P\s*(\(\d{3}\)\s*\d+-\d+)
See the regex demo, and here is what you need to pay attention to there:
The \b part matches a word boundary (\b) and (?m)^\s* matches the start of a line ((?m) makes ^ match the start of a line) and then \s* matches 0+ whitespaces. You may change it to only match horizontal whitespaces by replacing the pattern with [\p{Zs}\t]*.

Regex to extract city names (.NET)

Looking for an expression to extract City Names from addresses. Trying to use this expression in WebHarvy which uses the .NET flavor of regex
Example address
1234 Savoy Dr Ste 123
New Houston, TX 77036-3320
or
1234 Savoy Dr Ste 510
Texas, TX 77036-3320
So the city name could be single or two words.
The expression I am trying is
(\w|\w\s\w)+(?=,\s\w{2})
When I am trying this on RegexStorm it seems to be working fine, but when I am using this in WebHarvy, it only captures the 'n' from the city name New Houston and 'n' from Austin
Where am I going wrong?

In WebHarvey, if a regex contains a capturing group, its contents are returned. Thus, you do not need a lookahead.
Another point is that you need to match 1 or more word chars, optionally followed with a chunk of whitespaces followed with 1 or more word chars. Your regex contains a repeated capturing group whose contents are re-written upon each iteration and after it finds matching, Group 1 only contains n:
Use
(\w+(?:[^\S\r\n]+\w+)?),\s\w{2})
See the regex demo here
The [^\S\r\n]+ part matches any whitespace except CR and LF. You may use [\p{Zs}\t]+ to match any 1+ horizontal whitespaces.

Regex deleting all lines except last occurence of a pattern

I want to delete all lines that match a pattern but the last occurrence.
So assume we have this text:
test a 043
test a 123
test a 987
test b 565
The result I'm aiming for is this:
test a 987
test b 565
Is it possible to compare strings like that with just regex in vim? This is also assuming the a and b in this example are dynamic ((test\s\w\s(.*)).

You will need a lookahead regex in vim for this:
:g/\v(^test \w+)(\_.*\1)#=/d
RegEx Breakup:
\v # very magic to avoid escapes
( # capturing group #1 start
^test \w+ # match any line starting with test \w+
) # capturing group #1 end
(\_.*\1)#= # positive lookahead to make sure there is at least one of \1 below

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Capturing repeated word sequence - regex

Your problem is that your regex expects a space at the end, because you have included the space in the captures. Try instead: ([a-zA-Z]+)\s([0-9]+)\s\g1\s\g2

([0-9]\s) = 101 with space so \g2 doesn't match with 101 as it doesn't have any space at the end. Update: Working regex ([a-zA-Z]\s)([0-9])\s\g1\g2 for input abcd 101 abcd 101 Online Demo More example: ([a-zA-Z]\s) ([0-9]*) \s \g1 \g2 abcd+space 101 Space abcd+space 101

Related

I need to extract all words prior to 4th Space in a line

How to transpose pieces of data using Regular expression in Notepad++

Regex to match phone and fax numbers for WebHarvy

Regex to extract city names (.NET)

Regex deleting all lines except last occurence of a pattern

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Capturing repeated word sequence - regex

Your problem is that your regex expects a space at the end, because you have included the space in the captures. Try instead: ([a-zA-Z]+)\s([0-9]+)\s\g1\s\g2

([0-9]*\s) = 101 with space so \g2 doesn't match with 101 as it doesn't have any space at the end. Update: Working regex ([a-zA-Z]*\s)([0-9]*)\s\g1\g2 for input abcd 101 abcd 101 Online Demo More example: ([a-zA-Z]*\s) ([0-9]*) \s \g1 \g2 abcd+space 101 Space abcd+space 101

Related

I need to extract all words prior to 4th Space in a line

How to transpose pieces of data using Regular expression in Notepad++

Regex to match phone and fax numbers for WebHarvy

Regex to extract city names (.NET)

Regex deleting all lines except last occurence of a pattern

Categories

Resources

([0-9]\s) = 101 with space so \g2 doesn't match with 101 as it doesn't have any space at the end. Update: Working regex ([a-zA-Z]\s)([0-9])\s\g1\g2 for input abcd 101 abcd 101 Online Demo More example: ([a-zA-Z]\s) ([0-9]*) \s \g1 \g2 abcd+space 101 Space abcd+space 101