Matching percentages

Matching percentages - regex

I've been trying to enhance some code which determines whether a string is a valid percentage.
I decided that it was time to finally have a hundred problems, and learned regex.
I've been using this web regex tester to build my pattern.
I'm trying to do this rather loosely, such that valid percentages may be integer or decimal, positive or negative, include commas or not, and have any amount of whitespace at the beginning and end, as well as around the optional negative sign and the required percentage sign.
So far, I have \s*-?\s*\d+(,\d+)*(?:\.\d*)?\s*%\s*, which matches almost all of my test cases correctly:
0
0
0
% 0
- 0 %
20948.924780%
315%
2,456,875 %
2,104.86%
89fqyf0gp948y1-%ghghpq98fy92,.?><
, , , ,,,, 0,0,000,00,00,,,0
, , , ,,,, 0,0,000,00,00,,,0%
000000000,00000000000 %
000000000,00000000000,00000000000 %
000000000,00000000000,00000000000,00000000000.00000000000 %
These are not in any particular order, some pass and some fail, but only one is incorrect. In , , , ,,,, 0,0,000,00,00,,,0%, the last 0%\n is a match, but the whole line should be invalid. Start and end indicators do not seem to have the effect I had assumed, as a $ makes only the last example match, while a ^ at the beginning makes no matches register.
It may be something small, but as someone who only learned regex yesterday, it's far beyond my reach.
Thanks!

Start and end indicators do not seem to have the effect I had assumed, as a $ makes only the last example match, while a ^ at the beginning makes no matches register.
Those anchors should be working. However, it does depend on the regex engine and the options whether they match line begins/ends or file begins/ends. On RegExr, you'd have to check the multiline option: http://regexr.com?380p9 - in programming, use the m flag.

It could be done like this.
Edit: So after realizing its a line thing, this is the regex now.
Note(s) -
Uses multiline mode line Bergi's.
Also, you CANNOT just use \s wihitespace class in this.
It doesn't matter what mode used, \s will WILL match CRLF if it can, which means
-
000,000000.22
%
will match because it satisfies all the conditions.
[^\S\r\n] means match whitespace except CRLF characters. It could be replaced with
[^\S\n] in the real world. The initial input on that tester used \r\n linebreaks.
Good Luck!!
# ^[^\S\r\n]*-?[^\S\r\n]*(?:(?:\.\d+)|(?:\d+(?:,\d+)*(?:\.\d*)?))[^\S\r\n]*%[^\S\r\n]*$
^ # BOL
[^\S\r\n]*
-? # optional -
[^\S\r\n]*
(?: # group
(?: \. \d+ ) # .number
| # or
(?: # group
\d+ # number
(?: , \d+ )* # optional many ,number
(?: \. \d* )? # optional . optional number
) # end group
) # end group
[^\S\r\n]*
% # %
[^\S\r\n]*
$ # EOL

Related

Regex word can be optional but only if it matches the characters

Following pattern: (v[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2})(-[0-9]{1,2})?((-schema)?(-dev)?)((-schema)?(-dev)?) from http://regexr.com/ is meant to be used in a shell script with grep and does match the following strings (working example):
Hello I am a text and this is my v1.12.33-32 version
Hello I am a text and this is my v1.12.33-dev version
Hello I am a text and this is my v1.12.33-dev-schema version
Hello I am a text and this is my v1.12.33-schema version
Hello I am a text and this is my v1.12.33-3-schema version
and so forth
So I made the words schema and dev optional. They can be ommitted or used in a arbitrary order. What I don't what is this:
Hello I am a text and this is my v1.12.33-foo version
or Hello I am a text and this is my v1.12.33-asfs version
to match.
I want the option to be a bit more constrained. At the moment the Regex is still matching the stuff that...well actually matches.
This for example:
Hello I am a text and this is my v1.123.33
results in an empty string while this:
`Hello I am a text and this is my v1.12.33-bla"
still results in v.1.12.33
Is this because of the grouping I made? So at least the fully matching groups will be taken for the returned match-string?

To match only the version string, disallow extra trailing tags, yet allow trailing unmatched text, you need a regex language that supports lookahead. Standard grep / egrep regexes do not support lookahead.
You have two options:
Since you seem to be relying on GNU grep anyway, you could use a Perl regex, such as
v[0-9]{1,2}(\.[0-9]{1,2}){2}(-[0-9]{1,2})?((-schema(-dev)?)?|(-dev(-schema)?)?)?(?!\S)
The negative lookahead at the end allows the match to appear at the end of the line, but also requires that if it does not end the line then the next character following the match must be whitespace (which is not itself included in the match).
You could give up on completely isolating the target text via -o, and instead allow the pattern to match the trailing context, too:
v[0-9]{1,2}(\.[0-9]{1,2}){2}(-[0-9]{1,2})?((-schema(-dev)?)?|(-dev(-schema)?)?)?(\s.*)?$
In this case, you could isolate the target text in a second step, by stripping off any tail beginning with whitespace.
Note that neither of these pays attention to text preceeding the match. You have similar options for handling that portion as you do for handling the trailing portion.

The problem seems to be all the optional expressions lurking at the
edge (end).
You can solve that a few ways, but none are %100 because you'd need
more rules to control what matches.
It's not like you can say no - is allowed afterword, the engine will
backtrack to one of the range digits {1,2} to make a match.
What seems to work for now is passing on a whitespace end edge
or matching the dev/schema items.
(v[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2})(-[0-9]{1,2})?(?:(?!\S)|(-(schema|dev)(?:-(schema|dev))?))
Expanded
( # (1 start)
v [0-9]{1,2}
\. [0-9]{1,2}
\. [0-9]{1,2}
) # (1 end)
( - [0-9]{1,2} )? # (2)
(?:
(?! \S ) # Whitespace boundary
| # or,
( # (3 start)
-
( schema | dev ) # (4)
(?:
-
( schema | dev ) # (5)
)?
) # (3 end)
)
edit
If you want to avoid matching the same schema|dev word twice, just add
a negative assertion of group 4, before capture group 5 above.
(v[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2})(-[0-9]{1,2})?(?:(?!\S)|(-(schema|dev)(?:-(?!\4)(schema|dev))?))
Expanded
( # (1 start)
v [0-9]{1,2}
\. [0-9]{1,2}
\. [0-9]{1,2}
) # (1 end)
( - [0-9]{1,2} )? # (2)
(?:
(?! \S ) # Whitespace boundary
| # or,
( # (3 start)
-
( schema | dev ) # (4)
(?:
-
(?! \4 ) # Not same word twice
( schema | dev ) # (5)
)?
) # (3 end)
)

Since regular expressions are open-ended, you need to specify with $ where you want the match to end, so you don't let the regex engine silently ignore trailing junk.
With only two tags in the optional set, I would just enumerate the 4 possibilities:
(v[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2})(-[0-9]{1,2})?(-schema|-dev|-dev-schema|-schema-dev)?$

My version:
grep --perl-regexp \
'\bv(?:\d{1,2}\.){2}\d{1,2}(?:\-\d{1,2})?(?:\-(?:schema|dev))?(?:\s|$)' \
path/to/file
Where
the first \b is a word boundary(you might want to make it stricter);
(?: ... ) expressions are non-capturing groups;
\s|$ is either a space character, or the end of line
The rest is just refactored for simplicity.
The expression allows only schema, or dev at the "end".

Selecting if no delimiter, and no selecting if it is

I have string like "smth 2sg. smth", and sometimes "smth 2sg.| smth.".
What mask should I use for selecting "2sg." if string does not contains"|", and select nothing if string does contains "|"?

I have 2 methods. They both use something called a Negative Lookahead, which is used like so:
(?!data)
When this is inserted into a RegEx, it means if data exists, the RegEx will not match.
More info on the Negative Lookahead can be found here
Method 1 (shorter)
Just capture 2sg.
Try this RegEx:
(\dsg\.)(?!\|)
Use (\d+... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
Method 2 (longer but safer)
Match the whole string and capture 2sg.
Try this RegEx:
^\w+\s*(\dsg\.)(?!\|)\s*\w+\.?$
Use (\d+sg... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
^ # String starts with ...
\w+\s* # Letters then Optional Whitespace (smth )
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
\s* # Optional Whitespace
\w+ # Letters (smth)
\.? # Optional . (Dot)
$ # ... Strings ends with

Something like this might work for you:
(\d*sg\.)(?!\|)
It assumes that there is(or there is no)number followed by sg. and not followed by |.

^.*(\dsg\.)[^\|]*$
Explanation:
^ : starts from the beginning of the string
.* : accepts any number of initial characters (even nothing)
(\dsg\.) : looks for the group of digit + "sg."
[^\|]* : considers any number of following characters except for |
$ : stops at the end of the string
You can now select your string by getting the first group from your regex

Try:
(\d+sg.(?!\|))
depending on your programming environment, it can be little bit different but will get your result.
For more information see Negative Lookahead

How does this regex for FQDNs (excluding.arpa) work?

I am trying to understand how regex works. I understand it little by little. However, I don't understand this one completely. It's basically a regex for fully qualified domain names but a requirement is that the ending can't be .arpa.
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
https://regex101.com/r/hU6tP0/3
This doesn't match google.uk. If I change it to:
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{1,63}[^.arpa]$)
It works again.
But this works as well
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
Here is my thought process for
?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
I see it as this
(?=
Is a positive look ahead (Can someone explain to me what this actually means?) As I understand it now, it just means that the string needs to match the regex.
^.{4,253}$)
Match all characters but it needs to be between 4 and 253 characters long.
(^([a-zA-Z0-9]{1,63}\.)
Start a capture group and make another capture group within. This capture group says that every non special character can be written 1 to 63 times or till the . is written.
+
The previous capture group can be repeated indefinitely, but it should always end with a .. This way the next capture group is started.
[a-zA-Z]{2,63}
Then as many times as you want you can write a to z with upper, but it needs to be between 2 and 63.
[^.arpa]$)
The last characters can't be .arpa.
Can someone tell me where I am going wrong?

This doesn't do what you think it does:
[^.arpa]
All that says is 'ends with something that isn't one of the letter apr.' - it's a negated character class.
You might be thinking of a negative lookahead assertion:
(?!\.arpa)$
But if you're trying to compound multiple criteria in a regex, I'd suggest you're probably using the wrong tool for the job. It ends up complicated and hard to debug, thanks to greedy/non-greedy matching, etc.
Your 'positive/negative' lookaheads are to match a piece of a pattern that aren't surrounded by other pieces of pattern. But that can have some unexpected outcomes if you're matching variable widths, because the regex engine will backtrack until it finds something that matches.
A simpler example:
([\w.]+)(?!arpa)$
Applied to:
www.test.arpa
Will it match? What's in the group?
... it will match, because [\w\.]+ will consume all of it, and then the lookahead won't "see" anything.
If you use:
([\w]+)\.(?!arpa)
Instead though - you'll capture.... www, but you won't match test (with e.g. g flag, because the www doesn't have .arpa after it, but the test does.
https://regex101.com/r/hU6tP0/5
It really does get complicated using negative assertions in a pattern as a result. I'd suggest simply not doing so, and applying two separate tests. It's hard for you to figure out, and it's hard for a future maintenance programmer too!

This is an analysis of your regex:
(?=^.{4,253}$) # force min length: 4 chars, max length: 253 chars
( # Capturing Group 1 (CG1) - not needed
^ # Match start of the string
( # CG2 (can be a non capturing group '(?:...)')
[a-zA-Z0-9]{1,63} # any sequence of letters and numbers with length between 1 and 63
\. # a literal dot
)+ # CLOSE CG2
[a-zA-Z]{1,63} # any letter sequence with length between 1 to 63
[^.arpa] # a negated char class: any char that is not a "literal" '.','a','r','p' (last 'a' is redundant)
$ # end of the string
) # CLOSE CG1
To avoid the tail of the string to be .arpa you need to use a negative lookahead (?!...), so modify just like this:
(?=^.{4,253}$)(?!.*\.arpa$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
An online demo
Update:
I've upgraded the regex to rationalise it (i've incorporated also the Sobrique suggestion adding an important details):
/^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i
Compact version online demo
Legenda
/ # js regex delimiter
^ # start of the string
(?=.{4,253}$) # force min length: 4 chars, max length: 253 chars
(?: # Non capturing group 1 (NCG1)
[a-z0-9]{1,63} # any letter or digit in a sequence with length from 1 to 63 chars
[.] # a literal dot '.' (more readable than \.)
)+ # CLOSE NCG1 - repeat its content one or more time
(?!arpa$) # force that after the last literal dot '.' the string does not end with 'arpa' (i've added '$' to Sobrique suggestion instead it prevents also '.arpanet' too)
[a-z]{2,63} # a sequence of letters with length from 2 to 63
$ # end of the string
/i # Close the regex delimiter and add case insensitive flag [a-z] match also [A-Z] and viceversa
var re = /^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i;
var tests = ['google.uk','domain.arpa','domain.arpa2','another.domain.arpa.net','domain.arpanet'];
var m;
while(t = tests.pop()) {
document.getElementById("r").innerHTML += '"' + t + '"<br/>';
document.getElementById("r").innerHTML += 'Valid domain? ' + ( (t.match(re)) ? '<font color="green">YES</font>' : '<font color="red">NO</font>') + '<br/><br/>';
}
<div id="r"/>

Match double hyphens in comments of malformed XML

I'm to parse XML files that do not conform to the "no double hyphens in comments" -standard, which makes MSXML complain. I am looking for a way of deleting offending hyphens.
I am using StringRegExpReplace(). I attempted following regular expressions:
<!--(.*)--> : correctly gets comments
<!--(-*)--> : fails to be a correct regex (also tried escaping and using \x2D)
Given the right pattern, I would call:
StringRegExpReplace($xml_string,$correct_pattern,"") ;replace with nothing
How to match remaining extra hyphens within an XML comment, while leaving the remaining text alone?

You can use this pattern:
(?|\G(?!\A)(?|-{2,}+([^->][^-]*)|(-[^-]+)|-+(?=-->)|-->[^<]*(*SKIP)(*FAIL))|[^<]*<+(?>[^<]+<+)*?(?:!--\K|[^<]*\z\K(*ACCEPT))(?|-*+([^->][^-]*)|-+(?=-->)|-?+([^-]+)|-->[^<]*(*SKIP)(*FAIL)()))
details:
(?|
\G(?!\A) # contiguous to the precedent match (inside a comment)
(?|
-{2,}+([^->][^-]*) # duplicate hyphens, not part of the closing sequence
|
(-[^-]+) # preserve isolated hyphens
|
-+ (?=-->) # hyphens before closing sequence, break contiguity
|
-->[^<]* # closing sequence, go to next <
(*SKIP)(*FAIL) # break contiguity
)
|
[^<]*<+ # reach the next < (outside comment)
(?> [^<]+ <+ )*? # next < until !-- or the end of the string
(?: !-- \K | [^<]*\z\K (*ACCEPT) ) # new comment or end of the string
(?|
-*+ ([^->][^-]*) # possible hyphens not followed by >
|
-+ (?=-->) # hyphens before closing sequence, break contiguity
|
-?+ ([^-]+) # one hyphen followed by >
|
-->[^<]* # closing sequence, go to next <
(*SKIP)(*FAIL) () # break contiguity (note: "()" avoids a mysterious bug
) # in regex101, you can remove it)
)
With this replacement: \1
online demo
The \G feature ensures that matches are consecutive.
Two ways are used to break the contiguity:
a lookahead (?=-->)
the backtracking control verbs (*SKIP)(*FAIL) that forces the pattern to fail and all characters matched before to not be retried.
So when contiguity is broken or at the begining the first main branch will fail (cause of the \G anchor) and the second branch will be used.
\K removes all on the left from the match result.
(*ACCEPT) makes the pattern succeed unconditionnaly.
This pattern uses massively the branch reset feature (?|...(..)...|...(..)...|...), so all capturing groups have the same number (in other words there is only one group, the group 1.)
Note: even this pattern is long, it needs few steps to obtain a match. The impact of non-greedy quantifiers is reduced as much as possible, and each alternatives are sorted and as efficient as possible. One of the goals is to reduce the total number of matches needed to treat a string.

(?<!<!)--+(?!-?>)(?=(?:(?!-->).)*-->)
matches -- (or ---- etc.) only between <!-- and -->. You need to set the /s parameter to allow the dot to match newlines.
Explanation:
(?<!<!) # Assert that we're not right at the start of a comment
--+ # Match two or more dashes --
(?= # only if the following can be matched further onwards:
(?!-?>) # First, make sure we're not at the end of the comment.
(?: # Then match the following group
(?!-->) # which must not contain -->
. # but may contain any character
)* # any number of times
--> # as long as --> follows.
) # End of lookahead assertion.
Test it live on regex101.com.
I suppose the correct AutoIt syntax would be
StringRegExpReplace($xml_string, "(?s)(?<!<!)--+(?!-?>)(?=(?:(?!-->).)*-->)", "")

Can a Regex Return the Number of the Line where the Match is Found?

In a text editor, I want to replace a given word with the number of the line number on which this word is found. Is this is possible with Regex?

Recursion, Self-Referencing Group (Qtax trick), Reverse Qtax or Balancing Groups
Introduction
The idea of adding a list of integers to the bottom of the input is similar to a famous database hack (nothing to do with regex) where one joins to a table of integers. My original answer used the #Qtax trick. The current answers use either Recursion, the Qtax trick (straight or in a reversed variation), or Balancing Groups.
Yes, it is possible... With some caveats and regex trickery.
The solutions in this answer are meant as a vehicle to demonstrate some regex syntax more than practical answers to be implemented.
At the end of your file, we will paste a list of numbers preceded with a unique delimiter. For this experiment, the appended string is :1:2:3:4:5:6:7 This is a similar technique to a famous database hack that uses a table of integers.
For the first two solutions, we need an editor that uses a regex flavor that allows recursion (solution 1) or self-referencing capture groups (solutions 2 and 3). Two come to mind: Notepad++ and EditPad Pro. For the third solution, we need an editor that supports balancing groups. That probably limits us to EditPad Pro or Visual Studio 2013+.
Input file:
Let's say we are searching for pig and want to replace it with the line number.
We'll use this as input:
my cat
dog
my pig
my cow
my mouse
:1:2:3:4:5:6:7
First Solution: Recursion
Supported languages: Apart from the text editors mentioned above (Notepad++ and EditPad Pro), this solution should work in languages that use PCRE (PHP, R, Delphi), in Perl, and in Python using Matthew Barnett's regex module (untested).
The recursive structure lives in a lookahead, and is optional. Its job is to balance lines that don't contain pig, on the left, with numbers, on the right: think of it as balancing a nested construct like {{{ }}}... Except that on the left we have the no-match lines, and on the right we have the numbers. The point is that when we exit the lookahead, we know how many lines were skipped.
Search:
(?sm)(?=.*?pig)(?=((?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?:(?1)|[^:]+)(:\d+))?).*?\Kpig(?=.*?(?(2)\2):(\d+))
Free-Spacing Version with Comments:
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # fail right away if pig isn't there
(?= # The Recursive Structure Lives In This Lookahead
( # Group 1
(?: # skip one line
^
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
(?:(?1)|[^:]+) # recurse Group 1 OR match all chars that are not a :
(:\d+) # match digits
)? # End Group
) # End lookahead.
.*?\Kpig # get to pig
(?=.*?(?(2)\2):(\d+)) # Lookahead: capture the next digits
Replace: \3
In the demo, see the substitutions at the bottom. You can play with the letters on the first two lines (delete a space to make pig) to move the first occurrence of pig to a different line, and see how that affects the results.
Second Solution: Group that Refers to Itself ("Qtax Trick")
Supported languages: Apart from the text editors mentioned above (Notepad++ and EditPad Pro), this solution should work in languages that use PCRE (PHP, R, Delphi), in Perl, and in Python using Matthew Barnett's regex module (untested). The solution is easy to adapt to .NET by converting the \K to a lookahead and the possessive quantifier to an atomic group (see the .NET Version a few lines below.)
Search:
(?sm)(?=.*?pig)(?:(?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*+.*?\Kpig(?=[^:]+(?(1)\1):(\d+))
.NET version: Back to the Future
.NET does not have \K. It its place, we use a "back to the future" lookbehind (a lookbehind that contains a lookahead that skips ahead of the match). Also, we need to use an atomic group instead of a possessive quantifier.
(?sm)(?<=(?=.*?pig)(?=(?>(?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*).*)pig(?=[^:]+(?(1)\1):(\d+))
Free-Spacing Version with Comments (Perl / PCRE Version):
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # lookahead: if pig is not there, fail right away to save the effort
(?: # start counter-line-skipper (lines that don't include pig)
(?: # skip one line
^ #
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
# for each line skipped, let Group 1 match an ever increasing portion of the numbers string at the bottom
(?= # lookahead
[^:]+ # skip all chars that are not colons
( # start Group 1
(?(1)\1) # match Group 1 if set
:\d+ # match a colon and some digits
) # end Group 1
) # end lookahead
)*+ # end counter-line-skipper: zero or more times
.*? # match
\K # drop everything we've matched so far
pig # match pig (this is the match!)
(?=[^:]+(?(1)\1):(\d+)) # capture the next number to Group 2
Replace:
\2
Output:
my cat
dog
my 3
my cow
my mouse
:1:2:3:4:5:6:7
In the demo, see the substitutions at the bottom. You can play with the letters on the first two lines (delete a space to make pig) to move the first occurrence of pig to a different line, and see how that affects the results.
Choice of Delimiter for Digits
In our example, the delimiter : for the string of digits is rather common, and could happen elsewhere. We can invent a UNIQUE_DELIMITER and tweak the expression slightly. But the following optimization is even more efficient and lets us keep the :
Optimization on Second Solution: Reverse String of Digits
Instead of pasting our digits in order, it may be to our benefit to use them in the reverse order: :7:6:5:4:3:2:1
In our lookaheads, this allows us to get down to the bottom of the input with a simple .*, and to start backtracking from there. Since we know we're at the end of the string, we don't have to worry about the :digits being part of another section of the string. Here's how to do it.
Input:
my cat pi g
dog p ig
my pig
my cow
my mouse
:7:6:5:4:3:2:1
Search:
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # lookahead: if pig is not there, fail right away to save the effort
(?: # start counter-line-skipper (lines that don't include pig)
(?: # skip one line that doesn't have pig
^ #
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
# Group 1 matches increasing portion of the numbers string at the bottom
(?= # lookahead
.* # get to the end of the input
( # start Group 1
:\d+ # match a colon and some digits
(?(1)\1) # match Group 1 if set
) # end Group 1
) # end lookahead
)*+ # end counter-line-skipper: zero or more times
.*? # match
\K # drop match so far
pig # match pig (this is the match!)
(?=.*(\d+)(?(1)\1)) # capture the next number to Group 2
Replace: \2
See the substitutions in the demo.
Third Solution: Balancing Groups
This solution is specific to .NET.
Search:
(?m)(?<=\A(?<c>^(?:(?!pig)[^\r\n])*(?:\r?\n))*.*?)pig(?=[^:]+(?(c)(?<-c>:\d+)*):(\d+))
Free-Spacing Version with Comments:
(?xm) # free-spacing, multi-line
(?<= # lookbehind
\A #
(?<c> # skip one line that doesn't have pig
# The length of Group c Captures will serve as a counter
^ # beginning of line
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
) # end skipper
* # repeat skipper
.*? # we're on the pig line: lazily match chars before pig
) # end lookbehind
pig # match pig: this is the match
(?= # lookahead
[^:]+ # get to the digits
(?(c) # if Group c has been set
(?<-c>:\d+) # decrement c while we match a group of digits
* # repeat: this will only repeat as long as the length of Group c captures > 0
) # end if Group c has been set
:(\d+) # Match the next digit group, capture the digits
) # end lokahead
Replace: $1
Reference
Qtax trick
On Which Line Number Was the Regex Match Found?

Because you didn't specify which text editor, in vim it would be:
:%s/searched_word/\=printf('%-4d', line('.'))/g (read more)
But as somebody mentioned it's not a question for SO but rather Super User ;)

I don't know of an editor that does that short of extending an editor that allows arbitrary extensions.
You could easily use perl to do the task, though.
perl -i.bak -e"s/word/$./eg" file
Or if you want to use wildcards,
perl -MFile::DosGlob=glob -i.bak -e"BEGIN { #ARGV = map glob($_), #ARGV } s/word/$./eg" *.txt

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Matching percentages - regex

Related

Regex word can be optional but only if it matches the characters

Selecting if no delimiter, and no selecting if it is

How does this regex for FQDNs (excluding.arpa) work?

Match double hyphens in comments of malformed XML

Can a Regex Return the Number of the Line where the Match is Found?

Categories

Resources