Regex named capture group with multiple values - regex

I seem to be having a tough regex week. Anyone that can save me from throwing my laptop out the window gets a virtual beer. I have some data in the form of:
... f=something group="First Group,Group2" foo=val ...
where the number of groups can vary. I need to capture each group entry to a named capture. Based on a previous post, The difference here is that I don't have a constant to key off of within the values (i.e. ID-1-1, ID-2-2 allows me to say ID-\d+-\d+ whereas these values could be pretty much anything). I've been trying a ton of stuff, but I tend to get matches that are far too greedy, or I (often) get these 2 values:
First Group
First Group,Group2
What I need is:
First Group
Group2
...
I'm currently trying regex such as this where I'm trying to anchor to the group=" portion, and not exceed the ending ":
(?:(?:group=\")|(?:\"))(?<group>(?:(.+)+?)
Hopefully someone can make my day a lot better...

Here's the PHP solution. Once again, regex doesn't like capturing the multiple values so we need to break it in to two searches. One extracts the group value, the next extracts each value from the group
$test = 'f=something group="First Group,Group2" foo=val';
$re = '/(?:group=)?\x22(?<group>(?:[^\x2C]+\x2C*)+)\x22/';
$_ = null;
if (preg_match($re,$test,$_))
echo "Group Contents: ".$_['group']."\r\n";
$__ = null;
$re = '/(?:^|\x2C)(?<value>(?:[^\x2C]+)+)/';
if (preg_match_All($re,$_['group'],$__))
echo "Group Values: ".print_r($__['value'],true);
Should be pretty easy to port in to another language, just extract the regexes out and manage them the way you normally would.

Related

Regex Multiple rows [duplicate]

I'm trying to get the list of all digits preceding a hyphen in a given string (let's say in cell A1), using a Google Sheets regex formula :
=REGEXEXTRACT(A1, "\d-")
My problem is that it only returns the first match... how can I get all matches?
Example text:
"A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq"
My formula returns 1-, whereas I want to get 1-2-2-2-2-2-2-2-2-2-3-3- (either as an array or concatenated text).
I know I could use a script or another function (like SPLIT) to achieve the desired result, but what I really want to know is how I could get a re2 regular expression to return such multiple matches in a "REGEX.*" Google Sheets formula.
Something like the "global - Don't return after first match" option on regex101.com
I've also tried removing the undesired text with REGEXREPLACE, with no success either (I couldn't get rid of other digits not preceding a hyphen).
Any help appreciated!
Thanks :)
You can actually do this in a single formula using regexreplace to surround all the values with a capture group instead of replacing the text:
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
basically what it does is surround all instances of the \d- with a "capture group" then using regex extract, it neatly returns all the captures. if you want to join it back into a single string you can just use join to pack it back into a single cell:
You may create your own custom function in the Script Editor:
function ExtractAllRegex(input, pattern,groupId) {
return [Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId])];
}
Or, if you need to return all matches in a single cell joined with some separator:
function ExtractAllRegex(input, pattern,groupId,separator) {
return Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId]).join(separator);
}
Then, just call it like =ExtractAllRegex(A1, "\d-", 0, ", ").
Description:
input - current cell value
pattern - regex pattern
groupId - Capturing group ID you want to extract
separator - text used to join the matched results.
Edit
I came up with more general solution:
=regexreplace(A1,"(.)?(\d-)|(.)","$2")
It replaces any text except the second group match (\d-) with just the second group $2.
"(.)?(\d-)|(.)"
1 2 3
Groups are in ()
---------------------------------------
"$2" -- means return the group number 2
Learn regular expressions: https://regexone.com
Try this formula:
=regexreplace(regexreplace(A1,"[^\-0-9]",""),"(\d-)|(.)","$1")
It will handle string like this:
"A1-Nutrition;A2-ActPhysiq;A2-BioM---eta;A2-PH3-Généti***566*9q"
with output:
1-2-2-2-3-
I wasn't able to get the accepted answer to work for my case. I'd like to do it that way, but needed a quick solution and went with the following:
Input:
1111 days, 123 hours 1234 minutes and 121 seconds
Expected output:
1111 123 1234 121
Formula:
=split(REGEXREPLACE(C26,"[a-z,]"," ")," ")
The shortest possible regex:
=regexreplace(A1,".?(\d-)|.", "$1")
Which returns 1-2-2-2-2-2-2-2-2-2-3-3- for "A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq".
Explanation of regex:
.? -- optional character
(\d-) -- capture group 1 with a digit followed by a dash (specify (\d+-) multiple digits)
| -- logical or
. -- any character
the replacement "$1" uses just the capture group 1, and discards anything else
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex
This seems to work and I have tried to verify it.
The logic is
(1) Replace letter followed by hyphen with nothing
(2) Replace any digit not followed by a hyphen with nothing
(3) Replace everything which is not a digit or hyphen with nothing
=regexreplace(A1,"[a-zA-Z]-|[0-9][^-]|[a-zA-Z;/é]","")
Result
1-2-2-2-2-2-2-2-2-2-3-3-
Analysis
I had to step through these procedurally to convince myself that this was correct. According to this reference when there are alternatives separated by the pipe symbol, regex should match them in order left-to-right. The above formula doesn't work properly unless rule 1 comes first (otherwise it reduces all characters except a digit or hyphen to null before rule (1) can come into play and you get an extra hyphen from "Patho-jour").
Here are some examples of how I think it must deal with the text
The solution to capture groups with RegexReplace and then do the RegexExctract works here too, but there is a catch.
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
If the cell that you are trying to get the values has Special Characters like parentheses "(" or question mark "?" the solution provided won´t work.
In my case, I was trying to list all “variables text” contained in the cell. Those “variables text “ was wrote inside like that: “{example_name}”. But the full content of the cell had special characters making the regex formula do break. When I removed theses specials characters, then I could list all captured groups like the solution did.
There are two general ('Excel' / 'native' / non-Apps Script) solutions to return an array of regex matches in the style of REGEXEXTRACT:
Method 1)
insert a delimiter around matches, remove junk, and call SPLIT
Regexes work by iterating over the string from left to right, and 'consuming'. If we are careful to consume junk values, we can throw them away.
(This gets around the problem faced by the currently accepted solution, which is that as Carlos Eduardo Oliveira mentions, it will obviously fail if the corpus text contains special regex characters.)
First we pick a delimiter, which must not already exist in the text. The proper way to do this is to parse the text to temporarily replace our delimiter with a "temporary delimiter", like if we were going to use commas "," we'd first replace all existing commas with something like "<<QUOTED-COMMA>>" then un-replace them later. BUT, for simplicity's sake, we'll just grab a random character such as  from the private-use unicode blocks and use it as our special delimiter (note that it is 2 bytes... google spreadsheets might not count bytes in graphemes in a consistent way, but we'll be careful later).
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
"xyzSixSpaces:[ ]123ThreeSpaces:[ ]aaaa 12345",".*?( |$)",
"$1"
)
),
""
)
We just use a lambda to define temp="match1match2match3", then use that to remove the last delimiter into "match1match2match3", then SPLIT it.
Taking COLUMNS of the result will prove that the correct result is returned, i.e. {" ", " ", " "}.
This is a particularly good function to turn into a Named Function, and call it something like REGEXGLOBALEXTRACT(text,regex) or REGEXALLEXTRACT(text,regex), e.g.:
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
text,
".*?("&regex&"|$)",
"$1"
)
),
""
)
Method 2)
use recursion
With LAMBDA (i.e. lets you define a function like any other programming language), you can use some tricks from the well-studied lambda calculus and function programming: you have access to recursion. Defining a recursive function is confusing because there's no easy way for it to refer to itself, so you have to use a trick/convention:
trick for recursive functions: to actually define a function f which needs to refer to itself, instead define a function that takes a parameter of itself and returns the function you actually want; pass in this 'convention' to the Y-combinator to turn it into an actual recursive function
The plumbing which takes such a function work is called the Y-combinator. Here is a good article to understand it if you have some programming background.
For example to get the result of 5! (5 factorial, i.e. implement our own FACT(5)), we could define:
Named Function Y(f)=LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) ) (this is the Y-combinator and is magic; you don't have to understand it to use it)
Named Function MY_FACTORIAL(n)=
Y(LAMBDA(self,
LAMBDA(n,
IF(n=0, 1, n*self(n-1))
)
))
result of MY_FACTORIAL(5): 120
The Y-combinator makes writing recursive functions look relatively easy, like an introduction to programming class. I'm using Named Functions for clarity, but you could just dump it all together at the expense of sanity...
=LAMBDA(Y,
Y(LAMBDA(self, LAMBDA(n, IF(n=0,1,n*self(n-1))) ))(5)
)(
LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) )
)
How does this apply to the problem at hand? Well a recursive solution is as follows:
in pseudocode below, I use 'function' instead of LAMBDA, but it's the same thing:
// code to get around the fact that you can't have 0-length arrays
function emptyList() {
return {"ignore this value"}
}
function listToArray(myList) {
return OFFSET(myList,0,1)
}
function allMatches(text, regex) {
allMatchesHelper(emptyList(), text, regex)
}
function allMatchesHelper(resultsToReturn, text, regex) {
currentMatch = REGEXEXTRACT(...)
if (currentMatch succeeds) {
textWithoutMatch = SUBSTITUTE(text, currentMatch, "", 1)
return allMatches(
{resultsToReturn,currentMatch},
textWithoutMatch,
regex
)
} else {
return listToArray(resultsToReturn)
}
}
Unfortunately, the recursive approach is quadratic order of growth (because it's appending the results over and over to itself, while recreating the giant search string with smaller and smaller bites taken out of it, so 1+2+3+4+5+... = big^2, which can add up to a lot of time), so may be slow if you have many many matches. It's better to stay inside the regex engine for speed, since it's probably highly optimized.
You could of course avoid using Named Functions by doing temporary bindings with LAMBDA(varName, expr)(varValue) if you want to use varName in an expression. (You can define this pattern as a Named Function =cont(varValue) to invert the order of the parameters to keep code cleaner, or not.)
Whenever I use varName = varValue, write that instead.
to see if a match succeeds, use ISNA(...)
It would look something like:
Named Function allMatches(resultsToReturn, text, regex):
UNTESTED:
LAMBDA(helper,
OFFSET(
helper({"ignore"}, text, regex),
0,1)
)(
Y(LAMBDA(helperItself,
LAMBDA(results, partialText,
LAMBDA(currentMatch,
IF(ISNA(currentMatch),
results,
LAMBDA(textWithoutMatch,
helperItself({results,currentMatch}, textWithoutMatch)
)(
SUBSTITUTE(partialText, currentMatch, "", 1)
)
)
)(
REGEXEXTRACT(partialText, regex)
)
)
))
)

Regular Expression to match groups that may not exist

I'm trying to capture some data from logs in an application. The logs look like so:
*junk* [{count=240.0, state=STATE1}, {count=1.0, state=STATE2}, {count=93.0, state=STATE3}, {count=1.0, state=STATE4}, {count=1147.0, state=STATE5}, etc. ] *junk*
If the count for a particular state is ever 0, it actually won't be in the log at all, so I can't guarantee the ordering of the objects in the log (The only ordering is that they are sorted alphabetically by state name)
So, this is also a potential log:
*junk* [{count=240.0, state=STATE1}, {count=1.0, state=STATE4}, {count=1147.0, state=STATE5}, etc. ] *junk*
I'm somewhat new to using regular expressions, and I think I'm overdoing it, but this is what I've tried.
^[^=\n]*=(?:(?P<STATE1>\d+)(?=\.0,\s+\w+=STATE1))*.*?=(?P<STATE2>\d+)(?=\.0,\s+\w+=STATE2)*.*?=(?P<STATE3>\d+)(?=\.0,\s+\w+=STATE3)
The idea being that I'll loook for the '=' and then look ahead to see if this is for the state that I want, and it may or may not be there. Then skip all the junk after the count until the next state that I'm interested in(this is the part that I'm having issues with I believe). Sometimes it matches too far, and skips the state I'm interested in, giving me a bad value. If I use the lazy operator(as above), sometimes it doesn't go far enough and gets the count for a state that is before the one I want in the log.
See if this approach works for you:
Regex: (?<=count=)\d+(?:\.\d+)?(?=, state=(STATE\d+))
Demo
The group will be your State# and Full match will be the count value
You might use 2 capturing groups to capture the count and the state.
To capture for example STATE1, STATE2, STATE3 and STATE5, you could specify the numbers using a character class with ranges and / or an alternation.
{count=(\d+(?:\.\d+)?), state=(STATE(?:[123]|5))}
Explanation
{count= Match literally
( Capture group 1
\d+(?:\.\d+)? Match 1+ digits with an optional decimal part
) Close group
, state= Match literally
( Capture group 2
STATE(?:[123]|5) Match STATE and specify the allowed numbers
)} Close group and match }
Regex demo
If you want to match all states and digits:
{count=(\d+(?:\.\d+)?), state=(STATE\d+)}
Regex demo
After some experimentation, this is what I've come up with:
The answers provided here, although good answers, don't quite work if your state names don't end with a number (mine don't, I just changed them to make the question easier to read and to remove business information from the question).
Here's a completely tile-able regex where you can add on as many matches as needed
count=(?P<GROUP_NAME_HERE>\d+(?=\.0, state=STATE_NAME_HERE))?
This can be copied and appended with the new state name and group name.
Additionally, if any of the states do not appear in the string, it will still match the following states. For example:
count=(?P<G1>\d+(?=\.0, state=STATE_ONE))?(?P<G2>\d+(?=\.0, state=STATE_TWO))?(?P<G3>\d+(?=\.0, state=STATE_THREE))?
will match states STATE_ONE and STATE_THREE with named groups G1 & G3 in the following string even though STATE_TWO is missing:
[{count=55.0, state=STATE_ONE}, {count=10.0, state=STATE_THREE}]
I'm sure this could be improved, but it's fast enough for me, and with 11 groups, regex101 shows 803 steps with a time of ~1ms
Here's a regex101 playground to mess with: https://regex101.com/r/3a3iQf/1
Notice how groups 1,2,3,4,5,6,7,9, & 11 match. 8 & 10 are missing and the following groups still match.

How to Match in a Strict Order Data that comes in a Random Order?

I'm quite new to regular expressions and I have the following target string resource which can sometimes differ slightly. For example, the string might be:
<TITLE>SomeTitle</TITLE>
<ITEM1>Item 1 text</ITEM>
<ITEM2>Item 2 text</ITEM2>
<ITEM3>Item 3 text</ITEM3>
And the next time the resource is requested, it's output might be:
<ITEM1>Item 1 text</ITEM>
<ITEM2>Item 2 text</ITEM2>
<ITEM3>Item 3 text</ITEM3>
<TITLE>SomeTitle</TITLE>
I want to capture the data between the two tags in order of the first example, so that the match would always match "SomeTitle" first, followed by the items. So if the search string was the second example, I need an expression that can first match "SomeTitle" and then somehow "reset" the position of the match to start from the beginning so I can then match the items.
I can achieve this with two different pattern searches, but was wondering if there is a way to do this in a single search pattern? Perhaps using lookaheads/lookbehinds and conditionals?
Capture Groups inside Lookaheads
Use this:
(?s)(?=.*<TITLE>(.*?)</)(?=.*<ITEM1>(.*?)</)(?=.*<ITEM2>(.*?)</)(?=.*<ITEM3>(.*?)</)
Even when the tokens are in a random order, you can see them in the right order by examining Capture Groups 1, 2, 3 and 4.
For instance, in the online regex demo, see how the input is in a random order, but the capture groups in the right pane are in the right order.
PCRE: How to use in a programming language
The PCRE library is used in several programming languages: for instance PHP, R, Delphi, and often C. Regardless of the language, the idea is the same: retrieve the capture groups.
As an example, here is how to do it in PHP:
$regex = '~(?s)(?=.*<TITLE>(.*?)</)(?=.*<ITEM1>(.*?)</)(?=.*<ITEM2>(.*?)</)(?=.*<ITEM3>(.*?)</)~';
if (preg_match($regex, $yourdata, $m)) {
$title = $m[1];
$item1 = $m[2];
$item2 = $m[3];
$item3 = $m[4];
}
else { // sorry, no match...
}

How do I group regular expressions past the 9th backreference?

Ok so I am trying to group past the 9th backreference in notepad++. The wiki says that I can use group naming to go past the 9th reference. However, I can't seem to get the syntax right to do the match. I am starting off with just two groups to make it simple.
Sample Data
1000,1000
Regex.
(?'a'[0-9]*),([0-9]*)
According to the docs I need to do the following.
(?<some name>...), (?'some name'...),(?(some name)...)
Names this group some name.
However, the result is that it can't find my text. Any suggestions?
You can simply reference groups > 9 in the same way as those < 10
i.e $10 is the tenth group.
For (naive) example:
String:
abcdefghijklmnopqrstuvwxyz
Regex find:
(?:a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)(m)(n)(o)(p)
Replace:
$10
Result:
kqrstuvwxyz
My test was performed in Notepad++ v6.1.2 and gave the result I expected.
Update: This still works as of v7.5.6
SarcasticSully resurrected this to ask the question:
"What if you want to replace with the 1st group followed by the character '0'?"
To do this change the replace to:
$1\x30
Which is replacing with group 1 and the hex character 30 - which is a 0 in ascii.
A very belated answer to help others who land here from Google (as I did). Named backreferences in notepad++ substitutions look like this: $+{name}. For whatever reason.
There's a deviation from standard regex gotcha here, though... named backreferences are also given numbers. In standard regex, if you have (.*)(?<name> & )(.*), you'd replace with $1${name}$2 to get the exact same line you started with. In notepad++, you would have to use $1$+{name}$3.
Example: I needed to clean up a Visual Studio .sln file for mismatched configurations. The text I needed to replace looked like this:
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|Any CPU.ActiveCfg = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|Any CPU.Build.0 = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x64.ActiveCfg = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x64.Build.0 = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x86.ActiveCfg = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x86.Build.0 = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|Any CPU.ActiveCfg = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|Any CPU.Build.0 = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x64.ActiveCfg = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x64.Build.0 = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x86.ActiveCfg = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x86.Build.0 = Release|Any CPU
My search RegEx:
^(\s*\{[^}]*\}\.)(?<config>[a-zA-Z0-9]+\|[a-zA-Z0-9 ]+)*(\..+=\s*)(.*)$
My replacement RegEx:
$1$+{config}$3$+{config}
The result:
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|Any CPU.ActiveCfg = Dev|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|Any CPU.Build.0 = Dev|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x64.ActiveCfg = Dev|x64
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x64.Build.0 = Dev|x64
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x86.ActiveCfg = Dev|x86
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x86.Build.0 = Dev|x86
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|Any CPU.ActiveCfg = QA|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|Any CPU.Build.0 = QA|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x64.ActiveCfg = QA|x64
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x64.Build.0 = QA|x64
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x86.ActiveCfg = QA|x86
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x86.Build.0 = QA|x86
Hope this helps someone.
The usual syntax of referencing groups with \x will interpret \10 as a reference to group 1 followed by a 0.
You need to use instead the alternative syntax of $x with $10.
Note : Some people seem to doubt there's ever any reason to have 10 groups.
I have a simple one, I wanted to rename a group of files named <name_start>DDMMYYYY_TIME_DDMMYYYY_TIME<name_end> as <name_start>YYYYMMDD_TIME_YYYYMMDD_TIME<name_end>, and ended with replacing my input matches with : rename "\1" "\2\5\4\3_\6_\9\8\7_$10" since name_start and name_end were not always constant.
OK, matching is no problem, your example matches for me in the current Notepad++. This is an important point. To use PCRE regex in Notepad++, you need a Version >= 6.0.
The other point is, where do you want to use the backreference? I can use named backreferences without problems within the regex, but not in the replacement string.
means
(?'a'[0-9]*),([0-9]*),\g{a}
will match
1000,1001,1000
But I don't know a way to use named groups or groups > 9 in the replacement string.
Do you really need more than 9 backreferences in the replacement string? If you just need more than 9 groups, but not all of them in the replacement, then make the groups you don't need to reuse non-capturing groups, by adding a ?: at the start of the group.
(?:[0-9]*),([0-9]*),(?:[0-9]*),([0-9]*)
group 1 group 2

Notepad++ RegeEx group capture syntax

I have a list of label names in a text file I'd like to manipulate using Find and Replace in Notepad++, they are listed as follows:
MyLabel_01
MyLabel_02
MyLabel_03
MyLabel_04
MyLabel_05
MyLabel_06
I want to rename them in Notepad++ to the following:
Label_A_One
Label_A_Two
Label_A_Three
Label_B_One
Label_B_Two
Label_B_Three
The Regex I'm using in the Notepad++'s replace dialog to capture the label name is the following:
((MyLabel_0)((1)|(2)|(3)|(4)|(5)|(6)))
I want to replace each capture group as follows:
\1 = Label_
\2 = A_One
\3 = A_Two
\4 = A_Three
\5 = B_One
\6 = B_Two
\7 = B_Three
My problem is that Notepad++ doesn't register the syntax of the regex above. When I hit Count in the Replace Dialog, it returns with 0 occurrences. Not sure what's misesing in the syntax. And yes I made sure the Regular Expression radio button is selected. Help is appreciated.
UPDATE:
Tried escaping the parenthesis, still didn't work:
\(\(MyLabel_0\)\((1\)|\(2\)|\(3\)|\(4\)|\(5\)|\(6\)\)\)
Ed's response has shown a working pattern since alternation isn't supported in Notepad++, however the rest of your problem can't be handled by regex alone. What you're trying to do isn't possible with a regex find/replace approach. Your desired result involves logical conditions which can't be expressed in regex. All you can do with the replace method is re-arrange items and refer to the captured items, but you can't tell it to use "A" for values 1-3, and "B" for 4-6. Furthermore, you can't assign placeholders like that. They are really capture groups that you are backreferencing.
To reach the results you've shown you would need to write a small program that would allow you to check the captured values and perform the appropriate replacements.
EDIT: here's an example of how to achieve this in C#
var numToWordMap = new Dictionary<int, string>();
numToWordMap[1] = "A_One";
numToWordMap[2] = "A_Two";
numToWordMap[3] = "A_Three";
numToWordMap[4] = "B_One";
numToWordMap[5] = "B_Two";
numToWordMap[6] = "B_Three";
string pattern = #"\bMyLabel_(\d+)\b";
string filePath = #"C:\temp.txt";
string[] contents = File.ReadAllLines(filePath);
for (int i = 0; i < contents.Length; i++)
{
contents[i] = Regex.Replace(contents[i], pattern,
m =>
{
int num = int.Parse(m.Groups[1].Value);
if (numToWordMap.ContainsKey(num))
{
return "Label_" + numToWordMap[num];
}
// key not found, use original value
return m.Value;
});
}
File.WriteAllLines(filePath, contents);
You should be able to use this easily. Perhaps you can download LINQPad or Visual C# Express to do so.
If your files are too large this might be an inefficient approach, in which case you could use a StreamReader and StreamWriter to read from the original file and write it to another, respectively.
Also be aware that my sample code writes back to the original file. For testing purposes you can change that path to another file so it isn't overwritten.
Bar bar bar - Notepad++ thinks you're a barbarian.
(obsolete - see update below.) No vertical bars in Notepad++ regex - sorry. I forget every few months, too!
Use [123456] instead.
Update: Sorry, I didn't read carefully enough; on top of the barhopping problem, #Ahmad's spot-on - you can't do a mapping replacement like that.
Update: Version 6 of Notepad++ changed the regular expression engine to a Perl-compatible one, which supports "|". AFAICT, if you have a version 5., auto-update won't update to 6. - you have to explicitly download it.
A regular expression search and replace for
MyLabel_((01)|(02)|(03)|(04)|(05)|(06))
with
Label_(?2A_One)(?3A_Two)(?4A_Three)(?5B_One)(?6B_Two)(?7B_Three)
works on Notepad 6.3.2
The outermost pair of brackets is for grouping, they limit the scope of the first alternation; not sure whether they could be omitted but including them makes the scope clear. The pattern searches for a fixed string followed by one of the two-digit pairs. (The leading zero could be factored out and placed in the fixed string.) Each digit pair is wrapped in round brackets so it is captured.
In the replacement expression, the clause (?4A_Three) says that if capture group 4 matched something then insert the text A_Three, otherwise insert nothing. Similarly for the other clauses. As the 6 alternatives are mutually exclusive only one will match. Thus only one of the (?...) clauses will have matched and so only one will insert text.
The easiest way to do this that I would recommend is to use AWK. If you're on Windows, look for the mingw32 precompiled binaries out there for free download (it'll be called gawk).
BEGIN {
FS = "_0";
a[1]="A_One";
a[2]="A_Two";
a[3]="A_Three";
a[4]="B_One";
a[5]="B_Two";
a[6]="B_Three";
}
{
printf("Label_%s\n", a[$2]);
}
Execute on Windows as follows:
C:\Users\Mydir>gawk -f test.awk awk.in
Label_A_One
Label_A_Two
Label_A_Three
Label_B_One
Label_B_Two
Label_B_Three