How do I group regular expressions past the 9th backreference? - regex

Ok so I am trying to group past the 9th backreference in notepad++. The wiki says that I can use group naming to go past the 9th reference. However, I can't seem to get the syntax right to do the match. I am starting off with just two groups to make it simple.
Sample Data
1000,1000
Regex.
(?'a'[0-9]*),([0-9]*)
According to the docs I need to do the following.
(?<some name>...), (?'some name'...),(?(some name)...)
Names this group some name.
However, the result is that it can't find my text. Any suggestions?

You can simply reference groups > 9 in the same way as those < 10
i.e $10 is the tenth group.
For (naive) example:
String:
abcdefghijklmnopqrstuvwxyz
Regex find:
(?:a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)(m)(n)(o)(p)
Replace:
$10
Result:
kqrstuvwxyz
My test was performed in Notepad++ v6.1.2 and gave the result I expected.
Update: This still works as of v7.5.6
SarcasticSully resurrected this to ask the question:
"What if you want to replace with the 1st group followed by the character '0'?"
To do this change the replace to:
$1\x30
Which is replacing with group 1 and the hex character 30 - which is a 0 in ascii.

A very belated answer to help others who land here from Google (as I did). Named backreferences in notepad++ substitutions look like this: $+{name}. For whatever reason.
There's a deviation from standard regex gotcha here, though... named backreferences are also given numbers. In standard regex, if you have (.*)(?<name> & )(.*), you'd replace with $1${name}$2 to get the exact same line you started with. In notepad++, you would have to use $1$+{name}$3.
Example: I needed to clean up a Visual Studio .sln file for mismatched configurations. The text I needed to replace looked like this:
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|Any CPU.ActiveCfg = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|Any CPU.Build.0 = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x64.ActiveCfg = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x64.Build.0 = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x86.ActiveCfg = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x86.Build.0 = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|Any CPU.ActiveCfg = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|Any CPU.Build.0 = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x64.ActiveCfg = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x64.Build.0 = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x86.ActiveCfg = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x86.Build.0 = Release|Any CPU
My search RegEx:
^(\s*\{[^}]*\}\.)(?<config>[a-zA-Z0-9]+\|[a-zA-Z0-9 ]+)*(\..+=\s*)(.*)$
My replacement RegEx:
$1$+{config}$3$+{config}
The result:
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|Any CPU.ActiveCfg = Dev|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|Any CPU.Build.0 = Dev|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x64.ActiveCfg = Dev|x64
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x64.Build.0 = Dev|x64
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x86.ActiveCfg = Dev|x86
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x86.Build.0 = Dev|x86
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|Any CPU.ActiveCfg = QA|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|Any CPU.Build.0 = QA|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x64.ActiveCfg = QA|x64
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x64.Build.0 = QA|x64
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x86.ActiveCfg = QA|x86
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x86.Build.0 = QA|x86
Hope this helps someone.

The usual syntax of referencing groups with \x will interpret \10 as a reference to group 1 followed by a 0.
You need to use instead the alternative syntax of $x with $10.
Note : Some people seem to doubt there's ever any reason to have 10 groups.
I have a simple one, I wanted to rename a group of files named <name_start>DDMMYYYY_TIME_DDMMYYYY_TIME<name_end> as <name_start>YYYYMMDD_TIME_YYYYMMDD_TIME<name_end>, and ended with replacing my input matches with : rename "\1" "\2\5\4\3_\6_\9\8\7_$10" since name_start and name_end were not always constant.

OK, matching is no problem, your example matches for me in the current Notepad++. This is an important point. To use PCRE regex in Notepad++, you need a Version >= 6.0.
The other point is, where do you want to use the backreference? I can use named backreferences without problems within the regex, but not in the replacement string.
means
(?'a'[0-9]*),([0-9]*),\g{a}
will match
1000,1001,1000
But I don't know a way to use named groups or groups > 9 in the replacement string.
Do you really need more than 9 backreferences in the replacement string? If you just need more than 9 groups, but not all of them in the replacement, then make the groups you don't need to reuse non-capturing groups, by adding a ?: at the start of the group.
(?:[0-9]*),([0-9]*),(?:[0-9]*),([0-9]*)
group 1 group 2

Related

Regex to insert space with certain characters but avoid date and time

I made a regex which inserts a space where ever there is any of the characters
-:\*_/;, present for example JET*AIRWAYS\INDIA/858701/IDBI 05/05/05;05:05:05 a/c should beJET* AIRWAYS\ INDIA/ 858701/ IDBI 05/05/05; 05:05:05 a/c
The regex I used is (?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)
I have added some words exceptions like a/c w/d etc. \D conditions given to avoid date/time values getting separated, but this created an issue, the numbers followed by the above mentioned characters never get split.
My requirement is
1. Insert a space after characters -:\*_/;,
2. but date and time should not get split which may have / :
3. need exception on words like a/c w/d
The following is the full code
Private Function formatColon(oldString As String) As String
Dim reg As New RegExp: reg.Global = True: reg.Pattern = "(?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)" '"(\D:|\D/|\D-|^w/d)"
Dim newString As String: newString = reg.Replace(oldString, "$1 ")
formatColon = XtraspaceKill(newString)
End Function
I would use 3 replacements.
Replace all date and time special characters with a special macro that should never be found in your text, e.g. for 05/15/2018 4:06 PM, something based on your name:
05MANUMOHANSLASH15MANUMOHANSLASH2018 4MANUMOHANCOLON06 PM
You can encode exceptions too, like this:
aMANUMOHANSLASHc
Now run your original regex to replace all special characters.
Finally, unreplace the macros MANUMOHANSLASH and MANUMOHANCOLON.
Meanwhile, let me tell you why this is complicated in a single regex.
If trying to do this in a single regex, you have to ask, for each / or :, "Am I a part of a date or time?"
To answer that, you need to use lookahead and lookbehind assertions, the latter of which Microsoft has finally added support for.
But given a /, you don't know if you're between the first and second, or second and third parts of the date. Similar for time.
The number of cases you need to consider will render your regex unmaintainably complex.
So please just use a few separate replacements :-)

How to use the regular expression group In Lugaru Epsilon (Editor) when the total match exceeds 9 groups?

This is about regular expression replacement in the Epsilon editor. I have a csv file that I wanted to replace the texts with a certain pattern.
The pattern replacement works perfectly when I use #1, #2 etc., in the replacement group.
But, when I enter #10, its the first group that got placed here. How to use the matching group greater than 9?
(Well, a very late answer, I realize now; anyway...)
I'm not able to find the documentation, but I think that only 1 (0 for the whole patter) to 9 (at least in interactive command) are supported.
I found this code, in src/searc.e:
...
char *with;
...
if (*with != '#')
insert(*with);
else if (isdigit(*++with)) {
bufnum = orig;
group = *with - '0';
buf_xfer(tmp, find_group(group, 1),
find_group(group, 0));
bufnum = tmp;
} else {
...
It seems to me the only the first character after # is considered.
You may try to mail to support#lugaru.com for further clarification, I found Steven Doerfler always very helpfull (Epsilon 14 is now in beta, it could be an opportunity to improve the documentation.)

find files by number with leading zero's via regex

I have 22 files file001 - file022, I would like to use regex to find grab only file005-file022.
I know that 00[5-9] grabs 005-009 and 0[12][0-9] grabs 010-022.
I am having problems putting them together into one regex.
The most-readable way would be (00[5-9]|0[12][0-9]) but a more compact way is 0(0[5-9]|[12][0-9]). Or, depending on your regex engine, 0(0[5-9]|[12]\d).
If the engine supports it, a non-capturing group is preferred for the "either or" as 0(?:0[5-9]|[12]\d), assuming you do not need to separately capture the last two digits.
Since the context is not clear so far and combining two regular expressions might hurt readability, I would propose an alternative:
List<String> filenames = new ArrayList<String>();
String filename = "file007";
int fileIndex = Integer.parseInt(filename.substring(5, 7));
if (fileIndex > 4 && fileIndex < 23) {
filenames.add(filename);
}
Hopefully the Java code is self-explanatory.

Notepad++ RegeEx group capture syntax

I have a list of label names in a text file I'd like to manipulate using Find and Replace in Notepad++, they are listed as follows:
MyLabel_01
MyLabel_02
MyLabel_03
MyLabel_04
MyLabel_05
MyLabel_06
I want to rename them in Notepad++ to the following:
Label_A_One
Label_A_Two
Label_A_Three
Label_B_One
Label_B_Two
Label_B_Three
The Regex I'm using in the Notepad++'s replace dialog to capture the label name is the following:
((MyLabel_0)((1)|(2)|(3)|(4)|(5)|(6)))
I want to replace each capture group as follows:
\1 = Label_
\2 = A_One
\3 = A_Two
\4 = A_Three
\5 = B_One
\6 = B_Two
\7 = B_Three
My problem is that Notepad++ doesn't register the syntax of the regex above. When I hit Count in the Replace Dialog, it returns with 0 occurrences. Not sure what's misesing in the syntax. And yes I made sure the Regular Expression radio button is selected. Help is appreciated.
UPDATE:
Tried escaping the parenthesis, still didn't work:
\(\(MyLabel_0\)\((1\)|\(2\)|\(3\)|\(4\)|\(5\)|\(6\)\)\)
Ed's response has shown a working pattern since alternation isn't supported in Notepad++, however the rest of your problem can't be handled by regex alone. What you're trying to do isn't possible with a regex find/replace approach. Your desired result involves logical conditions which can't be expressed in regex. All you can do with the replace method is re-arrange items and refer to the captured items, but you can't tell it to use "A" for values 1-3, and "B" for 4-6. Furthermore, you can't assign placeholders like that. They are really capture groups that you are backreferencing.
To reach the results you've shown you would need to write a small program that would allow you to check the captured values and perform the appropriate replacements.
EDIT: here's an example of how to achieve this in C#
var numToWordMap = new Dictionary<int, string>();
numToWordMap[1] = "A_One";
numToWordMap[2] = "A_Two";
numToWordMap[3] = "A_Three";
numToWordMap[4] = "B_One";
numToWordMap[5] = "B_Two";
numToWordMap[6] = "B_Three";
string pattern = #"\bMyLabel_(\d+)\b";
string filePath = #"C:\temp.txt";
string[] contents = File.ReadAllLines(filePath);
for (int i = 0; i < contents.Length; i++)
{
contents[i] = Regex.Replace(contents[i], pattern,
m =>
{
int num = int.Parse(m.Groups[1].Value);
if (numToWordMap.ContainsKey(num))
{
return "Label_" + numToWordMap[num];
}
// key not found, use original value
return m.Value;
});
}
File.WriteAllLines(filePath, contents);
You should be able to use this easily. Perhaps you can download LINQPad or Visual C# Express to do so.
If your files are too large this might be an inefficient approach, in which case you could use a StreamReader and StreamWriter to read from the original file and write it to another, respectively.
Also be aware that my sample code writes back to the original file. For testing purposes you can change that path to another file so it isn't overwritten.
Bar bar bar - Notepad++ thinks you're a barbarian.
(obsolete - see update below.) No vertical bars in Notepad++ regex - sorry. I forget every few months, too!
Use [123456] instead.
Update: Sorry, I didn't read carefully enough; on top of the barhopping problem, #Ahmad's spot-on - you can't do a mapping replacement like that.
Update: Version 6 of Notepad++ changed the regular expression engine to a Perl-compatible one, which supports "|". AFAICT, if you have a version 5., auto-update won't update to 6. - you have to explicitly download it.
A regular expression search and replace for
MyLabel_((01)|(02)|(03)|(04)|(05)|(06))
with
Label_(?2A_One)(?3A_Two)(?4A_Three)(?5B_One)(?6B_Two)(?7B_Three)
works on Notepad 6.3.2
The outermost pair of brackets is for grouping, they limit the scope of the first alternation; not sure whether they could be omitted but including them makes the scope clear. The pattern searches for a fixed string followed by one of the two-digit pairs. (The leading zero could be factored out and placed in the fixed string.) Each digit pair is wrapped in round brackets so it is captured.
In the replacement expression, the clause (?4A_Three) says that if capture group 4 matched something then insert the text A_Three, otherwise insert nothing. Similarly for the other clauses. As the 6 alternatives are mutually exclusive only one will match. Thus only one of the (?...) clauses will have matched and so only one will insert text.
The easiest way to do this that I would recommend is to use AWK. If you're on Windows, look for the mingw32 precompiled binaries out there for free download (it'll be called gawk).
BEGIN {
FS = "_0";
a[1]="A_One";
a[2]="A_Two";
a[3]="A_Three";
a[4]="B_One";
a[5]="B_Two";
a[6]="B_Three";
}
{
printf("Label_%s\n", a[$2]);
}
Execute on Windows as follows:
C:\Users\Mydir>gawk -f test.awk awk.in
Label_A_One
Label_A_Two
Label_A_Three
Label_B_One
Label_B_Two
Label_B_Three

Regex named capture group with multiple values

I seem to be having a tough regex week. Anyone that can save me from throwing my laptop out the window gets a virtual beer. I have some data in the form of:
... f=something group="First Group,Group2" foo=val ...
where the number of groups can vary. I need to capture each group entry to a named capture. Based on a previous post, The difference here is that I don't have a constant to key off of within the values (i.e. ID-1-1, ID-2-2 allows me to say ID-\d+-\d+ whereas these values could be pretty much anything). I've been trying a ton of stuff, but I tend to get matches that are far too greedy, or I (often) get these 2 values:
First Group
First Group,Group2
What I need is:
First Group
Group2
...
I'm currently trying regex such as this where I'm trying to anchor to the group=" portion, and not exceed the ending ":
(?:(?:group=\")|(?:\"))(?<group>(?:(.+)+?)
Hopefully someone can make my day a lot better...
Here's the PHP solution. Once again, regex doesn't like capturing the multiple values so we need to break it in to two searches. One extracts the group value, the next extracts each value from the group
$test = 'f=something group="First Group,Group2" foo=val';
$re = '/(?:group=)?\x22(?<group>(?:[^\x2C]+\x2C*)+)\x22/';
$_ = null;
if (preg_match($re,$test,$_))
echo "Group Contents: ".$_['group']."\r\n";
$__ = null;
$re = '/(?:^|\x2C)(?<value>(?:[^\x2C]+)+)/';
if (preg_match_All($re,$_['group'],$__))
echo "Group Values: ".print_r($__['value'],true);
Should be pretty easy to port in to another language, just extract the regexes out and manage them the way you normally would.