find files by number with leading zero's via regex - regex

I have 22 files file001 - file022, I would like to use regex to find grab only file005-file022.
I know that 00[5-9] grabs 005-009 and 0[12][0-9] grabs 010-022.
I am having problems putting them together into one regex.

The most-readable way would be (00[5-9]|0[12][0-9]) but a more compact way is 0(0[5-9]|[12][0-9]). Or, depending on your regex engine, 0(0[5-9]|[12]\d).
If the engine supports it, a non-capturing group is preferred for the "either or" as 0(?:0[5-9]|[12]\d), assuming you do not need to separately capture the last two digits.

Since the context is not clear so far and combining two regular expressions might hurt readability, I would propose an alternative:
List<String> filenames = new ArrayList<String>();
String filename = "file007";
int fileIndex = Integer.parseInt(filename.substring(5, 7));
if (fileIndex > 4 && fileIndex < 23) {
filenames.add(filename);
}
Hopefully the Java code is self-explanatory.

Related

How do I group regular expressions past the 9th backreference?

Ok so I am trying to group past the 9th backreference in notepad++. The wiki says that I can use group naming to go past the 9th reference. However, I can't seem to get the syntax right to do the match. I am starting off with just two groups to make it simple.
Sample Data
1000,1000
Regex.
(?'a'[0-9]*),([0-9]*)
According to the docs I need to do the following.
(?<some name>...), (?'some name'...),(?(some name)...)
Names this group some name.
However, the result is that it can't find my text. Any suggestions?
You can simply reference groups > 9 in the same way as those < 10
i.e $10 is the tenth group.
For (naive) example:
String:
abcdefghijklmnopqrstuvwxyz
Regex find:
(?:a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)(m)(n)(o)(p)
Replace:
$10
Result:
kqrstuvwxyz
My test was performed in Notepad++ v6.1.2 and gave the result I expected.
Update: This still works as of v7.5.6
SarcasticSully resurrected this to ask the question:
"What if you want to replace with the 1st group followed by the character '0'?"
To do this change the replace to:
$1\x30
Which is replacing with group 1 and the hex character 30 - which is a 0 in ascii.
A very belated answer to help others who land here from Google (as I did). Named backreferences in notepad++ substitutions look like this: $+{name}. For whatever reason.
There's a deviation from standard regex gotcha here, though... named backreferences are also given numbers. In standard regex, if you have (.*)(?<name> & )(.*), you'd replace with $1${name}$2 to get the exact same line you started with. In notepad++, you would have to use $1$+{name}$3.
Example: I needed to clean up a Visual Studio .sln file for mismatched configurations. The text I needed to replace looked like this:
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|Any CPU.ActiveCfg = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|Any CPU.Build.0 = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x64.ActiveCfg = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x64.Build.0 = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x86.ActiveCfg = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x86.Build.0 = Debug|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|Any CPU.ActiveCfg = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|Any CPU.Build.0 = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x64.ActiveCfg = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x64.Build.0 = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x86.ActiveCfg = Release|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x86.Build.0 = Release|Any CPU
My search RegEx:
^(\s*\{[^}]*\}\.)(?<config>[a-zA-Z0-9]+\|[a-zA-Z0-9 ]+)*(\..+=\s*)(.*)$
My replacement RegEx:
$1$+{config}$3$+{config}
The result:
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|Any CPU.ActiveCfg = Dev|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|Any CPU.Build.0 = Dev|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x64.ActiveCfg = Dev|x64
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x64.Build.0 = Dev|x64
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x86.ActiveCfg = Dev|x86
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.Dev|x86.Build.0 = Dev|x86
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|Any CPU.ActiveCfg = QA|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|Any CPU.Build.0 = QA|Any CPU
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x64.ActiveCfg = QA|x64
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x64.Build.0 = QA|x64
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x86.ActiveCfg = QA|x86
{CDDB12FE-885F-4FB7-9724-1A4279573DE5}.QA|x86.Build.0 = QA|x86
Hope this helps someone.
The usual syntax of referencing groups with \x will interpret \10 as a reference to group 1 followed by a 0.
You need to use instead the alternative syntax of $x with $10.
Note : Some people seem to doubt there's ever any reason to have 10 groups.
I have a simple one, I wanted to rename a group of files named <name_start>DDMMYYYY_TIME_DDMMYYYY_TIME<name_end> as <name_start>YYYYMMDD_TIME_YYYYMMDD_TIME<name_end>, and ended with replacing my input matches with : rename "\1" "\2\5\4\3_\6_\9\8\7_$10" since name_start and name_end were not always constant.
OK, matching is no problem, your example matches for me in the current Notepad++. This is an important point. To use PCRE regex in Notepad++, you need a Version >= 6.0.
The other point is, where do you want to use the backreference? I can use named backreferences without problems within the regex, but not in the replacement string.
means
(?'a'[0-9]*),([0-9]*),\g{a}
will match
1000,1001,1000
But I don't know a way to use named groups or groups > 9 in the replacement string.
Do you really need more than 9 backreferences in the replacement string? If you just need more than 9 groups, but not all of them in the replacement, then make the groups you don't need to reuse non-capturing groups, by adding a ?: at the start of the group.
(?:[0-9]*),([0-9]*),(?:[0-9]*),([0-9]*)
group 1 group 2

Notepad++ RegeEx group capture syntax

I have a list of label names in a text file I'd like to manipulate using Find and Replace in Notepad++, they are listed as follows:
MyLabel_01
MyLabel_02
MyLabel_03
MyLabel_04
MyLabel_05
MyLabel_06
I want to rename them in Notepad++ to the following:
Label_A_One
Label_A_Two
Label_A_Three
Label_B_One
Label_B_Two
Label_B_Three
The Regex I'm using in the Notepad++'s replace dialog to capture the label name is the following:
((MyLabel_0)((1)|(2)|(3)|(4)|(5)|(6)))
I want to replace each capture group as follows:
\1 = Label_
\2 = A_One
\3 = A_Two
\4 = A_Three
\5 = B_One
\6 = B_Two
\7 = B_Three
My problem is that Notepad++ doesn't register the syntax of the regex above. When I hit Count in the Replace Dialog, it returns with 0 occurrences. Not sure what's misesing in the syntax. And yes I made sure the Regular Expression radio button is selected. Help is appreciated.
UPDATE:
Tried escaping the parenthesis, still didn't work:
\(\(MyLabel_0\)\((1\)|\(2\)|\(3\)|\(4\)|\(5\)|\(6\)\)\)
Ed's response has shown a working pattern since alternation isn't supported in Notepad++, however the rest of your problem can't be handled by regex alone. What you're trying to do isn't possible with a regex find/replace approach. Your desired result involves logical conditions which can't be expressed in regex. All you can do with the replace method is re-arrange items and refer to the captured items, but you can't tell it to use "A" for values 1-3, and "B" for 4-6. Furthermore, you can't assign placeholders like that. They are really capture groups that you are backreferencing.
To reach the results you've shown you would need to write a small program that would allow you to check the captured values and perform the appropriate replacements.
EDIT: here's an example of how to achieve this in C#
var numToWordMap = new Dictionary<int, string>();
numToWordMap[1] = "A_One";
numToWordMap[2] = "A_Two";
numToWordMap[3] = "A_Three";
numToWordMap[4] = "B_One";
numToWordMap[5] = "B_Two";
numToWordMap[6] = "B_Three";
string pattern = #"\bMyLabel_(\d+)\b";
string filePath = #"C:\temp.txt";
string[] contents = File.ReadAllLines(filePath);
for (int i = 0; i < contents.Length; i++)
{
contents[i] = Regex.Replace(contents[i], pattern,
m =>
{
int num = int.Parse(m.Groups[1].Value);
if (numToWordMap.ContainsKey(num))
{
return "Label_" + numToWordMap[num];
}
// key not found, use original value
return m.Value;
});
}
File.WriteAllLines(filePath, contents);
You should be able to use this easily. Perhaps you can download LINQPad or Visual C# Express to do so.
If your files are too large this might be an inefficient approach, in which case you could use a StreamReader and StreamWriter to read from the original file and write it to another, respectively.
Also be aware that my sample code writes back to the original file. For testing purposes you can change that path to another file so it isn't overwritten.
Bar bar bar - Notepad++ thinks you're a barbarian.
(obsolete - see update below.) No vertical bars in Notepad++ regex - sorry. I forget every few months, too!
Use [123456] instead.
Update: Sorry, I didn't read carefully enough; on top of the barhopping problem, #Ahmad's spot-on - you can't do a mapping replacement like that.
Update: Version 6 of Notepad++ changed the regular expression engine to a Perl-compatible one, which supports "|". AFAICT, if you have a version 5., auto-update won't update to 6. - you have to explicitly download it.
A regular expression search and replace for
MyLabel_((01)|(02)|(03)|(04)|(05)|(06))
with
Label_(?2A_One)(?3A_Two)(?4A_Three)(?5B_One)(?6B_Two)(?7B_Three)
works on Notepad 6.3.2
The outermost pair of brackets is for grouping, they limit the scope of the first alternation; not sure whether they could be omitted but including them makes the scope clear. The pattern searches for a fixed string followed by one of the two-digit pairs. (The leading zero could be factored out and placed in the fixed string.) Each digit pair is wrapped in round brackets so it is captured.
In the replacement expression, the clause (?4A_Three) says that if capture group 4 matched something then insert the text A_Three, otherwise insert nothing. Similarly for the other clauses. As the 6 alternatives are mutually exclusive only one will match. Thus only one of the (?...) clauses will have matched and so only one will insert text.
The easiest way to do this that I would recommend is to use AWK. If you're on Windows, look for the mingw32 precompiled binaries out there for free download (it'll be called gawk).
BEGIN {
FS = "_0";
a[1]="A_One";
a[2]="A_Two";
a[3]="A_Three";
a[4]="B_One";
a[5]="B_Two";
a[6]="B_Three";
}
{
printf("Label_%s\n", a[$2]);
}
Execute on Windows as follows:
C:\Users\Mydir>gawk -f test.awk awk.in
Label_A_One
Label_A_Two
Label_A_Three
Label_B_One
Label_B_Two
Label_B_Three

Regexp: Keyword followed by value to extract

I had this question a couple of times before, and I still couldn't find a good answer..
In my current problem, I have a console program output (string) that looks like this:
Number of assemblies processed = 1200
Number of assemblies uninstalled = 1197
Number of failures = 3
Now I want to extract those numbers and to check if there were failures. (That's a gacutil.exe output, btw.) In other words, I want to match any number [0-9]+ in the string that is preceded by 'failures = '.
How would I do that? I want to get the number only. Of course I can match the whole thing like /failures = [0-9]+/ .. and then trim the first characters with length("failures = ") or something like that. The point is, I don't want to do that, it's a lame workaround.
Because it's odd; if my pattern-to-match-but-not-into-output ("failures = ") comes after the thing i want to extract ([0-9]+), there is a way to do it:
pattern(?=expression)
To show the absurdity of this, if the whole file was processed backwards, I could use:
[0-9]+(?= = seruliaf)
... so, is there no forward-way? :T
pattern(?=expression) is a regex positive lookahead and what you are looking for is a regex positive lookbehind that goes like this (?<=expression)pattern but this feature is not supported by all flavors of regex. It depends which language you are using.
more infos at regular-expressions.info for comparison of Lookaround feature scroll down 2/3 on this page.
If your console output does actually look like that throughout, try splitting the string on "=" when the word "failure" is found, then get the last element (or the 2nd element). You did not say what your language is, but any decent language with string splitting capability would do the job. For example
gacutil.exe.... | ruby -F"=" -ane "print $F[-1] if /failure/"

Is there a RegEx that can parse out the longest list of digits from a string?

I have to parse various strings and determine a prefix, number, and suffix. The problem is the strings can come in a wide variety of formats. The best way for me to think about how to parse it is to find the longest number in the string, then take everything before that as a prefix and everything after that as a suffix.
Some examples:
0001 - No prefix, Number = 0001, No suffix
1-0001 - Prefix = 1-, Number = 0001, No suffix
AAA001 - Prefix = AAA, Number = 001, No suffix
AAA 001.01 - Prefix = AAA , Number = 001, Suffix = .01
1_00001-01 - Prefix = 1_, Number = 00001, Suffix = -01
123AAA 001_01 - Prefix = 123AAA , Number = 001, Suffix = _01
The strings can come with any mixture of prefixes and suffixes, but the key point is the Number portion is always the longest sequential list of digits.
I've tried a variety of RegEx's that work with most but not all of these examples. I might be missing something, or perhaps a RegEx isn't the right way to go in this case?
(The RegEx should be .NET compatible)
UPDATE: For those that are interested, here's the C# code I came up with:
var regex = new System.Text.RegularExpressions.Regex(#"(\d+)");
if (regex.IsMatch(m_Key)) {
string value = "";
int length;
var matches = regex.Matches(m_Key);
foreach (var match in matches) {
if (match.Length >= length) {
value = match.Value;
length = match.Length;
}
}
var split = m_Key.Split(new String[] {value}, System.StringSplitOptions.RemoveEmptyEntries);
m_KeyCounter = value;
if (split.Length >= 1) m_KeyPrefix = split(0);
if (split.Length >= 2) m_KeySuffix = split(1);
}
You're right, this problem can't be solved purely by regular expressions. You can use regexes to "tokenize" (lexically analyze) the input but after that you'll need further processing (parsing).
So in this case I would tokenize the input with (for example) a simple regular expression search (\d+) and then process the tokens (parse). That would involve seeing if the current token is longer than the tokens seen before it.
To gain more understanding of the class of problems regular expressions "solve" and when parsing is needed, you might want to check out general compiler theory, specifically when regexes are used in the construction of a compiler (e.g. http://en.wikipedia.org/wiki/Book:Compiler_construction).
You're input isn't regular so, a regex won't do. I would iterate over the all groups of digits via (\d+) and find the longest and then build a new regex in the form of (.*)<number>(.*) to find your prefix/suffix.
Or if you're comfortable with string operations you can probably just find the start and end of the target group and use substr to find the pre/suf fix.
I don't think you can do this with one regex. I would find all digit sequences within the string (probably with a regex) and then I would select the longest with .NET code, and call Split().
This depends entirely on your Regexp engine. Check your Regexp environment for capturing, there might be something in it like the automatic variables in Perl.
OK, let's talk about your question:
Keep in mind, that both, NFA and DFA, of almost every Regexp engine are greedy, this means, that a (\d+) will always find the longest match, when it "stumbles" over it.
Now, what I can get from your example, is you always need middle portion of a number, try this:
/^(.*\D)?(\d+)(\D.*)?$/ig
The now look at variables $1, $2, $3. Not all of them will exist: if there are all three of them, $2 will hold your number in question, the other vars, parts of the prefix. when one of the prefixes is missing, only variable $1 and $2 will be set, you have to see for yourself, which one is the integer. If both prefix and suffix are missing, $1 will hold the number.
The idea is to make the engine "stumble" over the first few characters and start matching a long number in the middle.
Since the modifier /gis present, you can loop through all available combinations, that the machine finds, you can then simply take the one you like most or something.
This example is in PCRE, but I'm sure .NET has a compatible mode.

Regex to replace string with another string in MS Word?

Can anyone help me with a regex to turn:
filename_author
to
author_filename
I am using MS Word 2003 and am trying to do this with Word's Find-and-Replace. I've tried the use wildcards feature but haven't had any luck.
Am I only going to be able to do it programmatically?
Here is the regex:
([^_]*)_(.*)
And here is a C# example:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
String test = "filename_author";
String result = Regex.Replace(test, #"([^_]*)_(.*)", "$2_$1");
}
}
Here is a Python example:
from re import sub
test = "filename_author";
result = sub('([^_]*)_(.*)', r'\2_\1', test)
Edit: In order to do this in Microsoft Word using wildcards use this as a search string:
(<*>)_(<*>)
and replace with this:
\2_\1
Also, please see Add power to Word searches with regular expressions for an explanation of the syntax I have used above:
The asterisk (*) returns all the text in the word.
The less than and greater than symbols (< >) mark the start and end
of each word, respectively. They
ensure that the search returns a
single word.
The parentheses and the space between them divide the words into
distinct groups: (first word) (second
word). The parentheses also indicate
the order in which you want search to
evaluate each expression.
Here you go:
s/^([a-zA-Z]+)_([a-zA-Z]+)$/\2_\1/
Depending on the context, that might be a little greedy.
Search pattern:
([^_]+)_(.+)
Replacement pattern:
$2_$1
In .NET you could use ([^_]+)_([^_]+) as the regex and then $2_$1 as the substitution pattern, for this very specific type of case. If you need more than 2 parts it gets a lot more complicated.
Since you're in MS Word, you might try a non-programming approach. Highlight all of the text, select Table -> Convert -> Text to Table. Set the number of columns at 2. Choose Separate Text At, select the Other radio, and enter an _. That will give you a table. Switch the two columns. Then convert the table back to text using the _ again.
Or you could copy the whole thing to Excel, construct a formula to split and rejoin the text and then copy and paste that back to Word. Either would work.
In C# you could also do something like this.
string[] parts = "filename_author".Split('_');
return parts[1] + "_" + parts[0];
You asked about regex of course, but this might be a good alternative.