Using regular expressions in findstr - regex

I'm trying to implement a hook script in Subversion, using findstr with a regular expression. The intent is to enforce the inclusion of an entry in the log message that matches the format used by our issue tracking tool (Atlassian JIRA). Our issues each consist of 4 to 6 capital letters and 2 to 4 numerals, separated by a hyphen (e.g., "TEST-554" or CMMGT-392"). Per instructions in the Subversion documentation, I've created a batch file to check the log message for a correctly-formatted entry, using the regex
findstr ([A-Z]{3,6}\-[0-9]{2,4}) > nul
I've tested the regex in a number of testing tools and it seems to work, but when I run it as part of the hook script, it fails to return a match. As a sort of "control", I tried using the regex
findstr ...... > nul
and was able to find a match. Anyone see where I'm going wrong?

findstr requires the /R option to use regular expressions, but it doesn't support extended regular expressions, so things like counts ({3,6}) don't work. Also, zero-or-one matches (?) don't work, so doing what you want will get pretty verbose. Also, English Windows collation means that [A-Z] matches 'A', 'b', 'B', 'z', and 'Z', but not 'a'. Here's something that might work:
findstr /R "[ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9][0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9][0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9][0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9][0-9][0-9]"
This incredibly verbose command may exceed the maximum command length of the shell (haven't checked), but basically does what you want by containing a separate match for each of the permutations of letter and number counts. That's another odd thing about findstr: unless you use the /C option, spaces in your match string will be used to separate it into individual match expressions.
If you have any option besides findstr such as PowerShell, Python, or even VBScript, I would suggest you use it. Good luck!
EDIT: Here's the Perl one-liner I used to generate the above command:
perl -le 'BEGIN{$\=" "}for $x (3..6){for $y (2..4){print join("","[",A..Z,"]") x $x, "-", "[0-9]" x $y}}'

Related

Command Line findstr with a regular expression

I need to search through all the files in a directory and sub directories to match any of the numbers in the reg exp. Basically in our code we have blocks of code based on certain project numbers. I need to find these blocks by project number. This regular expression does what I need but I cannot get it to work at the command line
([^0-9]|^)(56|14|2)([^0-9]|$)
I tested this on https://www.freeformatter.com/regex-tester.html against this string "If session.projid = 56 and then again 14 or something else"
I am trying this at the command line
findstr /s /R /C:"([^0-9]|^)(56|14|2)([^0-9]|$)" *.*
But no results and I know there should be. Thanks in advance for any help on this.
See these docs:
FINDSTR does not support alternation with the pipe character (|) multiple Regular Expressions can be separated with spaces, just the same as separating multiple words (assuming you have not specified a literal search with /C) but this might not be useful if the regex itself contains spaces.
In your case, you may use \< / \> word boundaries with each number and you may specify all your alternatives after a space:
findstr /s /r "\<56\> \<14\> \<2\>" *.*

Windows Batch File Regular Expression

I have a following requirement that needs to be achieved in .bat file. Can some one please help.
There is a string, ABCD-1234 TEST SENTENCE in a variable, say str. Now I want to check if the string starts with format [A-Z]*-[0-9] * or not.
How can I achieve this? I tried various regular expression using FINDSTR, but couldn't get the desired result.
Example:
set str=ABCD-1234 TEST SENTENCE
echo %str% | findstr /r "^[A-Z]*-[0-9] *"
I'm assuming you are looking for strings that begin with 1 or more upper case letters, followed by a dash, followed by 1 or more digits, followed by a space.
If the string might contain poison characters like &, <, > etc., then you really should use delayed expansion.
FINDSTR regex is totally non-standard. For example, [A-Z] does not properly represent uppercase letters to FINDSTR, it also includes most of the lowercase letters, as well as some non-English characters. You must explicitly list all uppercase letters. The same is true for the numbers.
A space is interpreted as a search string delimiter unless the /C:"search" option is used.
setlocal enableDelayedExpansion
echo(!str!|findstr /rc:"^[ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]*-[0123456789][0123456789]* "
You should have a look at What are the undocumented features and limitations of the Windows FINDSTR command?

findstr query including tab character

I'm trying to use findstr in place of grep on a barebones vanilla windows box (which is sadly a requirement). I have some relatively large files (1Gb+), and I would like to extract those lines which don't include MX, MXnn, BR, and BRnn delimited by tabs. If I were writing a 'real' regex, then
\t(MX|BR)(..)?\t
would cover it. I don't mind doing it in two stages, but I can't for the life of me seem to include the delimiter tabs.
So far I have:
findstr /V MX source.txt >> temp.txt
findstr /V BR temp.txt >> dest.txt
which due to the nature of the data does an ok-ish job, but I would really rather use something like:
findstr /R /V "\t(MX|BR)(..)?\t" source.txt >> dest.txt
I've tried double slashes, escape sequences etc. but seem to be running around in circles.
I'm loathe to resort to VBScript if I can help it.
Any ideas, given limitations of vanilla windows?
EDIT
I've looked into generating an exclusion file using the /G option, but generating might start to become problematic, once the users cotton on to the possibilities - a regex would just be a lot easier.
A possible solution from the command line or in a batch file is using:
%SystemRoot%\System32\findstr.exe /V /R /C:"\<BR[0-9]*\>" /C:"\<MX[0-9]*\>" "source.txt"
The file source.txt is searched case-sensitive for lines not containing because of /V either BR with 0 or more digits or MX with 0 or more digits being an entire word because of \< and \> using because of /R the two regular expression search terms \<BR[0-9]*\> and \<MX[0-9]*\> which are combined with a logical OR by FINDSTR.
This might be already enough to filter source.txt right. But it filters out also lines containing BR[0-9]* or MX[0-9]* surrounded by other word delimiting characters than horizontal tab characters.
It is possible to use in a batch file:
%SystemRoot%\System32\findstr.exe /V /R /C:"[ ]BR[0-9]*[ ]" /C:"[ ]MX[0-9]*[ ]" "source.txt"
ATTENTION: There must be 1 horizontal tab character in the batch file between each of the 4 pairs of square brackets. The browsers display those 4 tab characters as 1 or more spaces according to HTML specification.
Open a command prompt window and run findstr /? for more information about FINDSTR.
And perhaps read also the Stack Overflow article
What are the undocumented features and limitations of the Windows FINDSTR command?
Afaics there is no syntax to specify a horizontal tab directly.
Findstr regex seems pretty basic, they don't have \s \t \d and such like :-).
However you can use an input file to specify your search pattern. Inside this file you can use tabs literally.
The example from your original post "\t(MX|BR)(..)?\t" would be
" (MX|BR)(..)? "
without the quotes. The leading and trailing whitespace are the tabs typed and saved in the file.
Then you would use findstr with something like:
findstr /R /G:patternFileWithTabs.txt sourceFile.txt
Also you can get the job done most of the time by specifying an exclusive pattern.
If you exclude all alphanumeric, common separator, other white spaces chars, likely the only thing left is a tab.
For example I've been searching for a sequence like in default regex:
"\t\tUnknown\t\t\t\t0\t"
In my use case I could grep it with findstr like:
findstr /R "[ a-z0-9][ a-z0-9]Unknown[ a-z0-9]*0[ a-z0-9]" logfile.txt
Of course it depends on the actual data you have. In theory the pattern would match also other strings, but these other strings don't occur in my source file, so it works.
Most of the time you don't need a 100% bullet proof pattern.

bash 2.0 string matching

I'm on GNU bash, version 2.05b.0(1)-release (2002). I'd like to determine whether the value of $1 is a path in one of those /path/*.log rules in, say, /etc/logrotate.conf. It's not my box so I can't upgrade it.
Edit: my real goal is given /path/actual.log answer whether it is already governed by logrotate or if all the current rules miss it. I wonder then if my script should just run logrotate -d /etc/logrotate.conf and see if /path/actual.log is in the output. This seems simpler and covers all the cases as opposed to this other approach.
But I still want to know how to approach string matching in Bash 2.0 in general...
the line itself can start with some white space or none
it's not a match if it is in a commented line (comments are lines where the first non white space char is #)
there can be one or more paths on the same line to the left of $1
like if $1 is /my/path/*.log and the line in question is
/other/path*.log /yet/another.log /my/path/*.log {
there can be one or more paths to the right as well
the line itself can end with { and even more white space or not
paths can be contained in double-quotes or not
it can be assumed that the file is a valid logrotate conf file.
I have something that seems to work in Bash 4 but not in Bash 2.05. Where can I go to read what Bash 2.0 supports? How would this matching be checked in Bash 2.0?
You can find a terse bash changelog here.
You'll see that =~, the regex-matching operator, didn't get introduced until version 3.0.
Thus, your best bet is to use a utility to perform the regex matching for you; e.g.:
if grep -Eq '<your-extended-regex>' <<<"$1"; then ...
grep -Eq '<your-extended-regex>' <<<"$1":
IS like [[ $1 =~ <your-extended-regex> ]] in Bash 3.0+ in that its exit code indicates whether the literal value of $1 matches the extended regex <your-extended-regex>
Note that Bash 3.1 changed the interpretation of the RHS to treat quoted (sub)strings as literals.
Also note that grep -E may support a slightly different regular-expression dialect.
is NOT like it in that the grep solution cannot return capture groups; by contrast, Bash 3.0+ provides the overall match and capture groups via special array variable ${BASH_REMATCH[#]}.

Case insensitive regex for non-English characters

I need to perform a regular expression match on text that includes non-English characters (Spanish, French, German, and Russian).
I want the match to ignore case, so with English characters I would just use the /i modifier, but that doesn't work with words like übermäßig.
What is the simplest way to write a regex that will match both, say, übermäßig and ÜBERMÄßig? And can the same approach be used to convert upper case non-English letters to their lowercase equivalents in Perl?
It works perfectly fine
$ perl -E'use utf8; say "ÜBERMÄẞIG" =~ /^übermäßig\z/i ? "match" : "no match"'
match
$ perl -E'use utf8; say "ÜBERMÄSSIG" =~ /^übermäßig\z/i ? "match" : "no match"'
match
(The use utf8; says the source code is encoded using UTF-8. It would be impossible to have those characters in the script any other way.)
I suspect an encoding problem, meaning you think you gave Perl "ß" when you didn't. It could also be that you're using an older version of Perl that can't handle multi-char folds correctly. Generally speaking, it could help to use /u, but it shouldn't make a difference for this example.
The /i modifier works nicely if the strings use Perl's internal encoding.
For example, this prints "yes":
perl -le 'use utf8; print "yes" if "ÜBERMäßig" =~ /überMÄßiG/i'
The "use utf8" tells Perl that my source code is encoded in UTF-8, and therefore Perl decodes all literal strings in my source code from UTF-8 into its internal encoding. This example will not work without use utf8.
If your strings come from somewhere else then you may need to apply Encode::decode -- or tell your source to generate properly decoded strings (e.g. possible with most DBI drivers).
It works for me. Do you need to use utf8;, maybe?
(Disclaimer: I don't know Perl.)
If you set the locale to the appropriate value in your Perl script, then the /i modifier will work on non-English characters--as will other features like regex matching of word boundaries and the uc and lc functions.
Note that if you need to handle multiple foreign character sets, the linked documentation shows you how to switch locales within your script as needed, using setlocale().
Edit: I should have mentioned that this method is deprecated in most cases. Things should just work with UTF-8. But it can still be useful sometimes.
use locale;
use POSIX qw(locale_h);
setlocale (LC_ALL, $locale{German}) or die "failed to load locale!";