Windows Batch File Regular Expression - regex

I have a following requirement that needs to be achieved in .bat file. Can some one please help.
There is a string, ABCD-1234 TEST SENTENCE in a variable, say str. Now I want to check if the string starts with format [A-Z]*-[0-9] * or not.
How can I achieve this? I tried various regular expression using FINDSTR, but couldn't get the desired result.
Example:
set str=ABCD-1234 TEST SENTENCE
echo %str% | findstr /r "^[A-Z]*-[0-9] *"

I'm assuming you are looking for strings that begin with 1 or more upper case letters, followed by a dash, followed by 1 or more digits, followed by a space.
If the string might contain poison characters like &, <, > etc., then you really should use delayed expansion.
FINDSTR regex is totally non-standard. For example, [A-Z] does not properly represent uppercase letters to FINDSTR, it also includes most of the lowercase letters, as well as some non-English characters. You must explicitly list all uppercase letters. The same is true for the numbers.
A space is interpreted as a search string delimiter unless the /C:"search" option is used.
setlocal enableDelayedExpansion
echo(!str!|findstr /rc:"^[ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]*-[0123456789][0123456789]* "
You should have a look at What are the undocumented features and limitations of the Windows FINDSTR command?

Related

Exactly two capitalized words on a line

I want to create a regular expression which can replace lines that contain exactly two words beginning with an uppercase with the character 'X'.
I'm currently using this:
sed -e '/\b[A-Z][a-z]*\b c X /home/Morgan/desktop/test
The problem is the following: it only changes lines which contain 1 or more words described by the regular expression in my test.txt.
I don't know how to say that i want a X only on lines with exactly 2 words beginning with an uppercase. Either word can occur anywhere within the line.
My test.txt contains:
Bonjour oui oui Bonjour -> this must be replaced by X
Bonjour Bonjour Bonjour -> this mustn't
Bonjour Oui bonjour oui -> this must be replaced by X
You seem to be attempting to use the Perl/PCRE word boundary \b but typical sed implementations do not understand this regular expression dialect. By your problem description, you are looking for beginning and end of line, anyway; this is a very basic regex anchor which was introduced already in the original grep: ^ matches beginning of line, and $ matches end of line.
Without anchors, a regular expression will match anywhere in the line. To say "only two" you really must check the entire line and make sure there are not three or more of what you're looking for.
"Find a line with exactly two words which begin with uppercase" needs to be rephrased or massaged a bit before you can attempt to write a regex. If we -- provisionally, for this discussion -- define w to mean "word which does not begin with uppercase" and W to mean one which does, you want ^w*Ww*Ww*$ -- exactly two uppercase words, and zero or more non-uppercase words in any position before, between, or after them.
A word which begins with uppercase is [A-Z][a-z]* (this requires all the subsequent characters to be lowercase) and a word which doesn't is [a-z][a-z]* (or [a-z]\+ if your sed supports that regex variation).
Because words need spaces between them, an optional word expression needs to be parenthesized so you can say "zero or more of this entire sequence". Typically, sed regex requires grouping parentheses to be backslashed as well, though this differs between versions.
So, try this:
sed 's/^\([a-z][a-z]* \)*[A-Z][a-z]*\( [a-z][a-z]*\)* [A-Z][a-z]*\( [a-z][a-z]*\)*$/X/' file
If indeed you have GNU sed, this can be simplified a bit:
sed -r 's/^([a-z]+ )*[A-Z][a-z]*( [a-z]+)* [A-Z][a-z]*( [a-z]+)*$/X/' file
This definition of "word" might not be sufficient; perhaps you can refine it to suit your circumstances. In particular, the spacing is assumed to be regular (exactly one space between words; no leading or trailing whitespace on the lines) and no text may contain characters outside of spaces and the alphabetics a-z in upper or lower case. (Whether accented characters like è and Á are also considered alphabetics in this range depends on your locale settings. Maybe set LC_ALL=fr_FR.utf-8 in your script if French locale settings are important.)
Notice also how the sed substition command requires exactly three delimiter characters -- traditionally, we use a slash, but you can use any punctuation character. The form is s/regex/replacement/flags where the regex, the replacement, and the flags can all be empty, but the s and the delimiters are always required.

Find results with grep and write to file

I would like to get all the results with grep or egrep from a file on my computer.
Just discovered that the regex of finding the string
'+33. ... ... ..' is by the following regex
\+33.[0-9].[0-9].[0-9].[0-9].' Or is this not correct?
My grep command is:
grep '\+31.[0-9].[0.9].[0.9].[0-9]' Samsung\ GT-i9400\ Galaxy\ S\ II.xry >> resultaten.txt
The output file is only giving me as following:
"Binary file Samsung GT-i9400 .xry matches"
..... and no results were given.
Can someone help me please with getting the results and writing to a file?
Firstly, the default behavior of grep is to print the line containing a match. Because binary files do not contain lines, it only prints a message when it finds a match in a binary file. However, this can be overridden with the -a flag.
But then, you end up with the problem that the "lines" it prints are not useful. You probably want to add the -o option to only print the substrings which actually matched.
Finally, your regex isn't correct at all. The lone dot . is a metacharacter which matches any character, including a control character or other non-text character. Given the length of your regex, you are unlikely to catch false positives, but you might want to explain what you want the dot to match. I have replaced it with [ ._-] which matches a space and some punctuation characters which are common in phone numbers. Maybe extend or change it, depending on what interpunction you expect in your phone numbers.
In regular grep, a plus simply matches itself. With grep -E the syntax would change, and you would need to backslash the plus; but in the absence of this option, the backslash is superfluous (and actually wrong in this context in some dialects, including GNU grep, where a backslashed plus selects the extended meaning, which is of course a syntax error at beginning of string, where there is no preceding expression to repeat one or more times; but GNU grep will just silently ignore it, rather than report an error).
On the other hand, your number groups are also wrong. [0-9] matches a single digit, where apparently the intention is to match multiple digits. For convenience, I will use the grep -E extension which enables + to match one or more repetitions of the previous character. Then we also get access to ? to mark the punctuation expressions as optional.
Wrapping up, try this:
grep -Eao '\+33[0-9]+([^ ._-]?[0-9]+){3}' \
'Samsung GT-i9400 Galaxy S II.xry' >resultaten.txt
In human terms, this requires a literal +33 followed by required additional digits, then followed by three number groups of one or more digits, each optionally preceded by punctuation.
This will overwrite resultaten.txt which is usually what you want; the append operation you had also makes sense in many scenarios, so change it back if that's actually what you want.
If each dot in your template +33. ... ... .. represents a required number, and the spaces represent required punctuation, the following is closer to what you attempted to specify:
\+33[0-9]([^ ._-][0-9]{3}){2}[^ ._-][0-9]{2}
That is, there is one required digit after 33, then two groups of exactly three digits and one of two, each group preceded by one non-optional spacing or punctuation character.
(Your exposition has +33 while your actual example has +31. Use whichever is correct, or perhaps allow any sequence of numbers for the country code, too.)
It means that you're find a match but the file you're greping isn't a text file, it's a binary containing non-printable bytes. If you really want to grep that file, try:
strings Samsung\ GT-i9400\ Galaxy\ S\ II.xry | grep '+31.[0-9].[0.9].[0.9].[0-9]' >> resultaten.txt

Select all files that do not have string between 2 other strings

I have a set of files that i need to loop through and find all the files that does not have a specific string between 2 other specific strings. How can i do that?
I tried this but it didnt work:
grep -lri "\(stringA\).*\(?<!stringB\).*\(stringC\)" ./*.sql
EDIT:
the file could have structure as following:
StringA
StringB
StringA
StringC
all i want i s to know if there is any occurences where string A and stringC has no stringC in between.
You can use the -L option of grep to print all files which don't match and look for the specific combination of strings:
grep -Lri "\(stringA\).*\(stringB\).*\(stringC\)" ./*.sql
The short answer is along the lines of:
grep "abc[^(?:def)]*ghi" ./testregex
That's based on a testregex file like so:
abcghiabc
abcdefghi
abcghi
The output will be:
$ grep "abc[^(?:def)]*ghi" ./testregex
abcghiabc
abcghi
Mapped to your use-case, I'd wager this translates roughly to:
grep -lri "stringA[^(?:stringB)]*stringC" ./*.sql
Note that I've removed the ".*" between each string, since that will match the very string that you're attempting to exclude.
Update: The original question now calls out line breaks, so use grep's -z flag:
-z
suppress newline at the end of line, subtituting it for null character. That is, grep knows where end of line is, but sees the input as one big line.
Thus:
grep -lriz "stringA[^(?:stringB)]*stringC" ./*.sql
When I first had to use this approach myself, I wrote up the following explanation...
Specifically: I wanted to match "any character, any number of times,
non-greedy (so defer to subsequent explicit patterns), and NOT
MATCHING THE SEQUENCE />".
The last part is what I'm writing to share: "not matching the sequence
/>". This is the first time I've used character sequences combined
with "any character" logic.
My target string:
<img class="photo" src="http://d3gqasl9vmjfd8.cloudfront.net/49c7a10a-4a45-4530-9564-d058f70b9e5e.png" alt="Iron or Gold" />
My first attempt:
<img.*?class="photo".*?src=".*?".*?/>
This worked in online regex testers, but failed for some reason within
my actual Java code. Through trial and error, I found that replacing
every ".?" with "[^<>]?" was successful. That is, instead of
"non-greedy matching of any character", I could use "non-greedy
matching of any character except < or >".
But, I didn't want to use this, since I've seen alt text which
includes these characters. In my particular case, I wanted to use the
character sequence "/>" as the exclusion sequence -- once that
sequence was encountered, stop the "any character" matching.
This brings me to my lesson:
Part 1: Character sequences can be achieved using (?:regex). That is,
use the () parenthesis as normal for a character sequence, but prepend
with "?:" in order to prevent the sequence from being matched as a
target group. Ergo, "(?:/>)" would match "/>", while "(?:/>)*" would
match "/>/>/>/>".
Part 2: Such character sequences can be used in the same manner as
single characters. That is, "[^(?:/>)]*?" will match any character
EXCEPT the sequence "/>", any number of times, non-greedy.
That's pretty much it. The keywords for searching are "non-capturing
groups" and "negative lookahead|lookbehind", and the latter feature
goes much deeper than I've gone so far, with additional flags that I
don't yet grok. But the initial understanding gave me the tool I
needed for my immediate task, and it's a feature that I've wondered
about for awhile -- thus, I figured I'd share the basic introduction
in case any of you were curious about tucking it away in your toolset.
After playing around with the statement provided by the DreadPirateShawn:
stringA[^(?:stringB)]*stringC
I figured out that it is not a truly valid regex. This statement was excluding every character in the given set and not the full string. So I continued digging.
After some googling and testing the pattern, I came up with the following statement, that seems to fit my needs:
stringA\s*\t*(?:(?!stringB).)*\s*\t*stringC
This pattern matches any text except the provided string between 2 specified strings. It also takes into consideration whitespace characters.
There is more testing to be done, but it seems that this pattern perfectly fits my requirements
UPDATE: Here is a final version of the statement that seems to work for me:
grep -lriz "(set feedback on){0,}[ \t]*(?:(?!set feedback off).)*[ \t]*select sysdate from dual" ./*.sql

Using regular expressions in findstr

I'm trying to implement a hook script in Subversion, using findstr with a regular expression. The intent is to enforce the inclusion of an entry in the log message that matches the format used by our issue tracking tool (Atlassian JIRA). Our issues each consist of 4 to 6 capital letters and 2 to 4 numerals, separated by a hyphen (e.g., "TEST-554" or CMMGT-392"). Per instructions in the Subversion documentation, I've created a batch file to check the log message for a correctly-formatted entry, using the regex
findstr ([A-Z]{3,6}\-[0-9]{2,4}) > nul
I've tested the regex in a number of testing tools and it seems to work, but when I run it as part of the hook script, it fails to return a match. As a sort of "control", I tried using the regex
findstr ...... > nul
and was able to find a match. Anyone see where I'm going wrong?
findstr requires the /R option to use regular expressions, but it doesn't support extended regular expressions, so things like counts ({3,6}) don't work. Also, zero-or-one matches (?) don't work, so doing what you want will get pretty verbose. Also, English Windows collation means that [A-Z] matches 'A', 'b', 'B', 'z', and 'Z', but not 'a'. Here's something that might work:
findstr /R "[ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9][0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9][0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9][0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9][0-9] [ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]-[0-9][0-9][0-9][0-9]"
This incredibly verbose command may exceed the maximum command length of the shell (haven't checked), but basically does what you want by containing a separate match for each of the permutations of letter and number counts. That's another odd thing about findstr: unless you use the /C option, spaces in your match string will be used to separate it into individual match expressions.
If you have any option besides findstr such as PowerShell, Python, or even VBScript, I would suggest you use it. Good luck!
EDIT: Here's the Perl one-liner I used to generate the above command:
perl -le 'BEGIN{$\=" "}for $x (3..6){for $y (2..4){print join("","[",A..Z,"]") x $x, "-", "[0-9]" x $y}}'

regex unicode character in vim

I'm being an idiot.
Someone cut and pasted some text from microsoft word into my lovely html files.
I now have these unicode characters instead of regular quote symbols, (i.e. quotes appear as <92> in the text)
I want to do a regex replace but I'm having trouble selecting them.
:%s/\u92/'/g
:%s/\u5C/'/g
:%s/\x92/'/g
:%s/\x5C/'/g
...all fail. My google-fu has failed me.
From :help regexp (lightly edited), you need to use some specific syntax to select unicode characters with a regular expression in Vim:
\%u match specified multibyte character (eg \%u20ac)
That is, to search for the unicode character with hex code 20AC, enter this into your search pattern:
\%u20ac
The full table of character search patterns includes some additional options:
\%d match specified decimal character (eg \%d123)
\%x match specified hex character (eg \%x2a)
\%o match specified octal character (eg \%o040)
\%u match specified multibyte character (eg \%u20ac)
\%U match specified large multibyte character (eg \%U12345678)
This solution might not address the problem as originally stated, but it does address a different but very closely related one and I think it makes a lot of sense to place it here.
I don't know in which version of Vim it was implemented, but I was working on 7.4 when I tried it.
When in Edit mode, the sequence to output unicode characters is: ctrl-v u xxxx where xxxx is the code point. For instance outputting the euro sign would be ctrl-v u 20ac.
I tried it in Command mode as well and it worked. That is, to replace all instances of "20 euro" in my document with "20 €", I'd do:
:%s/20 euro/20 <ctrl-v u 20ac>/gc
In the above <ctrl-v u 20ac> is not literal, it's the sequence of keys that will output the € character.