Match "war" but not "software" - regex

What regular expression would match an occurrence of the string "war" that doesn't have the string "soft" in front of it? In other words the "war" in "world war iii" would match, but the "war" in "where's my software" would not? Further, "stay out of my warehouse" would match, so would "aware". In other words, I just don't want the string "software" to match.

If your regex engine supports lookbehinds: (?<!soft)war.
You can try it using Perl:
$ perl -ne 'print if /(?<!soft)war/i' < my-text-file
That will work more or less like grep (but grep does not support lookbehinds, afaik).

try this: '[^soft].*war' , its close but not perfect, put in your string combinations and you will see why, also it does not return anything if its only 'war' it doesnt identify, it requires to precede the word with atleast one letter, but it cant precede with 'soft' Can you use conditions in your program and combine this logic with a regex matching for the word war which is not a part of any subsstring as melwil pointed out?

Related

How to match string in between two words, but only the "closet" of the two words?

I am new to regex, and am trying to capture a certain pattern. There are two words (name1 and host), that I want to capture everything in between, the problem is, sometimes "everything" in between might contain 'name1'. And if it does contain 'name1', it includes everything from the previous name1, to the next 'host' word. So I basically have two 'strings' from two different 'name1' being captured.
This is the example I have:
name1{want-this-string}host,name1{want-this-string}host,name1{dont-want-this-string},name1{dont-want-this-either}name1{want-this-string}host
and this is the regex I'm using right now..
(?<=\bname1\b).*?(?=\bhost\b)
My expected output is that it matches the 3 {want-this-string}, and not the {dont-want-this} stuff. so basically:
{want-this-string}{want-this-string}{want-this-string}
But right now its grabbing the first two {want this string} and then this whole section
{dont-want-this-string},name1{dont-want-this-either}name1{want-this-string}
If you have a GNU grep, you may use
grep -oP '\bname1\{\K[^{}]*(?=}host\b)' file
With pcregrep (you may install it on MacOS if you are using that OS), you may use it like
pcregrep -oM '\bname1\{\K[^{}]*(?=}host\b)' file
See the regex demo
Details
\bname1\{ - whole word name1 and a { after
\K - match reset operator discarding the whole match
[^{}]* - 0 or more chars other than { and }
(?=}host\b) - there must be a }host as a whole word immediately to the right of the current location.
See the online grep demo:
s="name1{want-this-string}host,name1{want-this-string}host,name1{dont-want-this-string},name1{dont-want-this-either}name1{want-this-string}host"
grep -oP '\bname1\{\K[^{}]*(?=}host\b)' <<< "$s"
Output:
want-this-string
want-this-string
want-this-string
I'm not quite sure, if this pattern would pass our desired and potential inputs, yet we would similarly start to design an expression based on our cases with a likely left or if necessary right constraints, maybe such as this expression:
(^name1|}name1)({.+?})?|(host,name1)({.+?})(host,name1)
which this part can be much simplified:
(host,name1)({.+?})(host,name1)
and we are adding it here just to exemplify the implementation of a right boundary to only capture the first instance of (host,name1) value.
Demo
RegEx Circuit
jex.im visualizes regular expressions:
RegEx
If this expression wasn't desired and you wish to modify it, please visit this link at regex101.com.

Do I need negative lookahead to extract TEST strings in my regular expression?

My goal is to remove tokens from the string below that do not start with "TEST" with the help of regular expressions.
TESTA=abc; VAL2=def; TESTB=ghi; TESTC=jkl; VAL2=bla1; VAL3=bla2
Based on reading online it seems I would need to create a regular expression that will match what I want and then use negative lookahead for it. However, I am unable to come up with one.
Input string:
TESTA=abc; VAL2=def; TESTB=ghi; TESTC=jkl; VAL2=bla1; VAL3=bla2
Matching string:
TESTA=abc; TESTB=ghi; TESTC=jkl;
Is it even possible to do what I want in a single regular expression?
We need this to place in our Apache conf file. Some of the cookies sent to Apache are so big that it is failing our application. The approach we are trying to take is to filter all the cookies not set by our application. We can enforce some sort of restriction that all our cookies start with specific prefix (as used in the example above) and we will filter the rest.
In Apache if I use the syntax below it will replace the cookie that has a key TESTC and its value from the string with empty string. I can enhance the regex to match is with key that starts with TEST_. So basically it can remove the following :> "; TEST_key:VALUE FOR Cookie" . However what I want is the exact opposite of it. Leave alone what matched and replace everything else with empty string.
RequestHeader edit Cookie "(^TESTC=[^;]; |; TESTC=[^;])" ""
Something like
[^=;]+(?<!TEST.)=[^;]+(?:$|;)
Regex Demo
What it does?
[^=;]+ Matches till the first = or ;
(?<!TEST.) Negative lookbehind. Checks if the string matched is preceded not by TEST.
=[^;]+ If the lookbehind is successful, Matches till the next ;
Use a Non-Greedy Quantifier
You don't need anything as complicated as zero-width assertions or negative lookahead or lookbehind. All you need is a non-greedy quantifier like *? in engines that support it. For example, at the Bash prompt and only using egrep:
$ echo 'TESTA=abc; VAL2=def; TESTB=ghi; TESTC=jkl; VAL2=bla1; VAL3=bla2' |
egrep -o 'TEST.*?;' | xargs
TESTA=abc; TESTB=ghi; TESTC=jkl;
You can do something similar in Ruby, Python, or Perl. For example, using Ruby:
str = 'TESTA=abc; VAL2=def; TESTB=ghi; TESTC=jkl; VAL2=bla1; VAL3=bla2'
str.scan(/TEST.*?;/).join " "
#=> "TESTA=abc; TESTB=ghi; TESTC=jkl;"

regular expression to match strings with decimals

I'm trying to create a regex which will do the following:
Name description: "QUARTERLY PATCH FOR XAQE (JUL 2013 - 11.2.0.3.20) : (125546467)"
Val version : 11.2.0.3.4
In order to output:
"Name, 11.2.0.3.20"
"Val, 11.2.0.3.4"
I have created the following regex: /^([\w]+).*([\d\.\d]+).*/, but it is only matching the last number in the 2nd group, i.e. in 11.2.0.3.4 it will only match 4. Could anyone help?
Also, there could be more than the two lines given above, so it needs to account for arbitrary lines where the version number could be anywhere in the line.
You can use a one-liner for this as well:
perl -lne '/(\w+).*?(\d+(\.\d+)+)/; print "$1, $2"' <filename>
__END__
Name, 11.2.0.3.20
Val, 11.2.0.3.4
If you are only planning for the output and not doing any processing over the captured groups, then this will do:
$str =~ s/([\n\r]|^)(Name|Val).*?(\d+(\.\d+)+).*/$1"$2, $3"/g;
Your problem is that .* is greedy and will consume as much as it can whilst the pattern still matches. One solution is to make is lazy .*?
Also [\d\.\d]+ means match one of \d, \. and \d, so it's the same as [\d.]+ which isn't what you want since it would match "2013" in the first line. \d+(\.\d+)+ is more suitable.
After those 2 changes you have:
^([\w]+).*?(\d+(\.\d+)+).*
RegExr

Regex expression to find file extension in a file with multiple periods

How would you write a regular expression to find the file extension of the following files, keeping in mind that what I am looking for is the ".pdf" or ".xls" portion of the string?
REPORTPDF.20130810.pdf.pgp
REPORTXLS.20130810.xls.pgp
EDIT:
The resulting filenames I want to end up with are the following:
REPORT20130810.PDF
REPORT20130810.XLS
I am on a Windows platform. I've played around with this a bit at http://regexpal.com/ but so far I can only figure out how to match the date:
([0-9]{4}[0-9]{2}[0-9]{2})
Using sed:
sed 's/^\(.*[^.]*\)\.[^.]*$/\1/' <<< "REPORTPDF.20130810.pdf.pgp"
REPORTPDF.20130810.pdf
Using grep -P (PCRE regex):
grep -oP '^.+[^.]+(?=\.[^.]+$)' <<< "REPORTPDF.20130810.pdf.pgp"
REPORTPDF.20130810.pdf
.+\.(\w+)\.\w+$ would deliver the last but one extension as group 1, how this is accessed would then be dependent of your host language for the regex.
If you don't need the file extension to be capitalized, this should work
([a-zA-Z]+)\.([0-9]{4}[0-9]{2}[0-9]{2})\.(xls|pdf)\.pgp
Matches:
REPORTXLS.20130810.xls.pgp
And then the groups you'd use are two and three
REPORT\2.\3
Matches:
REPORT20130810.xls
Problem is that you don't provide much context for how you're going about changing these file names.
You don't say what language/library you're using, but this Perl one-liner does the trick:
perl -lpe "s/^([^.]*)(...)\.(\d+)(\.\2)\.pgp/\1\3\4/i; $_=uc"
I think this will work for you :)
^(([A-Z a-z]*)(?:XLS.|PDF.)(\d{8})(.pdf|.xls))
Edit live on Debuggex
^ starts at the beginning of the string
(.*) any character before
\d any number 0-9
{8} only 8 times for that character section (in this case 8 times of
the numbers 0-9)
?: is non capture groups
I wrapped the capture groups into one large one so the thing that you want will be in the first capture group :).
This can be prob be replaced
([A-Z a-z]*)
with
(REPORT)
This (.*?(?:\..*)?)(\..*) will hold things like:
'hello.1a.2bb.3' ---> group(1) == 'hello.1a.2bb', group(2) == '.3'
'yep.1' ---> group(1) == 'yep', group(2) == '.1'
If the format is pretty much fixed you could use
(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)
and cherry pick replacement based on what you want
Used java here but regex match would still be same
String a = "REPORTPDF.20130810.pdf.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1--$2--$3--$4--$5");
;
String b = "REPORTXLS.20130810.xls.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1--$2--$3--$4--$5");
System.out.println(a);
System.out.println(b);
REPORT--PDF--20130810--pdf--pgp
REPORT--XLS--20130810--xls--pgp
in your case "$1$3.$2"
String b = "REPORTXLS.20130810.xls.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1$3.$2");
which produces intended result
REPORT20130810.XLS

Using regex to find any last occurrence of a word between two delimiters

Suppose I have the following test string:
Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop
where _ means any characters, eg: StartaGetbbGetcccGetddddStopeeeeeStart....
What I want to extract is any last occurrence of the Get word within Start and Stop delimiters. The result here would be the three bolded Get below.
Start__Get__Get__Get__Stop__Start__Get__Get__Stop__Start__Get__Stop
I precise that I'd like to do this only using regex and as far as possible in a single pass.
Any suggestions are welcome
Thanks'
Get(?=(?:(?!Get|Start|Stop).)*Stop)
I'm assuming your Start and Stop delimiters will always be properly balanced and they can't be nested.
I would have done it with two passes. The first pass find the word "Get", and the second pass count the number of occurrences of it.
$ echo "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get__Stop" | awk -vRS="Stop" -F"_*" '{print $(NF-1)}'
Get
Get
Get
Something like this, maybe:
(?<=Start(?:.Get)*)Get(?=.Stop)
That requires variable-length lookbehind support, which not all regex engines support.
It could be made to have a max length, which a few more (but still not all) support, by changing the first * to {0,99} or similar.
Also, in the lookahead, possibly the . should be a .+ or .{1,2} depending on if the double underscore is a typo or not.
With Perl, i'd do :
my $test = "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop";
$test =~ s#(?<=Start_)((Get_)*)(Get)(?=_Stop)#$1<FOUND>$3</FOUND>#g;
print $test;
output:
Start_Get_Get_<FOUND>Get</FOUND>_Stop_Start_Get_<FOUND>Get</FOUND>_Stop_Start_<FOUND>Get</FOUND>_Stop
You should adapt to your regex flavour.