Regex - Positive Lookbehind - regex

I a few files with a couple of millions lines with the something like the following:
9/9/2015 2:50:39 PM: Export for https://portal.gaf.com/sites/RCNHistory/Lists/RCNs/Attachments/148/Ruberoid HW Plus SV.xls Complete.
9/9/2015 2:50:39 PM: Export for https://portal.gaf.com/sites/RCNHistory/Lists/RCNs/Attachments/148/Ruberoid Mop Granule SV.xls Complete.
9/9/2015 2:50:40 PM: Export for https://portal.gaf.com/sites/RCNHistory/Lists/RCNs/Attachments/148/Ruberoid Mop Smooth 1.5 SV.xls Complete.
I was hoping to capture the file name on each line with a lookbehind with the following:
$(?<=\/) Of course i will have to delete the "Complete." but i figure i start slowly
but i have not mastered the art of regex. can any one let me know what i am doing wrong?
Thank you.

This could work - you would retrieve the filename from the capture group:
\/([^\/]*) Complete.$
Here's an example on regexr: http://www.regexr.com/3bp2l

You don't need to complicate things with a lookbehind if the lines are all in this format.
You can just use greedy matching to get what you want.
.*\/(.*) Complete.
Which is essentially:
Match everything (including /'s) up to a / followed by some text (in this case your filename) followed by a literal " Complete."
The matching group contains the filename.
So, for a Regex Find and Replace in N++ you should use:
Find
.*\/(.*) Complete.
Replace
$1
This will leave you with just a filename on each line.

Lookbehind is a zero-width assertion at a position. It's not a way to tell the regex where to start -- it must always start at the beginning. You could probably use a regex like .*/(.*) Complete to capture that.
If you are working with the shell, the cut tool is great for this too.
# get everything after the last slash and before the last space (` Complete`)
rev $INPUT_FILE | cut -d'/' -f 1 | cut -d' ' -f2- | rev

You can use this regex with lookbehind:
/(?<=\/)[^\/]+$/
Make sure to use MULTILINE mode.
RegEx Demo

Related

Is it possible to perform a "lookahead" regex match only if one of two matches are present? [duplicate]

I'm setting up some goals in Google Analytics and could use a little regex help.
Lets say I have 4 URLs
http://www.anydotcom.com/test/search.cfm?metric=blah&selector=size&value=1
http://www.anydotcom.com/test/search.cfm?metric=blah2&selector=style&value=1
http://www.anydotcom.com/test/search.cfm?metric=blah3&selector=size&value=1
http://www.anydotcom.com/test/details.cfm?metric=blah&selector=size&value=1
I want to create an expression that will identify any URL that contains the string selector=size but does NOT contain details.cfm
I know that to find a string that does NOT contain another string I can use this expression:
(^((?!details.cfm).)*$)
But, I'm not sure how to add in the selector=size portion.
Any help would be greatly appreciated!
This should do it:
^(?!.*details\.cfm).*selector=size.*$
^.*selector=size.*$ should be clear enough. The first bit, (?!.*details.cfm) is a negative look-ahead: before matching the string it checks the string does not contain "details.cfm" (with any number of characters before it).
^(?=.*selector=size)(?:(?!details\.cfm).)+$
If your regex engine supported posessive quantifiers (though I suspect Google Analytics does not), then I guess this will perform better for large input sets:
^[^?]*+(?<!details\.cfm).*?selector=size.*$
regex could be (perl syntax):
`/^[(^(?!.*details\.cfm).*selector=size.*)|(selector=size.*^(?!.*details\.cfm).*)]$/`
There is a problem with the regex in the accepted answer. It also matches abcselector=size, selector=sizeabc etc.
A correct regex can be ^(?!.*\bdetails\.cfm\b).*\bselector=size\b.*$
Explanation of the regex at regex101:
I was looking for a way to avoid --line-buffered on a tail in a similar situation as the OP and Kobi's solution works great for me. In my case excluding lines with either "bot" or "spider" while including ' / ' (for my root document).
My original command:
tail -f mylogfile | grep --line-buffered -v 'bot\|spider' | grep ' / '
Now becomes (with -P perl switch):
tail -f mylogfile | grep -P '^(?!.*(bot|spider)).*\s\/\s.*$'

Regex stop on certain character

I have this regex
#(.*)\((.*)\)
And I'm trying to get two matches from this string
#YouTube('dqrtLyzNnn8') #Vimeo('124719070')
I need it to stop after the closing ), so I get two matches instead of one.
See example on Regexr
Be lazy (?):
#(.*?)\((.*?)\)
DEMO
You may use a negated character class:
#([^()]+)\(([^()]+)\)
Your regex is trying to eat as much characters as possible.
Look at the next regex:
echo "#YouTube('dqrtLyzNnn8') #Vimeo('124719070')" |
sed 's/#[a-zA-Z]*(\([^)]*\)[^#]*#[a-zA-Z]*(\([^)]*\).*/\1 \2/'
I do not know the exact requirements, you might me easier of starting with another command:
echo "#YouTube('dqrtLyzNnn8') #Vimeo('124719070')" | cut -d\' -f2,4

Regex - match up to first literal

I have some lines of code I am trying to remove some leading text from which appears like so:
Line 1: myApp.name;
Line 2: myApp.version
Line 3: myApp.defaults, myApp.numbers;
I am trying and trying to find a regex that will remove anything up to (but excluding) myApp.
I have tried various regular expressions, but they all seem to fail when it comes to line 3 (because myApp appears twice).
The closest I have come so far is:
.*?myApp
Pretty simple - but that matches both instances of myApp occurrences in Line 3 - whereas I'd like it to match only the first.
There's a few hundred lines - otherwise I'd have deleted them all manually by now.
Can somebody help me? Thanks.
You need to add an anchor ^ which matches the starting point of a line ,
^.*?(myApp)
DEMO
Use the above regex and replace the matched characters with $1 or \1. So that you could get the string myApp in the final result after replacement.
Pattern explanation:
^ Start of a line.
.*?(myApp) Shortest possible match upto the first myApp. The string myApp was captured and stored into a group.(group 1)
All matched characters are replaced with the chars present inside the group 1.
Your regular expression works in Perl if you add the ^ to ensure that you only match the beginnings of lines:
cat /tmp/test.txt | perl -pe 's/^.*?myApp/myApp/g'
myApp.name;
myApp.version
myApp.defaults, myApp.numbers;
If you wanted to get fancy, you could put the "myApp" into a group that doesn't get captured as part of the expression using (?=) syntax. That way it doesn't have to be replaced back in.
cat /tmp/test.txt | perl -pe 's/^.*?(?=myApp)//g'
myApp.name;
myApp.version
myApp.defaults, myApp.numbers;

regex, image src may have many different paths and I want to replace with a specific path

So I have a database from a wordpress multisite. I'm doing a search and replace on a table with regex and need to make all src's of a specific image name (image2.jpg) point to a single directory. Here's an example. I may have:
src="http://domain.com/path/weird/different/image2.jpg
and
src="http://domain2.com/path2/differentpath/helloworld/image2.jpg
I need to replace everything between src=" and /image.jpg with a specific domain/filepath.
I'm not great with regex stuff, I try, but it's just not my strong suit. Any help appreciated.
Search: src="[^"]*image2\.jpg
Replace: src="http://mydomain.com/mypath/image2.jpg
The [^"]* eats up any characters that are not a double quote.
In the demo, see the substitutions pane at the bottom.
In PHP (should work with WordPress):
$replaced = preg_replace('/src="[^"]*image2\.jpg/',
'src="http://mydomain.com/mypath/image2.jpg',
$str);
Use this regex:
/(?<=src=")(.*?)(?=\Q/image2.jpg\E)/
This matches anything that goes in between "src="" and "/image2.jpg", so you are free to replace this with your specific domain/filepath.
Depending on your language/ tool, you may have to escape the leading / in /image2.jpg.
> Positive Lookbehind - assert that match must be following 'src="'
|
| Anything > Positive Lookahead - assert that match
| | | is before '/image2.jpg' literally.
/(?<=src=")(.*?)(?=\Q/image2.jpg\E)/
Also try out an online regex tester.

Regex expression to find file extension in a file with multiple periods

How would you write a regular expression to find the file extension of the following files, keeping in mind that what I am looking for is the ".pdf" or ".xls" portion of the string?
REPORTPDF.20130810.pdf.pgp
REPORTXLS.20130810.xls.pgp
EDIT:
The resulting filenames I want to end up with are the following:
REPORT20130810.PDF
REPORT20130810.XLS
I am on a Windows platform. I've played around with this a bit at http://regexpal.com/ but so far I can only figure out how to match the date:
([0-9]{4}[0-9]{2}[0-9]{2})
Using sed:
sed 's/^\(.*[^.]*\)\.[^.]*$/\1/' <<< "REPORTPDF.20130810.pdf.pgp"
REPORTPDF.20130810.pdf
Using grep -P (PCRE regex):
grep -oP '^.+[^.]+(?=\.[^.]+$)' <<< "REPORTPDF.20130810.pdf.pgp"
REPORTPDF.20130810.pdf
.+\.(\w+)\.\w+$ would deliver the last but one extension as group 1, how this is accessed would then be dependent of your host language for the regex.
If you don't need the file extension to be capitalized, this should work
([a-zA-Z]+)\.([0-9]{4}[0-9]{2}[0-9]{2})\.(xls|pdf)\.pgp
Matches:
REPORTXLS.20130810.xls.pgp
And then the groups you'd use are two and three
REPORT\2.\3
Matches:
REPORT20130810.xls
Problem is that you don't provide much context for how you're going about changing these file names.
You don't say what language/library you're using, but this Perl one-liner does the trick:
perl -lpe "s/^([^.]*)(...)\.(\d+)(\.\2)\.pgp/\1\3\4/i; $_=uc"
I think this will work for you :)
^(([A-Z a-z]*)(?:XLS.|PDF.)(\d{8})(.pdf|.xls))
Edit live on Debuggex
^ starts at the beginning of the string
(.*) any character before
\d any number 0-9
{8} only 8 times for that character section (in this case 8 times of
the numbers 0-9)
?: is non capture groups
I wrapped the capture groups into one large one so the thing that you want will be in the first capture group :).
This can be prob be replaced
([A-Z a-z]*)
with
(REPORT)
This (.*?(?:\..*)?)(\..*) will hold things like:
'hello.1a.2bb.3' ---> group(1) == 'hello.1a.2bb', group(2) == '.3'
'yep.1' ---> group(1) == 'yep', group(2) == '.1'
If the format is pretty much fixed you could use
(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)
and cherry pick replacement based on what you want
Used java here but regex match would still be same
String a = "REPORTPDF.20130810.pdf.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1--$2--$3--$4--$5");
;
String b = "REPORTXLS.20130810.xls.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1--$2--$3--$4--$5");
System.out.println(a);
System.out.println(b);
REPORT--PDF--20130810--pdf--pgp
REPORT--XLS--20130810--xls--pgp
in your case "$1$3.$2"
String b = "REPORTXLS.20130810.xls.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1$3.$2");
which produces intended result
REPORT20130810.XLS