Regex stop on certain character - regex

I have this regex
#(.*)\((.*)\)
And I'm trying to get two matches from this string
#YouTube('dqrtLyzNnn8') #Vimeo('124719070')
I need it to stop after the closing ), so I get two matches instead of one.
See example on Regexr

Be lazy (?):
#(.*?)\((.*?)\)
DEMO

You may use a negated character class:
#([^()]+)\(([^()]+)\)

Your regex is trying to eat as much characters as possible.
Look at the next regex:
echo "#YouTube('dqrtLyzNnn8') #Vimeo('124719070')" |
sed 's/#[a-zA-Z]*(\([^)]*\)[^#]*#[a-zA-Z]*(\([^)]*\).*/\1 \2/'
I do not know the exact requirements, you might me easier of starting with another command:
echo "#YouTube('dqrtLyzNnn8') #Vimeo('124719070')" | cut -d\' -f2,4

Related

Regex - Positive Lookbehind

I a few files with a couple of millions lines with the something like the following:
9/9/2015 2:50:39 PM: Export for https://portal.gaf.com/sites/RCNHistory/Lists/RCNs/Attachments/148/Ruberoid HW Plus SV.xls Complete.
9/9/2015 2:50:39 PM: Export for https://portal.gaf.com/sites/RCNHistory/Lists/RCNs/Attachments/148/Ruberoid Mop Granule SV.xls Complete.
9/9/2015 2:50:40 PM: Export for https://portal.gaf.com/sites/RCNHistory/Lists/RCNs/Attachments/148/Ruberoid Mop Smooth 1.5 SV.xls Complete.
I was hoping to capture the file name on each line with a lookbehind with the following:
$(?<=\/) Of course i will have to delete the "Complete." but i figure i start slowly
but i have not mastered the art of regex. can any one let me know what i am doing wrong?
Thank you.
This could work - you would retrieve the filename from the capture group:
\/([^\/]*) Complete.$
Here's an example on regexr: http://www.regexr.com/3bp2l
You don't need to complicate things with a lookbehind if the lines are all in this format.
You can just use greedy matching to get what you want.
.*\/(.*) Complete.
Which is essentially:
Match everything (including /'s) up to a / followed by some text (in this case your filename) followed by a literal " Complete."
The matching group contains the filename.
So, for a Regex Find and Replace in N++ you should use:
Find
.*\/(.*) Complete.
Replace
$1
This will leave you with just a filename on each line.
Lookbehind is a zero-width assertion at a position. It's not a way to tell the regex where to start -- it must always start at the beginning. You could probably use a regex like .*/(.*) Complete to capture that.
If you are working with the shell, the cut tool is great for this too.
# get everything after the last slash and before the last space (` Complete`)
rev $INPUT_FILE | cut -d'/' -f 1 | cut -d' ' -f2- | rev
You can use this regex with lookbehind:
/(?<=\/)[^\/]+$/
Make sure to use MULTILINE mode.
RegEx Demo

Extracting string before pattern with sed (bash)

I need some help with sed to remove everything after matching pattern and remove the last "." if it exists..
Take this string as an example:
The.100.S02E05.720p.HDTV.x264-KILLERS.mkv
I want everything before the pattern "S[0-9][0-9]E[0-9[0-9]" except the last "."
What I want:
"The.100"
Does anyone have a great oneliner for this one?
It sounds like you can pretty much use exactly what you had in your question:
sed 's/\.*S[0-9][0-9]E[0-9][0-9].*//'
This matches an optional . character followed by the pattern you suggested (and anything after it), replacing with nothing. You were missing a ] in the question, which I have added.
Testing it out:
$ sed 's/\.*S[0-9][0-9]E[0-9][0-9].*//' <<<'The.100.S02E05.720p.HDTV.x264-KILLERS.mkv'
The.100

Is there a regex engine that supports "for each captured group" in replacement strings?

Here's my example. If I want to use a regex to replace tabs in the code with spaces, but wanted to preserve tab characters in the middle or end of a line of code, I would use this as my search string to capture each tab character at the start of a line: ^(\t)+
Now, how could I write a search string that replaces each captured group with four spaces? I'm thinking there must be some way to do this with backreferences?
I've found I can work around this by running similar regex-replacements (like s/^\t/ /g, s/^ \t/ /g, ...) multiple times until no more matches are found, but I wonder if there's a quicker way to do all the necessary replacements at once.
Note: I used sed format in my example, but I'm not sure if this is possible with sed. I'm wondering if sed supports this, and if not, is there a platform that does? (e.g., there's a Python/Java/bash extended regex lib that supports this.)
With perl and other languages that support this feature (Java, PCRE(PHP, R, libboost), Ruby, Python(the new regex module), .NET), you can use the \G anchor that matches the position after the last match or the start of the string:
s/(?:\G|^)\t/ /gm
This works in Perl. Maybe sed too, I don't know sed.
It relies on doing an eval, basically a callback.
It takes the length of $1 then cats ' ' that many times.
Perl sample.
my $str = "
\t\t\tThree
\t\tTwo
\tOne
None";
$str =~ s/^(\t+)/ ' ' x length($1) /emg;
print "$str\n";
Output
Three
Two
One
None
Just another idea that came to me, this could also be solved with positive lookbehind:
s/(?<=^[\t]*)\t/ /gm
It's ugly, but it works.
sed ':a
s/^\(\t*\)\t/\1 /
ta' YourFile
Use recursive action on 1 regex with sed, it's a workaround

Using regex to find any last occurrence of a word between two delimiters

Suppose I have the following test string:
Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop
where _ means any characters, eg: StartaGetbbGetcccGetddddStopeeeeeStart....
What I want to extract is any last occurrence of the Get word within Start and Stop delimiters. The result here would be the three bolded Get below.
Start__Get__Get__Get__Stop__Start__Get__Get__Stop__Start__Get__Stop
I precise that I'd like to do this only using regex and as far as possible in a single pass.
Any suggestions are welcome
Thanks'
Get(?=(?:(?!Get|Start|Stop).)*Stop)
I'm assuming your Start and Stop delimiters will always be properly balanced and they can't be nested.
I would have done it with two passes. The first pass find the word "Get", and the second pass count the number of occurrences of it.
$ echo "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get__Stop" | awk -vRS="Stop" -F"_*" '{print $(NF-1)}'
Get
Get
Get
Something like this, maybe:
(?<=Start(?:.Get)*)Get(?=.Stop)
That requires variable-length lookbehind support, which not all regex engines support.
It could be made to have a max length, which a few more (but still not all) support, by changing the first * to {0,99} or similar.
Also, in the lookahead, possibly the . should be a .+ or .{1,2} depending on if the double underscore is a typo or not.
With Perl, i'd do :
my $test = "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop";
$test =~ s#(?<=Start_)((Get_)*)(Get)(?=_Stop)#$1<FOUND>$3</FOUND>#g;
print $test;
output:
Start_Get_Get_<FOUND>Get</FOUND>_Stop_Start_Get_<FOUND>Get</FOUND>_Stop_Start_<FOUND>Get</FOUND>_Stop
You should adapt to your regex flavour.

Regex in sed to convert ##XXX## to ${XXX}

I need to use sed to convert all occurences of ##XXX## to ${XXX}. X could be any alphabetic character or '_'. I know that I need to use something like:
's/##/\${/g'
But of course that won't work properly, as it will convert ##FOO## to ${FOO${
Here's a shot at a better replacement regex:
's/##\([a-zA-Z_]\+\)##/${\1}/g'
Or if you assume exactly three characters :
's/##\([a-zA-Z_]\{3\}\)##/${\1}/g'
Encapsulate the alpha and '_' within '\(' and '\)' and then in the right side reference that with '\1'.
'+' to match one or more alpha and '_' (in case you see ####).
Add the 'g' option to the end to replace all matches (which I'm guessing is what you want to do in this case).
's/##\([a-zA-Z_]\+\)##/${\1}/g'
Use this:
s/##\([^#]*\)##/${\1}/
BTW, there is no need to escape $ in the right side of the "s" operator.
sed 's/##\([a-zA-Z_][a-zA-Z_][a-zA-Z_]\)##/${\1}/'
The \(...\) remembers...and is referenced as \1 in the expansion. Use single quotes to save your sanity.
As noted in the comments below this, this can also be contracted to:
sed 's/##\([a-zA-Z_]\{3\}\)##/${\1}/'
This answer assumes that the example wanted exactly three characters matched. There are multiple variations depending on what is in between the hash marks. The key part is remembering part of the matched string.
echo "##foo##" | sed 's/##/${/;s//}/'
s change only 1 occurence by default
s//take last search pattern used so second s take also ## and only the second occurence still exist
echo '##XXX##' | sed "s/^##\([^#]*\)/##$\{\1\}/g"
sed 's/\([^a-z]*[^A-Z]*[^0-9]*\)/(&)/pg