Extracting string before pattern with sed (bash) - regex

I need some help with sed to remove everything after matching pattern and remove the last "." if it exists..
Take this string as an example:
The.100.S02E05.720p.HDTV.x264-KILLERS.mkv
I want everything before the pattern "S[0-9][0-9]E[0-9[0-9]" except the last "."
What I want:
"The.100"
Does anyone have a great oneliner for this one?

It sounds like you can pretty much use exactly what you had in your question:
sed 's/\.*S[0-9][0-9]E[0-9][0-9].*//'
This matches an optional . character followed by the pattern you suggested (and anything after it), replacing with nothing. You were missing a ] in the question, which I have added.
Testing it out:
$ sed 's/\.*S[0-9][0-9]E[0-9][0-9].*//' <<<'The.100.S02E05.720p.HDTV.x264-KILLERS.mkv'
The.100

Related

Is it possible to perform a "lookahead" regex match only if one of two matches are present? [duplicate]

I'm setting up some goals in Google Analytics and could use a little regex help.
Lets say I have 4 URLs
http://www.anydotcom.com/test/search.cfm?metric=blah&selector=size&value=1
http://www.anydotcom.com/test/search.cfm?metric=blah2&selector=style&value=1
http://www.anydotcom.com/test/search.cfm?metric=blah3&selector=size&value=1
http://www.anydotcom.com/test/details.cfm?metric=blah&selector=size&value=1
I want to create an expression that will identify any URL that contains the string selector=size but does NOT contain details.cfm
I know that to find a string that does NOT contain another string I can use this expression:
(^((?!details.cfm).)*$)
But, I'm not sure how to add in the selector=size portion.
Any help would be greatly appreciated!
This should do it:
^(?!.*details\.cfm).*selector=size.*$
^.*selector=size.*$ should be clear enough. The first bit, (?!.*details.cfm) is a negative look-ahead: before matching the string it checks the string does not contain "details.cfm" (with any number of characters before it).
^(?=.*selector=size)(?:(?!details\.cfm).)+$
If your regex engine supported posessive quantifiers (though I suspect Google Analytics does not), then I guess this will perform better for large input sets:
^[^?]*+(?<!details\.cfm).*?selector=size.*$
regex could be (perl syntax):
`/^[(^(?!.*details\.cfm).*selector=size.*)|(selector=size.*^(?!.*details\.cfm).*)]$/`
There is a problem with the regex in the accepted answer. It also matches abcselector=size, selector=sizeabc etc.
A correct regex can be ^(?!.*\bdetails\.cfm\b).*\bselector=size\b.*$
Explanation of the regex at regex101:
I was looking for a way to avoid --line-buffered on a tail in a similar situation as the OP and Kobi's solution works great for me. In my case excluding lines with either "bot" or "spider" while including ' / ' (for my root document).
My original command:
tail -f mylogfile | grep --line-buffered -v 'bot\|spider' | grep ' / '
Now becomes (with -P perl switch):
tail -f mylogfile | grep -P '^(?!.*(bot|spider)).*\s\/\s.*$'

Regex stop on certain character

I have this regex
#(.*)\((.*)\)
And I'm trying to get two matches from this string
#YouTube('dqrtLyzNnn8') #Vimeo('124719070')
I need it to stop after the closing ), so I get two matches instead of one.
See example on Regexr
Be lazy (?):
#(.*?)\((.*?)\)
DEMO
You may use a negated character class:
#([^()]+)\(([^()]+)\)
Your regex is trying to eat as much characters as possible.
Look at the next regex:
echo "#YouTube('dqrtLyzNnn8') #Vimeo('124719070')" |
sed 's/#[a-zA-Z]*(\([^)]*\)[^#]*#[a-zA-Z]*(\([^)]*\).*/\1 \2/'
I do not know the exact requirements, you might me easier of starting with another command:
echo "#YouTube('dqrtLyzNnn8') #Vimeo('124719070')" | cut -d\' -f2,4

Turning off greed not working in this regex

I am trying to run the following search (with . made to match newlines either by adding the /s flag in perl or replacing it with \_. in vim):
/<output_channels>.*(?=Story).*?<\/output_channels>/
However the ? isn't turning off greed as it normally does - can anyone explain why? For example, it matches the entire contents of the following file rather than just the first element:
<output_channels>
<output_channel>RSS</output_channel>
<output_channel>Story</output_channel>
</output_channels>
<output_channels>
<output_channel>RSS</output_channel>
</output_channels>
Sorry if I'm missing something obvious.
I put your sample text into a vim buffer, and then executed the command
:%!perl -e '$text = join("", <STDIN>); $text =~ /<output_channels>.*(?=Story).*?<\/output_channels>/s; print $&;'
The result is just the first block of XML. I think this is what you want?
Note that I escaped the / within the regex. Other than this, it is the same one given in your question.
Also note that the equivalent vim RE would be (tested, works):
<output_channels>\_.*\(story\)\#=\_.\{-}<\/output_channels>
See :help perl-patterns for a rundown of the differences between perl and vim REs.
Further note that parsing heirarchical markup with regexps has been known to reawaken ancient demons.
The first .* in your regex is still greedy. You only added ? after the second one.

Using regex to find any last occurrence of a word between two delimiters

Suppose I have the following test string:
Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop
where _ means any characters, eg: StartaGetbbGetcccGetddddStopeeeeeStart....
What I want to extract is any last occurrence of the Get word within Start and Stop delimiters. The result here would be the three bolded Get below.
Start__Get__Get__Get__Stop__Start__Get__Get__Stop__Start__Get__Stop
I precise that I'd like to do this only using regex and as far as possible in a single pass.
Any suggestions are welcome
Thanks'
Get(?=(?:(?!Get|Start|Stop).)*Stop)
I'm assuming your Start and Stop delimiters will always be properly balanced and they can't be nested.
I would have done it with two passes. The first pass find the word "Get", and the second pass count the number of occurrences of it.
$ echo "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get__Stop" | awk -vRS="Stop" -F"_*" '{print $(NF-1)}'
Get
Get
Get
Something like this, maybe:
(?<=Start(?:.Get)*)Get(?=.Stop)
That requires variable-length lookbehind support, which not all regex engines support.
It could be made to have a max length, which a few more (but still not all) support, by changing the first * to {0,99} or similar.
Also, in the lookahead, possibly the . should be a .+ or .{1,2} depending on if the double underscore is a typo or not.
With Perl, i'd do :
my $test = "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop";
$test =~ s#(?<=Start_)((Get_)*)(Get)(?=_Stop)#$1<FOUND>$3</FOUND>#g;
print $test;
output:
Start_Get_Get_<FOUND>Get</FOUND>_Stop_Start_Get_<FOUND>Get</FOUND>_Stop_Start_<FOUND>Get</FOUND>_Stop
You should adapt to your regex flavour.

Regex in sed to convert ##XXX## to ${XXX}

I need to use sed to convert all occurences of ##XXX## to ${XXX}. X could be any alphabetic character or '_'. I know that I need to use something like:
's/##/\${/g'
But of course that won't work properly, as it will convert ##FOO## to ${FOO${
Here's a shot at a better replacement regex:
's/##\([a-zA-Z_]\+\)##/${\1}/g'
Or if you assume exactly three characters :
's/##\([a-zA-Z_]\{3\}\)##/${\1}/g'
Encapsulate the alpha and '_' within '\(' and '\)' and then in the right side reference that with '\1'.
'+' to match one or more alpha and '_' (in case you see ####).
Add the 'g' option to the end to replace all matches (which I'm guessing is what you want to do in this case).
's/##\([a-zA-Z_]\+\)##/${\1}/g'
Use this:
s/##\([^#]*\)##/${\1}/
BTW, there is no need to escape $ in the right side of the "s" operator.
sed 's/##\([a-zA-Z_][a-zA-Z_][a-zA-Z_]\)##/${\1}/'
The \(...\) remembers...and is referenced as \1 in the expansion. Use single quotes to save your sanity.
As noted in the comments below this, this can also be contracted to:
sed 's/##\([a-zA-Z_]\{3\}\)##/${\1}/'
This answer assumes that the example wanted exactly three characters matched. There are multiple variations depending on what is in between the hash marks. The key part is remembering part of the matched string.
echo "##foo##" | sed 's/##/${/;s//}/'
s change only 1 occurence by default
s//take last search pattern used so second s take also ## and only the second occurence still exist
echo '##XXX##' | sed "s/^##\([^#]*\)/##$\{\1\}/g"
sed 's/\([^a-z]*[^A-Z]*[^0-9]*\)/(&)/pg