How to extract value betwen two string using regex? - regex

Here is the demo String:
beforeValueAfter
Assume that I know the value I want is between "before" and "After"
I want to extact the "Value" using the regex....
pervious909078375639355544after
Assume that I know the value I want is between "pervious90907" and "55544after"
I want to extact the "83756393" using the regex....
thx in advance.

The answer depends on two things:
If you know exactly what the value consists of (if you know it will be digits, etc., it makes it easier). If it could be anything, the answer is a little harder.
If your system is greedy/ungreedy by default, it affects the way you'd set up the expression. I will assume it is greedy by default.
If it can be anything (the ? will be needed to toggle the .* to ungreedy because ".*" also matches "After":
/before(.*?)After/
If you know it is digits:
/before(\d*)After
If it could be any word characters (0-9, a-z, A-Z, _):
/before(\w*?)After

Try this regular expression:
pervious90907(.*?)55544after
That will get you the shortest string (note the non-greedy *? quantifier) between pervious90907 and 55544after.

The regex should be like this:
previous([0-9]*)after

bash-3.2$ echo pervious909078375639355544after | perl -ne 'print "$1\n" if /pervious90907(.*)55544after/'
83756393

Related

Pattern matching in Perl

I am doing pattern match for some names below:
ABCD123_HH1
ABCD123_HH1_K
Now, my code to grep above names is below:
($name, $kind) = $dirname =~ /ABCD(\d+)\w*_([\w\d]+)/;
Now, problem I am facing is that I get both the patterns that is ABCD123_HH1, ABCD123_HH1_K in $dirname. However, my variable $kind doesn't take this ABCD123_HH1_K. It does take ABCD123_HH1 pattern.
Appreciate your time. Could you please tell me what can be done to get pattern with _k.
You need to add the _K part to the end of your regex and make it optional with ?:
/ABCD(\d+)_([\w\d]+(_K)?)/
I also erased the \w*, which is useless and keeps you from correctly getting the HH1_K.
You should check for zero or more occurrences of _K.
* in Perl's regexp means zero or more times
+ means atleast one or more times.
Hence in your regexp, append (_K)*.
Finally, your regexp should be this:
/ABCD(\d+)\w*_([\w\d]+(_K)*)/
\w includes letters, numbers as well as underscores.
So you can use something as simple as this:
/ABCD\w+/

Regex string transformation/extraction

Code:
https://aaa.bbb.net/ccc/211099_589944494365122_1446403980_n.jpg
How can I get 589944494365122 out of that string using regex?
The best I can do so far is _(.*) resulting 589944494365122_1446403980_n.jpg
First, you should generalize your problem description, like that: How can I get the longest non-empty substring of digits after the first _ in string? The regexp you literally asked for is (589944494365122), but that's not what you expect.
According to my guess about what you want, the answer could be _(\d+).
The rule of extraction I can see in your input is:
211099_589944494365122_1446403980
[0-9]+_ part we want _[0-9]+
so a regex with look-behind and look-ahead will help:
'(?<=\d_)\d+(?=_\d)'
test with grep:
kent$ echo " https://aaa.bbb.net/ccc/211099_589944494365122_1446403980_n.jpg"|grep -Po '(?<=\d_)\d+(?=_\d)'
589944494365122
This works;
var s = "https://aaa.bbb.net/ccc/211099_589944494365122_1446403980_n.jpg";
var m = /_([^_]*)/.exec(s);
console.log( m[1] ); // 589944494365122
I would go with \d+_(\d+)_\d+_n\.jpg, but depending on the exact specification of the URL this may need a little bit of tweaking.
Also depending on the language, this may need to be altered a little bit. The solution I suggest will work for instance in Ruby (as well as many other regex implementations). Here \d matches any digit and \d+ means one or more digits. I assume the letter before .jpg is always n but you may change this by either replacing n with .(any character) or with \w (any word character).

Little vim regex

I have a bunch of strings that look like this: '../DisplayPhotod6f6.jpg?t=before&tn=1&id=130', and I'd like to take out everything after the question mark, to look like '../DisplayPhotod6f6.jpg'.
s/\(.\.\.\/DisplayPhoto.\{4,}\.jpg\)*'/\1'/g
This regex is capturing some but not all occurences, can you see why?
\.\{4,} is trying to match 4 or more . characters. What it looks like you wanted is "match 4 or more of any character" (.\{4,}) but "match 4 or more non-. characters" ([^.]\{4,}) might be more accurate. You'll also need to change the lone * at the end of the pattern to .* since the * is currently applying to the entire \(\) group.
I think the easyest way to go for this is:
s/?.*$/'/g
This says: delete everything after the question mark and replace it with a single quote.
I would use macros, sometime simpler than regexp (and interactive) :
qa
/DisplayPhoto<Enter>
f?dt'
n
q
And then some #a, or 20000#a to go though all lines.
The following regexp: /(\.\./DisplayPhoto.*\.jpg)/gi
tested against following examples:
../DisplayPhotocef3.jpg?t=before&tn=1&id=54
../DisplayPhotod6f6.jpg?t=before&tn=1&id=130
will result:
../DisplayPhotocef3.jpg
../DisplayPhotod6f6.jpg
%s/\('\.\.\/DisplayPhoto\w\{4,}\.jpg\).*'/\1'/g
Some notes:
% will cause the swap to work on all lines.
\w instead of '.', in case there are some malformed file names.
Replace '.' at the start of your matching regex with ' which is exactly what it should be matching.

Using regex to find any last occurrence of a word between two delimiters

Suppose I have the following test string:
Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop
where _ means any characters, eg: StartaGetbbGetcccGetddddStopeeeeeStart....
What I want to extract is any last occurrence of the Get word within Start and Stop delimiters. The result here would be the three bolded Get below.
Start__Get__Get__Get__Stop__Start__Get__Get__Stop__Start__Get__Stop
I precise that I'd like to do this only using regex and as far as possible in a single pass.
Any suggestions are welcome
Thanks'
Get(?=(?:(?!Get|Start|Stop).)*Stop)
I'm assuming your Start and Stop delimiters will always be properly balanced and they can't be nested.
I would have done it with two passes. The first pass find the word "Get", and the second pass count the number of occurrences of it.
$ echo "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get__Stop" | awk -vRS="Stop" -F"_*" '{print $(NF-1)}'
Get
Get
Get
Something like this, maybe:
(?<=Start(?:.Get)*)Get(?=.Stop)
That requires variable-length lookbehind support, which not all regex engines support.
It could be made to have a max length, which a few more (but still not all) support, by changing the first * to {0,99} or similar.
Also, in the lookahead, possibly the . should be a .+ or .{1,2} depending on if the double underscore is a typo or not.
With Perl, i'd do :
my $test = "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop";
$test =~ s#(?<=Start_)((Get_)*)(Get)(?=_Stop)#$1<FOUND>$3</FOUND>#g;
print $test;
output:
Start_Get_Get_<FOUND>Get</FOUND>_Stop_Start_Get_<FOUND>Get</FOUND>_Stop_Start_<FOUND>Get</FOUND>_Stop
You should adapt to your regex flavour.

Regex in sed to convert ##XXX## to ${XXX}

I need to use sed to convert all occurences of ##XXX## to ${XXX}. X could be any alphabetic character or '_'. I know that I need to use something like:
's/##/\${/g'
But of course that won't work properly, as it will convert ##FOO## to ${FOO${
Here's a shot at a better replacement regex:
's/##\([a-zA-Z_]\+\)##/${\1}/g'
Or if you assume exactly three characters :
's/##\([a-zA-Z_]\{3\}\)##/${\1}/g'
Encapsulate the alpha and '_' within '\(' and '\)' and then in the right side reference that with '\1'.
'+' to match one or more alpha and '_' (in case you see ####).
Add the 'g' option to the end to replace all matches (which I'm guessing is what you want to do in this case).
's/##\([a-zA-Z_]\+\)##/${\1}/g'
Use this:
s/##\([^#]*\)##/${\1}/
BTW, there is no need to escape $ in the right side of the "s" operator.
sed 's/##\([a-zA-Z_][a-zA-Z_][a-zA-Z_]\)##/${\1}/'
The \(...\) remembers...and is referenced as \1 in the expansion. Use single quotes to save your sanity.
As noted in the comments below this, this can also be contracted to:
sed 's/##\([a-zA-Z_]\{3\}\)##/${\1}/'
This answer assumes that the example wanted exactly three characters matched. There are multiple variations depending on what is in between the hash marks. The key part is remembering part of the matched string.
echo "##foo##" | sed 's/##/${/;s//}/'
s change only 1 occurence by default
s//take last search pattern used so second s take also ## and only the second occurence still exist
echo '##XXX##' | sed "s/^##\([^#]*\)/##$\{\1\}/g"
sed 's/\([^a-z]*[^A-Z]*[^0-9]*\)/(&)/pg