Regular expression to find first match only - regex

I have this text :-
SOME text, .....
Number of successes: 3556
Number of failures: 22
Some text, .....
Number of successes: 2623
Number of failure: 0
My requirement is to find the first occurrence of this pattern "Number of successes: (\d+)" which is Number of successes: 3556.
But the above expression returns subsequent matches as well.
I want the regular expression to do this for me, unlike in java where i can use loop to iterate.
Can anyone help me with a regular expression that can find the first occurrence only.

One solution that should work in any language:
(?s)\A(?:(?!Number of successes:).)*Number of successes: (\d+)
Explanation:
(?s) # Turn on singleline mode
\A # Start of string
(?: # Non-capturing group:
(?!Number of successes:) # Unless this text intervenes:
. # Match any character.
)* # Repeat as needed.
Number of successes:[ ] # Then match this text
(\d+) # and capture the following number
See it live on regex101.com.

Just in case the requirements to do it via regexp is not really a requirement, here are alternatives to the (nice) approach by Tim (who uses only regexp)
awk ' $0~/Number of successes: [1-9][0-9]*/ { print $0 ; exit 0 ;}'
or the really simple
grep 'Number of successes: [1-9][0-9]*' | head -1
I much prefer the awk one, as it quits as soon as it sees the first match, whereas the 2nd one could process many lines after it (until it receives the SIGPIPE or end of file)

Try using grep with -m option
grep -m 1 'Number of successes: [0-9]\+' file

Related

how to shell script regex perfect matching?

I have a Bash script file that matches a regex.
My regex script file:
if [[ "$image" =~ [0-9]+(\.[0-9]+){3}\-[0-9]+$ ]]; then
I need to pass cases that only match 0.0.0.0-0000
These are my inputs and results.
pass : 0.0.0.0-0000
pass : 0.0.0.0.0.0-0000 << Unwanted match
no : 0.0.0.0-word
no : 0.0.0.0
As I marked above 0.0.0.0.0.0-0000 gets a match with my regex.
My question is how can I modify my regex to only match the pattern 0.0.0.0-0000?
Assuming that you are trying to match up some sort of IP address like String I came up with this regex.
^(\d+\.?){4}-\d+
Regex Demo
Note the \d+ in first capturing group (\d+\.?) which will match any number before a .. If the only starting pattern is 0.0.0.0, you can remove the + mark here to only match one digit character.
Explanation:
^ - Captures start of a String
(\d+\.?){4} - Captures a number that ends with a optional . character 4 times in a row capturing 0.0.0.0
-\d+ - Captures - character and sequence of digits in a row capturing -0000
This issue is solved.
The follow answer to up #The fourth bird
i missed anchor(^).
To clarify the starting and ending points, It should be between '^' and '$'.
You can refer to answer
if [[ "$image" =~ ^[0-9]+(\.[0-9]+){3}\-[0-9]+$ ]]; #The fourth bird Jul 11 at 8:43
Thank you for replayers XD

last year occurrence from string

I have strings like this:
ACB 01900 X1911D 1910 1955-2011 3424 2135 1934 foobar
I'm trying to get the last occurrence of a single year (from 1900 to 2050), so I need to extract only 1934 from that string.
I'm trying with:
grep -P -o '\s(19|20)[0-9]{2}\s(?!\s(19|20)[0-9]{2}\s)'
or
grep -P -o '((19|20)[0-9]{2})(?!\s\1\s)'
But it matches: 1910 and 1934
Here's the Regex101 example:
https://regex101.com/r/UetMl0/3
https://regex101.com/r/UetMl0/4
Plus: how can I extract the year without the surrounding spaces without doing an extra grep to filter them?
Have you ever heard this saying:
Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems.
Keep it simple - you're interested in finding a number between 2 numbers so just use a numeric comparison, not a regexp:
$ awk -v min=1900 -v max=2050 '{yr=""; for (i=1;i<=NF;i++) if ( ($i ~ /^[0-9]{4}$/) && ($i >= min) && ($i <= max) ) yr=$i; print yr}' file
1934
You didn't say what to do if no date within your range is present so the above outputs a blank line if that happens but is easily tweaked to do anything else.
To change the above script to find the first instead of the last date is trivial (move the print inside the if), to use different start or end dates in your range is trivial (change the min and/or max values), etc., etc. which is a strong indication that this is the right approach. Try changing any of those requirements with a regexp-based solution.
I don't see a way to do this with grep because it doesn't let you output just one of the capture groups, only the whole match.
Wit perl I'd do something like
perl -lpe 'if (/^.*\b(19\d\d|20(?:0-4\d|50))\b/) { print $1 }'
Idea: Use ^.* (greedy) to consume as much of the string up front as possible, thus finding the last possible match. Use \b (word boundary) around the matched number to prevent matching 01900 or X1911D. Only print the first capture group ($1).
I tried to implement your requirement of 1900-2050; if that's too complicated, ((?:19|20)\d\d) will do (but also match e.g. 2099).
The regex to do your task using grep can be as follows:
\b(?:19\d{2}|20[0-4]\d|2050)\b(?!.*\b(?:19\d{2}|20[0-4]\d|2050)\b)
Details:
\b - Word boundary.
(?: - Start of a non-capturing group, needed as a container for
alternatives.
19\d{2}| - The first alternative (1900 - 1999).
20[0-4]\d| - The second alternative (2000 - 2049).
2050 - The third alternative, just 2050.
) - End of the non-capturing group.
\b - Word boundary.
(?! - Negative lookahead for:
.* - A sequence of any chars, meaning actually "what follows
can occur anywhere further".
\b(?:19\d{2}|20[0-4]\d|2050)\b - The same expression as before.
) - End of the negative lookahead.
The word boundary anchors provide that you will not match numbers - parts
of longer words, e.g. X1911D.
The negative lookahead provides that you will match just the last
occurrence of the required year.
If you can use other tool than grep, supporting call to a previous
numbered group (?n), where n is the number of another capturing
group, the regex can be a bit simpler:
(\b(?:19\d{2}|20[0-4]\d|2050)\b)(?!.*(?1))
Details:
(\b(?:19\d{2}|20[0-4]\d|2050)\b) - The regex like before, but
enclosed within a capturing group (it will be "called" later).
(?!.*(?1)) - Negative lookahead for capturing group No 1,
located anywhere further.
This way you avoid writing the same expression again.
For a working example in regex101 see https://regex101.com/r/fvVnZl/1
You may use a PCRE regex without any groups to only return the last occurrence of a pattern you need if you prepend the pattern with ^.*\K, or, in your case, since you expect a whitespace boundary, ^(?:.*\s)?\K:
grep -Po '^(?:.*\s)?\K(?:19\d{2}|20(?:[0-4]\d|50))(?!\S)' file
See the regex demo.
Details
^ - start of line
(?:.*\s)? - an optional non-capturing group matching 1 or 0 occurrences of
.* - any 0+ chars other than line break chars, as many as possible
\s - a whitespace char
\K - match reset operator discarding the text matched so far
(?:19\d{2}|20(?:[0-4]\d|50)) - 19 and any two digits or 20 followed with either a digit from 0 to 4 and then any digit (00 to 49) or 50.
(?!\S) - a whitespace or end of string.
See an online demo:
s="ACB 01900 X1911D 1910 1955-2011 3424 2135 1934 foobar"
grep -Po '^(?:.*\s)?\K(?:19\d{2}|20(?:[0-4]\d|50))(?!\S)' <<< "$s"
# => 1934

RegEx skip word

I would like to use regular expressions to extract the first couple of words and the second to last letter of a string.
For example, in the string
"CSC 101 Intro to Computing A R"
I would like to capture
"CSC 101 A"
Maybe something similar to this
grep -o -P '\w{3}\s\d{3}*thenIdon'tKnow*\s\w\s'
Any help would be greatly appreciated.
You could go for:
^((?:\w+\W+){2}).*(\w+)\W+\w+$
And use group 1 + 2, see it working on regex101.com.
Broken down, this says:
^ # match the start of the line/string
( # capture group 1
(?:\w+\W+){2} # repeated non-capturing group with words/non words
)
.* # anything else afterwards
(\w+)\W+\w+ # backtracking to the second last word character
$
Do:
^(\S+)\s+(\S+).*(\S+)\s+\S+$
The 3 captured groups capture the 3 desired potions
\S indicates any non-whitespace character
\s indicates any whitespace character
Demo
As you have used grep with PCRE in your example, i am assuming you have access to the GNU toolset. Using GNU sed:
% sed -E 's/^(\S+)\s+(\S+).*(\S+)\s+\S+$/\1 \2 \3/' <<<"CSC 101 Intro to Computing A R"
CSC 101 A
A whole RegEx pattern can't match disjointed groups.
I suggest taking a look at Capture Groups - basically you capture the two disjointed groups, the matched couples of words can then be used by referring to these two groups.
grep can't print out multiple capture groups so an example with sed is
echo 'CSC 101 Intro to Computing A R' | sed -n 's/^\(\w\{3\}\s[[:digit:]]\{3\}\).*\?\(\w\)\s\+\w$/\1 \2/p' which prints out CSC 101 A
Note that the pattern used here is ^(\w{3}\s\d{3}).*?(\w)\s+\w$

RegEx lookahead on .*

I have a pattern that needs to find the last occurrence of string1 unless string2 is found anywhere in the subject, then it needs the first occurrence of string1. In order to solve this I wrote this inefficient negative lookahead.
/(.(?!.*?string2))*string1/
It takes several seconds to run (prohibitively long on subjects lacking any occurrence of either string). Is there a more efficient way to accomplish this?
You should be able to use the following:
/string1(?!.*?string2)/
This will match string1 as long as string2 is not found later in the string, which I think meets your requirements.
Edit: After seeing your update, try the following:
/.*?string1(?=.*?string2)|.*string1/
You could also do if/else statements in your regex !
(?(?=.*string2).*(string1).*$|^.*?(string1))
Explanation:
(? # If
(?=.*string2) # Lookahead, if there is string2
.*(string1).*$ # Then match the last string1
| # Else
^.*?(string1) # Match the first string1
)
If string1 is found, you'll find it in group 1.
Ok now, i have understand what you want, a bit long but optimized to be fast:
nutria\d. -> string1
RABBIT -> string2
The pattern (example in PHP):
$pattern = <<<LOD
~(?J) # allow multiple capture groups with the same name
### capture the first nutria if RABBIT isn't found before ###
^ (?>[^Rn]++|R++(?!ABBIT)|n++(?!utria\d.))* (?<res>nutria\d.)
### try to capture the last nutria without RABBIT until the end ###
(?>
(?>
(?> [^Rn]++ | R++(?!ABBIT) | n++(?!utria\d.) )*
(?<res>nutria\d.)
)* # repeat as possible to catch the last nutria
(?> [^R]++ | R++(?!ABBIT) )* $ # the end without RABBIT
)? # /!\important/!\ this part is optional, then only the first captured
# nutria is in the result when RABBIT is found in this part
| # OR
### capture the first nutria when RABBIT is found before
^(?> [^n]++ | n++(?!utria\d.) )* (?<res>nutria\d.)
~x
LOD;
$subjects = array( 'groundhog nutria1A beaver nutria1B',
'polecat nutria2A badger RABBIT nutria2B',
'weasel RABBIT nutria3A nutria3B nutria3C',
'vole nutria4A marten nutria4B marmot nutria4C RABBIT');
foreach($subjects as $subject) {
if (preg_match($pattern, $subject, $match))
echo '<br/>'.$match['res'];
}
The pattern is designed to fail as fast as possible using atomic groups and possessive quantifiers with alternations and thus avoids catastrophic backtracking using the least possible lookaheads (only when a n or an R is found, and it fails quickly)
Try this regex:
string1(?!.*?string1)|string1(?=.*?string2)
Live Demo: http://www.rubular.com/r/uAjOqaTkYH
Edit live on Debuggex
Try using the possessive operator .*+, it uses less memory (it doesn't store the entire backtrace of matching cases). It may also run faster because of this.

Match a number in a string with letters and numbers

I need to write a Perl regex to match numbers in a word with both letters and numbers.
Example: test123. I want to write a regex that matches only the number part and capture it
I am trying this \S*(\d+)\S* and it captures only the 3 but not 123.
Regex atoms will match as much as they can.
Initially, the first \S* matched "test123", but the regex engine had to backtrack to allow \d+ to match. The result is:
+------------------- Matches "test12"
| +-------------- Matches "3"
| | +--------- Matches ""
| | |
--- --- ---
\S* (\d+) \S*
All you need is:
my ($num) = "test123" =~ /(\d+)/;
It'll try to match at position 0, then position 1, ... until it finds a digit, then it will match as many digits it can.
The * in your regex are greedy, that's why they "eat" also numbers. Exactly what #Marc said, you don't need them.
perl -e '$_ = "qwe123qwe"; s/(\d+)/$numbers=$1/e; print $numbers . "\n";'
"something122320" =~ /(\d+)/ will return 122320; this is probably what you're trying to do ;)
\S matches any non-whitespace characters, including digits. You want \d+:
my ($number) = 'test123' =~ /(\d+)/;
Were it a case where a non-digit was required (say before, per your example), you could use the following non-greedy expressions:
/\w+?(\d+)/ or /\S+?(\d+)/
(The second one is more in tune with your \S* specification.)
Your expression satisfies any condition with one or more digits, and that may be what you want. It could be a string of digits surrounded by spaces (" 123 "), because the border between the last space and the first digit satisfies zero-or-more non-space, same thing is true about the border between the '3' and the following space.
Chances are that you don't need any specification and capturing the first digits in the string is enough. But when it's not, it's good to know how to specify expected patterns.
I think parentheses signify capture groups, which is exactly what you don't want. Remove them. You're looking for /\d+/ or /[0-9]+/