How to extract a complex version number using sed? - regex

I use sed in CentOs to extract version number and it's work fine:
echo "var/opt/test/war/test-webapp-4.1.56.war" | sed -nre 's/^[^0-9]*(([0-9]+\.)*[0-9]+).*/\1/p'
But my problem is that i am not able to extract when the version is shown like this:
var/opt/test/war/test-webapp-4.1.56-RC1.war
I want to extract the 4.1.56-RC1 if it is present.
Any ideas ?
EDIT 2
Ok to be clear take this example, with a path:
Sometimes the path contains only a serial number like this var/opt/test/war/test-webapp-4.1.56.war and sometimes it contains a series of numbers and letters like this "var/opt/test/war/test-webapp-4.1.56-RC1.war
The need is to recover either 4.1.56 or 4.1.56-RC1 depending on the version present in the path. With sed or grep, no preference.
This seems to work but the .war is shown at the end:
echo "var/opt/test/war/test-webapp-4.1.56.war" | egrep -o '[0-9]\S*'

Little unclear what you are after, but this seems to be in the general direction.
Given:
$ echo "$e"
/var/opt/test/war/test-webapp-4.1.56-RC1.war
/var/opt/test/war/test-webapp-RC1.war
Version 4.2.4 (test version)
Try:
$ echo "$e" | egrep -o '(\d+\.\d+\.\d+-?\w*)'
4.1.56-RC1
4.2.4

The following will match the first digit up to 2 digits in length ({1,2}, second up to 2 digits and the last up to 4 digits followed by anything non-space up to a space.
grep -o '[0-9]\{1,2\}.[0-9]\{1,2\}.[0-9]\{1,4\}'

Just add (-[a-zA-Z]+[0-9]+) to your regex:
echo "Version 4.2.4 (test version)" | sed -nre 's/^[^0-9]*(([0-9]+\.)*[0-9]+(-[a-zA-Z]+[0-9]+)).*/\1/p'

What about just using whitespace as the delimiter like
echo "Version 4.2.4-RC1 (test version)" | grep -Po "Version\s+\K\S+"
for grep -P says to use Perl style regex, -o shows only the matching part and the \K in the string says not to show everything before it as part of the match

This passes both tests
egrep -o '[0-9]\S*'
Unfortunately, not all greps support -o, but grep in Linux does.

echo "Version 4.2.4 (test version)" | sed 's/Version[[:space:]]*\([^[:space:](]*\).*/\1/'
But like every extraction, you need to define what you want, not what could exist and extract it (or change your request).

Related

grep regex for return numerical values in between string

so i'm running the linux command
ls /etc/systemd/system | grep -o -E "[0-9]+"
which should return just numerical values, the only problem it returns some unwanted numerical values from parts of results i dont want, i want only the numerical values between - and .service so in like test-blah4-1321.service i just want it to return 1321. What am i missing here?
example
$ ls /etc/systemd/system
test.service test-blah4-1321.service test-blah2.service test-blah5-1387.service test-blah3-1521.service
GNU grep has the -P option for perl-style regexes, and the -o option to print only what matches the pattern. These can be combined using look-around assertions (described under Extended Patterns in the perlre manpage) to remove part of the grep pattern from what is determined to have matched for the purposes of -o.
Source
Applied to your example this would be:
echo test-blah4-1321.service | grep -oP '(?<=-)\d+(?=\.service)'
when I need look ahead or look behind tests I usually switch to a Perl one-liner.
this should do the trick.
echo test-blah4-1321.service | perl -ne 'm/(?<=-)(\d+)(?=\.service)/g; print "$1\n";'

Extract version using grep/regex in bash

I have a file that has a line stating
version = "12.0.08-SNAPSHOT"
The word version and quoted strings can occur on multiple lines in that file.
I am looking for a single line bash statement that can output the following string:
12.0.08-SNAPSHOT
The version can have RELEASE tag too instead of SNAPSHOT.
So to summarize, given
version = "12.0.08-SNAPSHOT"
expected output: 12.0.08-SNAPSHOT
And given
version = "12.0.08-RELEASE"
expected output: 12.0.08-RELEASE
The following command prints strings enquoted in version = "...":
grep -Po '\bversion\s*=\s*"\K.*?(?=")' yourFile
-P enables perl regexes, which allow us to use features like \K and so on.
-o only prints matched parts instead of the whole lines.
\b ensures that version starts at a word boundary and we do not match things like abcversion.
\s stands for any kind of whitespace.
\K lets grep forget, that it matched the part before \K. The forgotten part will not be printed.
.*? matches as few chararacters as possible (the matching part will be printed) ...
(?=") ... until we see a ", which won't be included in the match either (this is called a lookahead).
Not all grep implementations support the -P option. Alternatively, you can use perl, as described in this answer:
perl -nle 'print $& if m{\bversion\s*=\s*"\K.*?(?=")}' yourFile
Seems like a job for cut:
$ echo 'version = "12.0.08-SNAPSHOT"' | cut -d'"' -f2
12.0.08-SNAPSHOT
$ echo 'version = "12.0.08-RELEASE"' | cut -d'"' -f2
12.0.08-RELEASE
Portable solution:
$ echo 'version = "12.0.08-RELEASE"' |sed -E 's/.*"(.*)"/\1/g'
12.0.08-RELEASE
or even:
$ perl -pe 's/.*"(.*)"/\1/g'.
$ awk -F"\"" '{print $2}'

ERE - adding quantifier to group with inner group and back-reference

Was trying to get words with consecutive repeated letters occurring twice or thrice. Not able find a way to use quantifier and capture group using ERE
$ grep --version | head -n1
grep (GNU grep) 2.25
$ # consecutive repeated letters occurring twice
$ grep -m5 -xiE '[a-z]*([a-z])\1[a-z]*[a-z]*([a-z])\2[a-z]*' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed
$ # no output for this, why?
$ grep -m5 -xiE '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words
Works with -P though
$ grep -m5 -xiP '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed
$ grep -m5 -xiP '([a-z]*([a-z])\2[a-z]*){3}' /usr/share/dict/words
Chattahoochee
McConnell
Mississippi
Mississippian
Mississippians
Thanks Casimir et Hippolyte for coming up with simpler input and regex to test this behavior
$ echo 'aazbb' | grep -E '(([a-z])\2[a-z]*){2}' || echo 'No match'
aazbb
$ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]*){2}([a-z])\3[a-z]*' || echo 'No match'
aazbbycc
$ echo 'aazbbycc' | grep -P '(([a-z])\2[a-z]*){3}' || echo 'No match'
aazbbycc
$ # failing case
$ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]*){3}' || echo 'No match'
No match
Same behavior seen with sed as well
$ sed --version | head -n1
sed (GNU sed) 4.2.2
$ echo 'aazbb' | sed -E '/(([a-z])\2[a-z]*){2}/! s/.*/No match/'
aazbb
$ echo 'aazbbycc' | sed -E '/(([a-z])\2[a-z]*){2}([a-z])\3[a-z]*/! s/.*/No match/'
aazbbycc
$ # failing case
$ echo 'aazbbycc' | sed -E '/(([a-z])\2[a-z]*){3}/! s/.*/No match/'
No match
Related search links, I checked some of them, but didn't get anything close to this question
https://savannah.gnu.org/bugs/?group=grep
http://lists.gnu.org/archive/html/bug-sed/
If this is solved in newer version of grep or sed, let me know. Also, if the issue is seen in non-GNU implementations
I suppose -E doesn't allow Quantifiers, that's why it works only with -P
to match 2 or more consecutive groups of repeated letters:
grep -P '(?:([a-z])\1*([a-z])\2){1}' /usr/share/dict/words
to match 3 or more consecutive groups of repeated letters:
grep -P '(?:([a-z])\1*([a-z])\2){2}' /usr/share/dict/words
Options:
-P, --perl-regexp PATTERN is a Perl regular expression
Update
After searching around, I installed gnugrep32 on my windows box, then ran
some tests:
I read this from an old SO post:
Non-greedy matching is not part of the Extended Regular Expression syntax supported by grep
So, we use [a-z]{0,20} as a test instead of [a-z]* or [a-z]*? where the ? is ignored (wtf?)
Below are incremental tests useing the overal (){n} to see how far it will go before it STOPS BACKTRACKING
into frames.
Min to work
(([a-z])\2[a-z]{0,20}){1} len = 2 rr
(([a-z])\2[a-z]{0,20}){2} len = 4 rrrr
(([a-z])\2[a-z]{0,20}){3} len = 25 rrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])\2[a-z]{0,20}){4} len = 47 rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])\2[a-z]{0,20}){5} len = 69 rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])\2[a-z]{0,20}){6} len = 91 rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
From {3} to {6} the delta lengths are equal to 22.
This happens to be the full length of the capture frame expression ([a-z])\2[a-z]{0,20}
when it does not backtrack into previous frames.
Conclusion is that it automatically stops backtracking after 2 frames.
It makes sence given that for example, out of 20 frames, it gets to 16, and finds it cannot match.
Shoud it go back to frame 1 and adjust there and try it all over agaqin.
Why yes it should.
However, it has now consumed so much memory, the bloated pig has to unwind it all.
This could take forever with this old archaic utility.
Hey, better cap it to 2 frames.
Of course, there is no test case for (([a-z])\2[a-z]*){3} since the greedy quantifier *
will consume the entire line on the second frame if they are all [a-z] and never even
start a third frame.
$ # no output for this, why?
$ grep -m5 -xiE '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words
Because you search for a double group (twice the same) that have a (at least) double letter inside. Something like abbcabbc [(...) = "abbc" 2 times] and not 2 (eventually similar) group that have each a double letter inside likeabbcdeef.
with 2 back ref:
$ grep -iE '[a-z]*([a-z])\1{1,}[a-z]*([a-z])\2{1,}[a-z]*`
I filed an issue https://debbugs.gnu.org/cgi/bugreport.cgi?bug=26864 and the manual is now updated to reflect such issues.
From https://www.gnu.org/software/grep/manual/grep.html#Known-Bugs:
Back-references can greatly slow down matching, as they can generate exponentially many matching possibilities that can consume both time and memory to explore. Also, the POSIX specification for back-references is at times unclear. Furthermore, many regular expression implementations have back-reference bugs that can cause programs to return incorrect answers or even crash, and fixing these bugs has often been low-priority: for example, as of 2020 the GNU C library bug database contained back-reference bugs 52, 10844, 11053, 24269 and 25322, with little sign of forthcoming fixes. Luckily, back-references are rarely useful and it should be little trouble to avoid them in practical applications.

How to use sed to grab regular expression

I'd like to grab the digits in a string like so :
"sample_2341-43-11.txt" to 2341-43-11
And so I tried the following command:
echo "sample_2341-43-11.txt" | sed -n -r 's|[0-9]{4}\-[0-9]{2}\-[0-9]{2}|\1|p'
I saw this answer, which is where I got the idea.
Use sed to grab a string, but it doesn't work on my machine:
it gives an error "illegal option -r".
it doesn't like the \1, either.
I'm using sed on MacOSX yosemite.
Is this the easiest way to extract that information from the file name?
You need to set your grouping and match the rest of the line to remove it with the group. Also the - does not need to be escaped. And the -n will inhibit the output (It just returns exit level for script conditionals).
echo "sample_2341-43-11.txt" | sed -r 's/^.*([0-9]{4}-[0-9]{2}-[0-9]{2}).*$/\1/'
Enhanced regular expressions are not supported in the Mac version of sed.
You can use grep instead:
echo "sample_2341-43-11.txt" | grep -Eo "((\d+|-)+)"
OUTPUT
2341-43-11
echo "one1sample_2341-43-11.txt" \
| sed 's/[^[:digit:]-]\{1,\}/ /g;s/ \{1,\}/ /g;s/^ //;s/ $//'
1 2341-43-11
Extract all numbers(digit) completed with - (thus allow here --12 but can be easily treated)
posix compliant
all number of the line are on same line (if several) separate by a space character (could be changed to new line if wanted)
You can try this ways also
sed 's/[^_]\+_\([^.]\+\).*/\1/' <<< sample_2341-43-11.txt
OutPut:
2341-43-11
Explanation:
[^_]\+ - Match the content untile _ ( sample_)
\([^.]\+\) - Match the content until . and capture the pattern (2341-43-11)
.* - Discard remaining character (.txt)
You can go with what the poster above said. Well, making use of this
pattern "\d+-\d+-\d+" would match what you are looking for. See demo here
https://regex101.com/r/kO2cZ1/3

Matching arbitrary number of digits using grep regex

I've got a file that has lines in it that look similar as follows
data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later
What I am looking to do is use regex to match any line that starts with data and ends with later AND has numbers in between. Here is what I've concocted so far:
^[D,d]ata[0-9]*later$
However the output includes all datalater lines. I suppose I could pipe the output and grep -v datalater, but I feel like a single expression should do the trick.
Use + instead of *.
+ matches at least one or more of the preceding.
* matches zero or more.
^[Dd]ata[0-9]+later$
In grep you need to escape the +, and we can use \d which is a character class and matches single digits.
^[Dd]ata\d\+later$
In you example file you also have a line:
datafhj893724897290384later
This currently will not be matched due to there being letters in-between data and the numbers. We can fix this by adding a [^0-9]* to match anything after data until the digits.
Our final command will be:
grep '^[Dd]ata[^0-9]*\d\+later$' filename
Using Cygwin, the above commands didn't work. I had to modify the commands given above to get the desired results.
$ cat > file.txt <<EOL
> data
> datalater
> 983290842
> Data387428later
> datafhj893724897290384later
> 4329804928later
> EOL
I always like to make sure my file has what I expect it to have:
$ cat file.txt
data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later
$
I needed to run Perl-style expressions with the -P flag. This meant I couldn't use the [^0-9]+, whose necessity #Tom_Cammann aptly pointed out. Instead, I used .* which matches any sequence of characters not matching the next part of the pattern. Here are my command and output.
$ grep -P '^[Dd]ata.*\d+later$' file.txt
Data387428later
datafhj893724897290384later
$
I wish I could give a better explanation of WHY Perl expressions are needed, but I just know that Cygwin's grep works a bit differently.
System Info
$ uname -a
CYGWIN_NT-10.0 A-1052207 2.5.2(0.297/5/3) 2016-06-23 14:29 x86_64 Cygwin
My Results from the previous answers
$ grep '^[Dd]ata[^0-9]*\d\+later$' file2.txt
$ grep '^[Dd]ata\d+later$' file2.txt
$ grep -P '^[Dd]ata[^0-9]*\d\+later$' file2.txt
$ grep -P '^[Dd]ata\d+later$' file2.txt
Data387428later
$
You're matching zero or more digits with the * qualifier. Try
^[Dd]ata\d+later$
instead. You were also finding commas at the beginning of the string (e.g. ",ata1234later"). And \d is a shortcut to finding any digit character. So I changed those as well.
You should put a "+" (which means one or several) instead of "*" (which means zero, one or several
The "+" syntax only works for extended-regexp, not standard grep.
At least, that's my experience on RHEL.
To use extended-regexp, run egrep or pass "-E" / "--extended-regexp"
Examples...
Standard grep
echo abc123n1 | grep "abc[0-9]+n1"
<no output>
egrep
echo abc123n1 | egrep "abc[0-9]+n1"
abc123n1
grep with -E
echo abc123n1 | grep -E "abc[0-9]+n1"
abc123n1
HTH
grep -Eio "^(data)[0-9]+(later)$"
^[dD]ata=^d later$=r$
🎯 MOTIVATION
The rest of answers don't work on all systems.
🗒️ REQUISITES
grep
The option: --extended-regexp
Character groups, aka: [:group:]
Matching one or more of the preceding, aka: +
Optionally setting as starting or ending: ^whatever$
📟 COMMAND
grep --extended-regexp "[[:group:]]+"
🗂️ GROUPS
alnum
alpha
blank
cntrl
digit
graph
lower
print
punct
space
upper
xdigit