Capture group from regex in bash - regex

I have the following string /path/to/my-jar-1.0.jar for which I am trying to write a bash regex to pull out my-jar.
Now I believe the following regex would work: ([^\/]*?)-\d but I don't know how to get bash to run it.
The following: echo '/path/to/my-jar-1.0.jar' | grep -Po '([^\/]*?)-\d' captures my-jar-1

In BASH you can do:
s='/path/to/my-jar-1.0.jar'
[[ $s =~ .*/([^/[:digit:]]+)-[[:digit:]] ]] && echo "${BASH_REMATCH[1]}"
my-jar
Here "${BASH_REMATCH[1]}" will print captured group #1 which is expression inside first (...).

You can do this as well with shell prefix and suffix removal:
$ path=/path/to/my-jar-1.0.jar
# Remove the longest prefix ending with a slash
$ base="${path##*/}"
# Remove the longest suffix starting with a dash followed by a digit
$ base="${base%%-[0-9]*}"
$ echo "$base"
my-jar
Although it's a little annoying to have to do the transform in two steps, it has the advantage of only using Posix features so it will work with any compliant shell.
Note: The order is important, because the basename cannot contain a slash, but a path component could contain a dash. So you need to remove the path components first.

grep -o doesn't recognize "capture groups" I think, just the entire match. That said, with Perl regexps (-P) you have the "lookahead" option to exclude the -\d from the match:
echo '/path/to/my-jar-1.0.jar' | grep -Po '[^/]*(?=-\d)'
Some reference material on lookahead/lookbehind:
http://www.perlmonks.org/?node_id=518444

Related

Linux shell extracting substring between matching patterns

Let's say I have a string poskek|gfgfd|XLSE|a1768|d234|uijjk and I want to extract just the LSE part.
I only know that there will be |X directly before LSE, and | directly after the part I am interested in LSE.
The other answer using sed should work, but I always find sed to be a bit awkward for regex selection, as it's really intended for replacement (hence why either side of the pattern needs to be flanked with .* and the part you actually want needs to be in parentheses). Here's a solution using grep:
grep -Po '\|X\K[^|]+'
-P signals grep to use Perl's regex engine which is more advanced
-o only prints the matching part of the line
\|X match a literal vertical bar and a capital X
\K forget what has currently been matched (do not include it in the final output)
[^|]+ one or more characters other than vertical bars
As a pure bash solution, please try:
str='poskek|gfgfd|XLSE|a1768|d234|uijjk'
ext=${str#*|X}
ext=${ext%%|*}
echo "$ext"
If regex is available, following also works:
if [[ $str =~ .*\|X([^|]+) ]]; then
echo "${BASH_REMATCH[1]}"
fi
echo 'poskek|gfgfd|XLSE|a1768|d234|uijjk' | sed -n 's/.*|X\([^|]\+\).*/\1/p'
That ought to do the trick.
Explained:
sed -n will not print anything unless specified
s/ - search and replace
.*|X - match everything up to and including |X
\([^|]\+\) - capture multiple (at least one) character that isn't a |
.* - match the rest of the text (just to "eat it up")
/\1/p - Replace all matched text with the first capture, and print
For this particular case, you could do the rather unconventional:
awk '$1=="X"{$1="";print}' FS= OFS= RS=\|
try this
echo 'poskek|gfgfd|XLSE|a1768|d234|uijjk' |
awk -F "|" '{for(i=1;i<=NF;++i) printf "%s", (substr($i,1,1)=="X"?substr($i,2):"")}'
where
-F is field seperator => '|'
NF is number of fields

Excluding the first 3 characters of a string using regex

Given any string in bash, e.g flaccid, I want to match all characters in the string but the first 3 (in this case I want to exclude "fla" and match only "ccid"). The regex also needs to work in sed.
I have tried positive look behind and the following regex expressions (as well as various other unsuccessful ones):
^.{3}+([a-z,A-Z]+)
sed -r 's/(?<=^....)(.[A-Z]*)/,/g'
Google hasn't been very helpful as it only produce results like "get first 3 characters .."
Thanks in advance!
If you want to get all characters but the first 3 from a string, you can use cut:
str="flaccid"
cut -c 4- <<< "$str"
or bash variable subsitution:
str="flaccid"
echo "${str:3}"
That will strip the first 3 characters out of your string.
You may just use a capturing group within an expression like ^.{3}(.*) / ^.{3}([a-zA-Z]+) and grab the ${BASH_REMATCH[1]} contents:
#!/bin/bash
text="flaccid"
rx="^.{3}(.*)"
if [[ $text =~ $rx ]]; then
echo ${BASH_REMATCH[1]};
fi
See online Bash demo
In sed, you should also be using capturing groups / backreferences to get what you need. To just keep the first 3 chars, you may use a simple:
echo "flaccid" | sed 's/.\{3\}//'
See this regex demo. The .\{3\} matches exactly any 3 chars and will remove them from the beginning only, since g modifier is not used.
Now, both the solutions above will output ccid, returning the first 3 chars only.
Using sed, just remove them
echo string | sed 's/^...//g'
How is it that no-one has named the most simple and portable solution:
shell "Parameter expansions":
str="flacid"
echo "${str#???}
For a regex (bash):
$ str="flaccid"
$ regex='^.{3}(.*)$'
$ [[ $str =~ $regex ]] && echo "${BASH_REMATCH[1]}"
ccid
Same regex in sed:
$ echo "flaccid" | sed -E "s/$regex/\1/"
ccid
Or sed (Basic Regex):
$ echo "flaccid" | sed 's/^.\{3\}\(.*\)$/\1/'
ccid

Retrieve value of attribute in bash

I have a list of lines:
<some_random_text="someval" my_val_="0.4" some_random_text_1="someval_">
<some_random_text="someval" my_val_="0.8" some_random_text_1="someval_">
<some_random_text="someval" my_val_="1.2" some_random_text_1="someval_">
and so on.
From each line, I want to return the numeric value given after my_val_. How can I do this in bash?
Within this very rigid structure, what you want to do is quite easy using sed:
sed 's/.*my_val_="\([0-9.]\{1,\}\)".*/\1/' file
or using extended regular expressions:
sed -r 's/.*my_val_="([0-9.]+)".*/\1/' file
This captures the part you're interested in (the digits and dots between the quotes) and uses them to replace the contents of the line.
As mentioned in the comments (thanks), the switch to enable extended regular expressions differs between versions of sed. Out of habit, I tend to use -r but some implementations (such as BSD sed on OSX) work with -E instead. Others work with either -r or -E but neither option is defined by the standard.
This could also be done in native bash (although I wouldn't recommend it...):
re='my_val_="([0-9.]+)"'
while read -r line; do
[[ $line =~ $re ]] && echo "${BASH_REMATCH[1]}"
done < file
=~ is the regex match operator. The captured digits and dots are stored in element 1 of the special array BASH_REMATCH.
The sed and bash approaches are subtly different, as the sed version will print all lines in the file, even if they don't match the pattern. If this is a problem, you can add the -n switch and a p at the end of the command to print matching lines:
sed -nr 's/.*my_val_="([0-9.]+)".*/\1/p' file
With grep:
grep -oP 'my_val_="\K[^"]*' filename
-o so that grep only prints only the match, -P so that Perl-compatible regexes are used.
The \K in the regex removes from the match everything that was matched by the part of the regex that came before it; this has the effect of a lookbehind: only non-quote characters that come directly after my_val_=" are matched.

regular expression extract string after a colon in bash

I need to extract the string after the : in an example below:
package:project.abc.def
Where i would get project.abc.def as a result.
I am attempting this in bash and i believe i have a regular expression that will work :([^:]*)$.
In my bash script i have package:project.abc.def as a variable called apk. Now how do i assign the same variable the substring found with the regular expression?
Where the result from package:project.abc.def would be in the apk variable. And package:project.abc.def is initially in the apk variable?
Thanks!
There is no need for a regex here, just a simple prefix substitution:
$ apk="package:project.abc.def"
$ apk=${apk##package:}
project.abc.def
The ## syntax is one of bash's parameters expansions. Instead of #, % can be used to trim the end. See this section of the bash man page for the details.
Some alternatives:
$ apk=$(echo $apk | awk -F'package:' '{print $2}')
$ apk=$(echo $apk | sed 's/^package://')
$ apk=$(echo $apk | cut -d':' -f2)
$ string="package:project.abc.def"
$ apk=$(echo $string | sed 's/.*\://')
".*:" matches everything before and including ':' and then its removed from the string.
Capture groups from regular expressions can be found in the BASH_REMATCH array.
[[ $str =~ :([^:]*)$ ]]
# 0 is the substring that matches the entire regex
# n > 1: the nth parenthesized group
apk=${BASH_REMATCH[1]}

grep on unix / linux: how to replace or capture text?

So I'm pretty good with regular expressions, but I'm having some trouble with them on unix. Here are two things I'd love to know how to do:
1) Replace all text except letters, numbers, and underscore
In PHP I'd do this: (works great)
preg_replace('#[^a-zA-Z0-9_]#','',$text).
In bash I tried this (with limited success); seems like it dosen't allow you to use the full set of regex:
text="my #1 example!"
${text/[^a-zA-Z0-9_]/'')
I tried it with sed but it still seems to have problems with the full regex set:
echo "my #1 example!" | sed s/[^a-zA-Z0-9\_]//
I'm sure there is a way to do it with grep, too, but it was breaking it into multiple lines when i tried:
echo abc\!\#\#\$\%\^\&\*\(222 | grep -Eos '[a-zA-Z0-9\_]+'
And finally I also tried using expr but it seemed like that had really limited support for extended regex...
2) Capture (multiple) parts of text
In PHP I could just do something like this:
preg_match('#(word1).*(word2)#',$text,$matches);
I'm not sure how that would be possible in *nix...
Part 1
You are almost there with the sed just add the g modifier so that the replacement happen globally, without the g, replacement will happen just once.
$ echo "my #1 example!" | sed s/[^a-zA-Z0-9\_]//g
my1example
$
You did the same mistake with your bash pattern replacement too: not making replacements globally:
$ text="my #1 example!"
# non-global replacement. Only the space is delete.
$ echo ${text/[^a-zA-Z0-9_]/''}
my#1 example!
# global replacement by adding an additional /
$ echo ${text//[^a-zA-Z0-9_]/''}
my1example
Part 2
Capturing works the same in sed as it did in PHP's regex: enclosing the pattern in parenthesis triggers capturing:
# swap foo and bar's number using capturing and back reference.
$ echo 'foo1 bar2' | sed -r 's/foo([0-9]+) bar([0-9]+)/foo\2 bar\1/'
foo2 bar1
$
As an alternative to codaddict's nice answer using sed, you could also use tr for the first part of your question.
echo "my #1 _ example!" | tr -d -C '[[:alnum:]_]'
I've also made use of the [:alnum:] character class, just to show another option.
what do you mean you can't use the regex syntax for bash?
$ text="my #1 example!"
$ echo ${text//[^a-zA-Z0-9_]/}
my1example
you have to use // for more than 1 replacement.
for your 2nd question, with bash 3.2++
$ [[ $text =~ "(my).*(example)" ]]
$ echo ${BASH_REMATCH[1]}
my
$ echo ${BASH_REMATCH[2]}
example