bash - Extract part of string - regex

I have a string something like this
xsd:import schemaLocation="AppointmentManagementService.xsd6.xsd" namespace=
I want to extract the following from it :
AppointmentManagementService.xsd6.xsd
I have tried using regex, bash and sed with no success. Can someone please help me out with this?
The regex that I used was this :
/AppointmentManagementService.xsd\d{1,2}.xsd/g

Your string is:
nampt#nampt-desktop:$ cat 1
xsd:import schemaLocation="AppointmentManagementService.xsd6.xsd" namespace=
Try with awk:
cat 1 | awk -F "\"" '{print $2}'
Output:
AppointmentManagementService.xsd6.xsd

sed doesn't recognize \d, use [0-9] or [[:digit:]] instead:
sed 's/^.*schemaLocation="\([^"]\+[[:digit:]]\{1,2\}\.xsd\)".*$/\1/g'
## or
sed 's/^.*schemaLocation="\([^"]\+[0-9]\{1,2\}\.xsd\)".*$/\1/g'

You can use bash native regex matching:
$ in='xsd:import schemaLocation="AppointmentManagementService.xsd6.xsd" namespace='
$ if [[ $in =~ \"(.+)\" ]]; then echo "${BASH_REMATCH[1]}"; fi
Output:
AppointmentManagementService.xsd6.xsd
Based on your example, if you want to grant, at least, 1 or, at most, 2 digits in the .xsd... component, you can fine tune the regex with:
$ if [[ $in =~ \"(AppointmentManagementService.xsd[0-9]{1,2}.xsd)\" ]]; then echo "${BASH_REMATCH[1]}"; fi

using PCRE in GNU grep
grep -oP 'schemaLocation="\K.*?(?=")'
this will output pattern matched between schemaLocation=" and very next occurrence of "
Reference:
https://unix.stackexchange.com/a/13472/109046

Also we can use 'cut' command for this purpose,
[root#code]# echo "xsd:import schemaLocation=\"AppointmentManagementService.xsd6.xsd\" namespace=" | cut -d\" -f 2
AppointmentManagementService.xsd6.xsd

s='xsd:import schemaLocation="AppointmentManagementService.xsd6.xsd" namespace='
echo $s | sed 's/.*schemaLocation="\(.*\)" namespace=.*/\1/'

Related

How to use regex capturing group in bash correctly?

I've loaded some strings into variable "result". The strings look like this:
school/proj_1/file1.txt
school/proj_1/file2.txt
school/proj_1/file3.txt
I try to get only the name after the last slash, so file1.txt, file2.txt and file3.txt is the desirable result for me. I use this piece of code
for i in $result
do
grep "school/proj_1/(.*)" $i
done
but it doesn't work. I feel that the regex would work for Python with the caputuring group I created, but I can't really wrap my head around how to use capturing groups in bash or if it is even possible at all.
I'm sorry if it's a dumb question, I'm very new to scripting in bash.
You may use a simple approach with a string manipulation operation:
echo "${i##*/}"
${string##substring}
Deletes longest match of $substring from front of $string.
Or using a regex in Bash, you may get the capturing groups like
result=("school/proj_1/file1.txt" "school/proj_1/file2.txt" "school/proj_1/file3.txt")
rx='school/proj_1/(.*)'
for i in "${result[#]}"; do
if [[ "$i" =~ $rx ]]; then
echo "${BASH_REMATCH[1]}"
fi
done
See the online demo. Here, ${BASH_REMATCH[1]} is the contents inside capturing group #1.
Try this :
variable declaration :
$ result="school/proj_1/file1.txt
school/proj_1/file2.txt
school/proj_1/file3.txt"
Commands :
(all as one-liners)
$ grep -oP "school/proj_1/\K.*" "$i" <<< "$result"
or
$ awk -F'/' '{print $NF}' <<< "$result
or
$ sed 's|.*/||' <<< "$result"
or if number of sub dirs are fixed :
$ cut -d'/' -f3 <<< "$result"
Output :
file1.txt
file2.txt
file3.txt

Why \d\+ or \d+ is not equal to \d* here?

Bash +debian.
To match port number at the end of info.
s="2017-04-17 08:16:14 INFO connecting lh3.googleusercontent.com:443 from 111.111.111.111:26215"
echo $s | sed 's/\(.*\):\(\d*\)/\2/'
26215
Let's match it with \d+ or \d+ in sed.
echo $s | sed 's/\(.*\):\(\d\+\)/\2/'
echo $s | sed 's/\(.*\):\(\d+\)/\2/'
All of them get the whole string as output.
2017-04-17 08:16:14 INFO connecting lh3.googleusercontent.com:443 from 111.111.111.111:26215
None of them can match port number at the end,why?
There is an easier sed pattern to use:
$ echo "$s" | sed -nE 's/.*:([^:])/\1/p'
26215
As stated in comments, regular sed does not have perl meta characters. You need to use the POSIX character class of [[:digit:]]
Explanation:
sed -nE 's/.*:([^:])/\1/p'
^ only print if there is a match
^ use ERE and you don't need to escape the parens
^ capture up to the rightmost :
^ ^ -E means you don't need to escape parens
^ all characters except :
^ print if there is a match
Or, if you want to be more specific you want only digits:
$ echo "$s" | sed -nE 's/.*:([[:digit:]]+$)/\1/p'
26215
Note + to make sure there is at least one digit and $ to match only at the end of the line.
There is a summary of different regex flavors HERE. With -E sed is using ERE the same as egrep.
\d is a PCRE extension not present in BRE or ERE syntax (as used by standard UNIX tools).
In this particular case, there's no need to use any tools not built into bash for this purpose at all:
s="2017-04-17 08:16:14 INFO connecting lh3.googleusercontent.com:443 from 111.111.111.111:26215"
echo "Port is ${s##*:}"
This is a parameter expansion; when dealing with small amounts of data, such built-in capabilities are much more efficient than running external tools.
There's also native ERE support built into the shell, as follows:
re=':([[:digit:]]+)$'
[[ $s =~ $re ]] && echo "Port is ${BASH_REMATCH[1]}"
BashFAQ #100 also goes into detail on bash string manipulation.
All you need is this:
echo ${s##*:}
Learn your shell string operators.
s="2017-04-17 08:16:14 INFO connecting lh3.googleusercontent.com:443 from 111.111.111.111:26215"
1.grep
echo $s |grep -Po '\d+$'
2.ack
echo $s |ack -o '\d+$'
3.sed
echo $s |sed 's/.*\://'
4.awk
echo $s |awk -F: '{print $NF}'
Self-answer by OP moved from question to community wiki answer, per consensus on meta:
There is no expression \d to stand for numbers in sed.
To get with awk simply with :
echo $s |awk -F: '{print $NF}'
26215

Regex to get number after last underscore

I am having trouble coming up with the regex command that will get me Y in the following string X_X_X_Y . BTW: Y is an interger, but can validate that after.
You could use shell parameter expansion:
$ s="X_X_X_Y"
$ echo "${s##*_}"
Y
Using sed:
$ sed 's/.*_//' <<< "$s"
Y
Using grep:
$ grep -oP '.*_\K.*' <<< "$s"
Y
This regex will work as long at the stuff you're matching for is an integer
[^_]+_[^_]+_[^_]+_(\d+)
as an alternative, if you are always tokenizing on the _ char you can skip regex and use awk
echo 'X_X_X_Y' | awk -F_ '{print $NF}'
Using BASH regex:
s='s="X_X_X_10'
[[ "$s" =~ [^_]+$ ]] && echo "${BASH_REMATCH[0]}"
10
This will print an integer at the end of the string after an underscore.
perl -e '"0_0_0_1" =~ /_([0-9]+)$/; print $1,"\n" if defined $1'
1
This might work for you:
sed 's/.*_\([0-9][0-9]*\)/\1/' file

How to extract a number from a string using grep and regex

I make a cat of a file and apply on it a grep with a regular expression like this
cat /tmp/tmp_file | grep "toto.titi\[[0-9]\+\].tata=55"
the command display the following output
toto.titi[12].tata=55
is it possible to modify my grep command in order to extract the number 12 as displayed output of the command?
You can grab this in pure BASH using its regex capabilities:
s='toto.titi[12].tata=55'
[[ "$s" =~ ^toto.titi\[([0-9]+)\]\.tata=[0-9]+$ ]] && echo "${BASH_REMATCH[1]}"
12
You can also use sed:
sed 's/toto.titi\[\([0-9]*\)\].tata=55/\1/' <<< "$s"
12
OR using awk:
awk -F '[\\[\\]]' '{print $2}' <<<"$s"
12
use lookahead
echo toto.titi[12].tata=55|grep -oP '(?<=\[)\d+'
12
without perl regex,use sed to replace "["
echo toto.titi[12].tata=55|grep -o "\[[0-9]\+"|sed 's/\[//g'
12
Pipe it to sed and use a back reference:
cat /tmp/tmp_file | grep "toto.titi\[[0-9]\+\].tata=55" | sed 's/.*\[(\d*)\].*/\1/'

bash regex patch match

I have a path such as thus ..
/Users/me/bla/dev/trunk/source/java/com/mecorp/sub/misc/filename.java
I'd like to be able to use bash to create the package structure in another dir somewhere e.g.
com/mecorp/sub/misc/
I tried the following but it wont work .. I was able to get a match if I change my regex to .* so that implies my bash is ok - There must be something wrong with the way im quoting the regex or maybe the regex its self. I do see working here ..
http://regexr.com?3439m
So im confused ?
regex="(?<=/java)(.*)(?=/)"
[[ $fullfile =~ $regex ]]
echo "pkg name " ${BASH_REMATCH[0]}
Thanks for your time.
EDIT - I'm using OSX so it doesn't have all those nice spiffy GNU extensions.
Try this :
using GNU grep :
$ echo '/Users/me/bla/dev/trunk/source/java/com/mecorp/sub/misc/filename.java' |
grep -oP 'java/\K.*/'
com/mecorp/sub/misc/
See http://regexr.com?3439p
Or using bash :
x="/Users/me/bla/dev/trunk/source/java/com/mecorp/sub/misc/filename.java"
[[ $x =~ java/(.*/) ]] && echo ${BASH_REMATCH[1]}
Or with awk :
echo "$x" | awk -F/ '{print gensub(".*/java/(.*/).*", "\\1", $0)}'
Or with sed :
echo "$x" | sed -e 's#.*/java/\(.*/\).*#\1#'
If you try to extract the path after /java/ you can do it with this:
path=/Users/me/bla/dev/trunk/source/java/com/mecorp/sub/misc/filename.java
package=`echo $path | sed -r 's,^.*/java/(.*/).*$,\1,'`