Why \d\+ or \d+ is not equal to \d* here? - regex

Bash +debian.
To match port number at the end of info.
s="2017-04-17 08:16:14 INFO connecting lh3.googleusercontent.com:443 from 111.111.111.111:26215"
echo $s | sed 's/\(.*\):\(\d*\)/\2/'
26215
Let's match it with \d+ or \d+ in sed.
echo $s | sed 's/\(.*\):\(\d\+\)/\2/'
echo $s | sed 's/\(.*\):\(\d+\)/\2/'
All of them get the whole string as output.
2017-04-17 08:16:14 INFO connecting lh3.googleusercontent.com:443 from 111.111.111.111:26215
None of them can match port number at the end,why?

There is an easier sed pattern to use:
$ echo "$s" | sed -nE 's/.*:([^:])/\1/p'
26215
As stated in comments, regular sed does not have perl meta characters. You need to use the POSIX character class of [[:digit:]]
Explanation:
sed -nE 's/.*:([^:])/\1/p'
^ only print if there is a match
^ use ERE and you don't need to escape the parens
^ capture up to the rightmost :
^ ^ -E means you don't need to escape parens
^ all characters except :
^ print if there is a match
Or, if you want to be more specific you want only digits:
$ echo "$s" | sed -nE 's/.*:([[:digit:]]+$)/\1/p'
26215
Note + to make sure there is at least one digit and $ to match only at the end of the line.
There is a summary of different regex flavors HERE. With -E sed is using ERE the same as egrep.

\d is a PCRE extension not present in BRE or ERE syntax (as used by standard UNIX tools).
In this particular case, there's no need to use any tools not built into bash for this purpose at all:
s="2017-04-17 08:16:14 INFO connecting lh3.googleusercontent.com:443 from 111.111.111.111:26215"
echo "Port is ${s##*:}"
This is a parameter expansion; when dealing with small amounts of data, such built-in capabilities are much more efficient than running external tools.
There's also native ERE support built into the shell, as follows:
re=':([[:digit:]]+)$'
[[ $s =~ $re ]] && echo "Port is ${BASH_REMATCH[1]}"
BashFAQ #100 also goes into detail on bash string manipulation.

All you need is this:
echo ${s##*:}
Learn your shell string operators.

s="2017-04-17 08:16:14 INFO connecting lh3.googleusercontent.com:443 from 111.111.111.111:26215"
1.grep
echo $s |grep -Po '\d+$'
2.ack
echo $s |ack -o '\d+$'
3.sed
echo $s |sed 's/.*\://'
4.awk
echo $s |awk -F: '{print $NF}'

Self-answer by OP moved from question to community wiki answer, per consensus on meta:
There is no expression \d to stand for numbers in sed.
To get with awk simply with :
echo $s |awk -F: '{print $NF}'
26215

Related

Bash Regex extract all text from 2nd occurence of specific character until end of line

I have the following strings:
text/:some_random_text:text_i_w4nt_to:k33p.until_th3_end_1
text/:some_random_text:text_i_w4nt_to::k33p.until_th3_end_1
with a regular expression, I want to extract:
text_i_w4nt_to:k33p.until_th3_end_1
text_i_w4nt_to::k33p.until_th3_end_1
I have tried with regex101.com the following expression: ([^:]+)(?::[^:]+){1}$
and it worked (only for the first string)
But if I try in bash, it does not
echo "text/:some_random_text:text_i_w4nt_to::k33p.until_th3_end_1" | sed -n "/([^:]+)(?::[^:]+){1}$/p"
It would be much easier with cut without any regex:
cut -d: -f3- file
text_i_w4nt_to:k33p.until_th3_end_1
text_i_w4nt_to::k33p.until_th3_end_1
Non capture groups (?: are not supported in sed and you have to escape the \( \) \{ \} and \+
You can repeat 2 occurrences of : from the start of the string and replace that with an empty string.
sed 's/^\([^:]\+:\)\{2\}//' file
Or using sed -E for extended regexp:
sed -E 's/^([^:]+:){2}//' file
Output
text_i_w4nt_to:k33p.until_th3_end_1
text_i_w4nt_to::k33p.until_th3_end_1
Using sed
$ sed s'|\([^:]*:\)\{2\}\(.*\)$|\2|' input_file
text_i_w4nt_to:k33p.until_th3_end_1
text_i_w4nt_to::k33p.until_th3_end_1
or
$ sed s'|\([^:]*:\)\{2\}||' input_file
text_i_w4nt_to:k33p.until_th3_end_1
text_i_w4nt_to::k33p.until_th3_end_1
There's no reason to drag sed or other external programs into this; just use bash's built in regular expression matching:
#!/usr/bin/env bash
strings=(text/:some_random_text:text_i_w4nt_to:k33p.until_th3_end_1
text/:some_random_text:text_i_w4nt_to::k33p.until_th3_end_1)
for s in "${strings[#]}"; do
[[ $s =~ ^([^:]*:){2}(.*) ]] && printf "%s\n" "${BASH_REMATCH[2]}"
done
Heck, you don't need regular expressions in bash:
printf "%s\n" "${s#*:*:}"
awk
string='ext/:some_random_text:text_i_w4nt_to:k33p.until_th3_end_1
text/:some_random_text:text_i_w4nt_to::k33p.until_th3_end_1'
awk -vFS=: -vOFS=: '{$1=$2="";gsub(/^::/,"")}1' <<<"$string"
text_i_w4nt_to:k33p.until_th3_end_1
text_i_w4nt_to::k33p.until_th3_end_1
absolutely no need to use anything that requires regex-backreferences, since the regex anchoring is right at the line head anyway :
mawk ++NF OFS= FS='^[^:]*:[^:]*:'
text_i_w4nt_to:k33p.until_th3_end_1
text_i_w4nt_to::k33p.until_th3_end_1

Get substring using either perl or sed

I can't seem to get a substring correctly.
declare BRANCH_NAME="bugfix/US3280841-something-duh";
# Trim it down to "US3280841"
TRIMMED=$(echo $BRANCH_NAME | sed -e 's/\(^.*\)\/[a-z0-9]\|[A-Z0-9]\+/\1/g')
That still returns bugfix/US3280841-something-duh.
If I try an use perl instead:
declare BRANCH_NAME="bugfix/US3280841-something-duh";
# Trim it down to "US3280841"
TRIMMED=$(echo $BRANCH_NAME | perl -nle 'm/^.*\/([a-z0-9]|[A-Z0-9])+/; print $1');
That outputs nothing.
What am I doing wrong?
Using bash parameter expansion only:
$: # don't use caps; see below.
$: declare branch="bugfix/US3280841-something-duh"
$: tmp="${branch##*/}"
$: echo "$tmp"
US3280841-something-duh
$: trimmed="${tmp%%-*}"
$: echo "$trimmed"
US3280841
Which means:
$: tmp="${branch_name##*/}"
$: trimmed="${tmp%%-*}"
does the job in two steps without spawning extra processes.
In sed,
$: sed -E 's#^.*/([^/-]+)-.*$#\1#' <<< "$branch"
This says "after any or no characters followed by a slash, remember one or more that are not slashes or dashes, followed by a not-remembered dash and then any or no characters, then replace the whole input with the remembered part."
Your original pattern was
's/\(^.*\)\/[a-z0-9]\|[A-Z0-9]\+/\1/g'
This says "remember any number of anything followed by a slash, then a lowercase letter or a digit, then a pipe character (because those only work with -E), then a capital letter or digit, then a literal plus sign, and then replace it all with what you remembered."
GNU's manual is your friend. I look stuff up all the time to make sure I'm doing it right. Sometimes it still takes me a few tries, lol.
An aside - try not to use all-capital variable names. That is a convention that indicates it's special to the OS, like RANDOM or IFS.
You may use this sed:
sed -E 's~^.*/|-.*$~~g' <<< "$BRANCH_NAME"
US3280841
Ot this awk:
awk -F '[/-]' '{print $2}' <<< "$BRANCH_NAME"
US3280841
sed 's:[^/]*/\([^-]*\)-.*:\1:'<<<"bugfix/US3280841-something-duh"
Perl version just has + in wrong place. It should be inside the capture brackets:
TRIMMED=$(echo $BRANCH_NAME | perl -nle 'm/^.*\/([a-z0-9A-Z]+)/; print $1');
Just use a ^ before A-Z0-9
TRIMMED=$(echo $BRANCH_NAME | sed -e 's/\(^.*\)\/[a-z0-9]\|[^A-Z0-9]\+/\1/g')
in your sed case.
Alternatively and briefly, you can use
TRIMMED=$(echo $BRANCH_NAME | sed "s/[a-z\/\-]//g" )
too.
type on shell terminal
$ BRANCH_NAME="bugfix/US3280841-something-duh"
$ echo $BRANCH_NAME| perl -pe 's/.*\/(\w\w[0-9]+).+/\1/'
use s (substitute) command instead of m (match)
perl is a superset of sed so it'd be identical 'sed -E' instead of 'perl -pe'
Another variant using Perl Regular Expression Character Classes (see perldoc perlrecharclass).
echo $BRANCH_NAME | perl -nE 'say m/^.*\/([[:alnum:]]+)/;'

How to use regex capturing group in bash correctly?

I've loaded some strings into variable "result". The strings look like this:
school/proj_1/file1.txt
school/proj_1/file2.txt
school/proj_1/file3.txt
I try to get only the name after the last slash, so file1.txt, file2.txt and file3.txt is the desirable result for me. I use this piece of code
for i in $result
do
grep "school/proj_1/(.*)" $i
done
but it doesn't work. I feel that the regex would work for Python with the caputuring group I created, but I can't really wrap my head around how to use capturing groups in bash or if it is even possible at all.
I'm sorry if it's a dumb question, I'm very new to scripting in bash.
You may use a simple approach with a string manipulation operation:
echo "${i##*/}"
${string##substring}
Deletes longest match of $substring from front of $string.
Or using a regex in Bash, you may get the capturing groups like
result=("school/proj_1/file1.txt" "school/proj_1/file2.txt" "school/proj_1/file3.txt")
rx='school/proj_1/(.*)'
for i in "${result[#]}"; do
if [[ "$i" =~ $rx ]]; then
echo "${BASH_REMATCH[1]}"
fi
done
See the online demo. Here, ${BASH_REMATCH[1]} is the contents inside capturing group #1.
Try this :
variable declaration :
$ result="school/proj_1/file1.txt
school/proj_1/file2.txt
school/proj_1/file3.txt"
Commands :
(all as one-liners)
$ grep -oP "school/proj_1/\K.*" "$i" <<< "$result"
or
$ awk -F'/' '{print $NF}' <<< "$result
or
$ sed 's|.*/||' <<< "$result"
or if number of sub dirs are fixed :
$ cut -d'/' -f3 <<< "$result"
Output :
file1.txt
file2.txt
file3.txt

Excluding the first 3 characters of a string using regex

Given any string in bash, e.g flaccid, I want to match all characters in the string but the first 3 (in this case I want to exclude "fla" and match only "ccid"). The regex also needs to work in sed.
I have tried positive look behind and the following regex expressions (as well as various other unsuccessful ones):
^.{3}+([a-z,A-Z]+)
sed -r 's/(?<=^....)(.[A-Z]*)/,/g'
Google hasn't been very helpful as it only produce results like "get first 3 characters .."
Thanks in advance!
If you want to get all characters but the first 3 from a string, you can use cut:
str="flaccid"
cut -c 4- <<< "$str"
or bash variable subsitution:
str="flaccid"
echo "${str:3}"
That will strip the first 3 characters out of your string.
You may just use a capturing group within an expression like ^.{3}(.*) / ^.{3}([a-zA-Z]+) and grab the ${BASH_REMATCH[1]} contents:
#!/bin/bash
text="flaccid"
rx="^.{3}(.*)"
if [[ $text =~ $rx ]]; then
echo ${BASH_REMATCH[1]};
fi
See online Bash demo
In sed, you should also be using capturing groups / backreferences to get what you need. To just keep the first 3 chars, you may use a simple:
echo "flaccid" | sed 's/.\{3\}//'
See this regex demo. The .\{3\} matches exactly any 3 chars and will remove them from the beginning only, since g modifier is not used.
Now, both the solutions above will output ccid, returning the first 3 chars only.
Using sed, just remove them
echo string | sed 's/^...//g'
How is it that no-one has named the most simple and portable solution:
shell "Parameter expansions":
str="flacid"
echo "${str#???}
For a regex (bash):
$ str="flaccid"
$ regex='^.{3}(.*)$'
$ [[ $str =~ $regex ]] && echo "${BASH_REMATCH[1]}"
ccid
Same regex in sed:
$ echo "flaccid" | sed -E "s/$regex/\1/"
ccid
Or sed (Basic Regex):
$ echo "flaccid" | sed 's/^.\{3\}\(.*\)$/\1/'
ccid

bash - Extract part of string

I have a string something like this
xsd:import schemaLocation="AppointmentManagementService.xsd6.xsd" namespace=
I want to extract the following from it :
AppointmentManagementService.xsd6.xsd
I have tried using regex, bash and sed with no success. Can someone please help me out with this?
The regex that I used was this :
/AppointmentManagementService.xsd\d{1,2}.xsd/g
Your string is:
nampt#nampt-desktop:$ cat 1
xsd:import schemaLocation="AppointmentManagementService.xsd6.xsd" namespace=
Try with awk:
cat 1 | awk -F "\"" '{print $2}'
Output:
AppointmentManagementService.xsd6.xsd
sed doesn't recognize \d, use [0-9] or [[:digit:]] instead:
sed 's/^.*schemaLocation="\([^"]\+[[:digit:]]\{1,2\}\.xsd\)".*$/\1/g'
## or
sed 's/^.*schemaLocation="\([^"]\+[0-9]\{1,2\}\.xsd\)".*$/\1/g'
You can use bash native regex matching:
$ in='xsd:import schemaLocation="AppointmentManagementService.xsd6.xsd" namespace='
$ if [[ $in =~ \"(.+)\" ]]; then echo "${BASH_REMATCH[1]}"; fi
Output:
AppointmentManagementService.xsd6.xsd
Based on your example, if you want to grant, at least, 1 or, at most, 2 digits in the .xsd... component, you can fine tune the regex with:
$ if [[ $in =~ \"(AppointmentManagementService.xsd[0-9]{1,2}.xsd)\" ]]; then echo "${BASH_REMATCH[1]}"; fi
using PCRE in GNU grep
grep -oP 'schemaLocation="\K.*?(?=")'
this will output pattern matched between schemaLocation=" and very next occurrence of "
Reference:
https://unix.stackexchange.com/a/13472/109046
Also we can use 'cut' command for this purpose,
[root#code]# echo "xsd:import schemaLocation=\"AppointmentManagementService.xsd6.xsd\" namespace=" | cut -d\" -f 2
AppointmentManagementService.xsd6.xsd
s='xsd:import schemaLocation="AppointmentManagementService.xsd6.xsd" namespace='
echo $s | sed 's/.*schemaLocation="\(.*\)" namespace=.*/\1/'