Bash regex: replace string with any number of characters - regex

I'm trying to remove colouring codes from a string; e.g. from: \033[36;1mDISK\033[0m to: DISK
my regex looks like this: \033.*?m so match '\033' followed by any number of chars, terminated by 'm'
when I search for the pattern, it finds a match; [[ "$var" =~ $regex ]] evaluates to true
however when I try to replace matches, nothing happens and the same string is returned.
Here's my complete script:
regex="\033.*?m"
var="\033[36;1mDISK\033[0m"
if [[ "$var" =~ $regex ]]
then
echo "matches"
echo ${var//$regex}
else
echo "doesn't match!"
fi
The problem appears to be with the match any number of any character part of the regex. I can successfully replace DISK but if I change that to D.*K or D.*?K it fails.
Note in all above cases the pattern claims to match the string but fails when replacing. Not too sure where to go with this now, any help appreciated.
Thanks

The following should do it:
$ var="\033[36;1mDISK\033[0m"
$ newvar=$(printf ${var} | sed -r "s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g")
$ echo ${newvar}
returns:
DISK
Now verify!
$ echo $var | od
0000000 030134 031463 031533 035466 066461 044504 045523 030134
0000020 031463 030133 005155
0000026
$ echo $newvar | od
0000000 044504 045523 000012
0000005

To use the parameter expansion substitution operator, you need to use an extended glob.
shopt -s extglob
newvar=${var//\\033\[*([0-9;])m}
To break it down:
\\033\[ - match the encoded escape character and [.
*([0-9;]) - match zero or more digits or semicolons. You could use +([0-9;]) to (more correctly?) match one or more digits or semicolons
m - the trailing m.

Related

compare regex with grep output - bash script

I'm trying to find the word "PASS_MAX_DAYS" in a file using grep
grep "^PASS_MAX_DAYS" /etc/login.defs
then I save it in a variable and compare it to a regular expression that has the value 90 or less.
regex = "PASS_MAX_DAYS\s*([0-9]|[1-8][0-9]|90)"
grep output is: PASS_MAX_DAYS 120
so my function should print a fail, however it matches:
function audit_Control () {
if [[ $cmd =~ $regex ]]; then
echo match
else
echo fail
fi
}
cmd=`grep "^PASS_MAX_DAYS" /etc/login.defs`
regex="PASS_MAX_DAYS\s*([0-9]|[1-8][0-9]|90)"
audit_Control "$cmd" "$regex"
The problem is that the bash test [[ using the regex match operator =~ does not support the common escapes such as \s for whitespace or \W for non-word-characters.
It does support posix predefined character classes, so you can use [[:space:]] in place of \s
Your regex would then be:
regex="PASS_MAX_DAYS[[:space:]]*([0-9]|[1-8][0-9]|90)"
You may want to add anchors ^ and $ to ensure a whole-line match, then the regex is
regex="^PASS_MAX_DAYS[[:space:]]*([0-9]|[1-8][0-9]|90)$"
Without the end-of-line anchor you could match lines that have trailing numbers after the match, so PASS_MAX_DAYS 9077 would match PASS_MAX_DAYS 90 and the trailing "77" would not prevent the match.
This answer also has some very useful information about bash's [[ ]] construction with the =~ operator.
I believe you have a problem with your regex, please try that version:
PASS_MAX_DAYS\s*([0-9]|[1-8][0-9]|90)$
[lucas#lucasmachine ~]$ cat test.sh
#!/bin/bash
function audit_Control () {
if [[ $cmd =~ $regex ]]; then
echo match
else
echo fail
fi
}
regex="PASS_MAX_DAYS\s*([0-9]|[1-8][0-9]|90)$"
audit_Control "$cmd" "$regex"
[lucas#lucasmachine ~]$ export cmd="PASS_MAX_DAYS 123"
[lucas#lucasmachine ~]$ ./test.sh
fail
[lucas#lucasmachine ~]$ export cmd="PASS_MAX_DAYS 1"
[lucas#lucasmachine ~]$ ./test.sh
match
I can explain the problem was the other regex was not checking the end of line, so, your were matching "PASS_MAX_DAYS 1"23 so 23 were not being "counted" to your regex. Your regex was really matching part of the text.. Now with the end of line it should match exactly 1 digit find a end of line, or [1-8][0-9] end of line or 90 end of line.

Difference between grep -E regex and Bash regex in conditional expression

For the same regex applied to the same string, why does grep -E match, but the Bash =~ operator in [[ ]] does not?
$ D=Dw4EWRwer
$ echo $D|grep -qE '^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_-\ ]{1,22}$' || echo wrong pattern
$ [[ "${D}" =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_-\ ]{1,22}$ ]] || echo wrong pattern
wrong pattern
Update: I confirm this worked:
[[ "${D}" =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]\ _-]{1,22}$ ]] || echo wrong pattern
The problem (for both versions of the code) is on this character class:
[[:alnum:]_-\ ]
In the grep version, because the regex is enclosed in single quotes, the backslash doesn't escape anything and the character range received by grep is exactly how it is represented above.
In the bash version, the backslash (\) escapes the space that follows it and the actual character class used by [[ ]] to test is [[:alnum:]_- ].
Because in ASCII table the underscore (_) comes after both space () and backslash (\), neither of these character classes is correct.
For the bash version you can use:
[[ "${D}" =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_-\ ]{1,22}$ ]]; echo $?
to verify its outcome. If the regex is incorrect, the exit code is 2.
If you want to put a dash (-) into a character class you have to put it either as the first character in the class (just after [ or [^ if it is a negating class) or as the last character in the class (right before the closing]`).
The grep version of the code should be (there is no need to escape anything inside a string enclosed in single quotes):
$ echo $D | grep -qE '^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_ -]{1,22}$' || echo wrong pattern
The bash version of your code should be:
[[ "${D}" =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_\ -]{1,22}$ ]] || echo wrong pattern
Based on your comment, you want the bracket expression to contain alphanumeric characters, spaces, underscores and dashes, so the dash is not supposed to indicate a range. To add a hyphen to a bracket expression, it has to be the first or last character in it. Additionally, you don't have to escape things in bracket expressions, so you can drop the backslash. Your grep regex includes a literal \ in the bracket expression:
$ grep -q '[\]' <<< '\' && echo "Match"
Match
In the Bash regex, the space has to be escaped because the string is first read by the shell, but see below how to avoid that.
First, fixing your regex:
^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_ -]{1,22}$
The backslash is gone, and the hyphen is moved to the end. Using this with grep works fine:
$ D=Dw4EWRwer
$ grep -E '^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_ -]{1,22}$' <<< "$D"
Dw4EWRwer
To use the regex within [[ ]] directly, the space has to be escaped:
$ [[ $D =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_\ -]{1,22}$ ]] && echo "Match"
Match
I would make the following changes:
Use character classes where possible: [A-Z] is [[:upper:]], [A-Za-z0-9] is [[:alnum:]]
Store the regex in a variable for usage in [[ ]]; this has two advantages: no escaping characters special to the shell, and compatibility with older Bash versions, as the quoting requirements changed between 3.1 and 3.2 (see the Patterns article in the BashGuide).
The regex would then become this for grep:
$ grep -E '^[[:upper:]][[:alnum:]][[:alnum:]_ -]{1,22}$' <<< "$D"
Dw4EWRwer
and this in Bash:
$ re='^[[:upper:]][[:alnum:]][[:alnum:]_ -]{1,22}$'
$ [[ $D =~ $re ]] && echo "Match"
Match

Excluding the first 3 characters of a string using regex

Given any string in bash, e.g flaccid, I want to match all characters in the string but the first 3 (in this case I want to exclude "fla" and match only "ccid"). The regex also needs to work in sed.
I have tried positive look behind and the following regex expressions (as well as various other unsuccessful ones):
^.{3}+([a-z,A-Z]+)
sed -r 's/(?<=^....)(.[A-Z]*)/,/g'
Google hasn't been very helpful as it only produce results like "get first 3 characters .."
Thanks in advance!
If you want to get all characters but the first 3 from a string, you can use cut:
str="flaccid"
cut -c 4- <<< "$str"
or bash variable subsitution:
str="flaccid"
echo "${str:3}"
That will strip the first 3 characters out of your string.
You may just use a capturing group within an expression like ^.{3}(.*) / ^.{3}([a-zA-Z]+) and grab the ${BASH_REMATCH[1]} contents:
#!/bin/bash
text="flaccid"
rx="^.{3}(.*)"
if [[ $text =~ $rx ]]; then
echo ${BASH_REMATCH[1]};
fi
See online Bash demo
In sed, you should also be using capturing groups / backreferences to get what you need. To just keep the first 3 chars, you may use a simple:
echo "flaccid" | sed 's/.\{3\}//'
See this regex demo. The .\{3\} matches exactly any 3 chars and will remove them from the beginning only, since g modifier is not used.
Now, both the solutions above will output ccid, returning the first 3 chars only.
Using sed, just remove them
echo string | sed 's/^...//g'
How is it that no-one has named the most simple and portable solution:
shell "Parameter expansions":
str="flacid"
echo "${str#???}
For a regex (bash):
$ str="flaccid"
$ regex='^.{3}(.*)$'
$ [[ $str =~ $regex ]] && echo "${BASH_REMATCH[1]}"
ccid
Same regex in sed:
$ echo "flaccid" | sed -E "s/$regex/\1/"
ccid
Or sed (Basic Regex):
$ echo "flaccid" | sed 's/^.\{3\}\(.*\)$/\1/'
ccid

What does this match : bash regex

if [[ "$len" -lt "$MINLEN" && "$line" =~ \[*\.\] ]]
This is from Advanced bash scripting guide "Example 10-1. Inserting a blank line between paragraphs in a text file"
As I understand this matches "any string or a dot character". Right ?
It matches zero or more open bracket characters (\[*), followed by a period and a close square bracket (\.\]). Note that it only requires that a match exist somewhere in "$line", not that the whole string match. Here's a demo:
$ showmatch() { [[ "$1" =~ \[*\.\] ]] && echo "matched: '${BASH_REMATCH[0]}'" || echo "no match"; }
$ showmatch "abc[.]def"
matched: '[.]'
$ showmatch "abc.]def"
matched: '.]'
$ showmatch "abc[[[[[[[.]def"
matched: '[[[[[[[.]'
$ showmatch "abc[[[[[[[xyz.]def"
matched: '.]'
$ showmatch "abc[[[[[[[.xyz]def"
no match
...and I'm pretty sure that's not what it's supposed to be doing in that example script.
It means any string ended with dot inside bracers, for example: [.]
[abc.]
Update: +1 to Gordon Davisson, who has summed it up pretty well... so I've redacted my original post
In brief: You can test the result of a bash regex match like this:
[[ "[*.]" =~ \[*\.\] ]] ; echo ${BASH_REMATCH[0]}

How can I ensure a Bash string is alphanumeric, without an underscore?

I'm adding a feature to an existing script that will allow the user to configure the hostname of a Linux system. The rules I'm enforcing are as follows:
Must be between 2 and 63 characters long
Must not start or end with a hyphen
Can only contain alphanumeric characters, and hyphens; all other characters are not allowed (including an underscore, which means I can't use the \W regex symbol)
I've solved the first two in the list, but I'm having trouble figuring out how to check if a bash string contains only letters, digits, and hyphens. I think I can do this with a regex, but I'm having trouble figuring out how (I've spent the past hour searching the web and reading man pages).
I'm open to using sed, grep, or any of the other standard tools, but not Perl or Python.
Seems like this ought to do it:
^[a-zA-Z0-9][-a-zA-Z0-9]{0,61}[a-zA-Z0-9]$
Match any one alphanumeric character, then match up to 61 alphanumeric characters (including hyphens), then match any one alphanumeric character. The minimum string length is 2, the maximum is 63. It doesn't work with Unicode. If you need it to work with Unicode you'll need to add different character classes in place of a-zA-Z0-9 but the principle will be the same.
I believe the correct grep expression that will work with Unicode is:
^[[:alnum:]][-[:alnum:]]{0,61}[[:alnum:]]$
Example Usage:
echo 123-abc-098-xyz | grep -E '^[[:alnum:]][-[:alnum:]]{0,61}[[:alnum:]]$'
result=$(grep -E '^[[:alnum:]][-[:alnum:]]{0,61}[[:alnum:]]$' <<< "this-will-work"); echo $result;
echo "***_this_will_not_match_***" | grep -E '^[[:alnum:]][-[:alnum:]]{0,61}[[:alnum:]]$'
This is a bash script testing the first parameter whether it contains only alphanumerics or hyphens. It "pipes" the contents of $1 into grep:
#!/bin/bash
if grep '^[-0-9a-zA-Z]*$' <<<$1 ;
then echo ok;
else echo ko;
fi
This is for the last one you need: sed -e 's/[^[:alnum:]|-]//g'
you can do it with just bash
string="-sdfsf"
length=${#string}
if [ $length -lt 2 -o $length -gt 63 ] ;then
echo "length invalid"
exit
fi
case $string in
-* ) echo "not ok : start with hyphen";exit ;;
*- ) echo "not ok : end with hyphen";exit ;;
*[^a-zA-Z0-9-]* ) echo "not ok : special character";exit;;
esac