Bash script with regex and capturing group - regex

I'm working on a bash script to rename automatically files on my Synology NAS.
I have a loop for the statement of the files and everything is ok until I want to make my script more efficient with regex.
I have several bits of code which are working like as expected:
filename="${filename//[-_.,\']/ }"
filename="${filename//[éèēěëê]/e}"
But I have this:
filename="${filename//t0/0}"
filename="${filename//t1/1}"
filename="${filename//t2/2}"
filename="${filename//t3/3}"
filename="${filename//t4/4}"
filename="${filename//t5/5}"
filename="${filename//t6/6}"
filename="${filename//t7/7}"
filename="${filename//t8/8}"
filename="${filename//t9/9}"
And, I would like to use captured group to have something like this:
filename="${filename//t([0-9]{1,2})/\1}"
filename="${filename//t([0-9]{1,2})/${BASH_REMATCH[1]}}"
I've been looking for a working syntax without success...

The shell's parameter expansion facility does not support regular expressions. But you can approximate it with something like
filename=$(sed 's/t\([0-9]\)/\1/g' <<<"$filename")
This will work regardless of whether the first digit is followed by additional digits or not, so dropping that requirement simplifies the code.

If you want the last or all t[0-9]{1,2}s replaced:
$ filename='abt1cdt2eft3gh'; [[ "$filename" =~ (.*)t([0-9]{1,2}.*) ]] && filename="${BASH_REMATCH[1]}${BASH_REMATCH[2]}"; echo "$filename"
abt1cdt2ef3gh
$ filename='abt1cdt2eft3gh'; while [[ "$filename" =~ (.*)t([0-9]{1,2}.*) ]]; do filename="${BASH_REMATCH[1]}${BASH_REMATCH[2]}"; done; echo "$filename"
ab1cd2ef3gh
Note that the "replace all" case above would keep iterating until all t[0-9]{1,2}s are changed, even ones that didn't exist in the original input but were being created by the loop, e.g.:
$ filename='abtt123de'; while [[ "$filename" =~ (.*)t([0-9]{1,2}.*) ]]; do filename="${BASH_REMATCH[1]}${BASH_REMATCH[2]}"; echo "$filename"; done
abt123de
ab123de
whereas the sed script in #tripleee's answer would not do that:
$ filename='abtt123de'; filename=$(sed 's/t\([0-9]\)/\1/g' <<<"$filename"); echo "$filename"
abt123de

Related

bash regular expression format

My code have problem with compare var with regular expression.
The main problem is problem is here
if [[ “$alarm” =~ ^[0-2][0-9]\:[0-5][0-9]$ ]]
This "if" is never true i dont know why even if i pass to "$alarm" value like 13:00 or 08:19 its always false and write "invalid clock format".
When i try this ^[0-2][0-9]:[0-5][0-9]$ on site to test regular expressions its work for example i compered with 12:20.
I start my script whith command ./alarm 11:12
below is whole code
#!/bin/bash
masa="`date +%k:%M`"
mp3="$HOME/Desktop/alarm.mp3" #change this
echo "lol";
if [ $# != 1 ]; then
echo "please insert alarm time [24hours format]"
echo "example ./alarm 13:00 [will ring alarm at 1:00pm]"
exit;
fi
alarm=$1
echo "$alarm"
#fix me with better regex >_<
if [[ “$alarm” =~ ^[0-2][0-9]\:[0-5][0-9]$ ]]
then
echo "time now $masa"
echo "alarm set to $alarm"
echo "will play $mp3"
else
echo "invalid clock format"
exit;
fi
while [ $masa != $alarm ];do
masa="`date +%k:%M`" #update time
sleep 1 #dont overload the cpu cycle
done
echo $masa
if [ $masa = $alarm ];then
echo ringggggggg
play $mp3 > /dev/null 2> /dev/null &
fi
exit
I can see a couple of issues with your test.
Firstly, it looks like you may be using the wrong kind of double quotes around your variable (“ ”, rather than "). These "fancy quotes" are being concatenated with your variable, which I assume is what causes your pattern to fail to match. You could change them but within bash's extended tests (i.e. [[ instead of [), there's no need to quote your variables anyway, so I would suggest removing them entirely.
Secondly, your regular expression allows some invalid dates at the moment. I would suggest using something like this:
re='^([01][0-9]|2[0-3]):[0-5][0-9]$'
if [[ $alarm =~ $re ]]
I have deliberately chosen to use a separate variable to store the pattern, as this is the most widely compatible way of working with bash regexes.

Shell: Checking if argument exists and matches expression

I'm new to shell scripting and trying to write the ability to check if an argument exists and if it matches an expression. I'm not sure how to write expressions, so this is what I have so far:
#!/bin/bash
if [[ -n "$1"] && [${1#*.} -eq "tar.gz"]]; then
echo "Passed";
else
echo "Missing valid argument"
fi
To run the script, I would type this command:
# script.sh YYYY-MM.tar.gz
I believe what I have is
if the YYYY-MM.tar.gz is not after script.sh it will echo "Missing valid argument" and
if the file does not end in .tar.gz it echo's the same error.
However, I want to also check if the full file name is in YYYY-MM.tar.gz format.
if [[ -n "$1" ]] && [[ "${1#*.}" == "tar.gz" ]]; then
-eq: (equal) for arithmetic tests
==: to compare strings
See: help test
You can also use:
case "$1" in
*.tar.gz) ;; #passed
*) echo "wrong/missing argument $1"; exit 1;;
esac
echo "ok arg: $1"
As long as the file is in the correct YYYY-MM.tar.gz format, it obviously is non-empty and ends in .tar.gz as well. Check with a regular expression:
if ! [[ $1 =~ [0-9]{4}-[0-9]{1,2}.tar.gz ]]; then
echo "Argument 1 not in correct YYYY-MM.tar.gz format"
exit 1
fi
Obviously, the regular expression above is too general, allowing names like 0193-67.tar.gz. You can adjust it to be as specific as you need it to be for your application, though. I might recommend
[1-9][0-9]{3}-([1-9]|10|11|12).tar.gz
to allow only 4-digit years starting with 1000 (support for the first millennium ACE seems unnecessary) and only months 1-12 (no leading zero).

How to check an input string in bash it's in version format (n1.n2.n3)

I've written an script that updates a version on a certain file. I need to check that the input for the user is in version format so I don't finish adding number that are not needed in those important files. The way I have done it is by adding a new value version_check which where I delete my regex pattern and then an if check.
version=$1
version_checked=$(echo $version | sed -e '/[0-9]\+\.[0-9]\+\.[0-9]/d')
if [[ -z $version_checked ]]; then
echo "$version is the right format"
else
echo "$version_checked is not in the right format, please use XX.XX.XX format (ie: 4.15.3)"
exit
fi
That works fine for XX.XX and XX.XX.XX but it also allows XX.XX.XX.XX and XX.XX.XX.XX.XX etc.. so if user makes a mistake it will input wrong data on the file. How can I get the sed regex to ONLY allow 3 pairs of numbers separated by a dot?
Change your regex from:
/[0-9]\+\.[0-9]\+\.[0-9]/
to this:
/^[0-9]*\.[0-9]*\.[0-9]*$/
You can do this with bash pattern matching:
$ for version in 1.2 1.2.3 1.2.3.4; do
printf "%s\t" $version
[[ $version == +([0-9]).+([0-9]).+([0-9]) ]] && echo y || echo n
done
1.2 n
1.2.3 y
1.2.3.4 n
If you need each group of digits to be exactly 2 digits:
[[ $version == [0-9][0-9].[0-9][0-9].[0-9][0-9] ]]

Checking a string to see if it contains numeric character in UNIX

I'm new to UNIX, having only started it at work today, but experienced with Java, and have the following code:
#/bin/bash
echo "Please enter a word:"
read word
grep -i $word $1 | cut -d',' -f1,2 | tr "," "-"> output
This works fine, but what I now need to do is to check when word is read, that it contains nothing but letters and if it has numeric characters in print "Invalid input!" message and ask them to enter it again. I assumed regular expressions with an if statement would be the easy way to do this but I cannot get my head around how to use them in UNIX as I am used to the Java application of them. Any help with this would be greatly appreciated, as I couldn't find help when searching as all the solutions with regular expressions in linux I found only dealt with if it was either all numeric or not.
Yet another approach. Grep exits with 0 if a match is found, so you can test the exit code:
echo "${word}" | grep -q '[0-9]'
if [ $? = 0 ]; then
echo 'Invalid input'
fi
This is /bin/sh compatible.
Incorporating Daenyth and John's suggestions, this becomes
if echo "${word}" | grep '[0-9]' >/dev/null; then
echo 'Invalid input'
fi
The double bracket operator is an extended version of the test command which supports regexes via the =~ operator:
#!/bin/bash
while true; do
read -p "Please enter a word: " word
if [[ $word =~ [0-9] ]]; then
echo 'Invalid input!' >&2
else
break
fi
done
This is a bash-specific feature. Bash is a newer shell that is not available on all flavors of UNIX--though by "newer" I mean "only recently developed in the post-vacuum tube era" and by "not all flavors of UNIX" I mean relics like old versions of Solaris and HP-UX.
In my opinion this is the simplest option and bash is plenty portable these days, but if being portable to old UNIXes is in fact important then you'll need to use the other posters' sh-compatible answers. sh is the most common and most widely supported shell, but the price you pay for portability is losing things like =~.
If you're trying to write portable shell code, your options for string manipulation are limited. You can use shell globbing patterns (which are a lot less expressive than regexps) in the case construct:
export LC_COLLATE=C
read word
while
case "$word" in
*[!A-Za-z]*) echo >&2 "Invalid input, please enter letters only"; true;;
*) false;;
esac
do
read word
done
EDIT: setting LC_COLLATE is necessary because in most non-C locales, character ranges like A-Z don't have the “obvious” meaning. I assume you want only ASCII letters; if you also want letters with diacritics, don't change LC_COLLATE, and replace A-Za-z by [:alpha:] (so the whole pattern becomes *[![:alpha:]]*).
For full regexps, see the expr command. EDIT: Note that expr, like several other basic shell tools, has pitfalls with some special strings; the z characters below prevent $word from being interpreted as reserved words by expr.
export LC_COLLATE=C
read word
while expr "z$word" : 'z[A-Za-z]*$' >/dev/null; then
echo >&2 "Invalid input, please enter letters only"
read word
fi
If you only target recent enough versions of bash, there are other options, such as the =~ operator of [[ ... ]] conditional commands.
Note that your last line has a bug, the first command should be
grep -i "$word" "$1"
The quotes are because somewhat counter-intuitively, "$foo" means “the value of the variable called foo” whereas plain $foo means “take the value of foo, split it into separate words where it contains whitespace, and treat each word as a globbing pattern and try to expand it”. (In fact if you've already checked that $word contains only letters, leaving the quotes won't do any harm, but it takes more time to think of these special cases than to just put the quotes every times.)
Yet another (quite) portable way to do it ...
if test "$word" != "`printf "%s" "$word" | tr -dc '[[:alpha:]]'`"; then
echo invalid
fi
One portable (assuming bash >= 3) way to do this is to remove all numbers and test for length:
#!/bin/bash
read -p "Enter a number" var
if [[ -n ${var//[0-9]} ]]; then
echo "Contains non-numbers!"
else
echo "ok!"
fi
Coming from Java, it's important to note that bash has no real concept of objects or data types. Everything is a string, and complex data structures are painful at best.
For more info on what I did, and other related functions, google for bash string manipulation.
Playing around with Bash parameter expansion and character classes:
# cf. http://wiki.bash-hackers.org/syntax/pe
word="abc1def"
word="abc,def"
word=$'abc\177def'
# cf. http://mywiki.wooledge.org/BashFAQ/058 (no NUL byte in Bash variable)
word=$'abc\000def'
word="abcdef"
(
set -xv
[[ "${word}" != "${word/[[:digit:]]/}" ]] && echo invalid || echo valid
[[ -n "${word//[[:alpha:]]/}" ]] && echo invalid || echo valid
)
Everyone's answers seem to be based on the fact that the only invalid characters are numbers. The initial questions states that they need to check that the string contains "nothing but letters".
I think the best way to do it is
nonalpha=$(echo "$word" | sed 's/[[:alpha:]]//g')
if [[ ${#nonalpha} -gt 0 ]]; then
echo "Invalid character(s): $nonalpha"
fi
If you found this page looking for a way to detect non-numeric characters in your string (like I did!) replace [[:alpha:]] with [[:digit:]].

Getting the index of the substring on solaris

How can I find the index of a substring which matches a regular expression on solaris10?
Assuming that what you want is to find the location of the first match of a wildcard in a string using bash, the following bash function returns just that, or empty if the wildcard doesn't match:
function match_index()
{
local pattern=$1
local string=$2
local result=${string/${pattern}*/}
[ ${#result} = ${#string} ] || echo ${#result}
}
For example:
$ echo $(match_index "a[0-9][0-9]" "This is a a123 test")
10
If you want to allow full-blown regular expressions instead of just wildcards, replace the "local result=" line with
local result=$(echo "$string" | sed 's/'"$pattern"'.*$//')
but then you're exposed to the usual shell quoting issues.
The goto options for me are bash, awk and perl. I'm not sure what you're trying to do, but any of the three would likely work well. For example:
f=somestring
string=$(expr match "$f" '.*\(expression\).*')
echo $string
You tagged the question as bash, so I'm going to assume you're asking how to do this in a bash script. Unfortunately, the built-in regular expression matching doesn't save string indices. However, if you're asking this in order to extract the match substring, you're in luck:
if [[ "$var" =~ "$regex" ]]; then
n=${#BASH_REMATCH[*]}
while [[ $i -lt $n ]]
do
echo "capture[$i]: ${BASH_REMATCH[$i]}"
let i++
done
fi
This snippet will output in turn all of the submatches. The first one (index 0) will be the entire match.
You might like your awk options better, though. There's a function match which gives you the index you want. Documentation can be found here. It'll also store the length of the match in RLENGTH, if you need that. To implement this in a bash script, you could do something like:
match_index=$(echo "$var_to_search" | \
awk '{
where = match($0, '"$regex_to_find"')
if (where)
print where
else
print -1
}')
There are a lot of ways to deal with passing the variables in to awk. This combination of piping output and directly embedding one into the awk one-liner is fairly common. You can also give awk variable values with the -v option (see man awk).
Obviously you can modify this to get the length, the match string, whatever it is you need. You can capture multiple things into an array variable if necessary:
match_data=($( ... awk '{ ... print where,RLENGTH,match_string ... }'))
If you use bash 4.x you can source the oobash. A string lib written in bash with oo-style:
http://sourceforge.net/projects/oobash/
String is the constructor function:
String a abcda
a.indexOf a
0
a.lastIndexOf a
4
a.indexOf da
3
There are many "methods" more to work with strings in your scripts:
-base64Decode -base64Encode -capitalize -center
-charAt -concat -contains -count
-endsWith -equals -equalsIgnoreCase -reverse
-hashCode -indexOf -isAlnum -isAlpha
-isAscii -isDigit -isEmpty -isHexDigit
-isLowerCase -isSpace -isPrintable -isUpperCase
-isVisible -lastIndexOf -length -matches
-replaceAll -replaceFirst -startsWith -substring
-swapCase -toLowerCase -toString -toUpperCase
-trim -zfill