Checking a string to see if it contains numeric character in UNIX - regex

I'm new to UNIX, having only started it at work today, but experienced with Java, and have the following code:
#/bin/bash
echo "Please enter a word:"
read word
grep -i $word $1 | cut -d',' -f1,2 | tr "," "-"> output
This works fine, but what I now need to do is to check when word is read, that it contains nothing but letters and if it has numeric characters in print "Invalid input!" message and ask them to enter it again. I assumed regular expressions with an if statement would be the easy way to do this but I cannot get my head around how to use them in UNIX as I am used to the Java application of them. Any help with this would be greatly appreciated, as I couldn't find help when searching as all the solutions with regular expressions in linux I found only dealt with if it was either all numeric or not.

Yet another approach. Grep exits with 0 if a match is found, so you can test the exit code:
echo "${word}" | grep -q '[0-9]'
if [ $? = 0 ]; then
echo 'Invalid input'
fi
This is /bin/sh compatible.
Incorporating Daenyth and John's suggestions, this becomes
if echo "${word}" | grep '[0-9]' >/dev/null; then
echo 'Invalid input'
fi

The double bracket operator is an extended version of the test command which supports regexes via the =~ operator:
#!/bin/bash
while true; do
read -p "Please enter a word: " word
if [[ $word =~ [0-9] ]]; then
echo 'Invalid input!' >&2
else
break
fi
done
This is a bash-specific feature. Bash is a newer shell that is not available on all flavors of UNIX--though by "newer" I mean "only recently developed in the post-vacuum tube era" and by "not all flavors of UNIX" I mean relics like old versions of Solaris and HP-UX.
In my opinion this is the simplest option and bash is plenty portable these days, but if being portable to old UNIXes is in fact important then you'll need to use the other posters' sh-compatible answers. sh is the most common and most widely supported shell, but the price you pay for portability is losing things like =~.

If you're trying to write portable shell code, your options for string manipulation are limited. You can use shell globbing patterns (which are a lot less expressive than regexps) in the case construct:
export LC_COLLATE=C
read word
while
case "$word" in
*[!A-Za-z]*) echo >&2 "Invalid input, please enter letters only"; true;;
*) false;;
esac
do
read word
done
EDIT: setting LC_COLLATE is necessary because in most non-C locales, character ranges like A-Z don't have the “obvious” meaning. I assume you want only ASCII letters; if you also want letters with diacritics, don't change LC_COLLATE, and replace A-Za-z by [:alpha:] (so the whole pattern becomes *[![:alpha:]]*).
For full regexps, see the expr command. EDIT: Note that expr, like several other basic shell tools, has pitfalls with some special strings; the z characters below prevent $word from being interpreted as reserved words by expr.
export LC_COLLATE=C
read word
while expr "z$word" : 'z[A-Za-z]*$' >/dev/null; then
echo >&2 "Invalid input, please enter letters only"
read word
fi
If you only target recent enough versions of bash, there are other options, such as the =~ operator of [[ ... ]] conditional commands.
Note that your last line has a bug, the first command should be
grep -i "$word" "$1"
The quotes are because somewhat counter-intuitively, "$foo" means “the value of the variable called foo” whereas plain $foo means “take the value of foo, split it into separate words where it contains whitespace, and treat each word as a globbing pattern and try to expand it”. (In fact if you've already checked that $word contains only letters, leaving the quotes won't do any harm, but it takes more time to think of these special cases than to just put the quotes every times.)

Yet another (quite) portable way to do it ...
if test "$word" != "`printf "%s" "$word" | tr -dc '[[:alpha:]]'`"; then
echo invalid
fi

One portable (assuming bash >= 3) way to do this is to remove all numbers and test for length:
#!/bin/bash
read -p "Enter a number" var
if [[ -n ${var//[0-9]} ]]; then
echo "Contains non-numbers!"
else
echo "ok!"
fi
Coming from Java, it's important to note that bash has no real concept of objects or data types. Everything is a string, and complex data structures are painful at best.
For more info on what I did, and other related functions, google for bash string manipulation.

Playing around with Bash parameter expansion and character classes:
# cf. http://wiki.bash-hackers.org/syntax/pe
word="abc1def"
word="abc,def"
word=$'abc\177def'
# cf. http://mywiki.wooledge.org/BashFAQ/058 (no NUL byte in Bash variable)
word=$'abc\000def'
word="abcdef"
(
set -xv
[[ "${word}" != "${word/[[:digit:]]/}" ]] && echo invalid || echo valid
[[ -n "${word//[[:alpha:]]/}" ]] && echo invalid || echo valid
)

Everyone's answers seem to be based on the fact that the only invalid characters are numbers. The initial questions states that they need to check that the string contains "nothing but letters".
I think the best way to do it is
nonalpha=$(echo "$word" | sed 's/[[:alpha:]]//g')
if [[ ${#nonalpha} -gt 0 ]]; then
echo "Invalid character(s): $nonalpha"
fi
If you found this page looking for a way to detect non-numeric characters in your string (like I did!) replace [[:alpha:]] with [[:digit:]].

Related

Regular expressions don't work as expected in bash if-else block's condition

My pattern defined to match in if-else block is :
pat="17[0-1][0-9][0-9][0-9].AUG"
nln=""
In my script, I'm taking user input which needs to be matched against the pattern, which if doesn't match, appropriate error messages are to be shown. Pretty simple, but giving me a hard time though. My code block from the script is this:
echo "How many days' AUDIT Logs need to be searched?"
read days
echo "Enter file name(s)[For multiple files, one file per line]: "
for(( c = 0 ; c < $days ; c++))
do
read elements
if [[ $elements =~ $pat ]];
then
array[$c]="$elements"
elif [[ $elements =~ $nln ]];
then
echo "No file entered.Run script again. Exiting"
exit;
else
echo "Invalid filename entered: $elements.Run script again. Exiting"
exit;
fi
done
The format I want from the user for filenames to be entered is this:
170402.AUG
So basically yymmdd.AUG (where y-year,m-month,d-day), with trailing or leading spaces is fine. Anything other than that should throw "Invalid filename entered: $elements.Run script again. Exiting" message. Also I want to check if if it is a blank line with a "Enter" hit, it should give an error saying "No file entered.Run script again. Exiting"
However my code, even if I enter something like "xxx" as filename, which should be throwing "Invalid filename entered: $elements.Run script again. Exiting", is actually checking true against a blank line, and throwing "No file entered.Run script again. Exiting"
Need some help with handling the regular expressions' check with user input, as otherwise rest of my script works just fine.
I think as discussed in the comments you are confusing with the glob match and a regEx match, what you have defined as pat is a glob match which needs to be equated with the == operator as,
pat="17[0-1][0-9][0-9][0-9].AUG"
string="170402.AUG"
[[ $string == $pat ]] && printf "Match success\n"
The equivalent ~ match would be to something as
pat="17[[:digit:]]{4}\.AUG"
[[ $string =~ $pat ]] && printf "Match success\n"
As you can see the . in the regex syntax has been escaped to deprive of its special meaning ( to match any character) but just to use as a literal dot. The POSIX character class [[:digit:]] with a character count {4} allows you to match 4 digits followed by .AUG
And for the string empty check do as suggested by the comments from Cyrus, or by Benjamin.W
[[ $elements == "" ]]
(or)
[[ -z $elements ]]
I would not bug the user with how many days (who want count 15 days or like)? Also, why only one file per line? You should help the users, not bug them like microsoft...
For the start:
show_help() { cat <<'EOF'
bla bla....
EOF
}
show_files() { echo "${#files[#]} valid files entered: ${files[#]}"; }
while read -r -p 'files? (h-help)> ' line
do
case "$line" in
q) echo "quitting..." ; exit 0 ;;
h) show_help ; continue;;
'') (( ${#files} )) && show_files; continue ;;
l) show_files ; continue ;;
p) (( ${#files} )) && break || { echo "No files enterd.. quitting" ; exit 1; } ;; # go to processing
esac
# select (grep) the valid patterns from the entered line
# and append them into the array
# using the -P (if your grep know it) you can construct very complex regexes
files+=( $(grep -oP '17\d{4}.\w{3}' <<< "$line") )
done
echo "processing files ${files[#]}"
Using such logic you can build really powerful and user-friendly app. Also, you can use -e for the read enable the readline functions (cursor keys and like)...
But :) Consider just create a simple script, which accepts arguments. Without any dialogs and such. example:
myscript -h
same as above, or some longer help text
myscript 170402.AUG 170403.AUG 170404.AUG 170405.AUG
will do whatever it should do with the files. Main benefit, you could use globbing in the filenames, like
myscript 1704*
and so on...
And if you really want the dialog, it could show it when someone runs the script without any argument, e.g.:
myscript
will run in interactive mode...

Regex "start of string" anchor not working

The regex I'm working with at the moment is as follows:
^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$
I'm trying to match all floating point numbers, but ONLY the number. For example, the following should match:
6.0
1.22E3
-2
2.99999e-12
However, the following should NOT match:
somestring///////6.0
I've tested the above regex on this validation site and it works as expected. When running it in my bash script, however, nothing at all matches.
This is my bash code:
if [[ "$VAL" =~ ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$ ]]
then
echo $VAL, "is a number"
else
echo $VAL, "is not a number"
fi
I've tried removing the anchors, and it matches any strings that contain floating points. However, strings like "//////6.00007" match. The $ anchor works as expected; however, the ^ does not.
Please let me know if you have any suggestions for troubleshooting this issue.
Thanks!
Edit 1
Removed bad examples
Edit 2
I ran the regex in its own foo.sh as suggested by #lurker and the code ran as expected with my test cases. So I looked at what was being compared with the regex. When I echoed what was being compared, everything looked fine, so it made zero sense as to why the regex wasn't matching.
Then I began to suspect that echo wasn't displaying what was actually in $VAL for some reason.
So I ran this: NEWVAL=(echo $VAL) as a temporary workaround until I can figure out what's going on.
Your regex does not allow decimals in the exponent. Exponents may have decimals, so your either need to change your definition of what a 'number' is or you need to change your regex.
Assuming the later, here is a correction (Bash 4.4):
echo "6.0
1.22E3.7
-2
2.99999e-0.0001
somestring///////6.0" >/tmp/f1.txt
while IFS= read -r line || [[ -n $line ]]; do
if [[ "$line" =~ ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[-+]?[0-9]*\.?[0-9]+)?$ ]]
then
echo $line, "is a number"
else
echo $line, "is not a number"
fi
done < /tmp/f1.txt
Prints:
6.0, is a number
1.22E3.7, is a number
-2, is a number
2.99999e-0.0001, is a number
somestring///////6.0, is not a number
BUT you should know, that most consider the only two legit numbers in your list to be 6.0 and -2. Easy way to test is with awk:
$ awk '$1+0==$1{print $0 " is a number"; next} {print $0 " not a number"}' /tmp/f1.txt
6.0 is a number
1.22E3.7 not a number
-2 is a number
2.99999e-0.0001 not a number
somestring///////6.0 not a number
The same C language function that awk uses to convert a string to a float is used by many other languages (Ruby, Perl, Python, C, C++, Swift, etc etc) so if you consider your format valid, you are probably going to be writing your own conversion routine as well.
For example, in most languages your can enter 10**1.5 as legit float literal. No language I know of accepts decimal numbers after the e in a string of the form 'xx.zzEyy.y'
As it turns out, the variables that I was comparing with my regex had leading newlines on them (e.g. "\n2.3333") that were being stripped away with echo. So, when I displayed the values to the screen with echo I would see the stripped version of my variable, which was not what was being compared with the regex.
Lesson learned: echo isn't always trustworthy. Per #CharlesDuffy's comment, use one of the following to see what's actually in your variables: declare -p varname or printf '%q\n' "$varname" but do not use echo $varname.

how to enforce a date format

I want to use the date command to output a day of week from user input.
I want to force the input to be of the format MM/DD/YYYY.
For example, at the command line I give
./programname MM/DD/YYYY MM/DD/YYYY
Snippets from the script itself
#!/bin/bash
DATE_FORMAT="^[0-9][0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]$" #MM/DD/YYYY
DATE1="$1"
DATE2="$2"
... followed by
if [ "$DATE1" != "$DATE_FORMAT" ] || [ "$DATE2" != "$DATE_FORMAT" ]; then
echo -e "Please follow the valid format MM/DD/YYYY.\n" 1>&2
exit 1
Now the problem is even when I enter correct date formats,
./programname 11/22/2014 11/23/2014
I still get that error message that I set up, which means that condition for if is evaluated true even when I input valid format... any suggestions why this is happening?
This script seems to work:
#!/bin/bash
DATE_FORMAT="^[01][0-9][/][0-3][0-9][/][0-9][0-9][0-9][0-9]$" #MM/DD/YYYY
DATE1="$1"
DATE2="$2"
if [[ "$DATE1" =~ $DATE_FORMAT ]] && [[ "$DATE2" =~ $DATE_FORMAT ]]
then echo "Both dates ($DATE1 and $DATE2) are OK"
else echo "Please follow the valid format MM/DD/YYYY ($DATE1 or $DATE2 is wrong)."
fi
It uses the =~ operator for a positive regex match inside Bash's [[ test command. The documents don't mention a !~ for negative matching (though that's what Awk and Perl use). With the single-bracket [ test command, there is no regex matching. Note that the regex expression must not be enclosed in double quotes:
Any part of the pattern may be quoted to force the quoted portion to be matched as a string. Bracket expressions in regular expressions must be treated carefully, since normal quoting characters lose their meanings between brackets. If the pattern is stored in a shell variable, quoting the variable expansion forces the entire pattern to be matched as a string.
The test is also more stringent, rejecting 23/45/2091, amongst other invalid date strings.
$ bash dt19.sh 11/22/2014 11/23/2014
Both dates (11/22/2014 and 11/23/2014) are OK
$ bash dt19.sh 31/22/2014 11/43/2014
Please follow the valid format MM/DD/YYYY (31/22/2014 or 11/43/2014 is wrong).
$
Corrected code:
#!/bin/bash
DATE1="$1"
DATE2="$2"
if echo "$DATE1" | grep -q -E '[0-9][0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]'
then
echo "Do whatever you want here"
exit 1
else
echo "Invalid date"
fi

Validating that a string contains only ASCII characters and digits

I'm doing this to validate a username:
if [[ "$username" =~ ^[a-z][_a-z0-9]{2,17}$ ]]; then
But actually, a username containing unicode characters like é, ç, à etc... is accepted.
What regex class should I use to limit strings to only ascii letters (a, b, c, d ... z) ?
The bullet-proof way is to simply spell out [a-z] as [abcdefghijklmnopqrstuvwxyz]. There! No messing with locales or funny character classes and supported on any shell since Jan 1 1970 00:00:00. Future-proof, no matter what your OS vendor, shell vendor, Unix standardization process or BOFH thinks is cool.
With an extra variable lc like
lc=abcdefghijklmnopqrstuvwxyz
the regex even becomes readable:
[$lc][_0-9$lc]{2,17}
This is what highly robust and portable configure scripts do.
You should be able to do this by first setting LC_ALL=C (possibly temporarily so as to not affect anything else). The more modern regex engines allow for locales which can fold accented characters onto their base character (or at least sequence them so they come between a and z).
Since the C locale only knows ASCII, this should fix the problem.
For example, see the following script:
#!/bin/bash
username=amélie_314159
for locale in '' 'C' ; do
export LC_ALL="${locale}"
printf "LC_ALL set to %-3s: '%s' is " "'$LC_ALL'" "${username}"
if [[ "${username}" =~ ^[a-z][_a-z0-9]{2,17}$ ]] ; then
echo valid
else
echo invalid
fi
done
which outputs:
LC_ALL set to '' : 'amélie_314159' is valid
LC_ALL set to 'C': 'amélie_314159' is invalid
Use the following:
if [[ "$username" =~ ^[\x00-\x7f]{2,17}$ ]]; then
Before you check with a regex if the username has the right length etc. you should sanitize the input string. That means instead of blacklisting what is not allowed you should whitelist what is actually allowed. In this case we would use e.g. tr before the actual regex to remove all unwanted characters beforehand. When the regex check follows it becomes much simpler.
echo "abc[日本語][ひらがな][カタカナ][äüéçà]abc" | tr -dc "a-zA-Z"
This would only leave abcabc as remainder.

Getting the index of the substring on solaris

How can I find the index of a substring which matches a regular expression on solaris10?
Assuming that what you want is to find the location of the first match of a wildcard in a string using bash, the following bash function returns just that, or empty if the wildcard doesn't match:
function match_index()
{
local pattern=$1
local string=$2
local result=${string/${pattern}*/}
[ ${#result} = ${#string} ] || echo ${#result}
}
For example:
$ echo $(match_index "a[0-9][0-9]" "This is a a123 test")
10
If you want to allow full-blown regular expressions instead of just wildcards, replace the "local result=" line with
local result=$(echo "$string" | sed 's/'"$pattern"'.*$//')
but then you're exposed to the usual shell quoting issues.
The goto options for me are bash, awk and perl. I'm not sure what you're trying to do, but any of the three would likely work well. For example:
f=somestring
string=$(expr match "$f" '.*\(expression\).*')
echo $string
You tagged the question as bash, so I'm going to assume you're asking how to do this in a bash script. Unfortunately, the built-in regular expression matching doesn't save string indices. However, if you're asking this in order to extract the match substring, you're in luck:
if [[ "$var" =~ "$regex" ]]; then
n=${#BASH_REMATCH[*]}
while [[ $i -lt $n ]]
do
echo "capture[$i]: ${BASH_REMATCH[$i]}"
let i++
done
fi
This snippet will output in turn all of the submatches. The first one (index 0) will be the entire match.
You might like your awk options better, though. There's a function match which gives you the index you want. Documentation can be found here. It'll also store the length of the match in RLENGTH, if you need that. To implement this in a bash script, you could do something like:
match_index=$(echo "$var_to_search" | \
awk '{
where = match($0, '"$regex_to_find"')
if (where)
print where
else
print -1
}')
There are a lot of ways to deal with passing the variables in to awk. This combination of piping output and directly embedding one into the awk one-liner is fairly common. You can also give awk variable values with the -v option (see man awk).
Obviously you can modify this to get the length, the match string, whatever it is you need. You can capture multiple things into an array variable if necessary:
match_data=($( ... awk '{ ... print where,RLENGTH,match_string ... }'))
If you use bash 4.x you can source the oobash. A string lib written in bash with oo-style:
http://sourceforge.net/projects/oobash/
String is the constructor function:
String a abcda
a.indexOf a
0
a.lastIndexOf a
4
a.indexOf da
3
There are many "methods" more to work with strings in your scripts:
-base64Decode -base64Encode -capitalize -center
-charAt -concat -contains -count
-endsWith -equals -equalsIgnoreCase -reverse
-hashCode -indexOf -isAlnum -isAlpha
-isAscii -isDigit -isEmpty -isHexDigit
-isLowerCase -isSpace -isPrintable -isUpperCase
-isVisible -lastIndexOf -length -matches
-replaceAll -replaceFirst -startsWith -substring
-swapCase -toLowerCase -toString -toUpperCase
-trim -zfill