Regex "start of string" anchor not working - regex

The regex I'm working with at the moment is as follows:
^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$
I'm trying to match all floating point numbers, but ONLY the number. For example, the following should match:
6.0
1.22E3
-2
2.99999e-12
However, the following should NOT match:
somestring///////6.0
I've tested the above regex on this validation site and it works as expected. When running it in my bash script, however, nothing at all matches.
This is my bash code:
if [[ "$VAL" =~ ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$ ]]
then
echo $VAL, "is a number"
else
echo $VAL, "is not a number"
fi
I've tried removing the anchors, and it matches any strings that contain floating points. However, strings like "//////6.00007" match. The $ anchor works as expected; however, the ^ does not.
Please let me know if you have any suggestions for troubleshooting this issue.
Thanks!
Edit 1
Removed bad examples
Edit 2
I ran the regex in its own foo.sh as suggested by #lurker and the code ran as expected with my test cases. So I looked at what was being compared with the regex. When I echoed what was being compared, everything looked fine, so it made zero sense as to why the regex wasn't matching.
Then I began to suspect that echo wasn't displaying what was actually in $VAL for some reason.
So I ran this: NEWVAL=(echo $VAL) as a temporary workaround until I can figure out what's going on.

Your regex does not allow decimals in the exponent. Exponents may have decimals, so your either need to change your definition of what a 'number' is or you need to change your regex.
Assuming the later, here is a correction (Bash 4.4):
echo "6.0
1.22E3.7
-2
2.99999e-0.0001
somestring///////6.0" >/tmp/f1.txt
while IFS= read -r line || [[ -n $line ]]; do
if [[ "$line" =~ ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[-+]?[0-9]*\.?[0-9]+)?$ ]]
then
echo $line, "is a number"
else
echo $line, "is not a number"
fi
done < /tmp/f1.txt
Prints:
6.0, is a number
1.22E3.7, is a number
-2, is a number
2.99999e-0.0001, is a number
somestring///////6.0, is not a number
BUT you should know, that most consider the only two legit numbers in your list to be 6.0 and -2. Easy way to test is with awk:
$ awk '$1+0==$1{print $0 " is a number"; next} {print $0 " not a number"}' /tmp/f1.txt
6.0 is a number
1.22E3.7 not a number
-2 is a number
2.99999e-0.0001 not a number
somestring///////6.0 not a number
The same C language function that awk uses to convert a string to a float is used by many other languages (Ruby, Perl, Python, C, C++, Swift, etc etc) so if you consider your format valid, you are probably going to be writing your own conversion routine as well.
For example, in most languages your can enter 10**1.5 as legit float literal. No language I know of accepts decimal numbers after the e in a string of the form 'xx.zzEyy.y'

As it turns out, the variables that I was comparing with my regex had leading newlines on them (e.g. "\n2.3333") that were being stripped away with echo. So, when I displayed the values to the screen with echo I would see the stripped version of my variable, which was not what was being compared with the regex.
Lesson learned: echo isn't always trustworthy. Per #CharlesDuffy's comment, use one of the following to see what's actually in your variables: declare -p varname or printf '%q\n' "$varname" but do not use echo $varname.

Related

Bash regex for certain number length

So I'm trying to identify between 2 sets of numbers being entered as parameters into my script. One being a phone number and one being the limit for printing mongodb logs. The most simple way I can think of doing this is that if the number is more than 7 digits, it's most likely going to be a phone number as printing mongo results that lengthy is very unlikely.
I'm trying to put this into a case statement
case "$1" in
regex here ) echo "not msisdn";;
regex here ) echo "msisdn";;
*) echo "not found";;
esac
I was thinking converting this to a string and checking the length would probably be easiest, as I can't seem to find any decent information on how to find the number of digits entered.
Edit: I should note that I was thinking of putting all this in an if statement, but these are not the only cases I will be checking, and I think the if statement will get very lengthy. The script I'm creating is already mostly ifs, so trying to change it up a bit. Let me know if I'm wrong though.
case doesn't support regular expressions; the wildcard syntax is a less capable formalism known as glob patterns 1
case $1 in
*[!0-9]*)
echo "Not a number";;
? | ?? | ??? | ???? | ????? | ??????)
echo "Six digits or less";;
'')
echo "Empty string";;
*)
echo "Seven digits or more";;
esac
It's not clear if non-numeric or empty inputs should be treated separately, but this shoud hopefully at least get you started.
1 Some will argue that globs are a severely restricted form of regexe. I certainly disagree, because this is more confusing than helpful IMHO - several metacharacters have entirely different semantics, and what's left as similar is by and large the static parts and not the actually useful patterns.
${#1} contains the length of $1. So maybe just
if (( ${#1} > 7 )) ; then
echo Phone number
else
echo Too short for a phone number
fi

Extract a numeric substring and another line as variables from multi-line text file

I'm writing a bash shell script that I hope to ultimately use to automate the naming and 'attachment' of scanned documents to our db. The script OCR's a section of the first page of the pdf and outputs a text file containing three lines; a name, unique id, and a datetime string:
Smith, John
Case #: 234567 ( )
09/04/2013 11:34 AM
What I'd like to do is end up with two seperate strings as variables, "Smith, John" and "234567". I'm looking for help using regex with sed/awk/etc to extract this number. One issue is that the OCR will rarely output strings like:
"Case #2 234567 ( )"
or
"Ca$e # 2234567 ( 7"
So I'm thinking to take the only last 6-digits in the string, since only maybe 1 in 10,000+ of these ever get the last 6-digits read incorrectly. This unique ID is only 6 digits, and is always between 200000-999999. I'm learning regex, but it's slow going. Any help is greatly appreciated.
Edit:
For now I am using:
casename="$(cat test.txt | sed '1!d')"
casenum="$(cat test.txt | sed -n -r 's/.*([0-9]{6}).*/\1/p')"
echo ${casenum} ${casename}
234567 Smith, John
Any input for why this might not be a good way to do it, or what could be improved is (very) welcomed.
you might try this Regex (BRE):
[2-9][0-9]\{5\}\>
You could use the following regex for the second line:
^.*(\d{6})[^\d].*$
Here, the first named sub-group would denote the digits of interest.
For example, using Notepad++,
Orginal text:
Replace options:
Resulting text:
The regex should remain more or less the same across environments. You might need to simply change the way the named subexpression ($1 here) is referenced.
You could probably use something like this untested but somewhat-syntactically-valid snippet:
shopt -s extglob
declare -a cases
for casefile in casefiles/*
do
name=""
while read l
do
if [[ -z "$name" ]]
then
[[ "$l" == #(*, *) ]] && name=$l
elif [[ "$l" == +([0-9]) ]]
then
after=${l#*[2-9][0-9][0-9][0-9][0-9][0-9]}
l=${l%$after}
l=${l#${l%[2-9][0-9][0-9][0-9][0-9][0-9]}}
if [[ "$l" == #([2-9][0-9][0-9][0-9][0-9][0-9]) ]]
then
cases[$l]=$name
fi
name=""
fi
done < $casefile
done
The "hard part" prunes the first 6-digit number in your range and everything after it, then removes what's left (the stuff before the number) from the line. Then it removes the number from the beginning of the string, and removes what's left (the part after the number) from the end. If what's left is a 6-digit number in your range, it uses that as the index and the case name as the value in an array which you can later iterate over.
The rest should be pretty straightforward. :) If this doesn't quite work as expected, I blame the fact that I mostly use ksh, not bash. ;)

Checking a string to see if it contains numeric character in UNIX

I'm new to UNIX, having only started it at work today, but experienced with Java, and have the following code:
#/bin/bash
echo "Please enter a word:"
read word
grep -i $word $1 | cut -d',' -f1,2 | tr "," "-"> output
This works fine, but what I now need to do is to check when word is read, that it contains nothing but letters and if it has numeric characters in print "Invalid input!" message and ask them to enter it again. I assumed regular expressions with an if statement would be the easy way to do this but I cannot get my head around how to use them in UNIX as I am used to the Java application of them. Any help with this would be greatly appreciated, as I couldn't find help when searching as all the solutions with regular expressions in linux I found only dealt with if it was either all numeric or not.
Yet another approach. Grep exits with 0 if a match is found, so you can test the exit code:
echo "${word}" | grep -q '[0-9]'
if [ $? = 0 ]; then
echo 'Invalid input'
fi
This is /bin/sh compatible.
Incorporating Daenyth and John's suggestions, this becomes
if echo "${word}" | grep '[0-9]' >/dev/null; then
echo 'Invalid input'
fi
The double bracket operator is an extended version of the test command which supports regexes via the =~ operator:
#!/bin/bash
while true; do
read -p "Please enter a word: " word
if [[ $word =~ [0-9] ]]; then
echo 'Invalid input!' >&2
else
break
fi
done
This is a bash-specific feature. Bash is a newer shell that is not available on all flavors of UNIX--though by "newer" I mean "only recently developed in the post-vacuum tube era" and by "not all flavors of UNIX" I mean relics like old versions of Solaris and HP-UX.
In my opinion this is the simplest option and bash is plenty portable these days, but if being portable to old UNIXes is in fact important then you'll need to use the other posters' sh-compatible answers. sh is the most common and most widely supported shell, but the price you pay for portability is losing things like =~.
If you're trying to write portable shell code, your options for string manipulation are limited. You can use shell globbing patterns (which are a lot less expressive than regexps) in the case construct:
export LC_COLLATE=C
read word
while
case "$word" in
*[!A-Za-z]*) echo >&2 "Invalid input, please enter letters only"; true;;
*) false;;
esac
do
read word
done
EDIT: setting LC_COLLATE is necessary because in most non-C locales, character ranges like A-Z don't have the “obvious” meaning. I assume you want only ASCII letters; if you also want letters with diacritics, don't change LC_COLLATE, and replace A-Za-z by [:alpha:] (so the whole pattern becomes *[![:alpha:]]*).
For full regexps, see the expr command. EDIT: Note that expr, like several other basic shell tools, has pitfalls with some special strings; the z characters below prevent $word from being interpreted as reserved words by expr.
export LC_COLLATE=C
read word
while expr "z$word" : 'z[A-Za-z]*$' >/dev/null; then
echo >&2 "Invalid input, please enter letters only"
read word
fi
If you only target recent enough versions of bash, there are other options, such as the =~ operator of [[ ... ]] conditional commands.
Note that your last line has a bug, the first command should be
grep -i "$word" "$1"
The quotes are because somewhat counter-intuitively, "$foo" means “the value of the variable called foo” whereas plain $foo means “take the value of foo, split it into separate words where it contains whitespace, and treat each word as a globbing pattern and try to expand it”. (In fact if you've already checked that $word contains only letters, leaving the quotes won't do any harm, but it takes more time to think of these special cases than to just put the quotes every times.)
Yet another (quite) portable way to do it ...
if test "$word" != "`printf "%s" "$word" | tr -dc '[[:alpha:]]'`"; then
echo invalid
fi
One portable (assuming bash >= 3) way to do this is to remove all numbers and test for length:
#!/bin/bash
read -p "Enter a number" var
if [[ -n ${var//[0-9]} ]]; then
echo "Contains non-numbers!"
else
echo "ok!"
fi
Coming from Java, it's important to note that bash has no real concept of objects or data types. Everything is a string, and complex data structures are painful at best.
For more info on what I did, and other related functions, google for bash string manipulation.
Playing around with Bash parameter expansion and character classes:
# cf. http://wiki.bash-hackers.org/syntax/pe
word="abc1def"
word="abc,def"
word=$'abc\177def'
# cf. http://mywiki.wooledge.org/BashFAQ/058 (no NUL byte in Bash variable)
word=$'abc\000def'
word="abcdef"
(
set -xv
[[ "${word}" != "${word/[[:digit:]]/}" ]] && echo invalid || echo valid
[[ -n "${word//[[:alpha:]]/}" ]] && echo invalid || echo valid
)
Everyone's answers seem to be based on the fact that the only invalid characters are numbers. The initial questions states that they need to check that the string contains "nothing but letters".
I think the best way to do it is
nonalpha=$(echo "$word" | sed 's/[[:alpha:]]//g')
if [[ ${#nonalpha} -gt 0 ]]; then
echo "Invalid character(s): $nonalpha"
fi
If you found this page looking for a way to detect non-numeric characters in your string (like I did!) replace [[:alpha:]] with [[:digit:]].

Bash and regex problem : check for tokens entered into a Coke vending machine

Here is a "challenge question" I've got from Linux system programming lecture.
Any of the following strings will give you a Coke if you kick:
L = { aaaa, aab, aba, baa, bb, aaaa"a", aaaa"b", aab"a", … ab"b"a, ba"b"a, ab"bbbbbb"a, ... }
The letters shown in wrapped double quotes indicate coins that would have fallen through (but those strings are still part of the language in this example).
Exercise (a bit hard) show this is the language of a regular expression
And this is what I've got so far :
#!/usr/bin/bash
echo "A bottle of Coke costs you 40 cents"
echo -e "Please enter tokens (a = 10 cents, b = 20 cents) in a
sequence like 'abba' :\c"
read tokens
#if [ $tokens = aaaa ]||[ $tokens = aab ]||[ $tokens = bb ]
#then
# echo "Good! now a coke is yours!"
#else echo "Thanks for your money, byebye!"
if [[ $tokens =~ 'aaaa|aab|bb' ]]
then
echo "Good! now a coke is yours!"
else echo "Thanks for your money, byebye!"
fi
Sadly it doesn't work... always outputs "Thanks for your money, byebye!" I believe something is wrong with syntax... We didn't provided with any good reference book and the only instruction from the professor was to consult "anything you find useful online" and "research the problem yourself" :(
I know how could I do it in any programming language such as Java, but get it done with bash script + regex seems not "a bit hard" but in fact "too hard" for anyone with little knowledge on something advanced as "lookahead"(is this the terminology ?)
I don't know if there is a way to express the following concept in the language of regex:
Valid entry would consist of exactly one of the three components : aaaa, aab and bb, regardless of the order in a component, followed by an arbitrary sequence of a or b's
So this is what it should be like :
(a{4}Ua{2}bUb{2})(aUb)*
where the content in first braces is order irrelevant.
Thanks a lot in advance for any hints and/or tips :)
Oh, and forget about returning any money back to the user, they are all mine :)
Edit : now the code works ,thanks to Stephen has pointed out my careless typo.
you don't need a regular expression.
case $tokens in
aaaa|aab|bb) echo "coke!";;
*) echo "no coke";;
esac
note the above checks for exactly "aaaa" or exactly "aab"..if you don't care how many characters after that, use wildcard
case $tokens in
aaaa*|aab*|bb*) echo "coke!";;
*) echo "no coke";;
esac
You have:
if [[ $token =~ 'aaaa|aab|bb' ]]
$token should be $tokens
and remove the quotes from the regex.
if [[ $tokens =~ aaaa|aab|bb ]]
But, now you'll have to work on removing the quotes.

Getting the index of the substring on solaris

How can I find the index of a substring which matches a regular expression on solaris10?
Assuming that what you want is to find the location of the first match of a wildcard in a string using bash, the following bash function returns just that, or empty if the wildcard doesn't match:
function match_index()
{
local pattern=$1
local string=$2
local result=${string/${pattern}*/}
[ ${#result} = ${#string} ] || echo ${#result}
}
For example:
$ echo $(match_index "a[0-9][0-9]" "This is a a123 test")
10
If you want to allow full-blown regular expressions instead of just wildcards, replace the "local result=" line with
local result=$(echo "$string" | sed 's/'"$pattern"'.*$//')
but then you're exposed to the usual shell quoting issues.
The goto options for me are bash, awk and perl. I'm not sure what you're trying to do, but any of the three would likely work well. For example:
f=somestring
string=$(expr match "$f" '.*\(expression\).*')
echo $string
You tagged the question as bash, so I'm going to assume you're asking how to do this in a bash script. Unfortunately, the built-in regular expression matching doesn't save string indices. However, if you're asking this in order to extract the match substring, you're in luck:
if [[ "$var" =~ "$regex" ]]; then
n=${#BASH_REMATCH[*]}
while [[ $i -lt $n ]]
do
echo "capture[$i]: ${BASH_REMATCH[$i]}"
let i++
done
fi
This snippet will output in turn all of the submatches. The first one (index 0) will be the entire match.
You might like your awk options better, though. There's a function match which gives you the index you want. Documentation can be found here. It'll also store the length of the match in RLENGTH, if you need that. To implement this in a bash script, you could do something like:
match_index=$(echo "$var_to_search" | \
awk '{
where = match($0, '"$regex_to_find"')
if (where)
print where
else
print -1
}')
There are a lot of ways to deal with passing the variables in to awk. This combination of piping output and directly embedding one into the awk one-liner is fairly common. You can also give awk variable values with the -v option (see man awk).
Obviously you can modify this to get the length, the match string, whatever it is you need. You can capture multiple things into an array variable if necessary:
match_data=($( ... awk '{ ... print where,RLENGTH,match_string ... }'))
If you use bash 4.x you can source the oobash. A string lib written in bash with oo-style:
http://sourceforge.net/projects/oobash/
String is the constructor function:
String a abcda
a.indexOf a
0
a.lastIndexOf a
4
a.indexOf da
3
There are many "methods" more to work with strings in your scripts:
-base64Decode -base64Encode -capitalize -center
-charAt -concat -contains -count
-endsWith -equals -equalsIgnoreCase -reverse
-hashCode -indexOf -isAlnum -isAlpha
-isAscii -isDigit -isEmpty -isHexDigit
-isLowerCase -isSpace -isPrintable -isUpperCase
-isVisible -lastIndexOf -length -matches
-replaceAll -replaceFirst -startsWith -substring
-swapCase -toLowerCase -toString -toUpperCase
-trim -zfill