Bash regex for certain number length - regex

So I'm trying to identify between 2 sets of numbers being entered as parameters into my script. One being a phone number and one being the limit for printing mongodb logs. The most simple way I can think of doing this is that if the number is more than 7 digits, it's most likely going to be a phone number as printing mongo results that lengthy is very unlikely.
I'm trying to put this into a case statement
case "$1" in
regex here ) echo "not msisdn";;
regex here ) echo "msisdn";;
*) echo "not found";;
esac
I was thinking converting this to a string and checking the length would probably be easiest, as I can't seem to find any decent information on how to find the number of digits entered.
Edit: I should note that I was thinking of putting all this in an if statement, but these are not the only cases I will be checking, and I think the if statement will get very lengthy. The script I'm creating is already mostly ifs, so trying to change it up a bit. Let me know if I'm wrong though.

case doesn't support regular expressions; the wildcard syntax is a less capable formalism known as glob patterns 1
case $1 in
*[!0-9]*)
echo "Not a number";;
? | ?? | ??? | ???? | ????? | ??????)
echo "Six digits or less";;
'')
echo "Empty string";;
*)
echo "Seven digits or more";;
esac
It's not clear if non-numeric or empty inputs should be treated separately, but this shoud hopefully at least get you started.
1 Some will argue that globs are a severely restricted form of regexe. I certainly disagree, because this is more confusing than helpful IMHO - several metacharacters have entirely different semantics, and what's left as similar is by and large the static parts and not the actually useful patterns.

${#1} contains the length of $1. So maybe just
if (( ${#1} > 7 )) ; then
echo Phone number
else
echo Too short for a phone number
fi

Related

Regex "start of string" anchor not working

The regex I'm working with at the moment is as follows:
^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$
I'm trying to match all floating point numbers, but ONLY the number. For example, the following should match:
6.0
1.22E3
-2
2.99999e-12
However, the following should NOT match:
somestring///////6.0
I've tested the above regex on this validation site and it works as expected. When running it in my bash script, however, nothing at all matches.
This is my bash code:
if [[ "$VAL" =~ ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$ ]]
then
echo $VAL, "is a number"
else
echo $VAL, "is not a number"
fi
I've tried removing the anchors, and it matches any strings that contain floating points. However, strings like "//////6.00007" match. The $ anchor works as expected; however, the ^ does not.
Please let me know if you have any suggestions for troubleshooting this issue.
Thanks!
Edit 1
Removed bad examples
Edit 2
I ran the regex in its own foo.sh as suggested by #lurker and the code ran as expected with my test cases. So I looked at what was being compared with the regex. When I echoed what was being compared, everything looked fine, so it made zero sense as to why the regex wasn't matching.
Then I began to suspect that echo wasn't displaying what was actually in $VAL for some reason.
So I ran this: NEWVAL=(echo $VAL) as a temporary workaround until I can figure out what's going on.
Your regex does not allow decimals in the exponent. Exponents may have decimals, so your either need to change your definition of what a 'number' is or you need to change your regex.
Assuming the later, here is a correction (Bash 4.4):
echo "6.0
1.22E3.7
-2
2.99999e-0.0001
somestring///////6.0" >/tmp/f1.txt
while IFS= read -r line || [[ -n $line ]]; do
if [[ "$line" =~ ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[-+]?[0-9]*\.?[0-9]+)?$ ]]
then
echo $line, "is a number"
else
echo $line, "is not a number"
fi
done < /tmp/f1.txt
Prints:
6.0, is a number
1.22E3.7, is a number
-2, is a number
2.99999e-0.0001, is a number
somestring///////6.0, is not a number
BUT you should know, that most consider the only two legit numbers in your list to be 6.0 and -2. Easy way to test is with awk:
$ awk '$1+0==$1{print $0 " is a number"; next} {print $0 " not a number"}' /tmp/f1.txt
6.0 is a number
1.22E3.7 not a number
-2 is a number
2.99999e-0.0001 not a number
somestring///////6.0 not a number
The same C language function that awk uses to convert a string to a float is used by many other languages (Ruby, Perl, Python, C, C++, Swift, etc etc) so if you consider your format valid, you are probably going to be writing your own conversion routine as well.
For example, in most languages your can enter 10**1.5 as legit float literal. No language I know of accepts decimal numbers after the e in a string of the form 'xx.zzEyy.y'
As it turns out, the variables that I was comparing with my regex had leading newlines on them (e.g. "\n2.3333") that were being stripped away with echo. So, when I displayed the values to the screen with echo I would see the stripped version of my variable, which was not what was being compared with the regex.
Lesson learned: echo isn't always trustworthy. Per #CharlesDuffy's comment, use one of the following to see what's actually in your variables: declare -p varname or printf '%q\n' "$varname" but do not use echo $varname.

How do I make Grep less Greedy with a shell variable?

I have been polishing up my grep skills with a particular problem I have found. Basically it goes like this. I have a local file with words from a dictionary. The user will pass in a word and the script will find all words that can be made with letters from that word. The catch is, the words must be at least 4 characters long and you can only use as many letters as the user passes in. So if I passed in a word like "College" clee and cell would be acceptable words but not words like cocco because yes it contains letters from the word but college only has 1 o and 1 c. Here is my regular expression thus far.
egrep -i "^[("$text")]{4,}$" /usr/dict/words
This will find strings that contain these letters that are at least four characters long however grep is being greedy and grabbing more characters than those in the variable. How would I specify to only use the amount of characters in the variable? I've been stuck on this for a few days now to no avail. Thank you for your help and time community!
To expand on what #chepner said in the comments, regular expressions won't test for the exact number of characters that is in a range. In other words, [ee] will not match 2 e's it will only match if there is an e at all, so [ee] is a redundant of [e]. Regular expressions usually match 1 or more of a match expression [e]+ would match at least 1 e up to the buffer size of the string. To match a specific number of e's you'd have to know that before hand to do something like [e]{2,5} which would match at least 2 but no more than 5 e's.
Even if you set a pre-processor to calculate the number of letters that are repeated in the input, you'd have a hard time matching the regular expression how you think it matches. To go with your example of "college", preprocessed would look like c=1,o=1,l=2, e=2,g=1. If you were to put it in a regular expression like you had ^c?o?l{0,2}e{0,2}g?$` [note a "?" in this context is short hand for {0,1}] would not even match "college" as the match would be positional it would match "colleg", "colleeg", "colleg", etc.
To verify the length of the string what you have only verifies that there are at least for letters in the range []. You may want to change it to grep "^.{4,}$" to check whether the entire length is at least 4 characters.
If you aren't limited to only using grep, but are limited to bash, you may be able to use the below script to solve you're problem:
read input
cat /usr/dictwords | while read line
do
if $(echo $line | grep "^.\{4,\}\$" >> /dev/null)
then
testVal=$line
for i in $(echo $input | sed -e 's/\(.\)/\1 /g')
testVal=$(echo "$testVal" | sed -e "s/$i/_/i")
done
fi
if $(echo $testVal | grep "^_\+$" >> /dev/null)
then
echo $line
fi
done

Extract a numeric substring and another line as variables from multi-line text file

I'm writing a bash shell script that I hope to ultimately use to automate the naming and 'attachment' of scanned documents to our db. The script OCR's a section of the first page of the pdf and outputs a text file containing three lines; a name, unique id, and a datetime string:
Smith, John
Case #: 234567 ( )
09/04/2013 11:34 AM
What I'd like to do is end up with two seperate strings as variables, "Smith, John" and "234567". I'm looking for help using regex with sed/awk/etc to extract this number. One issue is that the OCR will rarely output strings like:
"Case #2 234567 ( )"
or
"Ca$e # 2234567 ( 7"
So I'm thinking to take the only last 6-digits in the string, since only maybe 1 in 10,000+ of these ever get the last 6-digits read incorrectly. This unique ID is only 6 digits, and is always between 200000-999999. I'm learning regex, but it's slow going. Any help is greatly appreciated.
Edit:
For now I am using:
casename="$(cat test.txt | sed '1!d')"
casenum="$(cat test.txt | sed -n -r 's/.*([0-9]{6}).*/\1/p')"
echo ${casenum} ${casename}
234567 Smith, John
Any input for why this might not be a good way to do it, or what could be improved is (very) welcomed.
you might try this Regex (BRE):
[2-9][0-9]\{5\}\>
You could use the following regex for the second line:
^.*(\d{6})[^\d].*$
Here, the first named sub-group would denote the digits of interest.
For example, using Notepad++,
Orginal text:
Replace options:
Resulting text:
The regex should remain more or less the same across environments. You might need to simply change the way the named subexpression ($1 here) is referenced.
You could probably use something like this untested but somewhat-syntactically-valid snippet:
shopt -s extglob
declare -a cases
for casefile in casefiles/*
do
name=""
while read l
do
if [[ -z "$name" ]]
then
[[ "$l" == #(*, *) ]] && name=$l
elif [[ "$l" == +([0-9]) ]]
then
after=${l#*[2-9][0-9][0-9][0-9][0-9][0-9]}
l=${l%$after}
l=${l#${l%[2-9][0-9][0-9][0-9][0-9][0-9]}}
if [[ "$l" == #([2-9][0-9][0-9][0-9][0-9][0-9]) ]]
then
cases[$l]=$name
fi
name=""
fi
done < $casefile
done
The "hard part" prunes the first 6-digit number in your range and everything after it, then removes what's left (the stuff before the number) from the line. Then it removes the number from the beginning of the string, and removes what's left (the part after the number) from the end. If what's left is a 6-digit number in your range, it uses that as the index and the case name as the value in an array which you can later iterate over.
The rest should be pretty straightforward. :) If this doesn't quite work as expected, I blame the fact that I mostly use ksh, not bash. ;)

Matching A File Name Using Grep

The overarching problem:
So I have a file name that comes in the form of
JohnSmith14_120325_A10_6.raw
and I want to match it using regex. I have a couple of issues in building a working example but unfortunately my issues won't be solved unless I get the basics.
So I have just recently learned about piping and one of the cool things I learned was that I can do the following.
X=ll_paprika.sc (don't ask)
VAR=`echo $X | cut -p -f 1`
echo $VAR
which gives me paprika.sc
Now when I try to execute the pipe idea in grep, nothing happens.
x=ll_paprika.sc
VAR=`echo $X | grep *.sc`
echo $VAR
Can anyone explain what I am doing wrong?
Second question:
How does one match a single underscore using regex?
Here's what I am ultimately trying to do;
VAR=`echo $X | grep -e "^[a-bA-Z][a-bA-Z0-9]*(_){1}[0-9]*(_){1}[a-bA-Z0-9]*(_){1}[0-9](\.){1}(raw)"
So the basic idea of my pattern here is that the file name must start with a letter
and then it can have any number of letters and numbers following it and it must have an _ delimit a series of numbers and another _ to delimit the next set of numbers and characters and another _ to delimit the next set of numbers and then it must have a single period following by raw. This looks grossly wrong and ugly (because I am not sure about the syntax). So how does one match a file extension? Can someone put up a simple example for something ll_parpika.sc so that I can figure out how to do my own regex?
Thanks.
x=ll_paprika.sc
VAR=`echo $X | grep *.sc`
echo $VAR
The reason this isn't doing what you want is that the grep matches a line and returns it. *.sc does in fact match 11_paprika.sc, so it returns that whole line and sticks it in $VAR.
If you want to just get a part of it, the cut line probably better. There is a grep -o option that returns only the matching portion, but for this you'd basically have to put in the thing you were looking for, at which point why bother?
the file name must start with a letter
`grep -e "^[a-zA-Z]
and then it can have any number
of letters and numbers following it
[a-zA-Z0-9]*
and it must have an _ delimit a
series of numbers and another _ to delimit the next set of numbers and
characters and another _ to delimit the next set of numbers
(_[0-9]+){3}
and then it must have a single period following by raw.
.raw"
For the first, use:
VAR=`echo $X | egrep '\.sc$'`
For the second, you can try this alternative instead:
VAR=`echo $X | egrep '^[[:alpha:]][[:alnum:]]*_[[:digit:]]+_[[:alnum:]]+_[[:digit:]]+\.raw'`
Note that your character classes from your expression differ from the description that follows in that they seem to only be permissive of a-b for lower case characters in some places. This example is permissive of all alphanumeric characters in those places.

Checking a string to see if it contains numeric character in UNIX

I'm new to UNIX, having only started it at work today, but experienced with Java, and have the following code:
#/bin/bash
echo "Please enter a word:"
read word
grep -i $word $1 | cut -d',' -f1,2 | tr "," "-"> output
This works fine, but what I now need to do is to check when word is read, that it contains nothing but letters and if it has numeric characters in print "Invalid input!" message and ask them to enter it again. I assumed regular expressions with an if statement would be the easy way to do this but I cannot get my head around how to use them in UNIX as I am used to the Java application of them. Any help with this would be greatly appreciated, as I couldn't find help when searching as all the solutions with regular expressions in linux I found only dealt with if it was either all numeric or not.
Yet another approach. Grep exits with 0 if a match is found, so you can test the exit code:
echo "${word}" | grep -q '[0-9]'
if [ $? = 0 ]; then
echo 'Invalid input'
fi
This is /bin/sh compatible.
Incorporating Daenyth and John's suggestions, this becomes
if echo "${word}" | grep '[0-9]' >/dev/null; then
echo 'Invalid input'
fi
The double bracket operator is an extended version of the test command which supports regexes via the =~ operator:
#!/bin/bash
while true; do
read -p "Please enter a word: " word
if [[ $word =~ [0-9] ]]; then
echo 'Invalid input!' >&2
else
break
fi
done
This is a bash-specific feature. Bash is a newer shell that is not available on all flavors of UNIX--though by "newer" I mean "only recently developed in the post-vacuum tube era" and by "not all flavors of UNIX" I mean relics like old versions of Solaris and HP-UX.
In my opinion this is the simplest option and bash is plenty portable these days, but if being portable to old UNIXes is in fact important then you'll need to use the other posters' sh-compatible answers. sh is the most common and most widely supported shell, but the price you pay for portability is losing things like =~.
If you're trying to write portable shell code, your options for string manipulation are limited. You can use shell globbing patterns (which are a lot less expressive than regexps) in the case construct:
export LC_COLLATE=C
read word
while
case "$word" in
*[!A-Za-z]*) echo >&2 "Invalid input, please enter letters only"; true;;
*) false;;
esac
do
read word
done
EDIT: setting LC_COLLATE is necessary because in most non-C locales, character ranges like A-Z don't have the “obvious” meaning. I assume you want only ASCII letters; if you also want letters with diacritics, don't change LC_COLLATE, and replace A-Za-z by [:alpha:] (so the whole pattern becomes *[![:alpha:]]*).
For full regexps, see the expr command. EDIT: Note that expr, like several other basic shell tools, has pitfalls with some special strings; the z characters below prevent $word from being interpreted as reserved words by expr.
export LC_COLLATE=C
read word
while expr "z$word" : 'z[A-Za-z]*$' >/dev/null; then
echo >&2 "Invalid input, please enter letters only"
read word
fi
If you only target recent enough versions of bash, there are other options, such as the =~ operator of [[ ... ]] conditional commands.
Note that your last line has a bug, the first command should be
grep -i "$word" "$1"
The quotes are because somewhat counter-intuitively, "$foo" means “the value of the variable called foo” whereas plain $foo means “take the value of foo, split it into separate words where it contains whitespace, and treat each word as a globbing pattern and try to expand it”. (In fact if you've already checked that $word contains only letters, leaving the quotes won't do any harm, but it takes more time to think of these special cases than to just put the quotes every times.)
Yet another (quite) portable way to do it ...
if test "$word" != "`printf "%s" "$word" | tr -dc '[[:alpha:]]'`"; then
echo invalid
fi
One portable (assuming bash >= 3) way to do this is to remove all numbers and test for length:
#!/bin/bash
read -p "Enter a number" var
if [[ -n ${var//[0-9]} ]]; then
echo "Contains non-numbers!"
else
echo "ok!"
fi
Coming from Java, it's important to note that bash has no real concept of objects or data types. Everything is a string, and complex data structures are painful at best.
For more info on what I did, and other related functions, google for bash string manipulation.
Playing around with Bash parameter expansion and character classes:
# cf. http://wiki.bash-hackers.org/syntax/pe
word="abc1def"
word="abc,def"
word=$'abc\177def'
# cf. http://mywiki.wooledge.org/BashFAQ/058 (no NUL byte in Bash variable)
word=$'abc\000def'
word="abcdef"
(
set -xv
[[ "${word}" != "${word/[[:digit:]]/}" ]] && echo invalid || echo valid
[[ -n "${word//[[:alpha:]]/}" ]] && echo invalid || echo valid
)
Everyone's answers seem to be based on the fact that the only invalid characters are numbers. The initial questions states that they need to check that the string contains "nothing but letters".
I think the best way to do it is
nonalpha=$(echo "$word" | sed 's/[[:alpha:]]//g')
if [[ ${#nonalpha} -gt 0 ]]; then
echo "Invalid character(s): $nonalpha"
fi
If you found this page looking for a way to detect non-numeric characters in your string (like I did!) replace [[:alpha:]] with [[:digit:]].