Bash and regex problem : check for tokens entered into a Coke vending machine - regex

Here is a "challenge question" I've got from Linux system programming lecture.
Any of the following strings will give you a Coke if you kick:
L = { aaaa, aab, aba, baa, bb, aaaa"a", aaaa"b", aab"a", … ab"b"a, ba"b"a, ab"bbbbbb"a, ... }
The letters shown in wrapped double quotes indicate coins that would have fallen through (but those strings are still part of the language in this example).
Exercise (a bit hard) show this is the language of a regular expression
And this is what I've got so far :
#!/usr/bin/bash
echo "A bottle of Coke costs you 40 cents"
echo -e "Please enter tokens (a = 10 cents, b = 20 cents) in a
sequence like 'abba' :\c"
read tokens
#if [ $tokens = aaaa ]||[ $tokens = aab ]||[ $tokens = bb ]
#then
# echo "Good! now a coke is yours!"
#else echo "Thanks for your money, byebye!"
if [[ $tokens =~ 'aaaa|aab|bb' ]]
then
echo "Good! now a coke is yours!"
else echo "Thanks for your money, byebye!"
fi
Sadly it doesn't work... always outputs "Thanks for your money, byebye!" I believe something is wrong with syntax... We didn't provided with any good reference book and the only instruction from the professor was to consult "anything you find useful online" and "research the problem yourself" :(
I know how could I do it in any programming language such as Java, but get it done with bash script + regex seems not "a bit hard" but in fact "too hard" for anyone with little knowledge on something advanced as "lookahead"(is this the terminology ?)
I don't know if there is a way to express the following concept in the language of regex:
Valid entry would consist of exactly one of the three components : aaaa, aab and bb, regardless of the order in a component, followed by an arbitrary sequence of a or b's
So this is what it should be like :
(a{4}Ua{2}bUb{2})(aUb)*
where the content in first braces is order irrelevant.
Thanks a lot in advance for any hints and/or tips :)
Oh, and forget about returning any money back to the user, they are all mine :)
Edit : now the code works ,thanks to Stephen has pointed out my careless typo.

you don't need a regular expression.
case $tokens in
aaaa|aab|bb) echo "coke!";;
*) echo "no coke";;
esac
note the above checks for exactly "aaaa" or exactly "aab"..if you don't care how many characters after that, use wildcard
case $tokens in
aaaa*|aab*|bb*) echo "coke!";;
*) echo "no coke";;
esac

You have:
if [[ $token =~ 'aaaa|aab|bb' ]]
$token should be $tokens
and remove the quotes from the regex.
if [[ $tokens =~ aaaa|aab|bb ]]
But, now you'll have to work on removing the quotes.

Related

Bash regex for certain number length

So I'm trying to identify between 2 sets of numbers being entered as parameters into my script. One being a phone number and one being the limit for printing mongodb logs. The most simple way I can think of doing this is that if the number is more than 7 digits, it's most likely going to be a phone number as printing mongo results that lengthy is very unlikely.
I'm trying to put this into a case statement
case "$1" in
regex here ) echo "not msisdn";;
regex here ) echo "msisdn";;
*) echo "not found";;
esac
I was thinking converting this to a string and checking the length would probably be easiest, as I can't seem to find any decent information on how to find the number of digits entered.
Edit: I should note that I was thinking of putting all this in an if statement, but these are not the only cases I will be checking, and I think the if statement will get very lengthy. The script I'm creating is already mostly ifs, so trying to change it up a bit. Let me know if I'm wrong though.
case doesn't support regular expressions; the wildcard syntax is a less capable formalism known as glob patterns 1
case $1 in
*[!0-9]*)
echo "Not a number";;
? | ?? | ??? | ???? | ????? | ??????)
echo "Six digits or less";;
'')
echo "Empty string";;
*)
echo "Seven digits or more";;
esac
It's not clear if non-numeric or empty inputs should be treated separately, but this shoud hopefully at least get you started.
1 Some will argue that globs are a severely restricted form of regexe. I certainly disagree, because this is more confusing than helpful IMHO - several metacharacters have entirely different semantics, and what's left as similar is by and large the static parts and not the actually useful patterns.
${#1} contains the length of $1. So maybe just
if (( ${#1} > 7 )) ; then
echo Phone number
else
echo Too short for a phone number
fi

Regex "start of string" anchor not working

The regex I'm working with at the moment is as follows:
^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$
I'm trying to match all floating point numbers, but ONLY the number. For example, the following should match:
6.0
1.22E3
-2
2.99999e-12
However, the following should NOT match:
somestring///////6.0
I've tested the above regex on this validation site and it works as expected. When running it in my bash script, however, nothing at all matches.
This is my bash code:
if [[ "$VAL" =~ ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$ ]]
then
echo $VAL, "is a number"
else
echo $VAL, "is not a number"
fi
I've tried removing the anchors, and it matches any strings that contain floating points. However, strings like "//////6.00007" match. The $ anchor works as expected; however, the ^ does not.
Please let me know if you have any suggestions for troubleshooting this issue.
Thanks!
Edit 1
Removed bad examples
Edit 2
I ran the regex in its own foo.sh as suggested by #lurker and the code ran as expected with my test cases. So I looked at what was being compared with the regex. When I echoed what was being compared, everything looked fine, so it made zero sense as to why the regex wasn't matching.
Then I began to suspect that echo wasn't displaying what was actually in $VAL for some reason.
So I ran this: NEWVAL=(echo $VAL) as a temporary workaround until I can figure out what's going on.
Your regex does not allow decimals in the exponent. Exponents may have decimals, so your either need to change your definition of what a 'number' is or you need to change your regex.
Assuming the later, here is a correction (Bash 4.4):
echo "6.0
1.22E3.7
-2
2.99999e-0.0001
somestring///////6.0" >/tmp/f1.txt
while IFS= read -r line || [[ -n $line ]]; do
if [[ "$line" =~ ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[-+]?[0-9]*\.?[0-9]+)?$ ]]
then
echo $line, "is a number"
else
echo $line, "is not a number"
fi
done < /tmp/f1.txt
Prints:
6.0, is a number
1.22E3.7, is a number
-2, is a number
2.99999e-0.0001, is a number
somestring///////6.0, is not a number
BUT you should know, that most consider the only two legit numbers in your list to be 6.0 and -2. Easy way to test is with awk:
$ awk '$1+0==$1{print $0 " is a number"; next} {print $0 " not a number"}' /tmp/f1.txt
6.0 is a number
1.22E3.7 not a number
-2 is a number
2.99999e-0.0001 not a number
somestring///////6.0 not a number
The same C language function that awk uses to convert a string to a float is used by many other languages (Ruby, Perl, Python, C, C++, Swift, etc etc) so if you consider your format valid, you are probably going to be writing your own conversion routine as well.
For example, in most languages your can enter 10**1.5 as legit float literal. No language I know of accepts decimal numbers after the e in a string of the form 'xx.zzEyy.y'
As it turns out, the variables that I was comparing with my regex had leading newlines on them (e.g. "\n2.3333") that were being stripped away with echo. So, when I displayed the values to the screen with echo I would see the stripped version of my variable, which was not what was being compared with the regex.
Lesson learned: echo isn't always trustworthy. Per #CharlesDuffy's comment, use one of the following to see what's actually in your variables: declare -p varname or printf '%q\n' "$varname" but do not use echo $varname.

Extract a numeric substring and another line as variables from multi-line text file

I'm writing a bash shell script that I hope to ultimately use to automate the naming and 'attachment' of scanned documents to our db. The script OCR's a section of the first page of the pdf and outputs a text file containing three lines; a name, unique id, and a datetime string:
Smith, John
Case #: 234567 ( )
09/04/2013 11:34 AM
What I'd like to do is end up with two seperate strings as variables, "Smith, John" and "234567". I'm looking for help using regex with sed/awk/etc to extract this number. One issue is that the OCR will rarely output strings like:
"Case #2 234567 ( )"
or
"Ca$e # 2234567 ( 7"
So I'm thinking to take the only last 6-digits in the string, since only maybe 1 in 10,000+ of these ever get the last 6-digits read incorrectly. This unique ID is only 6 digits, and is always between 200000-999999. I'm learning regex, but it's slow going. Any help is greatly appreciated.
Edit:
For now I am using:
casename="$(cat test.txt | sed '1!d')"
casenum="$(cat test.txt | sed -n -r 's/.*([0-9]{6}).*/\1/p')"
echo ${casenum} ${casename}
234567 Smith, John
Any input for why this might not be a good way to do it, or what could be improved is (very) welcomed.
you might try this Regex (BRE):
[2-9][0-9]\{5\}\>
You could use the following regex for the second line:
^.*(\d{6})[^\d].*$
Here, the first named sub-group would denote the digits of interest.
For example, using Notepad++,
Orginal text:
Replace options:
Resulting text:
The regex should remain more or less the same across environments. You might need to simply change the way the named subexpression ($1 here) is referenced.
You could probably use something like this untested but somewhat-syntactically-valid snippet:
shopt -s extglob
declare -a cases
for casefile in casefiles/*
do
name=""
while read l
do
if [[ -z "$name" ]]
then
[[ "$l" == #(*, *) ]] && name=$l
elif [[ "$l" == +([0-9]) ]]
then
after=${l#*[2-9][0-9][0-9][0-9][0-9][0-9]}
l=${l%$after}
l=${l#${l%[2-9][0-9][0-9][0-9][0-9][0-9]}}
if [[ "$l" == #([2-9][0-9][0-9][0-9][0-9][0-9]) ]]
then
cases[$l]=$name
fi
name=""
fi
done < $casefile
done
The "hard part" prunes the first 6-digit number in your range and everything after it, then removes what's left (the stuff before the number) from the line. Then it removes the number from the beginning of the string, and removes what's left (the part after the number) from the end. If what's left is a 6-digit number in your range, it uses that as the index and the case name as the value in an array which you can later iterate over.
The rest should be pretty straightforward. :) If this doesn't quite work as expected, I blame the fact that I mostly use ksh, not bash. ;)

Perl Pattern Matching Question

I am trying to match patterns in perl and need some help.
I need to delete from a string anything that matches [xxxx] i.e. opening bracket-things inside it-first closing bracket that occurs.
So I am trying to substitute with space the opening bracket, things inside, first closing bracket with the following code :
if($_ =~ /[/)
{
print "In here!\n";
$_ =~ s/[(.*?)]/ /ig;
}
Similarly I need to match i.e. angular bracket-things inside it-first closing angular bracket.
I am doing that using the following code :
if($_ =~ /</)
{
print "In here!\n";
$_ =~ s/<(.*?)>/ /ig;
}
This some how does not seem to work. My sample data is as below :
'Joanne' <!--Her name does NOT contain "Kathleen"; see the section "Name"--> "'Jo'" 'Rowling', OBE [http://news bbc co uk/1/hi/uk/793844 stm Caine heads birthday honours list] BBC News 17 June 2000 Retrieved 25 October 2000 , [http://content scholastic com/browse/contributor jsp?id=3578 JK Rowling Biography] Scholastic com Retrieved 20 October 2007 better known as 'J K Rowling' ,<ref name=telegraph>[http://www telegraph co uk/news/uknews/1531779/BBCs-secret-guide-to-avoid-tripping-over-your-tongue html Daily Telegraph, BBC's secret guide to avoid tripping over your tongue, 19 October 2006] is a British <!--do not change to "English" or "Scottish" until issue is resolved --> author best known as the creator of the [[Harry Potter]] fantasy series, the idea for which was conceived whilst on a train trip from Manchester to London in 1990 The Potter books have gained worldwide attention, won multiple awards, sold more than 400 million copies and been the basis for a popular series of films, in which Rowling had creative control serving as a producer in two of the seven installments [http://www businesswire com/news/home/20100920005538/en/Warner-Bros -Pictures-Worldwide-Satellite-Trailer-Debut%C2%A0Harry Business Wire - Warner Bros Pictures mentions J K Rowling as producer ]
Any help would be appreciated. Thanks!
You need to use this:
1 while s/\[[^\[\]]*\];
Demo:
% echo "i have [some [square] brackets] in [here] and [here] today."| perl -pe '1 while s/\[[^\[\]]*\]/NADA/g'
i have NADA in NADA and NADA today.
Versus the failing:
% echo "i have [some [square] brackets] in [here] and [here] today." | perl -pe 's/\[.*?\]/NADA/g'
i have NADA brackets] in NADA and NADA today.
The recursive regular expression I leave as an exercise for the reader. :)
EDIT: Eric Strom kindly provided a recursive solution you don’t have to use 1 while:
% echo "i have [some [square] brackets] in [here] and [here] today." | perl -pe 's/\[(?:[^\[\]]*|(?R))*\]/NADA/g'
i have NADA in NADA and NADA today.
$_ =~ /someregex/ will not modify $_
Just a note, $_ =~ /someregex/ and /someregex/ do the same thing.
Also, you don't need to check for the existence of [ or < or the grouping parenthesis:
s/\[.*?\]/ /g;
s/<.*?>/ /g;
will do the job you want.
Edit: changed code to match the fact you're modifying $_
Square brackets have special meaning in the regex syntax, so escape them: /\[.*?\]/. (You also don't need the parentheses here, and doing case-insensitive matching is pointless.)
It's been a long time since I had to wrestle with Perl, but I'm pretty sure that testing $_ with a regex will also modify $_ (even if you aren't using s///). You don't need the test anyway; just run the replacement, and if the pattern doesn't match anywhere, then it won't do anything.

Checking a string to see if it contains numeric character in UNIX

I'm new to UNIX, having only started it at work today, but experienced with Java, and have the following code:
#/bin/bash
echo "Please enter a word:"
read word
grep -i $word $1 | cut -d',' -f1,2 | tr "," "-"> output
This works fine, but what I now need to do is to check when word is read, that it contains nothing but letters and if it has numeric characters in print "Invalid input!" message and ask them to enter it again. I assumed regular expressions with an if statement would be the easy way to do this but I cannot get my head around how to use them in UNIX as I am used to the Java application of them. Any help with this would be greatly appreciated, as I couldn't find help when searching as all the solutions with regular expressions in linux I found only dealt with if it was either all numeric or not.
Yet another approach. Grep exits with 0 if a match is found, so you can test the exit code:
echo "${word}" | grep -q '[0-9]'
if [ $? = 0 ]; then
echo 'Invalid input'
fi
This is /bin/sh compatible.
Incorporating Daenyth and John's suggestions, this becomes
if echo "${word}" | grep '[0-9]' >/dev/null; then
echo 'Invalid input'
fi
The double bracket operator is an extended version of the test command which supports regexes via the =~ operator:
#!/bin/bash
while true; do
read -p "Please enter a word: " word
if [[ $word =~ [0-9] ]]; then
echo 'Invalid input!' >&2
else
break
fi
done
This is a bash-specific feature. Bash is a newer shell that is not available on all flavors of UNIX--though by "newer" I mean "only recently developed in the post-vacuum tube era" and by "not all flavors of UNIX" I mean relics like old versions of Solaris and HP-UX.
In my opinion this is the simplest option and bash is plenty portable these days, but if being portable to old UNIXes is in fact important then you'll need to use the other posters' sh-compatible answers. sh is the most common and most widely supported shell, but the price you pay for portability is losing things like =~.
If you're trying to write portable shell code, your options for string manipulation are limited. You can use shell globbing patterns (which are a lot less expressive than regexps) in the case construct:
export LC_COLLATE=C
read word
while
case "$word" in
*[!A-Za-z]*) echo >&2 "Invalid input, please enter letters only"; true;;
*) false;;
esac
do
read word
done
EDIT: setting LC_COLLATE is necessary because in most non-C locales, character ranges like A-Z don't have the “obvious” meaning. I assume you want only ASCII letters; if you also want letters with diacritics, don't change LC_COLLATE, and replace A-Za-z by [:alpha:] (so the whole pattern becomes *[![:alpha:]]*).
For full regexps, see the expr command. EDIT: Note that expr, like several other basic shell tools, has pitfalls with some special strings; the z characters below prevent $word from being interpreted as reserved words by expr.
export LC_COLLATE=C
read word
while expr "z$word" : 'z[A-Za-z]*$' >/dev/null; then
echo >&2 "Invalid input, please enter letters only"
read word
fi
If you only target recent enough versions of bash, there are other options, such as the =~ operator of [[ ... ]] conditional commands.
Note that your last line has a bug, the first command should be
grep -i "$word" "$1"
The quotes are because somewhat counter-intuitively, "$foo" means “the value of the variable called foo” whereas plain $foo means “take the value of foo, split it into separate words where it contains whitespace, and treat each word as a globbing pattern and try to expand it”. (In fact if you've already checked that $word contains only letters, leaving the quotes won't do any harm, but it takes more time to think of these special cases than to just put the quotes every times.)
Yet another (quite) portable way to do it ...
if test "$word" != "`printf "%s" "$word" | tr -dc '[[:alpha:]]'`"; then
echo invalid
fi
One portable (assuming bash >= 3) way to do this is to remove all numbers and test for length:
#!/bin/bash
read -p "Enter a number" var
if [[ -n ${var//[0-9]} ]]; then
echo "Contains non-numbers!"
else
echo "ok!"
fi
Coming from Java, it's important to note that bash has no real concept of objects or data types. Everything is a string, and complex data structures are painful at best.
For more info on what I did, and other related functions, google for bash string manipulation.
Playing around with Bash parameter expansion and character classes:
# cf. http://wiki.bash-hackers.org/syntax/pe
word="abc1def"
word="abc,def"
word=$'abc\177def'
# cf. http://mywiki.wooledge.org/BashFAQ/058 (no NUL byte in Bash variable)
word=$'abc\000def'
word="abcdef"
(
set -xv
[[ "${word}" != "${word/[[:digit:]]/}" ]] && echo invalid || echo valid
[[ -n "${word//[[:alpha:]]/}" ]] && echo invalid || echo valid
)
Everyone's answers seem to be based on the fact that the only invalid characters are numbers. The initial questions states that they need to check that the string contains "nothing but letters".
I think the best way to do it is
nonalpha=$(echo "$word" | sed 's/[[:alpha:]]//g')
if [[ ${#nonalpha} -gt 0 ]]; then
echo "Invalid character(s): $nonalpha"
fi
If you found this page looking for a way to detect non-numeric characters in your string (like I did!) replace [[:alpha:]] with [[:digit:]].