Validating that a string contains only ASCII characters and digits - regex

I'm doing this to validate a username:
if [[ "$username" =~ ^[a-z][_a-z0-9]{2,17}$ ]]; then
But actually, a username containing unicode characters like é, ç, à etc... is accepted.
What regex class should I use to limit strings to only ascii letters (a, b, c, d ... z) ?

The bullet-proof way is to simply spell out [a-z] as [abcdefghijklmnopqrstuvwxyz]. There! No messing with locales or funny character classes and supported on any shell since Jan 1 1970 00:00:00. Future-proof, no matter what your OS vendor, shell vendor, Unix standardization process or BOFH thinks is cool.
With an extra variable lc like
lc=abcdefghijklmnopqrstuvwxyz
the regex even becomes readable:
[$lc][_0-9$lc]{2,17}
This is what highly robust and portable configure scripts do.

You should be able to do this by first setting LC_ALL=C (possibly temporarily so as to not affect anything else). The more modern regex engines allow for locales which can fold accented characters onto their base character (or at least sequence them so they come between a and z).
Since the C locale only knows ASCII, this should fix the problem.
For example, see the following script:
#!/bin/bash
username=amélie_314159
for locale in '' 'C' ; do
export LC_ALL="${locale}"
printf "LC_ALL set to %-3s: '%s' is " "'$LC_ALL'" "${username}"
if [[ "${username}" =~ ^[a-z][_a-z0-9]{2,17}$ ]] ; then
echo valid
else
echo invalid
fi
done
which outputs:
LC_ALL set to '' : 'amélie_314159' is valid
LC_ALL set to 'C': 'amélie_314159' is invalid

Use the following:
if [[ "$username" =~ ^[\x00-\x7f]{2,17}$ ]]; then

Before you check with a regex if the username has the right length etc. you should sanitize the input string. That means instead of blacklisting what is not allowed you should whitelist what is actually allowed. In this case we would use e.g. tr before the actual regex to remove all unwanted characters beforehand. When the regex check follows it becomes much simpler.
echo "abc[日本語][ひらがな][カタカナ][äüéçà]abc" | tr -dc "a-zA-Z"
This would only leave abcabc as remainder.

Related

Regex to match custom key pair not working in linux [duplicate]

I'm trying to do a tiny bash script that'll clean up the file and folder names of downloaded episodes of some tv shows I like. They often look like "[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE", and I basically just want to strip out that speedcd advertising bit.
It's easy enough to remove www.Speed.Cd, spaces, and dashes using regexp matching in BASH, but for the life of me, I cannot figure out how to include the brackets in a list of characters to be matched against. [- [] doesn't work, neither does [- \[], [- \\[], [- \\\[], or any number of escape characters preceding the bracket I want to remove.
Here's what I've got so far:
[[ "$newfile" =~ ^(.*)([- \[]*(www\.torrenting\.com|spastikustv|www\.speed\.cd|moviesp2p\.com)[- \]]*)(.*)$ ]] &&
newfile="${BASH_REMATCH[1]}${BASH_REMATCH[4]}"
But it breaks on the brackets.
Any ideas?
TIA,
Daniel :)
EDIT: I should probably note that I'm using "shopt -s nocasematch" to ensure case insensitive matching, just in case you're wondering :)
EDIT 2: Thanks to all who contributed. I'm not 100% sure which answer was to be the "correct" one, as I had several problems with my statement. Actually, the most accurate answer was just a comment to my question posted by jw013, but I didn't get it at the time because I hadn't understood yet that spaces should be escaped. I've opted for aefxx's as that one basically says the same, but with explanations :) Would've liked to put a correct answer mark on ormaaj's answer, too, as he spotted more grave issues with my expression.
Anyway, the approach I was using above, trying to match and extract the parts to keep and leave behind the unwanted ones is really not very elegant, and won't catch all cases, not even something really simple like "Some.Show.S07E14.720p.HDTV.X264-SOMEONE - [ www.Speed.Cd ]". I've instead rewritten it to match and extract just the unwanted parts and then do string replacement of those on the original string, like so (loop is in case there's multiple brandings):
# Remove common torrent site brandings, including surrounding spaces, brackets, etc.:
while [[ "$newfile" =~ ([[\ {\(-]*(www\.)?(torrentday\.com|torrenting\.com|spastikustv|speed\.cd|moviesp2p\.com|publichd\.org|publichd|scenetime\.com|kingdom-release)[]\ }\)-]*) ]]; do
newfile=${newfile//"${BASH_REMATCH[1]}"/}
done
Ok, this is the first time I've heard of the =~ operator but nevertheless here's what I found by trial and error:
if [[ $newfile =~ ^(.*)([-[:space:][]*(what|ever)[][:space:]-]*)(.*)$ ]]
^^^^^^^^^^ ^^^^^^^^^^
Looks strange but actually does work (just tested it).
EDIT
Quote from the Linux man pages regex(7):
To include a literal ] in the list, make it the first character (following a possible ^). To include a literal -, make it the first or last character, or the second endpoint of a range. To use a literal aq-aq as the first endpoint of a range, enclose it in "[." and ".]" to make it a collating element (see below). With the exception of these and some combinations using aq[aq (see next paragraphs), all other special characters, including aq\aq, lose their special significance within a bracket expression.
Whenever you're doing a regex it's most compatible between Bash versions to put regexes in a variable even if you do manage to dodge all the pitfalls of putting them directly in a test expression. http://mywiki.wooledge.org/BashPitfalls#if_.5B.5B_.24foo_.3D.2BAH4_.27some_RE.27_.5D.5D
Your current regex looks like you're trying to optionally match anything preceding the opening bracket. I'd guess you're actually trying to save for example 3 and 4 from something like this:
$ shopt -s nocasematch
$ newfile='[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE'
$ re='^.*[-[:space:][]*(www\.torrenting\.com|spastikustv|www\.speed\.cd|moviesp2p\.com)[][:space:]-]*(.*)$'
$ [[ $newfile =~ $re ]]
$ declare -p BASH_REMATCH
declare -ar BASH_REMATCH='([0]="[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE" [1]="www.Speed.Cd" [2]="Some.Show.S07E14.720p.HDTV.X264-SOMEONE")'
The basic issue is quite simple, if not obvious.
A BASH REGEX is totally unprotected (from the shell), and cannot be protected by "​double quotes​". This means that every literal space (and tab,etc) must be protected by a baskslash \ ... end of story. The rest is just a case of getting you regex to suit your needs.
One other thing; use [\ [] and []\ ] to match [ and ] respectively, within the range square-bracket construct (in this case along with a space).
example:
newfile="[ ]"
[[ "$newfile" =~ ^[\ []\ []\ ]$ ]] &&
echo YES ||
echo NO
You can try something like this (though you weren't 100% clear on what cases you are trying to filter:
newfile="[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE"
if [[ $newfile =~ ^(.*)([^a-zA-Z0-9.]*\[.*\][^a-zA-Z0-9.]*)(.*)$ ]]; then
newfile="${BASH_REMATCH[1]}${BASH_REMATCH[3]}"
fi
echo $newfile
# Some.Show.S07E14.720p.HDTV.X264-SOMEONE
Its just stripping any non-alnum (and dot) characters outside the [], and anything within []

In bash/sed, how do you match on a lowercase letter followed by the SAME letter in uppercase?

I want to delete all instances of "aA", "bB" ... "zZ" from an input string.
e.g.
echo "foObar" |
sed -Ee 's/([a-z])\U\1//'
should output "fbar"
But the \U syntax works in the latter half (replacement part) of the sed expression - it fails to resolve in the matching clause.
I'm having difficulty converting the matched character to upper case to reuse in the matching clause.
If anyone could suggest a working regex which can be used in sed (or awk) that would be great.
Scripting solutions in pure shell are ok too (I'm trying to think of solving the problem this way).
Working PCRE (Perl-compatible regular expressions) are ok too but I have no idea how they work so it might be nice if you could provide an explanation to go with your answer.
Unfortunately, I don't have perl or python installed on the machine that I am working with.
You may use the following perl solution:
echo "foObar" | perl -pe 's/([a-z])(?!\1)(?i:\1)//g'
See the online demo.
Details
([a-z]) - Group 1: a lowercase ASCII letter
(?!\1) - a negative lookahead that fails the match if the next char is the same as captured with Group 1
(?i:\1) - the same char as captured with Group 1 but in the different case (due to the lookahead before it).
The -e option allows you to define Perl code to be executed by the compiler and the -p option always prints the contents of $_ each time around the loop. See more here.
This might work for you (GNU sed):
sed -r 's/aA|bB|cC|dD|eE|fF|gG|hH|iI|jJ|kK|lL|mM|nN|oO|pP|qQ|rR|sS|tT|uU|vV|wW|xX|yY|zZ//g' file
A programmatic solution:
sed 's/[[:lower:]][[:upper:]]/\n&/g;s/\n\(.\)\1//ig;s/\n//g' file
This marks all pairs of lower-case characters followed by an upper-case character with a preceding newline. Then remove altogether such marker and pairs that match by a back reference irrespective of case. Any other newlines are removed thus leaving pairs untouched that are not the same.
Here is a verbose awk solution as OP doesn't have perl or python available:
echo "foObar" |
awk -v ORS= -v FS='' '{
for (i=2; i<=NF; i++) {
if ($(i-1) == tolower($i) && $i ~ /[A-Z]/ && $(i-1) ~ /[a-z]/) {
i++
continue
}
print $(i-1)
}
print $(i-1)
}'
fbar
There's an easy lex for this,
%option main 8bit
#include <ctype.h>
%%
[[:lower:]][[:upper:]] if ( toupper(yytext[0]) != yytext[1] ) ECHO;
(that's a tab before the #include, markdown loses those). Just put that in e.g. that.l and then make that. Easy-peasy lex's are a nice addition to your toolkit.
Note: This solution is (unsurprisingly) slow, based on OP's feedback:
"Unfortunately, due to the multiple passes - it makes it rather slow. "
If there is a character sequence¹ that you know won't ever appear in the input,you could use a 3-stage replacement to accomplish this with sed:
echo 'foObar foobAr' | sed -E -e 's/([a-z])([A-Z])/KEYWORD\1\l\2/g' -e 's/KEYWORD(.)\1//g' -e 's/KEYWORD(.)(.)/\1\u\2/g'
gives you: fbar foobAr
Replacement stages explained:
Look for lowercase letters followed by ANY uppercase letter and replace them with both letters as lowercase with the KEYWORD in front of them foObar foobAr -> fKEYWORDoobar fooKEYWORDbar
Remove KEYWORD followed by two identical characters (both are lowercase now, so the back-reference works) fKEYWORDoobar fooKEYWORDbar -> fbar fooKEYWORDbar
Strip remaining² KEYWORD from the output and convert the second character after it back to it's original, uppercase version fbar fooKEYWORDbar -> fbar foobAr
¹ In this example I used KEYWORD for demonstration purposes. A single character or at least shorter character sequence would be better/faster. Just make sure to pick something that cannot possibly ever be in the input.
² The remaining occurances are those where the lowercase-versions of the letters were not identical, so we have to revert them back to their original state

Regex "start of string" anchor not working

The regex I'm working with at the moment is as follows:
^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$
I'm trying to match all floating point numbers, but ONLY the number. For example, the following should match:
6.0
1.22E3
-2
2.99999e-12
However, the following should NOT match:
somestring///////6.0
I've tested the above regex on this validation site and it works as expected. When running it in my bash script, however, nothing at all matches.
This is my bash code:
if [[ "$VAL" =~ ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$ ]]
then
echo $VAL, "is a number"
else
echo $VAL, "is not a number"
fi
I've tried removing the anchors, and it matches any strings that contain floating points. However, strings like "//////6.00007" match. The $ anchor works as expected; however, the ^ does not.
Please let me know if you have any suggestions for troubleshooting this issue.
Thanks!
Edit 1
Removed bad examples
Edit 2
I ran the regex in its own foo.sh as suggested by #lurker and the code ran as expected with my test cases. So I looked at what was being compared with the regex. When I echoed what was being compared, everything looked fine, so it made zero sense as to why the regex wasn't matching.
Then I began to suspect that echo wasn't displaying what was actually in $VAL for some reason.
So I ran this: NEWVAL=(echo $VAL) as a temporary workaround until I can figure out what's going on.
Your regex does not allow decimals in the exponent. Exponents may have decimals, so your either need to change your definition of what a 'number' is or you need to change your regex.
Assuming the later, here is a correction (Bash 4.4):
echo "6.0
1.22E3.7
-2
2.99999e-0.0001
somestring///////6.0" >/tmp/f1.txt
while IFS= read -r line || [[ -n $line ]]; do
if [[ "$line" =~ ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[-+]?[0-9]*\.?[0-9]+)?$ ]]
then
echo $line, "is a number"
else
echo $line, "is not a number"
fi
done < /tmp/f1.txt
Prints:
6.0, is a number
1.22E3.7, is a number
-2, is a number
2.99999e-0.0001, is a number
somestring///////6.0, is not a number
BUT you should know, that most consider the only two legit numbers in your list to be 6.0 and -2. Easy way to test is with awk:
$ awk '$1+0==$1{print $0 " is a number"; next} {print $0 " not a number"}' /tmp/f1.txt
6.0 is a number
1.22E3.7 not a number
-2 is a number
2.99999e-0.0001 not a number
somestring///////6.0 not a number
The same C language function that awk uses to convert a string to a float is used by many other languages (Ruby, Perl, Python, C, C++, Swift, etc etc) so if you consider your format valid, you are probably going to be writing your own conversion routine as well.
For example, in most languages your can enter 10**1.5 as legit float literal. No language I know of accepts decimal numbers after the e in a string of the form 'xx.zzEyy.y'
As it turns out, the variables that I was comparing with my regex had leading newlines on them (e.g. "\n2.3333") that were being stripped away with echo. So, when I displayed the values to the screen with echo I would see the stripped version of my variable, which was not what was being compared with the regex.
Lesson learned: echo isn't always trustworthy. Per #CharlesDuffy's comment, use one of the following to see what's actually in your variables: declare -p varname or printf '%q\n' "$varname" but do not use echo $varname.

BASH regexp matching - including brackets in a bracketed list of characters to match against?

I'm trying to do a tiny bash script that'll clean up the file and folder names of downloaded episodes of some tv shows I like. They often look like "[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE", and I basically just want to strip out that speedcd advertising bit.
It's easy enough to remove www.Speed.Cd, spaces, and dashes using regexp matching in BASH, but for the life of me, I cannot figure out how to include the brackets in a list of characters to be matched against. [- [] doesn't work, neither does [- \[], [- \\[], [- \\\[], or any number of escape characters preceding the bracket I want to remove.
Here's what I've got so far:
[[ "$newfile" =~ ^(.*)([- \[]*(www\.torrenting\.com|spastikustv|www\.speed\.cd|moviesp2p\.com)[- \]]*)(.*)$ ]] &&
newfile="${BASH_REMATCH[1]}${BASH_REMATCH[4]}"
But it breaks on the brackets.
Any ideas?
TIA,
Daniel :)
EDIT: I should probably note that I'm using "shopt -s nocasematch" to ensure case insensitive matching, just in case you're wondering :)
EDIT 2: Thanks to all who contributed. I'm not 100% sure which answer was to be the "correct" one, as I had several problems with my statement. Actually, the most accurate answer was just a comment to my question posted by jw013, but I didn't get it at the time because I hadn't understood yet that spaces should be escaped. I've opted for aefxx's as that one basically says the same, but with explanations :) Would've liked to put a correct answer mark on ormaaj's answer, too, as he spotted more grave issues with my expression.
Anyway, the approach I was using above, trying to match and extract the parts to keep and leave behind the unwanted ones is really not very elegant, and won't catch all cases, not even something really simple like "Some.Show.S07E14.720p.HDTV.X264-SOMEONE - [ www.Speed.Cd ]". I've instead rewritten it to match and extract just the unwanted parts and then do string replacement of those on the original string, like so (loop is in case there's multiple brandings):
# Remove common torrent site brandings, including surrounding spaces, brackets, etc.:
while [[ "$newfile" =~ ([[\ {\(-]*(www\.)?(torrentday\.com|torrenting\.com|spastikustv|speed\.cd|moviesp2p\.com|publichd\.org|publichd|scenetime\.com|kingdom-release)[]\ }\)-]*) ]]; do
newfile=${newfile//"${BASH_REMATCH[1]}"/}
done
Ok, this is the first time I've heard of the =~ operator but nevertheless here's what I found by trial and error:
if [[ $newfile =~ ^(.*)([-[:space:][]*(what|ever)[][:space:]-]*)(.*)$ ]]
^^^^^^^^^^ ^^^^^^^^^^
Looks strange but actually does work (just tested it).
EDIT
Quote from the Linux man pages regex(7):
To include a literal ] in the list, make it the first character (following a possible ^). To include a literal -, make it the first or last character, or the second endpoint of a range. To use a literal aq-aq as the first endpoint of a range, enclose it in "[." and ".]" to make it a collating element (see below). With the exception of these and some combinations using aq[aq (see next paragraphs), all other special characters, including aq\aq, lose their special significance within a bracket expression.
Whenever you're doing a regex it's most compatible between Bash versions to put regexes in a variable even if you do manage to dodge all the pitfalls of putting them directly in a test expression. http://mywiki.wooledge.org/BashPitfalls#if_.5B.5B_.24foo_.3D.2BAH4_.27some_RE.27_.5D.5D
Your current regex looks like you're trying to optionally match anything preceding the opening bracket. I'd guess you're actually trying to save for example 3 and 4 from something like this:
$ shopt -s nocasematch
$ newfile='[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE'
$ re='^.*[-[:space:][]*(www\.torrenting\.com|spastikustv|www\.speed\.cd|moviesp2p\.com)[][:space:]-]*(.*)$'
$ [[ $newfile =~ $re ]]
$ declare -p BASH_REMATCH
declare -ar BASH_REMATCH='([0]="[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE" [1]="www.Speed.Cd" [2]="Some.Show.S07E14.720p.HDTV.X264-SOMEONE")'
The basic issue is quite simple, if not obvious.
A BASH REGEX is totally unprotected (from the shell), and cannot be protected by "​double quotes​". This means that every literal space (and tab,etc) must be protected by a baskslash \ ... end of story. The rest is just a case of getting you regex to suit your needs.
One other thing; use [\ [] and []\ ] to match [ and ] respectively, within the range square-bracket construct (in this case along with a space).
example:
newfile="[ ]"
[[ "$newfile" =~ ^[\ []\ []\ ]$ ]] &&
echo YES ||
echo NO
You can try something like this (though you weren't 100% clear on what cases you are trying to filter:
newfile="[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE"
if [[ $newfile =~ ^(.*)([^a-zA-Z0-9.]*\[.*\][^a-zA-Z0-9.]*)(.*)$ ]]; then
newfile="${BASH_REMATCH[1]}${BASH_REMATCH[3]}"
fi
echo $newfile
# Some.Show.S07E14.720p.HDTV.X264-SOMEONE
Its just stripping any non-alnum (and dot) characters outside the [], and anything within []

Checking a string to see if it contains numeric character in UNIX

I'm new to UNIX, having only started it at work today, but experienced with Java, and have the following code:
#/bin/bash
echo "Please enter a word:"
read word
grep -i $word $1 | cut -d',' -f1,2 | tr "," "-"> output
This works fine, but what I now need to do is to check when word is read, that it contains nothing but letters and if it has numeric characters in print "Invalid input!" message and ask them to enter it again. I assumed regular expressions with an if statement would be the easy way to do this but I cannot get my head around how to use them in UNIX as I am used to the Java application of them. Any help with this would be greatly appreciated, as I couldn't find help when searching as all the solutions with regular expressions in linux I found only dealt with if it was either all numeric or not.
Yet another approach. Grep exits with 0 if a match is found, so you can test the exit code:
echo "${word}" | grep -q '[0-9]'
if [ $? = 0 ]; then
echo 'Invalid input'
fi
This is /bin/sh compatible.
Incorporating Daenyth and John's suggestions, this becomes
if echo "${word}" | grep '[0-9]' >/dev/null; then
echo 'Invalid input'
fi
The double bracket operator is an extended version of the test command which supports regexes via the =~ operator:
#!/bin/bash
while true; do
read -p "Please enter a word: " word
if [[ $word =~ [0-9] ]]; then
echo 'Invalid input!' >&2
else
break
fi
done
This is a bash-specific feature. Bash is a newer shell that is not available on all flavors of UNIX--though by "newer" I mean "only recently developed in the post-vacuum tube era" and by "not all flavors of UNIX" I mean relics like old versions of Solaris and HP-UX.
In my opinion this is the simplest option and bash is plenty portable these days, but if being portable to old UNIXes is in fact important then you'll need to use the other posters' sh-compatible answers. sh is the most common and most widely supported shell, but the price you pay for portability is losing things like =~.
If you're trying to write portable shell code, your options for string manipulation are limited. You can use shell globbing patterns (which are a lot less expressive than regexps) in the case construct:
export LC_COLLATE=C
read word
while
case "$word" in
*[!A-Za-z]*) echo >&2 "Invalid input, please enter letters only"; true;;
*) false;;
esac
do
read word
done
EDIT: setting LC_COLLATE is necessary because in most non-C locales, character ranges like A-Z don't have the “obvious” meaning. I assume you want only ASCII letters; if you also want letters with diacritics, don't change LC_COLLATE, and replace A-Za-z by [:alpha:] (so the whole pattern becomes *[![:alpha:]]*).
For full regexps, see the expr command. EDIT: Note that expr, like several other basic shell tools, has pitfalls with some special strings; the z characters below prevent $word from being interpreted as reserved words by expr.
export LC_COLLATE=C
read word
while expr "z$word" : 'z[A-Za-z]*$' >/dev/null; then
echo >&2 "Invalid input, please enter letters only"
read word
fi
If you only target recent enough versions of bash, there are other options, such as the =~ operator of [[ ... ]] conditional commands.
Note that your last line has a bug, the first command should be
grep -i "$word" "$1"
The quotes are because somewhat counter-intuitively, "$foo" means “the value of the variable called foo” whereas plain $foo means “take the value of foo, split it into separate words where it contains whitespace, and treat each word as a globbing pattern and try to expand it”. (In fact if you've already checked that $word contains only letters, leaving the quotes won't do any harm, but it takes more time to think of these special cases than to just put the quotes every times.)
Yet another (quite) portable way to do it ...
if test "$word" != "`printf "%s" "$word" | tr -dc '[[:alpha:]]'`"; then
echo invalid
fi
One portable (assuming bash >= 3) way to do this is to remove all numbers and test for length:
#!/bin/bash
read -p "Enter a number" var
if [[ -n ${var//[0-9]} ]]; then
echo "Contains non-numbers!"
else
echo "ok!"
fi
Coming from Java, it's important to note that bash has no real concept of objects or data types. Everything is a string, and complex data structures are painful at best.
For more info on what I did, and other related functions, google for bash string manipulation.
Playing around with Bash parameter expansion and character classes:
# cf. http://wiki.bash-hackers.org/syntax/pe
word="abc1def"
word="abc,def"
word=$'abc\177def'
# cf. http://mywiki.wooledge.org/BashFAQ/058 (no NUL byte in Bash variable)
word=$'abc\000def'
word="abcdef"
(
set -xv
[[ "${word}" != "${word/[[:digit:]]/}" ]] && echo invalid || echo valid
[[ -n "${word//[[:alpha:]]/}" ]] && echo invalid || echo valid
)
Everyone's answers seem to be based on the fact that the only invalid characters are numbers. The initial questions states that they need to check that the string contains "nothing but letters".
I think the best way to do it is
nonalpha=$(echo "$word" | sed 's/[[:alpha:]]//g')
if [[ ${#nonalpha} -gt 0 ]]; then
echo "Invalid character(s): $nonalpha"
fi
If you found this page looking for a way to detect non-numeric characters in your string (like I did!) replace [[:alpha:]] with [[:digit:]].