Grep regular expression for digits in character string of variable length - regex

I need some way to find words that contain any combination of characters and digits but exactly 4 digits only, and at least one character.
EXAMPLE:
a1a1a1a1 // Match
1234 // NO match (no characters)
a1a1a1a1a1 // NO match
ab2b2 // NO match
cd12 // NO match
z9989 // Match
1ab26a9 // Match
1ab1c1 // NO match
12345 // NO match
24 // NO match
a2b2c2d2 // Match
ab11cd22dd33 // NO match

to match a digit in grep you can use [0-9]. To match anything but a digit, you can use [^0-9]. Since that can be any number of , or no chars, you add a "*" (any number of the preceding). So what you'll want is logically
(anything not a digit or nothing)* (any single digit) (anything not a digit or nothing)* ....
until you have 4 "any single digit" groups. i.e. [^0-9]*[0-9]...
I find with grep long patterns, especially with long strings of special chars that need to be escaped, it's best to build up slowly so you're sure you understand whats going on. For example,
#this will highlight your matches, and make it easier to understand
alias grep='grep --color=auto'
echo 'a1b2' | grep '[0-9]'
will show you how it's matching. You can then extend the pattern once you understand each part.

I'm not sure about all the other input you might take (i.e. is ax12ax12ax12ax12 valid?), but this will work based on what you posted:
%> grep -P "^(?:\w\d){4}$" fileWithInput

With grep:
grep -iE '^([a-z]*[0-9]){4}[a-z]*$' | grep -vE '^[0-9]{4}$'
Do it in one pattern with Perl:
perl -ne 'print if /^(?!\d{4}$)([^\W\d_]*\d){4}[^\W\d_]*$/'
The funky [^\W\d_] character class is a cosmopolitan way to spell [A-Za-z]: it catches all letters rather than only the English ones.

If you don't mind using a little shell as well, you could do something like this:
echo "a1a1a1a1" |grep -o '[0-9]'|wc -l
which would display the number of digits found in the string. If you like, you could then test for a given number of matches:
max_match=4
[ "$(echo "a1da4a3aaa4a4" | grep -o '[0-9]'|wc -l)" -le $max_match ] || echo "too many digits."

Assuming you only need ASCII, and you can only access the (fairly primitive) regexp constructs of grep, the following should be pretty close:
grep ^[a-zA-Z]*[0-9][a-zA-Z]*[a-zA-Z]*[0-9][a-zA-Z]*[a-zA-Z]*[0-9][a-zA-Z]*[a-zA-Z]*[0-9][a-zA-Z]*$ | grep [a-zA-Z]

You might try
[^0-9]*[0-9][^0-9]*[0-9][^0-9]*[0-9][^0-9]*[0-9][^0-9]*
But this will match 1234. why doesn't that match your criteria?

The regex for that is:
([A-Za-z]\d){4}
[A-Za-z] - for character class
\d - for number
you wrapp them in () to group them indicating the format character follow by number
{4} - indicating that it must be 4 repetitions

you can use normal shell script, no need complicated regex.
var=a1a1a1a1
alldigits=${var//[^0-9]/}
allletters=${var//[0-9]/}
case "${#alldigits}" in
4)
if [ "${#allletters}" -gt 0 ];then
echo "ok: 4 digits and letters: $var"
else
echo "Invalid: all numbers and exactly 4: $var"
fi
;;
*) echo "Invalid: $var";;
esac

thanks for your answers
finaly i wrote some script and it work perfect:
. /P ab2b2 cd12 z9989 1ab26a9 1ab1c1 1234 24 a2b2c2d2
#!/bin/bash
echo "$#" |tr -s " " "\n"s >> sorting
cat sorting | while read tostr
do
l=$(echo $tostr|tr -d "\n"|wc -c)
temp=$(echo $tostr|tr -d a-z|tr -d "\n" | wc -c)
if [ $temp -eq 4 ]; then
if [ $l -gt 4 ]; then
printf "%s " "$tostr"
fi
fi
done
echo

Related

Print 1 Occurence for Each Pattern Match

I have a file that contains a pattern at the beginning of each newline:
./bob/some/text/path/index.html
./bob/some/other/path/index.html
./bob/some/text/path/index1.html
./sue/some/text/path/index.html
./sue/some/text/path/index2.html
./sue/some/other/path/index.html
./john/some/text/path/index.html
./john/some/other/path/index.html
./john/some/more/text/index1.html
... etc.
I came up with the following code to match the ./{name}/ pattern and would like to print 1 occurance of each name, BUT, it either prints out every line matching that pattern, or just 1 and stops when using the -m 1 flag:
I've tried it as a simple grep line(below) and also put it in a for loop
name=$(grep -iEoha -m 1 '\.\/([^/]*)\/' ./without_localnamespace.txt)
echo $name
My expected reuslts are:
./bob/
./sue/
./john/
Actual Results are:
./bob/
awk -F'/' '!a[$2]++{print $1 FS $2 FS}' input
./bob/
./sue/
./john/
You can do
cut -d "/" -f2 ./without_localnamespace.txt | sort -u
You seem to want unique occurrences, use
grep -Eoha '\./[^/]*/' ./without_localnamespace.txt | uniq
See the online demo
Regarding the pattern, you do not need to escape forward slashes, they are not special regex metacharacters. The -i flag is redundant here, too.

regex to catch substring in a line and change output with sed

I am trying to put together a regex that will find a substring that has the following format:
[a-z].* ('x' number of lowercase letters)
A '-' sign [a-z].*
('x' number of lower case letters [0-9].*
('x' number of numbers from 0 to 9)
Anything else in the line that follows this substring (including space or ',') would not be caught by the regex and then I would add a new line to the results so they are in a list.
If this regex works the way I would like then from the following string
file.txt: hostname abcd-efg123, zfdh-eif23 , reox-bmo552, 'coor-dto201',
I would receive this output
abcd-efg123
zfdh-eif23
reox-bmo552
coor-dto201
This is what I have so far. I'm trying to use the regex and then store the result as as two variables which I can then put back into sed. I'm not getting the results I expected.
The regex/sed I am using is
sed 's/\([a-z].*\)-\([a-z].*[0-9].*\)/\2 \1 \n/g'
Here is the command straight from the prompt
macbook:~ user$ echo "file.txt: hostname abcd-efg123, zfdh-eif23 , reox-bmo552, 'coor-dto201'," | sed 's/\([a-z].*\)-\([a-z].*[0-9].*\)/\2 \1 \n/g'
dto201', file.txt: hostname abcd-efg123, zfdh-eif23 , reox-bmo552, 'coor n
Here's the regex I would use to match :
[a-z]*-[a-z]*[0-9]*
The main problem in your regex was the use of .* where you obviously meant *. As I commented, * is the quantifier that means "any number of times (including 0)", while . is the "any character" wildcard. You want to apply the quantifier to your previous character class rather than to ., the . has no reason to be here.
Note that using * includes 0 repetition, so the regex would match a single dash, which might not be to your taste.
Maybe you could be more specific, with a regex along those lines :
[a-z]{4}-[a-z]{3}[0-9]{2,3}
Here instead of using * as a quantifier, we use numbers between curly brackets : they give us the possibility to specify an exact number of repetitions (i.e. .{4} means "any 4 characters") or a range of repetitions (i.e. [0-9]{2,6} means "2 to 6 digits"). You could also use +, a quantifier that means "at least one time", as mentioned by Kenavoz.
And here's how I would use it in a linux command :
grep -o '[a-z]*-[a-z]*[0-9]*'
or
grep -Eo '[a-z]{4}-[a-z]{3}[0-9]{2,3}'
Here it is in action :
$ echo "file.txt: hostname abcd-efg123, zfdh-eif23 , reox-bmo552, 'coor-dto201'," | grep -o '[a-z]*-[a-z]*[0-9]*'
abcd-efg123
zfdh-eif23
reox-bmo552
coor-dto201
Or with the more specific regex :
$ echo "file.txt: hostname abcd-efg123, zfdh-eif23 , reox-bmo552, 'coor-dto201'," | grep -Eo "[a-z]{4}-[a-z]{3}[0-9]{2,3}"
abcd-efg123
zfdh-eif23
reox-bmo552
coor-dto201

shell regex not found (number, length and K letter)

my code or regex not found, im tried with:
'^([0-9]{7,8})+([K|0-9]{1})'
'#[0-9]{7,8}[K 0-9]{1}'
'#[0-9]{7,8}[K 0-9]{1}'
"^([0-9]{7,8})+([K|0-9]{1})$"
*i need return (1234567 or 12345678) + K or number (not found \d)
ex: 123456789 12345678K 12345678
*with this regex '^[0-9]{7,8}[K|0-9]{1}' return:
ok 184587939
ok 17678977K
ok 186613074
ok 18661307Z (dont work the last digit)
invalido 18R613074
ok 1845879398888888 (not length found)
ok 18458793U
invalido 18661G074
invalido 18661G07T
ok 18458793
invalido 1845E793
#!/bin/bash
var='^([0-9]{7,8})+([K|0-9]{1})$'
for LINEA in `cat rut.txt ` #LINEA guarda el resultado del fichero rut.txt
do
rut=`echo $LINEA | cut -d ":" -f1` #Extrae rut
rut=$(echo $rut | tr 'a-z' 'A-Z')
while :
do
if [[ $rut =~ $var ]];then ()
#$rut >> rutok.txt
echo $rut | cat >> rutok.txt
echo $rut' ok'
break
else
#$rut >> rutinv.txt
echo $rut | cat >> rutinv.txt
echo $rut' inv'
break
fi
done
done
exit 0
If Iunderstand you correctly, I believe this is what you're after.
^([0-9]{8}|[0-9]{7}K)$
Your question is extremely unclear, but in addition, your code is incredibly overcomplicated. You seem to be looking for simply grep and grep -v. Looping over the lines in a file with for is definitely wrong (see http://mywiki.wooledge.org/DontReadLinesWithFor) but a while read -r loop is also usually an antipattern. In the general case, these are usually examples of an Awk script begging to be born.
cut -d : -f1 rut.txt |
tr A-Z a-z |
awk '/^([0-9]{7,8})([KZ0-9])$/ { print >>"rutok.txt"; next }
{ print >> "rutinv.txt" }'
The {1} repetition is just superfluous (otherwise we'd have to write h{1}t{1}t{1}p{1} etc to match a simple string!) and inside a character class, you just enumerate the character ranges; if you don't want to match a literal | character, don't put it in the character class.
([0-9]{7,8})+ means one or more repetitions of 7 or 8 numbers; so 7, 8, 14, 15, 16, 21, ... repetitions. According to the exposition, this is not what you want.
I'm guessing the regex above covers your case: seven or eight digits and a K or one more digit.
Moreover, returning to your shell script, you should not use uppercase for your own variables (these are reserved for system use); and you need to quote your strings unless you specifically want the shell to perform token splitting and wildcard expansion on the values (though that's probably likely to be harmless here). Finally, the while loop does not seem to serve any useful purpose, as you break out of it on the first iteration in every case.
You can try this:
^([0-9]{7,8})([k0-9])$
For
184587939
17678977K
186613074
and ^([0-9]{7,8})([a-z0-9])$ For
184587939
17678977K
186613074
17678977A
18661307B
17678977U
18661307L

Shell script linux, validating integer

This code is for check if a character is a integer or not (i think). I'm trying to understand what this means, I mean... each part of that line, checking the GREP man pages, but it's really difficult for me. I found it on the internet. If anyone could explain me the part of the grep... what means each thing put there:
echo $character | grep -Eq '^(\+|-)?[0-9]+$'
Thanks people!!!
Analyse this regex:
'^(\+|-)?[0-9]+$'
^ - Line Start
(\+|-)? - Optional + or - sign at start
[0-9]+ - One or more digits
$ - Line End
Overall it matches strings like +123 or -98765 or just 9
Here -E is for extended regex support and -q is for quiet in grep command.
PS: btw you don't need grep for this check and can do this directly in pure bash:
re='^(\+|-)?[0-9]+$'
[[ "$character" =~ $re ]] && echo "its an integer"
I like this cheat sheet for regex:
http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
It is very useful, you could easily analyze the
'^(+|-)?[0-9]+$'
as
^: Line must begin with...
(): grouping
\: ESC character (because + means something ... see below)
+|-: plus OR minus signs
?: 0 or 1 repetation
[0-9]: range of numbers from 0-9
+: one or more repetation
$: end of line (no more characters allowed)
so it accepts like: -312353243 or +1243 or 5678
but do not accept: 3 456 or 6.789 or 56$ (as dollar sign).

Bash script grep for pattern in variable of text

I have a variable which contains text; I can echo it to stdout so I think the variable is fine. My problem is trying to grep for a pattern in that variable of text. Here is what I am trying:
ERR_COUNT=`echo $VAR_WITH_TEXT | grep "ERROR total: (\d+)"`
When I echo $ERR_COUNT the variable appears to be empty, so I must be doing something wrong.
How to do this properly? Thanks.
EDIT - Just wanted to mention that testing that pattern on the example text I have in the variable does give me something (I tested with: http://rubular.com)
However the regex could still be wrong.
EDIT2 - Not getting any results yet, so here's the string I'm working with:
ALERT line125: Alert: Cannot locate any description for 'asdf' in the qwer.xml hierarchy. (due to (?i-xsm:\balert?\b) ALERT in ../hgfd.controls) ALERT line126: Alert: Cannot locate any description for 'zxcv' in the qwer.xml hierarchy. (due to (?i-xsm:\balert?\b) ALERT in ../dfhg.controls) ALERT line127: Alert: Cannot locate any description for 'rtyu' in the qwer.xml hierarchy. (due to (?i-xsm:\balert?\b) ALERT in ../kjgh.controls) [1] 22280 IGNORE total: 0 WARN total: 0 ALERT total: 3 ERROR total: 23 [1] + Done /tool/pandora/bin/gvim -u NONE -U NONE -nRN -c runtime! plugin/**/*.vim -bg ...
That's the string, so hopefully there should be no ambiguity anymore... I want to extract the number "23" (after "ERROR total: ") into a variable and I'm having a hard time haha.
Cheers
You can use bash's =~ operator to extract the value.
[[ $VAR_WITH_TEXT =~ ERROR\ total:\ ([0-9]+) ]]
Note that you have to escape the spaces, or only only quote
the fixed parts of the regular expression:
[[ $VAR_WITH_TEXT =~ "ERROR total: "([0-9]+) ]]
since quoting any of the metacharacters causes them to be treated
literally.
You can also save the regex in a variable:
regex="ERROR total: ([0-9]+)"
[[ $VAR_WITH_TEXT =~ $regex ]]
In any case, once the expression matches, the parenthesized expression
can be found in BASH_REMATCH array.
ERR_COUNT=${BASH_REMATCH[1]}
(The zeroth element contains the entire matched regular expression; the parenthesized subexpressions are found in the remaining elements in the order they appear in the full regex.)
If you want to use grep, you'll need a version that can accept Perl-style regexes.
ERR_COUNT=$( echo "$VAR_WITH_TEXT" | grep -Po "(?<=ERROR total: )\d+" )
As long as you need to use Perl-style regexes to enable the look-behind assertion, you can replace [0-9] with \d.
Your error is in the pattern: (\d+) matches:
'('
a digit
'+'
')'
According to your comment, what you want is \(\d\+\), which:
defines a sub-pattern by \( ... \)
Inside it matches at least one (\+) digit (\d).
In this case, if you don't need a sub-pattern, you can just drop the \( and \).
Note: if your grep doesn't understand \d, you can replace it by [0-9]. Easiest way is to write grep '\d' and test it by writing a couple test lines.
# setting example data
test="adfa\nfasetrfaqwe\ndsfa ERROR total: 32514235dsfaewrf"
one solution:
echo $(sed -n 's/^.*ERROR total: \([0-9]*\).*$/\1/p' < <(echo $test))
32514235
other solution:
# throw away everything up to "ERROR total: "
test=${test##*ERROR total: }
# cut from behind assuming number contains no spaces and is
# separated by space
test=${test%% *}
echo $test
32514235
The \d is probably only recognized as a digit in perl regex mode, you probably want to use grep -P.
If you only want the number you could try:
ERR_COUNT=$(echo $VAR_WITH_TEXT | perl -pe "s/.*ERROR total: (\d+).*/\1/g")
or:
ERR_COUNT=$(echo $VAR_WITH_TEXT | sed -n "s/.*ERROR total: ([0-9]+).*/\1/gp")