Regular expression operator {} in linux bash - regex

I'm having some problems with the {} operator. In the following examples, I'm trying to find the rows with 1, 2, and 2 or more occurrences of the word mint, but I get a response only if I search for 1 occurrence of mint, even though there are more than 1 per row.
The input I am processing is a listing like this obtainded with the ls -l command:
-rw-r--r-- 1 mint mint 26 Dec 20 21:11 example.txt
-rw-r--r-- 1 mint mint 26 Dec 20 21:11 another.example
-rw-r--r-- 1 mint mint 19 Dec 20 15:11 something.else
-rw-r--r-- 1 mint mint 1 Dec 20 01:23 filemint
-rw-r--r-- 1 mint mint 26 Dec 20 21:11 mint
With ls -l | grep -E 'mint{1}' I find all the rows above, and I expected to find nothing (should be all the rows with 1 occurrence of mint).
With ls -l | grep -E 'mint{2}' I find nothing, and I expected to find the first 3 rows above (should be all the rows with 2 occurrences of mint).
With ls -l | grep -E 'mint{2,}' I expected to find all the rows above, and again I found nothing (should be all the rows with at least 2 occurrences of mint).
Am I missing something on how {} works?

Firstly, a "quantifier" in a regular expression refers to the "token" immediately before it, which by default is a single character. So mint{2} is looking for the character t twice - it is equivalent to m{1}i{1}n{1}t{2}, or mintt.
To search for a sequence of characters a number of times, you need to group that sequence, using parentheses. So (mint){2} would search for the sequence mint twice in a row, as in mintmint.
Secondly, in your input, there are additional characters in between the occurrences of mint; the regular expression needs to specify that those are allowed.
The simplest way to do that is using the pattern .*, which means "anything, zero or more times". That gives you (mint.*){2} which will match "mint followed by anything, twice".
Finally, given the input "mint mint", the pattern (mint.*){1} will match - it doesn't care that some of the "extra" characters also spell "mint", it just knows that the required parts are there. In fact, {1} is always redundant, and (mint.*){1} matches exactly the same things that just mint matches. In general, regular expressions are good at asserting what is there, and not at asserting what is not there.
Some regular expression flavours have "lookahead assertions" which can process negative assertions like "not followed by mint", but grep -E does not. What it does have is a switch, -v, which inverts the whole command - it shows all lines except the ones matched by the regular expression. A simple approach to say "no more than 1 instance of mint" is therefore to run grep twice - once normally, and once with -v:
# At least once, but not twice -> exactly once
ls -l | grep -E 'mint' | grep -v -E '(mint.*){2}'
# At least twice, but not three times -> exactly twice
ls -l | grep -E '(mint.*){2}' | grep -v -E '(mint.*){3}'

Related

How to get number in range 10-20 in grep

I need from this file extract line that starts with a number in the range 10-20 and I have tried use grep "[10-20]" tmp_file.txt, but from a file that has this format
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.aa
12.bbb
13.cccc
14.ddddd
15.eeeeee
16.fffffff
17.gggggggg
18.hhhhhhhhh
19.iiiiiiiiii
20.jjjjjjjjjjj
21.
it returned everything and marked every number that contains either 1, 0, 10, 2, 0, 20 or 21 :/
With an extended regular expression (-E):
grep -E '^(1[0-9]|20)\.' file
Output:
10.aa
12.bbb
13.cccc
14.ddddd
15.eeeeee
16.fffffff
17.gggggggg
18.hhhhhhhhh
19.iiiiiiiiii
20.jjjjjjjjjjj
See: The Stack Overflow Regular Expressions FAQ
An other one with awk
awk '/^10\./,/^20\./' tmp_file.txt
awk '/^10\./,/^13\./' tmp_file.txt
10.aa
12.bbb
13.cccc
Try
grep -w -e '^1[[:digit:]]' -e '^20' tmp_file.txt
-w forces matches of whole words. That prevents matching lines like 100.... It's not POSIX, but it's supported by every grep that most people will encounter these days. Use grep -e '^1[[:digit:]]\.' -e '^20\.' ... if you are concerned about portability.
The -e option can be used multiple times to specify multiple patterns.
[[:digit:]] may be more reliable than [0-9]. See In grep command, can I change [:digit:] to [0-9]?.
Assuming the file might not be sorted and using numeric comparison
awk -F. '$1 >= 10 && $1 <= 20' < file.txt
grep is not the tool for this, because grep finds text patterns, but does not understand numeric values. Making patterns that match the 11 values from 10-20 is like stirring a can of paint with a screwdriver. You can do it, but it's not the right tool for the job.
A much clearer way to do this is with Perl:
$ perl -n -e'print if /^(\d+)/ && $1 >= 10 && $1 <= 20' foo.txt
This says to print a line of the file if the beginning of the line ^ matches one or more digits \d+ and if the numeric value of what was matched $1 is between the values of 10 and 20.

How to print portion of linux filenames that match a regex

I would like to list all files in a linux directory then apply a regular expression on them to format the file name, and print these formatted files names.
Example:
ls -lthrh
.
.
-rwxrwxrwx. 1 root root 633 Oct 31 2016 Oracle_Schedule_ARC-Oracle_ARCH-1477938600005-1002-Oracleorcl-rman1.txt
-rwxrwxrwx. 1 root root 610 Nov 7 2016 MOD-1478512353102-1002-Oracleorcl-rman1.txt
After applying my regex '.+?(?=-)' I would have everything before the first '-' to be:
Oracle_Schedule_ARC
MOD
I've tried using awk, but I couldn't pass a regex to it. I will apply later | sort | uniq to have a unique output of the regex output.
In any POSIX shell (bash, pdksh, ksh93, zsh, dash):
for name in *; do
printf '%s\n' "${name%%-*}"
done
This would go through all the names in the current directory and output the bit before the first - character. It does this by removing the longest suffix string matching -* from the filename using a standard parameter substitution.
Note that -* is a shell globbing pattern, not a regular expression. Regular expressions are useful for working on text, but globbing patterns are fast and efficient for working with filenames and pathnames in general, as you don't have to start another process with a regex engine, such as awk or sed.
In bash, you could also get away from using a loop at all:
set -- *
printf '%s\n' "${#%%-*}"
This first sets the positional parameters to the names in the current directory. printf is then invoked on the set of names, each individually transformed with the same parameter substitution as in the first part of this answer.
The same thing, but using an array variable other than the array of positional parameters:
names=( * )
printf '%s\n' "${names[#]%%-*}"

get number value between two strings using regex

I have a string with multiple value outputs that looks like this:
SD performance read=1450kB/s write=872kB/s no error (0 0), ManufactorerID 27 Date 2014/2 CardType 2 Blocksize 512 Erase 0 MaxtransferRate 25000000 RWfactor 2 ReadSpeed 22222222Hz WriteSpeed 22222222Hz MaxReadCurrentVDDmin 3 MaxReadCurrentVDDmax 5 MaxWriteCurrentVDDmin 3 MaxWriteCurrentVDDmax 1
I would like to output only the read value (1450kB/s) using bash and sed.
I tried
sed 's/read=\(.*\)kB/\1/'
but that outputs read=1450kB but I only want the number.
Thanks for any help.
Sample input shortened for demo:
$ echo 'SD performance read=1450kB/s write=872kB/s no error' | sed 's/read=\(.*\)kB/\1/'
SD performance 1450kB/s write=872/s no error
$ echo 'SD performance read=1450kB/s write=872kB/s no error' | sed 's/.*read=\(.*\)kB.*/\1/'
1450kB/s write=872
$ echo 'SD performance read=1450kB/s write=872kB/s no error' | sed 's/.*read=\([0-9]*\)kB.*/\1/'
1450
Since entire line has to be replaced, add .* before and after search pattern
* is greedy, will try to match as much as possible, so in 2nd example it can be seen that it matched even the values of write
Since only numbers after read= is needed, use [0-9] instead of .
Running
sed 's/read=\(.*\)kB/\1/'
will replace read=[digits]kB with [digit]. If you want to replace the whole string, use
sed 's/.*read=\([0-9]*\)kB.*/\1/'
instead.
As Sundeep noticed, sed doesn't support non-greedy pattern, updated for [0-9]* instead

Grep ascending order of cards. Why does it work?

The collection of cards I need to grep is defined as:
{h ∈ H | h contains only cards in ascending order regardless of their suit}
Example:
h = Ah2c2d3s5h6d8s8d9h9cTdTcKh
h != 3d4dQc3sKcAh2sAc7hKdKsKh4h62 (Q is followed by lower rank 3)
The ascending ranks of cards are:
A(ace) 2 3 4 5 6 7 8 9 T(ten) J Q K
The suits are defined as such:
c(clover) s(spade) h(heart) d(diamond)
I have tried the following grep and it is correct but I still don't
understand why it works.
Edit*** added -P flag (forgot about it) as pointed out by tripleee that just grep -v is indeed invalid.
grep -Pv "[KQJT].*[2-9A].* |[KQ].*[JT].* |[6-9].*[2-5A].* "
What baffles me is how K followed by Q got matched with this pattern or even 5 followed by [A2-4]
The solution has a total of 31027 lines
The text file provided for the exercise can be found here:
http://computergebruik.ugent.be/oefeningenreeks1/kaarten1.txt
Your regex is not at all valid, so I don't understand why you say it works.
Plain grep does not understand | to mean alteration. You can add an -E option to specify ERE (traditionally, egrep) regex semantics, or with POSIX grep backslash the |; or you can specify multiple -e options. (See e.g. https://en.wikipedia.org/wiki/Regular_expression#Standards for some background about the various regex dialects in common use.)
grep -Ev "[KQJT].*[2-9A].* |[KQ].*[JT].* |[6-9].*[2-5A].* "
grep -v "[KQJT].*[2-9A].* \|[KQ].*[JT].* \|[6-9].*[2-5A].* "
grep -ve "[KQJT].*[2-9A].* " -e "[KQ].*[JT].* " -e "[6-9].*[2-5A].* "
Even with this fix, the regex is obviously insufficient for removing matches where e.g. 3 is followed by 2. The only way to make it cover all cases is to enumerate every possibility. (Disallow 1 followed by any higher number, 2 followed by any higher number, 3 followed by any higher number, etc.) An altogether better approach would be to use a scripting language of some sort, and basically just map the symbols to ones with the desired sort order, then check if the input is sorted.
If that is not an option, maybe try
grep -E '^(A.)*(2.)*(3.)*(4.)*(5.)*(6.)*(7.)*(8.)*(9.)*(T.)*(J.)*(Q.)*(K.)* '
which looks for zero or more aces, followed by zero or more twos, followed by zero or more threes, etc.

How can this regex let a line like "0.0083" pass? grep -ioE '([0-9]{1,3}.){3}[0-9]{1,3}'

I am trying to make a bash script for active scan of a network. It seems I don't have a hang on regex. The code looks like this:
#! /bin/bash
cd /home/pi/int_lib
for word in $(nmap -sn 192.168.1.0/24 | grep -ioE '([0-9]{1,3}.){3}[0-9]{1,3}' |
grep -v -)
do
mac=$(arp $word | grep -ioE '([A-Fa-f0-9]{2}:){5}[A-Fa-f0-9]{2}')
echo $word: $mac
done
I just want to know how it is possible that a line like "0.0083" can pass the first regex. nmap gives the response time for each host, and in exactly one case the mentioned line pass the filter. Why?
The regex
([0-9]{1,3}.){3}[0-9]{1,3}
matches 1-3 digits followed by any character, 3 times, followed by 1-3 digits. That sums up to at least 7 characters/digits. Illustrated with n as digits, it can look like this
n.n.n.n
where . is any character, up to its longest form
nnn.nnn.nnn.nnn
Since 0.0083 only is 6 characters long, it can never match that regex.
But... simply adding a digit, e.g. 0.00831 makes it match.
Finally, I believe what you're after is the same, but with the . escaped, thus only matching dot.
([0-9]{1,3}\.){3}[0-9]{1,3}