I'm writing a bash script which analyses a html file and
I want to get the content of each single <tr>...</tr>. So my command looks like:
$ tr -d \\012 < price.html | grep -oE '<tr>.*?</tr>'
But it seems that grep gives me the result of:
$ tr -d \\012 < price.html | grep -oE '<tr>.*</tr>'
How can I make .* non-greedy?
If you have GNU Grep you can use -P to make the match non-greedy:
$ tr -d \\012 < price.html | grep -Po '<tr>.*?</tr>'
The -P option enables Perl Compliant Regular Expression (PCRE) which is needed for non-greedy matching with ? as Basic Regular Expression (BRE) and Extended Regular Expression (ERE) do not support it.
If you are using -P you could also use look arounds to avoid printing the tags in the match like so:
$ tr -d \\012 < price.html | grep -Po '(?<=<tr>).*?(?=</tr>)'
If you don't have GNU grep and the HTML is well formed you could just do:
$ tr -d \\012 < price.html | grep -o '<tr>[^<]*</tr>'
Note: The above example won't work with nested tags within <tr>.
Non-greedy matching is not part of the Extended Regular Expression syntax supported by grep -E. Use grep -P instead if you have that, or switch to Perl / Python / Ruby / what have you. (Oh, and pcregrep.)
Of course, if you really mean
<tr>[^<>]*</tr>
you should say that instead; then plain old grep will work fine.
You could (tediously) extend the regex to accept nested tags which are not <tr> but of course, it's better to use a proper HTML parser than spend a lot of time rediscovering why regular expressions are not the right tool for this.
.*? is a Perl regular expression. Change your grep to
grep -oP '<tr>.*?</tr>'
Try perl-style-regexp
$ grep -Po '<tr>.*?</tr>' input
<tr>stuff</tr>
<tr>more stuff</tr>
Related
I'm trying to figure out how to grep for lines that are made up of A-Z and a-z exclusively, that is, the "American" alphabet of letters. I would expect this to work, but it does not:
$ echo -e "Jutland\nJastrząb" | grep -x '[A-Za-z]*'
Jutland
Jastrząb
I want this to only print "Jutland", because ą is not a letter in the American alphabet. How can I achieve this?
You need to add LC_ALL=C before grep:
printf '%b\n' "Jutland\nJastrząb" | LC_ALL=C grep -x '[A-Za-z]*'
Jutland
You may also use -i switch to ignore case and reduce regex:
printf '%b\n' "Jutland\nJastrząb" | LC_ALL=C grep -ix '[a-z]*'
LC_ALL=C avoids locale-dependent effects otherwise your current LOCALE treats ą as [a-zA-Z].
You can use perl regex:
$ echo -e "Jutland\nJastrząb" | grep -P '^[[:ascii:]]+$'
Jutland
It's experimental though:
-P, --perl-regexp
Interpret the pattern as a Perl-compatible regular expression (PCRE). This is experimental and
grep -P may warn of unimplemented features.
EDIT
For letters only, use [A-Za-z]:
$ echo -e "L'Egyptienne\nJutland\nJastrząb" | grep -P '^[A-Za-z]+$'
Jutland
So I was trying to write a regex in grep to match square brackets, i.e [ad] should match [ and ]. But I was getting different results on using capturing groups and character classes. Also the result is different on putting ' in the beginning and end of regex string.
So these are the different result that I am getting.
Using capturing groups works fine
echo "[ad]" | grep -E '(\[|\])'
[ad]
Using capturing groups without ' gives syntax error
echo "[ad]" | grep -E (\[|\])
bash: syntax error near unexpected token `('
using character class with [ followed by ] gives no output
echo "[ad]" | grep -E [\[\]]
Using character class with ] followed by [ works correctly
echo "[ad]" | grep -E [\]\[]
[ad]
Using character class with ] followed by [ and using ' does not work
echo "[ad]" | grep -E '[\]\[]'
It'd be great if someone could explain the difference between them.
You should know about:
BRE ( = Basic Regular Expression )
ERE ( = Extended Regular Expression )
BRE metacharacters require a backslash to give them their special meaning and grep is based on
The ERE flavor standardizes a flavor similar to the one used by the UNIX egrep command.
Pay attention to -E and -G
grep --help
Usage: grep [OPTION]... PATTERN [FILE]...
Search for PATTERN in each FILE or standard input.
PATTERN is, by default, a basic regular expression (BRE).
Example: grep -i 'hello world' menu.h main.c
Regexp selection and interpretation:
-E, --extended-regexp PATTERN is an extended regular expression (ERE)
-F, --fixed-strings PATTERN is a set of newline-separated strings
-G, --basic-regexp PATTERN is a basic regular expression (BRE)
-P, --perl-regexp PATTERN is a Perl regular expression
...
...
POSIX Basic Regular Expressions
POSIX Extended Regular Expressions
POSIX Bracket Expressions
And you should also know about bash, since some of your input is related to bash interpreter not grep or anything else
echo "[ad]" | grep -E (\[|\])
Here bash assumes you try to use () something like:
echo $(( 10 * 10 ))
and by using single quote ' you tell the bash that you do not want it treats as a special operator for it. So
echo "[ad]" | grep -E '(\[|\])'
is correct.
Firstly, always quote Regex pattern to prevent shell interpretation beforehand:
$ echo "[ad]" | grep -E '(\[|\])'
[ad]
Secondly, within [] surrounded by quotes, you don't need to escape the [] inside, just write them as is within the outer []:
$ echo "[ad]" | grep -E '[][]'
[ad]
Maybe you provided such a simple example on purpose (after all, it is minimal), but in case all you really want is to check for existence of square brackets (a fixed string, not regex pattern), you can use grep with -F/--fixed-strings and multiple -e options:
$ echo "[ad]" | grep -F -e '[' -e ']'
[ad]
Or, a little bit shorter with fgrep:
$ echo "[ad]" | fgrep -e '[' -e ']'
[ad]
Or, even:
$ echo "[ad]" | fgrep -e[ -e]
[ad]
I'm looking for a regular expression pattern that can match adjacent 2 same characters, (used for grepping in Linux).
For example:
Catt
Puppet
Worry
Fool
Something like this with GNU grep?
echo 'Catt Puppet Worry Fool' | grep -E '(.)\1'
or
echo 'Catt Puppet Worry Fool' | grep -oE '(.)\1'
Update:
Try this to get the complete words:
echo 'Catt Puppet Worry Fool' | grep -Po '[^ ]*(.)\1[^ ]*'
I am trying to use the following syntax with grep
cat Notes.rtf | grep -i "\(D*\)ters"
The result is fine . However when I attempt to use
cat Notes.rtf | grep -i "\(D+\)ters"
There is no result.
I came across this page and it seems that regex does not support +
Is that correct is there an equivalent to + with grep. IS there a better alternative to grep for max OSX terminal ?
grep doesn't support extended regex properties like \D (matching a non-digit) unless your use -P flag (PCRE) or -E flag (extended regex) like this:
grep -Pi "\D+ters" Notes.rtf
OR
grep -Ei "\D+ters" Notes.rtf
Likewise + also need not be escaped while using P or E flags.
I'm writing a bash script in which I need to extract IPv4 and IPv6 Address Ranges from multiple strings and then format it as per the requirements before saving to the file.
I've got the regex working fine: http://regexr.com?38jsb (Not optimized, roughly added)
However, with bash it throws an error if i use with egrep which states egrep: repetition-operator operand invalid
Here's my bash script:
#!/bin/bash
regex="(?>(?>([a-f\d]{1,4})(?>:(?1)){3}|(?!(?:.*[a-f\d](?>:|$)){})((?1)(?>:(?1)){0,6})?::(?2)?)|(?>(?>(?1)(?>:(?1)){5}:|(?!(?:.*[a-f\d]:){6,})(?3)?::(?>((?1)(?>:(?1)){0,4}):)?)?(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?>\.(?4)){3}))\/\d{1,2}"
echo "v=abc ip4:127.0.0.1/19 ip4:192.168.1.1/32 ip4:192.168.2.50/20 ip6:2001:4860:4000::/36 ip6:2404:6800:4000::/36 ip6:2607:f8b0:4000::/36 ip6:2800:3f0:4000::/36 ip6:2a00:1450:4000::/36 ip6:2c0f:fb50:4000::/36 ~all" | egrep -o $regex
How can i extract both type of IP ranges in bash? What's a better solution?
Note: I'm using sample data for testing purpose
First, single-quote the regex variable assignment (regex='...').
Then, use grep -Po (and double-quote $regex), as #BroSlow suggests (note that -P is not available on all platforms (e.g., OSX)) -- -P activates support for PCREs (Perl-Compatible Regular Expressions), which is required for your regex.
To put it all together:
regex='(?>(?>([a-f\d]{1,4})(?>:(?1)){3}|(?!(?:.*[a-f\d](?>:|$)){})((?1)(?>:(?1)){0,6})?::(?2)?)|(?>(?>(?1)(?>:(?1)){5}:|(?!(?:.*[a-f\d]:){6,})(?3)?::(?>((?1)(?>:(?1)){0,4}):)?)?(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?>\.(?4)){3}))\/\d{1,2}'
txt="v=abc ip4:127.0.0.1/19 ip4:192.168.1.1/32 ip4:192.168.2.50/20 ip6:2001:4860:4000::/36 ip6:2404:6800:4000::/36 ip6:2607:f8b0:4000::/36 ip6:2800:3f0:4000::/36 ip6:2a00:1450:4000::/36 ip6:2c0f:fb50:4000::/36 ~all"
echo "$txt" | grep -Po "$regex"
Alternative: Following #l'L'l's example, here's a greatly simplified solution that works with the sample data (again relies on -P):
echo "$txt" | grep -Po '\bip[46]:\K[^ ]+'
Variant for OSX, where grep doesn't support -P:
echo "$txt" | egrep -o '\<ip[46]:[^ ]+' | cut -c 5-
This pattern should work in combination with sed:
str="v=abc ip4:127.0.0.1/19 ip4:192.168.1.1/32 ip4:192.168.2.50/20 ip6:2001:4860:4000::/36 ip6:2404:6800:4000::/36 ip6:2607:f8b0:4000::/36 ip6:2800:3f0:4000::/36 ip6:2a00:1450:4000::/36 ip6:2c0f:fb50:4000::/36 ~all"
echo $str | grep -s -i -o "ip[0-9]\:[a-z0-9\.:/]*" --color=always | sed 's/ip[0-9]\://g'
output:
127.0.0.1/19
192.168.1.1/32
192.168.2.50/20
2001:4860:4000::/36
2404:6800:4000::/36
2607:f8b0:4000::/36
2800:3f0:4000::/36
2a00:1450:4000::/36
2c0f:fb50:4000::/36
omit the --color=always to exclude color output if desired.