regular expression for adjacent same character - regex

I'm looking for a regular expression pattern that can match adjacent 2 same characters, (used for grepping in Linux).
For example:
Catt
Puppet
Worry
Fool

Something like this with GNU grep?
echo 'Catt Puppet Worry Fool' | grep -E '(.)\1'
or
echo 'Catt Puppet Worry Fool' | grep -oE '(.)\1'
Update:
Try this to get the complete words:
echo 'Catt Puppet Worry Fool' | grep -Po '[^ ]*(.)\1[^ ]*'

Related

Using grep with a negative lookahead assertion

I have exactly the same question as in this post, however the regex isn't working for me, in bash. RegExp exclusion, looking for a word not followed by another
I want to include all lines of a csv file that include the word "Tom", except when it's followed by "Thumb".
Include: Tom sat by the seashore.
Don't include: Tom Thumb sat by the seashore.
Include: Tom and Tom Thumb sat by the seashore.
The regex Tom(?!\s+Thumb) works when I try it out on regex101.com.
But I've tried all these variations and none of them work. What am I missing and how can I work around this? I'm on a Mac.
cat inputfile.csv | grep Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | egrep Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | egrep “Tom(?!\s+Thumb)” > Tom.csv
cat inputfile.csv | grep -E Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | grep -E “Tom(?!\s+Thumb)” > Tom.csv
You can't do this with POSIX ERE.
There is no negative lookahead assertion in POSIX extended regular expressions, which is the syntax grep -E activates.
The closest you can get is to combine two separate regexes, one positive match and one negative:
grep -we 'Tom' inputfile.csv | grep -wvEe 'Tom[[:space:]]Thumb'
grep -v excludes any line that matches the given expression; so here, we're first searching for Tom, and then removing Tom Thumb.
However, the intent to match Tom and Tom Thumb sat by the seashore makes this unworkable. In short: You can't do what you're asking for with standard grep, unless it has grep -P to make your original syntax valid. In that case you could use:
grep -Pwe 'Tom(?!\s+Thumb)' <inputfile.csv >Tom.csv
One hack might be a temporary substitution
Assuming you have uuidgen available (it appears to be present in Big Sur) to generate a temporary, unpredictable sigil:
uuid=$(uuidgen)
sed -e "s/Tom Thumb/$uuid/g" <inputfile.csv \
| grep -we 'Tom' \
| sed -e "s/$uuid/Tom Thumb/g" >tom.csv
How about a Perl solution:
perl -ne 'print if /Tom(?!\s+Thumb)/' inputfile.csv > Tom.csv
Perl obviously supports PCRE and pre-installed on Mac.
The -n option is mostly equivalent to that of sed.
It suppresses the automatic printing.
The -e option enables a one-liner by putting the immediate code.
The code print if /pattern/ is an idiom to print the matched line, which
may substitute grep command.
Keep it simple and just use awk, e.g. using any awk in any shell on every Unix box:
$ awk '{orig=$0; gsub(/Tom Thumb/,"")} /Tom/{print orig}' file
Include: Tom sat by the seashore.
Include: Tom and Tom Thumb sat by the seashore.
Grep can use Perl regular expressions (PCRE). From man grep:
-P, --perl-regexp
Interpret PATTERNS as Perl-compatible regular expressions (PCREs). This option is experimental when combined with the -z (--null-data) option, and grep -P may warn of unimplemented features.

Grep regex not working with square brackets

So I was trying to write a regex in grep to match square brackets, i.e [ad] should match [ and ]. But I was getting different results on using capturing groups and character classes. Also the result is different on putting ' in the beginning and end of regex string.
So these are the different result that I am getting.
Using capturing groups works fine
echo "[ad]" | grep -E '(\[|\])'
[ad]
Using capturing groups without ' gives syntax error
echo "[ad]" | grep -E (\[|\])
bash: syntax error near unexpected token `('
using character class with [ followed by ] gives no output
echo "[ad]" | grep -E [\[\]]
Using character class with ] followed by [ works correctly
echo "[ad]" | grep -E [\]\[]
[ad]
Using character class with ] followed by [ and using ' does not work
echo "[ad]" | grep -E '[\]\[]'
It'd be great if someone could explain the difference between them.
You should know about:
BRE ( = Basic Regular Expression )
ERE ( = Extended Regular Expression )
BRE metacharacters require a backslash to give them their special meaning and grep is based on
The ERE flavor standardizes a flavor similar to the one used by the UNIX egrep command.
Pay attention to -E and -G
grep --help
Usage: grep [OPTION]... PATTERN [FILE]...
Search for PATTERN in each FILE or standard input.
PATTERN is, by default, a basic regular expression (BRE).
Example: grep -i 'hello world' menu.h main.c
Regexp selection and interpretation:
-E, --extended-regexp PATTERN is an extended regular expression (ERE)
-F, --fixed-strings PATTERN is a set of newline-separated strings
-G, --basic-regexp PATTERN is a basic regular expression (BRE)
-P, --perl-regexp PATTERN is a Perl regular expression
...
...
POSIX Basic Regular Expressions
POSIX Extended Regular Expressions
POSIX Bracket Expressions
And you should also know about bash, since some of your input is related to bash interpreter not grep or anything else
echo "[ad]" | grep -E (\[|\])
Here bash assumes you try to use () something like:
echo $(( 10 * 10 ))
and by using single quote ' you tell the bash that you do not want it treats as a special operator for it. So
echo "[ad]" | grep -E '(\[|\])'
is correct.
Firstly, always quote Regex pattern to prevent shell interpretation beforehand:
$ echo "[ad]" | grep -E '(\[|\])'
[ad]
Secondly, within [] surrounded by quotes, you don't need to escape the [] inside, just write them as is within the outer []:
$ echo "[ad]" | grep -E '[][]'
[ad]
Maybe you provided such a simple example on purpose (after all, it is minimal), but in case all you really want is to check for existence of square brackets (a fixed string, not regex pattern), you can use grep with -F/--fixed-strings and multiple -e options:
$ echo "[ad]" | grep -F -e '[' -e ']'
[ad]
Or, a little bit shorter with fgrep:
$ echo "[ad]" | fgrep -e '[' -e ']'
[ad]
Or, even:
$ echo "[ad]" | fgrep -e[ -e]
[ad]

Why does grep return nothing?

I'm somewhat at ease with regex, but not with grep particularly, and can't figure out why the following regex returns nothing:
wget -qO- 'http://www.acme.com/index.html' | grep -iPo '(?s)(^<div class="titlebar">.+?<div class="colleft">)'
I prepended (?s) because the catch-all ".+?" includes carriage-returns (either CRLF, CR, or LF, depending on how the text was saved).
Any idea why it doesn't work as expected?
Thank you.
grep is line-oriented, so if there are newlines between the tags, grep can't find it. You'll want:
wget -qO- 'http://website.invalid/index.html' |
perl -0777 -nE 'say for /(^<div class="titlebar">.+?<div class="colleft">)/msg'

Extract IPv4 and IPv6 Address Ranges in Bash?

I'm writing a bash script in which I need to extract IPv4 and IPv6 Address Ranges from multiple strings and then format it as per the requirements before saving to the file.
I've got the regex working fine: http://regexr.com?38jsb (Not optimized, roughly added)
However, with bash it throws an error if i use with egrep which states egrep: repetition-operator operand invalid
Here's my bash script:
#!/bin/bash
regex="(?>(?>([a-f\d]{1,4})(?>:(?1)){3}|(?!(?:.*[a-f\d](?>:|$)){})((?1)(?>:(?1)){0,6})?::(?2)?)|(?>(?>(?1)(?>:(?1)){5}:|(?!(?:.*[a-f\d]:){6,})(?3)?::(?>((?1)(?>:(?1)){0,4}):)?)?(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?>\.(?4)){3}))\/\d{1,2}"
echo "v=abc ip4:127.0.0.1/19 ip4:192.168.1.1/32 ip4:192.168.2.50/20 ip6:2001:4860:4000::/36 ip6:2404:6800:4000::/36 ip6:2607:f8b0:4000::/36 ip6:2800:3f0:4000::/36 ip6:2a00:1450:4000::/36 ip6:2c0f:fb50:4000::/36 ~all" | egrep -o $regex
How can i extract both type of IP ranges in bash? What's a better solution?
Note: I'm using sample data for testing purpose
First, single-quote the regex variable assignment (regex='...').
Then, use grep -Po (and double-quote $regex), as #BroSlow suggests (note that -P is not available on all platforms (e.g., OSX)) -- -P activates support for PCREs (Perl-Compatible Regular Expressions), which is required for your regex.
To put it all together:
regex='(?>(?>([a-f\d]{1,4})(?>:(?1)){3}|(?!(?:.*[a-f\d](?>:|$)){})((?1)(?>:(?1)){0,6})?::(?2)?)|(?>(?>(?1)(?>:(?1)){5}:|(?!(?:.*[a-f\d]:){6,})(?3)?::(?>((?1)(?>:(?1)){0,4}):)?)?(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?>\.(?4)){3}))\/\d{1,2}'
txt="v=abc ip4:127.0.0.1/19 ip4:192.168.1.1/32 ip4:192.168.2.50/20 ip6:2001:4860:4000::/36 ip6:2404:6800:4000::/36 ip6:2607:f8b0:4000::/36 ip6:2800:3f0:4000::/36 ip6:2a00:1450:4000::/36 ip6:2c0f:fb50:4000::/36 ~all"
echo "$txt" | grep -Po "$regex"
Alternative: Following #l'L'l's example, here's a greatly simplified solution that works with the sample data (again relies on -P):
echo "$txt" | grep -Po '\bip[46]:\K[^ ]+'
Variant for OSX, where grep doesn't support -P:
echo "$txt" | egrep -o '\<ip[46]:[^ ]+' | cut -c 5-
This pattern should work in combination with sed:
str="v=abc ip4:127.0.0.1/19 ip4:192.168.1.1/32 ip4:192.168.2.50/20 ip6:2001:4860:4000::/36 ip6:2404:6800:4000::/36 ip6:2607:f8b0:4000::/36 ip6:2800:3f0:4000::/36 ip6:2a00:1450:4000::/36 ip6:2c0f:fb50:4000::/36 ~all"
echo $str | grep -s -i -o "ip[0-9]\:[a-z0-9\.:/]*" --color=always | sed 's/ip[0-9]\://g'
output:
127.0.0.1/19
192.168.1.1/32
192.168.2.50/20
2001:4860:4000::/36
2404:6800:4000::/36
2607:f8b0:4000::/36
2800:3f0:4000::/36
2a00:1450:4000::/36
2c0f:fb50:4000::/36
omit the --color=always to exclude color output if desired.

grep: group capturing

I have following string:
{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}
and I need to get value of "scheme version", which is 1234 in this example.
I have tried
grep -Eo "\"scheme_version\":(\w*)"
however it returns
"scheme_version":1234
How can I make it? I know I can add sed call, but I would prefer to do it with single grep.
You'll need to use a look behind assertion so that it isn't included in the match:
grep -Po '(?<=scheme_version":)[0-9]+'
This might work for you:
echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' |
sed -n 's/.*"scheme_version":\([^}]*\)}/\1/p'
1234
Sorry it's not grep, so disregard this solution if you like.
Or stick with grep and add:
grep -Eo "\"scheme_version\":(\w*)"| cut -d: -f2
I would recommend that you use jq for the job. jq is a command-line JSON processor.
$ cat tmp
{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}
$ cat tmp | jq .scheme_version
1234
As an alternative to the positive lookbehind method suggested by SiegeX, you can reset the match starting point to directly after scheme_version": with the \K escape sequence. E.g.,
$ grep -Po 'scheme_version":\K[0-9]+'
This restarts the matching process after having matched scheme_version":, and tends to have far better performance than the positive lookbehind. Comparing the two on regexp101 demonstrates that the reset match start method takes 37 steps and 1ms, while the positive lookbehind method takes 194 steps and 21ms.
You can compare the performance yourself on regex101 and you can read more about resetting the match starting point in the PCRE documentation.
To avoid using greps PCRE feature which is available in GNU grep, but not in BSD version, another method is to use ripgrep, e.g.
$ rg -o 'scheme_version.?:(\d+)' -r '$1' <file.json
1234
-r Capture group indices (e.g., $5) and names (e.g., $foo).
Another example with Python and json.tool module which can validate and pretty-print:
$ python -mjson.tool file.json | rg -o 'scheme_version[^\d]+(\d+)' -r '$1'
1234
Related: Can grep output only specified groupings that match?
You can do this:
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | awk -F ':' '{print $4}' | tr -d '}'
Improving #potong's answer that works only to get "scheme_version", you can use this expression :
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"_id":["]*\([^(",})]*\)[",}].*/\1/p'
scheme_version
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"_rev":["]*\([^(",})]*\)[",}].*/\1/p'
4-cad1842a7646b4497066e09c3788e724
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"scheme_version":["]*\([^(",})]*\)[",}].*/\1/p'
1234