Grepping for overlapping pattern matches

Grepping for overlapping pattern matches - regex

This is what I'm running
grep -o ',[tcb],' <<< "r,t,c,q,c b,b,"
The output is
,t,
,b,
But I want to get
,t,
,c,
,b,
(I do not want the b without a preceding , or the c without a trailing , to be matched)
Because ,[tcb], should be found in 'r",t,"c,q b,b,' 'r,t",c,"q b,b,' and 'r,t,c,q b",b,"'
But it seems that when the , is included in the first pattern match then grep does not look for this in the second instance of the pattern match
Is there a way around this or is grep not meant to do this

You can use awk instead of grep for this with record separator as comma:
awk -v RS=, '/^[tcb]$/{print RS $0 RS}' <<< "r,t,c,q,c b,b,"
,t,
,c,
,b,

You can use grep with a Perl RE, which allows non-capturing look-behind and look-ahead patterns to extract letters surrounded by commas. You can then restore the separators just as you need them as by:
grep -o -P '(?<=,)[tcb](?=,)' <<< "r,t,c,q,c b,b,"|while read c; do echo ",$c,"; done

The awk solution is nice. I have another with sed+grep:
echo "r,t,c,q,c b,b," | sed "s/,/,,/g" | grep -o ',[tcb],'
,t,
,c,
,b,

Related

Extract version using grep/regex in bash

I have a file that has a line stating
version = "12.0.08-SNAPSHOT"
The word version and quoted strings can occur on multiple lines in that file.
I am looking for a single line bash statement that can output the following string:
12.0.08-SNAPSHOT
The version can have RELEASE tag too instead of SNAPSHOT.
So to summarize, given
version = "12.0.08-SNAPSHOT"
expected output: 12.0.08-SNAPSHOT
And given
version = "12.0.08-RELEASE"
expected output: 12.0.08-RELEASE

The following command prints strings enquoted in version = "...":
grep -Po '\bversion\s*=\s*"\K.*?(?=")' yourFile
-P enables perl regexes, which allow us to use features like \K and so on.
-o only prints matched parts instead of the whole lines.
\b ensures that version starts at a word boundary and we do not match things like abcversion.
\s stands for any kind of whitespace.
\K lets grep forget, that it matched the part before \K. The forgotten part will not be printed.
.*? matches as few chararacters as possible (the matching part will be printed) ...
(?=") ... until we see a ", which won't be included in the match either (this is called a lookahead).
Not all grep implementations support the -P option. Alternatively, you can use perl, as described in this answer:
perl -nle 'print $& if m{\bversion\s*=\s*"\K.*?(?=")}' yourFile

Seems like a job for cut:
$ echo 'version = "12.0.08-SNAPSHOT"' | cut -d'"' -f2
12.0.08-SNAPSHOT
$ echo 'version = "12.0.08-RELEASE"' | cut -d'"' -f2
12.0.08-RELEASE

Portable solution:
$ echo 'version = "12.0.08-RELEASE"' |sed -E 's/.*"(.*)"/\1/g'
12.0.08-RELEASE
or even:
$ perl -pe 's/.*"(.*)"/\1/g'.
$ awk -F"\"" '{print $2}'

Ignore all letters except for capitals

I have an output like Johny-Smith, Juarez-Hugo, etc. and I need instead S, H, etc. Basically, I need the last uppercase letter in a string and that's it. If this is possible in any built in Linux tools (ex awk, sed, grep, etc.) it would be greatly appreciated.

Do you need like this ?
echo "Johny-Smith" | sed 's/^.*\([A-Z]\)[^A-Z]*$/\1/g'
Test:
$ echo "Johny-Smith-Hello Johny-Smith" | sed 's/.*\([A-Z]\)[^A-Z]*/\1/g'
S

With GNU grep and if PCRE option is available
$ echo 'Johny-Smith' | grep -oP '.*\K[A-Z]'
S
$ echo 'Juarez-Hugo' | grep -oP '.*\K[A-Z]'
H
-o prints only matched portion
-P Perl regular expression
.*\K positive lookbehind, not part of output
[A-Z] any uppercase character
with perl, see perldoc for command line options explanation
$ # prints the string within captured group
$ echo 'Johny-Smith' | perl -lne 'print /.*([A-Z])/'
S
$ echo 'Juarez-Hugo' | perl -lne 'print /.*([A-Z])/'
H

In Bash:
$ var="Johny-Smith-Hello Johny-Smith"; var="${var//[^[:upper:]]/}";echo "${var: -1}"
S
${var//[^[:upper:]]/} remove all non-upper case letter chars
echo ${var: -1} output the last one

Parsing log file

I am trying to parse a text like this from a log file:
[2016-01-29 11:31:33,809: WARNING/Worker-1283]
1030140:::DEAL_OF_DAY:::29:::1:::11 [2016-01-29 11:31:34,103:
WARNING/Worker-1197] 1025311:::DEAL_OF_DAY:::29:::1:::11 [2016-01-29
11:31:34,291: WARNING/Worker-1197] 1025158:::DEAL_OF_DAY:::29:::1:::11
I want to extract these numbers 1030140, 1025311, 1025158 and so on.
I have tried the following
cat deals29.txt | egrep -o '[0-9]+'
But this gives other digits as well
I tried
cat deals29.txt | egrep -o ' [0-9]+:::'
but now it gives the colons in the output as well and there is no way to capture the group in the command line version of grep.
Any suggestions? grep solution would be preferred but I can go with sed/awk as well if grep cannot do the job.

Using grep -oP and match reset \K:
grep -oP '^\[.*?\] \K\d+' file.log
1030140
1025311
1025158
If your grep doesn't support -P (PCRE) then use awk:
awk -F '\\] |:::' '{print $2}' file.log
1030140
1025311
1025158

You can train regex here : https://regex101.com/
I get
] [0-9]*
and you have to delete the first 2 chars

You could use a solution like:
(\d{3,})::
# looks for at least 3 digits (or more) followed by two colons
# puts the matched numbers in group 1
See a demo for this approach here.

How to extract value from the string in bash?

I have an input string in the following format:
bugfix/ABC-12345-1-00
I want to extract "ABC-12345". Regex for that format in C# looks like this:
.\*\\/([A-Z]+-[0-9]+).\*
How can I do that in a bash script? I've tried sed and awk but had no success because I need to extract value from the capturing group and skip the rest.

If your grep supports -P then you could use the below grep commands.
$ echo 'bugfix/ABC-12345-1-00' | grep -oP '/\K[A-Z]+-\d+'
ABC-12345
\K keeps the text matched so far out of the overall regex match.
$ echo 'bugfix/ABC-12345-1-00' | grep -oP '(?<=/)[A-Z]+-\d+'
ABC-12345
(?<=/) Positive lookbehind which asserts that the match must be preceded by a / symbol.
Through sed,
$ echo 'bugfix/ABC-12345-1-00' | sed 's~.*/\([A-Z]\+-[0-9]\+\).*~\1~'
ABC-12345

echo "bugfix/ABC-12345-1-00"| perl -ane '/.*?([A-Z]+\-[0-9]+).*/;print $1."\n"'

You could try something like:
echo "bugfix/ABC-12345-1-00" | egrep -o '[A-Z]+-[0-9]+'
OUTPUT:
ABC-12345

If you do not like to use regex, you can use this awk:
echo "bugfix/ABC-12345-1-00" | awk -F\/ '{print $NF}'
ABC-12345-1-00
Or just this:
awk -F\/ '$0=$NF'

Remove everything after 2nd occurrence in a string in unix

I would like to remove everything after the 2nd occurrence of a particular
pattern in a string. What is the best way to do it in Unix? What is most elegant and simple method to achieve this; sed, awk or just unix commands like cut?
My input would be
After-u-math-how-however
Output should be
After-u
Everything after the 2nd - should be stripped out. The regex should also match
zero occurrences of the pattern, so zero or one occurrence should be ignored and
from the 2nd occurrence everything should be removed.
So if the input is as follows
After
Output should be
After

Something like this would do it.
echo "After-u-math-how-however" | cut -f1,2 -d'-'
This will split up (cut) the string into fields, using a dash (-) as the delimiter. Once the string has been split into fields, cut will print the 1st and 2nd fields.

This might work for you (GNU sed):
sed 's/-[^-]*//2g' file

You could use the following regex to select what you want:
^[^-]*-\?[^-]*
For example:
echo "After-u-math-how-however" | grep -o "^[^-]*-\?[^-]*"
Results:
After-u

#EvanPurkisher's cut -f1,2 -d'-' solution is IMHO the best one but since you asked about sed and awk:
With GNU sed for -r
$ echo "After-u-math-how-however" | sed -r 's/([^-]+-[^-]*).*/\1/'
After-u
With GNU awk for gensub():
$ echo "After-u-math-how-however" | awk '{$0=gensub(/([^-]+-[^-]*).*/,"\\1","")}1'
After-u
Can be done with non-GNU sed using \( and *, and with non-GNU awk using match() and substr() if necessary.

awk -F - '{print $1 (NF>1? FS $2 : "")}' <<<'After-u-math-how-however'
Split the line into fields based on field separator - (option spec. -F -) - accessible as special variable FS inside the awk program.
Always print the 1st field (print $1), followed by:
If there's more than 1 field (NF>1), append FS (i.e., -) and the 2nd field ($2)
Otherwise: append "", i.e.: effectively only print the 1st field (which in itself may be empty, if the input is empty).

This can be done in pure bash (which means no fork, no external process). Read into an array split on '-', then slice the array:
$ IFS=-
$ read -ra val <<< After-u-math-how-however
$ echo "${val[*]}"
After-u-math-how-however
$ echo "${val[*]:0:2}"
After-u

awk '$0 = $2 ? $1 FS $2 : $1' FS=-
Result
After-u
After

This will do it in awk:
echo "After" | awk -F "-" '{printf "%s",$1; for (i=2; i<=2; i++) printf"-%s",$i}'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Grepping for overlapping pattern matches - regex

You can use awk instead of grep for this with record separator as comma: awk -v RS=, '/^[tcb]$/{print RS $0 RS}' <<< "r,t,c,q,c b,b," ,t, ,c, ,b,

You can use grep with a Perl RE, which allows non-capturing look-behind and look-ahead patterns to extract letters surrounded by commas. You can then restore the separators just as you need them as by: grep -o -P '(?<=,)[tcb](?=,)' <<< "r,t,c,q,c b,b,"|while read c; do echo ",$c,"; done

The awk solution is nice. I have another with sed+grep: echo "r,t,c,q,c b,b," | sed "s/,/,,/g" | grep -o ',[tcb],' ,t, ,c, ,b,

Related

Extract version using grep/regex in bash

Ignore all letters except for capitals

Parsing log file

How to extract value from the string in bash?

Remove everything after 2nd occurrence in a string in unix

Categories

Resources