Combine two regex together - regex

I have this expression from a ps ax list that I want to parse:
183838 ? myprocess -uuid 0f6309e3-bee2-4747-b76d-7aaf4d0f074e serial=802e7fd9-a2ab-e411-8000-001e67ca95b2
I want to match the process id (183838) AND the uuid expression (0f6309e3-bee2-4747-b76d-7aaf4d0f074e).
I have the two regexes that match each of them:
# PID
([0-9]*)
# UUID
(?<=uuid).([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})
But I can't find how to combine them together to have this as result with sed:
183838 0f6309e3-bee2-4747-b76d-7aaf4d0f074e
awk is not an option since it must be column number independent.

You can use the | or operator in regex in between your two regex expressions to combine them.

Bash uses POSIX ERE, and you have a PCRE with a lookbehind. If you need PCRE, grep -P is an option, combined with -o, an option to pring only matched parts of the matched line:
$ ps ax | grep -oP '(^[0-9]+)|(?<=uuid )([-0-9a-f]{36})' | paste -sd' '
183838 0f6309e3-bee2-4747-b76d-7aaf4d0f074e
(We combine multiple lines here with paste.)

You can do this sort of matching with capturing groups. These are enclosed by \( and \) in sed. In the replacement, \1 is replaced by whatever matched the content of the first capturing group, and so on.
So to translate your input string:
$ ps ax | grep -- '-uuid' | sed 's/\([0-9]*\).* -uuid \([0-9a-f-]*\).*/\1 \2/'
183838 0f6309e3-bee2-4747-b76d-7aaf4d0f074
I've used the "-uuid" as an anchor to locate the right part of the string, allowing a shorter and more relaxed pattern for the uuid itself. But you can adapt this to your own requirements.

Related

Regex to match exact version phrase

I have versions like:
v1.0.3-preview2
v1.0.3-sometext
v1.0.3
v1.0.2
v1.0.1
I am trying to get the latest version that is not preview (doesn't have text after version number) , so result should be:
v1.0.3
I used this grep: grep -m1 "[v\d+\.\d+.\d+$]"
but it still outputs: v1.0.3-preview2
what I could be missing here?
To return first match for pattern v<num>.<num>.<num>, use:
grep -m1 -E '^v[0-9]+(\.[0-9]+){2}$' file
v1.0.3
If you input file is unsorted then use grep | sort -V | head as:
grep -E '^v[0-9]+(\.[0-9]+){2}$' f | sort -rV | head -1
When you use ^ or $ inside [...] they are treated a literal character not the anchors.
RegEx Details:
^: Start
v: Match v
[0-9]+: Match 1+ digits
(\.[0-9]+){2}: Match a dot followed by 1+ dots. Repeat this group 2 times
$: End
To match the digits with grep, you can use
grep -m1 "v[[:digit:]]\+\.[[:digit:]]\+\.[[:digit:]]\+$" file
Note that you don't need the [ and ] in your pattern, and to escape the dot to match it literally.
With awk you could try following awk code.
awk 'match($0,/^v[0-9]+(\.[0-9]+){2}$/){print;exit}' Input_file
Explanation of awk code: Simple explanation of awk program would be, using match function of awk to match regex to match version, once match is found print the matched value and exit from program.
Regular expressions match substrings, not whole strings. You need to explicitly match the start (^) and end ($) of the pattern.
Keep in mind that $ has special meaning in double quoted strings in shell scripts and needs to be escaped.
The boundary characters need to be outside of any group ([]).

How to match the last occurrence of a pattern on a single line string

I am using this command line to get a particular line from an html file which contains various other tags, links etc.:
cat index.html | grep -m1 -oE '<a href="(.*?)" rel="sample"[\S\s]*.*</dd>'
It outputs the line which I want:
<a href="http://example.com/something/one/" rel="sample" >Foo</a> <a href="http://example.com/something/two/" rel="sample" >Bar</a></dd>
But I want to capture only something/two (the path of the last URL) considering that:
the URLs are not known beforehand (it's a script processing multiple html files)
the line can sometimes contain only 1 URL, e.g.
<a href="http://example.com/something/one/" rel="sample" >Foo</a></dd>
in which case I would want to get only something/one as in this case it is the last one.
How can I do that?
Just add
| grep -o 'href="[^"]*' | tail -n1
The first part only extracts the hrefs, the second part keeps only the last line.
If you want to extract only the path, you can use cut with delimiter set to / and extract everything starting from the fourth column:
| grep -o 'href="[^"]*' | tail -n1 | cut -f4- -d/
because
href="http://example.com/something/two/
1 23 4 5
If you can use perl, then capturing within a regex makes this a lot easier.
perl -ne 'm(.*<a href="[^:]+://[^/]*/(.*?)" rel="sample".*</dd>) and print "$1\n";'
The regex is basically the same as would also work with grep. I've used m() instead of // to avoid escaping the / inside the regex.
The initial .* will greedily capture everything at the beginning of the line. If you have multiple links on a line, it will capture all but the last. This works with grep too, but it causes grep -o to output the beginning of the line, since this now matches the regex.
This doesn't matter with the capturing parenthesis, as only the part inside the (.*?) is captured and printed.
It would be used the same way as grep.
cat index.html | perl -ne 'm(.*<a href="[^:]+://[^/]*/(.*?)" rel="sample".*</dd>) and print "$1\n";'
or
perl -ne 'm(.*<a href="[^:]+://[^/]*/(.*?)" rel="sample".*</dd>) and print "$1\n";' index.html
On Linux, GNU grep's -P option enables a concise solution:
$ grep -oP '.*<a href="http://.+?/\K[^"]+(?=/"\s*rel="sample".*</dd>$)' index.html
something/two
-o only outputs the matching part(s) of each line that matches.
-P activates support for PRCEs (Perl-compatible Regular Expressions), which supports advanced regex constructs such as non-greedy matching (*?), dropping everything matched so far (\K), and look-ahead assertions ((?=...).
The combination of \K and (?=...) allows constraining the matching part of the regex to the subexpression of interest.
Note that no grep implementation supports capture groups, but the above, thanks to the features enabled by -P, is an emulation of extracting a single capture-group value.
As for what you tried:
-m1 limits the number of matching lines to 1, but with -o also present, multiple matches on that 1 line are still all printed.
Additionally, while you can use (...) for precedence, that doesn't constitute a capture group in grep, because there's no support for extracting capture-group values in grep.
Even with -E for extended regex support, advanced constructs such as non-greedy matching (.*?) are not supported.

modify `sed` to remove exact tag from within a string

I am trying to remove a variable tag that I have on my data with grep and sed.
The data that I have looks like this:
Please_VB make_VB it_PRP in_IN a_DT range_NN of_IN colored_JJ and_CC precise_JJR Skin_NN tone_NN shades_VBZ
My goal is to extract only those words that have a tag of _NNS, _NNP, _NN, _JJ and _JJR. For a desired result of:
range
colored
precise
skin
tone
The grep and sed that I am using right now is the following:
grep -oh "\w*_\(JJ\|NN\)\w*" test_file.txt | sed 's/[_JJ\|_NN\|_JJR\|_NNP\|_NNS]//g'
The result of that command line, however, is:
range
colored
precise
kin
tone
It correctly extracts the correct words with the grep, but the sed is removing all corresponding letters, rather than just the exact tag of _NX or _JX.
Is there any way that I can make the sed more precise to remove ONLY the exact tag as specified rather than any letter that is also within the tag?
You can use POSIX grep (that doesn't support -P option) with cut:
grep -Eo '\w*_(NN[PS]?|JJR?)' file | cut -d_ -f1
range
colored
precise
Skin
tone
cut is used to strip off part after first underscore.
You may extract those value with grep and a PCRE regex with a lookahead:
grep -oP "\w+(?=_(JJR?|NN[PS]?))"
^^^^^^^^^^^^^^^^^^
See the online demo
Details:
\w+ - 1 or more word characters (letter, digits or an underscore)...
(?=_(JJR?|NN[PS]?)) - that are followed with
_ - an underscore and...
(JJR?|NN[PS]?) - JJ, JJR, NN, NNP or NNS substrings.
The P option in -oP will enforce the use of PCRE egnine, and o will get you the matches only.

grep part of text from ps output with regex

From ps -ef command output -Dorg.xxx.yyy=/home/user/aaa/server.log.
I'd like to extract the file path /home/user/aaa/server.log (can be any name.file).
Now, I'm using command:
ps -ef | grep -Po '(?<=-Dorg.xxx.yyy=)[^\s]*'
It will display two matched results:
/home/user/aaa/server.log
)[^\s]*
It looks like it counts the command as well for the 2nd matched result. How can I remove it? Or is there other suggestions? (I can not use -m1).
If you just need the file name, use \K operator:
org\.xxx\.yyy=\K[^\s]*
ps -ef | grep -Po 'org\.xxx\.yyy=\K[^\s]*'
It will match the whole string, but will only print the file name matched with [^\s]*.
From perlre:
There is a special form of this construct, called \K (available since
Perl 5.10.0), which causes the regex engine to "keep" everything it
had matched prior to the \K and not include it in $& . This
effectively provides variable-length look-behind.
Use that:
grep -Po '(?<=-[D]org.xxx.yyy=)[^\s]*'
Just put one of the characters in square brackets ([D]). The meaning of the regex hasn't changed and the pattern doesn't match itself anymore.

Regex to match unique substrings

Here's a basic regex technique that I've never managed to remember. Let's say I'm using a fairly generic regex implementation (e.g., grep or grep -E). If I were to do a list of files and match any that end in either .sty or .cls, how would I do that?
ls | grep -E "\.(sty|cls)$"
\. matches literally a "." - an unescaped . matches any character
(sty|cls) - match "sty" or "cls" - the | is an or and the brackets limit the expression.
$ forces the match to be at the end of the line
Note, you want grep -E or egrep, not grep -e as that's a different option for lists of patterns.
egrep "\.sty$|\.cls$"
This regex:
\.(sty|cls)\z
will match any string ends with .sty or .cls
EDIT:
for grep \z should be replaced with $ i.e.
\.(sty|cls)$
as jelovirt suggested.