Regex to match unique substrings - regex

Here's a basic regex technique that I've never managed to remember. Let's say I'm using a fairly generic regex implementation (e.g., grep or grep -E). If I were to do a list of files and match any that end in either .sty or .cls, how would I do that?

ls | grep -E "\.(sty|cls)$"
\. matches literally a "." - an unescaped . matches any character
(sty|cls) - match "sty" or "cls" - the | is an or and the brackets limit the expression.
$ forces the match to be at the end of the line
Note, you want grep -E or egrep, not grep -e as that's a different option for lists of patterns.

egrep "\.sty$|\.cls$"

This regex:
\.(sty|cls)\z
will match any string ends with .sty or .cls
EDIT:
for grep \z should be replaced with $ i.e.
\.(sty|cls)$
as jelovirt suggested.

Related

Regex to match exact version phrase

I have versions like:
v1.0.3-preview2
v1.0.3-sometext
v1.0.3
v1.0.2
v1.0.1
I am trying to get the latest version that is not preview (doesn't have text after version number) , so result should be:
v1.0.3
I used this grep: grep -m1 "[v\d+\.\d+.\d+$]"
but it still outputs: v1.0.3-preview2
what I could be missing here?
To return first match for pattern v<num>.<num>.<num>, use:
grep -m1 -E '^v[0-9]+(\.[0-9]+){2}$' file
v1.0.3
If you input file is unsorted then use grep | sort -V | head as:
grep -E '^v[0-9]+(\.[0-9]+){2}$' f | sort -rV | head -1
When you use ^ or $ inside [...] they are treated a literal character not the anchors.
RegEx Details:
^: Start
v: Match v
[0-9]+: Match 1+ digits
(\.[0-9]+){2}: Match a dot followed by 1+ dots. Repeat this group 2 times
$: End
To match the digits with grep, you can use
grep -m1 "v[[:digit:]]\+\.[[:digit:]]\+\.[[:digit:]]\+$" file
Note that you don't need the [ and ] in your pattern, and to escape the dot to match it literally.
With awk you could try following awk code.
awk 'match($0,/^v[0-9]+(\.[0-9]+){2}$/){print;exit}' Input_file
Explanation of awk code: Simple explanation of awk program would be, using match function of awk to match regex to match version, once match is found print the matched value and exit from program.
Regular expressions match substrings, not whole strings. You need to explicitly match the start (^) and end ($) of the pattern.
Keep in mind that $ has special meaning in double quoted strings in shell scripts and needs to be escaped.
The boundary characters need to be outside of any group ([]).

How to use square brackets in grep for MINGW64?

Currently, I have a following regex. It should match a string that I am echoing:
echo "TBGFSGFI22800_D_REP_D_RISIKOEINHEIT" | grep -E 'TBGFSGFI\d\d\d\d\d[A-Za-z_]{1,100}'
It works as expected in OsX on my Mac and in Notepad++, but in Bash for windows (MINGW64) I get an empty string. How can I use the grep with flags, or how should I rewrite the regex to match the pattern?
My grep version is 3.1. Bash: 4.4.23(1)
Thanks for help in advance!
You are using a POSIX ERE regex with the -E option, and that flavor does not support \d construct. You also need -o option to actually extract the matches.
Note you do not need to repeat \d five times, you can use a range quantifier, \d{5}.
You can use
echo "TBGFSGFI22800_D_REP_D_RISIKOEINHEIT" | grep -Po "TBGFSGFI\d{5}[A-Za-z_]{1,100}"
Where
-P means the regex is of a PCRE flavor
-o extracts matches only
TBGFSGFI\d{5}[A-Za-z_]{1,100} - a regex that matches TBGFSGFI, then any five digits and then 1-100 ASCII letters or _.

Regexp or Grep in Bash

Can you please tell me how to get the token value correctly? At the moment I am getting: "1jdq_dnkjKJNdo829n4-xnkwe",258],["FbtResult
echo '{"facebookdotcom":true,"messengerdotcom":false,"workplacedotcom":false},827],["DTSGInitialData",[],{"token":"aaaaaaa"},258],["FbtResult' | sed -n 's/.*"token":\([^}]*\)\}/\1/p'
You need to match the full string, and to get rid of double quotes, you need to match a " before the token and use a negated bracket expression [^"] instead of [^}]:
sed -n 's/.*"token":"\([^"]*\).*/\1/p'
Details:
.* - any zero or more chars
"token":" - a literal "token":" string
\([^"]*\) - Group 1 (\1 refers to this value): any zero or more chars other than "
.* - any zero or more chars.
This replacement works:
echo '{"facebookdotcom":true,"messengerdotcom":false,"workplacedotcom":false},827],["DTSGInitialData",[],{"token":"aaaaaaa"},258],["FbtResult'
| sed -n 's/.*"token":"\([a-z]*\)"\}.*/\1/p'
Key capture after "token" found between quotes via \([a-z]*\), followed by a closing brace \} and remaining characters after that as .* (you were missing this part before, which caused the replacement to include the text after keyword as well).
Output:
aaaaaaa
A grep solution:
echo '{"facebookdotcom":true,"messengerdotcom":false,"workplacedotcom":false},827],["DTSGInitialData",[],{"token":"aaaaaaa"},258],["FbtResult' | grep -Po '(?<="token":")[^"]+'
yields
aaaaaaa
The -P option to grep enables the Perl-compatible regex (PCRE).
The -o option tells grep to print only the matched substring, not the entire line.
The regex (?<="token":") is a PCRE-specific feature called a zero-width positive lookbehind assertion. The expression (?<=pattern) matches a pattern without including it in the matched result.

Combine two regex together

I have this expression from a ps ax list that I want to parse:
183838 ? myprocess -uuid 0f6309e3-bee2-4747-b76d-7aaf4d0f074e serial=802e7fd9-a2ab-e411-8000-001e67ca95b2
I want to match the process id (183838) AND the uuid expression (0f6309e3-bee2-4747-b76d-7aaf4d0f074e).
I have the two regexes that match each of them:
# PID
([0-9]*)
# UUID
(?<=uuid).([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})
But I can't find how to combine them together to have this as result with sed:
183838 0f6309e3-bee2-4747-b76d-7aaf4d0f074e
awk is not an option since it must be column number independent.
You can use the | or operator in regex in between your two regex expressions to combine them.
Bash uses POSIX ERE, and you have a PCRE with a lookbehind. If you need PCRE, grep -P is an option, combined with -o, an option to pring only matched parts of the matched line:
$ ps ax | grep -oP '(^[0-9]+)|(?<=uuid )([-0-9a-f]{36})' | paste -sd' '
183838 0f6309e3-bee2-4747-b76d-7aaf4d0f074e
(We combine multiple lines here with paste.)
You can do this sort of matching with capturing groups. These are enclosed by \( and \) in sed. In the replacement, \1 is replaced by whatever matched the content of the first capturing group, and so on.
So to translate your input string:
$ ps ax | grep -- '-uuid' | sed 's/\([0-9]*\).* -uuid \([0-9a-f-]*\).*/\1 \2/'
183838 0f6309e3-bee2-4747-b76d-7aaf4d0f074
I've used the "-uuid" as an anchor to locate the right part of the string, allowing a shorter and more relaxed pattern for the uuid itself. But you can adapt this to your own requirements.

Regex for uppercase matches with exclusions

I'm trying to come up with a regex for the following case: I need to find any matching paths using grep for the following paths:
Include all uppercase matching paths.
Example:
com/foo/Bar/1.2.3-SNAPSHOT/Bar-1.2.3-SNAPSHOT.jar
Notice the capital B in Bar.
Exclude all uppercase matching paths that only contain SNAPSHOT and have no other uppercase letters.
Example:
com/foo/bar/1.2.3-SNAPSHOT/bar-1.2.3-SNAPSHOT.jar
Is this possible with grep?
Something like this might do:
grep -vE '^([^[:upper:]]*(SNAPSHOT)?)*$'
Breakdown:
-v will reverse the match (show all non matched lines. -E enabled Extended Regular Expressions.
^ # Start of line
( )* # Capturing group repeated zero or more times
[^[:upper:]]* # Match all but uppercase zero or more times
(SNAPSHOT)? # Followed by literal SNAPSHOT zero or one time
$ # End of line
Just use awk:
$ cat file
com/foo/Bar/1.2.3-SNAPSHOT/Bar-1.2.3-SNAPSHOT.jar
com/foo/bar/1.2.3-SNAPSHOT/bar-1.2.3-SNAPSHOT.jar
With GNU awk or mawk for gensub():
$ awk 'gensub(/SNAPSHOT/,"","g")~/[[:upper:]]/' file
com/foo/Bar/1.2.3-SNAPSHOT/Bar-1.2.3-SNAPSHOT.jar
With other awks:
$ awk '{r=$0; gsub(/SNAPSHOT/,"",r)} r~/[[:upper:]]/' file
com/foo/Bar/1.2.3-SNAPSHOT/Bar-1.2.3-SNAPSHOT.jar
Well, you need find to list all paths. Then you can do it with grep with two runs. One includes all capital cases. The other one excludes that contain no capitals except SNAPSHOT:
find . | grep '[A-Z]' | grep -v '.*\/[^A-Z]*SNAPSHOT[^A-Z]*$'
I think only the last grep needs some explanation:
grep -v excludes the matching lines
.*\/ greedily matches everything up to the first slash. There'll always be a slash due to find .
[^A-Z]* finds all characters that are non-capital letters. So we apply it before and after the SNAPSHOT literal, up to the end of the string.
Here you can play with it online.
If you only want to get the matching files. I'll do it like this.
find . -type f -regex '.*[A-Z].*' | while read -r line; do echo "$line" | sed 's/SNAPSHOT//g' | grep -q '.*[A-Z].*' && echo "$line"; done