Grep first group regexp - regex

Is there a way to specify what regexp group I want to append to my file?
In the example below I only want to store (\d{8}) in my file:
grep -P1 -o kamilla(\d{8}) >> whatever.txt

You'll need to use a Positive Lookbehind assertion or alternative so that it isn't included in the match.
Positive Lookbehind:
grep -Poi '(?<=kamilla)\d{8}'
The look-behind asserts that at the current position in the string, what precedes is "kamilla". If the assertion succeeds, the regular expression engine matches eight digits.
Alternative \K escape sequence:
grep -Poi 'kamilla\K\d{8}'
The \K escape sequence resets the starting point of the reported match. Any previously matched characters are not included in the final matched sequence.
-o option shows only the matching part that matches the pattern.

You can use the -o switch and a \K, which removes the preceding part of the match:
$ grep -Poi 'kamilla\K\d{8}' <<<"kamilla83222237"
83222237
As you're using Perl-style regular expressions, you could also just use Perl:
$ perl -nE 'say $1 if /kamilla(\d{8})/' <<<"kamilla83222237"
83222237

Another way:
$ grep -P -o '(?<=kamilla)\d{8}' <<< kamilla12345678
12345678

You can use sed instead:
sed -E "s/.*kamilla(\d{8}).*/\1/g" input.txt >> output.txt
This is replacing input line with first matching group \1 and printing it.
This also allows you to manipulate input file is some non-trivial ways. For example, you can match two groups and output them in non-default order, like \2\1 and so on.

Related

Regex to match exact version phrase

I have versions like:
v1.0.3-preview2
v1.0.3-sometext
v1.0.3
v1.0.2
v1.0.1
I am trying to get the latest version that is not preview (doesn't have text after version number) , so result should be:
v1.0.3
I used this grep: grep -m1 "[v\d+\.\d+.\d+$]"
but it still outputs: v1.0.3-preview2
what I could be missing here?
To return first match for pattern v<num>.<num>.<num>, use:
grep -m1 -E '^v[0-9]+(\.[0-9]+){2}$' file
v1.0.3
If you input file is unsorted then use grep | sort -V | head as:
grep -E '^v[0-9]+(\.[0-9]+){2}$' f | sort -rV | head -1
When you use ^ or $ inside [...] they are treated a literal character not the anchors.
RegEx Details:
^: Start
v: Match v
[0-9]+: Match 1+ digits
(\.[0-9]+){2}: Match a dot followed by 1+ dots. Repeat this group 2 times
$: End
To match the digits with grep, you can use
grep -m1 "v[[:digit:]]\+\.[[:digit:]]\+\.[[:digit:]]\+$" file
Note that you don't need the [ and ] in your pattern, and to escape the dot to match it literally.
With awk you could try following awk code.
awk 'match($0,/^v[0-9]+(\.[0-9]+){2}$/){print;exit}' Input_file
Explanation of awk code: Simple explanation of awk program would be, using match function of awk to match regex to match version, once match is found print the matched value and exit from program.
Regular expressions match substrings, not whole strings. You need to explicitly match the start (^) and end ($) of the pattern.
Keep in mind that $ has special meaning in double quoted strings in shell scripts and needs to be escaped.
The boundary characters need to be outside of any group ([]).

Regexp or Grep in Bash

Can you please tell me how to get the token value correctly? At the moment I am getting: "1jdq_dnkjKJNdo829n4-xnkwe",258],["FbtResult
echo '{"facebookdotcom":true,"messengerdotcom":false,"workplacedotcom":false},827],["DTSGInitialData",[],{"token":"aaaaaaa"},258],["FbtResult' | sed -n 's/.*"token":\([^}]*\)\}/\1/p'
You need to match the full string, and to get rid of double quotes, you need to match a " before the token and use a negated bracket expression [^"] instead of [^}]:
sed -n 's/.*"token":"\([^"]*\).*/\1/p'
Details:
.* - any zero or more chars
"token":" - a literal "token":" string
\([^"]*\) - Group 1 (\1 refers to this value): any zero or more chars other than "
.* - any zero or more chars.
This replacement works:
echo '{"facebookdotcom":true,"messengerdotcom":false,"workplacedotcom":false},827],["DTSGInitialData",[],{"token":"aaaaaaa"},258],["FbtResult'
| sed -n 's/.*"token":"\([a-z]*\)"\}.*/\1/p'
Key capture after "token" found between quotes via \([a-z]*\), followed by a closing brace \} and remaining characters after that as .* (you were missing this part before, which caused the replacement to include the text after keyword as well).
Output:
aaaaaaa
A grep solution:
echo '{"facebookdotcom":true,"messengerdotcom":false,"workplacedotcom":false},827],["DTSGInitialData",[],{"token":"aaaaaaa"},258],["FbtResult' | grep -Po '(?<="token":")[^"]+'
yields
aaaaaaa
The -P option to grep enables the Perl-compatible regex (PCRE).
The -o option tells grep to print only the matched substring, not the entire line.
The regex (?<="token":") is a PCRE-specific feature called a zero-width positive lookbehind assertion. The expression (?<=pattern) matches a pattern without including it in the matched result.

Sed removing after ip

I have a simple sed question.
I have data like this:
boo:moo:127.0.0.1--¹óÖÝÊ¡µçÐÅ
foo:joo:127.0.0.1 ÁÉÄþÊ¡ÉòÑôÊвʺçÍø°É
How do I make it like this:
boo:moo:127.0.0.1
foo:joo:127.0.0.1
My sed code
sed -e 's/\.[^\.]*$//' test.txt
Thanks!
For the given sample, you could capture everything from start of line till last digit in the line
$ sed 's/\(.*[0-9]\).*/\1/' ip.txt
boo:moo:127.0.0.1
foo:joo:127.0.0.1
$ grep -o '.*[0-9]' ip.txt
boo:moo:127.0.0.1
foo:joo:127.0.0.1
Or, you could delete all non-digit characters at end of line
$ sed 's/[^0-9]*$//' ip.txt
boo:moo:127.0.0.1
foo:joo:127.0.0.1
You may find an IP like substring and remove all after it:
sed -E 's/([0-9]{1,3}(\.[0-9]{1,3}){3}).*/\1/' # POSIX ERE version
sed 's/\([0-9]\{1,3\}\(\.[0-9]\{1,3\}\)\{3\}\).*/\1/' # BRE POSIX version
The ([0-9]{1,3}(\.[0-9]{1,3}){3}) pattern is a simplified IP address regex pattern that matches and captures 1 to 3 digits and then 3 occurrences of a dot and again 1 to 3 digits, and then .* matches and consumes the rest of the line. The \1 placeholder in the replacement pattern inserts the captured value back into the result.
Note that in the BRE POSIX pattern, you have to escape ( and ) to make them a capturing group construct and you need to escape {...} to make it a range/interval/limiting quantifier (it has lots of names in the regex literature).
See an online demo.

Extract QueryString value using sed

I have the following lines in an apache access log
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229655&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229656&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229657&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229658&blah
and i want to extract the MSISDN value only, so expected output would be
647930229655
647930229656
647930229657
647930229658
I'm using the following sed command but i can't get it to stop capturing at &
sed 's/.*MSISDN=\(.*\)/\1/'
sed solution:
sed -E 's/.*&MSISDN=([^&]+).*/\1/' file
& - is key/value pair separator in URL syntax, so you should rely on it
([^&]+) - 1st captured group containing any character sequence except &
\1 - backreference to the 1st captured group
The output:
647930229655
647930229656
647930229657
647930229658
-o : means print only matching string not the whole line.
-P: To enable pcre regex.
\K: means ignore everything on the left. But should be part of actual input string.
\d: means digit, + means one or more digit.
grep -oP 'MSISDN=\K\d+' input
647930229655
647930229656
647930229657
647930229658
Following simple sed may help you on same.
sed 's/.*MSISDN=//;s/&.*//' Input_file
Explanation:
s/.*MSISDN=//: s means substitute .*MSISDN= string with // NULL in current line.
; semi colon tells sed that there is 1 more statement to be executed.
s/&.*//g': s/&.*// means substitute &.* from & to everything with NULL.
$ grep -oP '(?<=&MSISDN=)\d+' file
647930229655
647930229656
647930229657
647930229658
-o option is meant to show only matched output
-P option is meant to enable PCRE (Perl Compatible Regex)
(?<=regex) this is to enforce positive look behind assertion. You can read more about them over here. Lookarounds dont consume any characters while matching unlike normal regex. Hence the only matched output you get it \d+ which is 1 or more digits.
or using sed:
$ sed -r 's/^.*MSISDN=([0-9]+).*$/\1/' file
647930229655
647930229656
647930229657
647930229658
you can also pipe cut to cut
cut -d '&' -f3 Input_file |cut -d '=' -f2

Regex to match unique substrings

Here's a basic regex technique that I've never managed to remember. Let's say I'm using a fairly generic regex implementation (e.g., grep or grep -E). If I were to do a list of files and match any that end in either .sty or .cls, how would I do that?
ls | grep -E "\.(sty|cls)$"
\. matches literally a "." - an unescaped . matches any character
(sty|cls) - match "sty" or "cls" - the | is an or and the brackets limit the expression.
$ forces the match to be at the end of the line
Note, you want grep -E or egrep, not grep -e as that's a different option for lists of patterns.
egrep "\.sty$|\.cls$"
This regex:
\.(sty|cls)\z
will match any string ends with .sty or .cls
EDIT:
for grep \z should be replaced with $ i.e.
\.(sty|cls)$
as jelovirt suggested.