How to match sequence ../../../ in sed? - regex

How to realize in sed (regex) the next output?
It is necessary to extract the preceded '../'
input
$ ../some/path
$ ../../some/path
$ ../../../any/path/to/folder
$ ../../../../path/to/some/folder
output
$ ../
$ ../../
$ ../../../
$ ../../../../

I would just do:
sed 's#/[^.].*#/#'
or (depending on the input and desired behavior):
sed -E 's#/[^.]+#/#'

Match all repeated ../ prefixes from the beginning and replace the rest with nothing:
s#^((\.\./)*).*$#\1#g
with extended regular expressions, or with basic regular expressions:
s#^\(\(\.\./\)*\).*$#\1#g
^ matches beginning of line
(\.\./) matches ../
* repeats 0 or more times
.*$ matches the rest until the end of line
\1 references the first capturing group match
BUT this is probably achieved easier with grep instead of sed, since you do not need to do the replacing (so likely a bit faster too):
egrep -o '^(\.\./)*'
-o prints only the matching portion of the line.

Related

Sed removing after ip

I have a simple sed question.
I have data like this:
boo:moo:127.0.0.1--¹óÖÝÊ¡µçÐÅ
foo:joo:127.0.0.1 ÁÉÄþÊ¡ÉòÑôÊвʺçÍø°É
How do I make it like this:
boo:moo:127.0.0.1
foo:joo:127.0.0.1
My sed code
sed -e 's/\.[^\.]*$//' test.txt
Thanks!
For the given sample, you could capture everything from start of line till last digit in the line
$ sed 's/\(.*[0-9]\).*/\1/' ip.txt
boo:moo:127.0.0.1
foo:joo:127.0.0.1
$ grep -o '.*[0-9]' ip.txt
boo:moo:127.0.0.1
foo:joo:127.0.0.1
Or, you could delete all non-digit characters at end of line
$ sed 's/[^0-9]*$//' ip.txt
boo:moo:127.0.0.1
foo:joo:127.0.0.1
You may find an IP like substring and remove all after it:
sed -E 's/([0-9]{1,3}(\.[0-9]{1,3}){3}).*/\1/' # POSIX ERE version
sed 's/\([0-9]\{1,3\}\(\.[0-9]\{1,3\}\)\{3\}\).*/\1/' # BRE POSIX version
The ([0-9]{1,3}(\.[0-9]{1,3}){3}) pattern is a simplified IP address regex pattern that matches and captures 1 to 3 digits and then 3 occurrences of a dot and again 1 to 3 digits, and then .* matches and consumes the rest of the line. The \1 placeholder in the replacement pattern inserts the captured value back into the result.
Note that in the BRE POSIX pattern, you have to escape ( and ) to make them a capturing group construct and you need to escape {...} to make it a range/interval/limiting quantifier (it has lots of names in the regex literature).
See an online demo.

Extract QueryString value using sed

I have the following lines in an apache access log
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229655&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229656&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229657&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229658&blah
and i want to extract the MSISDN value only, so expected output would be
647930229655
647930229656
647930229657
647930229658
I'm using the following sed command but i can't get it to stop capturing at &
sed 's/.*MSISDN=\(.*\)/\1/'
sed solution:
sed -E 's/.*&MSISDN=([^&]+).*/\1/' file
& - is key/value pair separator in URL syntax, so you should rely on it
([^&]+) - 1st captured group containing any character sequence except &
\1 - backreference to the 1st captured group
The output:
647930229655
647930229656
647930229657
647930229658
-o : means print only matching string not the whole line.
-P: To enable pcre regex.
\K: means ignore everything on the left. But should be part of actual input string.
\d: means digit, + means one or more digit.
grep -oP 'MSISDN=\K\d+' input
647930229655
647930229656
647930229657
647930229658
Following simple sed may help you on same.
sed 's/.*MSISDN=//;s/&.*//' Input_file
Explanation:
s/.*MSISDN=//: s means substitute .*MSISDN= string with // NULL in current line.
; semi colon tells sed that there is 1 more statement to be executed.
s/&.*//g': s/&.*// means substitute &.* from & to everything with NULL.
$ grep -oP '(?<=&MSISDN=)\d+' file
647930229655
647930229656
647930229657
647930229658
-o option is meant to show only matched output
-P option is meant to enable PCRE (Perl Compatible Regex)
(?<=regex) this is to enforce positive look behind assertion. You can read more about them over here. Lookarounds dont consume any characters while matching unlike normal regex. Hence the only matched output you get it \d+ which is 1 or more digits.
or using sed:
$ sed -r 's/^.*MSISDN=([0-9]+).*$/\1/' file
647930229655
647930229656
647930229657
647930229658
you can also pipe cut to cut
cut -d '&' -f3 Input_file |cut -d '=' -f2

Grep first group regexp

Is there a way to specify what regexp group I want to append to my file?
In the example below I only want to store (\d{8}) in my file:
grep -P1 -o kamilla(\d{8}) >> whatever.txt
You'll need to use a Positive Lookbehind assertion or alternative so that it isn't included in the match.
Positive Lookbehind:
grep -Poi '(?<=kamilla)\d{8}'
The look-behind asserts that at the current position in the string, what precedes is "kamilla". If the assertion succeeds, the regular expression engine matches eight digits.
Alternative \K escape sequence:
grep -Poi 'kamilla\K\d{8}'
The \K escape sequence resets the starting point of the reported match. Any previously matched characters are not included in the final matched sequence.
-o option shows only the matching part that matches the pattern.
You can use the -o switch and a \K, which removes the preceding part of the match:
$ grep -Poi 'kamilla\K\d{8}' <<<"kamilla83222237"
83222237
As you're using Perl-style regular expressions, you could also just use Perl:
$ perl -nE 'say $1 if /kamilla(\d{8})/' <<<"kamilla83222237"
83222237
Another way:
$ grep -P -o '(?<=kamilla)\d{8}' <<< kamilla12345678
12345678
You can use sed instead:
sed -E "s/.*kamilla(\d{8}).*/\1/g" input.txt >> output.txt
This is replacing input line with first matching group \1 and printing it.
This also allows you to manipulate input file is some non-trivial ways. For example, you can match two groups and output them in non-default order, like \2\1 and so on.

grep not returning expected result with regex on xml

I'm running a grep command on some xml, and it appears to be misinterpretting the regular expression I'm trying to use.
Here's the command
grep '<ernm:NewReleaseMessage.*?>' ./075679942012_ORIGNAL.xml
what appears to be happening is that the ?> aspect of the regex seems to cause no matching rather than matching to the first occurence of >
Any ideas?
If you want to get the text upto the first occurrence of > character then try the below command,
grep -o '<ernm:NewReleaseMessage[^>]*>' file
If you want the whole line then remove -o parameter.
Example:
$ cat aa1.txt
<ernm:NewReleaseMessage blah> foo bar>
$ grep -o '<ernm:NewReleaseMessage[^>]*>' aa1.txt
<ernm:NewReleaseMessage blah>
grep with -o prints only the matched text.
[^>]* - Not of > character zero or more. So it matches upto the first occurance of > character.
By default, grep uses basic regular expression and considers ? as a literal question-mark. For it to be considered regular expression syntax, you need to escape that character.
grep '<ernm:NewReleaseMessage.*\?>' ./075679942012_ORIGNAL.xml
You can use the -E option which interprets the pattern as an extended regular expression.
grep -E '<ernm:NewReleaseMessage.*?>' ./075679942012_ORIGNAL.xml
Note: This above will return the whole line that matches your pattern, if you only want the matched text, use the -o option which prints only the matched parts of matching lines.
grep -o '<ernm:NewReleaseMessage.*\?>' ./075679942012_ORIGNAL.xml
OR
grep -Eo '<ernm:NewReleaseMessage.*?>' ./075679942012_ORIGNAL.xml

UNIX grep with $

I have a quick question:
Suppose I have a file contains:
abc$
$
$abc
and then I use grep "c\$" filename, then I got abc$ only. But if I use grep "c\\$", I got abc$.
I am pretty confused, doesn't back slash already turn off the special meaning of $? So grep "c\$" filename return me the line abc$?
Really hope who can kindly give me some suggestion.
Many thanks in advance.
The double quotes are throwing you off. That allows the shell to expand meta-characters. On my Linux box using single quotes only:
$ grep 'abc$' <<<'abc$'
$ grep 'abc\$' <<<'abc$'
$ grep 'abc\$' <<<"abc$"
abc$
$ grep 'abc$' <<<'abc$'
$ grep 'abc\\$' <<<'abc$'
$
Note that the only grep in the five commands above that found the pattern (and printed it out) was abc\$. If I didn't escape the $, it assumed I was looking for the string abc that was anchored to the end of the line. When I put a single backslash before the $, it recognized the $ as a literal character and not as a end of line anchor.
Note that the $ as an end of line anchor has some intelligence. If I put the $ in the middle of a regular expression, it's a regular character:
$ grep 'a$bc' <<<'a$bc'
a$bc
$ grep 'a\$bc' <<<'a$bc'
a$bc
Here, it found the literal string a$bc whether or not i escaped the $.
Tried things with double quotes:
$ grep "abc\$" <<<'abc$'
$ grep "abc\\$" <<<'abc$'
abc$
The single \ escaped the $ as a end of line anchor. Putting two \\ in front escaped the $ as a non-shell meta-character and as a regular expression literal.
If you're tempted to think that $ need to be escaped, then it's not so.
From the GNU grep manual, you'd figure:
The meta-characters that need to be escaped while using basic regular expressions are ?, +, {, |, (, and ).
I would suggest using fgrep if you want to search for literal $ and avoid escaping $ (which means end of line):
fgrep 'abc$' <<< 'abc$'
gives this output:
abc$
PS: fgrep is same as grep -F and as per the man grep
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.
Sign $ has special meaning in regexp patterns as the end of line, so when you use double quotes
grep "c\$"
the string expanded as two characters c and $ and grep thinks that it is regexp clause c'mon, find all lines with 'c' at the end.
In case of singe quotes, all characters treated as each one, i.e.
grep 'c\$'
command will have three characters c, \ and $. So grep will got all those symbols at its input and therefore he gets escaped special $ symbol, i.e. as \$ and do what you have expected.