Sed removing after ip - regex

I have a simple sed question.
I have data like this:
boo:moo:127.0.0.1--¹óÖÝÊ¡µçÐÅ
foo:joo:127.0.0.1 ÁÉÄþÊ¡ÉòÑôÊвʺçÍø°É
How do I make it like this:
boo:moo:127.0.0.1
foo:joo:127.0.0.1
My sed code
sed -e 's/\.[^\.]*$//' test.txt
Thanks!

For the given sample, you could capture everything from start of line till last digit in the line
$ sed 's/\(.*[0-9]\).*/\1/' ip.txt
boo:moo:127.0.0.1
foo:joo:127.0.0.1
$ grep -o '.*[0-9]' ip.txt
boo:moo:127.0.0.1
foo:joo:127.0.0.1
Or, you could delete all non-digit characters at end of line
$ sed 's/[^0-9]*$//' ip.txt
boo:moo:127.0.0.1
foo:joo:127.0.0.1

You may find an IP like substring and remove all after it:
sed -E 's/([0-9]{1,3}(\.[0-9]{1,3}){3}).*/\1/' # POSIX ERE version
sed 's/\([0-9]\{1,3\}\(\.[0-9]\{1,3\}\)\{3\}\).*/\1/' # BRE POSIX version
The ([0-9]{1,3}(\.[0-9]{1,3}){3}) pattern is a simplified IP address regex pattern that matches and captures 1 to 3 digits and then 3 occurrences of a dot and again 1 to 3 digits, and then .* matches and consumes the rest of the line. The \1 placeholder in the replacement pattern inserts the captured value back into the result.
Note that in the BRE POSIX pattern, you have to escape ( and ) to make them a capturing group construct and you need to escape {...} to make it a range/interval/limiting quantifier (it has lots of names in the regex literature).
See an online demo.

Related

Extract string between underscores and dot

I have strings like these:
/my/directory/file1_AAA_123_k.txt
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt
So basically, the number of underscores is not fixed. I would like to extract the string between the first underscore and the dot. So the output should be something like this:
AAA_123_k
CCC
KK_45
I found this solution that works:
string='/my/directory/file1_AAA_123_k.txt'
tmp="${string%.*}"
echo $tmp | sed 's/^[^_:]*[_:]//'
But I am wondering if there is a more 'elegant' solution (e.g. 1 line code).
With bash version >= 3.0 and a regex:
[[ "$string" =~ _(.+)\. ]] && echo "${BASH_REMATCH[1]}"
You can use a single sed command like
sed -n 's~^.*/[^_/]*_\([^/]*\)\.[^./]*$~\1~p' <<< "$string"
sed -nE 's~^.*/[^_/]*_([^/]*)\.[^./]*$~\1~p' <<< "$string"
See the online demo. Details:
^ - start of string
.* - any text
/ - a / char
[^_/]* - zero or more chars other than / and _
_ - a _ char
\([^/]*\) (POSIX BRE) / ([^/]*) (POSIX ERE, enabled with E option) - Group 1: any zero or more chars other than /
\. - a dot
[^./]* - zero or more chars other than . and /
$ - end of string.
With -n, default line output is suppressed and p only prints the result of successful substitution.
With your shown samples, with GNU grep you could try following code.
grep -oP '.*?_\K([^.]*)' Input_file
Explanation: Using GNU grep's -oP options here to print exact match and to enable PCRE regex respectively. In main program using regex .*?_\K([^.]*) to get value between 1st _ and first occurrence of .. Explanation of regex is as follows:
Explanation of regex:
.*?_ ##Matching from starting of line to till first occurrence of _ by using lazy match .*?
\K ##\K will forget all previous matched values by regex to make sure only needed values are printed.
([^.]*) ##Matching everything till first occurrence of dot as per need.
A simpler sed solution without any capturing group:
sed -E 's/^[^_]*_|\.[^.]*$//g' file
AAA_123_k
CCC
KK_45
If you need to process the file names one at a time (eg, within a while read loop) you can perform two parameter expansions, eg:
$ string='/my/directory/file1_AAA_123_k.txt.2'
$ tmp="${string#*_}"
$ tmp="${tmp%%.*}"
$ echo "${tmp}"
AAA_123_k
One idea to parse a list of file names at the same time:
$ cat file.list
/my/directory/file1_AAA_123_k.txt.2
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt
$ sed -En 's/[^_]*_([^.]+).*/\1/p' file.list
AAA_123_k
CCC
KK_45
Using sed
$ sed 's/[^_]*_//;s/\..*//' input_file
AAA_123_k
CCC
KK_45
This is easy, except that it includes the initial underscore:
ls | grep -o "_[^.]*"

Regex to match exact version phrase

I have versions like:
v1.0.3-preview2
v1.0.3-sometext
v1.0.3
v1.0.2
v1.0.1
I am trying to get the latest version that is not preview (doesn't have text after version number) , so result should be:
v1.0.3
I used this grep: grep -m1 "[v\d+\.\d+.\d+$]"
but it still outputs: v1.0.3-preview2
what I could be missing here?
To return first match for pattern v<num>.<num>.<num>, use:
grep -m1 -E '^v[0-9]+(\.[0-9]+){2}$' file
v1.0.3
If you input file is unsorted then use grep | sort -V | head as:
grep -E '^v[0-9]+(\.[0-9]+){2}$' f | sort -rV | head -1
When you use ^ or $ inside [...] they are treated a literal character not the anchors.
RegEx Details:
^: Start
v: Match v
[0-9]+: Match 1+ digits
(\.[0-9]+){2}: Match a dot followed by 1+ dots. Repeat this group 2 times
$: End
To match the digits with grep, you can use
grep -m1 "v[[:digit:]]\+\.[[:digit:]]\+\.[[:digit:]]\+$" file
Note that you don't need the [ and ] in your pattern, and to escape the dot to match it literally.
With awk you could try following awk code.
awk 'match($0,/^v[0-9]+(\.[0-9]+){2}$/){print;exit}' Input_file
Explanation of awk code: Simple explanation of awk program would be, using match function of awk to match regex to match version, once match is found print the matched value and exit from program.
Regular expressions match substrings, not whole strings. You need to explicitly match the start (^) and end ($) of the pattern.
Keep in mind that $ has special meaning in double quoted strings in shell scripts and needs to be escaped.
The boundary characters need to be outside of any group ([]).

Extract QueryString value using sed

I have the following lines in an apache access log
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229655&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229656&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229657&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229658&blah
and i want to extract the MSISDN value only, so expected output would be
647930229655
647930229656
647930229657
647930229658
I'm using the following sed command but i can't get it to stop capturing at &
sed 's/.*MSISDN=\(.*\)/\1/'
sed solution:
sed -E 's/.*&MSISDN=([^&]+).*/\1/' file
& - is key/value pair separator in URL syntax, so you should rely on it
([^&]+) - 1st captured group containing any character sequence except &
\1 - backreference to the 1st captured group
The output:
647930229655
647930229656
647930229657
647930229658
-o : means print only matching string not the whole line.
-P: To enable pcre regex.
\K: means ignore everything on the left. But should be part of actual input string.
\d: means digit, + means one or more digit.
grep -oP 'MSISDN=\K\d+' input
647930229655
647930229656
647930229657
647930229658
Following simple sed may help you on same.
sed 's/.*MSISDN=//;s/&.*//' Input_file
Explanation:
s/.*MSISDN=//: s means substitute .*MSISDN= string with // NULL in current line.
; semi colon tells sed that there is 1 more statement to be executed.
s/&.*//g': s/&.*// means substitute &.* from & to everything with NULL.
$ grep -oP '(?<=&MSISDN=)\d+' file
647930229655
647930229656
647930229657
647930229658
-o option is meant to show only matched output
-P option is meant to enable PCRE (Perl Compatible Regex)
(?<=regex) this is to enforce positive look behind assertion. You can read more about them over here. Lookarounds dont consume any characters while matching unlike normal regex. Hence the only matched output you get it \d+ which is 1 or more digits.
or using sed:
$ sed -r 's/^.*MSISDN=([0-9]+).*$/\1/' file
647930229655
647930229656
647930229657
647930229658
you can also pipe cut to cut
cut -d '&' -f3 Input_file |cut -d '=' -f2

Grep first group regexp

Is there a way to specify what regexp group I want to append to my file?
In the example below I only want to store (\d{8}) in my file:
grep -P1 -o kamilla(\d{8}) >> whatever.txt
You'll need to use a Positive Lookbehind assertion or alternative so that it isn't included in the match.
Positive Lookbehind:
grep -Poi '(?<=kamilla)\d{8}'
The look-behind asserts that at the current position in the string, what precedes is "kamilla". If the assertion succeeds, the regular expression engine matches eight digits.
Alternative \K escape sequence:
grep -Poi 'kamilla\K\d{8}'
The \K escape sequence resets the starting point of the reported match. Any previously matched characters are not included in the final matched sequence.
-o option shows only the matching part that matches the pattern.
You can use the -o switch and a \K, which removes the preceding part of the match:
$ grep -Poi 'kamilla\K\d{8}' <<<"kamilla83222237"
83222237
As you're using Perl-style regular expressions, you could also just use Perl:
$ perl -nE 'say $1 if /kamilla(\d{8})/' <<<"kamilla83222237"
83222237
Another way:
$ grep -P -o '(?<=kamilla)\d{8}' <<< kamilla12345678
12345678
You can use sed instead:
sed -E "s/.*kamilla(\d{8}).*/\1/g" input.txt >> output.txt
This is replacing input line with first matching group \1 and printing it.
This also allows you to manipulate input file is some non-trivial ways. For example, you can match two groups and output them in non-default order, like \2\1 and so on.

Return last [0-9]\{6\} from a string with sed

I want to pass a long list of filenames in the form
something_0230232_long_5160mK.csv
something_0230232_long-025160mK.csv
simething_0230342_lingk425460mK.csv
to sed (or similar linux shell tools) and get always the
last array of digits before mK per line
This works, if there are exactly 6 digits. how can I enhance it for n digits?
echo "something_0230232_long_025160mK.csv" | sed -e "s/S.*\([0-9]\{6\}\)mK\.csv/\1/p"
Solution using GNU grep:
$ grep -Po '[0-9]+(?=mK)' file
5160
025160
425460
Explanation:
-o show only the part of the line that matches.
-P use perl regexp.
[0-9]+ # Match a string of digits (at least one)
(?=mK) # Followed by mK (positive lookahead)
And with sed (since you asked):
sed -E 's/.*[^0-9]([0-9]+)mK.*/\1/' file
-E use extended regexp (alias for -r but more portability).
s/ # Subsitution -
.* # Match everything
[^0-9] # That's not a digit
([0-9]+) # Capture the last digit string
mK # Followed by the string mK
.* # Match everything left
/ # Replace with -
\1 # The captured digit string only
/ #
You're on the right track with your sed command:
echo "something_0230232_long_025160mK.csv" |
sed -e 's/^.*[^0-9]\([0-9]\{1,\}\)mK\.csv/\1/'
Differences:
Replace S with ^. This matches at the start (there is no S in the data, so the original would never match).
Replace 6 with 1,. This means 'one or more digits' given the context (strictly, one or more repeats of the previous regex, but the previous regex was [0-9]).
Insert the [^0-9] to stop the .* from being too greedy. When the number of digits matched was fixed (\{6\}), the rigidity prevented the .* from being too greedy. When you have two flexible ranges, the first will be the longest possible. Without the [^0-9], you get a 0 printed for the sample string.
Drop the 'p' so the value is printed once. Alternatively, keep the p and add -n as an option.
Reminder to self: test before (or shortly after) you post.
echo "something_0230232_long_025160mK.csv" | sed 's/^.*_//' | sed 's/mK.csv//'