find recurring pattern with `sed`

find recurring pattern with `sed` - regex

I am using GNU bash 4.3.48
I expected that
echo "23S62M1I19M2D" | sed 's/.*\([0-9]*M\).*/\1/g'
would output 62M19M... But it doesn't.
sed 's/\([0-9]*M\)//g' deletes ALL [0-9]*M and retrieves 23S1I2D. but the group \1 is not working as I thought it would.
sed 's/.*\([0-9]*M\).*/ \1 /g', retrieves M...
What am I doing wrong?
Thank you!

With your shown samples and with awk you could try following program.
echo "23S62M1I19M2D" |
awk '
{
val=""
while(match($0,/[0-9]+M/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
print val
}
'
Explanation: Simple explanation would be, using echo to print values and sending it as a standard input to awk program. In awk program using its match function to match regex mentioned in it(/[0-9]+M) running loop to find all matches in each line and printing the collected matched values at last of each line.

This might work for you (GNU sed):
sed -nE '/[0-9]*M/{s//\n&\n/g;s/(^|\n)[^\n]*\n?//gp}' file
Surround the match by newlines and then remove non-matching parts.
Alternative, using grep and tr:
grep -o '[0-9]*M' file | tr -d '\n'
N.B. tr removes all newlines (including the last one) to restore the last newline, use:
grep -o '[0-9]*M' file | tr -d '\n' | paste
The alternate solution will concatenate all results into a single line. To achieve the same result with the first solution use:
sed -nE '/[0-9]*M/{s//\n&\n/g;s/(^|\n)[^\n]*\n?//g;H};${x;s/\n//gp}' file

The problem is that the .* is greedy. Since only M is obligatory, when the engine finds last M, it satisfies the regex, so all string is matched, M is captured and thus kept after replacing with \1 backreference.
That means, you can't easily do this with sed. You can do that with Perl much easier since it supports matching and skipping pattern:
#!/bin/bash
perl -pe 's/\d+M(*SKIP)(*F)|.//g' <<< "23S62M1I19M2D"
See the online demo. The pattern matches
\d+M(*SKIP)(*F) - one or more digits, M, and then the match is omitted and the next match is searched for from the failure position
|. - or matches any char other than a line break char.
Or simply match all occurrences and concatenate them:
perl -lane 'BEGIN{$a="";} while (/\d+M/g) {$a .= $&} END{print $a;}' <<< "23S62M1I19M2D"
All \d+M matches are appended to the $a variable which is printed at the end of processing the string.

Your substitution is probably working, but not substituting what you think it is.
In the substitution s/\(foo...\)/\1/, the \1 matches whatever \(...\) matches and captures, so your substitution is replacing foo... by foo...!
% echo "1234ABC" | sed 's/\([A-Z]\)/-\1-/'g
1234-A--B--C-
So you'll need to match more, but capture only a portion of the match. For example:
echo "23S62M1I19M2D" | sed 's/[0-9]*[A-LN-Z]*\([0-9]*M\)/\1/g'
62M19M2D
In the case of sed 's/.*\([0-9]*M\).*/\1/g' (did that appear in an edit to the question, or did I just miss it?), the .* matches ‘greedily’ – it matches as much as it possibly can, thus including the digits before the M. In the example above, the [A-LN-Z] is required to be at the end of the uncaptured part, so the digits are forced to be matched by the [0-9] inside the capture.
Getting a clear idea of what ‘greedy’ means is a really important idea when writing or interpreting regexps.

If you know you will only encounter the suffixes S, M, I and D, an alternative approach would be explicitly deleting the combinations you don't want:
echo "23S62M1I19M2D" | sed 's/[0-9]\+[SID]//g'
This gives the expected:
62M19M
Update: This variant produces the same output, but rejects all non-numeric, non-M suffixes:
echo "23S62M1I19M2D" | sed 's/[0-9]\+[^0-9M]//g'

Related

sed and Perl regexp replaces once, with multiple replacements flag

I have the string:
lopy,lopy1,sym,lopy,lopy1,sym"
I want the line to be:
lopy,lopy1,sym,lady,lady1,sym
Which means that all "lad" after the string sym should be replaced. So I ran:
echo "lopy,lopy1,sym,lopy,lopy1,sym" | sed -r 's/(.*sym.*?)lopy/\1lad/g'
I get:
lopy,lopy1,sym,lopy,lad1,sym
Using Perl is not really better:
echo "lopy,lopy1,sym,lopy,lopy1,sym" | perl -pe 's/(.*sym.+?)lopy/${1}lad/g'
yields
lopy,lopy1,sym,lad,lopy1,sym
Not all "lopy" are replaced. What am I doing wrong?

The (.*sym.*?)lopy / (.*sym.+?)lopy patterns are almost the same, .+? matches one or more chars other than line break chars, but as few as possible, and .*? matches zero or more such chars. Mind that sed does not support lazy quantifiers, *? is the same as * in sed. However, the main problem with the regexps you used is that they match sym, then any text after it and then lopy, so when you added g, it just means you want to find more cases of lopy after sym....lopy. And there is only one such occurrence in your string.
You want to replace all lopy after sym, so you can use
perl -pe 's/(?:\G(?!^)|sym).*?\Klopy/lad/g'
See the regex demo. Details:
(?:\G(?!^)|sym) - sym or end of the previous match (\G(?!^))
.*? - any zero or more chars other than line break chars, as few as possible
\K - match reset operator that discards all text matched so far
lopy - a lopy string.
See the online demo:
#!/bin/bash
echo "lopy,lopy1,sym,lopy,lopy1,sym" | perl -pe 's/(?:\G(?!^)|sym).*?\Klopy/lad/g'
# => lopy,lopy1,sym,lad,lad1,sym
If the values are always comma separated, you may replace .*? with ,: (?:\G(?!^)|sym),\Klopy (see this regex demo).

Since OP has mentioned sed so I am adding awk program here. Which could be better choice in comparison to sed. With shown samples, please try following awk program.
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
awk -F',sym,' '
{
first=$1
$1=""
sub(/^[[:space:]]+/,"")
gsub(/lop/,"lad")
$0=first FS $0
}
1
'
Explanation: Adding detailed explanation for above.
echo "lopy,lopy1,sym,lopy,lopy1,sym" | ##Printing values and sending as standard output to awk program as an input.
awk -F',sym,' ' ##Making ,sym, as a field separator here.
{
first=$1 ##Creating first which has $1 of current line in it.
$1="" ##Nullifying $1 here.
sub(/^[[:space:]]+/,"") ##Substituting initial space in current line here.
gsub(/lop/,"lad") ##Globally substituting lop with lad in rest of line.
$0=first FS $0 ##Adding first FS to rest of edited line here.
}
1 ##Printing edited/non-edited line value here.
'

The problem is that the lopy(s) to replace are after sym, with a pattern like sym.*?lopy, so a global replacement looks for yet more of the whole sym+lopy-after-sym (not just for all lopys after that one sym).†
To replace all lopys (after the first sym, followed by another sym) we can capture the substring between syms and in the replacement side run code, in which a regex replaces all lopys
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe's{ sym,\K (.+?) (?=sym) }{ $1 =~ s/lop/lad/gr }ex'
To isolate the substring between syms I use \K after the first sym, which drops matches prior to it, and a positive lookahead for the sym after the substring, which doesn't consume anything. The /e modifier makes the replacement side be evaluated as code. In the replacement side's regex we need /r since $1 can't change, and we want the regex to return anyway. See perlretut.
† To match all of abbbb we can't say /ab/g, nor /(a)b/g nor /a(b)/g, because that would look for all repetitions of the whole ab in the string (and find only ab in the beginning).

sed does not support non-greedy wildcards at all. But your Perl script also fails for other reasons; you are saying "match all occurrences of this" but then you specify a regex which can only match once.
A common simple solution is to split the string, and then replace only after the match:
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe 'if (#x = /^(.*?sym,)(.*)/) { $x[1] =~ s/lop/lad/g; s/.*/$x[0]$x[1]/ }'
If you want to be fancy, you can use a lookbehind to only replace the lop occurrences after the first sym.
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe 's/(?<=sym.{0,200})lop/lad/'
The variable-length lookbehind generates a warning and is only supported in Perl 5.30+ (you can turn it off with no warnings qw(experimental::vlb));.)

Since you have shown an attempted sed command and used sed tag, here is a sed loop based solution:
sed -E -e ':a' -e 's~(sym,.*)lopy~\1lady~g; ta' file
lopy,lopy1,sym,lady,lady1,sym"
Explanation:
:a sets a label a before matching sym,.* pattern
ta jumps pattern matching back to label a after making a substitution
This looping stop when s command has nothing to match i.e. no lopy substring after sym,

How to use grep/sed/awk, to remove a pattern from beginning of a text file

I have a text file with the following pattern written to it:
TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"
I would like to discard the first part of each line containing
TIME[32.468ms] -(3)-.............
To test the regular expression I've tried the following:
cat myfile.txt | egrep "^TIME\[.*\]\s\s\-\(3\)\-\.+"
This identifies correctly the lines I want. Now, to delete the pattern I've tried:
cat myfile.txt | sed s/"^TIME\[.*\]\s\s\-\(3\)\-\.+"//
but it just seems to be doing the cat, since it shows the content of the complete file and no substitution happens.
What am I doing wrong?
OS: CentOS 7

With your shown samples, please try following grep command. Written and tested with GNU grep.
grep -oP '^TIME\[\d+\.\d+ms\]\s+-\(\d+\)-\.+\K.*' Input_file
Explanation: Adding detailed explanation for above code.
^TIME\[ ##Matching string TIME from starting of value here.
\d+\.\d+ms\] ##Matching digits(1 or more occurrences) followed by dot digits(1 or more occurrences) followed by ms ] here.
\s+-\(\d+\)-\.+ ##Matching spaces91 or more occurrences) followed by - digits(1 or more occurrences) - and 1 or more dots.
\K ##Using \K option of GNU grep to make sure previous match is found in line but don't consider it in printing, print next matched regex part only.
.* ##to match till end of the value.
2nd solution: Adding awk program here.
awk 'match($0,/^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+/){print substr($0,RSTART+RLENGTH)}' Input_file
Explanation: using match function of awk, to match regex ^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+ which will catch text which we actually want to remove from lines. Then printing rest of the text apart from matched one which is actually required by OP.

This awk using its sub() function:
awk 'sub(/^TIME[[][^]]*].*\.+/,"")' file
"TEXT I WANT TO KEEP"
If there is replacement, sub() returns true.

$ cut -d'"' -f2 file
TEXT I WANT TO KEEP

You may use:
s='TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"'
sed -E 's/^TIME\[[^]]*].*\.+//'
"TEXT I WANT TO KEEP"

The \s regex extension may not be supported by your sed.
In BRE syntax (which is what sed speaks out of the box) you do not backslash round parentheses - doing that turns them into regex metacharacters which do not match themselves, somewhat unintuitively. Also, + is just a regular character in BRE, not a repetition operator (though you can turn it into one by similarly backslashing it: \+).
You can try adding an -E option to switch from BRE syntax to the perhaps more familiar ERE syntax, but that still won't enable Perl regex extensions, which are not part of ERE syntax, either.
sed 's/^TIME\[[^][]*\][[:space:]][[:space:]]-(3)-\.*//' myfile.txt
should work on any reasonably POSIX sed. (Notice also how the minus character does not need to be backslash-escaped, though doing so is harmless per se. Furthermore, I tightened up the regex for the square brackets, to prevent the "match anything" regex you had .* from "escaping" past the closing square bracket. In some more detail, [^][] is a negated character class which matches any character which isn't (a newline or) ] or [; they have to be specified exactly in this order to avoid ambiguity in the character class definition. Finally, notice also how the entire sed script should normally be quoted in single quotes, unless you have specific reasons to use different quoting.)
If you have sed -E or sed -r you can use + instead of * but then this complicates the overall regex, so I won't suggest that here.

A simpler one for sed:
sed 's/^[^"]*//' myfile.txt

If the "text you want to keep" always surrounded by the quote like this and only them having the quote in the line starting with "TIME...", then:
sed -n '/^TIME/p' file | awk -F'"' '{print $2}'
should get the line starting with "TIME..." and print the text within the quotes.

Thanks all, for your help.
By the end, I've found a way to make it work:
echo 'TIME[32.468ms] -(3)-.............TEXT I WANT TO KEEP' | grep TIME | sed -r 's/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//'
More generally,
grep TIME myfile.txt | sed -r ‘s/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//’
Cheers,
Pedro

How can I get "grep -zoP" to display every match separately?

I have a file on this form:
X/this is the first match/blabla
X-this is
the second match-
and here we have some fluff.
And I want to extract everything that appears after "X" and between the same markers. So if I have "X+match+", I want to get "match", because it appears after "X" and between the marker "+".
So for the given sample file I would like to have this output:
this is the first match
and then
this is
the second match
I managed to get all the content between X followed by a marker by using:
grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file
That is:
grep -Po '(?<=X(.))(.|\n)+(?=\1)' to match X followed by (something) that gets captured and matched at the end with (?=\1) (I based the code on my answer here).
Note I use (.|\n) to match anything, including a new line, and that I also use -z in grep to match new lines as well.
So this works well, the only problem comes from the display of the output:
$ grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file
this is the first matchthis is
the second match
As you can see, all the matches appear together, with "this is the first match" being followed by "this is the second match" with no separator at all. I know this comes from the usage of "-z", that treats all the file as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline (quoting "man grep").
So: is there a way to get all these results separately?
I tried also in GNU Awk:
awk 'match($0, /X(.)(\n|.*)\1/, a) {print a[1]}' file
but not even the (\n|.*) worked.

awk doesn't support backreferences within regexp definition.
Workarounds:
$ grep -zPo '(?s)(?<=X(.)).+(?=\1)' ip.txt | tr '\0' '\n'
this is the first match
this is
the second match
# with ripgrep, which supports multiline matching
$ rg -NoUP '(?s)(?<=X(.)).+(?=\1)' ip.txt
this is the first match
this is
the second match
Can also use (?s)X(.)\K.+(?=\1) instead of (?s)(?<=X(.)).+(?=\1). Also, you might want to use non-greedy quantifier here to avoid matching match+xyz+foobaz for an input X+match+xyz+foobaz+
With perl
$ perl -0777 -nE 'say $& while(/X(.)\K.+(?=\1)/sg)' ip.txt
this is the first match
this is
the second match

Here is another gnu-awk solution making use of RS and RT:
awk -v RS='X.' 'ch != "" && n=index($0, ch) {
print substr($0, 1, n-1)
}
RT {
ch = substr(RT, 2, 1)
}' file
this is the first match
this is
the second match

With GNU awk for multi-char RS, RT, and gensub() and without having to read the whole file into memory:
$ awk -v RS='X.' 'NR>1{print "<" gensub(end".*","",1) ">"} {end=substr(RT,2,1)}' file
<this is the first match>
<this is
the second match>
Obviously I added the "<" and ">" so you could see where each output record starts/ends.
The above assumes that the character after X isn't a non-repetition regexp metachar (e.g. ., ^, [, etc.) so YMMV

The use case is kind of problematic, because as soon as you print the matches, you lose the information about where exactly the separator was. But if that's acceptable, try piping to xargs -r0.
grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file | xargs -r0
These options are GNU extensions, but then so is grep -z and (mostly) grep -P, so perhaps that's acceptable.

GNU grep -z terminates input/output records with null characters (useful in conjunction with other tools such as sort -z). pcregrep will not do that:
pcregrep -Mo2 '(?s)X(.)(.+?)\1' file
-onumber used instead of lookarounds. ? lazy quantifier added (in case \1 occurs later).

How to process a regular expression after being evaluated (sed)

I need to replace each character of a regular expression, once evaluated, with each character plus the # symbol.
For example:
If the regular expression is: POS[AB]
and the input text is: POSA_____POSB
I want to get this result: P#O#S#A#_____P#O#S#B#
Please, using sed or awk.
I have tried this:
$ echo "POSA_____POSB" | sed "s/POS[AB]/&#/g"
POSA#_____POSB#
$ echo "POSA_____POSB" | sed "s/./&#/g"
P#O#S#A#_#_#_#_#_#P#O#S#B#
But what I need is:
P#O#S#A#_____P#O#S#B#
Thank you in advance.
Best regards,
Octavio

Perl to the resuce!
perl -pe 's/(POS[AB])/$1 =~ s:(.):$1#:gr/ge'
The /e interprets the replacement as code, and it contains another substitution which replaces each character with itself plus #.
In ancient Perls before 5.14 (i.e. without the /r modifier), you need to use a bit more complex
perl -pe 's/(POS[AB])/$x = $1; $x =~ s:(.):$1#:g; $x/ge'

echo "POSA_____POSB" | sed "s/[^_]/&#/g"
or
echo "POSA_____POSB" | sed "s/[POSAB]/&#/g"

Try this regex:
echo "POSA_____POSB" | sed "s/[A-Z]/&#/g"
Output:
P#O#S#A#_____P#O#S#B#

You may replace regex pattern using awk with sub (first matching substring, sed "s///") or gsub (substitute matching substrings globally, sed "s///g") commands. The regex themselves will not differ between sed and awk. In your case you want:
Solution 1
EDIT: edited to match the comments
The following awk will limit substitution to a given substring (e.g.'POSA_____POSB'):
echo "OOPS POSA_____POSB" | awk '{str="POSA_____POSB"}; {gsub(/[POSAB]/,"&#",str)}; {gsub(/'POSA_____POSB'/, str); print $0} '
If your input consist only of matched string, try this:
echo "POSA_____POSB" | awk '{gsub(/[POSAB]/,"&#");}1'
Explanation:
Separate '{}' for each action and explicit print are for clarity sake.
The gsub accepts 3 arguments gsub(pattern, substitution [, target]) where target must be variable (gsub will change it inplace and store result there).
We use var named 'str' and initialize it with value (your string) before doing any substitutions.
The second gsub is there to put modified str into $0 (matches the whole record/line).
The expressions are greedy by default --- they will match the longest string possible.
[] introduces set of characters to be matched: every occurence of any char will be matched. The expression above says awk to match each occurence of any of "POSAB".
Your first regexp does not work as expected for you told sed to match POS ending in any of [AB] (the whole string at once).
In the other expression you told it to match any single character (including "_") when you used: '.' (dot).
If you want to generalize this solution you may use: [\w] expression which will match any of [a-zA-Z0-9_] or [a-z], [A-Z], [0-9] to match lowercase, uppercase letters and numbers respectively.
Solution 2
Note that you might negate character sets with [^] so: [^_] would also work in this particular case.
Explanation:
Negation means: match anything but the character between '[]'. The '^' character must come as first char, right after opening '['.
Sidenotes:
Also it may be good idea to directly indicate you want to match one character at a time with [POSAB]? or [POSAB]{1}.
Also note that some implementations of sed might need -r switch to use extended (more complicated) regexps.

With the given example you can use
echo "POSA_____POSB" | sed -r 's/POS([AB])/P#O#S#\1#/g'
This will fail for more complicated expressions.
When your input is without \v and \r, you can use
echo "POSA_____POSB" |
sed -r 's/POS([AB])/\v&\r/g; :loop;s/\v([^\r])/\1#\v/;t loop; s/[\v\r]//g'

Extract multiple occurrences on the same line using sed/regex

I am trying to loop through each line in a file and find and extract letters that start with ${ and end with }. So as the final output I am expecting only SOLDIR and TEMP(from inputfile.sh).
I have tried using the following script but it seems it matches and extracts only the second occurrence of the pattern TEMP. I also tried adding g at the end but it doesn't help. Could anybody please let me know how to match and extract both/multiple occurrences on the same line ?
inputfile.sh:
.
.
SOLPORT=\`grep -A 4 '\[LocalDB\]' \${SOLDIR}/solidhac.ini | grep \${TEMP} | awk '{print $2}'\`
.
.
script.sh:
infile='inputfile.sh'
while read line ; do
echo $line | sed 's%.*${\([^}]*\)}.*%\1%g'
done < "$infile"

May I propose a grep solution?
grep -oP '(?<=\${).*?(?=})'
It uses Perl-style lookaround assertions and lazily matches anything between '${' and '}'.
Feeding your line to it, I get
$ echo "SOLPORT=\`grep -A 4 '[LocalDB]' \${SOLDIR}/solidhac.ini | grep \${TEMP} | awk '{print $2}'\`" | grep -oP '(?<=\${).*?(?=})'
SOLDIR
TEMP

This might work for you (but maybe only for your specific input line):
sed 's/[^$]*\(${[^}]\+}\)[^$]*/\1\t/g;s/$[^{$]\+//g'

Extracting multiple matches from a single line using sed isn't as bad as I thought it'd be, but it's still fairly esoteric and difficult to read:
$ echo 'Hello ${var1}, how is your ${var2}' | sed -En '
# Replace ${PREFIX}${TARGET}${SUFFIX} with ${PREFIX}\a${TARGET}\n${SUFFIX}
s#\$\{([^}]+)\}#\a\1\n#
# Continue to next line if no matches.
/\n/!b
# Remove the prefix.
s#.*\a##
# Print up to the first newline.
P
# Delete up to the first newline and reprocess what's left of the line.
D
'
var1
var2
And all on one line:
sed -En 's#\$\{([^}]+)\}#\a\1\n#;/\n/!b;s#.*\a##;P;D'
Since POSIX extended regexes don't support non-greedy quantifiers or putting a newline escape in a bracket expression I've used a BEL character (\a) as a sentinel at the end of the prefix instead of a newline. A newline could be used, but then the second substitution would have to be the questionable s#.*\n(.*\n.*)##, which might involve a pathological amount of backtracking by the regex engine.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

find recurring pattern with `sed` - regex

Related

sed and Perl regexp replaces once, with multiple replacements flag

How to use grep/sed/awk, to remove a pattern from beginning of a text file

How can I get "grep -zoP" to display every match separately?

How to process a regular expression after being evaluated (sed)

Extract multiple occurrences on the same line using sed/regex

Categories

Resources