How to extract the second pattern from a line of text? - regex

Let me preface my question with the fact that I am doing this on an AS/400, and IBM really sucks at keeping their utilities up to date. I want to extract a pattern like /[a-zA-Z0-9]*.LIB/ but the second match that is found. Look at how the two paths below differ:
/QSYS.LIB/KDBDFC1_5.LIB/AUTNOTMAIN.PGM
/DATADEV/QSYS.LIB/FPSENGDEV.LIB/AUTNOTMAIN.PGM
So, in this case I want KDBDFC1_5.LIB and FPSENGDEV.LIB, not QSYS.LIB.
I've tried to use gawk with the match() function and store my matches in an array, but it seems I cannot have a third parameter with match() "match() cannot have 3 arguments". Our version of gawk is 3.0.3. Yeah. I'm fooling around with perl, trying to make this work in a command line setting. Our version of perl is 5.8.7. Should your answer include some fancy new option in grep, you may also consider the QSH version of grep equally old, although there are the PASE utilities, if you know what those are.
I'm still banging on this one, but would appreciate any suggestions as I'm likely to develop a headache soon. :-)

You probably need last-1 segment. Following awk should work:
awk -F/ '{print $(NF-1)}' file
KDBDFC1_5.LIB
FPSENGDEV.LIB
Or probably this awk would work by searching for .LIB and print 2nd field:
awk -F'.LIB' '{print substr($2,2) FS}' file
KDBDFC1_5.LIB
FPSENGDEV.LIB

How about
perl -lne '#matches = /(\w+\.LIB)/g; print $matches[1] if #matches > 1' file

If match does not support array output, you could run matching twice, discarding the first match, and printing the second:
$ awk '{p="[a-zA-Z0-9_]*.LIB"; sub(p,""); match($0,p); print substr($0,RSTART,RLENGTH)}' file
KDBDFC1_5.LIB
FPSENGDEV.LIB

return the second occurence of <word>.LIB :
perl -pe 's/^(?:.*?\.LIB).*?([\w_.]*.LIB).*$/\1/g'
return the last occurence of <word>.LIB :
perl -pe 's/^(?:.*\.LIB).*?([\w_.]*.LIB).*$/\1/g' file
^ start with
(?:.*\.LIB) uncapturing group containing .LIB
.*? anythings ungreedy
([\w.]*.LIB) first capturing group <word>.LIB
.* anythings greedy
$ finish by

So ... after adding an underscore to the search regex, the following worked for me:
sed 's/.*\/\([[:alnum:]_]*\.LIB\).*/\1/' file
Of course, you could also do this with grep -o instead of complex regex rewrites:
grep -o '[[:alnum:]_]*\.LIB' file | awk 'NR%2==0'
These use only POSIX-compatible functionality, so they should be fine in OS/400. That said, you're looking for this in awk, so:
awk '{sub(/.*QSYS\.LIB\//,""); sub(/\/.*/,"")}1' file
If you know that QSYS.LIB is the thing you're trying to avoid which may exist earlier on the line, then this might do. And if it really is the second of two .LIB files you want, this might do:
awk '{match($0,/[[:alnum:]_]+\.LIB/); s=substr($0,RSTART+RLENGTH); match(s,/[[:alnum:]_]+\.LIB/); print substr(s,RSTART,RLENGTH)}' file
Or, broken out for easier reading:
awk '{
match($0,/[[:alnum:]_]+\.LIB/);
s=substr($0,RSTART+RLENGTH);
match(s,/[[:alnum:]_]+\.LIB/);
print substr(s,RSTART,RLENGTH)
}' file
This uses only plain-old-awk functions match() and substr() to (1) strip off the first .LIB from and store the remainder of the line in a temporary variable, and (2) find the next .LIB inside that variable.
It has the advantage of not depending on any particular position of things -- i.e. it doesn't assume that the "interesting" file is immediately after the first one, or is the second last one on the line, etc.
That said, this is cumbersome, and anubhava's second solution is much more elegant. :-)

Related

Deleting the un-matched portion using sed

I'm having a text file containing data in the following format:
2020-01-01 00:00:00 #gibberish - key1:{value1}, unwanted key2:{value2}, unwanted key3:{value3}
I wanted to collect the timestamp in the beginning and key-value pairs alone. Like the following
2020-01-01 00:00:00,key1:{value1},key2:{value2},key3:{value3}
I'm able to write a regex script that can select the required values (works in visual studio code)
^([0-9 :-]+)|([0-9A-z,_-]+):\{(.*?)\}
(first pattern selects the timestamp and second part selects the key-value pattern)
Now, how can I select the un-matched part and delete it using sed ?
Note: I tried using egrep to match the required pattern and writing it to a new file. But every matched string is written on a new line instead of maintaining on the same line. That is not useful to me.
egrep -o '^([0-9 :-]+)|([0-9A-z,_-]+):\{(.*?)\}' source.txt > target.txt
Going from last to first, I can comment that:
egrep: yes, that is the designed behavior - egrep is probably not what you want to use.
sed: it is important to note that sed uses POSIX regular expressions which is simpler and much more limited than what people expect from regular expressions these days. Most of the new style (enhanced, perl-compatible, etc) regular expression work in the last few decades was done in Perl, which is readily available on UNIX systems and is probably what you want to use (but also note that in macOS, like all Apple distributed UNIX programs, the perl binary there is pretty outdated. It will probably still do what you want, but be warned).
Your regular expression uses a range [A-z], which is weird and doesn't work in my egrep or sed - I understand what you want to do, but it shouldn't work in system that actually use character sets (I'm not sure what Visual Studio is doing with this range, but it seems bonkers to me). You probably meant to use [A-Za-z].
I would have written this thing, using Perl, like so:
perl -nle '#res = (); while(m/^([0-9 :-]+\d)|([0-9A-Za-z,_-]+:\{[^}]+\})/g) {
push #res, "$1$2";
};
print join ",",#res' < source.txt > target.txt
With your shown samples, could you please try following. Written and tested in GNU awk in case you are ok with it.
awk '
match($0,/[0-9]{4}-[0-9]{2}-[0-9]{2}[[:space:]]+([0-9]{2}:){2}[0-9]{2}/){
val=""
printf("%s ",substr($0,RSTART,RLENGTH))
while(match($0,/key[0-9]+:{value[0-9]+}(,|$)/)){
val=(val?val OFS:"")substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
print val
}
' Input_file
This might work for you (GNU sed):
sed -E 's/\S+/\n&/3g;s#.*#echo "&"|sed "1b;/:{.*}/!d;s/, *$//"#e;s/ *\n/,/g' file
Split each line into a lines of tokens (keeping the date and time as the first of these lines).
Remove any line (apart from the first) that does not contain the pattern :{...}.
Flatten the lines by replacing the introduced newlines by , separator.
sed -rn 's/([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}[[:space:]]([[:digit:]]{2}:){2}[[:digit:]]{2})(.*)(key1.*,)(.*)(key2.*,)(.*)(key3.*$)/\1,\4\6\8/p' <<< "2020-01-01 00:00:00 #gibberish - key1:{value1}, unwanted key2:{value2}, unwanted key3:{value3}"
Enable regular expression interpretation with sed -r or -E and then split the string into 8 sections using parenthesis. Substitute the line for the 1st, 4th, 6th and 8th sections and print.

bash regex for word with some suffixes but not one specific

I need (case-insensitive) all matches of several variations on a word--except one--including unknowns.
I want
accept
acceptance
acceptable
accepting
...but not "acception." A coworker used it when he meant "exception." A lot.
Since I can't anticipate the variations (or typos), I need to allow things like "acceptjunk" and "acceptMacarena"
I thought I could accomplish this with a negative lookahead, but this didn't give the results I needed
grep -iE '(?!acception)(accept[a-zA-Z]*)[[:space:]]' file
The trick is that I can accept (har) lines that contain "acception," provided that the other words match. For example this line is okay to match:
The acceptance of the inevitable is the acception.
...otherwise by now I'd have piped grep through grep -v and been done with it:
grep -iE '(accept)[a-zA-Z]*[[:space:]]' | grep -vi 'acception'
I've found some questions that are similar and many that are not quite so. Using a-zA-Z is likely unnecessary in grep -i but I'm flailing. I'm probably missing something small or basic...but I'm missing it nonetheless. What is it?
Thanks for reading.
PS: I'm not married to grep--but I am operating in bash--so if there's a magic awk command that would do this I'm all ears (eyes).
PPS: forgot to mention that on https://regex101.com/ the above lookahead seemed to work, but it doesn't with my full grep command.
To use lookarounds, you need GNU grep with PCRE available
grep -iP '(?!acception)(accept[a-z]*)[[:space:]]'
With awk, this might work
awk '{ip=$0; sub(/acception/, ""); if(/accept[a-zA-Z]*[[:space:]]/) print ip}'
ip=$0 save input line
sub(/acception/, "") remove unwanted words, can add other unwanted words with alternation
if(/accept[a-zA-Z]*[[:space:]]/) print ip then print the line if it still contains words being searched

Sed dynamic backreference replacement

I am trying to use sed for transforming wikitext into latex code. I am almost done, but I would like to automate the generation of the labels of the figures like this:
[[Image(mypicture.png)]]
... into:
\includegraphics{mypicture.png}\label{img-1}
For what I would like to keep using sed. The current regex and bash code I am using is the following:
__tex_includegraphics="\\\\includegraphics[width=0.95\\\\textwidth]{$__images_dir\/"
__tex_figure_pre="\\\\begin{figure}[H]\\\\centering$__tex_includegraphics"
__tex_figure_post="}\\\\label{img-$__images_counter}\\\\end{figure}"
sed -e "s/\[\[Image(\([^)]*\))\]\].*/$__tex_figure_pre\1$__tex_figure_post/g"\
... but I cannot make that counter to be increased. Any ideas?
Within a more general perspective, my question would be the following: can I use a backreference in sed for creating a replacement that is different for each of the matches of sed? This is, each time sed matches the pattern, can I use \1 as the input of a function and use the result of this function as the replacement?
I know it is a tricky question and I might have to use AWK for this. However, if somebody has a solution, I would appreciate his or her help.
This might work for you (GNU sed):
sed -r ':a;/'"$PATTERN"'/{x;/./s/.*/echo $((&+1))/e;/./!s/^/1/;x;G;s/'"$PATTERN"'(.*)\n(.*)/'"$PRE"'\2'"$POST"'\1/;ba}' file
This looks for a PATTERN contained in a shell variable and if not presents prints the current line. If the pattern is present it increments or primes the counter in the hold space and then appends said counter to the current line. The pattern is then replaced using the shell variables PRE and POST and counter. Lastly the current line is checked for further cases of the pattern and the procedure repeated if necessary.
You could read the file line-by-line using shell features, and use a separate sed command for each line. Something like
exec 0<input_file
while read line; do
echo $line | sed -e "s/\[\[Image(\([^)]*\))\]\].*/$__tex_figure_pre\1$__tex_figure_post/g"
__images_counter=$(expr $__images_counter + 1)
done
(This won't work if there are multiple matches in a line, though.)
For the second part, my best idea is to run sed or grep to find what is being matched, and then run sed again with the value of the function of the matched text substituted into the command.

How to print only matches with sed?

Okay, this is an easy one, but I can't figure it out.
Basically I want to extract all links ([^<>]*) from a big html file.
I tried to do this with sed, but I get all kinds of results, just not what I want. I know that my regexp is correct, because I can replace all the links in a file:
sed 's_[^<>]*_TEST_g'
If I run that on something like
<div>A google link</div>
<div>A google link</div>
I get
<div>TEST</div>
<div>TEST</div>
How can I get rid of everything else and just print the matches instead? My preferred end result would be:
A google link
A google link
PS. I know that my regexp is not the most flexible one, but it's enough for my intentions.
Match the whole line, put the interesting part in a group, replace by the content of the group. Use the -n option to suppress non-matching lines, and add the p modifier to print the result of the s command.
sed -n -e 's!^.*\(<[Aa] [^<>]*>.*</[Aa]>\).*$!\1!p'
Note that if there are multiple links on the line, this only prints the last link. You can improve on that, but it goes beyond simple sed usage. The simplest method is to use two steps: first insert a newline before any two links, then extract the links.
sed -n -e 's!</a>!&\n!p' | sed -n -e 's!^.*\(<[Aa] [^<>]*>.*</[Aa]>\).*$!\1!p'
This still doesn't handle HTML comments, <pre>, links that are spread over several lines, etc. When parsing HTML, use an HTML parser.
If you don't mind using perl like sed it can copy with very diverse input:
perl -n -e 's+(<a href=.*?</a>)+ print $1, "\n" +eg;'
Assuming that there is only one hyperlink per line the following may work...
sed -e 's_.*&lta href=_&lta href=_' -e 's_>.*_>ed &lt&lt'EOF'
-e 's_.*&lta href=_&lta href=_' -e 's_>.*_>_'
This might work for you (GNU sed):
sed '/<a href\>/!d;s//\n&/;s/[^\n]*\n//;:a;$!{/>/!{N;ba}};y/\n/ /;s//&\n/;P;D' file

Regular expression with sed

I'm having hard time selecting from a file using a regular expression. I'm trying to replace a specific text in the file which is full of lines like this.
/home/user/test2/data/train/train38.wav /home/user/test2/data/train/train38.mfc
I'm trying to replace the bolded text. The problem is the i don't know how to select only the bolded text since i need to use .wav in my regexp and the filename and the location of the file is also going to be different.
Hope you can help
Best regards,
Jökull
This assumes that what you want to replace is the string between the last two slashes in the first path.
sed 's|\([^/]*/\)[^/]*\(/[^/]* .*\)|\1FOO\2|' filename
produces:
/home/user/test2/data/FOO/train38.wav /home/user/test2/data/train/train38.mfc
sed processes lines one at a time, so you can omit the global option and it will only change the first 'train' on each line
sed 's/train/FOO/' testdat
vs
sed 's/train/FOO/g' testdat
which is a global replace
This is quite a bit more readable and less error-prone than some of the other possibilities, but of course there are applications which will not simplify quite as readily.
sed 's;\(\(/[^/]\+\)*\)/train\(\(/[^/]\+\)*\)\.wav;\1/FOO\3.wav;'
You can do it like this
sed -e 's/\<train\>/plane/g'
The \< tells sed to match the beginning of that work and the \> tells it to match the end of the word.
The g at the end means global so it performs the match and replace on the entire line and does not stop after the first successful match as it would normally do without g.