Deleting the un-matched portion using sed - regex

I'm having a text file containing data in the following format:
2020-01-01 00:00:00 #gibberish - key1:{value1}, unwanted key2:{value2}, unwanted key3:{value3}
I wanted to collect the timestamp in the beginning and key-value pairs alone. Like the following
2020-01-01 00:00:00,key1:{value1},key2:{value2},key3:{value3}
I'm able to write a regex script that can select the required values (works in visual studio code)
^([0-9 :-]+)|([0-9A-z,_-]+):\{(.*?)\}
(first pattern selects the timestamp and second part selects the key-value pattern)
Now, how can I select the un-matched part and delete it using sed ?
Note: I tried using egrep to match the required pattern and writing it to a new file. But every matched string is written on a new line instead of maintaining on the same line. That is not useful to me.
egrep -o '^([0-9 :-]+)|([0-9A-z,_-]+):\{(.*?)\}' source.txt > target.txt

Going from last to first, I can comment that:
egrep: yes, that is the designed behavior - egrep is probably not what you want to use.
sed: it is important to note that sed uses POSIX regular expressions which is simpler and much more limited than what people expect from regular expressions these days. Most of the new style (enhanced, perl-compatible, etc) regular expression work in the last few decades was done in Perl, which is readily available on UNIX systems and is probably what you want to use (but also note that in macOS, like all Apple distributed UNIX programs, the perl binary there is pretty outdated. It will probably still do what you want, but be warned).
Your regular expression uses a range [A-z], which is weird and doesn't work in my egrep or sed - I understand what you want to do, but it shouldn't work in system that actually use character sets (I'm not sure what Visual Studio is doing with this range, but it seems bonkers to me). You probably meant to use [A-Za-z].
I would have written this thing, using Perl, like so:
perl -nle '#res = (); while(m/^([0-9 :-]+\d)|([0-9A-Za-z,_-]+:\{[^}]+\})/g) {
push #res, "$1$2";
};
print join ",",#res' < source.txt > target.txt

With your shown samples, could you please try following. Written and tested in GNU awk in case you are ok with it.
awk '
match($0,/[0-9]{4}-[0-9]{2}-[0-9]{2}[[:space:]]+([0-9]{2}:){2}[0-9]{2}/){
val=""
printf("%s ",substr($0,RSTART,RLENGTH))
while(match($0,/key[0-9]+:{value[0-9]+}(,|$)/)){
val=(val?val OFS:"")substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
print val
}
' Input_file

This might work for you (GNU sed):
sed -E 's/\S+/\n&/3g;s#.*#echo "&"|sed "1b;/:{.*}/!d;s/, *$//"#e;s/ *\n/,/g' file
Split each line into a lines of tokens (keeping the date and time as the first of these lines).
Remove any line (apart from the first) that does not contain the pattern :{...}.
Flatten the lines by replacing the introduced newlines by , separator.

sed -rn 's/([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}[[:space:]]([[:digit:]]{2}:){2}[[:digit:]]{2})(.*)(key1.*,)(.*)(key2.*,)(.*)(key3.*$)/\1,\4\6\8/p' <<< "2020-01-01 00:00:00 #gibberish - key1:{value1}, unwanted key2:{value2}, unwanted key3:{value3}"
Enable regular expression interpretation with sed -r or -E and then split the string into 8 sections using parenthesis. Substitute the line for the 1st, 4th, 6th and 8th sections and print.

Related

Get list of strings between certain strings in bash

Given a text file (.tex) which may contain strings of the form "\cite{alice}", "\cite{bob}", and so on, I would like to write a bash script that stores the content within brackets of each such string ("alice" and "bob") in a new text file (say, .txt).
In the output file I would like to have one line for each such content, and I would also like to avoid repetitions.
Attempts:
I thought about combining grep and cut.
From other questions and answers that I have seen on Stack Exchange I think that (modulo reading up on cut a bit more) I could manage to get at least one such content per line, but I do not know how to get all occurences of a single line if there are several such strings in it and I have not seen any question or answer giving hints in this direction.
I have tried using sed as well. Yesterday I read this guide to see if I was missing some basic sed command, but I did not see any straightforward way to do what I want (the guide did mention that sed is Turing complete, so I am sure there is a way to do this only with sed, but I do not see how).
What about:
grep -oP '(?<=\\cite{)[^}]+(?=})' sample.tex | sort -u > cites.txt
-P with GNU grep interprets the regexp as a Perl-compatible one (for lookbehind and lookahead groups)
-o "prints only the matched (non-empty) parts of a matching line, with each such part on a separate output line" (see manual)
The regexp matches a curly-brace-free text preceded by \cite{ (positive lookbehind group (?<=\\cite{)) and followed by a right curly brace (positive lookafter group (?=})).
sort -u sorts and remove duplicates
For more details about lookahead and lookbehind groups, see Regular-Expressions.info dedicated page.
You can use grep -o and postprocess its output:
grep -o '\\cite{[^{}]*}' file.tex |
sed 's/\\cite{\([^{}]*\)}/\1/'
If there can only ever be a single \cite on an input line, just a sed script suffices.
sed -n 's/.*\\cite{\([^{}]*\)}.*/\1/p' file.tex
(It's by no means impossible to refactor this into a script which extracts multiple occurrences per line; but good luck understanding your code six weeks from now.)
As usual, add sort -u to remove any repetitions.
Here's a brief Awk attempt:
awk -v RS='\' '/^cite\{/ {
split($0, g, /[{}]/)
cite[g[2]]++ }
END { for (cit in cite) print cit }' file.tex
This conveniently does not print any duplicates, and trivially handles multiple citations per line.

SED not updating with complex regex

I'm trying to automate updating the version number in a file as part of build process. I can get the following to work, but only for version numbers with single digits in each of the Major/minor/fix positions.
sed -i 's/version="[0-9]\.[0-9]\.[0-9]"/version="2.4.567"/g' projectConfig.xml
I've tried a more complex regex pattern and it works in the MS Regular Xpression Tool, but won't match when running sed.
sed -i 's/version="\b\d{1,3}\.\d{1,3}\.\d{1,3}\b"/version="2.4.567"/g' projectConfig.xml
Example Input:
This is a file at version="2.1.245" and it consists of much more text.
Desired output
This is a file at version="2.4.567" and it consists of much more text.
I feel that there is something that I'm missing.
There are 3 problems:
To enable quantifiers ({}) in sed you need the -E / --regexp-extended switch (or use \{\}, see http://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html#Regular-Expressions)
The character set shorthand \d is [[:digit:]] in sed.
Your input does not quote the version in ".
sed 's/version=\b[[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}\b/version="2.4.567"/g' \
<<< "This is a file at version=2.1.245 and it consists of much more text."
To stay more portable, you might want to use the --posix switch (which requires removing \b):
sed --posix 's/version=[[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}/version="2.4.567"/g' \
<<< "This is a file at version=2.1.245 and it consists of much more text."

How to extract the second pattern from a line of text?

Let me preface my question with the fact that I am doing this on an AS/400, and IBM really sucks at keeping their utilities up to date. I want to extract a pattern like /[a-zA-Z0-9]*.LIB/ but the second match that is found. Look at how the two paths below differ:
/QSYS.LIB/KDBDFC1_5.LIB/AUTNOTMAIN.PGM
/DATADEV/QSYS.LIB/FPSENGDEV.LIB/AUTNOTMAIN.PGM
So, in this case I want KDBDFC1_5.LIB and FPSENGDEV.LIB, not QSYS.LIB.
I've tried to use gawk with the match() function and store my matches in an array, but it seems I cannot have a third parameter with match() "match() cannot have 3 arguments". Our version of gawk is 3.0.3. Yeah. I'm fooling around with perl, trying to make this work in a command line setting. Our version of perl is 5.8.7. Should your answer include some fancy new option in grep, you may also consider the QSH version of grep equally old, although there are the PASE utilities, if you know what those are.
I'm still banging on this one, but would appreciate any suggestions as I'm likely to develop a headache soon. :-)
You probably need last-1 segment. Following awk should work:
awk -F/ '{print $(NF-1)}' file
KDBDFC1_5.LIB
FPSENGDEV.LIB
Or probably this awk would work by searching for .LIB and print 2nd field:
awk -F'.LIB' '{print substr($2,2) FS}' file
KDBDFC1_5.LIB
FPSENGDEV.LIB
How about
perl -lne '#matches = /(\w+\.LIB)/g; print $matches[1] if #matches > 1' file
If match does not support array output, you could run matching twice, discarding the first match, and printing the second:
$ awk '{p="[a-zA-Z0-9_]*.LIB"; sub(p,""); match($0,p); print substr($0,RSTART,RLENGTH)}' file
KDBDFC1_5.LIB
FPSENGDEV.LIB
return the second occurence of <word>.LIB :
perl -pe 's/^(?:.*?\.LIB).*?([\w_.]*.LIB).*$/\1/g'
return the last occurence of <word>.LIB :
perl -pe 's/^(?:.*\.LIB).*?([\w_.]*.LIB).*$/\1/g' file
^ start with
(?:.*\.LIB) uncapturing group containing .LIB
.*? anythings ungreedy
([\w.]*.LIB) first capturing group <word>.LIB
.* anythings greedy
$ finish by
So ... after adding an underscore to the search regex, the following worked for me:
sed 's/.*\/\([[:alnum:]_]*\.LIB\).*/\1/' file
Of course, you could also do this with grep -o instead of complex regex rewrites:
grep -o '[[:alnum:]_]*\.LIB' file | awk 'NR%2==0'
These use only POSIX-compatible functionality, so they should be fine in OS/400. That said, you're looking for this in awk, so:
awk '{sub(/.*QSYS\.LIB\//,""); sub(/\/.*/,"")}1' file
If you know that QSYS.LIB is the thing you're trying to avoid which may exist earlier on the line, then this might do. And if it really is the second of two .LIB files you want, this might do:
awk '{match($0,/[[:alnum:]_]+\.LIB/); s=substr($0,RSTART+RLENGTH); match(s,/[[:alnum:]_]+\.LIB/); print substr(s,RSTART,RLENGTH)}' file
Or, broken out for easier reading:
awk '{
match($0,/[[:alnum:]_]+\.LIB/);
s=substr($0,RSTART+RLENGTH);
match(s,/[[:alnum:]_]+\.LIB/);
print substr(s,RSTART,RLENGTH)
}' file
This uses only plain-old-awk functions match() and substr() to (1) strip off the first .LIB from and store the remainder of the line in a temporary variable, and (2) find the next .LIB inside that variable.
It has the advantage of not depending on any particular position of things -- i.e. it doesn't assume that the "interesting" file is immediately after the first one, or is the second last one on the line, etc.
That said, this is cumbersome, and anubhava's second solution is much more elegant. :-)

Is there an alternative to negative look ahead in sed

In sed I would like to be able to match /js/ but not /js/m I cannot do /js/[^m] because that would match /js/ plus whatever character comes after. Negative look ahead does not work in sed. Or I would have done /js/(?!m) and called it a day. Is there a way to achieve this with sed that would work for most similar situations where you want a section of text that does not end in another section of text?
Is there a better tool for what I am trying to do than sed? Possibly one that allows look ahead. awk seems a bit too much with its own language.
Well you could just do this:
$ echo 'I would like to be able to match /js/ but not /js/m' |
sed 's:#:#A:g; s:/js/m:#B:g; s:/js/:<&>:g; s:#B:/js/m:g; s:#A:#:g'
I would like to be able to match </js/> but not /js/m
You didn't say what you wanted to do with /js/ when you found it so I just put <> around it. That will work on all UNIX systems, unlike a perl solution since perl isn't guaranteed to be available and you're not guaranteed to be allowed to install it.
The approach I use above is a common idiom in sed, awk, etc. to create strings that can't be present in the input. It doesn't matter what character you use for # as long as it's not present in the string or regexp you're really interested in, which in the above is /js/. s/#/#A/g ensures that every occurrence of # in the input is followed by A. So now when I do s/foobar/#B/g I have replaced every occurrence of foobar with #B and I KNOW that every #B represents foobar because all other #s are followed by A. So now I can do s/foo/whatever/ without tripping over foo appearing within foobar. Then I just unwind the initial substitutions with s/#B/foobar/g; s/#A/#/g.
In this case though since you aren't using multi-line hold-spaces you can do it more simply with:
sed 's:/js/m:\n:g; s:/js/:<&>:g; s:\n:/js/m:g'
since there can't be newlines in a newline-separated string. The above will only work in seds that support use of \n to represent a newline (e.g. GNU sed) but for portability to all seds it should be:
sed 's:/js/m:\
:g; s:/js/:<&>:g; s:\
:/js/m:g'

Sed dynamic backreference replacement

I am trying to use sed for transforming wikitext into latex code. I am almost done, but I would like to automate the generation of the labels of the figures like this:
[[Image(mypicture.png)]]
... into:
\includegraphics{mypicture.png}\label{img-1}
For what I would like to keep using sed. The current regex and bash code I am using is the following:
__tex_includegraphics="\\\\includegraphics[width=0.95\\\\textwidth]{$__images_dir\/"
__tex_figure_pre="\\\\begin{figure}[H]\\\\centering$__tex_includegraphics"
__tex_figure_post="}\\\\label{img-$__images_counter}\\\\end{figure}"
sed -e "s/\[\[Image(\([^)]*\))\]\].*/$__tex_figure_pre\1$__tex_figure_post/g"\
... but I cannot make that counter to be increased. Any ideas?
Within a more general perspective, my question would be the following: can I use a backreference in sed for creating a replacement that is different for each of the matches of sed? This is, each time sed matches the pattern, can I use \1 as the input of a function and use the result of this function as the replacement?
I know it is a tricky question and I might have to use AWK for this. However, if somebody has a solution, I would appreciate his or her help.
This might work for you (GNU sed):
sed -r ':a;/'"$PATTERN"'/{x;/./s/.*/echo $((&+1))/e;/./!s/^/1/;x;G;s/'"$PATTERN"'(.*)\n(.*)/'"$PRE"'\2'"$POST"'\1/;ba}' file
This looks for a PATTERN contained in a shell variable and if not presents prints the current line. If the pattern is present it increments or primes the counter in the hold space and then appends said counter to the current line. The pattern is then replaced using the shell variables PRE and POST and counter. Lastly the current line is checked for further cases of the pattern and the procedure repeated if necessary.
You could read the file line-by-line using shell features, and use a separate sed command for each line. Something like
exec 0<input_file
while read line; do
echo $line | sed -e "s/\[\[Image(\([^)]*\))\]\].*/$__tex_figure_pre\1$__tex_figure_post/g"
__images_counter=$(expr $__images_counter + 1)
done
(This won't work if there are multiple matches in a line, though.)
For the second part, my best idea is to run sed or grep to find what is being matched, and then run sed again with the value of the function of the matched text substituted into the command.