Match inner pattern. Multiline - regex

I have:
%{ lorem ipsum dolor
sit %{hello
world}%
amet}%
I want:
hello
world
That is, I want to keep the inner %{...}% of any number of nesting %{...}%s that may or may not span multiple lines.
Is there a sed or awk way?

This sed command:
sed -n -r 'H; ${g; s/([^}]|\}[^%])*%\{//; s/\}%([^%]|%[^{])*//; p}'
will gather the entirety of the input into the pattern space, then remove ...%{ (taking care to ensure that the ... doesn't contain }%) and }%... (taking care to ensure that the ... doesn't contain %{), and then print the result. So it's suitable for the case where you need just one block. The case with multiple blocks is trickier, but I'll think about it further, and update this answer if I get that working well.
Note that -r (to support Extended Regular Expressions, instead of Basic ones) is a GNU extension to sed, so if you're using a non-GNU sed that doesn't support it, let me know.
Edited to add: O.K., here's a version that supports multiple blocks:
sed -n -r 'H; ${g; s/^([^}]|\}[^%])*%\{//; s/\}%([^%]|%[^{])*$//; s/\}%([^%]|%[^{])*([^}]|\}[^%])*%\{/\n/g; p}'
It uses essentially the same approach as the previous, except that it only removes ...%{ at start-of-input and }%... at end-of-input, and that after it's done that, it proceeds to remove all instances of }%...%{ that do not contain %{...}%, replacing them with a newline.

The AWK way:
gawk '
/%{/ {
match($0,/%{.*/)
text=substr($0,RSTART+2,RLENGTH-2)
}
!/% {/ && !/}%/ {
text=text "\n" $0
}
/}%/ {
match($0,/}%/)
text=text "\n" substr($0,1,RSTART-1)
print text
exit
}'
This won't work if there's more than one {% or %} in the same line. In this case you need minor modification - use array in the match command.

One possible TXR way:
Simply scan the input freeform (as one big line) collecting matches for a regular expression into the variable wanted which gets implicitly collected into a list called wanted.
Then spit out the pieces, chopping two characters from the head and tail of each.
$ txr -c '#(freeform)
#(coll)#{wanted /\%{(~(.*(\%{|}\%).*))}\%/}#(end)
#(output)
#(rep)#{wanted [2..-2]}#(end)
#(end)' -
asdf asdf %{
%{ asdf
asdf
}% %{boo}% }%
[Ctrl-D][Enter]
asdf
asdf
boo
The regex ~ operator means complement. The variable wanted captures text which consists of %{ followed by the longest matching string which does not contain %{ or }% as a substring, followed by %}. TXR regex supports complement, intersection, difference. We have to write \% character because % is the non-greedy zero-or-more operator.
The output for the example given in the question is:
hello
world
rather than
hello
world
Author didn't clarify if that is really needed. This complicates the problem because %{hello occurs somewhere in the middle of the line, and so we must know the column position of the h in hello in order to know that the w in world is two spaces over.

Related

Bash Script for Concatenating Broken Dashed Words

I've scraped a large amount (10GB) of PDFs and converted them to text files, but due to the format of the original PDFs, there is an issue:
Many of the words which break across lines have a dash in them that artificially breaks up the word, like this:
You can see that this happened because the original PDFs files have breaks:
What would be the cleanest and fastest way to "join" every word instance that matches this pattern inside of a .txt file?
Perhaps some sort of Regex search, like for a [a-z]\-\s \w of some kind (word character followed by dash followed by space) would work?
Or would some sort of sed replacement work better?
Currently, I'm trying to get a sed regex to work, but I'm not sure how to translate this to use capture groups to replace the selected text:
sed -n '\%\w\- [a-z]%p' Filename.txt
My input text would look like this:
The dog rolled down the st- eep hill and pl- ayed outside.
And the output would be:
The dog rolled down the steep hill and played outside.
Ideally, the expression would also work for words split up by a newline, like this:
The rule which provided for the consid-
eration of the resolution, was agreed to earlier by a
To this:
The rule which provided for the consideration
of the resolution, was agreed to earlier by a
It's straightforward in sed:
sed -e ':a' -e '/-$/{N;s/-\n//;ba
}' -e 's/- //g' filename
This translates roughly as "if the line ends with a dash, read in the next line as well (so that you have a line with a carriage return in the middle) then excise the dash and carriage return, and loop back the beginning just in case this new line also ends with a dash. Then remove any instances of - ".
You may use this gnu-awk code:
cat file
The dog rolled down the st- eep hill and pl- ayed outside.
The rule which provided for the consid-
eration of the resolution, was agreed to earlier by a
Then use awk like this:
awk 'p != "" {
w = $1
$1 = ""
sub(/^[[:blank:]]+/, ORS)
$0 = p w $0
p = ""
}
{
$0 = gensub(/([_[:alnum:]])-[[:blank:]]+([_[:alnum:]])/, "\\1\\2", "g")
}
/-$/ {
p = $0
sub(/-$/, "", p)
}
p == ""' file
The dog rolled down the steep hill and played outside.
The rule which provided for the consideration
of the resolution, was agreed to earlier by a
If you can consider perl then this may also work for you:
Then use:
perl -0777 -pe 's/(\w)-\h+(\w)/$1$2/g; s/(\w)-\R(\w+)\s+/$1$2\n/g' file
You simply add backslash-parentheses (or use the -r or -E option if available to do away with the requirement to put backslashes before capturing parentheses) and recall the matched text with \1 for the first capturing parenthesis, \2 for the second, etc.
sed 's/\(\w\)\- \([a-z]\)/\1\2/g' Filename.txt
The \w escape is not standard sed but if it works for you, feel free to use it. Otherwise, it is easy to replace with [A-Za-z0-9_#] or whatever else you want to call "word characters".
I'm guessing not all of the matches will be hyphenated words so perhaps run the result through a spelling checker or something to verify whether the result is an English word. (I would probably switch to a more capable scripting language like Python for that, though.)

bash grep regexp - excluding subpattern

I have a script written in bash, with one particular grep command I need to modify.
Generally I have two patterns: A & B. There is a textfile that can contain lines with all possible combinations of those patterns, that is:
"xxxAxxx", "xxxBxxx", "xxxAxxxBxxx", "xxxxxx", where "x" are any characters.
I need to match ALL lines APART FROM the ones containing ONLY "A".
At the moment, it is done with "grep -v (A)", but this is a false track, as this would exclude also lines with "xxxAxxxBxxx" - which are OK for me. This is why it needs modification. :)
The tricky part is that this one grep lies in the middle of a 'multiply-piped' command with many other greps, seds and awks inside. Thus forming a smarter pattern would be the best solution. Others would cause much additional work on changing other commands there, and even would impact another parts of the code.
Therefore, the question is: is there a possibility to match pattern and exclude a subpattern in one grep, but allow them to appear both in one line?
Example:
A file contains those lines:
fooTHISfoo
fooTHISfooTHATfoo
fooTHATfoo
foofoo
and I need to match
fooTHISfooTHATfoo
fooTHATfoo
foofoo
a line with "THIS" is not allowed.
You can use this awk command:
awk '!(/THIS/ && !/THAT/)' file
fooTHISfooTHATfoo
fooTHATfoo
foofoo
Or by reversing the boolean expression:
awk '!/THIS/ || /THAT/' file
fooTHISfooTHATfoo
fooTHATfoo
foofoo
You want to match lines that contain B, or don't contain A. Equivalently, to delete lines containing A and not B. You could do this in sed:
sed -e '/A/{;/B/!d}'
Or in this particular case:
sed '/THIS/{/THAT/!d}' file
Tricky for grep alone. However, replace that with an awk call: Filter out lines with "A" unless there is a "B"
echo "xxxAxxx
xxxBxxx
xxxAxxxBxxx
xxxBxxxAxxx
xxxxxx" | awk '!/A/ || /B/'
xxxBxxx
xxxAxxxBxxx
xxxBxxxAxxx
xxxxxx
grep solution. Uses perl regexp (-P) for Lookaheads (look if there is not, some explanation here).
grep -Pv '^((?!THAT).)*THIS((?!THAT).)*$' file

Execute command defined by backreference in sed

I am creating a primitive experimental templating engine completely based on sed (merely for my private enjoyment). One thing I have been trying to achieve for several hours now is to replace certain text patterns with the output of a command they contain.
To clearify, if an input line looks like this
Lorem {{echo ipsum}}
I would look the sed output to look like this:
Lorem ipsum
The closest I have come is this:
echo 'Lorem {{echo ipsum}}' | sed 's/{{\(.*\)}}/'"$(\\1)"'/g'
which does not work.
However,
echo 'Lorem {{echo ipsum}}' | sed 's/{{\(.*\)}}/'"$(echo \\1)"'/g'
gives me
Lorem echo ipsum
I don't quite understand what is happening here. Why can I give the backreference to the echo command, but cannot evaluate the entire backreference in $()? When is \\1 getting evaluated? Is the thing I am trying to achieve even possible with pure sed?
Keep in mind that it is entirely clear to me that what I am trying to achieve is easily possible with other tools. However, I am highly interested in whether this is possible with pure sed.
Thanks!
The reason your attempt doesn't work is that $() is expanded by the shell before sed is even called. For this reason it can't use the backreferences sed is eventually going to capture.
It is possible to do this sort of thing with GNU sed (not with POSIX sed). The main trick is that GNU sed has a e flag to the s command that makes it replace the pattern space (the whole space) with the result of the pattern space executed as a shell command. What this means is that
echo 'echo foo' | sed 's/f/g/e'
prints goo.
This can be used for your use case as follows:
echo 'Lorem {{echo ipsum}}' | sed ':a /\(.*\){{\(.*\)}}\(.*\)/ { h; s//\1\n\3/; x; s//\2/e; G; s/\(.*\)\n\(.*\)\n\(.*\)/\2\1\3/; ba }'
The sed code works as follows:
:a # jump label for looping, in case there are
# several {{}} expressions in a line
/\(.*\){{\(.*\)}}\(.*\)/ { # if there is a {{}} expression,
h # make a copy of the line
s//\1\n\3/ # isolate the surrounding parts
x # swap the original back in
s//\2/e # isolate the command, execute, get output
G # get the outer parts we put into the hold
# buffer
s/\(.*\)\n\(.*\)\n\(.*\)/\2\1\3/ # rearrange the parts to put the command
# output into the right place
ba # rinse, repeat until all {{}} are covered
}
This makes use of sed's greedy matching in the regexes to always capture the last {{}} expression in a line. Note that it will have difficulties if there are several commands in a line and one of the later ones has multi-line output. Handling this case will require the definition of a marker that the commands embedded in the data are not allowed to have as part of their output and that the templates are not allowed to contain. I would suggest something like {{{}}}, which would lead to
sed ':a /\(.*\){{\(.*\)}}\(.*\)/ { h; s//{{{}}}\1{{{}}}\3/; x; s//\2/e; G; s/\(.*\)\n{{{}}}\(.*\){{{}}}\(.*\)/\2\1\3/; ba }'
The reasoning behind this is that the template engine would run into trouble anyway if the embedded commands printed further {{}} terms. This convention is impossible to enforce, but then any code you pass into this template engine had better come from a trusted source, anyway.
Mind you, I am not sure that this whole thing is a sane idea1. You're not planning to use it in any sort of production code, are you?
1I am, however, quite sure whether it is a sane idea.

How to swap text based on patterns at once with sed?

Suppose I have 'abbc' string and I want to replace:
ab -> bc
bc -> ab
If I try two replaces the result is not what I want:
echo 'abbc' | sed 's/ab/bc/g;s/bc/ab/g'
abab
So what sed command can I use to replace like below?
echo abbc | sed SED_COMMAND
bcab
EDIT:
Actually the text could have more than 2 patterns and I don't know how many replaces I will need. Since there was a answer saying that sed is a stream editor and its replaces are greedily I think that I will need to use some script language for that.
Maybe something like this:
sed 's/ab/~~/g; s/bc/ab/g; s/~~/bc/g'
Replace ~ with a character that you know won't be in the string.
I always use multiple statements with "-e"
$ sed -e 's:AND:\n&:g' -e 's:GROUP BY:\n&:g' -e 's:UNION:\n&:g' -e 's:FROM:\n&:g' file > readable.sql
This will append a '\n' before all AND's, GROUP BY's, UNION's and FROM's, whereas '&' means the matched string and '\n&' means you want to replace the matched string with an '\n' before the 'matched'
sed is a stream editor. It searches and replaces greedily. The only way to do what you asked for is using an intermediate substitution pattern and changing it back in the end.
echo 'abcd' | sed -e 's/ab/xy/;s/cd/ab/;s/xy/cd/'
Here is a variation on ooga's answer that works for multiple search and replace pairs without having to check how values might be reused:
sed -i '
s/\bAB\b/________BC________/g
s/\bBC\b/________CD________/g
s/________//g
' path_to_your_files/*.txt
Here is an example:
before:
some text AB some more text "BC" and more text.
after:
some text BC some more text "CD" and more text.
Note that \b denotes word boundaries, which is what prevents the ________ from interfering with the search (I'm using GNU sed 4.2.2 on Ubuntu). If you are not using a word boundary search, then this technique may not work.
Also note that this gives the same results as removing the s/________//g and appending && sed -i 's/________//g' path_to_your_files/*.txt to the end of the command, but doesn't require specifying the path twice.
A general variation on this would be to use \x0 or _\x0_ in place of ________ if you know that no nulls appear in your files, as jthill suggested.
Here is an excerpt from the SED manual:
-e script
--expression=script
Add the commands in script to the set of commands to be run while processing the input.
Prepend each substitution with -e option and collect them together. The example that works for me follows:
sed < ../.env-turret.dist \
-e "s/{{ name }}/turret$TURRETS_COUNT_INIT/g" \
-e "s/{{ account }}/$CFW_ACCOUNT_ID/g" > ./.env.dist
This example also shows how to use environment variables in your substitutions.
This might work for you (GNU sed):
sed -r '1{x;s/^/:abbc:bcab/;x};G;s/^/\n/;:a;/\n\n/{P;d};s/\n(ab|bc)(.*\n.*:(\1)([^:]*))/\4\n\2/;ta;s/\n(.)/\1\n/;ta' file
This uses a lookup table which is prepared and held in the hold space (HS) and then appended to each line. An unique marker (in this case \n) is prepended to the start of the line and used as a method to bump-along the search throughout the length of the line. Once the marker reaches the end of the line the process is finished and is printed out the lookup table and markers being discarded.
N.B. The lookup table is prepped at the very start and a second unique marker (in this case :) chosen so as not to clash with the substitution strings.
With some comments:
sed -r '
# initialize hold with :abbc:bcab
1 {
x
s/^/:abbc:bcab/
x
}
G # append hold to patt (after a \n)
s/^/\n/ # prepend a \n
:a
/\n\n/ {
P # print patt up to first \n
d # delete patt & start next cycle
}
s/\n(ab|bc)(.*\n.*:(\1)([^:]*))/\4\n\2/
ta # goto a if sub occurred
s/\n(.)/\1\n/ # move one char past the first \n
ta # goto a if sub occurred
'
The table works like this:
** ** replacement
:abbc:bcab
** ** pattern
Tcl has a builtin for this
$ tclsh
% string map {ab bc bc ab} abbc
bcab
This works by walking the string a character at a time doing string comparisons starting at the current position.
In perl:
perl -E '
sub string_map {
my ($str, %map) = #_;
my $i = 0;
while ($i < length $str) {
KEYS:
for my $key (keys %map) {
if (substr($str, $i, length $key) eq $key) {
substr($str, $i, length $key) = $map{$key};
$i += length($map{$key}) - 1;
last KEYS;
}
}
$i++;
}
return $str;
}
say string_map("abbc", "ab"=>"bc", "bc"=>"ab");
'
bcab
May be a simpler approach for single pattern occurrence you can try as below:
echo 'abbc' | sed 's/ab/bc/;s/bc/ab/2'
My output:
~# echo 'abbc' | sed 's/ab/bc/;s/bc/ab/2'
bcab
For multiple occurrences of pattern:
sed 's/\(ab\)\(bc\)/\2\1/g'
Example
~# cat try.txt
abbc abbc abbc
bcab abbc bcab
abbc abbc bcab
~# sed 's/\(ab\)\(bc\)/\2\1/g' try.txt
bcab bcab bcab
bcab bcab bcab
bcab bcab bcab
Hope this helps !!
echo "C:\Users\San.Tan\My Folder\project1" | sed -e 's/C:\\/mnt\/c\//;s/\\/\//g'
replaces
C:\Users\San.Tan\My Folder\project1
to
mnt/c/Users/San.Tan/My Folder/project1
in case someone needs to replace windows paths to Windows Subsystem for Linux(WSL) paths
If replacing the string by Variable, the solution doesn't work.
The sed command need to be in double quotes instead on single quote.
#sed -e "s/#replacevarServiceName#/$varServiceName/g" -e "s/#replacevarImageTag#/$varImageTag/g" deployment.yaml
Here is an awk based on oogas sed
echo 'abbc' | awk '{gsub(/ab/,"xy");gsub(/bc/,"ab");gsub(/xy/,"bc")}1'
bcab
I believe this should solve your problem. I may be missing a few edge cases, please comment if you notice one.
You need a way to exclude previous substitutions from future patterns, which really means making outputs distinguishable, as well as excluding these outputs from your searches, and finally making outputs indistinguishable again. This is very similar to the quoting/escaping process, so I'll draw from it.
s/\\/\\\\/g escapes all existing backslashes
s/ab/\\b\\c/g substitutes raw ab for escaped bc
s/bc/\\a\\b/g substitutes raw bc for escaped ab
s/\\\(.\)/\1/g substitutes all escaped X for raw X
I have not accounted for backslashes in ab or bc, but intuitively, I would escape the search and replace terms the same way - \ now matches \\, and substituted \\ will appear as \.
Until now I have been using backslashes as the escape character, but it's not necessarily the best choice. Almost any character should work, but be careful with the characters that need escaping in your environment, sed, etc. depending on how you intend to use the results.
Every answer posted thus far seems to agree with the statement by kuriouscoder made in his above post:
The only way to do what you asked for is using an intermediate
substitution pattern and changing it back in the end
If you are going to do this, however, and your usage might involve more than some trivial string (maybe you are filtering data, etc.), the best character to use with sed is a newline. This is because since sed is 100% line-based, a newline is the one-and-only character you are guaranteed to never receive when a new line is fetched (forget about GNU multi-line extensions for this discussion).
To start with, here is a very simple approach to solving your problem using newlines as an intermediate delimiter:
echo "abbc" | sed -E $'s/ab|bc/\\\n&/g; s/\\nab/bc/g; s/\\nbc/ab/g'
With simplicity comes some trade-offs... if you had more than a couple variables, like in your original post, you have to type them all twice. Performance might be able to be improved a little bit, too.
It gets pretty nasty to do much beyond this using sed. Even with some of the more advanced features like branching control and the hold buffer (which is really weak IMO), your options are pretty limited.
Just for fun, I came up with this one alternative, but I don't think I would have any particular reason to recommend it over the one from earlier in this post... You have to essentially make your own "convention" for delimiters if you really want to do anything fancy in sed. This is way-overkill for your original post, but it might spark some ideas for people who come across this post and have more complicated situations.
My convention below was: use multiple newlines to "protect" or "unprotect" the part of the line you're working on. One newline denotes a word boundary. Two newlines denote alternatives for a candidate replacement. I don't replace right away, but rather list the candidate replacement on the next line. Three newlines means that a value is "locked-in", like your original post way trying to do with ab and bc. After that point, further replacements will be undone, because they are protected by the newlines. A little complicated if I don't say so myself... ! sed isn't really meant for much more than the basics.
# Newlines
NL=$'\\\n'
NOT_NL=$'[\x01-\x09\x0B-\x7F]'
# Delimiters
PRE="${NL}${NL}&${NL}"
POST="${NL}${NL}"
# Un-doer (if a request was made to modify a locked-in value)
tidy="s/(\\n\\n\\n${NOT_NL}*)\\n\\n(${NOT_NL}*)\\n(${NOT_NL}*)\\n\\n/\\1\\2/g; "
# Locker-inner (three newlines means "do not touch")
tidy+="s/(\\n\\n)${NOT_NL}*\\n(${NOT_NL}*\\n\\n)/\\1${NL}\\2/g;"
# Finalizer (remove newlines)
final="s/\\n//g"
# Input/Commands
input="abbc"
cmd1="s/(ab)/${PRE}bc${POST}/g"
cmd2="s/(bc)/${PRE}ab${POST}/g"
# Execute
echo ${input} | sed -E "${cmd1}; ${tidy}; ${cmd2}; ${tidy}; ${final}"

using sed to copy lines and delete characters from the duplicates

I have a file that looks like this:
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
I want it to look like this
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
I thought I could use sed to do this but I can't figure out how to store something in a buffer and then modify it.
Am I even using the right tool?
Thanks
You don't have to get tricky with regular expressions and replacement strings: use sed's p command to print the line intact, then modify the line and let it print implicitly
sed 'p; s/\.png//'
Glenn jackman's response is OK, but it also doubles the rows which do not match the expression.
This one, instead, doubles only the rows which matched the expression:
sed -n 'p; s/\.png//p'
Here, -n stands for "print nothing unless explicitely printed", and the p in s/\.png//p forces the print if substitution was done, but does not force it otherwise
That is pretty easy to do with sed and you not even need to use the hold space (the sed auxiliary buffer). Given the input file below:
$ cat input
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
you should use this command:
sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
The result:
$ sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
This commands is just a replacement command (s///). It matches anything starting with #" followed by non-period chars ([^.]*) and then by .png",. Also, it matches all non-period chars before .png", using the group brackets \( and \), so we can get what was matched by this group. So, this is the to-be-replaced regular expression:
#"\([^.]*\)\.png",
So follows the replacement part of the command. The & command just inserts everything that was matched by #"\([^.]*\)\.png", in the changed content. If it was the only element of the replacement part, nothing would be changed in the output. However, following the & there is a newline character - represented by the backslash \ followed by an actual newline - and in the new line we add the #" string followed by the content of the first group (\1) and then the string ",.
This is just a brief explanation of the command. Hope this helps. Also, note that you can use the \n string to represent newlines in some versions of sed (such as GNU sed). It would render a more concise and readable command:
sed 's/#"\([^.]*\)\.png",/&\n#"\1",/' input
I prefer this over Carles Sala and Glenn Jackman's:
sed '/.png/p;s/.png//'
Could just say it's personal preference.
or one can combine both versions and apply the duplication only on lines matching the required pattern
sed -e '/^#".*\.png",/{p;s/\.png//;}' input