What Does an Empty Regular Expression in a Sed Address Range Do? - regex

The following command sed -E '/# Section [134]/, // s/foo/bar/' <input_file> accomplishes the following
Input
# Section 1
- foo
- Unimportant Item
# Section 2
- foo
- Unimportant Item
# Section 3
- foo
- Unimportant Item
# Section 4
- foo
- Unimportant Item
# Section 5
- foo
- Unimportant Item
Output
# Section 1
- bar
- Unimportant Item
# Section 2
- foo
- Unimportant Item
# Section 3
- bar
- Unimportant Item
# Section 4
- bar
- Unimportant Item
# Section 5
- foo
- Unimportant Item
I am unsure of how this command works, specifically the empty regular expression in the address range. What I understand so far is Sed will first look for a portion of the document that matches the following regex /# Section [134]/, if it matches, it will begin the substitution looking for matches for foo and replacing them with bar. The second portion of the address range, as far as I am aware, is the stopping point, but in this case it is empty. I read here that an empty regular expression "repeats the last regular expression match", but I don't exactly know what this means, or how it affects this specific Sed command. How does the address range know that the stopping point is after each section? What regex is // repeating?

Let me simplify the input file as:
Line1
Line2
Line3
Line4
Line5
and the test script as:
sed -n "/[134]/,//p"
which will print the all lines, corresponding to your test results.
As noted, the empty regex repeats the previous regex, then the sed command
above is equivalent to:
sed -n "/[134]/,/[134]/p"
BTW the address range operator of sed works as follows:
If the left address matches, returns true without evaluating
right address on the same line (unlike the range operator of awk which evaluates the right
condition immediately on the same line).
Let's see how the operator works line by line.
On the first line Line1, the left start address matches and
returns true.
On the second line, the right stop address is evaluated without
match then the operator keeps true.
On the third line, the right stop address matches then it
changes the status to false (after printing the line).
On the fourh line, the left start address matches and
returns true again.
On the fifth line, the right stop address does not match and keeps true.
If you change the regex to /[135]/, you will see a different result.
(Line1, 2, 3, 5 will be printed skipping Line4.)

Using an empty regex is a shorthand for repeating the latest matching regex. So your script is an abbreviated (and more idiomatic, and less buggy) version of the longhand
sed -E '/# Section [134]/,/# Section [134]/ s/foo/bar/' <input_file>
which says to perform the substitution s/foo/bar/ in the range of lines starting with one occurrence of the regex, through to the next occurrence of the same regex (and starting again if there is a third one up through the fourth one, etc).
This is also available in the s/// command, so a fairly common idiom is
sed '/foo/ s//bar/'
which says to search for foo and then replace foo with bar. (This particular example is not particularly useful, but there are situations where this saves a lot of typing.)

Related

Perl In place edit: Find and replace in X12850 formatted file

I am new to Perl and cannot figure this out. I have a file called Test:
ISA^00^ ^00^ ^01^SupplyScan ^01^NOVA ^180815^0719^U^00204^000000255^0^P^^
GS^PO^SupplyScan^NOVA^20180815^0719^00000255^X^002004
ST^850^00000255
BEG^00^SA^0000000059^^20180815
DTM^097^20180815^0719
N1^BY^^92^
N1^SE^^92^1
N1^ST^^92^
PO1^1^4^BX^40.000^^^^^^^^IN^131470^^^1^
PID^F^^^^CATH 6FR .070 MPA 1 100CM
REF^
PO1^2^4^BX^40.000^^^^^^^^IN^131295^^^1^
PID^F^^^^CATHETER 6FR XB 3.5
REF^
PO1^3^2^EA^48.000^^^^^^^^IN^132288^^^1^
PID^F^^^^CATH 6FR AL-1 SH
REF^
PO1^4^2^BX^48.000^^^^^^^^IN^131297^^^1^
PID^F^^^^CATHETER 6FR .070 JL4SH 100CM
REF^
CTT^4^12
SE^20^00000255
GE^1^00000255
IEA^1^00000255
What I am trying to do is an in place edit, dropping any value in the N1^SE segment after the 92^. I tried this but I cant seem to make it work:
perl -i -pe 's/^N1\^SE\^\^92\^\d+$/N1^SE^^92^/g' Test
The final result should include the N1^SE segment looking like this:
N1^SE^^92^
It worked when I just had the one line in the file: N1^SE^^92^1. But when I try to globally substitute in the entire file, it doesn't work
Thanks.
You may have missed to copy here some hidden character(s) or spaces. Those may well be at the end of the line so try
perl -i -pe 's/^N1\^SE\^\^92\^\K.*//' Test
The \K is a special form of the "positive lookbehind" which drops all previous matches so only .* after it (the rest) are removed by the substitution. †
This takes seriously the requirement "dropping any value ... after", as it matches lines with things other than the sole \d from the question's example.
Or use \Q...\E sequence to escape special characters (see quotemeta)
perl -i -pe 's/^\QN1^SE^^92^\E\K.*//' Test
per Borodin's comment.
Another take is to specifically match \d as in the question
s/^N1\^SE\^\^92\^\K\d+//
per ikegami's comment. This stays true to your patterns and it also doesn't remove whatever may be hiding at the end of the line.
† The term "lookbehind" for \K is from documentation but, while \K clearly "looks behind," it has marked differences from how the normal lookbehind assertions behave.
Here is a striking example from ikegami. Compare
perl -le'print for "abcde" =~ /(?<=\w)\w/g' # prints lines: b c d e
and
perl -le'print for "abcde" =~ /\w\K\w/g' # prints lines: b d

RegEx Confusion in linux shell script

Can someone explain what this does in a linux shell.....
port=$((${devpath##*[-.]} - 1))
I have a variable named $devpath, and one possible value is /sys/bus/usb/devices/usb2/2-1.
I'm assuming that ${devpath##*[-.]} performs some sort of regex on $devpath, but it makes no sense to me. Nor does *[-.] which I understand to mean "one of more of any one of the character '-' or any other character except newline"
When running through a script (this is from usb-devices.sh), it seems that the value of port is always the first numeric digit. Something else that confuses me is the '-1' at the end, shouldn't that reduce whatever ${devpath##*[-.]} does by one?
I tried looking up regex in shell expressions but nothing made any sense and no where could I find an explanation for ##.
Given the variable:
r="/sys/bus/usb/devices/usb2/2-123.45"
echo ${r##*-} returns 123.45 and echo ${r##*[-.]} returns 45. Do you see the pattern here?
Let's go a bit further: the expression ${string##substring} strips the longest match of $substring from the front of $string.
So with ${r##*[-.]} we are stripping everything in $r until the last - or . is found.
Then, $(( )) is used for arithmetic expressions. Thus, with $(( $var - 1 )) you are subtracting 1 from the value coming from ${r##*[-.]}.
All together, port=$((${devpath##*[-.]} - 1)) means: store in $port the value of the last number after either - or . at the end of $devpath.
Following the example below, echo $((${r##*[-.]} - 1)) returns 44 (45 - 1).
There is no regex here. ${var##pattern} returns the value of var with any match on pattern removed from the prefix (but this is a glob pattern, not a regex); $((value - 1)) subtracts one from value. So the expression takes the number after the last dash or dot and reduces it by one.
See Shell Parameter Expansion and Arithmetic Expansion in the Bash manual.

vi: :s how to replace only the second occurence on a line?

:s/u/X/2 - this replaces the first u to X on the current and next line...
or to replace the second character on a line with X???? IDK.
or perhaps its something other than :s?
I suspect I have to use grouping of some kind (\2?) but I don't know to write that.
I heard that sed and :s option in sed are alike, and on a help page for sed I found:
3.1.3. Substitution switches:
Standard versions of sed support 4 main flags or switches which may be added to
the end of an "s///" command. They are:
N - Replace the Nth match of the pattern on the LHS, where
N is an integer between 1 and 512. If N is omitted,
the default is to replace the first match only.
g - Global replace of all matches to the pattern.
p - Print the results to stdout, even if -n switch is used.
w file - Write the pattern space to 'file' if a replacement was
done. If the file already exists when the script is
executed, it is overwritten. During script execution,
w appends to the file for each match.
http://sed.sourceforge.net/sedfaq3.html#s3.1.3
so: :r! sed 's/u/X/2' would work, although I think there is a specifically vi way of doing this?
IDK if its relevant but I'm using the tcsh shell.
also,
:version:
Version 1.79 (10/23/96) The CSRG, University of California, Berkeley.
This is brittle, but may be enough to do what you want. This switch command with regex:
:%s/first\(.\{-}\)first/first\1second/g
converts this:
first and first then first again
first and first then first again
first and first then first again
first and first then first again
to this:
first and second then first again
first and second then first again
first and second then first again
first and second then first again
The regexp looks for the first "first", followed by a match of any characters using pattern .\{-}, which is the non-greedy version of .* (type :help non-greedy in vim for more info.) This non-greedy match is followed with the second "first".
The characters between the first and second "first" are captured by surrounding the .\{-} with parenthesis, which, with escaping results in \(.\{-}\), then that captured group is dereferenced with the \1 (1 means first captured group) in the replacement.
In order to substitute the second occurrence on a line, you can say:
:call feedkeys('nyq') | s/u/X/gc
In order to invoke it over a range of lines or the entire file, use it in a function:
:function Mysub()
: call feedkeys('nyq') | s/u/X/gc
:endfunction
For example, the following would substitute the second occurrence of u for X in every line in the file:
:1,$ call Mysub()
Here's a dumber but easier to understand way: first find a string that doesn't exist in the file - for the sake of argument assume it's zzz. then simply:
:%s/first/zzz
:%s/first/second
:%s/zzz/first

sed - behaviour of holdspace

I have (from the sed website http://sed.sourceforge.net/sed1line.txt) this one-liner:
sed -e '/./{H;$!d;}' -e 'x;/AAA/!d;/BBB/!d;/CCC/!d'
Its purpose is to search a paragraph for either AAA, BBB or CCC.
My understanding of the script:
'/./' matches every line wich is not empty
'{}' all commands within the brackets handle the matched lines
'H' appends the holdspace with the matched lines
'$!d' delete from patternspace everything but the last line
'x' swaps the pattern- and holdspace
'/AAA/!d' search for AAA paragraph and print it
What is not clear to me:
In the holdspace should be several separate lines (for each paragraph), why am I able to search the whole paragraph? Are the lines in the holdspace merged to one line?
And how does sed know when one paragraph ends and the other begins in the holdspace?
Why do I have to append '$!d', why is not '$d' sufficient? Why am I not able to omit the '-n' and use '$p' instead of '$!d' in this case?
Thank you very much for every comment!
My test data (match every paragraph with XX in it):
YYaaaa
aaa1
aaa2
aXX3
aaa4
YYbbbb
bbb1
bbb2
YYcccc
ccc1
ccc2
ccc3
cXX4
ccc5
YYdddd
ddd1
dXX2
Following command is used:
sed -ne '/./{H;$!d};x;/XX/p' test2
Versions:
$ sed --version
GNU sed-Version 4.2.1
$ bash --version
GNU bash, Version 4.2.10(1)-release (x86_64-pc-linux-gnu)
It collects a paragraph as individual lines into the hold space (H), then when you hit an empty line, /./ fails and it falls through to the x which basically zaps the hold space for the next paragraph.
In order to correctly handle the final paragraph, it needs to cope with a paragraph which is not followed by an empty line, therefore it falls through from the last line as if it were followed by an empty line. This is a common idiom for scripts which collect something up through a particular pattern (or, to put it differently, it's a common error for such scripts to fail to handle the last collected data at end of file).
So in other words, if we are looking at a non-empty line, add it to the hold space, and unless it's the last line in the file, delete it and start over from the beginning of the script with the next input line. (Perhaps your understanding of d was not complete? This is what $!d means.)
Otherwise, we have an empty line, or end of file, and the hold space contains zero or more lines of text (one paragraph, possibly empty). Exchange them into the pattern space (the current, empty, line conveniently moves to the hold space) and examine the pattern space. If it fails to match one of our expressions, delete it. Otherwise, the default action is to print the entire pattern space.

How to replace all the blanks within square brackets with an underscore using sed?

I figured out that in order to turn [some name] into [some_name] I need to use the following expression:
s/\(\[[^ ]*\) /\1_/
i.e. create a backreference capture for anything that starts with a literal '[' that contains any number of non space characters, followed by a space, to be replaced with the non space characters followed by an underscore. What I don't know yet though is how to alter this expression so it works for ALL underscores within the braces e.g. [a few words] into [a_few_words].
I sense that I'm close, but am just missing a chunk of knowledge that will unlock the key to making this thing work an infinite number of times within the constraints of the first set of []s contained in a line (of SQL Server DDL in this case).
Any suggestions gratefully received....
There are two parts to the trickery needed:
Stop replacing when you reach a close square bracket (but do it repeatedly on the line):
s/\(\[[^] ]*\) /\1_/g
This matches an open square bracket, followed by zero or more characters that are neither a blank nor a close square bracket. The global suffix means that the pattern is applied to all sequences starting with an open square bracket followed eventually by a blank or close square bracket on the line. Note, too, that this regex does not alter '[single-word] and context' whereas the original would translate that to '[single-word]_and context', which is not the object of the exercise.
Get sed to repeat the search from where this one started. Unfortunately, there isn't a truly good way to do that. Sed always resumes searching after the text that was substituted; and this is one occasion when we don't want that. Sometimes, you can get away with simply repeating the substitute operation. In this case, you have to repeat it every time the substitution succeeds, stopping when there are no more substitutions.
Two of the less well known operations in sed are the ':label' and the 't' commands. They were present in the 7th Edition of Unix (circa 1978), though, so they are not new features. The first simply identifies a position in the script which can be jumped to with 'b' (not wanted here) or 't':
[2addr]t [label]
Branch to the ':' function bearing the label if any substitutions have been made since the most recent reading of an input line or execution of a 't' function. If no label is specified, branch to the end of the script.
Marvellous: we need:
sed -e ':redo; s/\(\[[^] ]*\) /\1_/g; t redo' data.file
Except - it doesn't work all on one line like that (at least, not on MacOS X). This did work admirably, though:
sed -e ':redo
s/\(\[[^] ]*\) /\1_/g
t redo' data.file
Or, as noted in the comments, you could write three separate '-e' options (which works on MacOS X):
sed -e ':redo' -e 's/\(\[[^] ]*\) /\1_/g' -e 't redo' data.file
Given the data file:
a line with [one blank] word inside square brackets.
a line with [two blank] or [three blank] words inside square brackets.
a line with [no-blank] word inside square brackets.
a line with [multiple words in a single bracket] inside square brackets.
a line with [multiple words in a single bracket] [several times on one line]
the output from the sed script shown is:
a line with [one_blank] word inside square brackets.
a line with [two_blank] or [three_blank] words inside square brackets.
a line with [no-blank] word inside square brackets.
a line with [multiple_words_in_a_single_bracket] inside square brackets.
a line with [multiple_words_in_a_single_bracket] [several_times_on_one_line]
And, finally, reading the fine print in the question, if you need this done only in the first square-bracketed field on each line, then we need to ensure that are no open square brackets before the one that starts the match. This variant works:
sed -e ':redo' -e 's/^\([^]]*\[[^] ]*\) /\1_/' -e 't redo' data.file
(The 'g' qualifier is gone - it probably isn't needed in the other variants either given the loop; its presence might make the process marginally more efficient, but it would most likely be essentially impossible to detect that. The pattern is now anchored to the start of the line (the caret) and contains zero or more characters that are not open square bracket before the first open square bracket.)
Sample output:
a line with [two_blank] or [three blank] words inside square brackets.
a line with [no-blank] word inside square brackets.
a line with [multiple_words_in_a_single_bracket] inside square brackets.
a line with [multiple_words_in_a_single_bracket] [several times on one line]
This is easier in a language like perl which has "executable" substitutions:
perl -wne 's/(\[.*?])/ do { my $x = $1; $x =~ y, ,_,; $x } /ge; print'
Or to split it up more clearly:
sub replace_with_underscores {
my $s = shift;
$s =~ y/ /_/;
$s
}
s/(\[.*?])/ replace_with_underscores($1) /ge;
The .*? is the non-greedy match (to avoid slurring together two adjacent bracketed phrases) and the e flag to the substitution causes it to be evaluated, so you can call a function to do the inner work.