sed - behaviour of holdspace - regex

I have (from the sed website http://sed.sourceforge.net/sed1line.txt) this one-liner:
sed -e '/./{H;$!d;}' -e 'x;/AAA/!d;/BBB/!d;/CCC/!d'
Its purpose is to search a paragraph for either AAA, BBB or CCC.
My understanding of the script:
'/./' matches every line wich is not empty
'{}' all commands within the brackets handle the matched lines
'H' appends the holdspace with the matched lines
'$!d' delete from patternspace everything but the last line
'x' swaps the pattern- and holdspace
'/AAA/!d' search for AAA paragraph and print it
What is not clear to me:
In the holdspace should be several separate lines (for each paragraph), why am I able to search the whole paragraph? Are the lines in the holdspace merged to one line?
And how does sed know when one paragraph ends and the other begins in the holdspace?
Why do I have to append '$!d', why is not '$d' sufficient? Why am I not able to omit the '-n' and use '$p' instead of '$!d' in this case?
Thank you very much for every comment!
My test data (match every paragraph with XX in it):
YYaaaa
aaa1
aaa2
aXX3
aaa4
YYbbbb
bbb1
bbb2
YYcccc
ccc1
ccc2
ccc3
cXX4
ccc5
YYdddd
ddd1
dXX2
Following command is used:
sed -ne '/./{H;$!d};x;/XX/p' test2
Versions:
$ sed --version
GNU sed-Version 4.2.1
$ bash --version
GNU bash, Version 4.2.10(1)-release (x86_64-pc-linux-gnu)

It collects a paragraph as individual lines into the hold space (H), then when you hit an empty line, /./ fails and it falls through to the x which basically zaps the hold space for the next paragraph.
In order to correctly handle the final paragraph, it needs to cope with a paragraph which is not followed by an empty line, therefore it falls through from the last line as if it were followed by an empty line. This is a common idiom for scripts which collect something up through a particular pattern (or, to put it differently, it's a common error for such scripts to fail to handle the last collected data at end of file).
So in other words, if we are looking at a non-empty line, add it to the hold space, and unless it's the last line in the file, delete it and start over from the beginning of the script with the next input line. (Perhaps your understanding of d was not complete? This is what $!d means.)
Otherwise, we have an empty line, or end of file, and the hold space contains zero or more lines of text (one paragraph, possibly empty). Exchange them into the pattern space (the current, empty, line conveniently moves to the hold space) and examine the pattern space. If it fails to match one of our expressions, delete it. Otherwise, the default action is to print the entire pattern space.

Related

What Does an Empty Regular Expression in a Sed Address Range Do?

The following command sed -E '/# Section [134]/, // s/foo/bar/' <input_file> accomplishes the following
Input
# Section 1
- foo
- Unimportant Item
# Section 2
- foo
- Unimportant Item
# Section 3
- foo
- Unimportant Item
# Section 4
- foo
- Unimportant Item
# Section 5
- foo
- Unimportant Item
Output
# Section 1
- bar
- Unimportant Item
# Section 2
- foo
- Unimportant Item
# Section 3
- bar
- Unimportant Item
# Section 4
- bar
- Unimportant Item
# Section 5
- foo
- Unimportant Item
I am unsure of how this command works, specifically the empty regular expression in the address range. What I understand so far is Sed will first look for a portion of the document that matches the following regex /# Section [134]/, if it matches, it will begin the substitution looking for matches for foo and replacing them with bar. The second portion of the address range, as far as I am aware, is the stopping point, but in this case it is empty. I read here that an empty regular expression "repeats the last regular expression match", but I don't exactly know what this means, or how it affects this specific Sed command. How does the address range know that the stopping point is after each section? What regex is // repeating?
Let me simplify the input file as:
Line1
Line2
Line3
Line4
Line5
and the test script as:
sed -n "/[134]/,//p"
which will print the all lines, corresponding to your test results.
As noted, the empty regex repeats the previous regex, then the sed command
above is equivalent to:
sed -n "/[134]/,/[134]/p"
BTW the address range operator of sed works as follows:
If the left address matches, returns true without evaluating
right address on the same line (unlike the range operator of awk which evaluates the right
condition immediately on the same line).
Let's see how the operator works line by line.
On the first line Line1, the left start address matches and
returns true.
On the second line, the right stop address is evaluated without
match then the operator keeps true.
On the third line, the right stop address matches then it
changes the status to false (after printing the line).
On the fourh line, the left start address matches and
returns true again.
On the fifth line, the right stop address does not match and keeps true.
If you change the regex to /[135]/, you will see a different result.
(Line1, 2, 3, 5 will be printed skipping Line4.)
Using an empty regex is a shorthand for repeating the latest matching regex. So your script is an abbreviated (and more idiomatic, and less buggy) version of the longhand
sed -E '/# Section [134]/,/# Section [134]/ s/foo/bar/' <input_file>
which says to perform the substitution s/foo/bar/ in the range of lines starting with one occurrence of the regex, through to the next occurrence of the same regex (and starting again if there is a third one up through the fourth one, etc).
This is also available in the s/// command, so a fairly common idiom is
sed '/foo/ s//bar/'
which says to search for foo and then replace foo with bar. (This particular example is not particularly useful, but there are situations where this saves a lot of typing.)

Use sed to remove string results in empty file

I have large text files, in which sometimes long lines are broken into multiple lines by writing a = and then a newline character. (Enron email data from Kaggle). Since even words are broken this way and I want to do some machine learning with the data, I'd like to remove those breaks. As far as I can see the combination =\n is only used for these breaks, so if I remove those, I have the same information without the breaks and nothing gets lost.
I cannot use tr because it only replaces 1 character, but I have two characters to replace.
The sed command I am using so far to no avail is:
sed --in-place --quiet --regexp-extended 's/=\n//g' email_aa_edit
where email_aa_edit is a part of the enron mail data (used split to split it) and is my input file. However this only produces an empty file and I am not sure why. Afaik = is not a special character on itself and the newline should be \n.
What is the correct way of removing those =\n occurrences?
You can't remove newlines characters since sed works line by line, but it's possible if you append the next line to the pattern space:
sed ':a;/=$/{N;s/=\n//;ta}' file
details:
:a; # defines a label "a"
/=$/ { # if the line ends with =
N; # append the next line to the pattern space
s/=\n//; # replace the =\n
ta # jump to label "a" when something is replaced (that's always the case
# except if the last line ends with =)
}
Note: if your file uses the Windows newline sequence, change \n to \r\n.

vi: :s how to replace only the second occurence on a line?

:s/u/X/2 - this replaces the first u to X on the current and next line...
or to replace the second character on a line with X???? IDK.
or perhaps its something other than :s?
I suspect I have to use grouping of some kind (\2?) but I don't know to write that.
I heard that sed and :s option in sed are alike, and on a help page for sed I found:
3.1.3. Substitution switches:
Standard versions of sed support 4 main flags or switches which may be added to
the end of an "s///" command. They are:
N - Replace the Nth match of the pattern on the LHS, where
N is an integer between 1 and 512. If N is omitted,
the default is to replace the first match only.
g - Global replace of all matches to the pattern.
p - Print the results to stdout, even if -n switch is used.
w file - Write the pattern space to 'file' if a replacement was
done. If the file already exists when the script is
executed, it is overwritten. During script execution,
w appends to the file for each match.
http://sed.sourceforge.net/sedfaq3.html#s3.1.3
so: :r! sed 's/u/X/2' would work, although I think there is a specifically vi way of doing this?
IDK if its relevant but I'm using the tcsh shell.
also,
:version:
Version 1.79 (10/23/96) The CSRG, University of California, Berkeley.
This is brittle, but may be enough to do what you want. This switch command with regex:
:%s/first\(.\{-}\)first/first\1second/g
converts this:
first and first then first again
first and first then first again
first and first then first again
first and first then first again
to this:
first and second then first again
first and second then first again
first and second then first again
first and second then first again
The regexp looks for the first "first", followed by a match of any characters using pattern .\{-}, which is the non-greedy version of .* (type :help non-greedy in vim for more info.) This non-greedy match is followed with the second "first".
The characters between the first and second "first" are captured by surrounding the .\{-} with parenthesis, which, with escaping results in \(.\{-}\), then that captured group is dereferenced with the \1 (1 means first captured group) in the replacement.
In order to substitute the second occurrence on a line, you can say:
:call feedkeys('nyq') | s/u/X/gc
In order to invoke it over a range of lines or the entire file, use it in a function:
:function Mysub()
: call feedkeys('nyq') | s/u/X/gc
:endfunction
For example, the following would substitute the second occurrence of u for X in every line in the file:
:1,$ call Mysub()
Here's a dumber but easier to understand way: first find a string that doesn't exist in the file - for the sake of argument assume it's zzz. then simply:
:%s/first/zzz
:%s/first/second
:%s/zzz/first

Regex: Match any character (including whitespace) except a comma

I would like to match any character and any whitespace except comma with regex. Only matching any character except comma gives me:
[^,]*
but I also want to match any whitespace characters, tabs, space, newline, etc. anywhere in the string.
EDIT:
This is using sed in vim via :%s/foo/bar/gc.
I want to find starting from func up until the comma, in the following example:
func("bla bla bla"
"asdfasdfasdfasdfasdfasdf"
"asdfasdfasdf", "more strings")
I
To work with multiline in SED using RegEx, you should look at here.
EDIT:
In SED command, working with NewLine is a bit different. SED command support three patterns to manage multiline operations N, P and D. To see how it works see this(Working with Multiple Lines) explaination. Here these three operations discussed.
My guess is that N operator is the area of consideration that is missing from here. Addition of N operator will allows to sense \n in string.
An example from here:
Occasionally one wishes to use a new line character in a sed script.
Well, this has some subtle issues here. If one wants to search for a
new line, one has to use "\n." Here is an example where you search for
a phrase, and delete the new line character after that phrase -
joining two lines together.
(echo a;echo x;echo y) | sed '/x$/ { N s:x\n:x: }'
which generates
a xy
However, if you are inserting a new line, don't use "\n" - instead
insert a literal new line character:
(echo a;echo x;echo y) | sed 's:x:X\ :'
generates
a X
y
So basically you're trying to match a pattern over multiple lines.
Here's one way to do it in sed (pretty sure these are not useable within vim though, and I don't know how to replicate this within vim)
sed '
/func/{
:loop
/,/! {N; b loop}
s/[^,]*/func("ok"/
}
' inputfile
Let's say inputfile contains these lines
func("bla bla bla"
"asdfasdfasdfasdfasdfasdf"
"asdfasdfasdf", "more strings")
The output is
func("ok", "more strings")
Details:
If a line contains func, enter the braces.
:loop is a label named loop
If the line does not contain , (that's what /,/! means)
append the next line to pattern space (N)
branch to / go to loop label (b loop)
So it will keep on appending lines and looping until , is found, upon which the s command is run which matches all characters before the first comma against the (multi-line) pattern space, and performs a replacement.

How to replace all the blanks within square brackets with an underscore using sed?

I figured out that in order to turn [some name] into [some_name] I need to use the following expression:
s/\(\[[^ ]*\) /\1_/
i.e. create a backreference capture for anything that starts with a literal '[' that contains any number of non space characters, followed by a space, to be replaced with the non space characters followed by an underscore. What I don't know yet though is how to alter this expression so it works for ALL underscores within the braces e.g. [a few words] into [a_few_words].
I sense that I'm close, but am just missing a chunk of knowledge that will unlock the key to making this thing work an infinite number of times within the constraints of the first set of []s contained in a line (of SQL Server DDL in this case).
Any suggestions gratefully received....
There are two parts to the trickery needed:
Stop replacing when you reach a close square bracket (but do it repeatedly on the line):
s/\(\[[^] ]*\) /\1_/g
This matches an open square bracket, followed by zero or more characters that are neither a blank nor a close square bracket. The global suffix means that the pattern is applied to all sequences starting with an open square bracket followed eventually by a blank or close square bracket on the line. Note, too, that this regex does not alter '[single-word] and context' whereas the original would translate that to '[single-word]_and context', which is not the object of the exercise.
Get sed to repeat the search from where this one started. Unfortunately, there isn't a truly good way to do that. Sed always resumes searching after the text that was substituted; and this is one occasion when we don't want that. Sometimes, you can get away with simply repeating the substitute operation. In this case, you have to repeat it every time the substitution succeeds, stopping when there are no more substitutions.
Two of the less well known operations in sed are the ':label' and the 't' commands. They were present in the 7th Edition of Unix (circa 1978), though, so they are not new features. The first simply identifies a position in the script which can be jumped to with 'b' (not wanted here) or 't':
[2addr]t [label]
Branch to the ':' function bearing the label if any substitutions have been made since the most recent reading of an input line or execution of a 't' function. If no label is specified, branch to the end of the script.
Marvellous: we need:
sed -e ':redo; s/\(\[[^] ]*\) /\1_/g; t redo' data.file
Except - it doesn't work all on one line like that (at least, not on MacOS X). This did work admirably, though:
sed -e ':redo
s/\(\[[^] ]*\) /\1_/g
t redo' data.file
Or, as noted in the comments, you could write three separate '-e' options (which works on MacOS X):
sed -e ':redo' -e 's/\(\[[^] ]*\) /\1_/g' -e 't redo' data.file
Given the data file:
a line with [one blank] word inside square brackets.
a line with [two blank] or [three blank] words inside square brackets.
a line with [no-blank] word inside square brackets.
a line with [multiple words in a single bracket] inside square brackets.
a line with [multiple words in a single bracket] [several times on one line]
the output from the sed script shown is:
a line with [one_blank] word inside square brackets.
a line with [two_blank] or [three_blank] words inside square brackets.
a line with [no-blank] word inside square brackets.
a line with [multiple_words_in_a_single_bracket] inside square brackets.
a line with [multiple_words_in_a_single_bracket] [several_times_on_one_line]
And, finally, reading the fine print in the question, if you need this done only in the first square-bracketed field on each line, then we need to ensure that are no open square brackets before the one that starts the match. This variant works:
sed -e ':redo' -e 's/^\([^]]*\[[^] ]*\) /\1_/' -e 't redo' data.file
(The 'g' qualifier is gone - it probably isn't needed in the other variants either given the loop; its presence might make the process marginally more efficient, but it would most likely be essentially impossible to detect that. The pattern is now anchored to the start of the line (the caret) and contains zero or more characters that are not open square bracket before the first open square bracket.)
Sample output:
a line with [two_blank] or [three blank] words inside square brackets.
a line with [no-blank] word inside square brackets.
a line with [multiple_words_in_a_single_bracket] inside square brackets.
a line with [multiple_words_in_a_single_bracket] [several times on one line]
This is easier in a language like perl which has "executable" substitutions:
perl -wne 's/(\[.*?])/ do { my $x = $1; $x =~ y, ,_,; $x } /ge; print'
Or to split it up more clearly:
sub replace_with_underscores {
my $s = shift;
$s =~ y/ /_/;
$s
}
s/(\[.*?])/ replace_with_underscores($1) /ge;
The .*? is the non-greedy match (to avoid slurring together two adjacent bracketed phrases) and the e flag to the substitution causes it to be evaluated, so you can call a function to do the inner work.