What does the regex '/^$/d' mean? - regex

I was trying to remove blank lines in a file using bash script. Now when i was searching in the INTERNET, i came across two variations of it. In one, we can directly modify the source file and in the other we can store the out put in another file . Here are the code snippets :
sed -i '/^$/d' fileName.txt
sed '/^$/d' fileName.txt > newFileName.txt
What i could not understand is how the regex '/^$/d' can be interpreted as blank lines. I am afraid i am not good in regex . Can some one explain me this one ?
Also is there some other way to do it ?

/^$/d
/ - start of regex
^ - start of line
$ - end of line
/ - end of regex
d - delete lines which match
So basically find any line which is empty (start and ending points are the same, e.g. no chars), and delete them.

Let's start with the regex explanation:
/^$/d
^ matches the beginning of the line and $ matches end of the line. so ^$ will match empty lines.
You're also using d flag with sed. This will remove the matched lines.
and -i switch in sed -i '/^$/d' fileName.txt makes sed remove the lines in-place. If you omit that, it will output the result to standard-output.

/^$/d is a sed command that removes empty lines. It's actually two things stuck together: a regular expression /^$/ and a sed instruction d.
The /^$/ component is a regular expression that matches the empty string. More specifically, it looks for the beginning of a line (^) followed directly by the end of a line ($), which is to say an empty line. If there's anything in the line -- whitespace or otherwise -- that pattern won't match since the end of the line won't directly follow the beginning of the line.
The d component is a sed instruction that means "delete". In this usage, the d applies to any line that matches the given regular expression (/^$/), so it will delete any empty line.
Because sed is running in autoprint mode (without the -n switch), it will print all lines that aren't deleted -- so, in this case don't match /$^/ -- so that command ends up being a filter that removes all empty lines from the input.

/^$/: select lines that are empty (^ matches the start of line, $ matches the end of line, and so this matches lines that start and immediately end with no intervening content).
d: delete matched line.

^ - start of a line
$ - end of line
so
/^$/
it matches lines with line beginning followed immediately by end of line. which means, empty lines.
the sed command with d means delete matched lines, that is, remove empty lines.
so basically:
sed '/regex/d(elete)' --this is not a real command line, just for explanation.

^$ represents an empty line because ^ is a zero-width anchor meaning the start of the line and $ is a zero-width anchor meaning the end of the line. Thus ^$ must be zero width (i.e. have no characters at all) to match. There also cannot be any characters before ^ on the line or after $ on the line.

Related

How to use Perl to replace multiple lines containing character '/' and new line?

I'm trying to modify block of several lines in several files. Initially, I tried sed but read that Perl might be a better choice. However, my Perl is very basic and I'm not sure how to deal with an empty (new) line and the special character '/'. To sum things up, I'd like to have a one-liner, something like ($perl -i -pe ...), to convert
(new line)
#include <item_b/item_bC.h>
into
#include <item_a/item_aC.h>
#include <item_b/item_bC.h>
Thanks.
One way -- slurp the file into a string, then match a line with only possibly spaces followed by a line starting with #include..., and replace what's matched with that #include line twice
perl -0777 -wpe's{ ^\s*\n ( \#include.*\n ) }{$1$1}mxg' file.c
With -0777 it slurps the whole file into $_ and with -p it prints $_ on every line (only once when under -0777 since hte whole file is in $_ so there is only one "line"); see switches in perlrun. The /m modifier makes ^ (and $) also match line boundaries inside a (multiline) string.
Or, with the same general approach (slurp the file) but use a lookahead
perl -0777 -wpe's{ ^\s*\n (?= (\#include.*\n) ) }{$1}mxg' file.c
Matches an empty line after which a lookahead finds a line starting with #include, which is also captured so to replace the empty line with it. Since lookarounds don't consume anything there is no need to replace that line (with itself).
Note, the .* is greedy and matches as much as possible up to the pattern that follows it, and here we have the whole file ahead of it so it may appear that .*\n will match all the way to the very last \n in the file! However, . doesn't match a line-feed (with /s modifier it does) so .*\n here stops at the first newline, so it matches the rest of the line.
If a more specific include statement need be matched add details following the #include pattern.†
Otherwise, one can process line by line, by copying the current line and printing it when on the next line, depending on what's on the saved and next line. There are some picky details to straighten there, not super amenable to one-liners.
Both tested with input file.c (Note: it does start with an empty line)
#include<item_b/item_bC.h>
#include<item_a/item_aC.h>
#include<item_c/item_cC.h>
int main() {
return 1;
}
where we end up with two item_b and one item_a and two itewm_c includes and no empty lines, and the rest of the file is unaffected.
† Special characters are mentioned so I'll comment. But please consult more complete resources, like tutorial perlretut and reference perlre. See also perlrebackslash
Characters special for regex can mostly be matched as literal characters in a pattern when escaped with \. But in this case that's not needed: the role of / in a regex is only to delimit the pattern, commonly given as /.../, but here I use {}{} as delimiters; so / isn't special here and can be used freely. For example
perl -0777 -wpe's{ ^\s*\n (?= (\#include<item_./.*\n) ) }{$1}mxg' file.c
matches lines from the input file I used, shown above.
There is clearly a more general pattern instead of item in the actual problem, and it's a filename. Most characters that are allowed in a filename can be used literally in a regex. Exceptions, like ., can be escaped, like \. to match a literal ..
For example, a string item_bC.h, where bC characters vary but item and .h are always the same, can be matched with the pattern /item_..\.h/.

What is the sed expression for replacing all occurences of ":{any character}{end of line}" in a file?

I have a file that looks like the following (simplified) and I'm trying to replace from one string to the end of the line. In this case, I'm trying to replace everything between the very last colon of the line and the end of the line.
BEFORE
GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0|1:746:196,213:0,0:903:0/1:.:113535
GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0|1:746:196,213:0,0:903:0/1:.:PATMAT
AFTER
GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0|1:746:196,213:0,0:903:0/1:.:1
GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0|1:746:196,213:0,0:903:0/1:.:1
Based on a similar question posted, if there were only one colon in the lines, I know the sed expression would be the following, but I'm not sure how to specify the very last colon of the line:
sed 's/:.*/1/' file.txt
You can do it like this:
sed 's/:[^:]*$/:1/'
[^:] matches any non-: character and $ anchors the match to the end of the line.

What does the following SED pattern exactly do?

I am working on a CGI script and the developer who worked on this before me has used a SED Pattern.
COMMAND=`echo "$QUERY_STRING" | sed -n 's/^.*com_tex=\([^&]*\).*$/\1/p' | sed "s/%20/ /g"`
Here com_tex is the name of the text box in HTML.
What this line does is it takes a value form the HTML text box and assigns it to a SHELL variable. The SED pattern is apparently (not sure) necessary to extract the value from HTML without the other unnecessary accompanying stuff.
I will also mention the issue what I am asking this. The same pattern is used for a text area where I am entering a command and I need it retrieved exactly as it is. However it's getting jumbled up. Eg. IF I enter the following command in text box:
/usr/bin/free -m >> /home/admin/memlog.txt
The value that gets stored in the variable is:
%2Fusr%2Fbin%2Ffree+-m+%3E%3E+%2Fhome%2Fadmin%2Fmemlog.txt
All of us can get that / is being substituted by %2F, a space by + and the > sign by %3E.
But I just can not figure how this is specified in the above pattern! Will someone please tell me how that pattern works or what pattern should I substitute there so that I would get my entered command instead of the output I am getting?
sed -n
-n switch means "Dont print"
's/
s is for substitutions, / is a delimiter so the command looks like
s/Thing to sub/subsitution/optional extra command
^.*com_tex=
^ means the start of the line
.* means match 0 or more of any character
So it will match the longest string from the start of the line up to com_tex=
\(\)
This is a capture group, whatever is matched inside these brackets is saved and can be used later
[^&]*
[^] When the hat is used inside square brackets it means do not match any characters inside the brackets
* The same as before means 0 or more matches
The capture group combined with this means capture any character except &.
.*$
The same as the first bit except $ means the end of the line, so this matches everything until the end
/\1/p'
After the second / is the substitution. \1 is the capture group from before, so this will substitute everything we matched in the first part(the whole line) with the capture group.
p means print, this must be explicitly stated as the -n switch was used and will prevent other lines from being printed.
|
PIPE
s/%20/ /g
Sub %20 for a space, g means global so do it for every match on the line
HTH :)
This is not performed by any of the patterns. My best guess is that this escaping is performed by the shell or whatever fetches the HTML.
I will try to explain the patterns a little at a time
sed -n
-n specifies that sed should not print out the text to be matched, ie the html, after applying the commands.
The command following is of the form 's/regexp/replacement/flags'
^.*com_tex=\([^&]*\).*$
^ matches the beginning of the line
.* matches zero to many of any character
com_tex= matches the characters literally
\([^&]*\) '\(' specifies the beginning of a group that can later be backreferenced via its index. '[^&]*' matches zero to many characters which are not '&'. '\)' specifies the end of the group.
.* See above
$ matches the end of the line
\1
The above replacement is a backreference to the first (and only) group in the regexp i.e. '[^&]*'. So the replacement replaces the entire line with all characters immediately following 'com_tex=' till the first '&'.
The p flag specifies that if a substitution took place, the current line post substitution should be printed.
sed "s/%20/ /g"
The above is much simpler, it replaces all (not just the first) occurences of '%20' with a space ' '.

How does the expression '^$' in SED is arrived?

I know that $ means the last character, and the ^ means the first character.
I have seen the example in SED to delete all the blank lines using this macro.
sed '/^$/d' <file_name> to delete all the blank lines. I was trying to understand this expression ^$, how some one has arrived to this expression? Does it mean that delete all the lines whose first and last characters are same? What is meant by that combination ^$?
^ is not a first character, it is "before first character". $ is not the last character as well, it is "end of line". ^$ means there's nothing in between those two so it's just a blank line.
The question is actually for regular expression on not specific for 'sed'.
The sed utility is using regular expressions for stream/line editor.
The ^ (circumflex or caret) means look only at the beginning of the target string.
The $ (dollar) means look only at the end of the target string.
So, /^$/ means a line with nothing in between the beginning and the end of the line.

Substitution till the end of the line in bash

I have a huge text file with lots of lines like:
asdasdasdaasdasd_DATA_3424223423423423
gsgsdgsgs_DATA_6846343636
.....
I would like to do, for each line, to substitute from DATA_ .. to the end, with just empty space so I would get:
asdasdasdaasdasd_DATA_
gsgsdgsgs_DATA_
.....
I know that you can do something similar with:
sed -e "s/^DATA_*$/DATA_/g" filename.txt
but it does not work.
Do you know how?
Thanks
You have two problems: you're unnecessarily matching beginning and end of line with ^ and $, and you're looking for _* (zero or more underscores) instead of .* (zero or more of any character. Here's what you want:
sed -e 's/_DATA_.*/_DATA_/'
The g on the end (global) won't do anything, because you're already going to remove everything from the first instance of "DATA" onward - there can't be another match.
P.S. The -e isn't strictly necessary if you only have one expression, but if you think you might tack more on, it's a convenient habit.
With regular expressions, * means the previous character, any number of times. To match any character, use .
So what you really want is .* which means any character, any number of times, like this:
sed 's/DATA_.*/DATA_/' filename.txt
Also, I removed the ^ which means start of line, since you want to match "DATA_" even if it's not in the beginning of a line.
using awk. Set field delimiter as "DATA", then get field 1 ($1). No need regular expression
$ awk -F"_DATA_" '{print $1"_DATA_"}' file
asdasdasdaasdasd_DATA_
gsgsdgsgs_DATA_