How does the expression '^$' in SED is arrived? - regex

I know that $ means the last character, and the ^ means the first character.
I have seen the example in SED to delete all the blank lines using this macro.
sed '/^$/d' <file_name> to delete all the blank lines. I was trying to understand this expression ^$, how some one has arrived to this expression? Does it mean that delete all the lines whose first and last characters are same? What is meant by that combination ^$?

^ is not a first character, it is "before first character". $ is not the last character as well, it is "end of line". ^$ means there's nothing in between those two so it's just a blank line.

The question is actually for regular expression on not specific for 'sed'.
The sed utility is using regular expressions for stream/line editor.
The ^ (circumflex or caret) means look only at the beginning of the target string.
The $ (dollar) means look only at the end of the target string.
So, /^$/ means a line with nothing in between the beginning and the end of the line.

Related

What does the following SED pattern exactly do?

I am working on a CGI script and the developer who worked on this before me has used a SED Pattern.
COMMAND=`echo "$QUERY_STRING" | sed -n 's/^.*com_tex=\([^&]*\).*$/\1/p' | sed "s/%20/ /g"`
Here com_tex is the name of the text box in HTML.
What this line does is it takes a value form the HTML text box and assigns it to a SHELL variable. The SED pattern is apparently (not sure) necessary to extract the value from HTML without the other unnecessary accompanying stuff.
I will also mention the issue what I am asking this. The same pattern is used for a text area where I am entering a command and I need it retrieved exactly as it is. However it's getting jumbled up. Eg. IF I enter the following command in text box:
/usr/bin/free -m >> /home/admin/memlog.txt
The value that gets stored in the variable is:
%2Fusr%2Fbin%2Ffree+-m+%3E%3E+%2Fhome%2Fadmin%2Fmemlog.txt
All of us can get that / is being substituted by %2F, a space by + and the > sign by %3E.
But I just can not figure how this is specified in the above pattern! Will someone please tell me how that pattern works or what pattern should I substitute there so that I would get my entered command instead of the output I am getting?
sed -n
-n switch means "Dont print"
's/
s is for substitutions, / is a delimiter so the command looks like
s/Thing to sub/subsitution/optional extra command
^.*com_tex=
^ means the start of the line
.* means match 0 or more of any character
So it will match the longest string from the start of the line up to com_tex=
\(\)
This is a capture group, whatever is matched inside these brackets is saved and can be used later
[^&]*
[^] When the hat is used inside square brackets it means do not match any characters inside the brackets
* The same as before means 0 or more matches
The capture group combined with this means capture any character except &.
.*$
The same as the first bit except $ means the end of the line, so this matches everything until the end
/\1/p'
After the second / is the substitution. \1 is the capture group from before, so this will substitute everything we matched in the first part(the whole line) with the capture group.
p means print, this must be explicitly stated as the -n switch was used and will prevent other lines from being printed.
|
PIPE
s/%20/ /g
Sub %20 for a space, g means global so do it for every match on the line
HTH :)
This is not performed by any of the patterns. My best guess is that this escaping is performed by the shell or whatever fetches the HTML.
I will try to explain the patterns a little at a time
sed -n
-n specifies that sed should not print out the text to be matched, ie the html, after applying the commands.
The command following is of the form 's/regexp/replacement/flags'
^.*com_tex=\([^&]*\).*$
^ matches the beginning of the line
.* matches zero to many of any character
com_tex= matches the characters literally
\([^&]*\) '\(' specifies the beginning of a group that can later be backreferenced via its index. '[^&]*' matches zero to many characters which are not '&'. '\)' specifies the end of the group.
.* See above
$ matches the end of the line
\1
The above replacement is a backreference to the first (and only) group in the regexp i.e. '[^&]*'. So the replacement replaces the entire line with all characters immediately following 'com_tex=' till the first '&'.
The p flag specifies that if a substitution took place, the current line post substitution should be printed.
sed "s/%20/ /g"
The above is much simpler, it replaces all (not just the first) occurences of '%20' with a space ' '.

What does the regex '/^$/d' mean?

I was trying to remove blank lines in a file using bash script. Now when i was searching in the INTERNET, i came across two variations of it. In one, we can directly modify the source file and in the other we can store the out put in another file . Here are the code snippets :
sed -i '/^$/d' fileName.txt
sed '/^$/d' fileName.txt > newFileName.txt
What i could not understand is how the regex '/^$/d' can be interpreted as blank lines. I am afraid i am not good in regex . Can some one explain me this one ?
Also is there some other way to do it ?
/^$/d
/ - start of regex
^ - start of line
$ - end of line
/ - end of regex
d - delete lines which match
So basically find any line which is empty (start and ending points are the same, e.g. no chars), and delete them.
Let's start with the regex explanation:
/^$/d
^ matches the beginning of the line and $ matches end of the line. so ^$ will match empty lines.
You're also using d flag with sed. This will remove the matched lines.
and -i switch in sed -i '/^$/d' fileName.txt makes sed remove the lines in-place. If you omit that, it will output the result to standard-output.
/^$/d is a sed command that removes empty lines. It's actually two things stuck together: a regular expression /^$/ and a sed instruction d.
The /^$/ component is a regular expression that matches the empty string. More specifically, it looks for the beginning of a line (^) followed directly by the end of a line ($), which is to say an empty line. If there's anything in the line -- whitespace or otherwise -- that pattern won't match since the end of the line won't directly follow the beginning of the line.
The d component is a sed instruction that means "delete". In this usage, the d applies to any line that matches the given regular expression (/^$/), so it will delete any empty line.
Because sed is running in autoprint mode (without the -n switch), it will print all lines that aren't deleted -- so, in this case don't match /$^/ -- so that command ends up being a filter that removes all empty lines from the input.
/^$/: select lines that are empty (^ matches the start of line, $ matches the end of line, and so this matches lines that start and immediately end with no intervening content).
d: delete matched line.
^ - start of a line
$ - end of line
so
/^$/
it matches lines with line beginning followed immediately by end of line. which means, empty lines.
the sed command with d means delete matched lines, that is, remove empty lines.
so basically:
sed '/regex/d(elete)' --this is not a real command line, just for explanation.
^$ represents an empty line because ^ is a zero-width anchor meaning the start of the line and $ is a zero-width anchor meaning the end of the line. Thus ^$ must be zero width (i.e. have no characters at all) to match. There also cannot be any characters before ^ on the line or after $ on the line.

Regular expression to match beginning and end of a line?

Could anyone tell me a regex that matches the beginning or end of a line? e.g. if I used sed 's/[regex]/"/g' filehere the output would be each line in quotes? I tried [\^$] and [\^\n] but neither of them seemed to work. I'm probably missing something obvious, I'm new to these
Try:
sed -e 's/^/"/' -e 's/$/"/' file
To add quotes to the start and end of every line is simply:
sed 's/.*/"&"/g'
The RE you were trying to come up with to match the start or end of each line, though, is:
sed -r 's/^|$/"/g'
Its an ERE (enable by "-r") so it will work with GNU sed but not older seds.
matthias's response is perfectly adequate, but you could also use a backreference to do this. if you're learning regular expressions, they are a handy thing to know.
here's how that would be done using a backreference:
sed 's/\(^.*$\)/"\1"/g' file
at the heart of that regex is ^.*$, which means match anything (.*) surrounded by the start of the line (^) and the end of the line ($), which effectively means that it will match the whole line every time.
putting that term inside parenthesis creates a backreference that we can refer to later on (in the replace pattern). but for sed to realize that you mean to create a backreference instead of matching literal parentheses, you have to escape them with backslashes. thus, we end up with \(^.*$\) as our search pattern.
the replace pattern is simply a double quote followed by \1, which is our backreference (refers back to the first pattern match enclosed in parentheses, hence the 1). then add your last double quote to end up with "\1".

Substitution till the end of the line in bash

I have a huge text file with lots of lines like:
asdasdasdaasdasd_DATA_3424223423423423
gsgsdgsgs_DATA_6846343636
.....
I would like to do, for each line, to substitute from DATA_ .. to the end, with just empty space so I would get:
asdasdasdaasdasd_DATA_
gsgsdgsgs_DATA_
.....
I know that you can do something similar with:
sed -e "s/^DATA_*$/DATA_/g" filename.txt
but it does not work.
Do you know how?
Thanks
You have two problems: you're unnecessarily matching beginning and end of line with ^ and $, and you're looking for _* (zero or more underscores) instead of .* (zero or more of any character. Here's what you want:
sed -e 's/_DATA_.*/_DATA_/'
The g on the end (global) won't do anything, because you're already going to remove everything from the first instance of "DATA" onward - there can't be another match.
P.S. The -e isn't strictly necessary if you only have one expression, but if you think you might tack more on, it's a convenient habit.
With regular expressions, * means the previous character, any number of times. To match any character, use .
So what you really want is .* which means any character, any number of times, like this:
sed 's/DATA_.*/DATA_/' filename.txt
Also, I removed the ^ which means start of line, since you want to match "DATA_" even if it's not in the beginning of a line.
using awk. Set field delimiter as "DATA", then get field 1 ($1). No need regular expression
$ awk -F"_DATA_" '{print $1"_DATA_"}' file
asdasdasdaasdasd_DATA_
gsgsdgsgs_DATA_

How can I match at the beginning of any line, including the first, with a Perl regex?

According the Perl documentation on regexes:
By default, the "^" character is guaranteed to match only the beginning of the string ... Embedded newlines will not be matched by "^" ... You may, however, wish to treat a string as a multi-line buffer, such that the "^" will match after any newline within the string ... you can do this by using the /m modifier on the pattern match operator.
The "after any newline" part means that it will only match at the beginning of the 2nd and subsequent lines. What if I want to match at the beginning of any line (1st, 2nd, etc.)?
EDIT: OK, it seems that the file has BOM information (3 chars) at the beginning and that's what's messing me up. Any way to get ^ to match anyway?
EDIT: So in the end it works (as long as there's no BOM), but now it seems that the Perl documentation is wrong, since it says "after any newline"
The ^ does match the 1st line with the /m flag:
~:1932$ perl -e '$a="12\n23\n34";$a=~s/^/:/gm;print $a'
:12
:23
:34
To match with BOM you need to include it in the match.
~:1939$ perl -e '$a="12\n23\n34";$a=~s/^(\d)/<\1>:/mg;print $a'
12
<2>:3
<3>:4
~:1940$ perl -e '$a="12\n23\n34";$a=~s/^(?:)?(\d)/<\1>:/mg;print $a'
<1>:2
<2>:3
<3>:4
You can use the /^(?:\xEF\xBB\xBF)?/mg regex to match at the beginning of the line anyway, if you want to preserve the BOM.
Conceptually, there's assumed to be a newline before the beginning of the string. Consequently, /^a/ will find a letter 'a' at the beginning of a string.
Put a empty line at the beginning of the file, this cool things down, and avoid to make regex hard to read.
Yes, the BOM. It might appear at the beginning of the file, so put an empty at the beginning of the file. The BOM will not be \s, or something can be seen by bare eye. It kills my hours when a BOM make my regex fail.