Substitution till the end of the line in bash

Substitution till the end of the line in bash - regex

I have a huge text file with lots of lines like:
asdasdasdaasdasd_DATA_3424223423423423
gsgsdgsgs_DATA_6846343636
.....
I would like to do, for each line, to substitute from DATA_ .. to the end, with just empty space so I would get:
asdasdasdaasdasd_DATA_
gsgsdgsgs_DATA_
.....
I know that you can do something similar with:
sed -e "s/^DATA_*$/DATA_/g" filename.txt
but it does not work.
Do you know how?
Thanks

You have two problems: you're unnecessarily matching beginning and end of line with ^ and $, and you're looking for _* (zero or more underscores) instead of .* (zero or more of any character. Here's what you want:
sed -e 's/_DATA_.*/_DATA_/'
The g on the end (global) won't do anything, because you're already going to remove everything from the first instance of "DATA" onward - there can't be another match.
P.S. The -e isn't strictly necessary if you only have one expression, but if you think you might tack more on, it's a convenient habit.

With regular expressions, * means the previous character, any number of times. To match any character, use .
So what you really want is .* which means any character, any number of times, like this:
sed 's/DATA_.*/DATA_/' filename.txt
Also, I removed the ^ which means start of line, since you want to match "DATA_" even if it's not in the beginning of a line.

using awk. Set field delimiter as "DATA", then get field 1 ($1). No need regular expression
$ awk -F"_DATA_" '{print $1"_DATA_"}' file
asdasdasdaasdasd_DATA_
gsgsdgsgs_DATA_

Related

What does the following SED pattern exactly do?

I am working on a CGI script and the developer who worked on this before me has used a SED Pattern.
COMMAND=`echo "$QUERY_STRING" | sed -n 's/^.*com_tex=\([^&]*\).*$/\1/p' | sed "s/%20/ /g"`
Here com_tex is the name of the text box in HTML.
What this line does is it takes a value form the HTML text box and assigns it to a SHELL variable. The SED pattern is apparently (not sure) necessary to extract the value from HTML without the other unnecessary accompanying stuff.
I will also mention the issue what I am asking this. The same pattern is used for a text area where I am entering a command and I need it retrieved exactly as it is. However it's getting jumbled up. Eg. IF I enter the following command in text box:
/usr/bin/free -m >> /home/admin/memlog.txt
The value that gets stored in the variable is:
%2Fusr%2Fbin%2Ffree+-m+%3E%3E+%2Fhome%2Fadmin%2Fmemlog.txt
All of us can get that / is being substituted by %2F, a space by + and the > sign by %3E.
But I just can not figure how this is specified in the above pattern! Will someone please tell me how that pattern works or what pattern should I substitute there so that I would get my entered command instead of the output I am getting?

sed -n
-n switch means "Dont print"
's/
s is for substitutions, / is a delimiter so the command looks like
s/Thing to sub/subsitution/optional extra command
^.*com_tex=
^ means the start of the line
.* means match 0 or more of any character
So it will match the longest string from the start of the line up to com_tex=
\(\)
This is a capture group, whatever is matched inside these brackets is saved and can be used later
[^&]*
[^] When the hat is used inside square brackets it means do not match any characters inside the brackets
* The same as before means 0 or more matches
The capture group combined with this means capture any character except &.
.*$
The same as the first bit except $ means the end of the line, so this matches everything until the end
/\1/p'
After the second / is the substitution. \1 is the capture group from before, so this will substitute everything we matched in the first part(the whole line) with the capture group.
p means print, this must be explicitly stated as the -n switch was used and will prevent other lines from being printed.
|
PIPE
s/%20/ /g
Sub %20 for a space, g means global so do it for every match on the line
HTH :)

This is not performed by any of the patterns. My best guess is that this escaping is performed by the shell or whatever fetches the HTML.
I will try to explain the patterns a little at a time
sed -n
-n specifies that sed should not print out the text to be matched, ie the html, after applying the commands.
The command following is of the form 's/regexp/replacement/flags'
^.*com_tex=\([^&]*\).*$
^ matches the beginning of the line
.* matches zero to many of any character
com_tex= matches the characters literally
\([^&]*\) '\(' specifies the beginning of a group that can later be backreferenced via its index. '[^&]*' matches zero to many characters which are not '&'. '\)' specifies the end of the group.
.* See above
$ matches the end of the line
\1
The above replacement is a backreference to the first (and only) group in the regexp i.e. '[^&]*'. So the replacement replaces the entire line with all characters immediately following 'com_tex=' till the first '&'.
The p flag specifies that if a substitution took place, the current line post substitution should be printed.
sed "s/%20/ /g"
The above is much simpler, it replaces all (not just the first) occurences of '%20' with a space ' '.

How does the expression '^$' in SED is arrived?

I know that $ means the last character, and the ^ means the first character.
I have seen the example in SED to delete all the blank lines using this macro.
sed '/^$/d' <file_name> to delete all the blank lines. I was trying to understand this expression ^$, how some one has arrived to this expression? Does it mean that delete all the lines whose first and last characters are same? What is meant by that combination ^$?

^ is not a first character, it is "before first character". $ is not the last character as well, it is "end of line". ^$ means there's nothing in between those two so it's just a blank line.

The question is actually for regular expression on not specific for 'sed'.
The sed utility is using regular expressions for stream/line editor.
The ^ (circumflex or caret) means look only at the beginning of the target string.
The $ (dollar) means look only at the end of the target string.
So, /^$/ means a line with nothing in between the beginning and the end of the line.

How do I write a SED regex to extract a string delimited by another string?

I am using GNU sed version 4.2.1 and I am trying to write a non-greedy SED regex to extract a string that delimited by two other strings. This is easy when the delimiting strings are single-character:
s:{\([^}]*\)}:\1:g
In that example the string is delimited by '{' on the left and '}' on the right.
If the delimiting strings are multiple characters, say '{{{' and '}}}' I can adjust the above expression like this:
s:{{{\([^}}}]*\)}}}:\1:g
so the centre expression matches anything not containing the '}}}' closing string. But this only works if the match string does not contain '}' at all. Something like:
{{{cannot match {this broken} example}}}
will not work but
{{{can match this example}}}
does work. Of course
s:{{{\(.*\)}}}:\1:g
always works but is greedy so isn't suitable where multiple patterns occur on the same line.
I understand [^a] to mean anything except a and [^ab] to mean anything except a or b so, despite it appearing to work, I don't think [^}}}] is the correct way to exclude that sequence of 3 consecutive characters.
So how to I write a regex for SED that matches a string that is delimited bt two other strings ?

You are correct that [^}}}] doesn't work. A negated character class matches anything that is not one of the characters inside it. Repeating characters doesn't change the logic. So what you wrote is the same as [^}]. (It is easy to see why this works when there are no braces inside the expression).
In Perl and compatible regular expressions, you can use ? to make a * or + non-greedy:
s:{{{(.*?)}}}:$1:g
This will always match the first }}} after the opening {{{.
However, this is not possible in Sed. In fact, I don't think there is any way in Sed of doing this match. The only other way to do this is use advanced features like look-ahead, which Sed also does not have.
You can easily use Perl in a sed-like fashion with the -pe options, which cause it to take a single line of code from the command line (-e) and automatically loop over each line and print the result (-p).
perl -pe 's:{{{(.*?)}}}:$1:g'
The -i option for in-place editing of files is also useful, but make sure your regex is correct first!
For more information see perlrun.

With sed you could do something like:
sed -e :a -e 's/\(.*\){{{\(.*\)}}}/\1\2/ ; ta'
With:
{{{can match this example}}} {{{can match this 2nd example}}}
This gives:
can match this example can match this 2nd example
It is not lazy matching, but by replacing from right to left we can make use of sed's greediness.

Unable to figure out regex bash or sed or awk

I wanted to split the following jdk-1.6.0_30-fcs.x86_64 to just jdk-1.6.0_30. I tried the following sed 's/\([a-z][^fcs]*\).*/\1/'but I end up with jdk-1.6.0_30-. I think am approaching it the wrong way, is there a way to start from the end of the word and traverse backwards till I encounter -?

Not exactly, but you can anchor the pattern to the end of the string with $. Then you just need to make sure that the characters you repeat may not include hyphens:
echo jdk-1.6.0_30-fcs.x86_64 | sed 's/-[^-]*$//'
This will match from a - to the end of the string, but all characters in between must be different from - (so that it does not match for the first hyphen already).
A slightly more detailed explanation. The engine tries to match the literal - first. That will first work at the first - in the string (obviously). Then [^-]* matches as many non-- characters as possible, so it will consume 1.6.0_30 (because the next character is in fact a hyphen). Now the engine will try to match $, but that does not work because we are not at the end of the string. Some backtracking occurs, but we can ignore that here. In the end the engine will abandon matching the first - and continue through the string. Then the engine will match the literal - with the second -. Now [^-]* will consume fcs.x86_64. Now we are actually at the end of the string and $ will match, so the full match (which will be removed is) -fcs.x86_64.

Use cut >>
echo 'jdk-1.6.0_30-fcs.x86_64' | cut -d- -f-2

Try doing this :
echo 'jdk-1.6.0_30-fcs.x86_64' | sed 's/-fcs.*//'
If using bash, sh or ash, you can do :
var=jdk-1.6.0_30-fcs.x86_64
echo ${var%%-fcs*}
jdk-1.6.0_30
Later solution use parameter expansion, tested on Linux and Minix3

Replace only the first occurence matching a regex with sed

I have a string
test:growTest:ret
And with sed i would to delete only test: to get :
growTest:ret
I tried with
sed '0,/RE/s/^.*://'
But it only gives me
ret
Any ideas ?
Thanks

Modify your regexp ^.*: to ^[^:]*:
All you need is that the .* construction won't consume your delimiter — the colon. To do this, replace matching-any-char . with negated brackets: [^abc], that match any char except specified.
Also, don't confuse the two circumflexes ^, as they have different meanings: first one matches beginning of string, second one means negated brackets.

If I understand your question, you want strings like test:growTest:ret to become growTest:ret.
You can use:
sed -i 's/test:(.*$)/\1/'
i means edit in place.
s/one/two/ replaces occurences of one with two.
So this replaces "test:(.*$)" with "\1". Where \1 is the contents of the first group, which is what the regex matched inside the braces.
"test:(.*$)" matches the first occurence of "test:" and then puts everything else until the end of the line unto the braces. The contents of the braces remain after the sed command.

Sed use hungry match. So ^.*: will match test:growTest: other than test:.
Default, sed only replace the first matched pattern. So you need not do anything specially.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Substitution till the end of the line in bash - regex

using awk. Set field delimiter as "DATA", then get field 1 ($1). No need regular expression $ awk -F"_DATA_" '{print $1"_DATA_"}' file asdasdasdaasdasd_DATA_ gsgsdgsgs_DATA_

Related

What does the following SED pattern exactly do?

How does the expression '^$' in SED is arrived?

How do I write a SED regex to extract a string delimited by another string?

Unable to figure out regex bash or sed or awk

Replace only the first occurence matching a regex with sed

Categories

Resources