Why does this regex work in grep but not sed? - regex

I have two regular expressions:
$ grep -E '\-\- .*$' *.sql
$ sed -E '\-\- .*$' *.sql
(I am trying to grep lines in sql files that have comments and remove lines in sql files that have comments)
The grep command works using this regex; however, the sed returns the following error:
sed: -e expression #1, char 7: unterminated address regex
What am I doing incorrectly with sed?
(The space after the two hyphens is required for sql comments if you are unfamiliar with MySql comments of this type)

You're trying to use:
sed -E '\-\- .*$' *.sql
Here sed command is not correct because you're not really telling sed to do something.
It should be:
sed -n '/-- /p' *.sql
and equivalent grep would be:
grep -- '-- ' *.sql
or even better with a fixed string search:
grep -F -- '-- ' *.sql
Using -- to separate pattern and arguments in grep command.
There is no need to escape - in a regex if it is outside bracket expression (or character class) i.e. [...].
Based on comments below it seems OP's intent is to remove commented section in all *.sql files that start with 2 hyphens.
You may use this sed for that:
sed -i 's/-- .*//g' *.sql

The problem here is not the regex, the problem is that sed requires a command. The equivalent of your grep would be:
sed -n '/\-\- .*$/p'
You suppress output for non-matching lines -n ... you search (wrap your regex in slashes) and you print p (after the last slash).
P.S.: As Anub pointed out, escaping the hyphens - inside the regex is unnecessary.

You are trying to use sed's \cregexpc syntax where with \-<...> you are telling sed the delimiter character you want use is a dash -, but you didn't terminate it where it should be: \-<...>- also add d command to delete those lines.
sed '\-\-\-.*$-d' infile
see man sed about that:
\cregexpc
Match lines matching the regular expression regexp. The c may be any character.
if default / was used this was not required so:
sed '/--.*$/d' infile
or simply:
sed '/^--/d' infile
and more accurately:
sed '/^[[:blank:]]*--/d' infile

Related

“sed” command to remove a line that matches an exact string on first word

I've found an answer to my question here: "sed" command to remove a line that match an exact string on first word
...but only partially because that solution only works if I query pretty much exactly like the answer person answered.
They answered:
sed -i "/^maria\b/Id" file.txt
...to chop out only a line starting with the word "maria" in it and not maria if it's not the first word for example.
I want to chop out a specific url in a file, example: "cnn.com" - but, I also have a bunch of local host addressses, 0.0.0.0 and both have some with a single space in front. I also don't want to chop out sub domains like ads.cnn.com so that code "should" work but doesn't when I string in more commands with the -e option. My code below seems to clean things up well except that I can't get it to whack out the cnn.com! My file is called raw.txt
sed -r -e 's/^127.0.0.1//' -e 's/^ 127.0.0.1//' -e 's/^0.0.0.0//' -e 's/^ 0.0.0.0//' -e '/#/d' -e '/^cnn.com\b/d' -e '/::/d' raw.txt | sort | tr -d "[:blank:]" | awk '!seen[$0]++' | grep cnn.com
When I grep for cnn.com I see all the cnn's INCLUDING the one I don't want which is actually "cnn.com".
ads.cnn.com
cl.cnn.com
cnn.com <-- the one I don't want
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
If I just use that one piece of code with the cnn.com chop out it seems to work.
sed -r '/^cnn.com\b/d' raw.txt | grep cnn.com
* I'm not using the "-e" option
Result:
ads.cnn.com
cl.cnn.com
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
Nothing I do seems to work when I string commands together with the "-e" option. I need some help on getting my multiple option command kicking with SED.
Any advice?
Ubuntu 12 LTS & 16 LTS.
sed (GNU sed) 4.2.2
The . is metacharacter in regex which means "Match any one character". So you accidentally created a regex that will also catch cnnPcom or cnn com or cnn\com. While it probably works for your needs, it would be better to be more explicit:
sed -r '/^cnn\.com\b/d' raw.txt
The difference here is the \ backslash before the . period. That escapes the period metacharacter so it's treated as a literal period.
As for your lines that start with a space, you can catch those in a single regex (Again escaping the period metacharacter):
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d' raw.txt
This (^[ ]*|^) says a line that starts with any number of repeating spaces ^[ ]* OR | starts with ^ which is then followed by your match for 127.0.0.1.
And then for stringing these together you can use the | OR operator inside of parantheses to catch all of your matches:
sed -r '/(^[ ]*|^)(127\.0\.0\.1|cnn\.com|0\.0\.0\.0)\b/d' raw.txt
Alternatively you can use a ; semicolon to separate out the different regexes:
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d; /(^[ ]*|^)cnn\.com\b/d; /(^[ ]*|^)0\.0\.0\.0\b/d;' raw.txt
sed doesn't understand matching on strings, only regular expressions, and it's ridiculously difficult to try to get sed to act as if it does, see Is it possible to escape regex metacharacters reliably with sed. To remove a line whose first space-separated word is "foo" is just:
awk '$1 != "foo"' file
To remove lines that start with any of "foo" or "bar" is just:
awk '($1 != "foo") && ($1 != "bar")' file
If you have more than just a couple of words then the approach is to list them all and create a hash table indexed by them then test for the first word of your line being an index of the hash table:
awk 'BEGIN{split("foo bar other word",badWords)} !($1 in badWords)' file
If that's not what you want then edit your question to clarify your requirements and include concise, testable sample input and the expected output given that input.

sed find and replace fastq regex

I have a file such as
head testSed.fastq
#M01551:51:000000000-BCB7H:1:1101:15800:1330 1:N:0:NGTCACTN+TATCCTCTCTTGAAGA
NGTCACTN
+
#>AAAAF#
#M01551:51:000000000-BCB7H:1:1101:15605:1331 1:N:0:NATCAGCN+TAGATCGCCAAGTTAA
NATCAGCN
+
#>>AA?C#
#M01551:51:000000000-BCB7H:1:1101:15557:1332 1:N:0:NCAGCAGN+TATCTTCTATAAATAT
NCAGCAGN
And I am attempting to replace the string after the final colon with 0 (in this example on lines 1,5,9 - but globally) using a regular expression.
I have checked my regex using egrep egrep '[ATGCN]{8}\+[ATGCN]{16}$' testSed.fastq which returns all the lines I would expect.
However when I try to use sed -i 's/[ATGCN]{8}\+[ATGCN]{16}$/0/g' testSed.fastq the original file is unchanged and no replacement occurs.
How can I fix this? Is my regex not specific enough?
Do you need a regex for this?
awk -F: -v OFS=: '/^#/ {$NF = "0"} 1' testfile
That won't save in-place. If you have GNU awk you can
gawk -F: -v OFS=: -i inplace '...' file
ref: https://www.gnu.org/software/gawk/manual/html_node/Extension-Sample-Inplace.html
Your regex is structured as an ERE rather than a BRE, which is sed's default interpretation. Not all sed implementations support ERE, but you can check man sed in your environment to determine whether it's possible for you. Look for -r or -E options. You can alternately use bounds by preceding the curly braces with backslashes.
That said, rather than matching the precise text in the last field, why not just look for the string that starts with a colon, and is followed by no-more-colons? The following RE is both BRE and ERE compatible.
$ sed '/^#/s/:[^:]*$/:0/' testq
#M01551:51:000000000-BCB7H:1:1101:15800:1330 1:N:0:0
NGTCACTN
+
#>AAAAF#
#M01551:51:000000000-BCB7H:1:1101:15605:1331 1:N:0:0
NATCAGCN
+
#>>AA?C#
#M01551:51:000000000-BCB7H:1:1101:15557:1332 1:N:0:0
NCAGCAGN

Extract few matching strings from matching lines in file using sed

I have a file with strings similar to this:
abcd u'current_count': u'2', u'total_count': u'3', u'order_id': u'90'
I have to find current_count and total_count for each line of file. I am trying below command but its not working. Please help.
grep current_count file | sed "s/.*\('current_count': u'\d+'\).*/\1/"
It is outputting the whole line but I want something like this:
'current_count': u'3', 'total_count': u'3'
It's printing the whole line because the pattern in the s command doesn't match, so no substitution happens.
sed regexes don't support \d for digits, or x+ for xx*. GNU sed has a -r option to enable extended-regex support so + will be a meta-character, but \d still doesn't work. GNU sed also allows \+ as a meta-character in basic regex mode, but that's not POSIX standard.
So anyway, this will work:
echo -e "foo\nabcd u'current_count': u'2', u'total_count': u'3', u'order_id': u'90'" |
sed -nr "s/.*('current_count': u'[0-9]+').*/\1/p"
# output: 'current_count': u'2'
Notice that I skip the grep by using sed -n s///p. I could also have used /current_count/ as an address:
sed -r -e '/current_count/!d' -e "s/.*('current_count': u'[0-9]+').*/\1/"
Or with just grep printing only the matching part of the pattern, instead of the whole line:
grep -E -o "'current_count': u'[[:digit:]]+'
(or egrep instead of grep -E). I forget if grep -o is POSIX-required behaviour.
For me this looks like some sort of serialized Python data. Basically I would try to find out the origin of that data and parse it properly.
However, while being hackish, sed can also being used here:
sed "s/.*current_count': [a-z]'\([0-9]\+\).*/\1/" input.txt
sed "s/.*total_count': [a-z]'\([0-9]\+\).*/\1/" input.txt

How to remove quotation and spaces arround numbers in a CSV file using sed?

I have some numbers in a CSV file which I'm trying to remove quotations and spaces arround it.
Input:
1," 23","45","67 ",89
Expected output: 1,23,45,67,89
I'm trying to remove with:
sed -r -e 's#\"[ ]*\([0-9]+\)[ ]*\"#\1#g' file.csv
But I'm getting the error "sed: -e expression #1, char 38: invalid reference \1 on s' command's RHS", if I remove the-r` option, I don't get the error, but it does not work either.
Tom Fenech provided the crucial pointer in a comment:
The only problem with the OP's command is a minor syntax problem:
Since sed is used with -r in order to activate extended regular expressions, ( and ) - for defining capture groups - must NOT be \-escaped.
(By contrast, when sed is used without -r, basic regular expressions must be used, where such escaping is needed.)
The correct form is therefore (\ before ( and ) removed):
sed -r 's#\"[ ]*([0-9]+)[ ]*\"#\1#g' file.csv
If you want the command to work on OSX also, use -E instead of -r.
Alternatively, for maximum portability (POSIX compliance) you could just use \{1,\} instead of + and do away with the -r switch entirely:
sed 's#\"[ ]*\([0-9]\{1,\}\)[ ]*\"#\1#g' file.csv
You could try the below perl command,
$ echo '1," 23","45","67 ",89, "foo" , "bar" ' | perl -pe 's/[" ]+(\d+)[ "]+/\1/g'
1,23,45,67,89, "foo" , "bar"

Finding a repeated string with grep and "bounds"

I have a file test-matching.txt that looks like this:
ba
bababa
baba
babadooba
According to the grep man page, I should be able to get all but the first line using the expression
grep "ba{2,}" test-matching.txt
This should match all the lines containing instances of a string with 2 or more "ba's". However, when I run it, I get no output.
First I tried grep "ba" test-matching.txt just to make sure it was working at all, and it gave me all four lines as output.
I've also tried the following, each with no output:
With the -e option: grep -e "ba{2,}" test-matching.txt
With the -e option and single quotes: grep -e 'ba{2,}' test-matching.txt
With the -e option and escaped braces: grep -e "ba\{2,\}" test-matching.txt
Without the -e option and single quotes: grep 'ba{2,}' test-matching.txt
Without the -e option and escaped braces: grep "ba\{2,\}" test-matching.txt
With {2} instead of {2,}: grep -e 'ba{2}' test-matching.txt
With {2} instead of {2,} and the -e option: grep -e 'ba{2} test-matching.txt
etc.
What is the correct way match all the lines of "ba" concatenated 2 or more times?
Use egrep or grep -E (not grep -e) if you want to use Extended regular expression syntax. If you want to use basic regular expression syntax, you need to backslash-escape the braces. Finally, if you want to repeat ba, you need to group: egrep '(ba){2,}', or grep '\(ba\)\{2,\}' if you prefer using basic regular expressions.
ba{2,} hits only the a
baa
baaa
baaaa
etc
You need (ba){2,} to make it works on group.
Try:
egrep "(ba){2,}" file
or
grep "\(ba\)\{2,\}" file
bababa
baba
babadooba