sed misbehaving? - regex

I have the following command:
$ xlscat -i $file
and I get:
Excel File Name.xslx - 01: [ Sheet #1 ] 34 Cols, 433 Rows
Excel File Name.xlsx - 02: [ Sheet Number2 ] 23 Cols, 32 Rows
Excel File Name.xlsx - 03: [ Foo Factor! ] 14 Cols, 123 Rows
I want just the sheet name, so i do this:
$ xlscat -i $file 2>&1 | sed -e 's/.*\[ *\(.*\) *\].*/\1/' | while read file
> do
> echo "File: '$file'"
> done
And get this:
File: 'Sheet #1'
File: 'Sheet Number2'
File: 'Foo Factor!'
Great! Everything works beautifully. As you can see with the single quotes, I've removed the extra spaces at the end of the file name. Now convert all remaining spaces to underscores:
$ xlscat -i $file 2>&1 | sed -e 's/.*\[ *\(.*\) *\].*/\1/' | sed -e 's/ /_/g' | while read file
> do
> echo "File: '$file'"
> done
Now I get this:
File: 'Sheet_#1_____'
File: 'Sheet_Number2'
File: 'Foo_Factor!__'
Huh? The first one didn't show any trailing blanks, but the second one seems to be appending underscores on the end of the file. What am I not seeing?

The first sed command is not stripping the trailing whitespace, read is. Check your expression:
sed -e 's/.*\[ *\(.*\) *\].*/\1/'
It matches:
anything
a bracket
1 or more spaces
anything, captured
1 or more spaces
a right bracket
anything
The regular expressions are greedy, meaning that they match as much as possible, and the earlier expressions will match before later ones do. So for example, the regular expression (.*)(.*) matches anything in two capturing groups, but there are any number of ways the data could be split between the two groups. So the regex implementation has to choose, and it will put as much as possible in the first, and nothing in the second.
Since you need to match filenames with spaces in them, you can't match "anything except a space"; your best bet is to trim the trailing whitespace as a separate step. Try this sed command instead:
sed -e 's/.*\[ *\(.*\) *\].*/\1/' -e 's/ *$//'

I think the read file is trimming the trailing whitespace for you. Try putting the
sed -e 's/ /_/g'
inside the while loop ... like:
echo "File: $(echo $file | sed -e 's/ /_/g')"

Could it be echo that's stripping the trailing spaces? Although it does seem like they should show up inside the quotes. Anyway, try this:
sed -e 's/.*\[ *\([^] ]\+\( \+[^] ]\+\)*\).*/\1/'
Each word of the sheet name is matched by [^] ]\+ (i.e., one or more of any characters other than space or ]). When the final word of the name has been matched, the second .* consumes the rest of the line. There's no need to match the closing ], so the trailing spaces don't have to be included in the match.
I'm not a sed user, but this regex works correctly in RegexBuddy when I specify the GNU-BRE flavor, so it should work in sed.

Related

Why does this regex work in grep but not sed?

I have two regular expressions:
$ grep -E '\-\- .*$' *.sql
$ sed -E '\-\- .*$' *.sql
(I am trying to grep lines in sql files that have comments and remove lines in sql files that have comments)
The grep command works using this regex; however, the sed returns the following error:
sed: -e expression #1, char 7: unterminated address regex
What am I doing incorrectly with sed?
(The space after the two hyphens is required for sql comments if you are unfamiliar with MySql comments of this type)
You're trying to use:
sed -E '\-\- .*$' *.sql
Here sed command is not correct because you're not really telling sed to do something.
It should be:
sed -n '/-- /p' *.sql
and equivalent grep would be:
grep -- '-- ' *.sql
or even better with a fixed string search:
grep -F -- '-- ' *.sql
Using -- to separate pattern and arguments in grep command.
There is no need to escape - in a regex if it is outside bracket expression (or character class) i.e. [...].
Based on comments below it seems OP's intent is to remove commented section in all *.sql files that start with 2 hyphens.
You may use this sed for that:
sed -i 's/-- .*//g' *.sql
The problem here is not the regex, the problem is that sed requires a command. The equivalent of your grep would be:
sed -n '/\-\- .*$/p'
You suppress output for non-matching lines -n ... you search (wrap your regex in slashes) and you print p (after the last slash).
P.S.: As Anub pointed out, escaping the hyphens - inside the regex is unnecessary.
You are trying to use sed's \cregexpc syntax where with \-<...> you are telling sed the delimiter character you want use is a dash -, but you didn't terminate it where it should be: \-<...>- also add d command to delete those lines.
sed '\-\-\-.*$-d' infile
see man sed about that:
\cregexpc
Match lines matching the regular expression regexp. The c may be any character.
if default / was used this was not required so:
sed '/--.*$/d' infile
or simply:
sed '/^--/d' infile
and more accurately:
sed '/^[[:blank:]]*--/d' infile

“sed” command to remove a line that matches an exact string on first word

I've found an answer to my question here: "sed" command to remove a line that match an exact string on first word
...but only partially because that solution only works if I query pretty much exactly like the answer person answered.
They answered:
sed -i "/^maria\b/Id" file.txt
...to chop out only a line starting with the word "maria" in it and not maria if it's not the first word for example.
I want to chop out a specific url in a file, example: "cnn.com" - but, I also have a bunch of local host addressses, 0.0.0.0 and both have some with a single space in front. I also don't want to chop out sub domains like ads.cnn.com so that code "should" work but doesn't when I string in more commands with the -e option. My code below seems to clean things up well except that I can't get it to whack out the cnn.com! My file is called raw.txt
sed -r -e 's/^127.0.0.1//' -e 's/^ 127.0.0.1//' -e 's/^0.0.0.0//' -e 's/^ 0.0.0.0//' -e '/#/d' -e '/^cnn.com\b/d' -e '/::/d' raw.txt | sort | tr -d "[:blank:]" | awk '!seen[$0]++' | grep cnn.com
When I grep for cnn.com I see all the cnn's INCLUDING the one I don't want which is actually "cnn.com".
ads.cnn.com
cl.cnn.com
cnn.com <-- the one I don't want
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
If I just use that one piece of code with the cnn.com chop out it seems to work.
sed -r '/^cnn.com\b/d' raw.txt | grep cnn.com
* I'm not using the "-e" option
Result:
ads.cnn.com
cl.cnn.com
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
Nothing I do seems to work when I string commands together with the "-e" option. I need some help on getting my multiple option command kicking with SED.
Any advice?
Ubuntu 12 LTS & 16 LTS.
sed (GNU sed) 4.2.2
The . is metacharacter in regex which means "Match any one character". So you accidentally created a regex that will also catch cnnPcom or cnn com or cnn\com. While it probably works for your needs, it would be better to be more explicit:
sed -r '/^cnn\.com\b/d' raw.txt
The difference here is the \ backslash before the . period. That escapes the period metacharacter so it's treated as a literal period.
As for your lines that start with a space, you can catch those in a single regex (Again escaping the period metacharacter):
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d' raw.txt
This (^[ ]*|^) says a line that starts with any number of repeating spaces ^[ ]* OR | starts with ^ which is then followed by your match for 127.0.0.1.
And then for stringing these together you can use the | OR operator inside of parantheses to catch all of your matches:
sed -r '/(^[ ]*|^)(127\.0\.0\.1|cnn\.com|0\.0\.0\.0)\b/d' raw.txt
Alternatively you can use a ; semicolon to separate out the different regexes:
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d; /(^[ ]*|^)cnn\.com\b/d; /(^[ ]*|^)0\.0\.0\.0\b/d;' raw.txt
sed doesn't understand matching on strings, only regular expressions, and it's ridiculously difficult to try to get sed to act as if it does, see Is it possible to escape regex metacharacters reliably with sed. To remove a line whose first space-separated word is "foo" is just:
awk '$1 != "foo"' file
To remove lines that start with any of "foo" or "bar" is just:
awk '($1 != "foo") && ($1 != "bar")' file
If you have more than just a couple of words then the approach is to list them all and create a hash table indexed by them then test for the first word of your line being an index of the hash table:
awk 'BEGIN{split("foo bar other word",badWords)} !($1 in badWords)' file
If that's not what you want then edit your question to clarify your requirements and include concise, testable sample input and the expected output given that input.

Regex EOL replace with perl is giving unexpected results

Why is there a dollar sign at the starting of line 2 and line 3?
➜ echo -e "hello\nworld" | perl -pe 's/$/\$/g'
hello$
$world$
$%
Above, I am trying to add a dollar sign at the end of each line, but somehow it's appending a dollar sign at the beginning too. It does that when global flag is enabled. But when I remove the global flag, it works fine:
➜ echo -e "hello\nworld" | perl -pe 's/$/\$/'
hello$
world$
Can anyone explain what's happening? Maybe it has something to do with '\r\n' characters?
EDIT : Adding the lookbehind case
It's not just breaking in this cases, but other cases as well. Consider the following:
➜ echo -e "A\nB\nC\nD" | perl -pe 's/(?<!A)$/\$/'
A
$B$
C$
D$
Above, I want to mark rows which don't end in "A" with $.
The extra dollar sign in line 2 shouldn't be there. I'm not even using global flag.
SOLUTION : Okay got it now. The solution for second one is like this (for explanation, refer to Wiktor Stribiżew's answer)
➜ echo -e "A\nB\nC\nD" | perl -pe 's/(?<!A|\n)$/\$/'
A
B$
C$
D$
But beware, if you try with more than single characters, it will throw
Variable length lookbehind not implemented in regex. For example:
➜ echo -e "AA\nBB\nCC\nDD" | perl -pe 's/(?<!AA|\n)$/\$/'
Variable length lookbehind not implemented in regex m/(?<!AA|\n)$/ at -e line 1.
To solve this, add the appropriate number of . before newline.
➜ echo -e "AA\nBB\nCC\nDD" | perl -pe 's/(?<!AA|.\n)$/\$/'
AA
BB$
CC$
DD$
The point is that $ is a zero-width assertion and it can match before a final newline. Perl reads a line with a trailing \n, so $ matches twice: before and after that.
Your string basically goes to Perl as two lines:
hello\n
world\n
And the $ can match both before a final newline and at the very end of the string. Thus, there are two matches in both lines ("strings" in this context).
If you want to match the very end of string, use \z:
perl -pe 's/\z/\$/g'
since \z only matches the very end of the string, but it is not likely anyone would want to use that since it will effectively insert a $ at the start of the second and subsequent lines, adding it as the final line as well.
To only insert $ before the last \n and stop, use your perl -pe 's/$/\$/', with no g modifier.
If you really want to use it with the global replace, you can use the following command:
echo -e "hello\nworld" | perl -pe 's/^(.*)$/\1\$/g'
hello$
world$
or without back-references you can use:
echo -e "hello\nworld" | perl -pe 's/\n$/\$\n/g'
hello$
world$
you might need to replace \n by \r\n if you manipulate a file from windows or just use dos2unix to remove Windows EOL chars \r.

Extracting Substring from String with Multiple Special Characters Using Sed

I have a text file with a line that reads:
<div id="page_footer"><div><? print('Any phrase's characters can go here!'); ?></div></div>
And I'm wanting to use sed or awk to extract the substring above between the single quotes so it just prints ...
Any phrase's characters can go here!
I want the phrase to be delimited as I have above, starting after the single quote and ending at the single-quote immediately followed by a parenthesis and then semicolon. The following sed command with a capture group doesn't seem to be working for me. Suggestions?
sed '/^<div id="page_footer"><div><? print(\'\(.\+\)\');/ s//\1/p' /home/foobar/testfile.txt
Incorrect would be using cut like
grep "page_footer" /home/foobar/testfile.txt | cut -d "'" -f2
It will go wrong with single quotes inside the string. Counting the number of single quotes first will change this from a simple to an over-complicated solution.
A solution with sed is better: remove everything until the first single quote and everything after the last one. A single quote in the string becomes messy when you first close the sed parameter with a single quote, escape the single quote and open a sed string again:
grep page_footer /home/foobar/testfile.txt | sed -e 's/[^'\'']*//' -e 's/[^'\'']*$//'
And this is not the full solution, you want to remove the first/last quotes as well:
grep page_footer /home/foobar/testfile.txt | sed -e 's/[^'\'']*'\''//' -e 's/'\''[^'\'']*$//'
Writing the sed parameters in double-quoted strings and using the . wildcard for matching the single quote will make the line shorter:
grep page_footer /home/foobar/testfile.txt | sed -e "s/^[^\']*.//" -e "s/.[^\']*$//"
Using advanced grep (such as in Linux), this might be what you are looking for
grep -Po "(?<=').*?(?='\);)"

Change CSV Delimiter with sed

I've got a CSV file that looks like:
1,3,"3,5",4,"5,5"
Now I want to change all the "," not within quotes to ";" with sed, so it looks like this:
1;3;"3,5";5;"5,5"
But I can't find a pattern that works.
If you are expecting only numbers then the following expression will work
sed -e 's/,/;/g' -e 's/\("[0-9][0-9]*\);\([0-9][0-9]*"\)/\1,\2/g'
e.g.
$ echo '1,3,"3,5",4,"5,5"' | sed -e 's/,/;/g' -e 's/\("[0-9][0-9]*\);\([0-9][0-9]*"\)/\1,\2/g'
1;3;"3,5";4;"5,5"
You can't just replace the [0-9][0-9]* with .* to retain any , in that is delimted by quotes, .* is too greedy and matches too much. So you have to use [a-z0-9]*
$ echo '1,3,"3,5",4,"5,5",",6","4,",7,"a,b",c' | sed -e 's/,/;/g' -e 's/\("[a-z0-9]*\);\([a-z0-9]*"\)/\1,\2/g'
1;3;"3,5";4;"5,5";",6";"4,";7;"a,b";c
It also has the advantage over the first solution of being simple to understand. We just replace every , by ; and then correct every ; in quotes back to a ,
You could try something like this:
echo '1,3,"3,5",4,"5,5"' | sed -r 's|("[^"]*),([^"]*")|\1\x1\2|g;s|,|;|g;s|\x1|,|g'
which replaces all commas within quotes with \x1 char, then replaces all commas left with semicolons, and then replaces \x1 chars back to commas. This might work, given the file is correctly formed, there're initially no \x1 chars in it and there're no situations where there is a double quote inside double quotes, like "a\"b".
Using gawk
gawk '{$1=$1}1' FPAT="([^,]+)|(\"[^\"]+\")" OFS=';' filename
Test:
[jaypal:~/Temp] cat filename
1,3,"3,5",4,"5,5"
[jaypal:~/Temp] gawk '{$1=$1}1' FPAT='([^,]+)|(\"[^\"]+\")' OFS=';' filename
1;3;"3,5";4;"5,5"
This might work for you:
echo '1,3,"3,5",4,"5,5"' |
sed 's/\("[^",]*\),\([^"]*"\)/\1\n\2/g;y/,/;/;s/\n/,/g'
1;3;"3,5";4;"5,5"
Here's alternative solution which is longer but more flexible:
echo '1,3,"3,5",4,"5,5"' |
sed 's/^/\n/;:a;s/\n\([^,"]\|"[^"]*"\)/\1\n/;ta;s/\n,/;\n/;ta;s/\n//'
1;3;"3,5";4;"5,5"