How to use sed to remove parenthesis but not all of them - regex

I have a file full of lines like the one below:
("012345", "File City (Spur) NE", "10.10.10.00", "b.file.file.cluster1")
I'd like to remove the parentheses around Spur but not the beginning and ending (). I can do this and it works but looking for one simple sed command.
sed -i 's/) //g' myfile.txt
sed -i 's/ (//g' myfile.txt
Not sure if it's possible but would appreciate any help.

If you want to remove all ( after a space, and ) before a space, you can use
sed -i 's/) / /g;s/ (/ /g' myfile.txt
See the online demo:
s='("012345", "File City (Spur) NE", "10.10.10.00", "b.file.file.cluster1")'
sed 's/) / /g;s/ (/ /g' <<< "$s"
# => ("012345", "File City Spur NE", "10.10.10.00", "b.file.file.cluster1")
Note that in POSIX BRE, unescaped ( and ) chars in the regex pattern match the literal parentheses.
A more precise solution can be written with Perl:
perl -i -pe 's/(^\(|\)$)|[()]/$1/g' myfile.txt
See an online demo. Here, ( at the start and ) at the end are captured into Group 1 (see (^\(|\)$)) and then [()] matches any ( and ) in other contexts, and replacing the matches with the backreference to Group 1 value restores the parentheses at the start and end only.

Using sed grouping and back referencing
sed -Ei 's/(\([^(]*).([^\)]*).(.*)/\1\2\3/' input_file`
("012345", "File City Spur NE", "10.10.10.00", "b.file.file.cluster1")
This will match the entire line excluding the inner parenthesis represented as . not within the grouped parenthesis

Related

bash tool to search and replace text (while leaving text in the middle the same)

I have text files that look like this:
foo(bar(some_id)) I want to replace that with
bleh(some_id)
I can come up with the regex to find the instances, which is: foo\(bar\([a-zA-z0-9_]+\)\). But I dont know how to express that I want to keep the text in the middle the same.
Any suggestion? (I'm thinking of using sed or awk or any standard bash tool, whichever is easier )
You can use
sed -E 's/foo\(bar\(([^()]*).*/bleh(\1)/'
sed 's/foo(bar(\([^()]*\).*/bleh(\1)/'
The first pattern is POSIX ERE compliant, hence the -E option.
The foo\(bar\(([^()]*).* POSIX ERE pattern matches foo(bar(, then captures any zero or more chars other than ( and ) into Group 1 (\1 refers to this group value from the replacement pattern), and then matches the rest of string. After the replacement, the Group 1 value remains. You may add .* at the start if there is text before foo(bar(.
The second sed command is POSIX BRE equivalent of the above command.
See an online demo:
s='foo(bar(some_id))'
sed -E 's/foo\(bar\(([^()]*).*/bleh(\1)/' <<< "$s"
# => bleh(some_id)
sed 's/foo(bar(\([^()]*\).*/bleh(\1)/' <<< "$s"
# => bleh(some_id)
Using sed
$ sed 's/.*\(([^)]*)\).*/bleh\1/' input_file
bleh(some_id)

find recurring pattern with `sed`

I am using GNU bash 4.3.48
I expected that
echo "23S62M1I19M2D" | sed 's/.*\([0-9]*M\).*/\1/g'
would output 62M19M... But it doesn't.
sed 's/\([0-9]*M\)//g' deletes ALL [0-9]*M and retrieves 23S1I2D. but the group \1 is not working as I thought it would.
sed 's/.*\([0-9]*M\).*/ \1 /g', retrieves M...
What am I doing wrong?
Thank you!
With your shown samples and with awk you could try following program.
echo "23S62M1I19M2D" |
awk '
{
val=""
while(match($0,/[0-9]+M/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
print val
}
'
Explanation: Simple explanation would be, using echo to print values and sending it as a standard input to awk program. In awk program using its match function to match regex mentioned in it(/[0-9]+M) running loop to find all matches in each line and printing the collected matched values at last of each line.
This might work for you (GNU sed):
sed -nE '/[0-9]*M/{s//\n&\n/g;s/(^|\n)[^\n]*\n?//gp}' file
Surround the match by newlines and then remove non-matching parts.
Alternative, using grep and tr:
grep -o '[0-9]*M' file | tr -d '\n'
N.B. tr removes all newlines (including the last one) to restore the last newline, use:
grep -o '[0-9]*M' file | tr -d '\n' | paste
The alternate solution will concatenate all results into a single line. To achieve the same result with the first solution use:
sed -nE '/[0-9]*M/{s//\n&\n/g;s/(^|\n)[^\n]*\n?//g;H};${x;s/\n//gp}' file
The problem is that the .* is greedy. Since only M is obligatory, when the engine finds last M, it satisfies the regex, so all string is matched, M is captured and thus kept after replacing with \1 backreference.
That means, you can't easily do this with sed. You can do that with Perl much easier since it supports matching and skipping pattern:
#!/bin/bash
perl -pe 's/\d+M(*SKIP)(*F)|.//g' <<< "23S62M1I19M2D"
See the online demo. The pattern matches
\d+M(*SKIP)(*F) - one or more digits, M, and then the match is omitted and the next match is searched for from the failure position
|. - or matches any char other than a line break char.
Or simply match all occurrences and concatenate them:
perl -lane 'BEGIN{$a="";} while (/\d+M/g) {$a .= $&} END{print $a;}' <<< "23S62M1I19M2D"
All \d+M matches are appended to the $a variable which is printed at the end of processing the string.
Your substitution is probably working, but not substituting what you think it is.
In the substitution s/\(foo...\)/\1/, the \1 matches whatever \(...\) matches and captures, so your substitution is replacing foo... by foo...!
% echo "1234ABC" | sed 's/\([A-Z]\)/-\1-/'g
1234-A--B--C-
So you'll need to match more, but capture only a portion of the match. For example:
echo "23S62M1I19M2D" | sed 's/[0-9]*[A-LN-Z]*\([0-9]*M\)/\1/g'
62M19M2D
In the case of sed 's/.*\([0-9]*M\).*/\1/g' (did that appear in an edit to the question, or did I just miss it?), the .* matches ‘greedily’ – it matches as much as it possibly can, thus including the digits before the M. In the example above, the [A-LN-Z] is required to be at the end of the uncaptured part, so the digits are forced to be matched by the [0-9] inside the capture.
Getting a clear idea of what ‘greedy’ means is a really important idea when writing or interpreting regexps.
If you know you will only encounter the suffixes S, M, I and D, an alternative approach would be explicitly deleting the combinations you don't want:
echo "23S62M1I19M2D" | sed 's/[0-9]\+[SID]//g'
This gives the expected:
62M19M
Update: This variant produces the same output, but rejects all non-numeric, non-M suffixes:
echo "23S62M1I19M2D" | sed 's/[0-9]\+[^0-9M]//g'

sed and Perl regexp replaces once, with multiple replacements flag

I have the string:
lopy,lopy1,sym,lopy,lopy1,sym"
I want the line to be:
lopy,lopy1,sym,lady,lady1,sym
Which means that all "lad" after the string sym should be replaced. So I ran:
echo "lopy,lopy1,sym,lopy,lopy1,sym" | sed -r 's/(.*sym.*?)lopy/\1lad/g'
I get:
lopy,lopy1,sym,lopy,lad1,sym
Using Perl is not really better:
echo "lopy,lopy1,sym,lopy,lopy1,sym" | perl -pe 's/(.*sym.+?)lopy/${1}lad/g'
yields
lopy,lopy1,sym,lad,lopy1,sym
Not all "lopy" are replaced. What am I doing wrong?
The (.*sym.*?)lopy / (.*sym.+?)lopy patterns are almost the same, .+? matches one or more chars other than line break chars, but as few as possible, and .*? matches zero or more such chars. Mind that sed does not support lazy quantifiers, *? is the same as * in sed. However, the main problem with the regexps you used is that they match sym, then any text after it and then lopy, so when you added g, it just means you want to find more cases of lopy after sym....lopy. And there is only one such occurrence in your string.
You want to replace all lopy after sym, so you can use
perl -pe 's/(?:\G(?!^)|sym).*?\Klopy/lad/g'
See the regex demo. Details:
(?:\G(?!^)|sym) - sym or end of the previous match (\G(?!^))
.*? - any zero or more chars other than line break chars, as few as possible
\K - match reset operator that discards all text matched so far
lopy - a lopy string.
See the online demo:
#!/bin/bash
echo "lopy,lopy1,sym,lopy,lopy1,sym" | perl -pe 's/(?:\G(?!^)|sym).*?\Klopy/lad/g'
# => lopy,lopy1,sym,lad,lad1,sym
If the values are always comma separated, you may replace .*? with ,: (?:\G(?!^)|sym),\Klopy (see this regex demo).
Since OP has mentioned sed so I am adding awk program here. Which could be better choice in comparison to sed. With shown samples, please try following awk program.
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
awk -F',sym,' '
{
first=$1
$1=""
sub(/^[[:space:]]+/,"")
gsub(/lop/,"lad")
$0=first FS $0
}
1
'
Explanation: Adding detailed explanation for above.
echo "lopy,lopy1,sym,lopy,lopy1,sym" | ##Printing values and sending as standard output to awk program as an input.
awk -F',sym,' ' ##Making ,sym, as a field separator here.
{
first=$1 ##Creating first which has $1 of current line in it.
$1="" ##Nullifying $1 here.
sub(/^[[:space:]]+/,"") ##Substituting initial space in current line here.
gsub(/lop/,"lad") ##Globally substituting lop with lad in rest of line.
$0=first FS $0 ##Adding first FS to rest of edited line here.
}
1 ##Printing edited/non-edited line value here.
'
The problem is that the lopy(s) to replace are after sym, with a pattern like sym.*?lopy, so a global replacement looks for yet more of the whole sym+lopy-after-sym (not just for all lopys after that one sym).†
To replace all lopys (after the first sym, followed by another sym) we can capture the substring between syms and in the replacement side run code, in which a regex replaces all lopys
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe's{ sym,\K (.+?) (?=sym) }{ $1 =~ s/lop/lad/gr }ex'
To isolate the substring between syms I use \K after the first sym, which drops matches prior to it, and a positive lookahead for the sym after the substring, which doesn't consume anything. The /e modifier makes the replacement side be evaluated as code. In the replacement side's regex we need /r since $1 can't change, and we want the regex to return anyway. See perlretut.
† To match all of abbbb we can't say /ab/g, nor /(a)b/g nor /a(b)/g, because that would look for all repetitions of the whole ab in the string (and find only ab in the beginning).
sed does not support non-greedy wildcards at all. But your Perl script also fails for other reasons; you are saying "match all occurrences of this" but then you specify a regex which can only match once.
A common simple solution is to split the string, and then replace only after the match:
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe 'if (#x = /^(.*?sym,)(.*)/) { $x[1] =~ s/lop/lad/g; s/.*/$x[0]$x[1]/ }'
If you want to be fancy, you can use a lookbehind to only replace the lop occurrences after the first sym.
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe 's/(?<=sym.{0,200})lop/lad/'
The variable-length lookbehind generates a warning and is only supported in Perl 5.30+ (you can turn it off with no warnings qw(experimental::vlb));.)
Since you have shown an attempted sed command and used sed tag, here is a sed loop based solution:
sed -E -e ':a' -e 's~(sym,.*)lopy~\1lady~g; ta' file
lopy,lopy1,sym,lady,lady1,sym"
Explanation:
:a sets a label a before matching sym,.* pattern
ta jumps pattern matching back to label a after making a substitution
This looping stop when s command has nothing to match i.e. no lopy substring after sym,

Printing only text from group

I have working example of substitution in online regex tester https://regex101.com/r/3FKdLL/1 and I want to use it as a substitution in sed editor.
echo "repo-2019-12-31-14-30-11.gz" | sed -r 's/^([\w-]+)-\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-\d{2}.gz$.*/\1/p'
It always prints whole string: repo-2019-12-31-14-30-11.gz, but not matched group [\w-]+.
I expect to get only text from group which is repo string in this example.
Try this:
echo "repo-2019-12-31-14-30-11.gz" |
sed -rn 's/^([A-Za-z]+)-[[:alnum:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}-[[:digit:]]{2}-[[:digit:]]{2}-[[:digit:]]{2}.gz.*$/\1/p'
Explanations:
\w will work (not [\w] wich matches either backslash or w), but you should use [[:alnum:]] which is POSIX
For sed, \d isn't a regex class, but an escaped character representing a non-printable character
Add -n to mute sed, with /p to explicitly print matched lines
Additionaly, you could refactor your regex by removing duplication:
echo "repo-2019-12-31-14-30-11.gz" |
sed -rn 's/^([[:alnum:]]+)-[[:digit:]]{4}(-[[:digit:]]{2}){5}.gz.*$/\1/p'
Looks like a job for GNU grep :
echo "repo-2019-12-31-14-30-11.gz" | grep -oP '^\K[[:alpha:]-]+'
Displays :
repo-
On this example :
echo "repo-repo-2019-12-31-14-30-11.gz" | grep -oP '^\K[[:alpha:]-]+'
Displays :
repo-repo-
Which I think is what you want because you tried with [\w-]+ on your regex.
If I'm wrong, just replace the grep command with : grep -oP '^\K\w+'

sed misbehaving?

I have the following command:
$ xlscat -i $file
and I get:
Excel File Name.xslx - 01: [ Sheet #1 ] 34 Cols, 433 Rows
Excel File Name.xlsx - 02: [ Sheet Number2 ] 23 Cols, 32 Rows
Excel File Name.xlsx - 03: [ Foo Factor! ] 14 Cols, 123 Rows
I want just the sheet name, so i do this:
$ xlscat -i $file 2>&1 | sed -e 's/.*\[ *\(.*\) *\].*/\1/' | while read file
> do
> echo "File: '$file'"
> done
And get this:
File: 'Sheet #1'
File: 'Sheet Number2'
File: 'Foo Factor!'
Great! Everything works beautifully. As you can see with the single quotes, I've removed the extra spaces at the end of the file name. Now convert all remaining spaces to underscores:
$ xlscat -i $file 2>&1 | sed -e 's/.*\[ *\(.*\) *\].*/\1/' | sed -e 's/ /_/g' | while read file
> do
> echo "File: '$file'"
> done
Now I get this:
File: 'Sheet_#1_____'
File: 'Sheet_Number2'
File: 'Foo_Factor!__'
Huh? The first one didn't show any trailing blanks, but the second one seems to be appending underscores on the end of the file. What am I not seeing?
The first sed command is not stripping the trailing whitespace, read is. Check your expression:
sed -e 's/.*\[ *\(.*\) *\].*/\1/'
It matches:
anything
a bracket
1 or more spaces
anything, captured
1 or more spaces
a right bracket
anything
The regular expressions are greedy, meaning that they match as much as possible, and the earlier expressions will match before later ones do. So for example, the regular expression (.*)(.*) matches anything in two capturing groups, but there are any number of ways the data could be split between the two groups. So the regex implementation has to choose, and it will put as much as possible in the first, and nothing in the second.
Since you need to match filenames with spaces in them, you can't match "anything except a space"; your best bet is to trim the trailing whitespace as a separate step. Try this sed command instead:
sed -e 's/.*\[ *\(.*\) *\].*/\1/' -e 's/ *$//'
I think the read file is trimming the trailing whitespace for you. Try putting the
sed -e 's/ /_/g'
inside the while loop ... like:
echo "File: $(echo $file | sed -e 's/ /_/g')"
Could it be echo that's stripping the trailing spaces? Although it does seem like they should show up inside the quotes. Anyway, try this:
sed -e 's/.*\[ *\([^] ]\+\( \+[^] ]\+\)*\).*/\1/'
Each word of the sheet name is matched by [^] ]\+ (i.e., one or more of any characters other than space or ]). When the final word of the name has been matched, the second .* consumes the rest of the line. There's no need to match the closing ], so the trailing spaces don't have to be included in the match.
I'm not a sed user, but this regex works correctly in RegexBuddy when I specify the GNU-BRE flavor, so it should work in sed.