Sed subexpressions not working as expected - regex

I am trying to make a simple wikitext parser using sed/bash. When I run
echo "London has [[public transport]]" | sed s/\\[\\[[A-Za-z0-9\ ]*\\]\\]/link/
it gives me London has link
but when I try to use marked subexpressions to get the contents of the brackets using
sed s/\\[\\[\([A-Za-z0-9\ ]*\)\\]\\]/\1/
it just gives me London has [[public transport]]

That's because the regex doesn't match.
Since you're not surrounding your sed expression in quotes, you have to double-escape slashes for the shell - that's why you have \\[ instead of \[.
Now in sed default regex (basic regular expressions), capturing brackets are denoted by \( and \) in regex. Since you're typing this into the shell without surrounding with quote marks, you need to escape the backslash. And since bash interprets brackets, you have to escape them too:
echo "London has [[public transport]]" | sed s/\\[\\[\\\([A-Za-z0-9\ ]*\\\)\\]\\]/\\1/
I strongly recommend you just enclose your sed expression in single quotes for ease of writing:
echo "London has [[public transport]]" | sed 's/\[\[\([A-Za-z0-9\ ]*\)\]\]/\1/'
Much easier right?

echo "London has [[public transport]]" | sed 's#[[][[]\([A-Za-z0-9\ ]*\)[]][]]#\1#'
output
London has public transport
works on my machine.
I hope this helps.

Related

How to use grep/sed/awk, to remove a pattern from beginning of a text file

I have a text file with the following pattern written to it:
TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"
I would like to discard the first part of each line containing
TIME[32.468ms] -(3)-.............
To test the regular expression I've tried the following:
cat myfile.txt | egrep "^TIME\[.*\]\s\s\-\(3\)\-\.+"
This identifies correctly the lines I want. Now, to delete the pattern I've tried:
cat myfile.txt | sed s/"^TIME\[.*\]\s\s\-\(3\)\-\.+"//
but it just seems to be doing the cat, since it shows the content of the complete file and no substitution happens.
What am I doing wrong?
OS: CentOS 7
With your shown samples, please try following grep command. Written and tested with GNU grep.
grep -oP '^TIME\[\d+\.\d+ms\]\s+-\(\d+\)-\.+\K.*' Input_file
Explanation: Adding detailed explanation for above code.
^TIME\[ ##Matching string TIME from starting of value here.
\d+\.\d+ms\] ##Matching digits(1 or more occurrences) followed by dot digits(1 or more occurrences) followed by ms ] here.
\s+-\(\d+\)-\.+ ##Matching spaces91 or more occurrences) followed by - digits(1 or more occurrences) - and 1 or more dots.
\K ##Using \K option of GNU grep to make sure previous match is found in line but don't consider it in printing, print next matched regex part only.
.* ##to match till end of the value.
2nd solution: Adding awk program here.
awk 'match($0,/^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+/){print substr($0,RSTART+RLENGTH)}' Input_file
Explanation: using match function of awk, to match regex ^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+ which will catch text which we actually want to remove from lines. Then printing rest of the text apart from matched one which is actually required by OP.
This awk using its sub() function:
awk 'sub(/^TIME[[][^]]*].*\.+/,"")' file
"TEXT I WANT TO KEEP"
If there is replacement, sub() returns true.
$ cut -d'"' -f2 file
TEXT I WANT TO KEEP
You may use:
s='TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"'
sed -E 's/^TIME\[[^]]*].*\.+//'
"TEXT I WANT TO KEEP"
The \s regex extension may not be supported by your sed.
In BRE syntax (which is what sed speaks out of the box) you do not backslash round parentheses - doing that turns them into regex metacharacters which do not match themselves, somewhat unintuitively. Also, + is just a regular character in BRE, not a repetition operator (though you can turn it into one by similarly backslashing it: \+).
You can try adding an -E option to switch from BRE syntax to the perhaps more familiar ERE syntax, but that still won't enable Perl regex extensions, which are not part of ERE syntax, either.
sed 's/^TIME\[[^][]*\][[:space:]][[:space:]]-(3)-\.*//' myfile.txt
should work on any reasonably POSIX sed. (Notice also how the minus character does not need to be backslash-escaped, though doing so is harmless per se. Furthermore, I tightened up the regex for the square brackets, to prevent the "match anything" regex you had .* from "escaping" past the closing square bracket. In some more detail, [^][] is a negated character class which matches any character which isn't (a newline or) ] or [; they have to be specified exactly in this order to avoid ambiguity in the character class definition. Finally, notice also how the entire sed script should normally be quoted in single quotes, unless you have specific reasons to use different quoting.)
If you have sed -E or sed -r you can use + instead of * but then this complicates the overall regex, so I won't suggest that here.
A simpler one for sed:
sed 's/^[^"]*//' myfile.txt
If the "text you want to keep" always surrounded by the quote like this and only them having the quote in the line starting with "TIME...", then:
sed -n '/^TIME/p' file | awk -F'"' '{print $2}'
should get the line starting with "TIME..." and print the text within the quotes.
Thanks all, for your help.
By the end, I've found a way to make it work:
echo 'TIME[32.468ms] -(3)-.............TEXT I WANT TO KEEP' | grep TIME | sed -r 's/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//'
More generally,
grep TIME myfile.txt | sed -r ‘s/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//’
Cheers,
Pedro

Bash script to enclose words in single quotes

I'm trying to write a bash script to enclose words contained in a file with single quotes.
Word - Hello089
Result - 'Hello089',
I tried the following regex but it doesn't work. This works in Notepad++ with find and replace. I'm not sure how to tweak it to make it work in bash scripting.
sed "s/(.+)/'$1',/g" file.txt > result.txt
Replacement backreferences (also called placeholders) are defined with \n syntax, not $n (this is perl-like backreference syntax).
Note you do not need groups here, though, since you want to wrap the whole lines with single quotation marks. This is also why you do not need the g flags, they are only necessary when you plan to find multiple matches on the same line, input string.
You can use the following POSIX BRE and ERE (the one with -E) solutions:
sed "s/..*/'&',/" file.txt > result.txt
sed -E "s/.+/'&',/" file.txt > result.txt
In the POSIX BRE (first) solution, ..* matches any char and then any 0 or more chars (thus emulating .+ common PCRE pattern). The POSIX ERE (second) solution uses .+ pattern to do the same. The & in the right-hand side is used to insert the whole match (aka \0). Certainly, you may enclose the whole match with capturing parentheses and then use \1, but that is redundant:
sed "s/\(..*\)/'\1',/" file.txt > result.txt
sed -E "s/(.+)/'\1',/" file.txt > result.txt
See the escaping, capturing parentheses in POSIX BRE must be escaped.
See the online sed demo.
s="Hello089";
sed "s/..*/'&',/" <<< "$s"
# => 'Hello089',
sed -E "s/.+/'&',/" <<< "$s"
# => 'Hello089',
$1 is expanded by the shell before sed sees it, but it's the wrong back reference anyway. You need \1. You also need to escape the parentheses that define the capture group. Because the sed script is enclosed in double quotes, you'll need to escape all the backslashes.
$ echo "Hello089" | sed "s/\\(.*\\)/'\1',/g"
'Hello089',
(I don't recall if there is a way to specify single quotes using an ASCII code instead of a literal ', which would allow you to use single quotes around the script.)

Regex for string with double quotes and environment variable

Using sed, I need to replace a string that contains double quotes with an environment variable:
BUCKET_FOLDER=\"dev\"
(or any derivative of 'dev') needs to convert to:
BUCKET_FOLDER=bucket1/$ID
where $ID = abcde, ie
BUCKET_FOLDER=bucket1/abcde
To expand the $ID environment variable, I need to put double quotes around the sed substitution expression:
sed -e "s/BUCKET_FOLDER=\\"(.*?)\\"/BUCKET_FOLDER=bucket1\/$ID/g" $string
but this is then preventing a match on the double quotes in the source string.
Would appreciate any advice. I can make it work with 2 steps, but would prefer 1.
ID=abcde
echo 'A=\"x\" BUCKET_FOLDER=\"dev\" B=\"y\"' |sed -r "s|(.*)(BUCKET_FOLDER=)([^ ]+)(.*)|\1\2bucket1/$ID\4|g"
A=\"x\" BUCKET_FOLDER=bucket1/abcde B=\"y\"
Using | as seprator in sed. As you mentioned, used double quotes to expand $ID and captured BUCKET_FOLDER= as first group.
I believe you escaped the quote correctly in the sed command. On the other hand, the way you specified the rest of the regex isn't how it's supposed to be. Here is my take on the story:
echo BUCKET_FOLDER='"'dev'"' | \
sed -e "s/BUCKET_FOLDER=\"\(.*\?\)\"/BUCKET_FOLDER=bucket1\/$ID/g"
Explanation:
I made the assumption that the \" parts in your $string variable are just escape sequences. Thus I used the echo BUCKET_FOLDER='"'dev'"' command. It's output is BUCKET_FOLDER="dev".
The non-greedy qualifier looks like this: .*\?. I.e. you need to escape the question mark.
You need to escape the parentheses too: \(...\). This should work without the group too, because you don't use backreferences like \1.
Alternatives:
If you want to eliminate the capturing group, then the sed expression becomes this:
sed -e "s/BUCKET_FOLDER=\".*\?\"/BUCKET_FOLDER=bucket1\/$ID/g"
If the backslash is really part of your string, you can match them with the character class [\] :
echo BUCKET_FOLDER=\\'"'dev\\'"' | \
sed -e "s/BUCKET_FOLDER=[\]\".*\?[\]\"/BUCKET_FOLDER=bucket1\/$ID/g

Simple replacement with sed inside bash not working

Why is this simple replacement with sed inside bash not working?
echo '[a](!)' | sed 's/[a](!)/[a]/'
It returns [a](!) instead of [a]. But why, given that only three characters need to be escaped in a sed replacement string?
If I account for the case that additional characters need to be replaced in the regex string and try
echo '[a](!)' | sed 's/\[a\]\(!\)/[a]/'
it is still not working.
The point is that [a] in the regex pattern does not match square brackets that form a bracket expression. Escape the first [ for it to be parsed as a literal [ symbol, and your replacement will work:
echo '[a](!)' | sed 's/\[a](!)/[a]/'
^^
See this demo
sed uses BREs by default and EREs can be enabled by escaping individual ERE metacharaters or by using the -E argument. [ and ] are BRE metacharacters, ( and ) are ERE metacharacters. When you wrote:
echo '[a](!)' | sed 's/\[a\]\(!\)/[a]/'
you were turning the [ and ] BRE metacharacters into literals, which is good, but you were turning the literal ( and ) into ERE metacharacters, which is bad. This is what you were trying to do:
echo '[a](!)' | sed 's/\[a\](!)/[a]/'
which you'd probably really want to write using a capture group:
echo '[a](!)' | sed 's/\(\[a\]\)(!)/\1/'
to avoid duplicating [a] on both sides of the substitution. With EREs enabled using the -E argument that last would be:
echo '[a](!)' | sed -E 's/(\[a\])\(!\)/\1/'
Read the sed man page and a regexp tutorial.
man echo tells that the command echo display a line of text. So [ and ( with their closing brackets are just text.
If you read man grep and type there /^\ *Character Classes and Bracket Expressions and /^\ *Basic vs Extended Regular Expressions you can read the difference. sed and other tools that use regex interprets this as Character Classes and Bracket Expressions.
You can try this
$ echo '[a](!)' | sed 's/(!)//'

Printing a matched regexp with sed

So I'm trying to match a regexp with any string in the middle of it and then print out just that string. The syntax is sort of like this...
sed -n 's/<title>.*</title>/"what do I put here"/p' input.file
and I just want to print out whatever .* is where I typed "what do I put here". I'm not very comfortable with sed at this point so this is likely a very simple answer and I'm having trouble finding one in any of the other questions. Thanks in advance!
Capture the pattern you want to extract within \(...\), and then you can refer to it as \1 in the replacement string:
sed -n 's/<title>\(.*\)</title>/\1/p' input.file
You can have multiple \(...\) expressions, and refer to them with \1, \2, \3, and so on.
If you have the GNU version of sed, or gsed, then you could simplify a bit:
sed -rn 's/<title>(.*)</title>/\1/p' input.file
With the -r flag, sed can use "extended regular expressions", which practically let's you write (...) instead of \(...\), + instead of \+, and other goodies.