Chage text in first bracket that are succeded by third/square brackets - regex

I want to append some text (.html) at the end of line in format [some_text](some/other/text); basically links in markdown syntax.
Example
[Test](test/link) would be [Test](test/link.html)
[Test](test/link1/link) would be [Test](test/link1/link.html)
[Test] would be [Test]
(test) would be (test)
So I was trying out unix sed with syntax: sed -i 's/\[*\](*)/.html)/g' filename.md. The said sed syntax is wrong and not working, can someone help? I'm open to using other tools like awk or perl, it is appropriate for this scenario.
Solution: sed -i 's/\(\[[^][]*]([^()]*\))/\1.html)/g' filename.md
Suggested by #WiktorStribiżew

Based on given samples:
$ cat ip.txt
[Test](test/link)
[Test](test/link1/link)
[Test]
(test)
# if closing ) should also be matched: sed -E 's/(\[[^]]+]\([^)]+)\)/\1.html)/'
$ sed -E 's/\[[^]]+]\([^)]+/&.html/' ip.txt
[Test](test/link.html)
[Test](test/link1/link.html)
[Test]
(test)
\[ match [
[^]]+ match one or more non ] characters
]\( match ](
[^)]+ match one or more non ) characters
& backreferences entire matched portion
\1 backreferences portion matched by first capture group
With perl:
perl -pe 's/\[[^]]+]\([^)]+\K(?=\))/.html/'
\K helps to avoid capturing the text matched until that point
(?=\)) is a lookahead assertion to match ) character, this also is not part of the matched portion
Add -i option for either solutions once it is working as expected.

You can use
sed -i 's/\(\[[^][]*]([^()]*\))/\1.html)/g' filename.md
See the online demo.
The regex is a POSIX BRE expression that matches
\(\[[^][]*]([^()]*\) - Group 1:
\[ - a [ char
[^][]* - zero or more chars other than [ and ]
] - a ] char
( - a ( char
[^()]* - zero or more char sother than ( and )
) - a ) char.
The -i option makes the replacements in the same file provided as input file in a GNU sed. g flag will look for all matches on the lines.

Related

Regex to match exact version phrase

I have versions like:
v1.0.3-preview2
v1.0.3-sometext
v1.0.3
v1.0.2
v1.0.1
I am trying to get the latest version that is not preview (doesn't have text after version number) , so result should be:
v1.0.3
I used this grep: grep -m1 "[v\d+\.\d+.\d+$]"
but it still outputs: v1.0.3-preview2
what I could be missing here?
To return first match for pattern v<num>.<num>.<num>, use:
grep -m1 -E '^v[0-9]+(\.[0-9]+){2}$' file
v1.0.3
If you input file is unsorted then use grep | sort -V | head as:
grep -E '^v[0-9]+(\.[0-9]+){2}$' f | sort -rV | head -1
When you use ^ or $ inside [...] they are treated a literal character not the anchors.
RegEx Details:
^: Start
v: Match v
[0-9]+: Match 1+ digits
(\.[0-9]+){2}: Match a dot followed by 1+ dots. Repeat this group 2 times
$: End
To match the digits with grep, you can use
grep -m1 "v[[:digit:]]\+\.[[:digit:]]\+\.[[:digit:]]\+$" file
Note that you don't need the [ and ] in your pattern, and to escape the dot to match it literally.
With awk you could try following awk code.
awk 'match($0,/^v[0-9]+(\.[0-9]+){2}$/){print;exit}' Input_file
Explanation of awk code: Simple explanation of awk program would be, using match function of awk to match regex to match version, once match is found print the matched value and exit from program.
Regular expressions match substrings, not whole strings. You need to explicitly match the start (^) and end ($) of the pattern.
Keep in mind that $ has special meaning in double quoted strings in shell scripts and needs to be escaped.
The boundary characters need to be outside of any group ([]).

Regex: select each occurrence of a character up until another character

I have a couple of lines in a document which looks something like that:
foo-bar-foo[Foo - Bar]
I'd like to select every - character up until the first [ bracket on every line. Thus the - in the square brackets shouldn't be selected.
How can I achieve that with a Regex?
I already have this regex /.+?(?=\[)/g, which selects every character until the first [ but I only want the -.
Edit: I want to replace these selected characters with the sed command (GNU).
You can use
sed -E ':a; s/^([^[-]+)-/\1/; ta'
See an online demo:
#!/bin/bash
s='foo-bar-foo[Foo - Bar]'
sed -E ':a; s/^([^[-]+)-/\1/; ta' <<< "$s"
# => foobarfoo[Foo - Bar]
Details:
-E - enabling POSIX ERE syntax (so that there is no need to escape capturing parentheses and the + quantifier)
:a - an a label
s/^([^[-]+)-/\1/ - finds one or more chars other than [ and - from the start of string capturing this substring into Group 1 (\1) and then matches a - char
ta - jumps to a label upon a successful replacement

Regexp or Grep in Bash

Can you please tell me how to get the token value correctly? At the moment I am getting: "1jdq_dnkjKJNdo829n4-xnkwe",258],["FbtResult
echo '{"facebookdotcom":true,"messengerdotcom":false,"workplacedotcom":false},827],["DTSGInitialData",[],{"token":"aaaaaaa"},258],["FbtResult' | sed -n 's/.*"token":\([^}]*\)\}/\1/p'
You need to match the full string, and to get rid of double quotes, you need to match a " before the token and use a negated bracket expression [^"] instead of [^}]:
sed -n 's/.*"token":"\([^"]*\).*/\1/p'
Details:
.* - any zero or more chars
"token":" - a literal "token":" string
\([^"]*\) - Group 1 (\1 refers to this value): any zero or more chars other than "
.* - any zero or more chars.
This replacement works:
echo '{"facebookdotcom":true,"messengerdotcom":false,"workplacedotcom":false},827],["DTSGInitialData",[],{"token":"aaaaaaa"},258],["FbtResult'
| sed -n 's/.*"token":"\([a-z]*\)"\}.*/\1/p'
Key capture after "token" found between quotes via \([a-z]*\), followed by a closing brace \} and remaining characters after that as .* (you were missing this part before, which caused the replacement to include the text after keyword as well).
Output:
aaaaaaa
A grep solution:
echo '{"facebookdotcom":true,"messengerdotcom":false,"workplacedotcom":false},827],["DTSGInitialData",[],{"token":"aaaaaaa"},258],["FbtResult' | grep -Po '(?<="token":")[^"]+'
yields
aaaaaaa
The -P option to grep enables the Perl-compatible regex (PCRE).
The -o option tells grep to print only the matched substring, not the entire line.
The regex (?<="token":") is a PCRE-specific feature called a zero-width positive lookbehind assertion. The expression (?<=pattern) matches a pattern without including it in the matched result.

sed -> replace fixed text and parenthesis from string

How to bring this expression
echo "ObjectId(5e257e424ed10b0015e3e780),'qwe',ObjectId(5e257e424ed10b0015e3e780),()"
to this
5e257e424ed10b0015e3e780,'qwe',5e257e424ed10b0015e3e780,()
using sed?
I use this:
echo "ObjectId(5e257e424ed10b0015e3e780),'qwe',ObjectId(5e257e424ed10b0015e3e780),()" | \
sed 's/ObjectId(\([a-z0-9]\)/\1/'
You may use
sed 's/ObjectId(\([[:alnum:]]*\))/\1/g'
See the online demo
The POSIX BRE pattern means:
ObjectId( - matches a literal string
\([[:alnum:]]*\) - Group 1: zero or more alphanumeric chars
) - a literal ).
The \1 replacement will keep the Group 1 value only.
The g flag will replace all occurrences.

Sed removing after ip

I have a simple sed question.
I have data like this:
boo:moo:127.0.0.1--¹óÖÝÊ¡µçÐÅ
foo:joo:127.0.0.1 ÁÉÄþÊ¡ÉòÑôÊвʺçÍø°É
How do I make it like this:
boo:moo:127.0.0.1
foo:joo:127.0.0.1
My sed code
sed -e 's/\.[^\.]*$//' test.txt
Thanks!
For the given sample, you could capture everything from start of line till last digit in the line
$ sed 's/\(.*[0-9]\).*/\1/' ip.txt
boo:moo:127.0.0.1
foo:joo:127.0.0.1
$ grep -o '.*[0-9]' ip.txt
boo:moo:127.0.0.1
foo:joo:127.0.0.1
Or, you could delete all non-digit characters at end of line
$ sed 's/[^0-9]*$//' ip.txt
boo:moo:127.0.0.1
foo:joo:127.0.0.1
You may find an IP like substring and remove all after it:
sed -E 's/([0-9]{1,3}(\.[0-9]{1,3}){3}).*/\1/' # POSIX ERE version
sed 's/\([0-9]\{1,3\}\(\.[0-9]\{1,3\}\)\{3\}\).*/\1/' # BRE POSIX version
The ([0-9]{1,3}(\.[0-9]{1,3}){3}) pattern is a simplified IP address regex pattern that matches and captures 1 to 3 digits and then 3 occurrences of a dot and again 1 to 3 digits, and then .* matches and consumes the rest of the line. The \1 placeholder in the replacement pattern inserts the captured value back into the result.
Note that in the BRE POSIX pattern, you have to escape ( and ) to make them a capturing group construct and you need to escape {...} to make it a range/interval/limiting quantifier (it has lots of names in the regex literature).
See an online demo.