Regex: select each occurrence of a character up until another character

Regex: select each occurrence of a character up until another character - regex

I have a couple of lines in a document which looks something like that:
foo-bar-foo[Foo - Bar]
I'd like to select every - character up until the first [ bracket on every line. Thus the - in the square brackets shouldn't be selected.
How can I achieve that with a Regex?
I already have this regex /.+?(?=\[)/g, which selects every character until the first [ but I only want the -.
Edit: I want to replace these selected characters with the sed command (GNU).

You can use
sed -E ':a; s/^([^[-]+)-/\1/; ta'
See an online demo:
#!/bin/bash
s='foo-bar-foo[Foo - Bar]'
sed -E ':a; s/^([^[-]+)-/\1/; ta' <<< "$s"
# => foobarfoo[Foo - Bar]
Details:
-E - enabling POSIX ERE syntax (so that there is no need to escape capturing parentheses and the + quantifier)
:a - an a label
s/^([^[-]+)-/\1/ - finds one or more chars other than [ and - from the start of string capturing this substring into Group 1 (\1) and then matches a - char
ta - jumps to a label upon a successful replacement

Related

Extract string between underscores and dot

I have strings like these:
/my/directory/file1_AAA_123_k.txt
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt
So basically, the number of underscores is not fixed. I would like to extract the string between the first underscore and the dot. So the output should be something like this:
AAA_123_k
CCC
KK_45
I found this solution that works:
string='/my/directory/file1_AAA_123_k.txt'
tmp="${string%.*}"
echo $tmp | sed 's/^[^_:]*[_:]//'
But I am wondering if there is a more 'elegant' solution (e.g. 1 line code).

With bash version >= 3.0 and a regex:
[[ "$string" =~ _(.+)\. ]] && echo "${BASH_REMATCH[1]}"

You can use a single sed command like
sed -n 's~^.*/[^_/]*_\([^/]*\)\.[^./]*$~\1~p' <<< "$string"
sed -nE 's~^.*/[^_/]*_([^/]*)\.[^./]*$~\1~p' <<< "$string"
See the online demo. Details:
^ - start of string
.* - any text
/ - a / char
[^_/]* - zero or more chars other than / and _
_ - a _ char
\([^/]*\) (POSIX BRE) / ([^/]*) (POSIX ERE, enabled with E option) - Group 1: any zero or more chars other than /
\. - a dot
[^./]* - zero or more chars other than . and /
$ - end of string.
With -n, default line output is suppressed and p only prints the result of successful substitution.

With your shown samples, with GNU grep you could try following code.
grep -oP '.*?_\K([^.]*)' Input_file
Explanation: Using GNU grep's -oP options here to print exact match and to enable PCRE regex respectively. In main program using regex .*?_\K([^.]*) to get value between 1st _ and first occurrence of .. Explanation of regex is as follows:
Explanation of regex:
.*?_ ##Matching from starting of line to till first occurrence of _ by using lazy match .*?
\K ##\K will forget all previous matched values by regex to make sure only needed values are printed.
([^.]*) ##Matching everything till first occurrence of dot as per need.

A simpler sed solution without any capturing group:
sed -E 's/^[^_]*_|\.[^.]*$//g' file
AAA_123_k
CCC
KK_45

If you need to process the file names one at a time (eg, within a while read loop) you can perform two parameter expansions, eg:
$ string='/my/directory/file1_AAA_123_k.txt.2'
$ tmp="${string#*_}"
$ tmp="${tmp%%.*}"
$ echo "${tmp}"
AAA_123_k
One idea to parse a list of file names at the same time:
$ cat file.list
/my/directory/file1_AAA_123_k.txt.2
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt
$ sed -En 's/[^_]*_([^.]+).*/\1/p' file.list
AAA_123_k
CCC
KK_45

Using sed
$ sed 's/[^_]*_//;s/\..*//' input_file
AAA_123_k
CCC
KK_45

This is easy, except that it includes the initial underscore:
ls | grep -o "_[^.]*"

sed and Perl regexp replaces once, with multiple replacements flag

I have the string:
lopy,lopy1,sym,lopy,lopy1,sym"
I want the line to be:
lopy,lopy1,sym,lady,lady1,sym
Which means that all "lad" after the string sym should be replaced. So I ran:
echo "lopy,lopy1,sym,lopy,lopy1,sym" | sed -r 's/(.*sym.*?)lopy/\1lad/g'
I get:
lopy,lopy1,sym,lopy,lad1,sym
Using Perl is not really better:
echo "lopy,lopy1,sym,lopy,lopy1,sym" | perl -pe 's/(.*sym.+?)lopy/${1}lad/g'
yields
lopy,lopy1,sym,lad,lopy1,sym
Not all "lopy" are replaced. What am I doing wrong?

The (.*sym.*?)lopy / (.*sym.+?)lopy patterns are almost the same, .+? matches one or more chars other than line break chars, but as few as possible, and .*? matches zero or more such chars. Mind that sed does not support lazy quantifiers, *? is the same as * in sed. However, the main problem with the regexps you used is that they match sym, then any text after it and then lopy, so when you added g, it just means you want to find more cases of lopy after sym....lopy. And there is only one such occurrence in your string.
You want to replace all lopy after sym, so you can use
perl -pe 's/(?:\G(?!^)|sym).*?\Klopy/lad/g'
See the regex demo. Details:
(?:\G(?!^)|sym) - sym or end of the previous match (\G(?!^))
.*? - any zero or more chars other than line break chars, as few as possible
\K - match reset operator that discards all text matched so far
lopy - a lopy string.
See the online demo:
#!/bin/bash
echo "lopy,lopy1,sym,lopy,lopy1,sym" | perl -pe 's/(?:\G(?!^)|sym).*?\Klopy/lad/g'
# => lopy,lopy1,sym,lad,lad1,sym
If the values are always comma separated, you may replace .*? with ,: (?:\G(?!^)|sym),\Klopy (see this regex demo).

Since OP has mentioned sed so I am adding awk program here. Which could be better choice in comparison to sed. With shown samples, please try following awk program.
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
awk -F',sym,' '
{
first=$1
$1=""
sub(/^[[:space:]]+/,"")
gsub(/lop/,"lad")
$0=first FS $0
}
1
'
Explanation: Adding detailed explanation for above.
echo "lopy,lopy1,sym,lopy,lopy1,sym" | ##Printing values and sending as standard output to awk program as an input.
awk -F',sym,' ' ##Making ,sym, as a field separator here.
{
first=$1 ##Creating first which has $1 of current line in it.
$1="" ##Nullifying $1 here.
sub(/^[[:space:]]+/,"") ##Substituting initial space in current line here.
gsub(/lop/,"lad") ##Globally substituting lop with lad in rest of line.
$0=first FS $0 ##Adding first FS to rest of edited line here.
}
1 ##Printing edited/non-edited line value here.
'

The problem is that the lopy(s) to replace are after sym, with a pattern like sym.*?lopy, so a global replacement looks for yet more of the whole sym+lopy-after-sym (not just for all lopys after that one sym).†
To replace all lopys (after the first sym, followed by another sym) we can capture the substring between syms and in the replacement side run code, in which a regex replaces all lopys
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe's{ sym,\K (.+?) (?=sym) }{ $1 =~ s/lop/lad/gr }ex'
To isolate the substring between syms I use \K after the first sym, which drops matches prior to it, and a positive lookahead for the sym after the substring, which doesn't consume anything. The /e modifier makes the replacement side be evaluated as code. In the replacement side's regex we need /r since $1 can't change, and we want the regex to return anyway. See perlretut.
† To match all of abbbb we can't say /ab/g, nor /(a)b/g nor /a(b)/g, because that would look for all repetitions of the whole ab in the string (and find only ab in the beginning).

sed does not support non-greedy wildcards at all. But your Perl script also fails for other reasons; you are saying "match all occurrences of this" but then you specify a regex which can only match once.
A common simple solution is to split the string, and then replace only after the match:
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe 'if (#x = /^(.*?sym,)(.*)/) { $x[1] =~ s/lop/lad/g; s/.*/$x[0]$x[1]/ }'
If you want to be fancy, you can use a lookbehind to only replace the lop occurrences after the first sym.
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe 's/(?<=sym.{0,200})lop/lad/'
The variable-length lookbehind generates a warning and is only supported in Perl 5.30+ (you can turn it off with no warnings qw(experimental::vlb));.)

Since you have shown an attempted sed command and used sed tag, here is a sed loop based solution:
sed -E -e ':a' -e 's~(sym,.*)lopy~\1lady~g; ta' file
lopy,lopy1,sym,lady,lady1,sym"
Explanation:
:a sets a label a before matching sym,.* pattern
ta jumps pattern matching back to label a after making a substitution
This looping stop when s command has nothing to match i.e. no lopy substring after sym,

Chage text in first bracket that are succeded by third/square brackets

I want to append some text (.html) at the end of line in format [some_text](some/other/text); basically links in markdown syntax.
Example
[Test](test/link) would be [Test](test/link.html)
[Test](test/link1/link) would be [Test](test/link1/link.html)
[Test] would be [Test]
(test) would be (test)
So I was trying out unix sed with syntax: sed -i 's/\[*\](*)/.html)/g' filename.md. The said sed syntax is wrong and not working, can someone help? I'm open to using other tools like awk or perl, it is appropriate for this scenario.
Solution: sed -i 's/\(\[[^][]*]([^()]*\))/\1.html)/g' filename.md
Suggested by #WiktorStribiżew

Based on given samples:
$ cat ip.txt
[Test](test/link)
[Test](test/link1/link)
[Test]
(test)
# if closing ) should also be matched: sed -E 's/(\[[^]]+]\([^)]+)\)/\1.html)/'
$ sed -E 's/\[[^]]+]\([^)]+/&.html/' ip.txt
[Test](test/link.html)
[Test](test/link1/link.html)
[Test]
(test)
\[ match [
[^]]+ match one or more non ] characters
]\( match ](
[^)]+ match one or more non ) characters
& backreferences entire matched portion
\1 backreferences portion matched by first capture group
With perl:
perl -pe 's/\[[^]]+]\([^)]+\K(?=\))/.html/'
\K helps to avoid capturing the text matched until that point
(?=\)) is a lookahead assertion to match ) character, this also is not part of the matched portion
Add -i option for either solutions once it is working as expected.

You can use
sed -i 's/\(\[[^][]*]([^()]*\))/\1.html)/g' filename.md
See the online demo.
The regex is a POSIX BRE expression that matches
\(\[[^][]*]([^()]*\) - Group 1:
\[ - a [ char
[^][]* - zero or more chars other than [ and ]
] - a ] char
( - a ( char
[^()]* - zero or more char sother than ( and )
) - a ) char.
The -i option makes the replacements in the same file provided as input file in a GNU sed. g flag will look for all matches on the lines.

Regexp or Grep in Bash

Can you please tell me how to get the token value correctly? At the moment I am getting: "1jdq_dnkjKJNdo829n4-xnkwe",258],["FbtResult
echo '{"facebookdotcom":true,"messengerdotcom":false,"workplacedotcom":false},827],["DTSGInitialData",[],{"token":"aaaaaaa"},258],["FbtResult' | sed -n 's/.*"token":\([^}]*\)\}/\1/p'

You need to match the full string, and to get rid of double quotes, you need to match a " before the token and use a negated bracket expression [^"] instead of [^}]:
sed -n 's/.*"token":"\([^"]*\).*/\1/p'
Details:
.* - any zero or more chars
"token":" - a literal "token":" string
\([^"]*\) - Group 1 (\1 refers to this value): any zero or more chars other than "
.* - any zero or more chars.

This replacement works:
echo '{"facebookdotcom":true,"messengerdotcom":false,"workplacedotcom":false},827],["DTSGInitialData",[],{"token":"aaaaaaa"},258],["FbtResult'
| sed -n 's/.*"token":"\([a-z]*\)"\}.*/\1/p'
Key capture after "token" found between quotes via \([a-z]*\), followed by a closing brace \} and remaining characters after that as .* (you were missing this part before, which caused the replacement to include the text after keyword as well).
Output:
aaaaaaa

A grep solution:
echo '{"facebookdotcom":true,"messengerdotcom":false,"workplacedotcom":false},827],["DTSGInitialData",[],{"token":"aaaaaaa"},258],["FbtResult' | grep -Po '(?<="token":")[^"]+'
yields
aaaaaaa
The -P option to grep enables the Perl-compatible regex (PCRE).
The -o option tells grep to print only the matched substring, not the entire line.
The regex (?<="token":") is a PCRE-specific feature called a zero-width positive lookbehind assertion. The expression (?<=pattern) matches a pattern without including it in the matched result.

remove certain prefix in every word separate by delimiter

How to remove certain prefix in every word separate by space? which I want to remove the prefix of abc and def from the beginning of the string. I have the sed statement which make it so long. Don't know if can make it shorter and simplier
Sed: sed -e 's/, /,/g' -e 's/'.yaml$'//g' -e 's/^abc_//g' -e 's/^def_//g' -e 's/,abc_/,/g' -e 's/,def_/,/g'
Input: abc_mscp_def.yaml_v1, def_mscp_abc.yaml_v2, abc_mscp_abc.yaml_v2, def_mscp_def.yaml_v2
Output: mscp_def_v1,mscp_abc_v2,mscp_abc_v2,mscp_def_v2

You may use
sed -E 's/(^|,) ?(abc|def)_|(,) |\.yaml/\1\3/g'
See the online demo:
s="abc_mscp_def.yaml_v1, def_mscp_abc.yaml_v2, abc_mscp_abc.yaml_v2, def_mscp_def.yaml_v2"
sed -E 's/(^|,) ?(abc|def)_|(,) |\.yaml/\1\3/g' <<< "$s"
# => mscp_def_v1,mscp_abc_v2,mscp_abc_v2,mscp_def_v2
Details
-E option enables POSIX ERE syntax and alternation
(^|,) ?(abc|def)_|(,) |\.yaml - matches:
(^|,) ?(abc|def)_ - Group 1: start of string or comma, then an optional space, and then Group 2: either abc or def
| - or
(,) - Group 3: a comma, and then a space
| - or
\.yaml - .yaml substring.
The replacement is \1\3, i.e. the values of Group 1 and 3 concatenated.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex: select each occurrence of a character up until another character - regex

Related

Extract string between underscores and dot

sed and Perl regexp replaces once, with multiple replacements flag

Chage text in first bracket that are succeded by third/square brackets

Regexp or Grep in Bash

remove certain prefix in every word separate by delimiter

Categories

Resources