Regex AND search inside block which enclosed by something delimiter - regex

I want to regex AND search (?=)(?=) inside block which enclosed by something delimiter such as #
In following sample regex, what I expected is, cat to ugly matches to the pattern inside # cat B to before # cat C.
But the regex match to nothing.
regex
^#(?=[\s\S]*(cat))(?=[\s\S]*(ugly))^#
text
# cat A
the cat is
very cute.
# cat B
the cat is
very ugly.
# cat C
the cat is
very good.
#
You can test the regex on https://regexr.com/

In your pattern ^#(?=[\s\S]*(cat))(?=[\s\S]*(ugly))^# you use match a # from the start of the string ^#, followed by 2 positive lookaheads and then again match ^#. That is why you don't get a match.
To get a more exact match, you could start the pattern with ^# cat B
If you want to use lookaheads, you might use 2 capturing groups in the positive lookahead. If you want to search for cat and ugly as whole words you might use word boundaries \b.
The (?s) is a modifier that enables the dot matching a newline for which you might also use /s as a flag instead.
(?s)(?=^# cat B.*?(cat).*?(ugly).*?^# cat C
Regex demo
But it might be easier to not use the lookahead and match instead:
(?s)^# cat B.*?(cat).*?(ugly).*?^# cat C$
Php demo

This RegEx might help you to design/match your target words by bounding them using \n.
((.+)(cat)(.+))\n((.+)(ugly)(.+))
Just to be simple, it creates four groups for each target keywords: 🐈 and ugly, where your target keywords can be called using $3 and $7:
You could additionally bound it with start ^ and end $, if you wish.
This expression only works when your target keywords are in the middle of both lines.

Related

How can I get the count of a capture group then replace the characters with a specific character?

How can I get the count of a capture group and replace it with the same number of characters that I specify?
For example here is a string...
123456789ABCD00001DDD
My regex with capture groups is as follows...
^([123456789]{9})([ABCD]{1,4})([0]{1,5})([0-9]{1,5})([D]*)$
When I use something like Notepad++ I want to find the above and replace it with something like...
\1\2 \4\5
Making the end results look like...
123456789ABCD 1DDD
Example located at https://regex101.com/r/fykEnn/1
You may use this regex with \G and a positive lookahead:
(?:^([1-9]{9}[ABCD]{1,4})(?=0{1,5}\d{1,5}D*$)|\G)0
RegEx Demo
\G asserts position at the end of the previous match or the start of the string for the first match.
In bash, using sed:
$ echo 123456789ABCD00001DDD | sed -re 's/([123456789]{9})([ABCD]{1,4})([0]{1,5})([0-9]{1,5})([D]*)$/\1\2 \4\5 /g'
123456789ABCD 1DDD
I think you should to use both lookbehind & lookahead assertion with a single digit ...
example ...
Find what: (?<=[A-Z])\d{4}(?=\d[A-Z])
Replace with: a space

capturing each word containing pattern regex

I'm trying to write a sed script that finds every word that contains a certain pattern and then prepends all words that contain that pattern. For example:
foobarbaz barfoobaz barbazfoo barbaz
might turn into:
quxfoobarbaz quxbarfoobaz quxbarbazfoo barbaz
I understand the basics of capture groups and backrefrences, but I'm still having trouble. Specifically I can't get it so that it captures each whole word separately.
s/\(.*\)men\(.*\)/ not just the \1men\2, but the \1women\2 and \1children\2 too /
I tried using \s, for whitespace as many sites recommend, but sed treats \s as the separate characters \ and s
You could use the non-space character \S as follows:
sed 's/\S*foo\S*/qux&/g' <<< "foobarbaz barfoobaz barbazfoo barbaz"
this will match words containing foo. The replacement string qux& will prepend every matched pattern with qux. Output:
quxfoobarbaz quxbarfoobaz quxbarbazfoo barbaz
It works fine if no spaces in each word.
echo "foobarbaz barfoobaz barbazfoo barbaz" | sed 's/\([^ ]*foo[^ ]*\)/qux\1/g'

How do I add characters to RegEx results?

I am not a programmer but I am having to use RegEx for a particular purpose. How do I add specific characters to what is being returned from RegEx?
For example, if I have a list as follows:
XYZ ABC 123
How do I use RegEx to add something specific to the end of each? For example, if I want all three to end with .com for example?
You can try this script, replacing XYZ abc 123 with your full list:
echo "XYZ abc 123" | sed -E 's/([a-zA-Z0-9]+)/\1.com/g'
Explanation:
s/ Starts a substitution regex
([a-zA-Z0-9]+) Capture at least one alphanumeric
/ End regex
\1.com replaces the capture with itself plus adds .com
/g Global modifier (for all matches)
Without knowing which regex engine you want to use, there are many other ways to do this. In the future, please give more information.

regex, search and replace until a certain point

The Problem
I have a file full of lines like
convert.these.dots.to.forward.slashes/but.leave.these.alone/i.mean.it
I want to search and replace such that I get
convert/these/dots/to/forward/slashes/but.leave.these.alone/i.mean.it
The . are converted to / up until the first forward slash
The Question
How do I write a regex search and replace to solve my problem?
Attempted solution
I tried using look behind with perl, but variable length look behinds are not implemented
$ echo "convert.these.dots.to.forward.slashes/but.leave.these.alone/i.mean.it" | perl -pe 's/(?<=[^\/]*)\./\//g'
Variable length lookbehind not implemented in regex m/(?<=[^/]*)\./ at -e line 1.
Workaround
Variable length look aheads are implemented, so you can use this dirty trick
$ echo "convert.these.dots.to.forward.slashes/but.leave.these.alone/i.mean.it" | rev | perl -pe 's/\.(?=[^\/]*$)/\//g' | rev
convert/these/dots/to/forward/slashes/but.leave.these.alone/i.mean.it
Is there a more direct solution to this problem?
s/\G([^\/.]*)\./\1\//g
\G is an assertion that matches the point at the end of the previous match. This ensures that each successive match immediately follows the last.
Matches:
\G # start matching where the last match ended
([^\/.]*) # capture until you encounter a "/" or a "."
\. # the dot
Replaces with:
\1 # that interstitial text you captured
\/ # a slash
Usage:
echo "convert.these.dots.to.forward.slashes/but.leave.these.alone/i.mean.it" | perl -pe 's/\G([^\/.]*)\./\1\//g'
# yields: convert/these/dots/to/forward/slashes/but.leave.these.alone/i.mean.it
Alternatively, if you're a purist and don't want to add the captured subpattern back in — avoiding that may be more efficient, but I'm not certain — you could make use of \K to restrict the "real" match solely to the ., then simply replace with a /. \K essentially "forgets" what has been matched up to that point, so the final match ultimately returned is only what comes after the \K.
s/\G[^\/.]*\K\./\//g
Matches:
\G # start matching where the last match ended
[^\/.]* # consume chars until you encounter a "/" or a "."
\K # "forget" what has been consumed so far
\. # the dot
Thus, the entirety of the text matched for replacement is simply ".".
Replaces with:
\/ # a slash
Result is the same.
You can use substr as an lvalue and perform the substitution on it. Or transliteration, like I did below.
$ perl -pe 'substr($_,0,index($_,"/")) =~ tr#.#/#'
convert.these.dots.to.forward.slashes/but.leave.these.alone/i.mean.it
convert/these/dots/to/forward/slashes/but.leave.these.alone/i.mean.it
This finds the first instance of a slash, extracts the part of the string before it, and performs a transliteration on that part.

Replace specific capture group instead of entire regex in Perl

I've got a regular expression with capture groups that matches what I want in a broader context. I then take capture group $1 and use it for my needs. That's easy.
But how to use capture groups with s/// when I just want to replace the content of $1, not the entire regex, with my replacement?
For instance, if I do:
$str =~ s/prefix (something) suffix/42/
prefix and suffix are removed. Instead, I would like something to be replaced by 42, while keeping prefix and suffix intact.
As I understand, you can use look-ahead or look-behind that don't consume characters. Or save data in groups and only remove what you are looking for. Examples:
With look-ahead:
s/your_text(?=ahead_text)//;
Grouping data:
s/(your_text)(ahead_text)/$2/;
If you only need to replace one capture then using #LAST_MATCH_START and #LAST_MATCH_END (with use English; see perldoc perlvar) together with substr might be a viable choice:
use English qw(-no_match_vars);
$your_string =~ m/aaa (bbb) ccc/;
substr $your_string, $LAST_MATCH_START[1], $LAST_MATCH_END[1] - $LAST_MATCH_START[1], "new content";
# replaces "bbb" with "new content"
This is an old question but I found the below easier for replacing lines that start with >something to >something_else. Good for changing the headers for fasta sequences
while ($filelines=~ />(.*)\s/g){
unless ($1 =~ /else/i){
$filelines =~ s/($1)/$1\_else/;
}
}
I use something like this:
s/(?<=prefix)(group)(?=suffix)/$1 =~ s|text|rep|gr/e;
Example:
In the following text I want to normalize the whitespace but only after ::=:
some text := a b c d e ;
Which can be achieved with:
s/(?<=::=)(.*)/$1 =~ s|\s+| |gr/e
Results with:
some text := a b c d e ;
Explanation:
(?<=::=): Look-behind assertion to match ::=
(.*): Everything after ::=
$1 =~ s|\s+| |gr: With the captured group normalize whitespace. Note the r modifier which makes sure not to attempt to modify $1 which is read-only. Use a different sub delimiter (|) to not terminate the replacement expression.
/e: Treat the replacement text as a perl expression.
Use lookaround assertions. Quoting the documentation:
Lookaround assertions are zero-width patterns which match a specific pattern without including it in $&. Positive assertions match when their subpattern matches, negative assertions match when their subpattern fails. Lookbehind matches text up to the current match position, lookahead matches text following the current match position.
If the beginning of the string has a fixed length, you can thus do:
s/(?<=prefix)(your capture)(?=suffix)/$1/
However, ?<= does not work for variable length patterns (starting from Perl 5.30, it accepts variable length patterns whose length is smaller than 255 characters, which enables the use of |, but still prevents the use of *). The work-around is to use \K instead of (?<=):
s/.*prefix\K(your capture)(?=suffix)/$1/