Term with no alphanumeric characters before or after - regex

I am trying to write a regular expression that matches all occurrences of a specified word, but must not have any alphanumeric characters prefixed or suffixed.
For example, searching for the term "cat" should not return terms like "catalyst".
Here is what I have so far:
"?<!([a-Z0-9])*?TERMPLACEHOLDER?!([a-Z0-9])*?"
This should return the word "TERMPLACEHOLDER" on its own.
Any ideas?
Thanks.

How about:
\bTERMPLACEHOLDER\b

You could use word boundaries: \bTERMPLACEHOLDER\b
A quick test in Javascript:
var a = "this cat is not a catalyst";
console.log(a.match(/\bcat\b/));
Returns just "cat".

You may be looking for word boundaries. From there, you can use wildcards like \w*? on either side of the word if you want to make it match partials
Search for any word containing "MYWORD"
\b\w*?MYWORD\w*?\b
Search for any word ending in "ING"
\b\w*?ING\b
Search for any word starting with "TH"
\bTH\w*?\b

Be carefull When you say "word" refering to a substring you want to find. On the regulare expression side "word" has a different meaning, its a character class.
Define the 'literal' string you would like to find (not word). This can be anything, sentences, punctuation, newline combinations. Example "find this \exact phrase <> !abc".
Since this is going to be part of a regular expression (not the whole regex), you can escape the special regular expression metacharacters that might be embedded.
string = 'foo.bar' // the string you want to find
string =~ s/[.*+?|()\[\]{}^\$\\]/\\$&/g // Escape metachars
Now the 'literal' string is ready to be inserted into the regular expression. Note that if you want to individually allow classes or want metachars in the string, you would have to escape this yourself.
sample =~ /(?<![^\W_])$string(?![^\W_])/ig // Find the string globally
(expanded)
/
(?<![^\W_]) # assertion: No alphanumeric character behind us
$string # the 'string' we want to find
(?![^\W_]) # assertion: No alphanumeric character in front of us
/ig
Perl sample -
use strict;
use warnings;
my $string = 'foo.bar';
my $sample = 'foo.bar and !fooAbar and afoo.bar.foo.bar';
# Quote string metacharacters
$string =~ s/[.*+?|()\[\]{}^\$\\]/\\$&/g;
# Globally find the string in the sample target
while ( $sample =~ /(?<![^\W_])$string(?![^\W_])/ig )
{
print substr($sample, 0, $-[0]), "-->'",
substr($sample, $-[0], $+[0] - $-[0]), "'\n";
}
Output -
-->'foo.bar'
foo.bar and !fooAbar and afoo.bar.-->'foo.bar'

Related

Problems with perl regex

I need a perl regex to match A.CC3 on a line begining with something followed by anything then, my 'A.CC3 " and then anything...
I am surprised this (text =~ /^\W+\CC.*\A\.CC\[3].*/) is not working
Thanks
\A is an escape sequence that denotes beginning of line, or ^ like in the beginning of your regex. Remove the backslash to make it match a literal A.
Edit: You also seem to have \C in there. You should only use backslash to escape meta characters such as period ., or to create escape sequences, such as \Q .. \E.
At its simplest, a regex to match A.CC3 would be
$text =~ /A\.CC3/
That's all you need. This will match any string with A.CC3 in it. In the comments you mention the string you are matching is this:
my $text = "//%CC Unused Static Globals, A.CC3, Halstead Progam Volume";
You might want to avoid partial matches, in which case you can use word boundary \b
$text =~ /\bA\.CC3\b/
You might require that a line begins with //%
$text =~ m#^//%.*\bA\.CC3\b#
Of course, only you know which parts of the string should be matched and in what way. "Something followed by anything followed by A.CC3 followed by anything" really just needs the first simple regex.
It doesn't seem like you're trying to capture anything. If that's the case, and all you need to do is find lines that contain A.CC3 then you can simply do
if ( index( $str, 'A.CC3' ) >= 0 ) # Found it...
No need for a regex.
Try to give this a shot:
^.*?A\.CC.*$
That will match anything until it reaches A, then a literal ., followed by CC, then anything until end of string.
It depends what you want to match. If you want to pull back the whole line in which the A.CC3 pattern occurs then something like this should work:
^.*A\.CC3.*$

TCL regexp pattern search

I am trying to find a pattern match as below
abc(xxxx):efg(xxxx):xyz(xxxx) where xxxx - [0-9] digits
I used
set string "my string is abc(xxxx):efg(xxxx):xyz(xxxx)"
regexp abc(....):efg(....):xyz(....) $string result_str
it returns 0. Can anyone help?
The problem you've got is that ( and ) have special meaning to regular expressions in Tcl (and many other RE engines besides) in that they denote a capturing sub-RE. To make the characters “normal”, they have to be escaped with a backslash, and that means that it's best to put the regular expression in braces (because backslashes are general Tcl metacharacters).
Thus:
% set string "my string is abc(xxxx):efg(xxxx):xyz(xxxx)"
% regexp {abc\(....\):efg\(....\):xyz\(....\)} $string
1
If you want to also capture the contents of those parentheses, you need a slightly more complex RE:
regexp {abc\((....)\):efg\((....)\):xyz\((....)\)} $string \
all abc_bit efg_bit xyz_bit
Note that those .... sequences always match exactly four characters, but it's better to be more specific. To match any number of digits in each case:
regexp {abc\((\d+)\):efg\((\d+)\):xyz\((\d+)\)} $string -> abc efg xyz
When using regexp to extract bits of a string, it's pretty common to use -> as a (rather strange) variable name for the whole string match; it looks mnemonically like it's saying “send the pieces extracted to these variables”.
Not worked with tcl but seems like you need to escape the ( and ). Also if you are sure that the x's would be digits, use \d{4} instead of ..... Based on this, the updated regex you could try is
abc\(\d{4}\):efg\(\d{4}\):xyz\(\d{4}\).

Replace specific capture group instead of entire regex in Perl

I've got a regular expression with capture groups that matches what I want in a broader context. I then take capture group $1 and use it for my needs. That's easy.
But how to use capture groups with s/// when I just want to replace the content of $1, not the entire regex, with my replacement?
For instance, if I do:
$str =~ s/prefix (something) suffix/42/
prefix and suffix are removed. Instead, I would like something to be replaced by 42, while keeping prefix and suffix intact.
As I understand, you can use look-ahead or look-behind that don't consume characters. Or save data in groups and only remove what you are looking for. Examples:
With look-ahead:
s/your_text(?=ahead_text)//;
Grouping data:
s/(your_text)(ahead_text)/$2/;
If you only need to replace one capture then using #LAST_MATCH_START and #LAST_MATCH_END (with use English; see perldoc perlvar) together with substr might be a viable choice:
use English qw(-no_match_vars);
$your_string =~ m/aaa (bbb) ccc/;
substr $your_string, $LAST_MATCH_START[1], $LAST_MATCH_END[1] - $LAST_MATCH_START[1], "new content";
# replaces "bbb" with "new content"
This is an old question but I found the below easier for replacing lines that start with >something to >something_else. Good for changing the headers for fasta sequences
while ($filelines=~ />(.*)\s/g){
unless ($1 =~ /else/i){
$filelines =~ s/($1)/$1\_else/;
}
}
I use something like this:
s/(?<=prefix)(group)(?=suffix)/$1 =~ s|text|rep|gr/e;
Example:
In the following text I want to normalize the whitespace but only after ::=:
some text := a b c d e ;
Which can be achieved with:
s/(?<=::=)(.*)/$1 =~ s|\s+| |gr/e
Results with:
some text := a b c d e ;
Explanation:
(?<=::=): Look-behind assertion to match ::=
(.*): Everything after ::=
$1 =~ s|\s+| |gr: With the captured group normalize whitespace. Note the r modifier which makes sure not to attempt to modify $1 which is read-only. Use a different sub delimiter (|) to not terminate the replacement expression.
/e: Treat the replacement text as a perl expression.
Use lookaround assertions. Quoting the documentation:
Lookaround assertions are zero-width patterns which match a specific pattern without including it in $&. Positive assertions match when their subpattern matches, negative assertions match when their subpattern fails. Lookbehind matches text up to the current match position, lookahead matches text following the current match position.
If the beginning of the string has a fixed length, you can thus do:
s/(?<=prefix)(your capture)(?=suffix)/$1/
However, ?<= does not work for variable length patterns (starting from Perl 5.30, it accepts variable length patterns whose length is smaller than 255 characters, which enables the use of |, but still prevents the use of *). The work-around is to use \K instead of (?<=):
s/.*prefix\K(your capture)(?=suffix)/$1/

How to Use Regular Expression to Look for the something NOT of Certain pattern

Using Perl style regexp, is it possible to look for something not of certain pattern?
For example, [^abc] looks for a SINGLE CHARACTER not a nor b nor c.
But can I specify something longer than a single character?
For example, in the following string, I want to search for the first word which is not a top level domain name and contains no uppercase letter, or perhaps some more complicated rules like having 3-10 characters. In my example, this should be "abcd":
net com org edu ABCE abcdefghijklmnoparacbasd abcd
You can do it using negative look-ahead assertions as:
^(?!(?:net|com|org|edu)$)(?!.*[A-Z])[a-z]{3,10}$
See it
Explanation:
^ - Start anchor
$ - End anchor
(?:net|com|org|edu) - Alternation, matches net or com or org or edu
(?!regex) - Negative lookahead.
Matches only if the string does not match the regex.
So the part (?!(?:net|com|org|edu)$) ensures that input is not one of the top level domains.
The part (?!.*[A-Z]) ensures that the input does not have a upper case letter.
The part [a-z]{3,10}$ ensures that the input is of length atleast 3 and atmost 10.
Just use the "not match" operator: !~
So just create your expression and then see that a variable does not match it:
if ($var !~ /abc/) {
...
}
IMHO its easier to do some matching with regexp and some checks with perl.
#!/usr/bin/env perl
use strict;
use warnings;
my $s = "net com org edu ABCE abcdefghijklmnoparacbasd abcd";
# loop short words (a-z might not be what you want though)
foreach( $s =~ /(\b[a-z]{3,10}\b)/g ){
print $_, "\n" if is_tpl($_);
}
BTW, there are a lot of top level domains ..

How can I preserve whitespace when I match and replace several words in Perl?

Let's say I have some original text:
here is some text that has a substring that I'm interested in embedded in it.
I need the text to match a part of it, say: "has a substring".
However, the original text and the matching string may have whitespace differences. For example the match text might be:
has a
substring
or
has a substring
and/or the original text might be:
here is some
text that has
a substring that I'm interested in embedded in it.
What I need my program to output is:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
I also need to preserve the whitespace pattern in the original and just add the start and end markers to it.
Any ideas about a way of using Perl regexes to get this to happen? I tried, but ended up getting horribly confused.
Been some time since I've used perl regular expressions, but what about:
$match = s/(has\s+a\s+substring)/[$1]/ig
This would capture zero or more whitespace and newline characters between the words. It will wrap the entire match with brackets while maintaining the original separation. It ain't automatic, but it does work.
You could play games with this, like taking the string "has a substring" and doing a transform on it to make it "has\s*a\s*substring" to make this a little less painful.
EDIT: Incorporated ysth's comments that the \s metacharacter matches newlines and hobbs corrections to my \s usage.
This pattern will match the string that you're looking to find:
(has\s+a\s+substring)
So, when the user enters a search string, replace any whitespace in the search string with \s+ and you have your pattern. The, just replace every match with [match starts here]$1[match ends here] where $1 is the matched text.
In regexes, you can use + to mean "one or more." So something like this
/has\s+a\s+substring/
matches has followed by one or more whitespace chars, followed by a followed by one or more whitespace chars, followed by substring.
Putting it together with a substitution operator, you can say:
my $str = "here is some text that has a substring that I'm interested in embedded in it.";
$str =~ s/(has\s+a\s+substring)/\[match starts here]$1\[match ends here]/gs;
print $str;
And the output is:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
A many has suggested, use \s+ to match whitespace. Here is how you do it automaticly:
my $original = "here is some text that has a substring that I'm interested in embedded in it.";
my $search = "has a\nsubstring";
my $re = $search;
$re =~ s/\s+/\\s+/g;
$original =~ s/\b$re\b/[match starts here]$&[match ends here]/g;
print $original;
Output:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
You might want to escape any meta-characters in the string. If someone is interested, I could add it.
This is an example of how you could do that.
#! /opt/perl/bin/perl
use strict;
use warnings;
my $submatch = "has a\nsubstring";
my $str = "
here is some
text that has
a substring that I'm interested in, embedded in it.
";
print substr_match($str, $submatch), "\n";
sub substr_match{
my($string,$match) = #_;
$match =~ s/\s+/\\s+/g;
# This isn't safe the way it is now, you will need to sanitize $match
$string =~ /\b$match\b/;
}
This currently does anything to check the $match variable for unsafe characters.