I'm venturing to read a code in Perl and found the following regular expressions:
$str =~ s/(<.+?>)|(&\w+;)/ /gis;
$str =~ /(\w+)/gis
I wonder what these codes represent.
Can anyone help me?
The first one $str =~ s/(<.+?>)|(&\w+;)/ /gis; does a sustitution:
$str : the variable to work on
=~ : do the subs and save in the same variable
s : substitution operator
/ : begining or the regex
( : begining of captured group 1
< : <
.+? : one or more of any char NOT greedy
> : >
) : end of capture group 1
| : alternation
( : begining of captured group 2
& : &
\w+ : one or more word char ie: [a-zA-Z0-9_]
; : ;
) : end of group 2
/ : end of search part
: a space
/ : end of replace part
gis; : global, case insensitive, multi-line
This will replace all tags and encoded element like & or < by a space.
The second one expect that it left at least one word.
One way to help deciper regular expressions is to use the YAPE::Regex::Explain module from CPAN:
#!/usr/bin/env perl
use YAPE::Regex::Explain;
#...may need to single quote $ARGV[0] for the shell...
print YAPE::Regex::Explain->new( $ARGV[0] )->explain;
Assuming this snippet is named 'rexplain' you would do:
$ ./rexplain 's/(<.+?>)|(&\w+;)/ /gis'
The first strips out every XML/HTML tag and every character entity, replacing each one with a space. The second finds every substring consisting entirely of word characters.
In detail:
The first part of the first expression first matches a <, then any character with the . (newlines included thanks to the /s flag at the end). The + modifier would match one or more characters up until the last > found in $str, but the ? after it makes it not greedy, so it only matches up to the first > encountered. The second part matches & followed by any word character until ; is found. Since ; is not a word character, the ? modifier is not needed. The s/ up front means a substitution, and the bit after the second / means that's what any match is substituted with. The /gis at the end means *g*reedy, case *i*nsensitive, and *s*ingle line.
The second expression finds the first substring of non-word characters and puts it in $1. If you call it repeatedly, the /g at the end means that it will keep matching every instance in $str.
The first one takes a string and replaces html tags or html character codes with a space
The second one makes sure there is still a word left when done.
These "codes" are regular expressions. Type this to learn more:
perldoc perlre
The code above replaces with blanks some HTML/XML tags and some URL-encoded characters such as
from $str. But there are better ways to do this using CPAN modules. The code then tries to match and capture into variable $1 the first word in $str.
Ex:
perl -le '$str = "foo<br> bar<another\ntag>baz"; print $str; $str =~ s/(<.+?>)|(&\w+;)/ /gis; $str =~ /(\w+)/gis; print $str; print $1;'
It prints:
foo<br> bar<another
tag>baz
foo bar baz
foo
Related
I have a (probably very basic) question about how to construct a (perl) regex, perl -pe 's///g;', that would find/replace multiple instances of a given character/set of characters in a specified string. Initially, I thought the g "global" flag would do this, but I'm clearly misunderstanding something very central here. :/
For example, I want to eliminate any non-alphanumeric characters in a specific string (within a larger text corpus). Just by way of example, the string is identified by starting with [ followed by #, possibly with some characters in between.
[abc#def"ghi"jkl'123]
The following regex
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1$2/g;
will find the first " and if I run it three times I have all three.
Similarly, what if I want to replace the non-alphanumeric characters with something else, let's say an X.
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1X$2/g;
does the trick for one instance. But how can I find all of them in one go?
The reason your code doesn't work is that /g doesn't rescan the string after a substitution. It finds all non-overlapping matches of the given regex and then substitutes the replacement part in.
In [abc#def"ghi"jkl'123], there is only a single match (which is the [abc#def" part of the string, with $1 = '[abc#def' and $2 = ''), so only the first " is removed.
After the first match, Perl scans the remaining string (ghi"jkl'123]) for another match, but it doesn't find another [ (or #).
I think the most straightforward solution is to use a nested search/replace operation. The outer match identifies the string within which to substitute, and the inner match does the actual replacement.
In code:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9//cdr }xe;
Or to replace each match by X:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9/X/cr }xe;
We match a prefix of [, followed by 0 or more characters that are not [ or ] or #, followed by #.
\K is used to mark the virtual beginning of the match (i.e. everything matched so far is not included in the matched string, which simplifies the substitution).
We match and capture 0 or more characters that are not [ or ].
Finally we match a suffix of ] in a look-ahead (so it's not part of the matched string either).
The replacement part is executed as a piece of code, not a string (as indicated by the /e flag). Here we could have used $1 =~ s/[^a-zA-Z0-9]//gr or $1 =~ s/[^a-zA-Z0-9]/X/gr, respectively, but since each inner match is just a single character, it's also possible to use a transliteration.
We return the modified string (as indicated by the /r flag) and use it as the replacement in the outer s operation.
So...I'm going to suggest a marvelously computationally inefficient approach to this. Marvelously inefficient, but possibly still faster than a variable-length lookbehind would be...and also easy (for you):
The \K causes everything before it to be dropped....so only the character after it is actually replaced.
perl -pe 'while (s/\[[^]]*#[^]]*\K[^]a-zA-Z0-9]//){}' file
Basically we just have an empty loop that executes until the search and replace replaces nothing.
Slightly improved version:
perl -pe 'while (s/\[[^]]*?#[^]]*?\K[^]a-zA-Z0-9](?=[^]]*?])//){}' file
The (?=) verifies that its content exists after the match without being part of the match. This is a variable-length lookahead (what we're missing going the other direction). I also made the *s lazy with the ? so we get the shortest match possible.
Here is another approach. Capture precisely the substring that needs work, and in the replacement part run a regex on it that cleans it of non-alphanumeric characters
use warnings;
use strict;
use feature 'say';
my $var = q(ah [abc#def"ghi"jkl'123] oh); #'
say $var;
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
(my $v = $1) =~ s{[^0-9a-zA-Z]}{}g;
$v
}ex;
say $var;
where the lone $v is needed so to return that and not the number of matches, what s/ operator itself returns. This can be improved by using the /r modifier, which returns the changed string and doesn't change the original (so it doesn't attempt to change $1, what isn't allowed)
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
$1 =~ s/[^0-9a-zA-Z]//gr;
}ex;
The \K is there so that all matches before it are "dropped" -- they are not consumed so we don't need to capture them in order to put them back. The /e modifier makes the replacement part be evaluated as code.
The code in the question doesn't work because everything matched is consumed, and (under /g) the search continues from the position after the last match, attempting to find that whole pattern again further down the string. That fails and only that first occurrence is replaced.
The problem with matches that we want to leave in the string can often be remedied by \K (used in all current answers), which makes it so that all matches before it are not consumed.
$str="!bypass";
I need return string that only start with regex "!"
How can I return bypass ?
To match strings that start with a ! you need this pattern. The ^ is the anchor at the beginning of the string.
/^!/
If you want to capture the stuff after the !, you need this pattern. The parenthesis () are a capture group. They tell Perl to grab everything between them and keep it. The . means any character, and the + is a quantifier for as many as possible, at least one. So .+ means grab everything.
/^!(.+)/
To apply it, do this.
$str =~ m/^!(.+)/;
And to get the "bypass" out of that pattern, use the $1 match variable that was assigned automatically by Perl with the m// operation.
print $1; # will print bypass
To make that conditional, it would be:
print $1 if $str =~ m/^!(.+)/;
The if here is in post-fix notation, which lets you omit the block and the parenthesis. It's the same as the following, but shorter and easier to read for single statements.
if ( $str =~ m/^!(.+)/ ) {
print $1;
}
If you want to permanently change $str to not have an exclamation mark at the beginning, you need to use a substitution instead.
$str =~ s/^!//;
The s/// is the substitution operator. It changes $str in place. The original value including the ! will be lost.
Use ^!\K.+.
It works this way:
^! - Match initial ! (but this will soon change, see below).
\K - Keep - "forget" about what you have matched so far and set the starting point of the match here (after the !).
.+ - Match non-empty sequence of chars.
Due to \K, only the last part (.+) is actually matched.
How would I match any number of any characters between two specific words... I have a document with a block of text enclosed between 'begin parameters' and 'end parameters'. These two phrases are separated by a number of lines of text. So my text looks like this:
begin parameters
<lines of text here \n.
end parameters
My current regular expression looks like this:
my $regex = "begin parameters[.*\n*]end parameters";
However this is not matching. Does anybody have any suggestions?
Use the /s switch so that the any character . will match new lines.
I also suggest that you use non greedy matching by adding ? to your quantifier.
use strict;
use warnings;
my $data = do {local $/; <DATA>};
if ($data =~ /begin parameters(.*?)end parameters/s) {
print "'$1'";
}
__DATA__
begin parameters
<lines of text here.
end parameters
Outputs:
'
<lines of text here.
'
Your current regular expression does not do what you may think, by placing those characters inside of a character class; it matches any character of: ( ., *, \n, * ) instead of actually matching what you want.
You can use the s modifier forcing the dot . to match newline sequences. By placing a capturing group around what you want to extract, you can access that by using $1
my $regex = qr/begin parameters(.*?)end parameters/s;
my $string = do {local $/; <DATA>};
print $1 if $string =~ /$regex/;
See Demo
Please try this :
Begin Parameters([\S\s]+?)EndParameters
Translation : This will look for any char who is a separator, or any char who is everything but a separator (so actually, it will look for any char) until it find "EndParameters".
I hope it is what you expect.
The meta-character . loses its special properties inside of a character class.
So [.*\n*] actually matches 0 or more literal periods or zero or more newlines.
What you actual want is to match 0 or more of any character and 0 or more of a newline. Which you can represent in a non-capturing group:
begin parameters(?:.|\n)*?end parameters
What does the following snippet do?
if ($str =~ /^:(\w+)/) {
$hash{$1} = 1;
}
It uses the first successful capture as key in the hash. And the $str has to contain one or more words but I am not sure what the ^: means
^ start at beginning of string
: match a literal colon
( capture the following string
\w+ matching one or more alphanumeric characters
) end capture
The capture is stored in $1, which then becomes a key in the hash %hash below.
So if you have the string :foo, you will match foo, and get $hash{foo} = 1. The purpose of this code is no doubt to extract certain strings and dedupe them using a hash.
^: mean ":" symbol at the start of the line. Also it will capture only single "word" after :
It means ^ At the begin of the Line a :
e.g:
$string = q~:Thats~;
$hash{Thats} = 1;
$string2 = q~Thats~;
The if Statement is successfull at $string, but it fails at $string2 because it doesn't start with an :.
You said:
'And the $str has to contain one or more words ...'
I'm not sure if this is simply a typo or if your intention is different from your small example. Now (according to your post), your regex would match a string like: :Hello. In Perl, this could be written also like
my %hash = ();
my $str = ':Hello';
$hash{ $1 }++ if $str =~ /^:(\w+)/;
Now, if you'd change ^: in your regex to [:^], which means: your word in your string should be preceded by 'start of string' ^ OR a colon :, your regex could now match lines like: 'Hello:World:Perl:Script'; (maybe this was the real intention).
Such a string could then be dissected in a while loop:
$hash{ $1 }++ while $str =~ /[:^](\w+)/g;
If you print the captured keys: print "#{[keys %hash]}";
the result would be: Perl Script Hello World (the order of the keys is undefined).
These kinds strings are widespread in the unix world, e.g. the environment variables PATH, LD_LIBRARY_PATH, and also the file /etc/passwd looks like that.
BTW, this is only an idea - IF your typo wasn't really one ;-)
I have a file which looks like:
uniprotkb:Q9VNB0|intact:EBI-102551 uniprotkb:A1ZBG6|intact:EBI-195768
uniprotkb:P91682|intact:EBI-142245 uniprotkb:Q24117|intact:EBI-156442
uniprotkb:P92177-3|intact:EBI-204491 uniprotkb:Q9VDK2|intact:EBI-87444
and I wish to extract strings between : and | separators, the output should be:
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2
tab delimited between the two columns.
I wrote in unix a perl command:
perl -l -ne '/:([^|]*)?[^:]*:([^|]*)/ and print($1,"\t",$2)' <file>
the output that I got is:
Q9VNB0 EBI-102551 uniprotkb:A1ZBG6
P91682 EBI-142245 uniprotkb:Q24117
P92177-3 EBI-204491 uniprotkb:Q9VDK2
I wish to know what am I doing wrong and how can I fix the problem.
I don't wish to use split function.
Thanks,
Tom.
The expression you give is too greedy and thus consumes more characters than you wanted. The following expression works on your sample data set:
perl -l -ne '/:([^|]*)\|.*:([^|]*)\|/ and print($1,"\t",$2)'
It anchors the search with explicit matches for something between a ":" and "|" pair. If your data doesn't match exactly, it should ignore the input line, but I have not tested this. I.e., this regex assumes exactly two entries between ":" and "|" will exist per line.
Try m/: ( [^:|]+ ) \| .+ : ( [^:|]+ ) \| /x instead.
A fix could be to use a greeding expression between the first string and the second one. With .* it goes until the end and begins to backtrack searching for the last colon followed by a pipe.
perl -l -ne '/:([^|]*).*:([^|]*)\|/ and print($1,"\t",$2)' <file>
Output:
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2
See it in action:
:([\w\-]*?)\|
Another method:
:(\S*?)\|
The way you've specified it, it has to match that way. You want a single colon
followed by any number of non-pipe, followed by any number of non-colon.
single colon -> :
non-pipe -> Q9VNB0
non-colon -> |intact
colon -> :
non-pipe -> EBI-102551 uniprotkb:A1ZBG6
Instead I make a space the end-of-contract, and require all my patterns to begin
with a colon, end with a pipe and consist of non-space/non-pipe characters.
perl -M5.010 -lne 'say join( "\t", m/[:]([^\s|]+)[|]/g )';
perl -nle'print "$1\t$2" if /:([^|]*)\S*\s[^:]*:([^|]*)/'
Or with 5.10+:
perl -nE'say "$1\t$2" if /:([^|]*)\S*\s[^:]*:([^|]*)/'
Explanation:
: Matches the start of the first "word".
([^|]*) Matches the desired part of the first "word".
\S* Matches the end of the first "word".
\s+ Matches the "word" separator.
[^:]*: Matches the start of the second "word".
([^|]*) Matches the desired part of the second "word".
This isn't the shortest answer (although it's close) because each part is quite independent of the others. This makes it more robust, less error-prone, and easier to maintain.
Why do you not want to use the split function. On the face of it this would be easily solved by writing
my #fields = map /:([^|]+)/, split
I am not sure how your regex is supposed to work. Using the /x modifier to allow non-significant whitespace it looks like this
/ : ([^|]*)? [^:]* : ([^|]*) /x
which finds a colon and optionally captures as many non-pipe characters as possible. Then skips over as many non-colon characters as possible to the next colon. Then captures zero asm many non-pipe characters as possible. Because all of your matches are greedy, any one of them is allowed to consume all of the rest of the string as long as the characters match the character class. Note that a ? that indicates an optional sequence will first of all match all that it can, and the option to skip the sequence will be taken only if the rest of the pattern cannot then be made to match
It is hard to judge from your examples the precise criteria for a field, but this code should do the trick. It finds sequences of characters that are neither a colon nor a pipe that are preceded by a colon and terminated by a pipe
use strict;
use warnings;
while (<DATA>) {
my #fields = /:([^:|]+)\|/g;
print join("\t", #fields), "\n";
}
__DATA__
uniprotkb:Q9VNB0|intact:EBI-102551 uniprotkb:A1ZBG6|intact:EBI-195768
uniprotkb:P91682|intact:EBI-142245 uniprotkb:Q24117|intact:EBI-156442
uniprotkb:P92177-3|intact:EBI-204491 uniprotkb:Q9VDK2|intact:EBI-87444
output
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2