There are numerous questions on how to do a multiline regex in Perl. Most of them mention the s switch that makes a dot match a newline. However, I want to match an exact phrase (so, not a pattern) and I don't know where the newlines will be. So the question is: can you ignore newlines, instead of matching them with .?
MWE:
$pattern = "Match this exact phrase across newlines";
$text1 = "Match\nthis exact\nphrase across newlines";
$text2 = "Match this\nexact phra\nse across\nnewlines";
$text3 = "Keep any newlines\nMatch this exact\nphrase across newlines\noutside\nof the match";
$text1 =~ s/$pattern/replacement text/s;
$text2 =~ s/$pattern/replacement text/s;
$text3 =~ s/$pattern/replacement text/s;
print "$text1\n---\n$text2\n---\n$text3\n";
I can put dots in the pattern instead of spaces ("Match.this.exact.phrase") but that does not work for the second example. I can delete all newlines as preprocessing but I would like to keep newlines that are not part of the match (as in the third example).
Desired output:
replacement text
---
replacement text
---
Keep any newlines
replacement text
outside
of the match
Just replace the literal spaces with a character class that matches a space or a newline:
$pattern = "Match[ \n]this[ \n]exact[ \n]phrase[ \n]across[ \n]newlines";
Or, if you want to be more lenient, use \s or \s+ instead, since \s also matches newlines.
Most of the time, you are treating newlines as spaces. If that's all you wanted to do, all you'd need is
$text =~ s/\n/ /g;
$text =~ /\Q$text_to_find/ # or $text =~ /$regex_pattern_to_match/
Then there's the one time you want to ignore it. If that's all you wanted to do, all you'd need is
$text =~ s/\n//g;
$text =~ /\Q$text_to_find/ # or $text =~ /$regex_pattern_to_match/
Doing both is next to impossible if you have a regex pattern to match. But you seem to want to match literal text, so that opens up some possibilities.
( my $pattern = $text_to_find )
=~ s/(.)/ $1 eq " " ? "[ \\n]" : "\\n?" . quotemeta($1) /seg;
$pattern =~ s/^\\n\?//;
$text =~ /$pattern/
It sounds like you want to change your "exact" pattern to match newlines anywhere, and also to allow newlines instead of spaces. So change your pattern to do so:
$pattern = "Match this exact phrase across newlines";
$pattern =~ s/\S\K\B/\n?/g;
$pattern =~ s/ /[ \n]/g;
It certainly is ugly, but it works:
M\n?a\n?t\n?c\n?h\st\n?h\n?i\n?s\se\n?x\n?a\n?ct\sp\n?h\n?r\n?a\n?s\n?e\sa\n?c\n?r\n?o\n?s\n?s\sn\n?e\n?w\n?l\n?i\n?n\n?e\n?s
For every pair of letters inside a word, allow a newline between them with \n?. And replace each space in your regex with \s.
May not be usable, but it gets the job done ;)
Check it out at regex101.
Related
I'm using this code:
$text =~ s/\s(\w)/\u$1/g;
But This is an example
Become ThisIsAnExample
Instead of This Is An Example.
How to preserve blank spaces?
Use lookbehind.
$text =~ s/(?<!\S)(\w)/\u$1/g;
Or use the more efficient \K (Perl 5.10+).
$text =~ s/(?:^|\s)\K(\w)/\u$1/g;
Both of the solutions will make sure the first word is capitalized too. If that's not an issue, the second solution can be simplified to the following:
$text =~ s/\s\K(\w)/\u$1/g;
The matching contains the whitespace, the replacement doesn't.
$text =~ s/(\s)(\w)/$1\u$2/g;
Since \s contains different types of whitespace characters, if you want to keep it in your replacement, you need to capture it and put it back.
An alternative is to use word boundaries and full "words".
$text =~ s/\b(\w+)\b/\u$1/g;
I am trying to remove all words that contain two keys (in Perl).
For example, the string
garble variable10 variable1 vssx vddx xi_21_vssx vddx_garble_21 xi_blahvssx_grbl_2
Should become
garble variable10 variable1
To just remove the normal, unappended/prepended keys is easy:
$var =~ s/(vssx|vddx)/ /g;
However I cannot figure out how to get it to remove the entire xi_21_vssx part. I tried:
$var =~ s/\s.*(vssx|vddx).*\s/ /g
Which does not work correctly. I do not understand why... it seems like \s should match the space, then .* matches anything up to one of the patterns, then the pattern, then .* matches anything preceding the pattern until the next space.
I also tried replacing \s (whitespace) with \b (word boundary) but it also did it work. Another attempt:
$var =~ s/ .*(vssx|vddx).* / /g
$var =~ s/(\s.*vssx.*\s|\s.*vddx.*\s)/ /g
As well as a few other mungings.
Any pointers/help would be greatly appreciated.
-John
I think the regex will just be
$var =~ s/\S*(vssx|vddx)\S*/ /g;
You can use
\s*\S*(?:vssx|vddx)\S*\s*
The problem with your regex were:
The .* should have been non-greedy.
The .* in front of (vssx|vddx) mustn't match whitespace characters, so you have to use \S*.
Note that there's no way to properly preserve the space between words - i.e. a vssx b will become ab.
regex101 demo.
I am trying to remove all words that [...]
This type of problem lends itself well to grep, which can be used to find the elements in a list that match a condition. You can use split to convert your string to a list of words and then filter it like this:
use strict;
use warnings;
use 5.010;
my $string = 'garble variable10 variable1 vssx vddx xi_21_vssx vddx_garble_21 xi_blahvssx_grbl_2';
my #words = split ' ', $string;
my #filtered = grep { $_ !~ /(?:vssx|vddx)/ } #words;
say "#filtered";
Output:
garble variable10 variable1
Try this as the regex:
\b[\w]*(vssx|vddx)[\w]*\b
I have a string $text and want to modify it with a regex. The string contains multiple sections like <NAME>John</NAME>.
I want to search for those sections, which I would normally do with something like
$text =~ m/<NAME>(.*?)<\/NAME>/g
but then make sure that there are no leading and trailing blanks and no leading non-word characters, which I would normally ensure with something like
$temp =~ s/^\s+|\s+$//g; # trim leading and trailing whitespaces
$temp = s/^\W*//g; # remove all leading non-word chars
Now my question is: How do I actually make this happen? Is it possible to use a s/// regex instead of the m//?
This is possible in a single substitution, but it's unnecessarily complex. I suggest you do a two-tier substitution using a executable replacement.
my $text = '<NAME> %^John^%
</NAME>';
$text =~ s{ (?<=<NAME>) ([^<>]*) (?=</NAME>) }{
(my $new = $1) =~ s/\A\s+|\s+\z//g;
$new =~ s/\A\W+//;
$new;
}eg;
print $text;
output
<NAME>John^%</NAME>
This is even simpler if you have version 14 or later of Perl 5, and want to use the non-destructive ( /r modifier) substitution mode.
$text =~ s{ (?<=<NAME>) ([^<>]*) (?=</NAME>) }{ $1 =~ s/\A\s+|\s+\z//gr =~ s/\A\W+//r }exg;
If I understand correctly, what you want to do is merely "clean up" the text inside the tag (insofar as it's possible to "parse" XML using regular expressions). This should do the trick:
$text =~ s/(<NAME>)\s*\W*(.*?)\s*(<\/NAME>)/$1$2$3/sgi;
my #matches = ($result =~ m/INFO\n(.*?)\n/);
So in Perl I want to store all matches to that regular expression. I'm looking to store the value between INFO\n and \n each time it occurs.
But I'm only getting the last occurrence stored. Is my regex wrong?
Use the /g modifier for global matching.
my #matches = ($result =~ m/INFO\n(.*?)\n/g);
Lazy quantification is unnecessary in this case as . doesn't match newlines. The following would give better performance:
my #matches = ($result =~ m/INFO\n(.*)\n/g);
/s can be used if you do want periods to match newlines. For more info about these modifiers, see perlre.
How can I find the first substring until I find the first digit?
Example:
my $string = 'AAAA_BBBB_12_13_14' ;
Result expected: 'AAAA_BBBB_'
Judging from the tags you want to use a regular expression. So let's build this up.
We want to match from the beginning of the string so we anchor with a ^ metacharacter at the beginning
We want to match anything but digits so we look at the character classes and find out this is \D
We want 1 or more of these so we use the + quantifier which means 1 or more of the previous part of the pattern.
This gives us the following regular expression:
^\D+
Which we can use in code like so:
my $string = 'AAAA_BBBB_12_13_14';
$string =~ /^\D+/;
my $result = $&;
Most people got half of the answer right, but they missed several key points.
You can only trust the match variables after a successful match. Don't use them unless you know you had a successful match.
The $&, $``, and$'` have well known performance penalties across all regexes in your program.
You need to anchor the match to the beginning of the string. Since Perl now has user-settable default match flags, you want to stay away from the ^ beginning of line anchor. The \A beginning of string anchor won't change what it does even with default flags.
This would work:
my $substring = $string =~ m/\A(\D+)/ ? $1 : undef;
If you really wanted to use something like $&, use Perl 5.10's per-match version instead. The /p switch provides non-global-perfomance-sucking versions:
my $substring = $string =~ m/\A\D+/p ? ${^MATCH} : undef;
If you're worried about what might be in \D, you can specify the character class yourself instead of using the shortcut:
my $substring = $string =~ m/\A[^0-9]+/p ? ${^MATCH} : undef;
I don't particularly like the conditional operator here, so I would probably use the match in list context:
my( $substring ) = $string =~ m/\A([^0-9]+)/;
If there must be a number in the string (so, you don't match an entire string that has no digits, you can throw in a lookahead, which won't be part of the capture:
my( $substring ) = $string =~ m/\A([^0-9]+)(?=[0-9])/;
$str =~ /(\d)/; print $`;
This code print string, which stand before matching
perl -le '$string=q(AAAA_BBBB_12_13_14);$string=~m{(\D+)} and print $1'
AAAA_BBBB_