Perl regular expression that only keeps characters until first newline - regex

When I try with a regular character instead of a newline such as X, the following works:
my $str = 'fooXbarXbaz';
$str =~ s/X.*//;
print $str;
→ foo
However, when the character to look for is \n, the code above fails:
my $str = "foo\nbar\nbaz";
$str =~ s/\n.*//;
print $str;
→ foo
baz
Why is this happening? It looks like . only matches b, a, and r, but not the second \n and the rest of the string.

The . metacharacter does not match a newline unless you add the /s switch. This way it matches everything after the newline even if there is another newline.
$str =~ s/\n.*//s;
Another way to do this is to match a character class that matches everything, such as digits and non-digits:
$str =~ s/\n[\d\D]*//;
Or, the character class of the newline and not a newline:
$str =~ s/\n[\n\N]*//;
I'd be tempted to do this with simpler string operations. Split into lines but only save the first one:
$str = ( split /\n/, $str, 2 )[0]
Or, a substring up to the first newline:
$str = substr $str, 0, index($str, "\n");
And, I'm not recommending this, but sometimes I'll open a file handle on a reference to a scalar so I can read its contents line by line:
open my $string_fh, '<', \ $str;
my $line = <$string_fh>;
close $string_fh;
say $line;

Related

Perl regex multiline match without dot

There are numerous questions on how to do a multiline regex in Perl. Most of them mention the s switch that makes a dot match a newline. However, I want to match an exact phrase (so, not a pattern) and I don't know where the newlines will be. So the question is: can you ignore newlines, instead of matching them with .?
MWE:
$pattern = "Match this exact phrase across newlines";
$text1 = "Match\nthis exact\nphrase across newlines";
$text2 = "Match this\nexact phra\nse across\nnewlines";
$text3 = "Keep any newlines\nMatch this exact\nphrase across newlines\noutside\nof the match";
$text1 =~ s/$pattern/replacement text/s;
$text2 =~ s/$pattern/replacement text/s;
$text3 =~ s/$pattern/replacement text/s;
print "$text1\n---\n$text2\n---\n$text3\n";
I can put dots in the pattern instead of spaces ("Match.this.exact.phrase") but that does not work for the second example. I can delete all newlines as preprocessing but I would like to keep newlines that are not part of the match (as in the third example).
Desired output:
replacement text
---
replacement text
---
Keep any newlines
replacement text
outside
of the match
Just replace the literal spaces with a character class that matches a space or a newline:
$pattern = "Match[ \n]this[ \n]exact[ \n]phrase[ \n]across[ \n]newlines";
Or, if you want to be more lenient, use \s or \s+ instead, since \s also matches newlines.
Most of the time, you are treating newlines as spaces. If that's all you wanted to do, all you'd need is
$text =~ s/\n/ /g;
$text =~ /\Q$text_to_find/ # or $text =~ /$regex_pattern_to_match/
Then there's the one time you want to ignore it. If that's all you wanted to do, all you'd need is
$text =~ s/\n//g;
$text =~ /\Q$text_to_find/ # or $text =~ /$regex_pattern_to_match/
Doing both is next to impossible if you have a regex pattern to match. But you seem to want to match literal text, so that opens up some possibilities.
( my $pattern = $text_to_find )
=~ s/(.)/ $1 eq " " ? "[ \\n]" : "\\n?" . quotemeta($1) /seg;
$pattern =~ s/^\\n\?//;
$text =~ /$pattern/
It sounds like you want to change your "exact" pattern to match newlines anywhere, and also to allow newlines instead of spaces. So change your pattern to do so:
$pattern = "Match this exact phrase across newlines";
$pattern =~ s/\S\K\B/\n?/g;
$pattern =~ s/ /[ \n]/g;
It certainly is ugly, but it works:
M\n?a\n?t\n?c\n?h\st\n?h\n?i\n?s\se\n?x\n?a\n?ct\sp\n?h\n?r\n?a\n?s\n?e\sa\n?c\n?r\n?o\n?s\n?s\sn\n?e\n?w\n?l\n?i\n?n\n?e\n?s
For every pair of letters inside a word, allow a newline between them with \n?. And replace each space in your regex with \s.
May not be usable, but it gets the job done ;)
Check it out at regex101.

How to replace stuff in a Perl regex

I have a string $text and want to modify it with a regex. The string contains multiple sections like <NAME>John</NAME>.
I want to search for those sections, which I would normally do with something like
$text =~ m/<NAME>(.*?)<\/NAME>/g
but then make sure that there are no leading and trailing blanks and no leading non-word characters, which I would normally ensure with something like
$temp =~ s/^\s+|\s+$//g; # trim leading and trailing whitespaces
$temp = s/^\W*//g; # remove all leading non-word chars
Now my question is: How do I actually make this happen? Is it possible to use a s/// regex instead of the m//?
This is possible in a single substitution, but it's unnecessarily complex. I suggest you do a two-tier substitution using a executable replacement.
my $text = '<NAME> %^John^%
</NAME>';
$text =~ s{ (?<=<NAME>) ([^<>]*) (?=</NAME>) }{
(my $new = $1) =~ s/\A\s+|\s+\z//g;
$new =~ s/\A\W+//;
$new;
}eg;
print $text;
output
<NAME>John^%</NAME>
This is even simpler if you have version 14 or later of Perl 5, and want to use the non-destructive ( /r modifier) substitution mode.
$text =~ s{ (?<=<NAME>) ([^<>]*) (?=</NAME>) }{ $1 =~ s/\A\s+|\s+\z//gr =~ s/\A\W+//r }exg;
If I understand correctly, what you want to do is merely "clean up" the text inside the tag (insofar as it's possible to "parse" XML using regular expressions). This should do the trick:
$text =~ s/(<NAME>)\s*\W*(.*?)\s*(<\/NAME>)/$1$2$3/sgi;

Perl simple regex uppercase words separated by underscore

Consider I have string like print_this_text_in_camel_case and I want to uppercase the first word and every word after the underscore, so the result will be Print_This_Text_In_Camel_Case. The below test does not work on the first word.
#!/usr/bin/perl
my $str = "print_this_text_in_camel_case";
$str =~ s/(_.)/uc($1)/ge;
print $str, "\n";
Just modify the regex to match the first char as well:
#!/usr/bin/perl
my $str = "print_this_text_in_camel_case";
$str =~ s/(_.|^.)/uc($1)/ge;
print $str, "\n";
will print out:
Print_This_Text_In_Camel_Case
You need to add a beginning-of-string anchor as an alternative to the underscore.
For Perl 5.10+, I'd use a \K (keep) escape to emulate variable-width look-behind and only uppercase the letter. I'd also use use \U to perform the uppercase in the replacement text instead of uc and the /e (eval) modifier.
$str =~ s/(?:^|_)\K(.)/\U$1/g;
If you're using an older version of Perl (without \K) you could do it this way:
$str =~ s/(^|_)(.)/$1\U$2/g;
Another alternative is using split and join instead of a regex:
$str = join '_', map { ucfirst } split /_/, $s;
It is tidiest to use a negative look-behind. This code fragment upper-cases all letters that aren't preceded by a letter.
my $str = "print_this_text_in_camel_case";
$str =~ s/ (?<!\p{alpha}) (\p{alpha}) /uc $1/xgei;
print $str, "\n";
output
Print_This_Text_In_Camel_Case
If you prefer, or if you have a very old copy of Perl that doesn't support Unicode properties, you can use [a-z] instead od \p{alpha}, like this
$str =~ s/ (?<![a-z]) ([a-z]) /uc $1/xige;
which produces the same result.
You could also use ucfirst
use feature 'say';
my $str = "print_this_text_in_camel_case";
my #split = map(ucfirst, (split/(_)/, $str));
say #split;

Perl substitute for white space substitutes at newline also

I have a text file, its content as follows:
a b
c
and I use the below Perl code to substitute underscore '-' char at where ever the space char appears in the input line:
while (<>) {
$_ =~ s/\s/_/;
print $_;
}
and I get output like this:
a_b
c_
So my question is why Perl substitutes underscore in the place of newline '\n' char too which is evident from the input line which contains 'c'?
When I use chomp in the code it works as expected.
\s matches all white space chars [ \t\r\n\f], so use space if you want to replace plain spaces
$_ =~ s/ /_/g;
# or just
s/ /_/g;
Translation could also be used for such simple substitutions, eg. tr/ /_/;

Perl ignore whitespace on replacement side of regular expression substitution

Suppose I have $str = "onetwo".
I would like to write a reg ex substitution command that ignores whitespace (which makes it more readable):
$str =~ s/
one
two
/
three
four
/x
Instead of "threefour", this produces "\nthree\nfour\n" (where \n is a newline). Basically the /x option ignores whitespace for the matching side of the substitution but not the replacement side. How can I ignore whitespace on the replacement side as well?
s{...}{...} is basically s{...}{qq{...}}e. If you don't want qq{...}, you'll need to replace it with something else.
s/
one
two
/
'three' .
'four'
/ex
Or even:
s/
one
two
/
clean('
three
four
')
/ex
A possible implementation of clean:
sub clean {
my ($s) = #_;
$s =~ s/^[ \t]+//mg;
$s =~ s/^\s+//;
$s =~ s/\s+\z//;
return $s;
}