Regex pattern matching for ; and / on multiple lines - regex

I just started learning Perl today, and I am working with regular expressions to match text from within a file.
I am checking to see if my file contains.
/
;
This is what I have attempted so far:
if ($Text =~ /;\n//)
{
//dostuff
}
Is this syntax correct? Do I need to use the \n or is there a character for end of line? Also, can I search for / or do I need some sort of escape character?

To search for / you have to escape it with \.
To check your example however, you need to turn it around because you want to match the / before the ;.
In the end it should look like this: if ($Text =~ /\/\n;/)
For more information, see perlretut to get an introduction
to regular expressions in Perl.

Use a backslash to escape your forward slash. Also why are you trying to match it in reverse order? Try $Text =~ /\/\n;/.

Related

How to escape a string that looks like a regular expression in Perl

I have a script that, among other things, searches a list of text files to replace a Windows path (text string) with another path.
The problem is that some of the folder names begin with a number and a dash. Perl seems to think that I am trying to invoke a regular expression here. I get the message, "Reference to nonexistent group in regex".
the string looks like this:
\\\BAGlobal\6-Engineering\3-Tech
I have quoted it like this:
my $find = "\\\\\\\BAGlobal\\\6-Engineering\\\3-Tech"
How do I escape the 6- and 3- ?
The problem is not the the dash in 6- but all the backslashes \.
It thinks that \3 and \6 are back-references to previously matched groups, like /foo(bar) foo\1/ would match the string foobar foobar.
If you use this in a pattern match you need to either include \Q and \E to add quoting, or apply the quotemeta built-in to your $find.
my $find = '\\\\\\\BAGlobal\\\6-Engineering\\\3-Tech';
$string =~ m/\Q$find\E/;
Or with quotemeta.
my $find = quotemeta '\\\\\\\BAGlobal\\\6-Engineering\\\3-Tech';
$string =~ m/$find/;
Also see perlre.
Note that your example code is probably wrong. The number of backslashes you have there is uneven, and double quotes "" interpolate, so each pair of backslashes \\ turn into one actual backslash in the string. But because you have 7 of them, the last one is seen as the escape for B, turning that into \B, which is not a valid escape sequence. I used single quotes '' in my code above.

Problems with perl regex

I need a perl regex to match A.CC3 on a line begining with something followed by anything then, my 'A.CC3 " and then anything...
I am surprised this (text =~ /^\W+\CC.*\A\.CC\[3].*/) is not working
Thanks
\A is an escape sequence that denotes beginning of line, or ^ like in the beginning of your regex. Remove the backslash to make it match a literal A.
Edit: You also seem to have \C in there. You should only use backslash to escape meta characters such as period ., or to create escape sequences, such as \Q .. \E.
At its simplest, a regex to match A.CC3 would be
$text =~ /A\.CC3/
That's all you need. This will match any string with A.CC3 in it. In the comments you mention the string you are matching is this:
my $text = "//%CC Unused Static Globals, A.CC3, Halstead Progam Volume";
You might want to avoid partial matches, in which case you can use word boundary \b
$text =~ /\bA\.CC3\b/
You might require that a line begins with //%
$text =~ m#^//%.*\bA\.CC3\b#
Of course, only you know which parts of the string should be matched and in what way. "Something followed by anything followed by A.CC3 followed by anything" really just needs the first simple regex.
It doesn't seem like you're trying to capture anything. If that's the case, and all you need to do is find lines that contain A.CC3 then you can simply do
if ( index( $str, 'A.CC3' ) >= 0 ) # Found it...
No need for a regex.
Try to give this a shot:
^.*?A\.CC.*$
That will match anything until it reaches A, then a literal ., followed by CC, then anything until end of string.
It depends what you want to match. If you want to pull back the whole line in which the A.CC3 pattern occurs then something like this should work:
^.*A\.CC3.*$

Perl: remove a part of string after pattern

I have strings like this:
trn_425374_1_94_-
trn_12_1_200_+
trn_2003_2_198_+
And I want to split all after the first number, like this:
trn_425374
trn_12
trn_2003
I tried the following code:
$string =~ s/(?<=trn_\d)\d+//gi;
But returns the same as the input. I have been following examples of similar questions but I don't know what I'm doing wrong. Any suggestion?
If you are running Perl 5 version 10 or later then you have access to the \K ("keep") regular expression escape. Everything before the \K is excluded from the substitution, so this removes everything after the first sequence of digits (except newlines)
s/\d+\K.+//;
with earlier versions of Perl, you will have to capture the part of the string you want to keep, and replace it in the substitution
s/(\D*\d+).+/$1/;
Note that neither of these will remove any trailing newline characters. If you want to strip those as well, then either chomp the string first, or add the /s modifier to the substitution, like this
s/\d+\K.+//s;
or
s/(\D*\d+).+/$1/s;
Do grouping to save first numbers of digits found and use .* to delete from there until end of line:
#!/usr/bin/env perl
use warnings;
use strict;
while ( <DATA> ) {
s/(\d+).*$/$1/ && print;
}
__DATA__
trn_425374_1_94_-
trn_12_1_200_+
trn_2003_2_198_+
It yields:
trn_425374
trn_12
trn_2003
your regexr should be:
$string =~ s/(trn_\d+).*/$1/g;
It substitutes the whole match by the memorized at $1 (which is the string part you want to preserve)
Use \K to preserve the part of the string you want to keep:
$string =~ s/trn_\d+\K.*//;
To quote the link above:
\K
This appeared in perl 5.10.0. Anything matched left of \K is not
included in $& , and will not be replaced if the pattern is used in a
substitution.

How can I extract a substring enclosed in double quotes in Perl?

I'm new to Perl and regular expressions and I am having a hard time extracting a string enclosed by double quotes. Like for example,
"Stackoverflow is
awesome"
Before I extract the strings, I want to check if it is the end of the line of the whole text was in the variable:
if($wholeText =~ /\"$/) #check the last character if " which is the end of the string
{
$wholeText =~ s/\"(.*)\"/$1/; #extract the string, removed the quotes
}
My code didn't work; it is not getting inside of the if condition.
You need to do:
if($wholeText =~ /"$/)
{
$wholeText =~ s/"(.*?)"/$1/s;
}
. doesn't match newlines unless you apply the /s modifier.
There's no need to escape the quotes like you're doing.
The above poster who recommended using the "m" flag in the regular expression is correct, however the regex provided won't quite work. When you say:
$wholeText =~ s/\"(.*)\"/$1/m; #extract the string, removed the quotes
...the regular expression is too "greedy", which means the (.*) part will gobble up too much of the text. If you have a sample like this:
"The quick brown fox," he said, "jumped over the lazy dog."
...then the above regex will capture everything from "The" through "dog.", which is probably not what you intend. There are two ways to make the regex less greedy. Which one is better has everything to do with how you choose to handle extra " marks inside your string.
One:
$wholeText =~ s/\"([^"]*)\"/$1/m;
Two:
$wholeText =~ s/\"(.*?)\"/$1/m;
In One, the regex says "start with quote, then find everything that is not a quote and remember it, until you see another quote." In Two, the regex says "Start with quote, then find everything until you find another quote." The extra ? inside the ( ) tells the regex processor to not be greedy. Without considering quote escaping within the string, both regular expressions should behave the same.
By the way, this is a classic problem when parsing a CSV ("Comma Separated Values") file, by the way, so looking up some references on that may help you out.
If you want to anchor a match to the very end of the string (not line, entire string), use the \z anchor:
if( $wholeText =~ /"\z/ ) { ... }
You don't need a guard condition for this. Just use the right regex in the substitution. If it doesn't match the regex, nothing happens:
$wholeText =~ s/"(.*?)"\z/$1/s;
I think you really have a different question though. Why are you trying to anchor it to the end of the string? What problems are you trying to avoid?
For multi-line strings, you need to include the 'm' modifier with the search pattern.
if ($wholeText =~ m/\"$/m) # First m for match operator; second multi-line modifier
{
$wholeText =~ s/\"(.*?)\"/$1/s; #extract the string, removed the quotes
}
You will also need to consider whether you allow double quotes inside the string and if so, which convention to use. The primary ones are backslash and double quote (also backslash backslash), or double quote double quote in the string. These slightly complicate your regex.
The answer by #chaos uses 's' as a multi-line modifier. There's a small difference between the two:
m
Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of the string to matching the start or end of any line anywhere within the string.
s
Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.
Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string.
Assuming you have a single substring in quotes, this will extract it:
s/."(.?)".*/$1/
And the answer above (s/"(.*?)"/$1/s) will just remove quotes.
Test code:
my $text = "no \"need this\" again, no\n";
my $text2 = $text;
print $text;
$text2 =~ s/.*\"(.*?)\".*/$1/;
print $text2;
$text =~ s/"(.*?)"/$1/s;
print $text;
Output:
no "need this" again, no
need this
no need this again, no

How can I preserve whitespace when I match and replace several words in Perl?

Let's say I have some original text:
here is some text that has a substring that I'm interested in embedded in it.
I need the text to match a part of it, say: "has a substring".
However, the original text and the matching string may have whitespace differences. For example the match text might be:
has a
substring
or
has a substring
and/or the original text might be:
here is some
text that has
a substring that I'm interested in embedded in it.
What I need my program to output is:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
I also need to preserve the whitespace pattern in the original and just add the start and end markers to it.
Any ideas about a way of using Perl regexes to get this to happen? I tried, but ended up getting horribly confused.
Been some time since I've used perl regular expressions, but what about:
$match = s/(has\s+a\s+substring)/[$1]/ig
This would capture zero or more whitespace and newline characters between the words. It will wrap the entire match with brackets while maintaining the original separation. It ain't automatic, but it does work.
You could play games with this, like taking the string "has a substring" and doing a transform on it to make it "has\s*a\s*substring" to make this a little less painful.
EDIT: Incorporated ysth's comments that the \s metacharacter matches newlines and hobbs corrections to my \s usage.
This pattern will match the string that you're looking to find:
(has\s+a\s+substring)
So, when the user enters a search string, replace any whitespace in the search string with \s+ and you have your pattern. The, just replace every match with [match starts here]$1[match ends here] where $1 is the matched text.
In regexes, you can use + to mean "one or more." So something like this
/has\s+a\s+substring/
matches has followed by one or more whitespace chars, followed by a followed by one or more whitespace chars, followed by substring.
Putting it together with a substitution operator, you can say:
my $str = "here is some text that has a substring that I'm interested in embedded in it.";
$str =~ s/(has\s+a\s+substring)/\[match starts here]$1\[match ends here]/gs;
print $str;
And the output is:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
A many has suggested, use \s+ to match whitespace. Here is how you do it automaticly:
my $original = "here is some text that has a substring that I'm interested in embedded in it.";
my $search = "has a\nsubstring";
my $re = $search;
$re =~ s/\s+/\\s+/g;
$original =~ s/\b$re\b/[match starts here]$&[match ends here]/g;
print $original;
Output:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
You might want to escape any meta-characters in the string. If someone is interested, I could add it.
This is an example of how you could do that.
#! /opt/perl/bin/perl
use strict;
use warnings;
my $submatch = "has a\nsubstring";
my $str = "
here is some
text that has
a substring that I'm interested in, embedded in it.
";
print substr_match($str, $submatch), "\n";
sub substr_match{
my($string,$match) = #_;
$match =~ s/\s+/\\s+/g;
# This isn't safe the way it is now, you will need to sanitize $match
$string =~ /\b$match\b/;
}
This currently does anything to check the $match variable for unsafe characters.