How can I extract a substring enclosed in double quotes in Perl? - regex

I'm new to Perl and regular expressions and I am having a hard time extracting a string enclosed by double quotes. Like for example,
"Stackoverflow is
awesome"
Before I extract the strings, I want to check if it is the end of the line of the whole text was in the variable:
if($wholeText =~ /\"$/) #check the last character if " which is the end of the string
{
$wholeText =~ s/\"(.*)\"/$1/; #extract the string, removed the quotes
}
My code didn't work; it is not getting inside of the if condition.

You need to do:
if($wholeText =~ /"$/)
{
$wholeText =~ s/"(.*?)"/$1/s;
}
. doesn't match newlines unless you apply the /s modifier.
There's no need to escape the quotes like you're doing.

The above poster who recommended using the "m" flag in the regular expression is correct, however the regex provided won't quite work. When you say:
$wholeText =~ s/\"(.*)\"/$1/m; #extract the string, removed the quotes
...the regular expression is too "greedy", which means the (.*) part will gobble up too much of the text. If you have a sample like this:
"The quick brown fox," he said, "jumped over the lazy dog."
...then the above regex will capture everything from "The" through "dog.", which is probably not what you intend. There are two ways to make the regex less greedy. Which one is better has everything to do with how you choose to handle extra " marks inside your string.
One:
$wholeText =~ s/\"([^"]*)\"/$1/m;
Two:
$wholeText =~ s/\"(.*?)\"/$1/m;
In One, the regex says "start with quote, then find everything that is not a quote and remember it, until you see another quote." In Two, the regex says "Start with quote, then find everything until you find another quote." The extra ? inside the ( ) tells the regex processor to not be greedy. Without considering quote escaping within the string, both regular expressions should behave the same.
By the way, this is a classic problem when parsing a CSV ("Comma Separated Values") file, by the way, so looking up some references on that may help you out.

If you want to anchor a match to the very end of the string (not line, entire string), use the \z anchor:
if( $wholeText =~ /"\z/ ) { ... }
You don't need a guard condition for this. Just use the right regex in the substitution. If it doesn't match the regex, nothing happens:
$wholeText =~ s/"(.*?)"\z/$1/s;
I think you really have a different question though. Why are you trying to anchor it to the end of the string? What problems are you trying to avoid?

For multi-line strings, you need to include the 'm' modifier with the search pattern.
if ($wholeText =~ m/\"$/m) # First m for match operator; second multi-line modifier
{
$wholeText =~ s/\"(.*?)\"/$1/s; #extract the string, removed the quotes
}
You will also need to consider whether you allow double quotes inside the string and if so, which convention to use. The primary ones are backslash and double quote (also backslash backslash), or double quote double quote in the string. These slightly complicate your regex.
The answer by #chaos uses 's' as a multi-line modifier. There's a small difference between the two:
m
Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of the string to matching the start or end of any line anywhere within the string.
s
Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.
Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string.

Assuming you have a single substring in quotes, this will extract it:
s/."(.?)".*/$1/
And the answer above (s/"(.*?)"/$1/s) will just remove quotes.
Test code:
my $text = "no \"need this\" again, no\n";
my $text2 = $text;
print $text;
$text2 =~ s/.*\"(.*?)\".*/$1/;
print $text2;
$text =~ s/"(.*?)"/$1/s;
print $text;
Output:
no "need this" again, no
need this
no need this again, no

Related

How to reconstruct regex matched part

I have simplify some latex math formula within text, for example
This is ${\text{BaFe}}_{2}{\text{As}}_{2}$ crystal
I want to transform this into
This is BaFe2As2 crystal
That is to concatenate only content within inner most bracket.
I figure out that I can use regex pattern
\{[^\{\}]*\}
to match those inner most bracket. But the problem is how to concatenate them together?
I don't know if this could be done in notepad++ regex replacement. If notepad++ is not capable, I can also accept perl one-liner solution.
There may clearly be multiple such equations (the markup between two $s) in the document. So while you need to assemble text between all {}, this also need be constrained within a $ pair. Then all such equations need be processed.
Matching that in a single pattern results in a complex regex. Instead, we can first extract everything within a pair of $s and then gather text within {}s from that, simplifying the regex a lot. This makes two passes over each equation but a Latex document is small for computational purposes and the loss of efficiency can't be noticed.
use warnings;
use strict;
use feature 'say';
my $text = q(This is ${\text{BaFe}}_{2}{\text{As}}_{2}$ crystal,)
. q( and ${\text{Some}}{\mathbf{More}}$ text);
my #results;
while ($text =~ /\$(.*?)\$/g) {
my $eq = $1;
push #results, join('', $eq =~ /\{([^{}]+)\}/g);
}
say for #results;
This prints lines BaFe2As2 and SomeMore.
The regex in the while condition captures all chars between two $s. After the body of the loop executes and the condition is checked again, the regex continues searching the string from the position of the previous match. This is due to the "global" modifier /g in scalar context, imposed on regex since it is in the loop condition. Once there are no more matches the loop terminates.
In the body we match between {}, and again due to /g this is done for all {}s in the equation. Here, however, the regex is in the list context (as it is assigned to an array) and then /g makes it return all matches. They are joined into a string, which is added to the array.
In order to replace the processed equation, use this in a substitution instead
$text =~ s{ \$(.*?)\$ }{ join('', $1 =~ /\{([^{}]+)\}/g) }egx;
where the modifier e makes it so that the replacement part is evaluated as Perl code, and the result of that used to replace the matched part. Then in it we can run our regex to match content of all {} and join it into the string, as explained above. I use s{}{} delimiters, and x modifier so to be able to space things in the matching part as well.
Since the whole substitution has the g modifier the regex keeps going through $text, as long as there are equations to match, replacing them with what's evaluated in the replacement part.
I use a hard-coded string (extended) from the question, for an easy demo. In reality you'd read a file into a scalar variable ("slurp" it) and process that.
This relies on the question's premise that text of interest in an equation is cleanly between {}.
Missed the part that a one-liner is sought
perl -0777 -wnE'say join("", $1=~/\{([^{}]+)\}/g) while /\$(.*?)\$/g' file.tex
With -0777 the file is read whole ("slurped"), and as -n provides a loop over input lines it is in the $_ variable; the regex in the while condition works by default on $_. In each interation of while the contents of the captured equation, in $1, is directly matched for {}s.
Then to replace each equation and print out the whole processed file
perl -0777 -wne's{\$(.*?)\$}{join "", $1=~/\{([^{}]+)\}/g}eg; print' file.tex
where I've removed extra spaces and (unnecessary) parens on join.
Use this regex in Notepad++. I have tried to match everything which is NOT present between the innermost curly brackets and then replaced the match with a blank string.
[^{}]*\{|\}[^{}]*
Click for Demo
Explanation:
[^{}]*\{ - matches 0+ occurrences of any character that is neither { nor } followed by {
| - OR
\}[^{}]* - matches } followed by 0+ occurrences of any character that is neither { nor }
Before Replacement:
After Replacement:
UPDATE:
Try this updated regex:
\$?(?=[^$]*\$[^$]*$)(?:[^{}]*{|}[^{}]*)(?=[^$]*\$[^$]*$)\$?
Click for Demo

How to replace single quote to two single quotes, but do nothing if two single quotes are next to each other in perl

I need to change all the single quote found in the string to two single quotes, if more than one single quotes are found successively, they should remain as it is.
e.g. str = abc'def''sdf'''asdf
output should be : str = abc''def''sdf'''asdf
I think the cleanest way is to search for the following pattern:
(?<!')'(?!')
and then replace that with two single quotes. The pattern searches for a single quote, but it has negative lookbehind and lookahead assertions which check that the preceeding and proceeding character is not also another single quote.
my $var = "abc'def''sdf'";
print "$var\n";
$var =~ s/(?<!')'(?!')/''/g;
print "$var\n";
Note that you could have also just written a straight pattern to match, e.g.
(^|[^'])'($|[^'])
But then the replacement becomes tricky because you would have consumed the characters surrounding the single quote. I don't like to do extra work if I don't have to.
Output:
abc'def''sdf'
abc''def''sdf''
Demo here:
Rextester

Can regex match all the words outside quotation marks?

I recently typed an essay for my lit class, and my teacher specifically stated a word limit that does not include quotations from the piece. And I thought, why not make a script that calculates that for you? I could, of course, do this the boring way by going though the whole text and ignoring the words inside quotation marks, but I have a feeling that there's a neater way using Regex and Array.count. As I know next to nothing about Regex, can someone help me/tell me that it's impossible with Regex?
Tl;dr: use Regex to match all words (or spaces, doesn't matter) that are outside quotation marks from a text, and count the items in the resulting array.
Depending on the requirements, could use The Greatest Regex Trick Ever
"[^"]*"|(\w+)
And count the matches of the first capture group.
\w+ matches one or more word characters.
See test at regex101.com
Also skip single quoted strings:
"[^"]*"|'[^']*'|(\w+)
test at regex101
This is easy enough using PCRE (or Perl of course):
".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+
Use the g modifier, and s if you want to handle multiline quotes.
Demo
Here's the x version for readability:
".*?" (*SKIP)(?!)
| (?<!\w)'.*?'(?!\w) (*SKIP)(?!)
| [\w]+
The first part will match everything inside " or ' quotes and will discard it ((*SKIP)(?!)). The second part will match all words (I've included ' as being part of a word in this example). The ' character will be counted as a quote boundary only at start/end of words, to let you use things like isn't for instance.
Possible modifications:
To count the text isn't as two words, replace [\w']+ with \w+.
To count text like mother-in-law as one word instead of 3, replace [\w']+ with [-\w']+.
You get the point ;)
And here's a full Perl script that uses this regex:
#!/usr/bin/env perl
use strict;
use warnings;
$_ = do { local $/; <> };
print scalar (() = /".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+/gs), "\n";
Execute it passing in a file or STDIN containing the text you want to count the words in, and it will output the word count on STDOUT.
A general solution would be pretty tough, since some works will have multi-paragraph quotes, where the first paragraph doesn't close the quote, but the second paragraph opens with a quotation mark. So matching quote marks document-wide would be hard.
On the other hand, you could maybe go paragraph-by-paragraph, and accumulate a non-quote word count for each paragraph. There would still be pathalogical cases that could break this (like a paragraph which includes a list of punctuation symbols, including a quotation mark), of course.
In Perl, assuming a getWordCount sub exists somewhere, and assuming you've somehow split your document into an array of paragraphs called #paragraphs, this might look like:
my $wordCount = 0;
foreach my $paragraph (#paragraphs) {
$paragraph =~ s/\".*?\"/g; # remove all quotation marks which have a matching quotation mark
$paragraph =~ s/\".*$/g; # remove quotation marks which go to the end of the paragraph
$wordCount += getWordCount($paragraph);
}
print "There are $wordCount words outside of quotations, maybe!";
It would work better this way:
Total Number of characters - Sum(characters inside quotes)
You can use this regex to find all "Quoted" strings: \"[^"]*\"

Problems with perl regex

I need a perl regex to match A.CC3 on a line begining with something followed by anything then, my 'A.CC3 " and then anything...
I am surprised this (text =~ /^\W+\CC.*\A\.CC\[3].*/) is not working
Thanks
\A is an escape sequence that denotes beginning of line, or ^ like in the beginning of your regex. Remove the backslash to make it match a literal A.
Edit: You also seem to have \C in there. You should only use backslash to escape meta characters such as period ., or to create escape sequences, such as \Q .. \E.
At its simplest, a regex to match A.CC3 would be
$text =~ /A\.CC3/
That's all you need. This will match any string with A.CC3 in it. In the comments you mention the string you are matching is this:
my $text = "//%CC Unused Static Globals, A.CC3, Halstead Progam Volume";
You might want to avoid partial matches, in which case you can use word boundary \b
$text =~ /\bA\.CC3\b/
You might require that a line begins with //%
$text =~ m#^//%.*\bA\.CC3\b#
Of course, only you know which parts of the string should be matched and in what way. "Something followed by anything followed by A.CC3 followed by anything" really just needs the first simple regex.
It doesn't seem like you're trying to capture anything. If that's the case, and all you need to do is find lines that contain A.CC3 then you can simply do
if ( index( $str, 'A.CC3' ) >= 0 ) # Found it...
No need for a regex.
Try to give this a shot:
^.*?A\.CC.*$
That will match anything until it reaches A, then a literal ., followed by CC, then anything until end of string.
It depends what you want to match. If you want to pull back the whole line in which the A.CC3 pattern occurs then something like this should work:
^.*A\.CC3.*$

How can I preserve whitespace when I match and replace several words in Perl?

Let's say I have some original text:
here is some text that has a substring that I'm interested in embedded in it.
I need the text to match a part of it, say: "has a substring".
However, the original text and the matching string may have whitespace differences. For example the match text might be:
has a
substring
or
has a substring
and/or the original text might be:
here is some
text that has
a substring that I'm interested in embedded in it.
What I need my program to output is:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
I also need to preserve the whitespace pattern in the original and just add the start and end markers to it.
Any ideas about a way of using Perl regexes to get this to happen? I tried, but ended up getting horribly confused.
Been some time since I've used perl regular expressions, but what about:
$match = s/(has\s+a\s+substring)/[$1]/ig
This would capture zero or more whitespace and newline characters between the words. It will wrap the entire match with brackets while maintaining the original separation. It ain't automatic, but it does work.
You could play games with this, like taking the string "has a substring" and doing a transform on it to make it "has\s*a\s*substring" to make this a little less painful.
EDIT: Incorporated ysth's comments that the \s metacharacter matches newlines and hobbs corrections to my \s usage.
This pattern will match the string that you're looking to find:
(has\s+a\s+substring)
So, when the user enters a search string, replace any whitespace in the search string with \s+ and you have your pattern. The, just replace every match with [match starts here]$1[match ends here] where $1 is the matched text.
In regexes, you can use + to mean "one or more." So something like this
/has\s+a\s+substring/
matches has followed by one or more whitespace chars, followed by a followed by one or more whitespace chars, followed by substring.
Putting it together with a substitution operator, you can say:
my $str = "here is some text that has a substring that I'm interested in embedded in it.";
$str =~ s/(has\s+a\s+substring)/\[match starts here]$1\[match ends here]/gs;
print $str;
And the output is:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
A many has suggested, use \s+ to match whitespace. Here is how you do it automaticly:
my $original = "here is some text that has a substring that I'm interested in embedded in it.";
my $search = "has a\nsubstring";
my $re = $search;
$re =~ s/\s+/\\s+/g;
$original =~ s/\b$re\b/[match starts here]$&[match ends here]/g;
print $original;
Output:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
You might want to escape any meta-characters in the string. If someone is interested, I could add it.
This is an example of how you could do that.
#! /opt/perl/bin/perl
use strict;
use warnings;
my $submatch = "has a\nsubstring";
my $str = "
here is some
text that has
a substring that I'm interested in, embedded in it.
";
print substr_match($str, $submatch), "\n";
sub substr_match{
my($string,$match) = #_;
$match =~ s/\s+/\\s+/g;
# This isn't safe the way it is now, you will need to sanitize $match
$string =~ /\b$match\b/;
}
This currently does anything to check the $match variable for unsafe characters.