Can regex match all the words outside quotation marks? - regex

I recently typed an essay for my lit class, and my teacher specifically stated a word limit that does not include quotations from the piece. And I thought, why not make a script that calculates that for you? I could, of course, do this the boring way by going though the whole text and ignoring the words inside quotation marks, but I have a feeling that there's a neater way using Regex and Array.count. As I know next to nothing about Regex, can someone help me/tell me that it's impossible with Regex?
Tl;dr: use Regex to match all words (or spaces, doesn't matter) that are outside quotation marks from a text, and count the items in the resulting array.

Depending on the requirements, could use The Greatest Regex Trick Ever
"[^"]*"|(\w+)
And count the matches of the first capture group.
\w+ matches one or more word characters.
See test at regex101.com
Also skip single quoted strings:
"[^"]*"|'[^']*'|(\w+)
test at regex101

This is easy enough using PCRE (or Perl of course):
".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+
Use the g modifier, and s if you want to handle multiline quotes.
Demo
Here's the x version for readability:
".*?" (*SKIP)(?!)
| (?<!\w)'.*?'(?!\w) (*SKIP)(?!)
| [\w]+
The first part will match everything inside " or ' quotes and will discard it ((*SKIP)(?!)). The second part will match all words (I've included ' as being part of a word in this example). The ' character will be counted as a quote boundary only at start/end of words, to let you use things like isn't for instance.
Possible modifications:
To count the text isn't as two words, replace [\w']+ with \w+.
To count text like mother-in-law as one word instead of 3, replace [\w']+ with [-\w']+.
You get the point ;)
And here's a full Perl script that uses this regex:
#!/usr/bin/env perl
use strict;
use warnings;
$_ = do { local $/; <> };
print scalar (() = /".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+/gs), "\n";
Execute it passing in a file or STDIN containing the text you want to count the words in, and it will output the word count on STDOUT.

A general solution would be pretty tough, since some works will have multi-paragraph quotes, where the first paragraph doesn't close the quote, but the second paragraph opens with a quotation mark. So matching quote marks document-wide would be hard.
On the other hand, you could maybe go paragraph-by-paragraph, and accumulate a non-quote word count for each paragraph. There would still be pathalogical cases that could break this (like a paragraph which includes a list of punctuation symbols, including a quotation mark), of course.
In Perl, assuming a getWordCount sub exists somewhere, and assuming you've somehow split your document into an array of paragraphs called #paragraphs, this might look like:
my $wordCount = 0;
foreach my $paragraph (#paragraphs) {
$paragraph =~ s/\".*?\"/g; # remove all quotation marks which have a matching quotation mark
$paragraph =~ s/\".*$/g; # remove quotation marks which go to the end of the paragraph
$wordCount += getWordCount($paragraph);
}
print "There are $wordCount words outside of quotations, maybe!";

It would work better this way:
Total Number of characters - Sum(characters inside quotes)
You can use this regex to find all "Quoted" strings: \"[^"]*\"

Related

Eliminate whitespace around single letters

I frequently receive PDFs that contain (when converted with pdftotext) whitespaces between the letters of some arbitrary words:
This i s a n example t e x t that c o n t a i n s strange spaces.
For further automated processing (looking for specific words) I would like to remove all whitespace between "standalone" letters (single-letter words), so the result would look like this:
This isan example text that contains strange spaces.
I tried to achieve this with a simple perl regex:
s/ (\w) (\w) / $1$2 /g
Which of course does not work, as after the first and second standalone letters have been moved together, the second one no longer is a standalone, so the space to the third will not match:
This is a n example te x t that co n ta i ns strange spaces.
So I tried lockahead assertions, but failed to achieve anything (also because I did not find any example that uses them in a substitution).
As usual with PRE, my feeling is, that there must be a very simple and elegant solution for this...
Just match a continuous series of single letters separated by spaces, then delete all spaces from that using a nested substitution (the /e eval modifier).
s{\b ((\w\s)+\w) \b}{ my $s = $1; $s =~ s/ //g; $s }xge;
Excess whitespace can be removed with a regex, but Perl by itself cannot know what is correct English. With that caveat, this seems to work:
$ perl -pe's/(?<!\S)(\S) (?=\S )/$1/g' spaces.txt
This isan example text that contains strange spaces.
Note that i s a n cannot be distinguished from a normal 4 letter word, that requires human correction, or some language module.
Explanation:
(?<!\S) negative look-behind assertion checks that the character behind is not a non-whitespace.
(\S) next must follow a non-whitespace, which we capture with parens, followed by a whitespace, which we will remove (or not put back, as it were).
(?=\S ) next we check with a look-ahead assertion that what follows is a non-whitespace followed by a whitespace. We do not change the string there.
Then put back the character we captured with $1
It might be more correct to use [^ ] instead of \S. Since you only seem to have a problem with spaces being inserted, there is no need to match tabs, newlines or other whitespace. Feel free to do that change if you feel it is appropriate.

What regular expression can remove duplicate items from a string?

Given a string of identifiers separated by :, is it possible to construct a regular expression to extract the unique identifiers into another string, also separated by :?
How is it possible to achieve this using a regular expression? I have tried s/(:[^:])(.*)\1/$1$2/g with no luck, because the (.*) is greedy and skips to the last match of $1.
Example: a:b:c:d:c:c:x:c:c:e:e:f should give a:b:c:d:x:e:f
Note: I am coding in perl, but I would very much appreciate using a regex for this.
In .NET which supports infinite repetition inside lookbehind, you could search for
(?<=\b\1:.*)\b(\w+):?
and replace all matches with the empty string.
Perl (at least Perl 5) only supports fixed-length lookbehinds, so you can try the following (using lookahead, with a subtly different result):
\b(\w+):(?=.*\b\1:?)
If you replace that with the empty string, all previous repetitions of a duplicate entry will be removed; the last one will remain. So instead of
a:b:c:d:x:e:f
you would get
a:b:d:x:c:e:f
If that is OK, you can use
$subject =~ s/\b(\w+):(?=.*\b\1:?)//g;
Explanation:
First regex:
(?<=\b\1:.*): Check if you can match the contents of backreference no. 1, followed by a colon, somewhere before in the string.
\b(\w+):?: Match an identifier (from a word boundary to the next :), optionally followed by a colon.
Second regex:
\b(\w+):: Match an identifier and a colon.
(?=.*\b\1:?): Then check whether you can match the same identifier, optionally followed by a colon, somewhere ahead in the string.
Check out: http://www.regular-expressions.info/duplicatelines.html
Always a useful site when thinking about any regular expression.
$str = q!a:b:c:d:c:c:x:c:c:e:e:f!;
1 while($str =~ s/(:[^:]+)(.*?)\1/$1$2/g);
say $str
output :
a:b:c:d:x:e:f
here's an awk version, no need regex.
$ echo "a:b:c:d:c:c:x:c:c:e:e:f" | awk -F":" '{for(i=1;i<=NF;i++)if($i in a){continue}else{a[$i];printf $i}}'
abcdxef
split the fields on ":", go through the splitted fields, store the elements in an array. check for existence and if exists, skip. Else print them out. you can translate this easily into Perl code.
If the identifiers are sorted, you may be able to do it using lookahead/lookbehind. If they aren't, then this is beyond the computational power of a regex. Now, just because it's impossible with formal regex doesn't mean it's impossible if you use some perl specific regex feature, but if you want to keep your regexes portable you need to describe this string in a language that supports variables.

How can I extract a substring enclosed in double quotes in Perl?

I'm new to Perl and regular expressions and I am having a hard time extracting a string enclosed by double quotes. Like for example,
"Stackoverflow is
awesome"
Before I extract the strings, I want to check if it is the end of the line of the whole text was in the variable:
if($wholeText =~ /\"$/) #check the last character if " which is the end of the string
{
$wholeText =~ s/\"(.*)\"/$1/; #extract the string, removed the quotes
}
My code didn't work; it is not getting inside of the if condition.
You need to do:
if($wholeText =~ /"$/)
{
$wholeText =~ s/"(.*?)"/$1/s;
}
. doesn't match newlines unless you apply the /s modifier.
There's no need to escape the quotes like you're doing.
The above poster who recommended using the "m" flag in the regular expression is correct, however the regex provided won't quite work. When you say:
$wholeText =~ s/\"(.*)\"/$1/m; #extract the string, removed the quotes
...the regular expression is too "greedy", which means the (.*) part will gobble up too much of the text. If you have a sample like this:
"The quick brown fox," he said, "jumped over the lazy dog."
...then the above regex will capture everything from "The" through "dog.", which is probably not what you intend. There are two ways to make the regex less greedy. Which one is better has everything to do with how you choose to handle extra " marks inside your string.
One:
$wholeText =~ s/\"([^"]*)\"/$1/m;
Two:
$wholeText =~ s/\"(.*?)\"/$1/m;
In One, the regex says "start with quote, then find everything that is not a quote and remember it, until you see another quote." In Two, the regex says "Start with quote, then find everything until you find another quote." The extra ? inside the ( ) tells the regex processor to not be greedy. Without considering quote escaping within the string, both regular expressions should behave the same.
By the way, this is a classic problem when parsing a CSV ("Comma Separated Values") file, by the way, so looking up some references on that may help you out.
If you want to anchor a match to the very end of the string (not line, entire string), use the \z anchor:
if( $wholeText =~ /"\z/ ) { ... }
You don't need a guard condition for this. Just use the right regex in the substitution. If it doesn't match the regex, nothing happens:
$wholeText =~ s/"(.*?)"\z/$1/s;
I think you really have a different question though. Why are you trying to anchor it to the end of the string? What problems are you trying to avoid?
For multi-line strings, you need to include the 'm' modifier with the search pattern.
if ($wholeText =~ m/\"$/m) # First m for match operator; second multi-line modifier
{
$wholeText =~ s/\"(.*?)\"/$1/s; #extract the string, removed the quotes
}
You will also need to consider whether you allow double quotes inside the string and if so, which convention to use. The primary ones are backslash and double quote (also backslash backslash), or double quote double quote in the string. These slightly complicate your regex.
The answer by #chaos uses 's' as a multi-line modifier. There's a small difference between the two:
m
Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of the string to matching the start or end of any line anywhere within the string.
s
Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.
Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string.
Assuming you have a single substring in quotes, this will extract it:
s/."(.?)".*/$1/
And the answer above (s/"(.*?)"/$1/s) will just remove quotes.
Test code:
my $text = "no \"need this\" again, no\n";
my $text2 = $text;
print $text;
$text2 =~ s/.*\"(.*?)\".*/$1/;
print $text2;
$text =~ s/"(.*?)"/$1/s;
print $text;
Output:
no "need this" again, no
need this
no need this again, no

How can I preserve whitespace when I match and replace several words in Perl?

Let's say I have some original text:
here is some text that has a substring that I'm interested in embedded in it.
I need the text to match a part of it, say: "has a substring".
However, the original text and the matching string may have whitespace differences. For example the match text might be:
has a
substring
or
has a substring
and/or the original text might be:
here is some
text that has
a substring that I'm interested in embedded in it.
What I need my program to output is:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
I also need to preserve the whitespace pattern in the original and just add the start and end markers to it.
Any ideas about a way of using Perl regexes to get this to happen? I tried, but ended up getting horribly confused.
Been some time since I've used perl regular expressions, but what about:
$match = s/(has\s+a\s+substring)/[$1]/ig
This would capture zero or more whitespace and newline characters between the words. It will wrap the entire match with brackets while maintaining the original separation. It ain't automatic, but it does work.
You could play games with this, like taking the string "has a substring" and doing a transform on it to make it "has\s*a\s*substring" to make this a little less painful.
EDIT: Incorporated ysth's comments that the \s metacharacter matches newlines and hobbs corrections to my \s usage.
This pattern will match the string that you're looking to find:
(has\s+a\s+substring)
So, when the user enters a search string, replace any whitespace in the search string with \s+ and you have your pattern. The, just replace every match with [match starts here]$1[match ends here] where $1 is the matched text.
In regexes, you can use + to mean "one or more." So something like this
/has\s+a\s+substring/
matches has followed by one or more whitespace chars, followed by a followed by one or more whitespace chars, followed by substring.
Putting it together with a substitution operator, you can say:
my $str = "here is some text that has a substring that I'm interested in embedded in it.";
$str =~ s/(has\s+a\s+substring)/\[match starts here]$1\[match ends here]/gs;
print $str;
And the output is:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
A many has suggested, use \s+ to match whitespace. Here is how you do it automaticly:
my $original = "here is some text that has a substring that I'm interested in embedded in it.";
my $search = "has a\nsubstring";
my $re = $search;
$re =~ s/\s+/\\s+/g;
$original =~ s/\b$re\b/[match starts here]$&[match ends here]/g;
print $original;
Output:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
You might want to escape any meta-characters in the string. If someone is interested, I could add it.
This is an example of how you could do that.
#! /opt/perl/bin/perl
use strict;
use warnings;
my $submatch = "has a\nsubstring";
my $str = "
here is some
text that has
a substring that I'm interested in, embedded in it.
";
print substr_match($str, $submatch), "\n";
sub substr_match{
my($string,$match) = #_;
$match =~ s/\s+/\\s+/g;
# This isn't safe the way it is now, you will need to sanitize $match
$string =~ /\b$match\b/;
}
This currently does anything to check the $match variable for unsafe characters.

RegEx: Grabbing values between quotation marks

I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?
In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']
I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.
I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.
Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'
Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.
The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.
I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.
A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1
The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.
This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/
MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "
I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1
string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character
My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match
Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.
All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.
echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.
From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter
If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".
A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub