Regex matching spaces, but not in "strings"

Regex matching spaces, but not in "strings" - c++

I am looking for a regular exression matching spaces only if thos spaces are not enclosed in double quotes ("). For example, in
Mary had "a little lamb"
it should match the first an the second space, but not the others.
I want to split the string only at the spaces which are not in the double quotes, and not at the quotes.
I am using C++ with the Qt toolkit and wanted to use QString::split(QRegExp). QString is very similar to std::string and QRegExp are basically POSIX regex encapsulated in a class. If there exist such a regex, the split would be trivial.
Examples:
Mary had "a little lamb" => Mary,had,"a little lamb"
1" 2 "3 => 1" 2 "3 (no splitting at ")
abc def="g h i" "j k" = 12 => abc,def="g h i","j k",=,12
Sorry for the edits, I was very imprecise when I asked the question first. Hope it is somewhat more clear now.

(I know you just posted almost exactly the same answer yourself, but I can't bear to just throw all this away. :-/)
If it's possible to solve your problem with a regex split operation, the regex will have to match even numbers of quotation marks, as MSalters said. However, a split regex should match only the spaces you're splitting on, so the rest of the work has to be done in a lookahead. Here's what I would use:
" +(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"
If the text is well formed, a lookahead for an even number of quotes is sufficient to determine that the just-matched space is not inside a quoted sequence. That is, lookbehinds aren't necessary, which is good because QRegExp doesn't seem to support them. Escaped quotes can be accommodated too, but the regex becomes quite a bit larger and uglier. But if you can't be sure the text is well formed, it's extremely unlikely you'll be able to solve your problem with split().
By the way, QRegExp does not implement POSIX regular expressions--if it did, it wouldn't support lookaheads OR lookbehinds. Instead, it falls into the loosely-defined category of Perl-compatible regex flavors.

What should happen to "a" b "c" ?
Note that in the substring " b " the spaces are between quotes.
-- edit --
I assume a space is "between quotes" if it is preceded and followed by an odd number of standard quotation marks (i.e. U+0022, I'll ignore those funny Unicode “quotes”).
That means you need the following regex: ^[^"]*("[^"]*"[^"]*)*"[^"]* [^"]*"[^"]*("[^"]*"[^"]*)*$
("[^"]*"[^"]*) represents a pair of quotes. ("[^"]*"[^"]*)* is an even amount of quotes, ("[^"]"[^"]*)*" an odd amount. Then there's the actual quoted string part, followed by another odd number of quotes. ^$ anchors are needed because you need to count every quote from the beginning of the string. This answers the " b " substring problem above by never looking at substrings. The price is that every character in your input must be matched against the entire string, which turns this into an O(N*N) split operation.
The reason why you can do this in a regex is because there is a finite amount of memory needed. Effectively just one bit; "have I seen an odd or even number of quotes so far?". You don't actually have to match up individual "" pairs.
This is not the only interpretation possible, though. If you do include “funny Unicode quotes” which should be paired, you also need to deal with ““double quoted”” strings. This in turn means you need a count of open “, which means you need infinite storage, which in turns means it's no longer a regular language, which means you can't use a regex. QED.
Anyway, even if it was possible, you still would want a proper parser. The O(N*N) behavior to count the number of quotes preceding each character just isn't funny. If you already know there are X quotes preceding Str[N], it should be an O(1) operation to determine how many quotes precede Str[N+1], not O(N). The possible answers are after all just X or X+1 !

MSalters pushed me on the right track. The problem with his answer that the regex he gives always matches the whole string and so is unsuitable for split(), but this can partly redeemed by a lookahead match. Assuming that the quotes are always paired (they are indeed), I can split on every space which is followed by an even number of quotes.
The regex without C escapes and in single quotes looks like
' (?=[^"]*("[^"]*"[^"]*)*$)'
In the source it finally looked like (using Qt and C++)
QString buf("Mary had \"a little lamb\""); // string we want to split
QStringList splitted = buf.split( QRegExp(" (?=[^\"]*(\"[^\"]*\"[^\"]*)*$)") );
Simple, eh?
For the performance, the strings are parsed once at the start of the program, they are a few dozen and they are less than hundred chars. I will test its runtime with long strings, just to be sure nothing bad happens ;-)

If the quoting in the strings is simple (like your examples), you can use alternation. This regex first hunts for a simple quoted string; failing that it finds spaces.
/(\"[^\"]*\"| +)/
In Perl, if you use grouping in the regex when calling split(), the function returns not only the elements but also the captured groups (in this case, our delimiter). If you then filter out the blank and spaces-only delimiters, you will get the desired list of elements. I don't know whether a similar strategy would work in C++, but the following Perl code does work:
use strict;
use warnings;
while (<DATA>){
chomp;
my #elements = split /(\"[^\"]*\"| +)/, $_;
#elements = grep {length and /[^ ]/} #elements;
# Do stuff with #elements
}
__DATA__
Mary had "a little lamb"
1" 2 "3
abc def="g h i" "j k" = 12

Simplest regex-solution: match whole spaces AND quotes. Filter quotes later
"[^"]*"|\s

Related

How to multiline regex but stop after first match?

I need to match any string that has certain characteristics, but I think enabling the /m flag is breaking the functionality.
What I know:
The string will start and end with quotation marks.
The string will have the following words. "the", "fox", and "lazy".
The string may have a line break in the middle.
The string will never have an at sign (used in the regex statement)
My problem is, if I have the string twice in a single block of text, it returns once, matching everything between the first quote mark and last quote mark with the required words in-between.
Here is my regex:
/^"the[^#]*fox[^#]*lazy[^#]*"$/gim
And a Regex101 example.
Here is my understanding of the statement. Match where the string starts with "the and there is the word fox and lazy (in that order) somewhere before the string ends with ". Also ignore newlines and case-sensitivity.
The most common answer to limiting is (.*?) But it doesn't work with new lines. And putting [^#?]* doesn't work because it adds the ? to the list of things to ignore.
So how can I keep the "match everything until ___" from skipping until the last instance while still being able to ignore newlines?
This is not a duplicate of anything else I can find because this deals with multi-line matching, and those don't.

In your case, all your quantifiers need to be non-greedy so you can just use the flag ungreedy: U.
/^"the[^#]*fox[^#]*lazy[^#]*"$/gimU
Example on Regex101.

The answer, which was figured out while typing up this question, may seem ridiculously obvious.
Put the ? after the *, not inside the brackets. Parenthesis and Brackets are not analogous, and the ? should be relative to the *.
Corrected regex:
/^"the[^#]*?fox[^#]*?lazy[^#]*?"$/gim
Example from Regex101.
The long and the short of this is:
Non-greedy, multi-line matching can be achieved with [^#]*?
(substituting # for something you don't want to match)

REGEX in R: extracting words from a string

i guess this is a common problem, and i found quite a lot of webpages, including some from SO, but i failed to understand how to implement it.
I am new to REGEX, and I'd like to use it in R to extract the first few words from a sentence.
for example, if my sentence is
z = "I love stack overflow it is such a cool site"
id like to have my output as being (if i need the first four words)
[1] "I love stack overflow"
or (if i need the last four words)
[1] "such a cool site"
of course, the following works
paste(strsplit(z," ")[[1]][1:4],collapse=" ")
paste(strsplit(z," ")[[1]][7:10],collapse=" ")
but i'd like to try a regex solution for performance issues as i need to deal with very huge files (and also for the sake of knowing about it)
I looked at several links, including
Regex to extract first 3 words from a string and
http://osherove.com/blog/2005/1/7/using-regex-to-return-the-first-n-words-in-a-string.html
so i tried things like
gsub("^((?:\S+\s+){2}\S+).*",z,perl=TRUE)
Error: '\S' is an unrecognized escape in character string starting ""^((?:\S"
i tried other stuff but it usually returned me either the whole string, or the empty string.
another problem with substr is that it returns a list. maybe it looks like the [[]] operator is slowing things a bit (??) when dealing with large files and doing apply stuff.
it looks like the Syntax used in R is somewhat different ?
thanks !

You've already accepted an answer, but I'm going to share this as a means of helping you understand a little more about regex in R, since you were actually very close to getting the answer on your own.
There are two problems with your gsub approach:
You used single backslashes (\). R requires you to escape those since they are special characters. You escape them by adding another backslash (\\). If you do nchar("\\"), you'll see that it returns "1".
You didn't specify what the replacement should be. Here, we don't want to replace anything, but we want to capture a specific part of the string. You capture groups in parentheses (...), and then you can refer to them by the number of the group. Here, we have just one group, so we refer to it as "\\1".
You should have tried something like:
sub("^((?:\\S+\\s+){2}\\S+).*", "\\1", z, perl = TRUE)
# [1] "I love stack"
This is essentially saying:
Work from the start of the contents of "z".
Start creating group 1.
Find non-whitespace (like a word) followed by whitespace (\S+\s+) two times {2} and then the next set of non-whitespaces (\S+). This will get us 3 words, without also getting the whitespace after the third word. Thus, if you wanted a different number of words, change the {2} to be one less than the number you are actually after.
End group 1 there.
Then, just return the contents of group 1 (\1) from "z".
To get the last three words, just switch the position of the capturing group and put it at the end of the pattern to match.
sub("^.*\\s+((?:\\S+\\s+){2}\\S+)$", "\\1", z, perl = TRUE)
# [1] "a cool site"

For getting the first four words.
library(stringr)
str_extract(x, "^\\s*(?:\\S+\\s+){3}\\S+")
For getting the last four.
str_extract(x, "(?:\\S+\\s+){3}\\S+(?=\\s*$)")

Regular Expression to match most explicit string

I have some experience with regular expressions but I am far from expert level and need a way to match the record with the most explicit string in a file where each record begins with a unique 1-5 digit integer and is padded with various other characters when it is shorter than 5 digits. For example, my file has records that begin with:
32000
3201X
32014
320xy
In this example, the non-numeric characters represent wildcards. I thought the following regex examples would work but rather than match the record with the MOST explicit number, they always match the record with the LEAST explicit number. Remember, I do not know what is in the file so I need to test all possibilities to locate the MOST explicit match.
If I need to search for 32000, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3200\D|^32000/
It should match 32000 but it matches 320xy
If I need to search for 32014, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3201\D|^32014/
It should match 32014 but it matches 320xy
If I need to search for 32015, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3201\D|^32015/
It should match 3201x but it matches 320xy
In each case, the matched result is the LEAST specific numeric value. I also tried reversing the regex as follows by still get the same results:
/^32014|^3201\D|^320\D{2}|^32\D{3}|^3\D{4}/
Any help is much appreciated.

Okay, if you want to match a string literally then use anchors. Then specify the string you want matched. For instance match '123456xyz' where the xyz can be anything excep numeric use:
'^123456[^0-9]{3}$'
If you prefer specific letters to match at the end, if they will always be x y or z then use:
'^123456[xyz]{3}$'
Note the ^ and $ anchor the string to start with 12345 and end with three letters that are x y or z.
Good luck!

Ok, I did quite some tinkering here. I am 99% percent sure that this is pretty much impossible (if we don't cheat and interpolate code into the regex). The reason is you will need a negative lookbehind with variable length at some point.
However, I came up with two alternatives. One is if you want just to find the "most exact match", the second one is if you want to replace it with something. Here we go:
/(32000)|\A(?!.*32000).*(3200\D)|\A(?!.*3200[0\D]).*(320\D\D)|\A(?!.*320[0\D][0\D]).*(32\D\D\D)|\A(?!.*32[0\D][0\D][0\D]).*(3\D\D\D\D)/m
Question:
So what is my "most exact match" here?
Answer:
The concatenation of the 5 matched groups - \1\2\3\4\5. In fact always only one of them will match, the other 4 will be empty.
/(32000)|\A(?!.*32000)(.*)(3200\D)|\A(?!.*3200[0\D])(.*)(320\D\D)|\A(?!.*320[0\D][0\D])(.*)(32\D\D\D)|\A(?!.*32[0\D][0\D][0\D])(.*)(3\D\D\D\D)/m
Question:
How can I use this to replace my "most exact match"?
Answer:
In this case your "most exact match" will be the concatenation of \1\3\5\7\9, but we will have also matched some other things before that, namely \2\4\6\8 (again, only one of these can be non empty). Therefore if you want to replace your "most exact match" with fubar you can match with the above regex and replace with \2\4\6\8fubar
Another way you can think about it (and might be helpful) is that your "most exact match" will be the last matched line of either of the two regexes.
Two things to note here:
I used Ruby style RE, \A means the beginning of the string (not the beginning of a line - ^). \m means multi line mode. You should be able to find syntax for the same things in your language/technology as long as it uses some flavor of PCRE.
This can be slow. If we don't find exact match we might possibly have to match and replace the entire string (if the non exact match can be found at the end of the string).

preg_match / php style regex to find repeating alphanumeric characters, comma delimited?

I'm trying to figure out a preg_match / php style regex to find repeating groups of alphanumeric characters(of any length), separated by commas?
so if I have string "c,b,a,xz,x,b,a,c,xz,x,x,b,a"
would return the first series of letters that repeat more than two values. I think I need to do a recursive backreference, maybe something like
<?php
// lines removed for simplicity
// test string = "c,b,a,xz,x,b,a,c,xz,x,x,b,a"
$haystack = "c,b,a,xz,x,b,a,c,xz,x,x,b,a";
$answer = preg_match('/([A-z]{2,*}[\s]{1})([A-z \s]*)[\1]*/', $haystack );
echo $answer; // print the first occurrence of the repeating series of two or more
?>
I just need to find and echo out the first occurrence of a repeating series of two or more values. Is there a way to use a backreference recursively, or some better method?
edit: code vomit removed.

'/\b(\w+,\w+),(?:.*,)?\1\b/' should work. It'd match any sequence of two items, any amount of other stuff, and then the same sequence again.
Catch is, it will likely find the first sequence that has a duplicate, not the sequence that has the first duplicate, due to how regexes work. (The match that starts earliest, wins.) For example, if you have 'a,b,c,d,c,d,a,b,c', $matches[1] would probably be 'a,b', even though 'c,d' would match earlier.
To find the first duplicate, you'd have to be able to match that and have a backreference to it in a lookbehind assertion. If that's even legal (which i doubt it is), it'd have to be fixed width before PHP would let it happen.
Edit:
Although, now that i think about it...if you reversed the string and then used '/.*\b(\w+,\w+),(?:.*?,)??\1\b/' on that, it might work. That dances around the constraint i'd mentioned; with the string reversed, the duplicate comes before the original, so now we can match the duplicate and then refer to it "later".
The .* at the beginning of the expression grabs as much as it can, so the match will start as close to the end of the reversed string (and therefore, as close to the beginning of the original string) as possible. And the extra ?s make their corresponding bits lazy, so they match as little as necessary. Of course, once you find the match in the reversed string, you'll need to reverse it in order to get the match in the original string.
And of course, this could break all to hell in the presence of UTF-8. (Then again, most regexes would.) If you're just dealing with ASCII, though, it should work.

Not a PHP expert, but I would think you could use this regex
~\b([a-zA-Z0-9]{2,})\b(?=.*\b\1\b)~ in a while loop.
In the body, you could track the results in a hash array (if php has that),
to print out unique series and positions. Capture buffer 1 has the series.

What regex will capitalize any letters following whitespace?

I'm looking for a Perl regex that will capitalize any character which is preceded by whitespace (or the first char in the string).
I'm pretty sure there is a simple way to do this, but I don't have my Perl book handy and I don't do this often enough that I've memorized it...

s/(\s\w)/\U$1\E/g;
I originally suggested:
s/\s\w/\U$&\E/g;
but alarm bells were going off at the use of '$&' (even before I read #Manni's comment). It turns out that they're fully justified - using the $&, $` and $' operations cause an overall inefficiency in regexes.
The \E is not critical for this regex; it turns off the 'case-setting' switch \U in this case or \L for lower-case.
As noted in the comments, matching the first character of the string requires:
s/((?:^|\s)\w)/\U$1\E/g;
Corrected position of second close parenthesis - thanks, Blixtor.

Depending on your exact problem, this could be more complicated than you think and a simple regex might not work. Have you thought about capitalization inside the word? What if the word starts with punctuation like '...Word'? Are there any exceptions? What about international characters?
It might be better to use a CPAN module like Text::Autoformat or Text::Capitalize where these problems have already been solved.
use Text::Capitalize 0.2;
print capitalize_title($t), "\n";
use Text::Autoformat;
print autoformat{case => "highlight", right=>length($t)}, $t;
It sounds like Text::Autoformat might be more "standard" and I would try that first. Its written by Damian. But Text::Capitalize does a few things that Text::Autoformat doesn't. Here is a comparison.
You can also check out the Perl Cookbook for recipie 1.14 (page 31) on how to use regexps to properly capitalize a title or headline.

Something like this should do the trick -
s!(^|\s)(\w)!$1\U$2!g
This simply splits up the scanned expression into two matches - $1 for the blank/start of string and $2 for the first character of word. We then substitute both $1 and $2 after making the start of the word upper-case.
I would change the \s to \b which makes more sense since we are checking for word-boundaries here.

This isn't something I'd normally use a regex for, but my solution isn't exactly what you would call "beautiful":
$string = join("", map(ucfirst, split(/(\s+)/, $string)));
That split()s the string by whitespace and captures all the whitespace, then goes through each element of the list and does ucfirst on them (making the first character uppercase), then join()s them back together as a single string. Not awful, but perhaps you'll like a regex more. I personally just don't like \Q or \U or other semi-awkward regex constructs.
EDIT: Someone else mentioned that punctuation might be a potential issue. If, say, you want this:
...string
changed to this:
...String
i.e. you want words capitalized even if there is punctuation before them, try something more like this:
$string = join("", map(ucfirst, split(/(\w+)/, $string)));
Same thing, but it split()s on words (\w+) so that the captured elements of the list are word-only. Same overall effect, but will capitalize words that may not start with a word character. Change \w to [a-zA-Z] to eliminate trying to capitalize numbers. And just generally tweak it however you like.

If you mean character after space, use regular expressions using \s. If you really mean first character in word you should use \b instead of all above attempts with \s which is error prone.
s/\b(\w)/\U$1/g;

You want to match letters behind whitespace, or at the start of a string.
Perl can't do variable length lookbehind. If it did, you could have used this:
s/(?<=\s|^)(\w)/\u$1/g; # this does not work!
Perl complains:
Variable length lookbehind not implemented in regex;
You can use double negative lookbehind to get around that: the thing on the left of it must not be anything that is not whitespace. That means it'll match at the start of the string, but if there is anything in front of it, it must be whitespace.
s/(?<!\S)(\w)/\u$1/g;
The simpler approach in this exact case will probably be to just match the whitespace; the variable length restriction falls away, then, and include that in the replacement.
s/(\s|^)(\w)/$1\u$2/g;
Occasionally you can't use this approach in repeated substitutions because that what precedes the actual match has already been eaten by the regex, and it's good to have a way around that.

Capitalize ANY character preceded by whitespace or at beginning of string:
s/(^|\s)./\u$1/g
Maybe a very sloppy way of doing it because it's also uppercasing the whitespace now. :P
The advantage is that it works with letters with all possible accents (and also with special Danish/Swedish/Norwegian letters), which are problematic when you use \w and \b in your regex. Can I expect that all non-letters are untouched by the uppercase modifier?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex matching spaces, but not in "strings" - c++

Simplest regex-solution: match whole spaces AND quotes. Filter quotes later "[^"]*"|\s

Related

How to multiline regex but stop after first match?

REGEX in R: extracting words from a string

Regular Expression to match most explicit string

preg_match / php style regex to find repeating alphanumeric characters, comma delimited?

What regex will capitalize any letters following whitespace?

Categories

Resources