regular expression to split up searchphrase - c++

I was hoping someone could help me writing a regex for c++ that matches words in a searchphrase, and explain it bit by bit for learning purposes.
What I need is a regex that matches string within " " like "Hello you all", and single words that starts/ends with * like *ack / overfl*.
For the quote part I have \"[\^\\s][\^\"]*\" but I can't figure out the wildcard (*) part, and how I should combine it with the quote regex.

Try this regular expression:
(?:\*?\w+\*?|"(?:[^\x5C"]+|\x5C(?:\x5C\x5C)*")*")+
For readability I replaced the backslash characters by \x5C.
The expression "(?:[^\x5C"]+|\x5C(?:\x5C\x5C)*")*" will also match "foo \"bar\"" and other proper escaped quote sequences (but only the " might be escaped).
So foo* bar *baz *quux* "foo \"bar\"" should be splitted into:
foo*
bar
*baz
*quux*
"foo \"bar\""
If you don’t want to match bar in the example above, use this:
(?:\*\w+|\w+\*|"(?:[^\x5C"]+|\x5C(?:\x5C\x5C)*")*")+

As long as there is no quote nesting (nesting in general is something regex is bad at):
"(?:(?<=\\)"|[^"])*"|\*[^\s]+|[^\s]+\*
This regex allows for escaped double quotes ('\"'), though, if you need that. And the match includes the enclosing double quotes.
This regex matches:
"A string in quotes, possibly containing \"escaped quotes\""
*a_search_word_beginning_with_a_star
a_search_word_ending_with_a_star*
*a_search_word_enclosed_in_stars*
Be aware that it will break at strings like this:
A broken \"string "with the quotes all \"mangled up\""
If you expect (read: can't entirely rule out the possibility) to get these, please don't use regex, but write a small quote-aware parser. For a one-shot search and replace activity or input in a guaranteed format, the regex is okay to use.
For validating/parsing user input, it is not okay to use. That's where I would recommend a parser. Knowing the difference is the key.

Related

Regex: Exact match string ending with specific character

I'm using Java. So I have a comma separated list of strings in this form:
aa,aab,aac
aab,aa,aac
aab,aac,aa
I want to use regex to remove aa and the trailing ',' if it is not the last string in the list. I need to end up with the following result in all 3 cases:
aab,aac
Currently I am using the following pattern:
"aa[,]?"
However it is returning:
b,c
If lookarounds are available, you can write:
,aa(?![^,])|(?<![^,])aa,
with an empty string as replacement.
demo
Otherwise, with a POSIX ERE syntax you can do it with a capture:
^(aa(,|$))+|(,aa)+(,|$)
with the 4th group as replacement (so $4 or \4)
demo
Without knowing your flavor, I propose this solution for the case that it does know the \b.
I use perl as demo environment and do a replace with "_" for demonstration.
perl -pe "s/\baa,|,aa\b/_/"
\b is the "word border" anchor. I.e. any start or end of something looking like a word. It allows to handle line end, line start, blank, comma.
Using it, two alternatives suffice to cover all the cases in your sample input.
Output (with interleaved input, with both, line ending in newline and line ending in blank):
aa,aab,aac
_aab,aac
aab,aa,aac
aab_,aac
aab,aac,aa
aab,aac_
aa,aab,aac
_aab,aac
aab,aa,aac
aab_,aac
aab,aac,aa
aab,aac_
If the \b is unknown in your regex engine, then please state which one you are using, i.e. which tool (e.g. perl, awk, notepad++, sed, ...). Also in that case it might be necessary to do replacing instead of deleting, i.e. to fine tune a "," or "" as replacement. For supporting that, please show the context of your regex, i.e. the replacing mechanism you are using. If you are deleting, then please switch to replacing beforehand.
(I picked up an input from comment by gisek, that the cpaturing groups are not needed. I usually use () generously, including in other syntaxes. In my opinion not having to think or look up evaluation orders is a benefit in total time and risks taken. But after testing, I use this terser/eleganter way.)
If your regex engine supports positive lookaheads and positive lookbehinds, this should work:
,aa(?=,)|(?<=,)aa,|(,|^)aa(,|$)
You could probably use the following and replace it by nothing :
(aa,|,aa$)
Either aa, when it's in the begin or the middle of a string
,aa$ when it's at the end of the string
Demo
As you want to delete aa followed by a coma or the end of the line, this should do the trick: ,aa(?=,|$)|^aa,
see online demo

What do pipes between brackets mean in a regex?

Reading this vim plugin I see this line:
syntax match tweeDelimiter "[<<|>>|\]\]|\[\[]"
To me, that regex doesn't make much sense when it's surrounded by []. According to this, "POSIX bracket expressions match one character out of a set of characters".
So isn't this matching < or > or [ or ]? I know from context that it's trying to match << or >> or [[ or ]].
That indeed looks like a bug in the plugin. If it wants to match pairs of those characters, it has to use plain regexp branches (\|), not a collection:
<<\|>>\|\]\]\|\[\[
If there were additional stuff to match, above would have to be enclosed in \%(...\) to group it. However, using [...] will match any of the contained characters; Vim just ignores the duplicate ones. As others have commented already, such could be written in shorter form, for example [][<>|].
So, if the plugin indeed mistakenly matches stuff like <> and <[ instead of just << and [[, please inform its author about the bug.

Complex regex single quote replace

I have a set of strings for which I would like to replace single quotes by double quotes. But, sometimes the single quote to replace is at the end of the line, sometimes the single quote should be replaced since it follow a S for possessive.
Example :
The song 'Miss you' is featured in The Rolling Stones' album 'Voodoo Lounge'
should be
The song "Miss you" is featured in The Rolling Stones' album "Voodoo Lounge"
Thanks your help :)
Regular expressions can only deal with raw text. It can't tell context or grammar. So it is pretty much impossible to build up a regular expression that will correctly identify the occurrences of non-possessive s characters.
However, if you'd like to ignore such cases, and match rest of them, you can use the following regex with lookaround assertions:
(?<!s)'(?!s\b)
Note that this will not match for valid cases like Blurred Lines, Dangerous etc.
Working demo

Perl regex with exclamation marks

How do you define/explain this Perl regex:
$para =~ s!//!/!g;
I know the s means search, and g means global (search), but not sure how the exclamation marks ! and extra slashes / fit in (as I thought the pattern would look more like s/abc/def/g).
Perl's regex operators s, m and tr ( thought it's not really a regex operator ) allow you to use any symbol as your delimiter.
What this means is that you don't have to use / you could use, like in your question !
# the regex
s!//!/!g
means search and replace all instances of '//' with '/'
you could write the same thing as
s/\/\//\/g
or
s#//#/#g
or
s{//}{/}g
if you really wanted but as you can see the first one, with all the backslashes, is very hard to understand and much more cumbersome.
More information can be found in the perldoc's perlre
The substitution regex (and other regex operators, like m///) can take any punctuation character as delimiter. This saves you the trouble of escaping meta characters inside the regex.
If you want to replace slashes, it would be awkward to write:
s/\/\//\//g;
Which is why you can write
s!//!/!g;
...instead. See http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators
And no, s/// is the substitution. m/// is the search, though I do believe the intended mnemonic is "match".
The exclamation marks are the delimiter; perl lets you choose any character you want, within reason. The statement is equivalent to the (much uglier) s/\/\//\//g — that is, it replaces // with /.

replace word using Regex, but not in Quotes in C# [duplicate]

From this q/a, I deduced that matching all instances of a given regex not inside quotes, is impossible. That is, it can't match escaped quotes (ex: "this whole \"match\" should be taken"). If there is a way to do it that I don't know about, that would solve my problem.
If not, however, I'd like to know if there is any efficient alternative that could be used in JavaScript. I've thought about it a bit, but can't come with any elegant solutions that would work in most, if not all, cases.
Specifically, I just need the alternative to work with .split() and .replace() methods, but if it could be more generalized, that would be the best.
For Example:
An input string of: +bar+baz"not+or\"+or+\"this+"foo+bar+
replacing + with #, not inside quotes, would return: #bar#baz"not+or\"+or+\"this+"foo#bar#
Actually, you can match all instances of a regex not inside quotes for any string, where each opening quote is closed again. Say, as in you example above, you want to match \+.
The key observation here is, that a word is outside quotes if there are an even number of quotes following it. This can be modeled as a look-ahead assertion:
\+(?=([^"]*"[^"]*")*[^"]*$)
Now, you'd like to not count escaped quotes. This gets a little more complicated. Instead of [^"]* , which advanced to the next quote, you need to consider backslashes as well and use [^"\\]*. After you arrive at either a backslash or a quote, you need to ignore the next character if you encounter a backslash, or else advance to the next unescaped quote. That looks like (\\.|"([^"\\]*\\.)*[^"\\]*"). Combined, you arrive at
\+(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$)
I admit it is a little cryptic. =)
Azmisov, resurrecting this question because you said you were looking for any efficient alternative that could be used in JavaScript and any elegant solutions that would work in most, if not all, cases.
There happens to be a simple, general solution that wasn't mentioned.
Compared with alternatives, the regex for this solution is amazingly simple:
"[^"]+"|(\+)
The idea is that we match but ignore anything within quotes to neutralize that content (on the left side of the alternation). On the right side, we capture all the + that were not neutralized into Group 1, and the replace function examines Group 1. Here is full working code:
<script>
var subject = '+bar+baz"not+these+"foo+bar+';
var regex = /"[^"]+"|(\+)/g;
replaced = subject.replace(regex, function(m, group1) {
if (!group1) return m;
else return "#";
});
document.write(replaced);
Online demo
You can use the same principle to match or split. See the question and article in the reference, which will also point you code samples.
Hope this gives you a different idea of a very general way to do this. :)
What about Empty Strings?
The above is a general answer to showcase the technique. It can be tweaked depending on your exact needs. If you worry that your text might contain empty strings, just change the quantifier inside the string-capture expression from + to *:
"[^"]*"|(\+)
See demo.
What about Escaped Quotes?
Again, the above is a general answer to showcase the technique. Not only can the "ignore this match" regex can be refined to your needs, you can add multiple expressions to ignore. For instance, if you want to make sure escaped quotes are adequately ignored, you can start by adding an alternation \\"| in front of the other two in order to match (and ignore) straggling escaped double quotes.
Next, within the section "[^"]*" that captures the content of double-quoted strings, you can add an alternation to ensure escaped double quotes are matched before their " has a chance to turn into a closing sentinel, turning it into "(?:\\"|[^"])*"
The resulting expression has three branches:
\\" to match and ignore
"(?:\\"|[^"])*" to match and ignore
(\+) to match, capture and handle
Note that in other regex flavors, we could do this job more easily with lookbehind, but JS doesn't support it.
The full regex becomes:
\\"|"(?:\\"|[^"])*"|(\+)
See regex demo and full script.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
You can do it in three steps.
Use a regex global replace to extract all string body contents into a side-table.
Do your comma translation
Use a regex global replace to swap the string bodies back
Code below
// Step 1
var sideTable = [];
myString = myString.replace(
/"(?:[^"\\]|\\.)*"/g,
function (_) {
var index = sideTable.length;
sideTable[index] = _;
return '"' + index + '"';
});
// Step 2, replace commas with newlines
myString = myString.replace(/,/g, "\n");
// Step 3, swap the string bodies back
myString = myString.replace(/"(\d+)"/g,
function (_, index) {
return sideTable[index];
});
If you run that after setting
myString = '{:a "ab,cd, efg", :b "ab,def, egf,", :c "Conjecture"}';
you should get
{:a "ab,cd, efg"
:b "ab,def, egf,"
:c "Conjecture"}
It works, because after step 1,
myString = '{:a "0", :b "1", :c "2"}'
sideTable = ["ab,cd, efg", "ab,def, egf,", "Conjecture"];
so the only commas in myString are outside strings. Step 2, then turns commas into newlines:
myString = '{:a "0"\n :b "1"\n :c "2"}'
Finally we replace the strings that only contain numbers with their original content.
Although the answer by zx81 seems to be the best performing and clean one, it needes these fixes to correctly catch the escaped quotes:
var subject = '+bar+baz"not+or\\"+or+\\"this+"foo+bar+';
and
var regex = /"(?:[^"\\]|\\.)*"|(\+)/g;
Also the already mentioned "group1 === undefined" or "!group1".
Especially 2. seems important to actually take everything asked in the original question into account.
It should be mentioned though that this method implicitly requires the string to not have escaped quotes outside of unescaped quote pairs.