I am working on a C++ code base that was recently moved from X/Motif to Qt. I am trying to write a Perl script that will replace all occurrences of Boolean (from X) with bool. The script just does a simple replacement.
s/\bBoolean\b/bool/g
There are a few conditions.
1) We have CORBA in our code and \b matches CORBA::Boolean which should not be changed.
2) It should not match if it was found as a string (i.e. "Boolean")
Updated:
For #1, I used lookbehind
s/(?<!:)\bBoolean\b/bool/g;
For #2, I used lookahead.
s/(?<!:)\bBoolean\b(?!")/bool/g</pre>
This will most likely work for my situation but how about the following improvements?
3) Do not match if in the middle of a string (thanks nohat).
4) Do not match if in a comment. (// or /**/)
s/[^:]\bBoolean\b(?!")/bool/g
This does not match strings where Boolean is at that the beginning of the line becuase [^:] is "match a character that is not :".
Watch out with that quote-matching lookahead assertion. That'll only match if Boolean is the last part of a string, but not in the middle of the string. You'll need to match an even number of quote marks preceding the match if you want to be sure you're not in a string (assuming no multi-line strings and no escaped embedded quote marks).
s/[^:]\bBoolean\b[^"]/bool/g
Edit: Rats, beaten again. +1 for beating me, good sir.
#define Boolean bool
Let the preprocesser take care of this. Every time you see a Boolean you can either manually fix it or hope a regex doesn't make a mistake. Depending on how many macros you use you can you could dump the out of cpp.
To fix condition 1 try:
s/[^:]\bBoolean\b(?!")/bool/g
The [^:] says to match any character other than ":".
3) Do not match if in the middle of a string (thanks nohat).
You can perhaps write a reg ex to check ".*Boolean.*". But what if you have quote(") inside the string? So, you have more work to not exclude (\") pattern.
4) Do not match if in a comment. (// or /* */)
For '//', you can have a regex to exclude //.* But, better could be to first put a regex to compare the whole line for the // comments ((.*)(//.*)) and then apply replacement only on $1 (first matching pattern).
For /* */, it is more complex as this is multiline pattern. One approach can be to first run whole of you code to match multiline comments and then take out only the parts not matching ... something like ... (.*)(/*.**/)(.*). But, the actual regex would be even more complex as you would have not one but more of multi-line comments.
Now, what if you have /* or */ inside // block? (I dont know why would you have it.. but Murphy's law says that you can have it). There is obviously some way out but my idea is to emphasize how bad-looking the regex will become.
My suggestion here would be to use some lexical tool for C++ and replace the token Boolean with bool. Your thoughts?
In order to avoid writing a full C parser in perl, you're trying to strike a balance. Depending on how much needs changing, I would be inclined to do something like a very restrictive s/// and then anything that still matches /Boolean/ gets written to an exception file for human decision making. That way you're not trying to parse the C middle strings, multi-line comment, conditional compiled out text, etc. that could be present.
…
…
Do not match if in the middle of a string (thanks nohat).
Do not match if in a comment. (// or /**/)
No can do with a simple regex. For that, you need to actually look at every single character left-to-right and decide what kind of thing it is, at least well enough to tell apart comments from multi-line comments from strings from other stuff, and then you need to see if the “other stuff” part contains things you want to change.
Now, I don’t know the exact syntactical rules for comments and strings in C++ so the following is going to be imprecise and completely undebugged, but it’ll give you an idea of the complexity you’re up against.
my $line_comment = qr! (?> // .* \n? ) !x;
my $multiline_comment = qr! (?> /\* [^*]* (?: \* (?: [^/*] [^*]* )? )* )* \*/ ) !x;
my $string = qr! (?> " [^"\\]* (?: \\ . [^"\\]* )* " ) !x;
my $boolean_type = qr! (?<!:) \b Boolean \b !x;
$code =~ s{ \G (
$line_comment
| $multiline_comment
| $string
| ( $boolean_type )
| .
) }{
defined $2 ? 'bool' : $1
}gex;
Please don’t ask me to explain this in all its intricacies, it would take me a day and another. Just buy and read Jeff Friedl’s Mastering Regular Expressions if you want to understand exactly what is going on here.
The "'Boolean' in the middle of a string" part sounds a bit unlikely, I'd check first if there is any occurrence of it in the code with something like
m/"[^"]*Boolean[^"]*"/
And if there is none or a few, just ignore that case.
Related
Problem:
I have thousands of documents which contains a specific character I don't want. E.g. the character a. These documents contain a variety of characters, but the a's I want to replace are inside double quotes or single quotes.
I would like to find and replace them, and I thought using Regex would be needed. I am using VSCode, but I'm open to any suggestions.
My attempt:
I was able to find the following regex to match for a specific string containing the values inside the ().
".*?(r).*?"
However, this only highlights the entire quote. I want to highlight the character only.
Any solution, perhaps outside of regex, is welcome.
Example outcomes:
Given, the character is a, find replace to b
Somebody once told me "apples" are good for you => Somebody once told me "bpples" are good for you
"Aardvarks" make good kebabs => "Abrdvbrks" make good kebabs
The boy said "aaah!" when his mom told him he was eating aardvark => The boy said "bbbh!" when his mom told him he was eating aardvark
Visual Studio Code
VS Code uses JavaScript RegEx engine for its find / replace functionality. This means you are very limited in working with regex in comparison to other flavors like .NET or PCRE.
Lucky enough that this flavor supports lookaheads and with lookaheads you are able to look for but not consume character. So one way to ensure that we are within a quoted string is to look for number of quotes down to bottom of file / subject string to be odd after matching an a:
a(?=[^"]*"[^"]*(?:"[^"]*"[^"]*)*$)
Live demo
This looks for as in a double quoted string, to have it for single quoted strings substitute all "s with '. You can't have both at a time.
There is a problem with regex above however, that it conflicts with escaped double quotes within double quoted strings. To match them too if it matters you have a long way to go:
a(?=[^"\\]*(?:\\.[^"\\]*)*"[^"\\]*(?:\\.[^"\\]*)*(?:"[^"\\]*(?:\\.[^"\\]*)*"[^"\\]*(?:\\.[^"\\]*)*)*$)
Applying these approaches on large files probably will result in an stack overflow so let's see a better approach.
I am using VSCode, but I'm open to any suggestions.
That's great. Then I'd suggest to use awk or sed or something more programmatic in order to achieve what you are after or if you are able to use Sublime Text a chance exists to work around this problem in a more elegant way.
Sublime Text
This is supposed to work on large files with hundred of thousands of lines but care that it works for a single character (here a) that with some modifications may work for a word or substring too:
Search for:
(?:"|\G(?<!")(?!\A))(?<r>[^a"\\]*+(?>\\.[^a"\\]*)*+)\K(a|"(*SKIP)(*F))(?(?=((?&r)"))\3)
^ ^ ^
Replace it with: WHATEVER\3
Live demo
RegEx Breakdown:
(?: # Beginning of non-capturing group #1
" # Match a `"`
| # Or
\G(?<!")(?!\A) # Continue matching from last successful match
# It shouldn't start right after a `"`
) # End of NCG #1
(?<r> # Start of capturing group `r`
[^a"\\]*+ # Match anything except `a`, `"` or a backslash (possessively)
(?>\\.[^a"\\]*)*+ # Match an escaped character or
# repeat last pattern as much as possible
)\K # End of CG `r`, reset all consumed characters
( # Start of CG #2
a # Match literal `a`
| # Or
"(*SKIP)(*F) # Match a `"` and skip over current match
)
(?(?= # Start a conditional cluster, assuming a positive lookahead
((?&r)") # Start of CG #3, recurs CG `r` and match `"`
) # End of condition
\3 # If conditional passed match CG #3
) # End of conditional
Three-step approach
Last but not least...
Matching a character inside quotation marks is tricky since delimiters are exactly the same so opening and closing marks can not be distinguished from each other without taking a look at adjacent strings. What you can do is change a delimiter to something else so that you can look for it later.
Step 1:
Search for: "[^"\\]*(?:\\.[^"\\]*)*"
Replace with: $0Я
Step 2:
Search for: a(?=[^"\\]*(?:\\.[^"\\]*)*"Я)
Replace with whatever you expect.
Step 3:
Search for: "Я
Replace with nothing to revert every thing.
/(["'])(.*?)(a)(.*?\1)/g
With the replace pattern:
$1$2$4
As far as I'm aware, VS Code uses the same regex engine as JavaScript, which is why I've written my example in JS.
The problem with this is that if you have multiple a's in 1 set of quotes, then it will struggle to pull out the right values, so there needs to be some sort of code behind it, or you, hammering the replace button until no more matches are found, to recurse the pattern and get rid of all the a's in between quotes
let regex = /(["'])(.*?)(a)(.*?\1)/g,
subst = `$1$2$4`,
str = `"a"
"helapke"
Not matched - aaaaaaa
"This is the way the world ends"
"Not with fire"
"ABBA"
"abba",
'I can haz cheezburger'
"This is not a match'
`;
// Loop to get rid of multiple a's in quotes
while(str.match(regex)){
str = str.replace(regex, subst);
}
const result = str;
console.log(result);
Firstly a few of considerations:
There could be multiple a characters within a single quote.
Each quote (using single or double quotation marks) consists of an opening quote character, some text and the same closing quote character. A simple approach is to assume that when the quote characters are counted sequentially, the odd ones are opening quotes and the even ones are closing quotes.
Following point 2, it could be worth some further thought on whether single-quoted strings should be allowed. See the following example: It's a shame 'this quoted text' isn't quoted. Here, the simple approach would think there were two quoted strings: s a shame and isn. Another: This isn't a quote ...'this is' and 'it's unclear where this quote ends'. I've avoided attempting to tackle these complexities and gone with the simple approach below.
The bad news is that point 1 presents a bit of a problem, as a capturing group with a wildcard repeat character after it (e.g. (.*)*) will only capture the last captured "thing". But the good news is there's a way of getting around this within certain limits. Many regex engines will allow up to 99 capturing groups (*). So if we can make the assumption that there will be no more than 99 as in each quote (UPDATE ...or even if we can't - see step 3), we can do the following...
(*) Unfortunately my first port of call, Notepad++ doesn't - it only allows up to 9. Not sure about VS Code. But regex101 (used for the online demos below) does.
TL;DR - What to do?
Search for: "([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*"
Replace with: "\1\2\3\4\5\6\7\8\9\10\11\12\13\14\15\16\17\18\19\20\21\22\23\24\25\26\27\28\29\30\31\32\33\34\35\36\37\38\39\40\41\42\43\44\45\46\47\48\49\50\51\52\53\54\55\56\57\58\59\60\61\62\63\64\65\66\67\68\69\70\71\72\73\74\75\76\77\78\79\80\81\82\83\84\85\86\87\88\89\90\91\92\93\94\95\96\97\98\99"
(Optionally keep repeating steps the previous two steps if there's a possibility of > 99 such characters in a single quote until they've all been replaced).
Repeat step 1 but replacing all " with ' in the regular expression, i.e: '([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*'
Repeat steps 2-3.
Online demos
Please see the following regex101 demos, which could actually be used to perform the replacements if you're able to copy the whole text into the contents of "TEST STRING":
Demo for double quotes
Demo for single quotes.
If you can use Visual Studio (instead of Visual Studio Code), it is written in C++ and C# and uses the .NET Framework regular expressions, which means you can use variable length lookbehinds to accomplish this.
(?<="[^"\n]*)a(?=[^"\n]*")
Adding some more logic to the above regular expression, we can tell it to ignore any locations where there are an even amount of " preceding it. This prevents matches for a outside of quotes. Take, for example, the string "a" a "a". Only the first and last a in this string will be matched, but the one in the middle will be ignored.
(?<!^[^"\n]*(?:(?:"[^"\n]*){2})+)(?<="[^"\n]*)a(?=[^"\n]*")
Now the only problem is this will break if we have escaped " within two double quotes such as "a\"" a "a". We need to add more logic to prevent this behaviour. Luckily, this beautiful answer exists for properly matching escaped ". Adding this logic to the regex above, we get the following:
(?<!^[^"\n]*(?:(?:"(?:[^"\\\n]|\\.)*){2})+)(?<="[^"\n]*)a(?=[^"\n]*")
I'm not sure which method works best with your strings, but I'll explain this last regex in detail as it also explains the two previous ones.
(?<!^[^"\n]*(?:(?:"(?:[^"\\\n]|\\.)*){2})+) Negative lookbehind ensuring what precedes doesn't match the following
^ Assert position at the start of the line
[^"\n]* Match anything except " or \n any number of times
(?:(?:"(?:[^"\\\n]|\\.)*){2})+ Match the following one or more times. This ensures if there are any " preceding the match that they are balanced in the sense that there is an opening and closing double quote.
(?:"(?:[^"\\\n]|\\.)*){2} Match the following exactly twice
" Match this literally
(?:[^"\\\n]|\\.)* Match either of the following any number of times
[^"\\\n] Match anything except ", \ and \n
\\. Matches \ followed by any character
(?<="[^"\n]*) Positive lookbehind ensuring what precedes matches the following
" Match this literally
[^"\n]* Match anything except " or \n any number of times
a Match this literally
(?=[^"\n]*") Positive lookahead ensuring what follows matches the following
[^"\n]* Match anything except " or \n any number of times
" Match this literally
You can drop the \n from the above pattern as the following suggests. I added it just in case there's some sort of special cases I'm not considering (i.e. comments) that could break this regex within your text. The \A also forces the regex to match from the start of the string (or file) instead of the start of the line.
(?<!\A[^"]*(?:(?:"(?:[^"\\]|\\.)*){2})+)(?<="[^"]*)a(?=[^"]*")
You can test this regex here
This is what it looks like in Visual Studio:
I am using VSCode, but I'm open to any suggestions.
If you want to stay in an Editor environment, you could use
Visual Studio (>= 2012) or even notepad++ for quick fixup.
This avoids having to use a spurious script environment.
Both of these engines (Dot-Net and boost, respectively) use the \G construct.
Which is start the next match at the position where the last one left off.
Again, this is just a suggestion.
This regex doesn't check the validity of balanced quotes within the entire
string ahead of time (but it could with the addition of a single line).
It is all about knowing where the inside and outside of quotes are.
I've commented the regex, but if you need more info let me know.
Again this is just a suggestion (I know your editor uses ECMAScript).
Find (?s)(?:^([^"]*(?:"[^"a]*(?=")"[^"]*(?="))*"[^"a]*)|(?!^)\G)a([^"a]*(?:(?=a.*?")|(?:"[^"]*$|"[^"]*(?=")(?:"[^"a]*(?=")"[^"]*(?="))*"[^"a]*)))
Replace $1b$2
That's all there is to it.
https://regex101.com/r/loLFYH/1
Comments
(?s) # Dot-all inine modifier
(?:
^ # BOS
( # (1 start), Find first quote from BOS (written back)
[^"]*
(?: # --- Cluster
" [^"a]* # Inside quotes with no 'a'
(?= " )
" [^"]* # Between quotes, get up to next quote
(?= " )
)* # --- End cluster, 0 to many times
" [^"a]* # Inside quotes, will be an 'a' ahead of here
# to be sucked up by this match
) # (1 end)
| # OR,
(?! ^ ) # Not-BOS
\G # Continue where left off from last match.
# Must be an 'a' at this point
)
a # The 'a' to be replaced
( # (2 start), Up to the next 'a' (to be written back)
[^"a]*
(?: # --------------------
(?= a .*? " ) # If stopped before 'a', must be a quote ahead
| # or,
(?: # --------------------
" [^"]* $ # If stopped at a quote, check for EOS
| # or,
" [^"]* # Between quotes, get up to next quote
(?= " )
(?: # --- Cluster
" [^"a]* # Inside quotes with no 'a'
(?= " )
" [^"]* # Between quotes
(?= " )
)* # --- End cluster, 0 to many times
" [^"a]* # Inside quotes, will be an 'a' ahead of here
# to be sucked up on the next match
) # --------------------
) # --------------------
) # (2 end)
"Inside double quotes" is rather tricky, because there are may complicating scenarios to consider to fully automate this.
What are your precise rules for "enclosed by quotes"? Do you need to consider multi-line quotes? Do you have quoted strings containing escaped quotes or quotes used other than starting/ending string quotation?
However there may be a fairly simple expression to do much of what you want.
Search expression: ("[^a"]*)a
Replacement expression: $1b
This doesn't consider inside or outside of quotes - you have do that visually. But it highlights text from the quote to the matching character, so you can quickly decide if this is inside or not.
If you can live with the visual inspection, then we can build up this pattern to include different quote types and upper and lower case.
I wrote a Perl function to replace job name in JCL script. Zero-width match was used here.
sub modify_jcl_jobname ()
{
my ($jcl, $old, $new) = #_;
$jcl =~ s/
# The name must begin in column 3.
^(?<=\/\/)
# The first charater must be alphabetic or national.
($old)
# The name must be followed by at leat on blank.
# Append JCL keyword JOB
(?=\s+JOB)
/$new/xmig; # Multi-lines, ignore case.
return $jcl;
}
But this function didn't work until I did a simple modification that just deleted the leading sign "^".
#before ^(?<=\/\/)
#after (?<=\/\/)
So I'd like to make it clear that the cause of problem. Any reply would be appreciated. Thanks.
The problem lies with
^(?<=\/\/)
That pattern will only match if the spot after which ^ matched is preceded by the two characters //. That's never going to happen since /^/m matches the start of the string and after a newline.
But you don't want to start matching at the start of the line. You want to start matching 2 characters in. What you want is actually:
(?<=^\/\/)
After doing some improvements, the code looks like:
sub modify_jcl_jobname {
my ($jcl, $old, $new) = #_;
$jcl =~ s{
(?<= ^// )
\Q$old\E
(?= \s+ JOB )
}{$new}xmig;
return $jcl;
}
Improvements:
Removed the incorrect prototype (()). It forced the caller to tell Perl to ignore the prototype (by using &).
Added code (\Q...\E) to convert the contents of $old into a regex pattern before using it as such.
Removed the needless capture ((...)).
Switched the delimiters of the substitution (from s/// to s{}{}) to require less escaping.
Removed highly redundant comments. (Good comments explain why something is being done rather than what is being done.)
The optimiser might handle this version better:
$jcl =~ s{
^// \K
\Q$old\E
(?= \s+ JOB )
}{$new}xmig;
The ^ sign matches the beginning of the line. You then want something preceded by two slashes - where should these slashes go if the next character is the very first character of the line?
s{^//
($old)
...
}{//$new}xmig
should work: you need no look behind.
Update: Thanks to ikegami, I now see why you used it. You want to keep the // in the string: well, you can repeat them in the substitution, or move the ^ character into the look-behind.
I've seen examples of finding the absence of characters in a regular expression, I'm trying to find the absence of words in a regular expression (likely using a negative lookbehind).
I have lines of code like this:
Example One:
protected static readonly string BACKGROUND_MUSIC_NAME = "Music_Mission_Complete_Loop_audio";
And here's another one:
mainWindow.Id = "MainWindow";
Final one:
mainStoLabel.Text = "#stb_entry_clah";
I want to capture only the middle one by finding all strings like these that a.) aren't preceded by a "#" in the actual string between the quotes, and b.) aren't preceded at all by the word "readonly".
My current Regular Expression is this:
.*\W\=\W"[^#].*"
It captures the top two examples. Now I just want to narrow down the top example. How do I capture the absence of (not characters) whole words.
Thanks.
The bug in your negation lookahead assertion is that you didn’t put it together right to suit the general case. You need to make its assertion apply to every character position as you crawl ahead. It only applies to one possible dot the way you’ve written it, whereas you need it to apply to all of them. See below for how you must do this to do it correctly.
Here is a working demo that shows two different approaches:
The first uses a negative lookahead to ensure that the left-hand portion not contain readonly and the right-hand portion not start with a number sign.
The second does a simpler parser, then separately inspects the left- and right-hand sides for the individual constraints that apply to each.
The demo language is Perl, but the same patterns and logic should work virtually everywhere.
#!/usr/bin/perl
while (<DATA>) {
chomp;
#
# First demo: use a complicated regex to get desired part only
#
my($label) = m{
^ # start at the beginning
(?: # noncapture group:
(?! \b readonly \b ) # no "readonly" here
. # now advance one character
) + # repeated 1 or more times
\s* = \s* # skip an equals sign w/optional spaces
" ( [^#"] [^"]* ) " # capture #1: quote-delimited text
# BUT whose first char isn't a "#"
}x;
if (defined $label) {
print "Demo One: found label <$label> at line $.\n";
}
#
# Second demo: This time use simpler patterns, several
#
my($lhs, $rhs) = m{
^ # from the start of line
( [^=]+ ) # capture #1: 1 or more non-equals chars
\s* = \s* # skip an equals sign w/optional spaces
" ( [^"]+ ) " # capture #2: all quote-delimited text
}x;
unless ($lhs =~ /\b readonly \b/x || $rhs =~ /^#/) {
print "Demo Two: found label <$rhs> at line $.\n";
}
}
__END__
protected static readonly string BACKGROUND_MUSIC_NAME = "Music_Mission_Complete_Loop_audio";
mainWindow.Id = "MainWindow";
mainStoLabel.Text = "#stb_entry_clah";
I have two bits of advice. The first is to make very sure you ALWAYS use /x mode so you can produce documented and maintainable regexes. The second is that it is much cleaner doing things a bit at a time as in the second solution rather than all at once as in the first.
I don 't understand your question completely, a negative lookahead would look like this:
(?!.*readonly)(?:.*\s\=\s"[^#].*")
The first part will match if there is not the word "readonly" in the string.
Which language are you using?
What do you want to match, only the second example, did I understand this correct?
^[^"=]*(?<!(^|\s)readonly\s.*)\s*=\s*"[^#].*" seems to fit your needs:
everything before the first equal sign should not contain readonly or quotes
readonly is recognized not with word boundaries but with whitespace (except at beginning of line)
the equal sign can be surrounded by arbitrary whitespace
the equal sign must be followed by a quoted string
the quoted string should not start with #
You can work with lookarounds or capture groups if you only want the strings or quoted strings.
Note: as per your own regex, this discards anything after the last quote (not matching the semi-colon in your examples)
You absolutely need to specify the language. The negative lookahead/lookbehind is the thing you need.
Look at this site for an inventory of how to do that in Delphi, GNU (Linux), Groovy, Java, JavaScript, .NET, PCRE (C/C++), Perl, PHP, POSIX, PowerShell, Python, R, REALbasic, Ruby, Tcl, VBScript, Visual Basic 6, wxWidgets, XML Schema, XQuery & XPath
Currently I use this reg ex:
"\bI([ ]{1,2})([a-zA-Z]|\d){2,13}\b"
It was just brought to my attention that the text that I use this against could contain a "\" (backslash). How do I add this to the expression?
Add |\\ inside the group, after the \d for instance.
This expression could be simplified if you're also allowing the underscore character in the second capture register, and you are willing to use metacharacters. That changes this:
([a-zA-Z]|\d){2,13}
into this ...
([\w]{2,13})
and you can also add a test for the backslash character with this ...
([\w\x5c]{2,13})
which makes the regex just a tad easier to eyeball, depending on your personal preference.
"\bI([\x20]{1,2})([\w\x5c]{2,13})\b"
See also:
WP Metacharacter
Metacharacters
Shorthand character class
Both #slavy13 and #dreftymac give you the basic solution with pointers, but...
You can use \d inside a character class to mean a digit.
You don't need to put blank into a character class to match it (except, perhaps, for clarity, though that is debatable).
You can use [:alpha:] inside a character class to mean an alpha character, [:digit:] to mean a digit, and [:alnum:] to mean an alphanumeric (specifically not including underscore, unlike \w). Note that these character classes might mean more characters than you expect; think of accented characters and non-arabic digits, especially in Unicode.
If you want to capture the whole of the information after the space, you need the repetition inside the capturing parentheses.
Contrast the behaviour of these two one-liners:
perl -n -e 'print "$2\n" if m/\bI( {1,2})([a-zA-Z\d\\]){2,13}\b/'
perl -n -e 'print "$2\n" if m/\bI( {1,2})([a-zA-Z\d\\]{2,13})\b/'
Given the input line "I a123", the first prints "3" and the second prints "a123". Obviously, if all you wanted was the last character of the second part of the string, then the original expression is fine. However, that is unlikely to be the requirement. (Obviously, if you're only interested in the whole lot, then using '$&' gives you the matched text, but it has negative efficiency implications.)
I'd probably use this regex as it seems clearest to me:
m/\bI( {1,2})([[:alnum:]\\]{2,13})\b/
Time for the obligatory plug: read Jeff Friedl's "Mastering Regular Expressions".
As I pointed out in my comment to slavy's post, \\ -> \b as a backslash is not a word character. So my suggestion is
/\bI([ ]{1,2})([\p{IsAlnum}\\]{2,13})(?:[^\w\\]|$)/
I assumed that you wanted to capture the whole 2-13 characters, not just the first one that applies, so I adjusted my RE.
You can make the last capture a lookahead if the engine supports it and you don't want to consume it. That would look like:
/\bI([ ]{1,2})([\p{IsAlnum}\\]{2,13})(?=[^\w\\]|$)/
I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?
In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']
I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.
I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.
Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'
Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.
The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.
I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.
A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1
The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.
This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/
MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "
I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1
string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character
My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match
Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.
All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.
echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.
From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter
If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".
A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub