Regex: Match a pattern within quoted texts - regex

Text is
lemma A:
"
abx K() bc
"
// comment lemma B
lemma B:
"
abx bc sdsf
"
lemma C:
"
abfdfx K() bc
"
lemma D:
"
abxsf bc
"
I want to find the lemmas which contain K() inside its following quoted text. I have tried Perl regex (?s)^[ ]*lemma.*?"(?!").*?K\( but it overlaps two lemmas. The output should be: lemma A: "..." and lemma C: "...".

If the double quotes are at the start of the string, you can match a newline and then the double quote.
Then match any char except the double quote until you match K(
^[ ]*lemma\b.*\R"[^"]*K\(
^ Start of string
[ ]*lemma\b Match optional spaces and lemma
.*\R Match the rest of the line and a newline
"[^"]* Match " followed by optional chars other than "
K\( Match K(
Regex demo

You could use:
(?s)^[ ]*lemma[^"]*"[^"]*?K\(
[^"] means "any character but ""
See a demo here

Related

Regex to match a pattern within quotes

I'm trying to match and substitute a pattern.
Test String: {1-Emp Name: "John", "2-Emp pat" : 1123,"3-Emp lwd" : 20}, "4-Emp Pat" : 1234}
I'm trying to match the pattern with the word "pat" from the test string and substitute
Expected Result: {1-Emp Name: "John", "matched Pattern" : 1123,"3-Emp lwd" : 20}, "matched Pattern" : 1234}
My regex: ".+?(?i)Pat.+?(?=:)
You can use
Regex pattern: (?i)"[^"]* Pat\b[^"]*("\s*:)
Replacement pattern: "matched pattern$1
See the regex demo. Details:
(?i) - case insensitive inline modifier
" - a " char
[^"]* - zero or more chars other than "
Pat - space + Pat word
\b - word boundary
[^"]* - zero or more chars other than "
("\s*:) - Group 1 ($1): ", zero or more whitespaces, :.

Regex for matching C++ string constant

I'm currently working on a C++ preprocessor and I need to match string constants with more than 0 letters like this "hey I'm a string.
I'm currently working with this one here \"([^\\\"]+|\\.)+\" but it fails on one of my test cases.
Test cases:
std::cout << "hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";
Expected output:
std::cout << String("hello") << String(" world");
std::cout << String("He said: \"bananas\"") << String("...");
std::cout << "";
std::cout << String("\x12\23\x34");
On the second one I instead get
std::cout << String("He said: \")bananas\"String(" << ")...";
Short repro code (using the regex by AR.3):
std::string in_line = "std::cout << \"He said: \\\"bananas\\\"\" << \"...\";";
std::regex r("\"([^\"]+|\\.|(?<=\\\\)\")+\"");
in_line = std::regex_replace(in_line, r, "String($&)");
Lexing a source file is a good job for regexes. But for such a task, let's use a better regex engine than std::regex. Let's use PCRE (or boost::regex) at first. At the end of this post, I'll show what you can do with a less feature-packed engine.
We only need to do partial lexing, ignoring all unrecognized tokens that won't affect string literals. What we need to handle is:
Singleline comments
Multiline comments
Character literals
String literals
We'll be using the extended (x) option, which ignores whitespace in the pattern.
Comments
Here's what [lex.comment] says:
The characters /* start a comment, which terminates with the characters */. These comments do not nest.
The characters // start a comment, which terminates immediately before the next new-line character. If
there is a form-feed or a vertical-tab character in such a comment, only white-space characters shall appear
between it and the new-line that terminates the comment; no diagnostic is required. [ Note: The comment
characters //, /*, and */ have no special meaning within a // comment and are treated just like other
characters. Similarly, the comment characters // and /* have no special meaning within a /* comment.
— end note ]
# singleline comment
// .* (*SKIP)(*FAIL)
# multiline comment
| /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)
Easy peasy. If you match anything there, just (*SKIP)(*FAIL) - meaning that you throw away the match. The (?s: .*? ) applies the s (singleline) modifier to the . metacharacter, meaning it's allowed to match newlines.
Character literals
Here's the grammar from [lex.ccon]:
character-literal:
encoding-prefix(opt) ’ c-char-sequence ’
encoding-prefix:
one of u8 u U L
c-char-sequence:
c-char
c-char-sequence c-char
c-char:
any member of the source character set except the single-quote ’, backslash \, or new-line character
escape-sequence
universal-character-name
escape-sequence:
simple-escape-sequence
octal-escape-sequence
hexadecimal-escape-sequence
simple-escape-sequence: one of \’ \" \? \\ \a \b \f \n \r \t \v
octal-escape-sequence:
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit
hexadecimal-escape-sequence:
\x hexadecimal-digit
hexadecimal-escape-sequence hexadecimal-digit
Let's define a few things first, which we'll need later on:
(?(DEFINE)
(?<prefix> (?:u8?|U|L)? )
(?<escape> \\ (?:
['"?\\abfnrtv] # simple escape
| [0-7]{1,3} # octal escape
| x [0-9a-fA-F]{1,2} # hex escape
| u [0-9a-fA-F]{4} # universal character name
| U [0-9a-fA-F]{8} # universal character name
))
)
prefix is defined as an optional u8, u, U or L
escape is defined as per the standard, except that I've merged universal-character-name into it for the sake of simplicity
Once we have these, a character literal is pretty simple:
(?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)
We throw it away with (*SKIP)(*FAIL)
Simple strings
They're defined in almost the same way as character literals. Here's a part of [lex.string]:
string-literal:
encoding-prefix(opt) " s-char-sequence(opt) "
encoding-prefix(opt) R raw-string
s-char-sequence:
s-char
s-char-sequence s-char
s-char:
any member of the source character set except the double-quote ", backslash \, or new-line character
escape-sequence
universal-character-name
This will mirror the character literals:
(?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* "
The differences are:
The character sequence is optional this time (* instead of +)
The double quote is disallowed when unescaped instead of the single quote
We actually don't throw it away :)
Raw strings
Here's the raw string part:
raw-string:
" d-char-sequence(opt) ( r-char-sequence(opt) ) d-char-sequence(opt) "
r-char-sequence:
r-char
r-char-sequence r-char
r-char:
any member of the source character set, except a right parenthesis )
followed by the initial d-char-sequence (which may be empty) followed by a double quote ".
d-char-sequence:
d-char
d-char-sequence d-char
d-char:
any member of the basic source character set except:
space, the left parenthesis (, the right parenthesis ), the backslash \,
and the control characters representing horizontal tab,
vertical tab, form feed, and newline.
The regex for this is:
(?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
[^ ()\\\t\x0B\r\n]* is the set of characters that are allowed in delimiters (d-char)
\k<delimiter> refers to the previously matched delimiter
The full pattern
The full pattern is:
(?(DEFINE)
(?<prefix> (?:u8?|U|L)? )
(?<escape> \\ (?:
['"?\\abfnrtv] # simple escape
| [0-7]{1,3} # octal escape
| x [0-9a-fA-F]{1,2} # hex escape
| u [0-9a-fA-F]{4} # universal character name
| U [0-9a-fA-F]{8} # universal character name
))
)
# singleline comment
// .* (*SKIP)(*FAIL)
# multiline comment
| /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)
# character literal
| (?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)
# standard string
| (?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* "
# raw string
| (?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
See the demo here.
boost::regex
Here's a simple demo program using boost::regex:
#include <string>
#include <iostream>
#include <boost/regex.hpp>
static void test()
{
boost::regex re(R"regex(
(?(DEFINE)
(?<prefix> (?:u8?|U|L) )
(?<escape> \\ (?:
['"?\\abfnrtv] # simple escape
| [0-7]{1,3} # octal escape
| x [0-9a-fA-F]{1,2} # hex escape
| u [0-9a-fA-F]{4} # universal character name
| U [0-9a-fA-F]{8} # universal character name
))
)
# singleline comment
// .* (*SKIP)(*FAIL)
# multiline comment
| /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)
# character literal
| (?&prefix)? ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)
# standard string
| (?&prefix)? " (?> (?&escape) | [^"\\\r\n]+ )* "
# raw string
| (?&prefix)? R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
)regex", boost::regex::perl | boost::regex::no_mod_s | boost::regex::mod_x | boost::regex::optimize);
std::string subject(R"subject(
std::cout << L"hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";
std::cout << u8R"hello(this"is\a\""""single\\(valid)"
raw string literal)hello";
"" // empty string
'"' // character literal
// this is "a string literal" in a comment
/* this is
"also inside"
//a comment */
// and this /*
"is not in a comment"
// */
"this is a /* string */ with nested // comments"
)subject");
std::cout << boost::regex_replace(subject, re, "String\\($&\\)", boost::format_all) << std::endl;
}
int main(int argc, char **argv)
{
try
{
test();
}
catch(std::exception ex)
{
std::cerr << ex.what() << std::endl;
}
return 0;
}
(I left syntax highlighting disabled because it goes nuts on this code)
For some reason, I had to take the ? quantifier out of prefix (change (?<prefix> (?:u8?|U|L)? ) to (?<prefix> (?:u8?|U|L) ) and (?&prefix) to (?&prefix)?) to make the pattern work. I believe it's a bug in boost::regex, as both PCRE and Perl work just fine with the original pattern.
What if we don't have a fancy regex engine at hand?
Note that while this pattern technically uses recursion, it never nests recursive calls. Recursion could be avoided by inlining the relevant reusable parts into the main pattern.
A couple of other constructs can be avoided at the price of reduced performance. We can safely replace the atomic groups (?>...) with normal groups (?:...) if we don't nest quantifiers in order to avoid catastrophic backtracking.
We can also avoid (*SKIP)(*FAIL) if we add one line of logic into the replacement function: All the alternatives to skip are grouped in a capturing group. If the capturing group matched, just ignore the match. If not, then it's a string literal.
All of this means we can implement this in JavaScript, which has one of the simplest regex engines you can find, at the price of breaking the DRY rule and making the pattern illegible. The regex becomes this monstrosity once converted:
(\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"
And here's an interactive demo you can play with:
function run() {
var re = /(\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"/g;
var input = document.getElementById("input").value;
var output = input.replace(re, function(m, ignore) {
return ignore ? m : "String(" + m + ")";
});
document.getElementById("output").innerText = output;
}
document.getElementById("input").addEventListener("input", run);
run();
<h2>Input:</h2>
<textarea id="input" style="width: 100%; height: 50px;">
std::cout << L"hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";
std::cout << u8R"hello(this"is\a\""""single\\(valid)"
raw string literal)hello";
"" // empty string
'"' // character literal
// this is "a string literal" in a comment
/* this is
"also inside"
//a comment */
// and this /*
"is not in a comment"
// */
"this is a /* string */ with nested // comments"
</textarea>
<h2>Output:</h2>
<pre id="output"></pre>
Regular expressions can be tricky for beginners but once you understand it's basics and well tested divide and conquer strategy, it will be your goto tool.
What you need to search for quote (") not starting with () back slash and read all characters upto next quote.
The regex I came up is (".*?[^\\]"). See a code snippet below.
std::string in_line = "std::cout << \"He said: \\\"bananas\\\"\" << \"...\";";
std::regex re(R"((".*?[^\\]"))");
in_line = std::regex_replace(in_line, re, "String($1)");
std::cout << in_line << endl;
Output:
std::cout << String("He said: \"bananas\"") << String("...");
Regex Explanation:
(".*?[^\\]")
Options: Case sensitive; Numbered capture; Allow zero-length matches; Regex syntax only
Match the regex below and capture its match into backreference number 1 (".*?[^\\]")
Match the character “"” literally "
Match any single character that is NOT a line break character (line feed, carriage return) .*?
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) *?
Match any character that is NOT the backslash character [^\\]
Match the character “"” literally "
String($1)
Insert the character string “String” literally String
Insert an opening parenthesis (
Insert the text that was last matched by capturing group number 1 $1
Insert a closing parenthesis )
Read the relevant sections from the C++ standard, they are called lex.ccon and lex.string.
Then convert each rule you find there into a regular expression (if you really want to use regular expressions; it might turn out that they are not capable of doing this job).
Then, build more complicated regular expressions out of them. Be sure to name your regular expressions exactly as the rules from the C++ standard, so that you can recheck them later.
If, instead of using regular expressions, you want to use an existing tool, here is one: http://clang.llvm.org/doxygen/Lexer_8cpp_source.html. Have a look at the LexStringLiteral function.

Regex to reject only nonalphanumeric characters

If the keyword to be checked is other. It should not be preceded or followed by alphanumeric character.
spaces are allowed, \n allowed, Special characters allowed.
Not allowed - "AOther9", "noTHERX"
Allowed - "other", "\nother" , " other ", "$other/"
grepl(paste("[^a-zA-Z0-9]","other","[^a-zA-Z0-9]",sep=""),String1 , ignore.case = TRUE)
The above regex works well for all cases other than “check” - when check is preceded and followed by nothing.
You need to use a PCRE regex with lookarounds:
grepl(paste("(?<![a-zA-Z0-9])","other","(?![a-zA-Z0-9])",sep=""), String1, ignore.case = TRUE, perl=TRUE)
^^^^ ^ ^^^ ^ ^^^^^^^^^
The negative lookarounds will not consume the non-alphanumeric characters, they do not require those characters to actually be present in the string.
You can read more about lookarounds here.
Add a * quantifier to the inverted ranges, and start ^ and end $ of line anchors:
String1 <- c("AOther9", "noTHERX", "other", "\nother", " other ", "$other/")
grep('^[^a-z0-9]*other[^a-z0-9]*$', String1, ignore.case = TRUE, value = TRUE)
# [1] "other" "\nother" " other " "$other/"

R Regex number followed by punctuation followed by space

Suppose I had a string like so:
x <- "i2: 32390. 2093.32: "
How would I return a vector that would give me the positions of where a number is followed by a : or a . followed by a space?
So for this string it would be
"2: ","0. ","2: "
The regex you need is just '\\d[\\.:]\\s'. Using stringr's str_extract_all to quickly extract matches:
library(stringr)
str_extract_all("i2: 32390. 2093.32: ", '\\d[\\.:]\\s')
produces
[[1]]
[1] "2: " "0. " "2: "
You can use it with R's built-in functions, and it should work fine, as well.
What it matches:
\\d matches a digit, i.e. number
[ ... ] sets up a range of characters to match
\\. matches a period
: matches a colon
\\s matches a space.

Replace nth occurence of a character by another

I hope this isn't a duplicated, I didn't find an answer and I need help from regexp wizards.
I have a string and I would like to replace the second space found in it by a \n, but I don't know how to use indices (this way) in a regular expression :
For example :
# I have :
"a b c d e f"
# I want :
> "a b/nc d e f"
Also I would like to know how I can "repeat" this replacement: each two occurences of space replace by \n.
For example :
"a b c d e f"
> "a b\nc d\ne f"
(\\S+\\s+\\S+)\\s+
You can use this and replace by \1\n or $1\n.See demo.
https://regex101.com/r/yG7zB9/29