Using a regex capture directly in expression in C++

Using a regex capture directly in expression in C++ - c++

I'm trying to use a captured group directly in the regex. However, when I try to do this the program hangs indefinitely.
For example:
string input = "<Tag>blahblah</Tag>";
regex r1("<([a-zA-Z]+)>[a-z]+</\1>");
string result = regex_replace(result, regex, "");
If I add another slash to the capture "<([a-zA-Z]+)>[a-z]</\\1>", the program compiles but throws a "regex_error(regex_constants::error_backref)" exception.
Notes:
Compiler: Apple LLVM 5.1
I am using this as part of the process to clean junk from blocks of text. The document is not necessarily HTML/XML and desired text is not always within tags. So if possible, I would like to be able to do this with regular expressions, not a parser.

The backslash character in string literals is an escape character.
Either escape it "<([a-zA-Z]+)>[a-z]+</\\1>" or use a raw literal, R"(<([a-zA-Z]+)>[a-z]+</\1>)"
With that, your program works as you would expect:
#include <regex>
#include <iostream>
int main()
{
std::string input = "Hello<Tag>blahblah</Tag> World";
std::regex r1("<([a-zA-Z]+)>[a-z]+</\\1>");
std::string result = regex_replace(input, r1, "");
std::cout << "The result is '" << result << "'\n";
}
demo: http://coliru.stacked-crooked.com/a/ae20b09d46f975e9
The exception you're getting with \\1 suggests that your compiler is configured to use GNU libstdc++, where regex was not implemented. Look up how to set it up to use LLVM libc++ or use boost.regex.

Related

C++ regex capture group confusion

I'm implementing the nand2tetris Assembler in C++ (I'm pretty new to C++), and I'm having a lot of trouble parsing a C-instruction using regex. Mainly I really don't understand the return value of regex_search and how to use it.
Setting aside the various permutations of a C instruction, the current example I'm having trouble with is D=D-M. The result should have dest = "D"; comp = "D-M".
With the current code below, the regex appears to find the results correctly (confirmed by regex101.com), but, not really correctly, or something, or I don't know how to get to it. See the debugger screenshot. matches[n].second (which appears to contain the correct comp value) is not a string but an iterator.
Note that the 3rd capture group is correctly empty for this example.
auto regex_str = regex("([AMD]{1,3}=)?([01\-AMD!|+&><]{1,3})?(;[A-Z]{3})?");
regex_search(assemblyCode, matches, regex_str);
string dest = matches[1]; // this automatically casts some object (submatch) into a string?
string comp = matches[2];
string jump = matches[3];
I will note, though, that D=D+M works, but not D=D-M!

gcc warns about unknows escape sequence \- Demo.
You have to escape \,
std::regex("([AMD]{1,3}=)?([01\\-AMD!|+&><]{1,3})?(;[A-Z]{3})?");
or use raw string
std::regex(R"(([AMD]{1,3}=)?([01\-AMD!|+&><]{1,3})?(;[A-Z]{3})?)");
Demo

How to build a Raw string for regex from string variable

How build a regex from a string variable, and interpret that as Raw format.
std::regex re{R"pattern"};
For the above code, is there a way to replace the fixed string "pattern" with a std::string pattern; variable that is either built from compile time or run time.
I tried this but didn't work:
std::string pattern = "key";
std::string pattern = std::string("R(\"") + pattern + ")\"";
std::regex re(pattern); // does not work as if it should when write re(R"key")
Specifically, the if using re(R("key") the result is found as expected. But building using re(pattern) with pattern is exactly the same value ("key"), it did not find the result.
This is probably what I need, but it was for Java, not sure if there is anything similar in C++:
How do you use a variable in a regular expression?

std::string pattern = std::string("R(\"") + pattern + ")\"";
should be build from raw string literals as follows
pattern = std::string(R"(\")") + pattern + std::string(R"(\")");
This results in a string value like
\"key\"
See a working live example;
In case you want to have escaped parenthesis, you can write
pattern = std::string(R"(\(")") + pattern + std::string(R"("\))");
This results in a string value like
\("key"\)
Live example
Side note: You can't define the pattern variable twice. Omit the std::string type in follow up uses.

Regex to handle malformed delimited files

I am trying to find a regular expression that will not match a delimiter if it is wrapped in double quotes. But it must also be able to handle values that have a single double quote. I have the first part down with the below expression where DELIMITER could be just about anything but is mainly commas, pipes, and double pipes:
DELIMITER(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)
This handles a properly formed CSV rowlike apple, "banana, and orange", grape. I can split on the delimiter and get the values:
['apple', 'banana, and orange', 'grape']
My problem is that I may encounter a line like apple, "banana, and orange, grape. In this case I would want to get the values:
['apple', '"banana', 'and orange', 'grape']
However, I get:
['apple, "banana', 'and orange', 'grape']
It basically ignores all of the commas up to the double quote.
The logic that I have in my head is that I want to ignore a comma if it is preceded by a double quote, but only if it has a double quote in front of it as well. My first thought was to play around with a look-behind, but I can't get that to work due to look-behinds not able to handle quantifiers (correct me if this is wrong).
I am using Qt QRegExp which I understand is more or less similar to the Perl regex engine. Please let me know if there is more information that I can provide. I know regular expressions can be finicky based on your setup, and I hope I have explained what I'm looking for well enough!

It's not QT but boost::tokenizer, which is header-only, has support for escaped delimited text formats.
From the example usage at the Boost docs: http://www.boost.org/doc/libs/1_60_0/libs/tokenizer/escaped_list_separator.htm
// simple_example_2.cpp
#include<iostream>
#include<boost/tokenizer.hpp>
#include<string>
int main(){
using namespace std;
using namespace boost;
string s = "Field 1,\"putting quotes around fields, allows commas\",Field 3";
tokenizer<escaped_list_separator<char> > tok(s);
for(tokenizer<escaped_list_separator<char> >::iterator beg=tok.begin(); beg!=tok.end();++beg){
cout << *beg << "\n";
}
}
In the malformed case tok returns a single token, which isn't what you're looking for. You're looking for non-standard1 parsing, consider writing a small state machine instead of a regular expression.
1. As much as there is a standard for delimited text

Regex Extractor

Hi I am implementing regex using C++ .
Background:
I have a std::string and a std::regex. I need to compare the string against this regex .
The regex used here is not about validation . My typical regex would be something
like
a[bc]{2} and nothing beyond this scope
I have to pass this regex as a char pointer argument to a function .
Problem:
I am unable to assign char pointer to std::regex. If I do so I am getting the following error.
terminate called after throwing an instance of std::regex_error what(): regex_error
My function body will be
std::string s((char*)a); // The main string
std::regex e((char*)b); // Regex comparing the main string. a and b are the parameters to the function
if (std::regex_match(s, e))
{
// returns the matched portion of the string
// for instance "abcdeef" , "e{2}" would return ee
}
else
{
// return "Mismatch"
}
Any suggestions..? Or is there a way to extract the string from regex like "a{2}b" -> "aab"
Thanks in advance

The error is probably raised due to the ECMAScript syntax for the expression which doesn't support [] in regex. You can try with using the basic regex constrain tag.
std::regex e((char*)b, std::regex_constants::basic);
They are discussing more about this here.

Groovy Regex illegal Characters

I have a Groovy script that converts some very poorly formatted data into XML. This part works fine, but it's also happily passing some characters along that aren't legal in XML. So I'm adding some code to strip these out, and this is where the problem is coming from.
The code that isn't compiling is this:
def illegalChars = ~/[\u0000-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/
What I'm wondering is, why? What am I doing wrong here? I tested this regex in http://regexpal.com/ and it works as expected, but I'm getting an error compiling it in Groovy:
[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] line 23:26: unexpected char: 0x0
The line above is line 23. The surrounding lines are just variable declarations that I haven't changed while working on the regex.
Thanks!
Update:
The code compiles, but it's not filtering as I'd expected it to.
In regexpal I put the regex:
[\u0000-\u0008\u000B-\u000C\u000E-\u001F\u007F-\u009F]
and the test data:
name='lang'>E</field><field name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc>
<doc><field name='page'>72-88</field><field name='shm'>3146.757500</field><field
name='pubc'>47</field><field name='cs'>1</field><field name='issue'>NUMBER</field>
<field name='auth'>Dvorak, A.</field><field name='pub'>KARGER</field><field
name='rr'>GBP013.51</field><field name='issn'>1660-2242</field><field
name='class1'>TS</field><field name='freq'>S</field><field
name='class2'>616.079</field><field name='text'>Subcellular Localization of the
Cytokines, Basic Fibroblast Growth Factor and Tumor Necrosis Factor- in Mast
Cells</field><field name='id'>RN170369808</field><field name='volume'>VOL 85</field>
<field name='year'>2005</field><field name='lang'>E</field><field
name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc><doc><field
name='page'>89-97</field><field name='shm'>3146.757500</field><field
name='pubc'>47</field><field name='cs'>1</field><field
It's a grab from a file with one of the illegal characters, so it's a little random. But regexpal highlights only the illegal character, but in Groovy it's replacing even the '<' and '>' characters with empty strings, so it's basically annihilating the entire document.
The code snippet:
def List parseFile(File file){
println "reading File name: ${file.name}"
def lineCount = 0
List data = new ArrayList()
file.eachLine {
String input ->
lineCount ++
String line = input
if(input =~ illegalChars){
line = input.replaceAll(illegalChars, " ")
}
Map document = new HashMap()
elementNames.each(){
token ->
def val = getValue(line, token)
if(val != null){
if(token.equals("ISSUE")){
List entries = val.split(";")
document.putAt("year",entries.getAt(0).trim())
if(entries.size() > 1){
document.putAt("volume", entries.getAt(1).trim())
}
if(entries.size() > 2){
document.putAt("issue", entries.getAt(2).trim())
}
} else {
document.putAt(token, val)
}
}
}
data.add(document)
}
println "done"
return data
}
I don't see any reason that the two should behave differently; am I missing something?
Again, thanks!

line 23:26: unexpected char: 0x0
This error message points to this part of the code:
def illegalChars = ~/[\u0000-...
12345678901234567890123
It looks like for some reason the compiler doesn't like having Unicode 0 character in the source code. That said, you should be able to fix this by doubling the slash. This prevents Unicode escapes at the source code level, and let the regex engine handle the unicode instead:
def illegals = ~/[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/
Note that I've also combined the character classes into one instead of as alternates. I've also removed the range definition when they're not necessary.
References
regular-expressions.info/Character Classes
On doubling the slash
Here's the relevant quote from java.util.regex.Pattern
Unicode escape sequences such as \u2014 in Java source code are processed as described in JLS 3.3. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.
To illustrate, in Java:
System.out.println("\n".matches("\\u000A")); // prints "true"
However:
System.out.println("\n".matches("\u000A"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"
This is because \u000A, which is the newline character, is escaped in the second snippet at the source code level. The source code essentially becomes:
System.out.println("\n".matches("
"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"
This is not a legal Java source code.

Try this Regular Expression to remove unicode char from the string :
/*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/

OK here's my finding:
>>> print "XYZ".replaceAll(
/[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/,
"-"
)
---
>>> print "X\0YZ".replaceAll(
/[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F-\u009F]/,
"-"
)
X-YZ
>>> print "X\0YZ".replaceAll(
"[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]",
"-"
)
X-YZ
In other words, my \\uNNNN answer within /pattern/ is WRONG. What happens is that 0-\ becomes part of the range, and this includes <, > and all capital letters.
The \\uNNNN only works in "pattern", not in /pattern/.
I will edit my official answer based on comments to this "answer".
Related questions
How to escape Unicode escapes in Groovy’s /pattern/ syntax

try
def illegalChars = ~/[\u0001-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/`

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using a regex capture directly in expression in C++ - c++

Related

C++ regex capture group confusion

How to build a Raw string for regex from string variable

Regex to handle malformed delimited files

Regex Extractor

Groovy Regex illegal Characters

Categories

Resources