Porting from C# to Delphi, Regex incompatibility - regex

I am new to HTTP and Regex. I have a piece of code which I have ported to Delphi which works partially. The exception 'lookbehind not of fixed length' is raised on a particular statement:
'(?<=image\\?c=)[^\"]+'
The statement is there to extract image link from a html form. After some research here and on the web, I have come to understand that the '+' at the end causes this in some implementations of Regex. Which I couldn't find was how can I change it to work in Delphi's implementation. As the code works in C#, can somebody help and explain?

The lookbehind section doesn't have fixed length. That has nothing to do with the + at the end. The lookbehind portion is (?<=image\\?c=). You copied that from C#. In C#, the regex wants to look for a literal question mark. That's a special character in regex, so it needs a backslash in front of it. Backslash is special in C# strings, though, so that backslash needs another backslash, all just to represent a single question mark.
In Delphi strings, backslashes aren't special, so the two of them are treated as a literal backslash to search for in the regex. The question mark isn't escaped, so the Delphi regex treats it as an instruction to make the literal backslash optional. The optional character makes the lookbehind have variable length.
To solve this, simply remove one backslash.
You can also remove the one before the quotation mark, but it should have no effect since quotation marks aren't special in regex.
Even if you use an HTML parser to identify HTML element that contains this URL fragment, you may still need the right regex to recognize which HTML element is your target.

Related

Regex c++ crashing while initialization

I'm currently working on finding registry paths match using regex.
I have initalized regex as
regex regx("HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Uninstall\\\\{0398BFBC-913B-3275-9463-D2BF91B3C80B\\}")
and the program throws a std::tr1::regex_error exception.
I tried to escape the curly braces using "\\\\" but it still didn't work.
Any idea on how to fix it?
I'm on Windows 10, Visual Studio 2010.
Let's look at a C++ string literal (a slightly shorter one that we can read):
"A\\B\\C"
This, taking account of the literal escaping, is really the string:
A\B\C
Now you're passing this string to the regex engine. But regex has its own escaping, yet there are no escape sequences \B or \C (there may be, but there aren't for your actual characters).
Hence the regex is invalid and trying to instantiate it throws an exception.
You will need an extra layer of escaping:
"A\\\\B\\\\C"
Or use a raw string literal:
R"(A\\B\\C)"
In other words:
regex regx(R"(HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Uninstall\\\\{0398BFBC-913B-3275-9463-D2BF91B3C80B\\})")
(Yuck!)

What's wrong in regex format for filenames in <regex> VS10?

I'm trying to parse filenames paths by in Visual Studio 2010.
But program crashes with
Microsoft C++ exception: std::tr1::regex_error at memory location 0x001ef120..
on
regex myRegEx("^([a-zA-Z]\\:)(\\\\[^\\\\/:*?<>\"|]*(?<![ ]))*(\\.[a-zA-Z]\\{2,6\\})$");
Regular expression is ^([a-zA-Z]\:)(\\[^\\/:*?<>"|]*(?<![ ]))*(\.[a-zA-Z]{2,6})$
What's wrong with regex format?
You can perhaps narrow things down by slicing it into chunks. Evaluate the atoms separately, and see where the error turns up:
"^([a-zA-Z]\\:).*$"
"^([a-zA-Z]\\:)(\\\\[^\\\\/:*?<>\"|]*(?<![ ]))*[.]*$"
"^([a-zA-Z]\\:)(\\\\[^\\\\/:*?<>\"|]*(?<![ ]))*(\\.[a-zA-Z]+)$"
"^([a-zA-Z]\\:)(\\\\[^\\\\/:*?<>\"|]*(?<![ ]))*(\\.[a-zA-Z]\\{2,6\\})$"
One possible gotcha is the range, "\{2,6\}". If you really want "two to six letters", then you don't want backslashes in the middle of the range. The real answer depends on your parser.
Also, if there's confusion as to what's being escaped with backslashes, remember that you can often escape special characters by putting them into a range. For example, \\ may be equivalent to [\], and \. is certainly equivalent to [.].
First of all, I don't know well c++ regex syntax but, it seems to me that \\[ means escape the [ character.
I guess you should code just as [ if you want a negated character class [^\\/:*?<>"|]
^([a-zA-Z]\:)([^\\/:*?<>"|]*(?<![ ]))*(\.[a-zA-Z]{2,6})$
tr1::regex doesn't support lookbehind, so it's choking on "(?<![ ])".
Unfortunately, I'm not enough of a regex user to give you guidance on what you might use instead.

Regular expression for parsing string inside ""

<A "SystemTemperatureOutOfSpec" >
What should be the regular expression for parsing the string inside "". In the above sample it is 'SystemTemperatureOutOfSpec'
In JavaScript, this regexp:
/"([^"]*)"/
ex.
> /"([^"]*)"/.exec('<A "SystemTemperatureOutOfSpec" >')[1]
"SystemTemperatureOutOfSpec"
Similar patterns should work in a bunch of other programming languages.
try this
string Exp = "\"!\"";
I am not sure I understand your question well but if you need to match everything between double quotes, here it is: /(?<=").*?(?=")/s
(?<=<A\s")(?<content>.*)(?="\s>)
Regular expressions don't get much easier than this, so you should be able to solve it by yourself. Here's how you go about doing that:
The first step is to try to define as precisely as possible what you want to find. Let's start with this: you want to find a quote, followed by some number of characters other than a quote, followed by a quote. Is that correct? If so, our pattern has three parts: "a quote", "some characters other than a quote", and "a quote".
Now all we need to do is figure out what the regular expressions for those patterns are.
A quote
For "a quote", the pattern is literally ". Regular expressions have special characters which you have to be aware of (*, ., etc). Anything that's not a special character matches itself, and " is one of those characters. For a complete list of special characters for your language, see the documentation.
Characters other than a quote
So now the question is, how do we match "characters other than a quote"? That sounds like a range. A range is square brackets with a list of allowable characters. If the list begins with ^ it means it is a list of not-allowed characters. We want any characters other than a quote, so that means [^"].
"Some"
That range just means any one of the characters in the range, but we want "some". "Some" usually means either zero-or-more, or one-or-more. You can place * after a part of an expression to mean zero-or-more of that part. Likewise, use + to mean one-or-more (and ? means zero-or-one). There are a few other variations, but that's enough for this problem.
So, "some characters other than a quote" is the range [^"] (any character other than a quote) followed by * (zero-or-more). Thus, [^"]*
Putting it all together
This is the easy part: just combine all the pieces. A quote, followed by some characters other than a quote, followed by a quote, is "[^"]*".
Capturing the interesting part
The pattern we have will now match your string. What you want, however, is just the part inside the quotes. For that you need a "capturing group", which is denoted by parenthesis. To capture a part of a regular expression, put it in parenthesis. So, if we want to capture everything but the beginning and ending quote, the pattern becomes "([^"]*)".
And that's how you learn regular expressions. Break your problem down into a precise statement composed of short sequences of characters, figure out the regular expression for each sequence, then put it all together.
The pattern in this answer may not actually be the perfect answer for you. There are some edge cases to worry about. For example, you may only want to match a quote following a non-word character, or only quotes at the beginning or end of a word. That's all possible, but is highly dependent on your exact problem. Figuring out how to do that is just as easy though -- decide what you want, then look at the documentation to see how to accomplish that.
Spend one day practicing on regular expressions and you'll never have to ask anyone for help with regular expressions for the rest of your career. They aren't hard, but they do require concentrated study.
Are you sure you need regular expression matching here? Looking at your "string" you might be better off using a Xml parser?

Seeking quoted string generator

Does anyone know of a free website or program which will take a string and render it quoted with the appropriate escape characters?
Let's say, for instance, that I want to quote
It's often said that "devs don't know how to quote nested "quoted strings"".
And I would like to specify whether that gets enclosed in single or double quotes. I don't personally care for any escape character other than backslash, but other's might.
If none of the double quotes of the string is already escaped, you can simply do:
str = str.replace(/"/g, "\\\"");
Otherwise, you should check if it is already escaped and replace only if it isn't; You can use lookbehind for that. The following is what came to my mind first but it would fail for strings like escaped backslash followed by quotes \\" :(
str = str.replace(/(?<!\\)"/g, "\\\"");
The following makes sure that the second last character, if exists, is not a backslash.
str = str.replace(/(?<!(^|[^\\])\\)"/g, "\\\"");
Update: Just remembered that JavaScript doesn't support look-behind; you can use the same regex on a look-behind supporting regex engine like perl/php/.net etc.
Any decent regex library in any decent programming language will have a function to do this - not that it's hard to write one yourself (as the other answers have indicated). So having a separate website or program to do it would be mostly useless.
Perl has the quotemeta function
PCRE's C++ wrapper has a function RE::QuoteMeta (warning: giant file at that link) which does the same thing
PHP has preg_quote if you're using Perl-compatible regexes
Python's re module has an escape function
In Java, the java.util.regex.Pattern class has a quote method
Perl and most of the other regular expression engines based on Perl have metacharacters \Q...\E, meaning that whatever comes between \Q and \E is interpreted literally
Most tools that use POSIX regular expressions (e.g. grep) have an option that makes them interpret their input as a literal string (e.g. grep -F)
In Python, for enclosing in single quotes:
import re
mystr = """It's often said that "devs don't know how to quote nested "quoted strings""."""
print("""'%s'""" % re.sub("'", r"\'", mystr))
Output:
'It\'s often said that "devs don\'t know how to quote nested "quoted strings"".'
You could easily adapt this into a more general form, and/or wrap it in a script for command-line invocation.
so, I guess the answer is "no". Sorry, guys, but I didn't learn anything that I don't know. Probably my fault for not phrasing the question correctly.
+1 for everyone who posted

Regex for matching a character, but not when it's enclosed in quotes

I need to match a colon (':') in a string, but not when it's enclosed by quotes - either a " or ' character.
So the following should have 2 matches
something:'firstValue':'secondValue'
something:"firstValue":'secondValue'
but this should only have 1 match
something:'no:match'
If the regular expression implementation supports look-around assertions, try this:
:(?:(?<=["']:)|(?=["']))
This will match any colon that is either preceeded or followed by a double or single quote. So that does only consider construct like you mentioned. something:firstValue would not be matched.
It would be better if you build a little parser that reads the input byte-by-byte and remembers when quotation is open.
Regular expressions are stateless. Tracking whether you are inside of quotes or not is state information. It is, therefore, impossible to handle this correctly using only a single regular expression. (Note that some "regular expression" implementations add extensions which may make this possible; I'm talking solely about "true" regular expressions here.)
Doing it with two regular expressions is possible, though, provided that you're willing to modify the original string or to work with a copy of it. In Perl:
$string =~ s/['"][^'"]*['"]//g;
my $match_count = $string =~ /:/g;
The first will find every sequence consisting of a quote, followed by any number of non-quote characters, and terminated by a second quote, and remove all such sequences from the string. This will eliminate any colons which are within quotes. (something:"firstValue":'secondValue' becomes something:: and something:'no:match' becomes something:)
The second does a simple count of the remaining colons, which will be those that weren't within quotes to start with.
Just counting the non-quoted colons doesn't seem like a particularly useful thing to do in most cases, though, so I suspect that your real goal is to split the string up into fields with colons as the field delimiter, in which case this regex-based solution is unsuitable, as it will destroy any data in quoted fields. In that case, you need to use a real parser (most CSV parsers allow you to specify the delimiter and would be ideal for this) or, in the worst case, walk through the string character-by-character and split it manually.
If you tell us the language you're using, I'm sure somebody could suggest a good parser library for that language.
Uppps ... missed the point. Forget the rest. It's quite hard to do this because regex is not good at counting balanced characters (but the .NET implementation for example has an extension that can do it, but it's a bit complicated).
You can use negated character groups to do this.
[^'"]:[^'"]
You can further wrap the quotes in non-capturing groups.
(?:[^'"]):(?:[^'"])
Or you can use assertion.
(?<!['"]):(?!['"])
I've come up with the following slightly worrying construction:
(?<=^('[^']*')*("[^"]*")*[^'"]*):
It uses a lookbehind assertion to make sure you match an even number of quotes from the beginning of the line to the current colon. It allows for embedding a single quote inside double quotes and vice versa. As in:
'a":b':c::"':" (matches at positions 6, 8 and 9)
EDIT
Gumbo is right, using * within a look behind assertion is not allowed.
You can try to catch the strings withing the quotes
/(?<q>'|")([\w ]+)(\k<q>)/m
First pattern defines the allowed quote types, second pattern takes all Word-Digits and spaces.
Very good on this solution is, it takes ONLY Strings where opening and closing quotes match.
Try it at regex101.com