Regular expression for parsing string inside "" - regex

<A "SystemTemperatureOutOfSpec" >
What should be the regular expression for parsing the string inside "". In the above sample it is 'SystemTemperatureOutOfSpec'

In JavaScript, this regexp:
/"([^"]*)"/
ex.
> /"([^"]*)"/.exec('<A "SystemTemperatureOutOfSpec" >')[1]
"SystemTemperatureOutOfSpec"
Similar patterns should work in a bunch of other programming languages.

try this
string Exp = "\"!\"";

I am not sure I understand your question well but if you need to match everything between double quotes, here it is: /(?<=").*?(?=")/s

(?<=<A\s")(?<content>.*)(?="\s>)

Regular expressions don't get much easier than this, so you should be able to solve it by yourself. Here's how you go about doing that:
The first step is to try to define as precisely as possible what you want to find. Let's start with this: you want to find a quote, followed by some number of characters other than a quote, followed by a quote. Is that correct? If so, our pattern has three parts: "a quote", "some characters other than a quote", and "a quote".
Now all we need to do is figure out what the regular expressions for those patterns are.
A quote
For "a quote", the pattern is literally ". Regular expressions have special characters which you have to be aware of (*, ., etc). Anything that's not a special character matches itself, and " is one of those characters. For a complete list of special characters for your language, see the documentation.
Characters other than a quote
So now the question is, how do we match "characters other than a quote"? That sounds like a range. A range is square brackets with a list of allowable characters. If the list begins with ^ it means it is a list of not-allowed characters. We want any characters other than a quote, so that means [^"].
"Some"
That range just means any one of the characters in the range, but we want "some". "Some" usually means either zero-or-more, or one-or-more. You can place * after a part of an expression to mean zero-or-more of that part. Likewise, use + to mean one-or-more (and ? means zero-or-one). There are a few other variations, but that's enough for this problem.
So, "some characters other than a quote" is the range [^"] (any character other than a quote) followed by * (zero-or-more). Thus, [^"]*
Putting it all together
This is the easy part: just combine all the pieces. A quote, followed by some characters other than a quote, followed by a quote, is "[^"]*".
Capturing the interesting part
The pattern we have will now match your string. What you want, however, is just the part inside the quotes. For that you need a "capturing group", which is denoted by parenthesis. To capture a part of a regular expression, put it in parenthesis. So, if we want to capture everything but the beginning and ending quote, the pattern becomes "([^"]*)".
And that's how you learn regular expressions. Break your problem down into a precise statement composed of short sequences of characters, figure out the regular expression for each sequence, then put it all together.
The pattern in this answer may not actually be the perfect answer for you. There are some edge cases to worry about. For example, you may only want to match a quote following a non-word character, or only quotes at the beginning or end of a word. That's all possible, but is highly dependent on your exact problem. Figuring out how to do that is just as easy though -- decide what you want, then look at the documentation to see how to accomplish that.
Spend one day practicing on regular expressions and you'll never have to ask anyone for help with regular expressions for the rest of your career. They aren't hard, but they do require concentrated study.

Are you sure you need regular expression matching here? Looking at your "string" you might be better off using a Xml parser?

Related

E-mail address validation using Regular Expressions

I'm writing a simple, small app that allows me to share information. I have a question on using regx to validate email address.
I'm kind learning on my own. But when it comes to real-world examples, such that strings that can be validated with regular expressions, I'm kind stuck.
Exercise:
Untangle the following regular expression that validates an email address:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
It looks like a jumble of characters.
Can someone please explain to me how does this work?
I try to use this online resources by by Jan Goyvaerts.
Any help I will appreciate it.
First of all, there is a good thread about totally the same thing:
Using a regular expression to validate an email address
Then, below there is the explanation of your regular expression:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
- The square brackets represent the symbol class, containing all the symbols which are in the square brackets. The plus sign ('+') is a quantifier, which means that the sequence of symbols, represented by this symbol class must be at least one character long.
Also, the '+' is greedy, and, therefore, this part of the pattern will match the symbol sequence of the maximal possible length.
Talking about the square brackets contents, 'a-z' means any symbol in a range, which could be described mathematically as [a, z], and '0-9' is similar. All the other symbols are just symbols in this case.
(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
- In Regular Expressions, the brackets represent grouping, and the asterisk ('*') is a greedy quantifier, which means "occurs zero or more times". So here we are not sure if we are going to find the brackets content, but we do not rule out the possibility.
Then, inside the brackets, we see the ?: character combination, which, being put inside brackets tells us that the symbol group inside should not be captured as a sub-string for the further reference.
Going further, \. means just a usual dot (see Escape sequence), since a dot symbol is a meta-symbol in Regex.
After the dot we see again the character of symbols, explained above.
#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
- Here we see the at symbol ('#'), which is just a symbol here, then there is a non-capturing symbol group, which will occur one or more times (because of + after it), and which includes a single symbol of [a-z0-9] class and another non-capturing group of symbols, which contents you can totally describe using my explanations above except for a question mark sign ('?'), which means "either once or not at all" in this context (i.e. if it is used as a quantifier).
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
- This last part is similar to what is found in a symbol group, explained above, so I believe you have now enough information to understand it.
More on quantifier types here: Greedy vs. Reluctant vs. Possessive Quantifiers.
A good Regular Expressions reference: Regular Expression Language - Quick Reference
Some information on capturing in Regular Expressions: Regex Tutorial - Parentheses for Grouping and Capturing
About special characters: Regex Tutorial - Literal Characters and Special Characters
Regex statements can be a fun yet tricky to follow. There are 5 parts to this statement.
One valid characters for a username
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
check for a single '.' and any additional amount of characters
(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
The '#' symbol
Valid second / lower level domain
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
A valid top level domain
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
I recommend http://www.ultrapico.com/expresso.htm. It will break the statement down for you.
I've found a remarkable tool for visualizing regular expressions here: http://regexper.com
It shows me that your regular expression breaks down like this. Hopefully this helps explain it.
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
This looks for at least one of of the characters given here (a-z, 0-9, and those special characters).
(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)
This looks for the same as above, but only when it stands after a dot. This part is optional and can be repeated indefinitely. It prevents dots at the end of the name.
#
Matches the # symbol
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
This matches a-z, 0-9 ending with a dot and optional - in the middle ending with a dot. This has to be matched at least once.
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
This looks for a-z or 0-9, optionally followed by a-z, 0-9, -, but it cant end with a - again.
Two Suggestions I have for you.
Escaping special characters is messy. 2. Email addresses are complicated. I probably recommend you to study this post if you are really interested. Please check out this other posts: Validation in Regex and Regex Help.
See this answer. The problem is probably too difficult to solve. Two problems you have here. 1. RegEx are not easy. 2. Escaping special characters is messy. Finally, Email addresses are complicated. I probably recommend you to study this post if you are really interested.

Regex to check if a string contains at least A-Za-z0-9 but not an &

I am trying to check if a string contains at least A-Za-z0-9 but not an &.
My experience with regexes is limited, so I started with the easy part and got:
.*[a-zA-Z0-9].*
However I am having troubling combining this with the does not contain an & portion.
I was thinking along the lines of ^(?=.*[a-zA-Z0-9].*)(?![&()]).* but that does not seem to do the trick.
Any help would be appreciated.
I'm not sure if this what you meant, but here is a regular expression that will match any string that:
contains at least one alpha-numeric character
does not contain a &
This expression ensures that the entire string is always matched (the ^ and $ at beginning and end), and that none of the characters matched are a "&" sign (the [^&]* sections):
^[^&]*[a-zA-Z0-9][^&]*$
However, it might be clearer in code to simply perform two checks, if you are not limited to a single expression.
Also, check out the \w class in regular expressions (it might be the better solution for catching alphanumeric chars if you want to allow non-ASCII characters).

Regexp Question - Negating a captured character

I'm looking for a regular expression that allows for either single-quoted or double-quoted strings, and allows the opposite quote character within the string. For example, the following would both be legal strings:
"hello 'there' world"
'hello "there" world'
The regexp I'm using uses negative lookahead and is as follows:
(['"])(?:(?!\1).)*\1
This would work I think, but what about if the language didn't support negative lookahead. Is there any other way to do this? Without alternation?
EDIT:
I know I can use alternation. This was more of just a hypothetical question. Say I had 20 different characters in the initial character class. I wouldn't want to write out 20 different alternations. I'm trying to actually negate the captured character, without using lookahead, lookbehind, or alternation.
This is actually much simpler than you may have realized. You don't really need the negative look-ahead. What you want to do is a non-greedy (or lazy) match like this:
(['"]).*?\1
The ? character after the .* is the important part. It says, consume the minimum possible characters before hitting the next part of the regex. So, you get either kind of quote, and then you go after 0-M characters until you encounter a character matching whichever quote you first ran into. You can learn more about greedy matching vs. non-greedy here and here.
Sure:
'([^']*)'|"([^"]*)"
On a successful match, the $+ variable will hold the contents of whichever alternate matched.
In the general case, regexps are not really the answer. You might be interested in something like Text::ParseWords, which tokenizes text, accounting for nested quotes, backslashed quotes, backslashed spaces, and other oddities.

Regex for matching a character, but not when it's enclosed in quotes

I need to match a colon (':') in a string, but not when it's enclosed by quotes - either a " or ' character.
So the following should have 2 matches
something:'firstValue':'secondValue'
something:"firstValue":'secondValue'
but this should only have 1 match
something:'no:match'
If the regular expression implementation supports look-around assertions, try this:
:(?:(?<=["']:)|(?=["']))
This will match any colon that is either preceeded or followed by a double or single quote. So that does only consider construct like you mentioned. something:firstValue would not be matched.
It would be better if you build a little parser that reads the input byte-by-byte and remembers when quotation is open.
Regular expressions are stateless. Tracking whether you are inside of quotes or not is state information. It is, therefore, impossible to handle this correctly using only a single regular expression. (Note that some "regular expression" implementations add extensions which may make this possible; I'm talking solely about "true" regular expressions here.)
Doing it with two regular expressions is possible, though, provided that you're willing to modify the original string or to work with a copy of it. In Perl:
$string =~ s/['"][^'"]*['"]//g;
my $match_count = $string =~ /:/g;
The first will find every sequence consisting of a quote, followed by any number of non-quote characters, and terminated by a second quote, and remove all such sequences from the string. This will eliminate any colons which are within quotes. (something:"firstValue":'secondValue' becomes something:: and something:'no:match' becomes something:)
The second does a simple count of the remaining colons, which will be those that weren't within quotes to start with.
Just counting the non-quoted colons doesn't seem like a particularly useful thing to do in most cases, though, so I suspect that your real goal is to split the string up into fields with colons as the field delimiter, in which case this regex-based solution is unsuitable, as it will destroy any data in quoted fields. In that case, you need to use a real parser (most CSV parsers allow you to specify the delimiter and would be ideal for this) or, in the worst case, walk through the string character-by-character and split it manually.
If you tell us the language you're using, I'm sure somebody could suggest a good parser library for that language.
Uppps ... missed the point. Forget the rest. It's quite hard to do this because regex is not good at counting balanced characters (but the .NET implementation for example has an extension that can do it, but it's a bit complicated).
You can use negated character groups to do this.
[^'"]:[^'"]
You can further wrap the quotes in non-capturing groups.
(?:[^'"]):(?:[^'"])
Or you can use assertion.
(?<!['"]):(?!['"])
I've come up with the following slightly worrying construction:
(?<=^('[^']*')*("[^"]*")*[^'"]*):
It uses a lookbehind assertion to make sure you match an even number of quotes from the beginning of the line to the current colon. It allows for embedding a single quote inside double quotes and vice versa. As in:
'a":b':c::"':" (matches at positions 6, 8 and 9)
EDIT
Gumbo is right, using * within a look behind assertion is not allowed.
You can try to catch the strings withing the quotes
/(?<q>'|")([\w ]+)(\k<q>)/m
First pattern defines the allowed quote types, second pattern takes all Word-Digits and spaces.
Very good on this solution is, it takes ONLY Strings where opening and closing quotes match.
Try it at regex101.com

How can I match a quote-delimited string with a regex?

If I'm trying to match a quote-delimited string with a regex, which of the following is "better" (where "better" means both more efficient and less likely to do something unexpected):
/"[^"]+"/ # match quote, then everything that's not a quote, then a quote
or
/".+?"/ # match quote, then *anything* (non-greedy), then a quote
Assume for this question that empty strings (i.e. "") are not an issue. It seems to me (no regex newbie, but certainly no expert) that these will be equivalent.
Update: Upon reflection, I think changing the + characters to * will handle empty strings correctly anyway.
You should use number one, because number two is bad practice. Consider that the developer who comes after you wants to match strings that are followed by an exclamation point. Should he use:
"[^"]*"!
or:
".*?"!
The difference appears when you have the subject:
"one" "two"!
The first regex matches:
"two"!
while the second regex matches:
"one" "two"!
Always be as specific as you can. Use the negated character class when you can.
Another difference is that [^"]* can span across lines, while .* doesn't unless you use single line mode. [^"\n]* excludes the line breaks too.
As for backtracking, the second regex backtracks for each and every character in every string that it matches. If the closing quote is missing, both regexes will backtrack through the entire file. Only the order in which then backtrack is different. Thus, in theory, the first regex is faster. In practice, you won't notice the difference.
More complicated, but it handles escaped quotes and also escaped backslashes (escaped backslashes followed by a quote is not a problem)
/(["'])((\\{2})*|(.*?[^\\](\\{2})*))\1/
Examples:
"hello\"world" matches "hello\"world"
"hello\\"world" matches "hello\\"
I would suggest:
([\"'])(?:\\\1|.)*?\1
But only because it handles escaped quote chars and allows both the ' and " to be the quote char. I would also suggest looking at this article that goes into this problem in depth:
http://blog.stevenlevithan.com/archives/match-quoted-string
However, unless you have a serious performance issue or cannot be sure of embedded quotes, go with the simpler and more readable:
/".*?"/
I must admit that non-greedy patterns are not the basic Unix-style 'ed' regular expression, but they are getting pretty common. I still am not used to group operators like (?:stuff).
I'd say the second one is better, because it fails faster when the terminating " is missing. The first one will backtrack over the string, a potentially expensive operation. An alternative regexp if you are using perl 5.10 would be /"[^"]++"/. It conveys the same meaning as version 1 does, but is as fast as version two.
I'd go for number two since it's much easier to read. But I'd still like to match empty strings so I would use:
/".*?"/
From a performance perspective (extremely heavy, long-running loop over long strings), I could imagine that
"[^"]*"
is faster than
".*?"
because the latter would do an additional check for each step: peeking at the next character. The former would be able to mindlessly roll over the string.
As I said, in real-world scenarios this would hardly be noticeable. Therefore I would go with number two (if my current regex flavor supports it, that is) because it is much more readable. Otherwise with number one, of course.
Using the negated character class prevents matching when the boundary character (doublequotes, in your example) is present elsewhere in the input.
Your example #1:
/"[^"]+"/ # match quote, then everything that's not a quote, then a quote
matches only the smallest pair of matched quotes -- excellent, and most of the time that's all you'll need. However, if you have nested quotes, and you're interested in the largest pair of matched quotes (or in all the matched quotes), you're in a much more complicated situation.
Luckily Damian Conway is ready with the rescue: Text::Balanced is there for you, if you find that there are multiple matched quote marks. It also has the virtue of matching other paired punctuation, e.g. parentheses.
I prefer the first regex, but it's certainly a matter of taste.
The first one might be more efficient?
Search for double-quote
add double-quote to group
for each char:
if double-quote:
break
add to group
add double-quote to group
Vs something a bit more complicated involving back-tracking?
Considering that I didn't even know about the "*?" thing until today, and I've been using regular expressions for 20+ years, I'd vote in favour of the first. It certainly makes it clear what you're trying to do - you're trying to match a string that doesn't include quotes.