Regular Expression groups ignoring comma inside parenthesis - regex

I know that are plenty of regular expressions around here similar to what I am going to ask, but couldn't find one that actually helps me.
This one got close, but it uses Java split method, but I need to capture the values using only regular expressions:
Java: splitting a comma-separated string but ignoring commas in quotes
So, what I need to do is, given the below input:
string,string([a-zA-Z]{0,9}),integer
I would like to capture 3 matches:
string
string([a-zA-Z]{0,9})
integer
Note that inside the parenthesis we can have a regular expression, which means almost any chars, even comma.
I can't use split here, because I am not using Java, but an internal declarative programming that uses ICU regular expressions and has an API for capturing groups, but not a regex based split method.
Any help would be appreciated. And I am really sorry if there exists other posts that could be duplicated as this one, but I have spent a few hours looking around, and even played with the post I mentioned, but couldn't get to a solution.
Thanks
EDIT
The input I provided is just an example, but other inputs are also possible.
Besides, after #sin comments, I have reviewed the input, and we can actually assume we'll have quotes inside the parenthesis, like that:
string("[\w]{0,9}"),integer,string

Related

Regex Replacement Syntax for number of replacent group occurences

Take the sample string:
__________Hello
I want to replace lines starting with 10 x _ with 20 x _
Desired output:
____________________Hello
I can do this a number of ways, i.e:
/^(_{10})/\1\1/
/^_{10}/____________________/
/^(__________)/\1\1/
etc...
Question:
Is there a way within the regex specification/expression itself - say PCRE (or any regex library/engine for that matter) - to specify the replacement occurence of a character ?
For example:
/_{10}/_{20}/
I don't know if I'm having a mind blank or if I've just never done this, but I cannot seem to find any such thing in the regex specification docs.
It can't be done within the Regex itself.
If I have the input "39572a4872" and I want to replace it with "39572aaaaa4872", there are many simple ways to achieve that, which can include Regular expressions, but as Wiktor explained in the comment thread, the actual quantifier of the replacement is not something itself that is achieved through regex.
It may seem unimportant, since in this example I could simply just apply the replacement 5 times manually or programatically, but one of the benefits of standardized technologies is applying the same concepts in different environments, languages, even within programs.
I as well as many others have had a lot of success with the portability of my regex because of this.
This question was to see if specifying quantifiers for replacement strings was possible within the syntax of a regex itself. Which it is surely not.

Remove text between two characters (parenthesis) in a string

I'm working on a project and I want to remove text between two parentheses in a string.
Example:
std::string str = "I want to remove (this)."
How would I go about doing that?
I've searched google and stackoverflow an haven't found anything.
I'd use a regular expression for that. Check out the link I provided. As for the expression to use the following expression
(\()(?:[^\)\\]*(?:\\.)?)*\)
That guy worked for me.
Conditionally replace regex matches in string
Do not get regular and common expressions confused. This is not like the more common expression of :-) or :-O or >:( All-though effective These expressions are mutually exclusive expressions that not many languages understand but are more commonly used.

PCRE Regex Syntax

I guess this is more or less a two-part question, but here's the basics first: I am writing some PHP to use preg_match_all to look in a variable for strings book-ended by {}. It then iterates through each string returned, replaces the strings it found with data from a MySQL query.
The first question is this: Any good sites out there to really learn the ins and outs of PCRE expressions? I've done a lot of searching on Google, but the best one I've been able to find so far is http://www.regular-expressions.info/. In my opinion, the information there is not well-organized and since I'd rather not get hung up having to ask for help whenever I need to write a complex regex, please point me at a couple sites (or a couple books!) that will help me not have to bother you folks in the future.
The second question is this: I have this regex
"/{.*(_){1}(.*(_){1}[a-z]{1}|.*)}/"
and I need it to catch instances such as {first_name}, {last_name}, {email}, etc. I have three problems with this regex.
The first is that it sees "{first_name} {last_name}" as one string, when it should see it as two. I've been able to solve this by checking for the existence of the space, then exploding on the space. Messy, but it works.
The second problem is that it includes punctuation as part of the captured string. So, if you have "{first_name} {last_name},", then it returns the comma as part of the string. I've been able to partially solve this by simply using preg_replace to delete periods, commas, and semi-colons. While it works for those punctuation items, my logic is unable to handle exclamation points, question marks, and everything else.
The third problem I have with this regex is that it is not seeing instances of {email} at all.
Now, if you can, are willing, and have time to simply hand me the solution to this problem, thank you as that will solve my immediate problem. However, even if you can do this, please please provide an lmgfty that provides good web sites as references and/or a book or two that would provide a good education on this subject. Sites would be preferable as money is tight, but if a book is the solution, I'll find the money (assuming my local library system is unable to procure said volume).
Back then I found PHP's own PCRE syntax reference quite good: http://uk.php.net/manual/en/reference.pcre.pattern.syntax.php
Let's talk about your expression. It's quite a bit more verbose than necessary; I'm going to simplify it while we go through this.
A rather simpler way of looking at what you're trying to match: "find a {, then any number of letters or underscores, then a }". A regular expression for that is (in PHP's string-y syntax): '/\{[a-z_]+\}/'
This will match all of your examples but also some wilder ones like {__a_b}. If that's not an option, we can go with a somewhat more complex description: "find a {, then a bunch of letters, then (as often as possible) an underscore followed by a bunch of letters, then a }". In a regular expression: /\{([a-z]+(_[a-z]+)*\}/
This second one maybe needs a bit more explanation. Since we want to repeat the thing that matches _foo segments, we need to put it in parentheses. Then we say: try finding this as often as possible, but it's also okay if you don't find it at all (that's the meaning of *).
So now that we have something to compare your attempt to, let's have a look at what caused your problems:
Your expression matches any characters inside the {}, including } and { and a whole bunch of other things. In other words, {abcde{_fgh} would be accepted by your regex, as would {abcde} fg_h {ijkl}.
You've got a mandatory _ in there, right after the first .*. The (_){1} (which means exactly the same as _) says: whatever happens, explode if this ain't here! Clearly you don't actually want that, because it'll never match {email}.
Here's a complete description in plain language of what your regex matches:
Match a {.
Match a _.
Match absolutely anything as long as you can match all the remaining rules right after that anything.
Match a _.
Match a single letter.
Instead of that _ and the single letter, absolutely anything is okay, too.
Match a }.
This is probably pretty far from what you wanted. Don't worry, though. Regular expressions take a while to get used to. I think it's very helpful if you think of it in terms of instructions, i.e. when building a regular expression, try to build it in your head as a "find this, then find that", etc. Then figure out the right syntax to achieve exactly that.
This is hard mainly because not all instructions you might come up with in your head easily translate into a piece of a regular expression... but that's where experience comes in. I promise you that you'll have it down in no time at all... if you are fairly methodical about making your regular expressions at first.
Good luck! :)
For PCRE, I simply digested the PCRE manpages, but then my brain works that way anyway...
As for matching delimited stuff, you generally have 2 approaches:
Match the first delimiter, match anything that is not the closing delimiter, match the closing delimiter.
Match the first delimiter, match anything ungreedily, match the closing delimiter.
E.g. for your case:
\{([^}]+)\}
\{(.+?)\} - Note the ? after the +
I added a group around the content you'd likely want to extract too.
Note also that in the case of #1 in particular but also for #2 if "dot matches anything" is in effect (dotall, singleline or whatever your favourite regex flavour calls it), that they would also match linebreaks within - you'd need to manually exclude that and anything else you don't want if that would be a problem; see the above answer for if you want something more like a whitelist approach.
Here's a good regex site.
Here's a PCRE regex that will work: \{\w+\}
Here's how it works:
It's basically looking for { followed by one ore more word characters followed by }. The interesting part is that the word character class actually includes an underscore as well. \w is essentially shorthand for [A-Za-z0-9_]
So it will basically match any combination of those characters within braces and because of the plus sign will only match braces that are not empty.

Regular expression for finding swear words unless they are football teams

In AS3, I have created a nice swear filter routine that imports a list of regular expressions for swear words and combines them into a single regular expression. However, one bit I'm having problem over are football teams, namely ARSENAL and SCUNTHORPE.
Is there a way in a regular expression to block the swearwords unless they complete the words to be the above? I tried the following with ARSENAL but it didn't work properly:
/arse[^(nal)]/gi
The problem is that I cannot parenthesise the letters "nal" because it sees the parentheses as characters rather than a block. It appears to expect at least one extra character after "arse" in order to work. Can I make it so that it will allow one but not the other? How can I group letters together and say "not"?
EDIT: I found elsewhere on Stack some talk of "negative lookahead"s but didn't quite get how I could do that for these two use cases... Any ideas?
Just use the word anchor \b: \bswearwordhere\b.
Of course, you'd have to do with whatever s---ty workaround those ba**ar*s will invent to circumvent your f-*"-ng rules, heh.
I don't know about Actionscript specifically but in most Regex engines you can use
negative lookahead: ?!
negative lookbehind: ?<!
So for Arsenal:
/arse(?!nal)/gi
And Scunthorp or sHAPPYhorp:
/HAPPY(?<!sHAPPY)(?!horp)/gi
And Scunthorp will be similar to sHAPPYhorp, left as an assignment for the reader.

Lua string.match uses irregular regular expressions?

I'm curious why this doesn't work, and need to know why/how to work around it; I'm trying to detect whether some input is a question, I'm pretty sure string.match is what I need, but:
print(string.match("how much wood?", "(how|who|what|where|why|when).*\\?"))
returns nil. I'm pretty sure Lua's string.match uses regular expressions to find matches in a string, as I've used wildcards (.) before with success, but maybe I don't understand all the mechanics? Does Lua require special delimiters in its string functions? I've tested my regular expression here, so if Lua used regular regular expressions, it seems like the above code would return "how much wood?".
Can any of you tell me what I'm doing wrong, what I mean to do, or point me to a good reference where I can get comprehensive information about how Lua's string manipulation functions utilize regular expressions?
Lua doesn't use regex. Lua uses Patterns, which look similar but match different input.
.* will also consume the last ? of the input, so it fails on \\?. The question mark should be excluded. Special characters are escaped with %.
"how[^?]*%?"
As Omri Barel said, there's no alternation operator. You probably need to use multiple patterns, one for each alternative word at the beginning of the sentence. Or you could use a library that supports regex like expressions.
According to the manual, patterns don't support alternation.
So while "how.*" works, "(how|what).*" doesnt.
And kapep is right about the question mark being swallowed by the .*.
There's a related question: Lua pattern matching vs. regular expressions.
As they have already answered before, it is because the patterns in lua are different from the Regex in other languages, but if you have not yet managed to get a good pattern that does all the work, you can try this simple function:
local function capture_answer(text)
local text = text:lower()
local pattern = '([how]?[who]?[what]?[where]?[why]?[when]?[would]?.+%?)'
for capture in string.gmatch(text, pattern) do
return capture
end
end
print(capture_answer("how much wood?"))
Output: how much wood?
That function will also help you if you want to find a question in a larger text string
Ex.
print(capture_answer("Who is the best football player in the world?\nWho are your best friends?\nWho is that strange guy over there?\nWhy do we need a nanny?\nWhy are they always late?\nWhy does he complain all the time?\nHow do you cook lasagna?\nHow does he know the answer?\nHow can I learn English quickly?"))
Output:
who is the best football player in the world?
who are your best friends?
who is that strange guy over there?
why do we need a nanny?
why are they always late?
why does he complain all the time?
how do you cook lasagna?
how does he know the answer?
how can i learn english quickly?