Regular expressions with an alternative if the first one doesn't match - regex

I need to have a regular expression that takes a function signature as an input and returns the name of the function, i.e I may have the following input:
FUNCTION(A,B,C)
and after applying the following regular expression:
^(.*?)(?=\()
I correctly obtain the word "FUNCTION" as expected.
However, sometimes I can get the name of the function WITHOUT parentheses (and therefore without parameters), like this:
FUNCTION
In this case, the previous regex fails and doesn't take the name. Is there any way to define a regex that, in case it cannot find the first regular expression, try another one? (In this case would be taking the whole input.)

From what I see, you want to match the first n characters other than (, ) and space.
Thus, it is much more efficient to use
^[^()\s]+
See demo

^(.*?)(?=\(|\s*$|\s)
This should do it for you.You need to use | or operator.
\s*$ === stop if you have 0 or more spaces and then string ends
\s ==== stop at the first instance of space

^([^)]+)\s*\(?
Could do what you want.
Explanation :
([^(]+) : one or more character that is not (
\s* : maybe some blank spaces
\(? : optionnal parenthesis

Related

How is ? used in regular expression in python?

I have this snippet
print(re.sub(r'(-script\.pyw|\.exe)?', '','.exe1.exe.exe'))
The output is 1
If i remove ? from the above snippet and run it as
print(re.sub(r'(-script\.pyw|\.exe)', '','.exe1.exe.exe'))
Th output is again same.
Although I am using ?, it is getting greedy and replacing all '.exe' with NULL.
Is there any workaround to replace only first occurrence?
re.sub(pattern, repl, string, count=0, flags=0)
This is the signature for the re.sub function. Notice the count parameter. If you just want the first occurence to be replaced, use count=1.
? is a non-greedy modifier for repetition operators; when it stands next to anything else, it makes the previous element optional. Thus, Your top expression is replacing either -script.pyw or .exe or nothing with nothing. Since replacement of nothing by nothing doesn't change the string, the top and the bottom version (where empty string cannot be matched) will give the same result.
Question mark is making the preceding token in the regular expression optional
Use
print(re.sub(r'(-script\.pyw|\.exe)', '','.exe1.exe.exe', 1))
if you want to remove only the first match.
? is greedy. So if it can match, It will.
For example: aaab? will match aaab instead of aaa
In order to make ? non greedy, you must add an extra ? (this is the same way you make * and + non greedy, by the way)
So aaab?? will just match aaa. Yet, at the same time, aaab??c will match aaabc

Regex - Match first occurrence within string, do not return anything before it

I'm doing a search and replace in Notepad++ and am looking for a regex that will literally give me the first ( in a given string, so I can replace it.
I am not interested in any preceding or succeeding characters, literally just the first (.
An example string is:
"starLan(11), -- Deprecated via RFC3635 ethernetCsmacd (6) should be used instead
I'd like to find the first ( (near starLan(11) in this case) so I can replace that character with something else.
It should not match any other ( in the same line, so in this case it should not match the second ( near (6).
All of the examples I've come across seem to be returning everything up to and including the given character, which is not what I'm after in this case.
I would match the following pattern :
^([^(]*)\((.*)$
And replace it with this :
\1X\2
Where X is the text you want to replace your ( with.
It uses back-references to refer to the parts before and after the first (.
Edit : as mentioned by OP, matching ^([^(]*)\( and replacing with \1X is enough.
you can use this
^(.*?)\(
the text captured inside () will be available in back reference $1. so you can replace it like:
$1someText
where someText is the text you want to put in place of removed '('
if you want the text after removed '(' to remain intact as well, you can use:
^(.*?)\((.*)
and replacement as:
$1someText$2

how to define a regular expression in boost?

I have a section in file:
[Source]
[Source.Ia32]
[Source.Ia64]
I have created the expression as:
const boost::regex source_line_pattern ("(Sources)(.*?)");
Now, I am trying to match the string, but I am not able to match; it is always returning 0.
if (boost::regex_match ( sToken, source_line_pattern ) )
return TRUE;
Please note that sToken value is [Source]. [Source.Ia32]... and so on.
Thanks,
There are at least two problems with your code. First, the
regular expression you give contains the literal string
"Sources", and not "Source", which is what you seem to be
trying to match. The second is that boost::regex_match is
bound: it must match the entire string. What you seem to want
is boost::regex_search. Depending on what you are doing,
however, it might be better to try to match the entire string:
"\\[Source(?:\\.(\\w+))?\\]\\s*". Which provides for capture of
the trailing part, if present (but not the leading
"Source"---no point, in general, in capturing something that is
a constant).
Note too that the sequence ".*?" is very dubious. Normally,
I would expect the regular expression parser to fail if
a (non-escaped) '*' is followed by a '?'.
The issue is that boost::regex_match only returns true if the entire input string is matched by the regex. So the '[' and ']' are not matched by your current regex, and it will fail.
Your options are either to use boost::regex_search, which will search for a substring of the input that matches your regex, or modify your regex to accept the entire string being passed in.

regex with 3 backreferences but one optional

I have a regular expression that captures three backreferences though one (the 2nd) may be null.
Given the flowing string:
http://www.google.co.uk/url?sa=t&rct=j&q=site%3Ajonathonoat.es&source=web&cd=1&ved=0CC8QFjAA&url=http%3A%2F%2Fjonathonoat.es%2Fbritish-mozcast%2F&ei=MQj9UKejDYeS0QWruIHgDA&usg=AFQjCNHy1cDoWlIAwyj76wjiM6f2Rpd74w&bvm=bv.41248874,d.d2k,.co.uk,site%3Ajonathonoat.es&source=web,1
I wish to capture the TLD (in this case .co.uk), q param and cd param.
I'm using the following RegEx:
/.*\.google([a-z\.]*).*q=(.*[^&])?.*cd=(\d*).*/i
Which works except the 2nd backreference includes the other parameters upto the cd param, I current get this:
["http://www.google.co.uk/url?sa=t&rct=j&q=site%3Ajo…,d.d2k,.co.uk,site%3Ajonathonoat.es&source=web,1 ", ".co.uk", "site%3Ajonathonoat.es&source=web", "1", index: 0, input: "http://www.google.co.uk/url?sa=t&rct=j&q=site%3Ajo…,d.d2k,.co.uk,site%3Ajonathonoat.es&source=web,1"]
The 1st backreference is correct, it's .co.uk and so is the 3rd; it's 1. I want the 2nd backreference to be either null (or undefined or whatever) or just the q param, in this example site%3Ajonathonoat.es. It currently includes the source param too (site%3Ajonathonoat.es&source=web).
Any help would be much appreciated, thanks!
I've added a JSFiddle of the code, look in your browser console for the output, thanks!
if negating character classes, i always add a multiplier to the class itself:
/.*\.google([a-z\.]*).*q=([^&]*?)?.*cd=(\d*).*/i
i also recoomend not using * or + as they are "greedy", always use *? or +? when you are going to find delimiters inside your string. For more on greedyness check J.F.Friedls Mastering Rgeular Expressions or simply here
You want the middle group to be:
q=([^&]*)
This will capture characters other than ampersand. This also allows zero characters, so you can remove the optional group (?).
Working example: http://rubular.com/r/AJkXxgeX5K

Regular expression - what is my mistake?

I would like to match either any sequence or digits, or the literal: na .
I am using:
"^\d*|na$"
Numbers are being matched, but not na.
Whats my mistake?
More info: im using this in a regular expression validator for a textbox in aspnet c#.
A blank entry is ok.
It's because the expression is being read (assuming PCRE):
"^\d*" OR "na$"
Some parentheses would take care of that in a jiff. Choose from (depending on your needs):
"^(\d+|na)$" // this will capture the number or na
"^(?:\d+|na)$" // this one won't capture
Cheers!
The | operator have a higher precedence than the anchors ^ and $. So the expression ^\d*|na$ means match ^\d* or na$. So try this:
^(\d*|na)$
Or:
^\d*$|^na$
Perhaps ^(?:\d*|na)$ would be better. What language/engine? Also, please show the input and, if possible, the snippet of the code.
Also, it is possible that you aren't matching "na" because there is a new line after it. The digits wouldn't be affected because you did not specify a $ anchor for them.
So, depending on the language and how the input is acquired, there might be new-line between "na" and the end of the string, and $ won't match it unless you turn on multi-line match (or strip the string of the new line).
This may not be the best or most elegant way to fix it, but try this:
"^\d*|[n][a]$"