What is the difference between "a{1}" and "a" in regex? - regex

Some string was matched with the following `regex
([0-9]\s+){1}
Why did author use {1} in the end of regex?
Can I safely remove it?

Yes, there is no difference at all. Possibly it was left over from tweaks made while the regex was being built and tested.

{1} limits the regex match to only one integer or space, in your example.

It is probably a leftover from debugging/writing the query when the author experimented with {1,2} or so.
Yes, you can remove it.

if it is the result of an interpreted code (log/debug coming from script for exemple) the 1 could be the value of a variable.
If it is directly in a script, {1} is the default behavior so it is the same (but take longer to work due to extra interpreation to make by the parser)

Related

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

Get only first match in Regex

Given this string: hello"C07","73" (quotes included) I want to get "C07". I'm using (?:hello)|(?<=")(?<screen>[a-zA-Z0-9]+)?(?=") to try to do this. However, it consistently matches "73" as well. I've tried ...0-9]+){1}..., but that doesn't work either. I must be misunderstanding how this is supposed to work, but I can't figure out any other way.
How can I get just the first set of characters between quotes?
EDIT: Here's a link to show my problem.
EDIT: Ok, here's exactly what I'm trying to do:
Basically, what I'm trying to get is this: 1) a positive match on "hello", 2) a group named "screen" with, in this case, "C07" in it and 3) a group named "format" with, in this case, "73" in it.
Both the "C07" and "73" will vary. "hello" will always be the same. There may or may not be an extra comma between "hello" and the first double-quote.
For you initial question of how to stop after the first match either removing the global search, or searching from the start of the string would accomplish that.
For the latter question you can name your groups and just keep extending the pattern throughout the line(s).
hello"(?<screen>[^"]+)","(?<format>[^"]+)"
Demo: http://regex101.com/r/PBXe8l/1
Based on your regex example, why not:
^(?:hello)"([a-zA-Z\d]+)"
Regex Demo

Regular expression that allows square-brackets

This works perfect..
result2 = Regex.Replace(result2, "[^A-Za-z0-9/.,>#:\s]", "", RegexOptions.Compiled)
But I need to allow square brackets ([ and ]).
Does this look correct to allow Brackets without changing what is allowed and not allowed from the above?
result2 = Regex.Replace(result2, "[^A-Za-z0-9\[\]/.,>#:\s]", "", RegexOptions.Compiled)
Reason I need a second opinion is that I think if this is correct something else is blocking it that is out of my control.
I cant say any one person did or did not answer the question or try to help, I would split the solution among everyone if I could because it made me think. The key was to separate the brackets by using a single \ . Thanks everyone for your help.
result = Regex.Replace(result, "[^A-Za-z0-9/\[\].,>#\s]", "", RegexOptions.Compiled)
The gimmick #tripleee mentioned does indeed work in .NET. Just make sure ] is the first character (or in this case, first after the ^.
result2 = Regex.Replace(result2, "[^][A-Za-z0-9/.,>#:\s]", "");
But be careful about porting the regex to other flavors. Some will treat it as a syntax error, and some will treat it as two atoms: [^] and [A-Za-z0-9/.,>#:\s], the first of which matches literally anything but nothing--i.e., any character including newlines.
On a side note, why are you using the RegexOptions.Compiled option? That's something you should use only when you know you need it. The increased performance will almost never be significant, and it comes with a pretty high price tag, as explained here.
http://msdn.microsoft.com/en-us/library/8zbs0h2f.aspx

Help to compose regular expression

I have folowing string: user1 fam <user#example.com>, user2 fam <user2#example.com>, ...
How can i get mail address from this string with regular expression. I need in output list of mail address
user#example.com
uesr2#example.com
I try:
<.*>
But it's ouput with < >:
<user#example.com>
<uesr2#example.com>
Thank you.
p.s. Thank you #xanatos for comment, I use Erlang
As the other have said, but to make it faster:
<([^>]*)>
In this way the Regex won't have to backtrack (with the other Regexes suggested, the Regex will match all the string and then will begin to rollback to find a >)
I'll add that, for historical reasons, there are small differences between the . and, for example [\s\S]. Both catch all the characters EXCEPT the \n. The first one (.) doesn't catch it. So by using the [^>] you are catching the \n, but this shouldn't be a problem for what you are doing. http://www.regular-expressions.info/dot.html
Just to be complete, because it's a problem that often happens, there is another variant:
<((?:(?!>).)*)>
(you can substitute the . with [\s\S] if you want, or use the SingleLine option if your language supports it, to make the . behave in a different way). The point here is that the "stop" expression can be longer than one character. Instead of (?!>) you could have inserted (?!%%) and it would have stopped at %%. BUT I'm not sure this variant work with Erlang (I hadn't noticed the new Tag... It wasn't there when I orginally read the question and I'm not an Erlang programmer... And it seems at least two Erlang programmers have different opinions on the argument :-) )
You need to use the option ungreedy so that it only matches the individual bracket pairs.
global so that you can get all the matches.
and you need {capture, all_but_first, list} so that you get the actual values (list can also be binary if you prefer binary results). all_but_first tells re to not return the whole match (which would include <>), just the group.
Result:
1> S.
"user1 fam <user#example.com>, user2 fam <user2#example.com>, "
2> re:run(S, "<(.+)>", [ungreedy, global, {capture, all_but_first, list}]).
{match,[["user#example.com"],["user2#example.com"]]}
Use groups. See your regex engine's documentation for more details.
>>> re.findall('<(.*?)>', 'user1 fam <user#example.com>, user2 fam <user2#example.com>, ...')
['user#example.com', 'user2#example.com']
Keep it simple and use <([^>]*)> which is about as fast as it can get and works for most versions of regular expressions. This is faster as it never has to backtrack while using <(.*?)> will cause backtracking.

Capture string until first caret sign hit in regex?

I am working with legacy systems at the moment, and a lot of work involves breaking up delimited strings and testing against certain rules.
With this string, how could I return "Active" in a back reference and search terms, stopping when it hits the first caret (^)?:
Active^20080505^900^LT^100
Can it be done with an inclusion in the regex of this "(.+)" ? The reason I ask is that the actual regex "(.+)" is defined in a database as cutting up these messages and their associated rules can be set from a front-end system. The content could be anything ('Active' in this case), that's why ".+" has been used in this case.
Rule: The caret sign cannot feature between the brackets, as that would result with it being stored in the database field too, and it is defined elsewhere in another system field.
If you have a better suggestion than "(.+)" will be happy to hear it.
Thanks in advance.
(.+?)\^
Should grab up to the first ^
If you have to include (.+) w/o modifications you could use this:
(.+?)\^(.+)
The first backreference will still be the correct one and you can ignore the second.
A regex is really overkill here.
Just take the first n characters of the string where n is the position of the first caret.
Pseudo code:
InputString.Left(InputString.IndexOf("^"))
^([^\^]+)
That should work if your RE library doesn't support non-greediness.