Lua pattern parentheses and 0 or 1 occurrence - regex

I'm trying to match a string against a pattern, but there's one thing I haven't managed to figure out. In a regex I'd do this:
Strings:
en
eng
engl
engli
englis
english
Pattern:
^en(g(l(i(s(h?)?)?)?)?)?$
I want all strings to be a match.
In Lua pattern matching I can't get this to work.
Even a simpler example like this won't work:
Strings:
fly
flying
Pattern:
^fly(ing)?$
Does anybody know how to do this?

You can't make match-groups optional (or repeat them) using Lua's quantifiers ?, *, + and -.
In the pattern (%d+)?, the question mark "looses" its special meaning and will simply match the literal ? as you can see by executing the following lines of code:
text = "a?"
first_match = text:match("((%w+)?)")
print(first_match)
which will print:
a?
AFAIK, the closest you can come in Lua would be to use the pattern:
^eng?l?i?s?h?$
which (of course) matches string like "enh", "enls", ... as well.

In Lua, the parentheses are only used for capturing. They don't create atoms.
The closest you can get to the patterns you want is:
'^flyi?n?g?$'
'^en?g?l?i?s?h?$'
If you need the full power of a regular expression engine, there are bindings to common engines available for Lua. There's also LPeg, a library for creating PEGs, which comes with a regular expression engine as an example (not sure how powerful it is).

Related

Reg expression to get a string starting from particular string

I'm trying to write a regular expression which returns a string after a particular string.
For example:
The string is
"<https://meraki/api/v1/sm/devices?fields%5B%5D=imei%2Ciccid%2ClastConnected%2CownerEmail%2C+ownerUsername%2CphoneNumber&perPage=1000&startingAfter=0>; rel=first"
result I'm expecting is -- first.
Here is the expression i'm using
(?<=rel=\s").*(?=\)
Okay so this should work:
(?<=rel[=])[^"]*
I would advise looking over the syntax of regex again, because yours was not even matching the colons correctly. Look behinds (?<=pattern) match before the pattern you want to capture. Likewise look aheads (?=pattern) match after the pattern.
You can test your regex online here (or many other sites). They will show you the matching groups and errors, but will also explain what certain parts of the pattern do.

Is there way to repeat specific regex previous used [duplicate]

Say I have a regex matching a hexadecimal 32 bit number:
([0-9a-fA-F]{1,8})
When I construct a regex where I need to match this multiple times, e.g.
(?<from>[0-9a-fA-F]{1,8})\s*:\s*(?<to>[0-9a-fA-F]{1,8})
Do I have to repeat the subexpression definition every time, or is there a way to "name and reuse" it?
I'd imagine something like (warning, invented syntax!)
(?<from>{hexnum=[0-9a-fA-F]{1,8}})\s*:\s*(?<to>{=hexnum})
where hexnum= would define the subexpression "hexnum", and {=hexnum} would reuse it.
Since I already learnt it matters: I'm using .NET's System.Text.RegularExpressions.Regex, but a general answer would be interesting, too.
RegEx Subroutines
When you want to use a sub-expression multiple times without rewriting it, you can group it then call it as a subroutine. Subroutines may be called by name, index, or relative position.
Subroutines are supported by PCRE, Perl, Ruby, PHP, Delphi, R, and others. Unfortunately, the .NET Framework is lacking, but there are some PCRE libraries for .NET that you can use instead (such as https://github.com/ltrzesniewski/pcre-net).
Syntax
Here's how subroutines work: let's say you have a sub-expression [abc] that you want to repeat three times in a row.
Standard RegEx
Any: [abc][abc][abc]
Subroutine, by Name
Perl:     (?'name'[abc])(?&name)(?&name)
PCRE: (?P<name>[abc])(?P>name)(?P>name)
Ruby:   (?<name>[abc])\g<name>\g<name>
Subroutine, by Index
Perl/PCRE: ([abc])(?1)(?1)
Ruby:          ([abc])\g<1>\g<1>
Subroutine, by Relative Position
Perl:     ([abc])(?-1)(?-1)
PCRE: ([abc])(?-1)(?-1)
Ruby:   ([abc])\g<-1>\g<-1>
Subroutine, Predefined
This defines a subroutine without executing it.
Perl/PCRE: (?(DEFINE)(?'name'[abc]))(?P>name)(?P>name)(?P>name)
Examples
Matches a valid IPv4 address string, from 0.0.0.0 to 255.255.255.255:
((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.(?1)\.(?1)\.(?1)
Without subroutines:
((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))
And to solve the original posted problem:
(?<from>(?P<hexnum>[0-9a-fA-F]{1,8}))\s*:\s*(?<to>(?P>hexnum))
More Info
http://regular-expressions.info/subroutine.html
http://regex101.com/
Why not do something like this, not really shorter but a bit more maintainable.
String.Format("(?<from>{0})\s*:\s*(?<to>{0})", "[0-9a-zA-Z]{1,8}");
If you want more self documenting code i would assign the number regex string to a properly named const variable.
.NET regex does not support pattern recursion, and if you can use (?<from>(?<hex>[0-9a-fA-F]{1,8}))\s*:\s*(?<to>(\g<hex>)) in Ruby and PHP/PCRE (where hex is a "technical" named capturing group whose name should not occur in the main pattern), in .NET, you may just define the block(s) as separate variables, and then use them to build a dynamic pattern.
Starting with C#6, you may use an interpolated string literal that looks very much like a PCRE/Onigmo subpattern recursion, but is actually cleaner and has no potential bottleneck when the group is named identically to the "technical" capturing group:
C# demo:
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var block = "[0-9a-fA-F]{1,8}";
var pattern = $#"(?<from>{block})\s*:\s*(?<to>{block})";
Console.WriteLine(Regex.IsMatch("12345678 :87654321", pattern));
}
}
The $#"..." is a verbatim interpolated string literal, where escape sequences are treated as combinations of a literal backslash and a char after it. Make sure to define literal { with {{ and } with }} (e.g. $#"(?:{block}){{5}}" to repeat a block 5 times).
For older C# versions, use string.Format:
var pattern = string.Format(#"(?<from>{0})\s*:\s*(?<to>{0})", block);
as is suggested in Mattias's answer.
If I am understanding your question correctly, you want to reuse certain patterns to construct a bigger pattern?
string f = #"fc\d+/";
string e = #"\d+";
Regex regexObj = new Regex(f+e);
Other than this, using backreferences will only help if you are trying to match the exact same string that you have previously matched somewhere in your regex.
e.g.
/\b([a-z])\w+\1\b/
Will only match : text, spaces in the above text :
This is a sample text which is not the title since it does not end with 2 spaces.
There is no such predefined class. I think you can simplify it using ignore-case option, e.g.:
(?i)(?<from>[0-9a-z]{1,8})\s*:\s*(?<to>[0-9a-z]{1,8})
To reuse regex named capture group use this syntax: \k<name> or \k'name'
So the answer is:
(?<from>[0-9a-fA-F]{1,8})\s*:\s*\k<from>
More info: http://www.regular-expressions.info/named.html

Regular expression dilemma

I'm trying for a few hours to write a pattern for some matching algorithm and I can't manage to find something for the following issue: given the example "my_name_is", I need to extract all words individually, as well as the whole expression. Consider that it may be a list of n examples, some that can be matched, some that cannot be matched.
"my_name_is" => ["my", "name", "is", "my_name_is"]
How can I do this, how should the regexp look like? Looking forward for your answers, thank you!
Regular Expressions are patterns used to match a string of characters. We usually use them to validate a string of characters, or to find and replace a specific pattern within text.
Here, it seems the outcome you're looking for is an array of strings that have been split using an underscore. Regex isn't what you're looking for.
Implementation would change based on language, but consider the following code:
function stringToArray(myStr)
{
words = str_split(myStr, '_');
return array_merge(words, [myStr]);
}
use re.findall with the following as your regex:
([^_]+)+?
This should match all sets of consecutive characters that don't contain the underscore.
As for the whole thing? You already have it, so there's no reason to regex the whole string

Regex match sequence more than once

How come for something that simple I can't find an answer after looking one hour in the internet?
I have this sentence:
HeLLo woRLd HOw are YoU
I want to capture all groups that consist of two following capital letters
[A-Z]{2}
The regex above works but capture only LL (the first two capital letters) while I want LL in one group and in the other groups also RL HO
Most regular expression engines expose some way to make your expression global. This means that your expression will applied multiple times. This global flag is usually denoted with the /g marker at the end of your expression. This is your regular expression without the /g flag, while this is what happens when you apply said flag.
Different languages expose such functionality differently, in C# for instance, this is done through the Regex.Matches syntax. In Java, you use while(matcher.find()), which keeps providing sub strings which match the pattern provided.
EDIT: I am not a Python person, but judging from the example available here, you could do something like so:
it = re.finditer(r"[A-Z]{2}", "HeLLo woRLd HOw are YoU")
for match in it:
print "'{g}' was found between the indices {s}".format(g=match.group(), s=match.span())
You can not have multiple groups in this case, but you can have multiple matches. Add the global flag to your regex and use a method to match the regex.
For javscript, it would be /[A-Z]{2}/g.
The method most probably returns an Array of matches, and you can use index to access them.

Regex: Does not have/include pattern

I have a regex pattern to match an HTML script tag. How can I change this script tag pattern so that the patterns means "input string DOES NOT MATCH" the script tag pattern?
In other words, given a pattern, what is the alteration needed to change the meaning of the pattern to "does not match this pattern"?
For example, if I have a pattern: \d{3}-\d{3}-\d{4}, what is the equivalent pattern for this that means "does not match \d{3}-\d{3}-\d{4}"?
You can negate a regex pattern by using a negative lookahead. This is slightly different than simply negating the regex though. Negative lookahead would look like the following in Java (and many other languages):
(?!\d{3}-\d{3}-\d{4})
It should be noted that this doesn't exactly answer the question. Finding the inverse of a regular language is not an easy task using a regular expression (I don't think). A much easier way to solve the problem would be to inverse the program logic:
Instead of:
if (string.matches(yourRegex))
Do:
if (!string.matches(yourRegex))
That is not easily achievable for arbitrary patterns. In practice, it's almost always easier to do what you want in the surrounding code than in the pattern itself. For instance, instead of
grep '\d{3}-\d{3}-\d{4}' file
you could use
grep -v '\d{3}-\d{3}-\d{4|' file
Or in a program you could change something like
if (pattern.matches()) {
foo();
}
into something like
if (!pattern.matches()) {
foo();
}
In a more tedious approach, you would have to enumerate all possible values that should match instead of what should not match. So, say you want to match everything but the string <html>, you could write a regex like so:
([^<]|<([^h]|h([^t]|t([^m]|m([^l]|l[^>])))))
Reading that regex is like saying: "Okay, you can match any character but '<', or you could match '<' but then you can't match an 'h' after that... or you do match an 'h' after that but then you can't match a 't' after that... and so on.
It's butt ugly, but then again, for simple string matches, you can easily write a recursive function that transforms any given term into a pattern like the above.
easier to just negate the test surely? eg...
if (!regex.test(str)) ...
(javascript example)
Negating a character class is easy with ^ but a whole regex will get much more convoluted.
What language are you using? The easiest solution to the specific problem you stated is to simply prepend a negation operator (usually "!") to the match.
I definitely agree with the other answers saying you should negate testing for a match, but this should do what you want using just a regex:
(?!.*\d{3}-\d{3}-\d{4})
This is a negative lookahead, by not placing any characters outside of the lookahead the regex basically means "fail on any string that starts with any number of characters (.*) followed by the regex \d{3}-\d{3}-\d{4}".