How to write this regular expression in Lua? - regex

I'm new to the Lua regex equivalence features, I need to write the following regular expression, which should match numbers with decimals
\b[0-9]*.\b[0-9]*(?!])
Basically, it matches numbers in decimal format (eg: 1, 1.1, 0.1, 0.11), which do not end with ']', I've been trying to write a regex like this with Lua using string.gmatch, but I'm quite inexperienced with Lua matching expressions...
Thanks!

Lua does not have regular expressions, mainly because a full regular expression library would be bigger than Lua itself.
What Lua has instead are matching patterns, which are way less powerful (but still sufficient for many use cases):
There is no "word boundary" matcher,
no alternatives,
and also no lookahead or similar.
I think there is no Lua pattern which would match every possible occurrence of your string, and no other one, which means that you somehow must work around this.
The pattern proposed by Stuart, %d*%.?%d*, matches all decimal numbers (with or without a dot), but it also matches the empty string, which is not quite useful. %d+%.?%d* matches all decimal numbers with at least one digit before the dot (or without a dot), %d*%d.?%d+ matches all decimal numbers with at least one digit after the dot (or without a dot). %.%d+ matches decimal numbers without a digit before the dot.
A simple solution would be to search more than one of these patterns (for example, both %d+%.?%d* and %.%d+), and combine the results. Then look at the places where you found them and look if there is a ']' following them.
I experimented a bit with the frontier pattern.
The pattern %f[%.%d]%d*%.?%d*%f[^%.%d%]] matches all decimal numbers which are preceded by something that is neither digit nor dot (or by nothing), and followed by something that is neither ] nor digit nor dot (or by nothing). It also matches the single dot, though.

"%d*%.?%d+" will match all such numbers in decimal format (note that that's going to miss any signed numbers such as -1.1 or +3.14). You'll need to come up with another solution to avoid instances that end with ], such as removing them from the string before looking for the numbers:
local pattern = "%d*%.?%d+"
local clean = string.gsub(orig ,pattern .. "%]", "")
return string.gmatch(clean, pattern)

Related

Can I write a regular expression that checks two lengths are equal?

I want to match strings with two numbers of equal length, like : 42-42, 0-2, 12345-54321.
I don't want to match strings where the two numbers have different lengths, like : 42-1, 000-0000.
The two parts (separated by the hyphen) must have the same length.
I wonder if it is possible to do a regexp like [0-9]{n}-[0-9]{n} with n variable but equal?
If there is no clean way to that in one pattern (I must put that in the pattern attribute of a HTML form input), I will do something like /\d-\d|\d{2}-\d{2}|\d{3}-\d{3}|<etc>/ up to the maximum length (16 in my case).
This is not possible with regular expressions, because this is neither a type-3 grammatic (can be done with regular expression) nor a type-2 grammatic (can be done with regular expressions, which support recursion).
The higher grammar levels (type-1 grammatic and type-0 grammatic) can only be parsed using a Turing machine (or something compatible like your programming language).
More about this can be found here:
https://en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy
Using a programming language, you need to count the first sequence of digits, check for the minus and then check if the same amount of digits follows.
Without the minus symbol, this would be a type-2 grammatic and could be solved using a recursive regular expression (even if the right sequence shall not contain digits), like this: ^(\d(?1)\d)$
So you need to write your own, non-regular-expression check code.
You should probably split the String around the separator and compare the length of both parts.
The tool of choice in regex to use when specifying "the same thing than before" are back-references, however they reference the matched value rather than the matching pattern : no way of using a back-reference to .{3} to match any 3 characters.
However, if you only need to validate a finite number of lengths, it can be (painfully) done with alternation :
\d-\d will match up to 1 character on both sides of the separator
\d-\d|\d{2}-\d{2} will match up to 2 characters on both sides of the separator
...

How to define a regex that matches whole words treating "." like a normal letter

I am trying to read a several numbers from a string in Matlab. The aim is to do what str2num does, but without using eval (and much less advanced).
I already have a regex for matching a valid double number:
'([-+]?([0-9]*\.[0-9]+|[0-9]+\.|[0-9]+)([eE][-+]?([0-9]*\.[0-9]+|[0-9]+\.|[0-9]+))?)'
Which works fine for valid substrings such as "1.15e2.4". My problem is that I want to avoid matching invalid substrings such as "1.15.e2.4" (which splits to "1.15" and "2.4").
When I match only whole words (using \< and \>), the invalid string is split to "1.15" and "4"), because the decimal point is considered a word binary.
For now I am using look-around expressions:
'((?<=^|[ :,])[-+]?([0-9]*\.[0-9]+|[0-9]+\.|[0-9]+)([eE][-+]?([0-9]*\.[0-9]+|[0-9]+\.|[0-9]+))?(?=$|[ :,]))'
but I wonder if there is an easier and more general way.
Is it possible to redefine which characters are considered word boundaries?
You cannot redefine what a word boundary means. But you can achieve the same effect by using negative lookarounds:
(?<!\.)\< your first regex here \>(?!\.)
Not dramatically simpler than your second regex, but more robust since it does exactly what it says: disallows . as a word boundary.

how to find integer with comma and zeros after that (regex)?

I try to create regex(es) to extract all integers. It can be 6 -12 bur also +6.000 or -5,0 and onother one to extract real numbers which are not integers, for example 3.14, -6,26 but no 5.0.
For finding integers I tried "^[+-]?([0-9]+)(\\[.,]0{1,})?$" but it doesn't work on -6.00. And I have no idea how to create second regex (how to exclude integers with comas or dots and then zeros). Any help appreciated.
The problem with your integer regex appears to be the backslash(es). I don't know any regex engine in which you would need to escape the opening bracket of a character class, and you certainly don't want to match a literal backslash. Also, to a regex engine that understands it at all, the quantifier {1,} is an uglier, more complex way of saying +.
This should do your integer matching:
"^[+-]?[0-9]+([.,]0+)?$"
And this variation should do your non-integer matching:
"^[+-]?[0-9]+[.,]0*[1-9][0-9]*$"
In both cases I omitted parentheses not needed for expressing a correct pattern, but if you need to capture parts of the match then you will want to add some back in. You might also want to convert the grouping parentheses into non-capturing form if you are using a regex engine that supports it.
Also, the real number pattern requires at least one digit before the fraction separator character, per your examples. It would be easy to convert the pattern to also match strings of the form .1 or -.17. Similarly, the integer pattern requires at least one zero in the fraction part if there is a fraction separator, and restriction could be removed, too.

RegEx - Match Numbers of Variable Length

I'm trying to parse a document that has reference numbers littered throughout it.
Text text text {4:2} more incredible text {4:3} much later on
{222:115} and yet some more text.
The references will always be wrapped in brackets, and there will always be a colon between the two. I wrote an expression to find them.
{[0-9]:[0-9]}
However, this obviously fails the moment you come across a two or three digit number, and I'm having trouble figuring out what that should be. There won't ever be more than 3 digits {999:999} is the maximum size to deal with.
Anybody have an idea of a proper expression for handling this?
{[0-9]+:[0-9]+}
try adding plus(es)
What regex engine are you using? Most of them will support the following expression:
\{\d+:\d+\}
The \d is actually shorthand for [0-9], but the important part is the addition of + which means "one or more".
Try this:
{[0-9]{1,3}:[0-9]{1,3}}
The {1,3} means "match between 1 and 3 of the preceding characters".
You can specify how many times you want the previous item to match by using {min,max}.
{[0-9]{1,3}:[0-9]{1,3}}
Also, you can use \d for digits instead of [0-9] for most regex flavors:
{\d{1,3}:\d{1,3}}
You may also want to consider escaping the outer { and }, just to make it clear that they are not part of a repetition definition.

How can I check if at least one of two subexpressions in a regular expression match?

I am trying to match floating-point decimal numbers with a regular expression. There may or may not be a number before the decimal, and the decimal may or may not be present, and if it is present it may or may not have digits after it. (For this application, a leading +/- or a trailing "E123" is not allowed). I have written this regex:
/^([\d]*)(\.([\d]*))?$/
Which correctly matches the following:
1
1.
1.23
.23
However, this also matches empty string or a string of just a decimal point, which I do not want.
Currently I am checking after running the regex that $1 or $3 has length greater than 0. If not, it is not valid. Is there a way I can do this directly in the regex?
I think this will do what you want. It either starts with a digit, in which case the decimal point and digits after it are optional, or it starts with a decimal point, in which case at least one digit is mandatory after it.
/^\d+(\.\d*)?|\.\d+$/
Create a regular expression for each case and OR them. Then you only need test if the expression matches.
/^(\d+(\.\d*)?)|(\d*\.\d+)$/
A very late answer, but like to answer, taken from regular-expressions.info
[-+]?[\d]*\.?[\d]+?
Update This [\d]*\.?[\d]+?|[\d]+\. will help you matching 1.
http://regex101.com/r/lJ7fF4/7