Nginx Lua regex match first word

Nginx Lua regex match first word - regex

I try to convert regex into Lua language, from
([a-zA-Z0-9._-/]+)
to
^%w+?([_-]%w+)
I want to make match first word with '-' and '_':
mar_paci (toto totot)
toi-re/3.9
pouri marc (sensor)
Phoenix; SAGEM
The result:
marc_paci
toi-re
pouri marc
Phoenix
The code used:
value = string.match(ngx.var.args, "^%w+?([_-]%w+)")
In the ^%w+?([_-]%w+) regex, I added the ? character for an optional string.

You can use
^[%w%s_-]*%w
It matches
^ - start of string
[%w%s_-]* - zero or more alphanumerics, whitespaces, _ or hyphens
%w - an alphanumeric char.
See the Lua demo:
local function extract(text)
return string.match(text, "^[%w%s_-]*%w")
end
print(extract("mar_paci (toto totot)"))
-- => mar_paci
print(extract("toi-re/3.9"))
-- => toi-re

Related

Regex absolute begginer: filter alphanumeric

I'm playing codewars in Ruby and I'm stuck on a Kata. The goal is to validate if a user input string is alphanumeric. (yes, this is quite advanced Regex)
The instructions:
At least one character ("" is not valid)
Allowed characters are uppercase / lowercase latin letters and digits from 0 to 9
No whitespaces/underscore
What I've tried :
^[a-zA-Z0-9]+$
^(?! !)[a-zA-Z0-9]+$
^((?! !)[a-zA-Z0-9]+)$
It passes all the test except one, here's the error message:
Value is not what was expected
I though the Regex I'm using would satisfy all the conditions, what am I missing ?
SOLUTION: \A[a-zA-Z0-9]+\z (and better Ruby :^) )
$ => end of a line
\z => end of a string
(same for beginning: ^ (line) and \A (string), but wasn't needed for the test)
Favourite answer from another player:
/\A[A-z\d]+\z/

My guess is that maybe, we would start with an expression similar to:
^(?=[A-Za-z0-9])[A-Za-z0-9]+$
and test to see if it might cover our desired rules.
In this demo, the expression is explained, if you might be interested.
Test
re = /^(?=[A-Za-z0-9])[A-Za-z0-9]+$/m
str = '
ab
c
def
abc*
def^
'
# Print the match result
str.scan(re) do |match|
puts match.to_s
end

str !~ /[^A-Za-z\d]/
The string contains alphanumeric characters only if and only if it does not contain a character other than an alphnumeric character.

Regex conditional lookout

My input text file is like
A={5,6},B={2},C={3}
B={2,4}
A={5},B={1},C={3}
A={5},B={2},C={3,4,QWERT},D={TXT}
I would like to match all the lines where A=5,B=2 and C=3. The catch is, if variable is not mentioned, then that variable can take any value and hence that line also needs to be matched.
Above should match line 1,2 & 4.
I tried
.*?(?:(?=A)A\{.*?5).*?(?:(?=B)B\{.*?2).*?(?:(?=C)C\{.*?3)
https://regex101.com/r/NN9qk5/1
But, it is not working
I shall be using this regex in a python 3.6 code.

If you want to solve it with a regex, you may use
^
(?!.*\bA={(?![^{}]*\b5\b))
(?!.*\bB={(?![^{}]*\b2\b))
(?!.*\bC={(?![^{}]*\b3\b))
.*
See the regex demo
The point is to fail a match if there is a key that contains no given number value inside braces.
E.g. (?!.*\bA={(?![^{}]*\b5\b)) is a negative lookahead that fails the match if, immediately to the right of the current location, there is no
- .* - any 0+ chars other than line break chars
- \bA - a whole word A
- ={ - ={ substring
- (?![^{}]*\b5\b) - that is not followed with any 0+ chars other than { and } and then followed with 5 as a whole word.
Sample usage in Python 3.6:
import re
s = """A={5,6},B={2},C={3}
B={2,4}
A={5},B={1},C={3}
A={5},B={2},C={3,4,QWERT},D={TXT}"""
given = { 'A': '5', 'B': '2', 'C': '3'}
reg_pattern = ''
for key,val in given.items():
reg_pattern += r"(?!.*\b{}={{(?![^{{}}]*\b{}\b))".format(key,val)
reg = re.compile(reg_pattern)
for line in s.splitlines():
if reg.match(line):
print(line)
Output:
A={5,6},B={2},C={3}
B={2,4}
A={5},B={2},C={3,4,QWERT},D={TXT}
Note the use of re.match, this method only searches for a match at the start of the string, so, no need adding ^ anchor (that matches string start).

Lua gsub - How to set max character limit in regex pattern

From strings that are similar to this string:
|cff00ccffkey:|r value
I need to remove |cff00ccff and |r to get:
key: value
The problem is that |cff00ccff is a color code. I know it always starts with |c but the next 8 characters could be anything. So I need a gsub pattern to get the next 8 characters (alpha-numeric only) after |c.
How can I do this in Lua? I have tried:
local newString = string.gsub("|cff00ccffkey:|r value", "|c%w*", "")
newString = string.gsub(newString, "|r", "")
but that will remove everything up to the first white-space and I don't know how to specify the max characters to select to avoid this.
Thank you.

Lua patterns do not support range/interval/limiting quantifiers.
You may repeat %w alphanumeric pattern eight times:
local newString = string.gsub("|cff00ccffkey:|r value", "|c%w%w%w%w%w%w%w%w", "")
newString = string.gsub(newString, "|r", "")
print(newString)
-- => key: value
See the Lua demo online.
You may also make it a bit more dynamic if you build the pattern like ('%w'):.rep(8):
local newString = string.gsub("|cff00ccffkey:|r value", "|c" ..('%w'):rep(8), "")
See another Lua demo.
If your strings always follow this pattern - |c<8alpnum_chars><text>|r<value> - you may also use a pattern like
local newString = string.gsub("|cff00ccffkey:|r value", "^|c" ..('%w'):rep(8) .. "(.-)|r(.*)", "%1%2")
See this Lua demo
Here, the pattern matches:
^ - start of string
|c - a literal |c
" ..('%w'):rep(8) .. " - 8 alphanumeric chars
(.-) - Group 1: any 0+ chars, as few as possible
|r - a |r substring
(.*) - Group 2: the rest of the string.
The %1 and %2 refer to the values captured into corresponding groups.

RegEx not recognized although it should be

I'm trying to split texts like these:
§1Hello§fman, §0this §8is §2a §blittle §dtest :)
by delimiter "§[a-z|A-Z
My first approach was the following:
^[§]{1}[a-fA-F]|[0-9]$
But pythex.org won't find any occurrences in my example text by using this regex.
Do you know why?

The ^[§]{1}[a-fA-F]|[0-9]$ pattern matches a string starting with § and then having a letter from a-f and A-F ranges, or a digit at the end of the string.
Note the ^ matches the start of the string, and $ matches the end of the string positions.
To extract those words after § and a hex char after it you may use
re.findall(r'§[A-Fa-z0-9]([^\W\d_]+)', s)
# => ['Hello', 'man', 'this', 'is', 'a', 'little', 'test']
To remove them, you may use re.sub:
re.sub(r'\s*§[A-Fa-z0-9]', ' ', s).strip()
# => Hello man, this is a little test :)
To just get a string of those delimiters you may use
"".join(re.findall(r'§[A-Za-z0-9]', s))
# => §1§f§0§8§2§b§d
See this Python demo.
Details
§ - a § symbol
[A-Fa-z0-9] - 1 digit or ASCII letter from a-f and A-F ranges (hex char)
([^\W\d_]+) - Group 1 (this value will be extracted by re.findall): one or more letters (to include digits, remove \d)

Your regex uses anchors to assert the start and the end of the string ^$.
You could update your regex to §[a-fA-F0-9]
Example using split:
import re
s = "§1Hello§fman, §0this §8is §2a §blittle §dtest :)"
result = [r.strip() for r in re.split('[§]+[a-fA-F0-9]', s) if r.strip()]
print(result)
Demo

How to create "blocks" with Regex

For a project of mine, I want to create 'blocks' with Regex.
\xyz\yzx //wrong format
x\12 //wrong format
12\x //wrong format
\x12\x13\x14\x00\xff\xff //correct format
When using Regex101 to test my regular expressions, I came to this result:
([\\x(0-9A-Fa-f)])/gm
This leads to an incorrect output, because
12\x
Still gets detected as a correct string, though the order is wrong, it needs to be in the order specified below, and in no other order.
backslash x 0-9A-Fa-f 0-9A-Fa-f
Can anyone explain how that works and why it works in that way? Thanks in advance!

To match the \, folloed with x, followed with 2 hex chars, anywhere in the string, you need to use
\\x[0-9A-Fa-f]{2}
See the regex demo
To force it match all non-overlapping occurrences, use the specific modifiers (like /g in JavaScript/Perl) or specific functions in your programming language (Regex.Matches in .NET, or preg_match_all in PHP, etc.).
The ^(?:\\x[0-9A-Fa-f]{2})+$ regex validates a whole string that consists of the patterns like above. It happens due to the ^ (start of string) and $ (end of string) anchors. Note the (?:...)+ is a non-capturing group that can repeat in the string 1 or more times (due to + quantifier).
Some Java demo:
String s = "\\x12\\x13\\x14\\x00\\xff\\xff";
// Extract valid blocks
Pattern pattern = Pattern.compile("\\\\x[0-9A-Fa-f]{2}");
Matcher matcher = pattern.matcher(s);
List<String> res = new ArrayList<>();
while (matcher.find()){
res.add(matcher.group(0));
}
System.out.println(res); // => [\x12, \x13, \x14, \x00, \xff, \xff]
// Check if a string consists of valid "blocks" only
boolean isValid = s.matches("(?i)(?:\\\\x[a-f0-9]{2})+");
System.out.println(isValid); // => true
Note that we may shorten [a-zA-Z] to [a-z] if we add a case insensitive modifier (?i) to the start of the pattern, or just use \p{Alnum} that matches any alphanumeric char in a Java regex.
The String#matches method always anchors the regex by default, we do not need the leading ^ and trailing $ anchors when using the pattern inside it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Nginx Lua regex match first word - regex

Related

Regex absolute begginer: filter alphanumeric

Regex conditional lookout

Lua gsub - How to set max character limit in regex pattern

RegEx not recognized although it should be

How to create "blocks" with Regex

Categories

Resources