Using set operators with python regex module - regex

I'm having trouble getting set operators to work in the regex module (regex 2013-11-29) in python-3.x. For example, to match ASCII characters minus punctuation I have tried:
import regex as rx
data = '(foo)'
for m in rx.finditer(r'[\p{ASCII}--\p{P}]+',data):
print(m.group(0)) # expect 'foo', getting '(foo)'
The documentation gives this example:
[\p{N}--[0-9]] # Set containing all numbers except '0' .. '9'
Am I missing something here?

It sounds like you need to explicitly opt into Version 1 behavior so that the -- is interpreted as a set operator and not as characters to include in the class.
From the module web page:
Version 1 behaviour (new behaviour, different from the current re
module):
Indicated by the VERSION1 or V1 flag, or (?V1) in the pattern.
.split will split a string at a zero-width match.
Inline flags apply to the end of the group or pattern, and they can be
turned off.
Nested sets and set operations are supported.
Case-insensitive matches in Unicode use full case-folding by default.
If no version is specified, the regex module will default to
regex.DEFAULT_VERSION. In the short term this will be VERSION0, but in
the longer term it will be VERSION1.

Related

What is the regular expression for all pages except "/"?

I am using NextAuth for Next.js for session management. In addition, I am using the middleware.js to protect my routes from unauthenticated users.
According to https://nextjs.org/docs/advanced-features/middleware#matcher,
if we want to exclude a path, we do something like
export const config = {
matcher: [
/*
* Match all request paths except for the ones starting with:
* - api (API routes)
* - static (static files)
* - favicon.ico (favicon file)
*/
'/((?!api|static|favicon.ico).*)',
],
}
In this example, we exclude /api, /static,/favicon.icon. However, I want to exclude all path except the home page, "/". What is the regular expression for that? I am tried '/(*)'. It doesn't seem to work.
The regular expression which matches everything but a specific one-character string / is constructed as follows:
we need to match the empty string: empty regex.
we need to match all strings two characters long or longer: ..+
we need to match one-character strings which are not that character: [^/].
Combining these three together with the | branching operator: "|..+|[^/]".
If we are using a regular expression tool that performs substring searching rather than a full match, we need to use its anchoring features; perhaps it supports the ^ and $ notation for that: "^(|..+|[^/])$".
I'm guessing that you might not want to match empty strings; in which case, revise your requirement and drop that branch from the expression.
Suppose we wanted to match all strings, except for a specific fixed word like abc. Without negation support in the regex language, we can use a generalization of the above trick.
Match the empty string, like before, if desired.
Match all one-character strings: .
Match all two-character strings: ..
Match all strings longer than three characters: ....+
Those simple cases taken care of, we focus on matching just those three-symbol strings that are not abc. How can we do that?
Match all three-character strings that don't start with a: [^a]...
Match all three-character strings that don't have a b in the middle: .[^b].
Match all three-character strings that don't end in c: ..[^c].
Combine it all together: "|.|..|....+|[^a]..|.[^b].|..[^c]".
For longer words, we might want to take advantage of the {m,n} notation, if available, to express "match from zero to nine characters" and "match eleven or more characters".
I will need to exclude the signin page and register page as well. Because, it will cause an infinite loop and an error, if you don't exclude signin page. For register page, you won't be able to register if you are redirected to the signin page.
So the "/", "/auth/signin", and "/auth/register" will be excluded. Here is what I needed:
export const config = {
matcher: [
'/((?!auth).*)(.+)'
]
}

How to find "complicated" URLs in a text file

I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).
This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)
So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).
There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.
Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)

Regex for multiline .ini file

I try to create a regex expression that can parse a ini-File.
But I want that the ini-values can be multiline!
Like that:
Wert1=Hallo
dsadasd
Wert2=Hi
Wert3=Heinirch Volland
I try it with this regex, but it doesn't work:
/.*=(.*)^.*=/gsm
You could be using this PCRE regex :
/^.*=.*[^=]*$/gm
Try it here.
This relies on the absence of the single-line flag, be careful not to set it. The multiline flag is also necessary, and global can be used if appropriate.
This matches from the start of a line containing an equal sign (^.*=.*), then will match as many whole lines that do not contain an equal sign as it can ([^=]*$, where [^=] will match linefeeds).
You appear to be using Perl. Have you considered using Config::IniFiles? That module will handle parsing INI-type files for you, and has support for multiline parameters using heredoc syntax:
Parameter=<<EOT
value/line 1
value/line 2
EOT
Or, if you enable it with Config::IniFiles->new(..., -allowcontinue => 1);, continuation lines:
[Section]
Parameter=this parameter \
spreads across \
a few lines
I guess you are trying to get all ini values, and to do that you can use this regex pattern:
/^(.*)=(.*)/gm
and you'll can access your values using groups, each group will retrieve to you key and value

Postgres regex issue

I need to find all records stored in postgres, which matching following regexp:
^((8|\+7)[\- ]?)?(\(?\d{3}\)?[\- ]?)?[\d\- ]{7,10}$
Something like this:
SELECT * FROM users WHERE users.phone ~ '^((8|\+7)[\- ]?)?(\(?\d{3}\)?[\- ]?)?[\d\- ]{7,10}$'
But this one falls with error:
invalid regular expression: quantifier operand invalid
Why won't Postgres work with this regex?
Using the same one in plain Ruby works just fine.
UPDATE
Problem is only with WHERE. When i try to:
SELECT '+79637434199' ~ '^((8|\+7)[\- ]?)(\(?\d{3}\)?[\- ]?)[\d\- ]{7,10}'
Postgres returns true. But when i try:
SELECT * FROM users WHERE users.phone ~ '^((8|\+7)[\- ]?)(\(?\d{3}\)?[\- ]?)[\d\- ]{7,10}'
Result: "invalid regular expression: quantifier operand invalid".
You don't need to escape - inside a character class when you put it at the first or last position, because it cannot be misread as range that way:
[\- ] → [- ]
[\d\- ] → [\d -]
The way you have it the upper bound 10 at the end is futile.
Add $ at the end to disallow trailing characters.
Or \D to disallow trailing digits (but require a non-digit).
Or ($|\D) to either end the string there or have a non-digit follow.
Put together:
SELECT '+79637434199' ~ '^(8|\+7)[ -]?(\(?\d{3}\)?[ -]?)[\d -]{7,10}($|\D)'
Otherwise your expression is just fine and it works for me on PostgreSQL 9.1.4. It should not make any difference whatsoever whether you use it in a WHERE clause or in a SELECT list - unless you are running into a bug with some old version (like #kgrittn commented).
If I prepend the string literal with E, I can provoke the error message that you get. This cannot explain your problem, because you stated that the expression works fine as SELECT item.
But, as Sherlock Holmes is quoted, "when you have excluded the impossible, whatever remains, however improbable, must be the truth."
Maybe you ran one test with standard_conforming_strings = on and the other one with standard_conforming_strings = off - this was the default interpretation of string literals in older versions before 9.1. Maybe with two different clients (that have a different setting as to that).
Read more in the chapter String Constants with C-style Escapes in the manual.

Is there a neat way to invert an embedded pattern-match in a regular expression in Perl?

All, I have the following code:
Readonly my $CTRL_CHARS => qr{[|!?*]}xsm;
my ($dbix_object, $alias, $cardinality) = ($object =~ qr{
^ # Start of the line
([^|*?!]*) # Anything that isn't a relationship control character
# i.e. | (indicating an alias)
# * (indicating many_to_many)
# ? (indicating might_have)
# ! (indicating has_one)
\|? # Possible |, indicating an alias follows
([^|!?*]*?) # Possible alias (excludes all the control characters above)
([|!?*]?)$ # Possible control character
}oxsm);
I'd like to replace the punctuation vomit within the regex with the pattern defined as $CTRL_CHARS. However, when I put something like: [^$CTRL_CHARS], Perl complains, because this is expanded out as [^(?msx-i:[|!?*])]. Understandably, Perl pitches a fit at the invalid character range x-i.
One solution would be to use the following:
Readonly my $CTRL_CHARS => qr{[|!?*]}xsm;
Readonly my $NON_CTRL_CHARS => qr{[^|!?*]}xsm;
There's repetition there, which I don't like... but they're close together, so maybe that's not such a bad thing.
What I'd like to know is if there's a simple way to invert the meaning of $CTRL_CHARS, either for the definition of $NON_CTRL_CHARS or for direct use within the regex.
Another approach would be to define a character class, but I don't know how to do that and can't find any simple one liners to do it (would have to be a simple one liner, I think, to justify it)
If $CTRL_CHARS is guaranteed to be a char class, then you can use
(?! $CTRL_CHARS . )
But why not just define
Readonly my $CTRL_CHARS => '|!?*';
[$CTRL_CHARS]
[^$CTRL_CHARS]