Writing a proper regular expression that handles multiple spaces - regex

I'm struggling to write a proper regexp to use with PHP7.4 to extract required information from a string.
Here are the sample strings:
Numer właściciela: NOWAKOWSKA 01-234 Warsaw
Numer właściciela: NOWAK_S6_2
Numer właściciela: KOWALSKA_S6_ 01-234 Warsaw
Numer właściciela: NOWACKI S6_ 01-234 Warsaw
What I want to extract is accordingly:
NOWAKOWSKA
NOWAK_S6_2
KOWALSKA_S6_
NOWACKI S6_
So far I was using the %^Numer właściciela:[[:space:]](?<owner_id>.+)$%imu which worked fine (with example from row#2). However, turns out that the other cases (#1, #3, #4) appeared during a roll-out phase and our text extraction is not accurate enough.
The problem here is with spaces, the source text may contain space inside the pattern and this space must be included in the result. However, if there are repeating spaces, they must not be included.
Tried playing around with some conditionals and negative lookaheads to exclude multiple spaces, but failed to do so.
Would really appreciate any help here.

In a general case, when you want to match sequences of chars separated with a single whitespace, you can use
/^Numer właściciela:\h*(?<owner_id>\S+(?:\h\S+)*)/imu
See the regex demo. \h is preferred to \s since you are extracting data from lines in a longer text, not standalone texts.
If the strings you extract are all short, you may also use
/^Numer właściciela:\h*(?<owner_id>.*?)(?:\h{2}|$)/imu
Then, it should be even more efficient, but only if they are that short as in the question. The .*? is usually as expensive as .* in strings of arbitrary length.
Pattern details:
^ - start of a line (due to m flag)
Numer właściciela: - a literal string (replace with \h to match any horizontal whitespace)
\h* - zero or more horizontal whitespaces
(?<owner_id>\S+(?:\h\S+)*) - Group "owner_id": one or more non-whitespace chars followed with zero or more sequences of a single horizontal whitespace followed with one or more non-whitespace chars.
(?<owner_id>.*?)(?:\h{2}|$) - Group "owner_id" that captures any zero or more chars other than line break chars as few as possible, and then either two horizontal whitespaces or end of a line.

This regex:
/^Numer właściciela:\s+(?<owner_id>.*?)(?=\s{20,}|$)/imu
online demo

Related

What is the regex to find lines WITHOUT a line break

I'm using SubtitleEdit and I'd like to locate all the lines that do not contain a line break.
Because lines containing a line break indicates they are bilingual, which I want.
But those that do not have line breaks are mono-lingual, and I'd like to quickly locate them all and delete them. TIA!
Alternatively, if there is a regex expression that can find lines which do not contain any English characters, that would also work.
The confusion here was caused by 2 facts:
What SubtitleEdit calls a line is actually a multiline, containing
newlines.
The newline displayed is not the one used internally (so it would never match <br>).
Solution 1:
Now that we have found out it uses either \r\n or just \n, we can write a regex:
(?-m)^(?!.*\r?\n)[\s\S]*$
Explanation:
(?-m) - turn off the multiline option (which is otherwise enabled).
^ - match from start of text
(?!.*\r?\n) - negative look ahead for zero or more of any characters followed by newline character(s) - (=Contains)
[\s\S]*$ - match zero or more of ANY character (including newline) - will match the rest of text.
In short: If we don't find newline characters, match everything.
Now replace with an empty string.
Solution 2:
If you want to match lines that doesn't have any English characters, you can use this:
(?-m)^(?![\s\S]*[a-zA-Z])[\s\S]*$
Explanation:
(?-m) - turn off the multiline option (which is otherwise enabled).
^ - match from start of text
(?![\s\S]*[a-zA-Z]) - negative look ahead for ANY characters followed by an English character.
[\s\S]*$ - match zero or more of ANY character (including newline) - will match the rest of text.
In short: If we don't find an English character, match everything.
Now replace with an empty string.
You should use regex assert. Given test lines:
something_1
some<br>thing_2
something_3<br>
<br>something_4
something_5
This is an expression that will match lines 1 and 5
^(?!.*<br>).*$
In this regular expression we have the negative lookahead assertion (?!.*<br>) that allows us to define what line is suitable for us

Regex to match lines starting with a \t or - but only capture - on

I cannot figure out this regex for the life of me
I have example input such as:
- Line 1
- Line 2
- Line 3
- Line 4
I am trying to match each line starting at the - and going through the end of the line. I am using the Workflow app on iOS which uses ICU regex parsing
The pattern I am using is
(?m)^\t*(-.*)
This pattern will match all the lines, but it captures the tabs. What am I doing wrong?
You ask why your regex captures the tabs. It is not so: your regex matches the tabs, and captures the - after those tabs with the rest of the line. The point is that you are using consuming pattern, the one that will return the matched/captured strings.
Non-consuming patterns - lookarounds - can be used to just check for some text presence/absence that do not actually put it into the text returned.
In ICU regex flavor, the lookbehinds are of constrained-width, that is, if you use a limiting quantifier, it is OK to use it. (The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)
Thus, this will work in case there can be 100 and fewer tabs at the line start:
(?m)(?<=^\t{0,100})-.*
Here,
(?m) - makes ^ match the start of a line
(?<=^\t{0,100}) - a positive lookbehind requiring 0 to 100 tabs after the beginning of the line to appear before a
-.* - hyphen and the rest of the line.
Try this:
(?m)^[ \t]*(-.*)
First, it appears that you have some spaces at the beginning of some of those lines, so \t will not match spaces. Replacing \t with [ \t] (or just \s) will fix this. Also, (-*) is going to match and capture any number of -, not including what's following. Put a . before your * to match any number of characters following the -, like this: (-.*)
If you don't require leading spaces, you can use
(?m)(-.*)
If you don't care about capturing the match, you don't need the parenthesis, giving you
(?m)-.*
As mentioned in the comments

Regex to prevent trailing spaces and extra spaces

Right now I have a regex that prevents the user from typing any special characters. The only allowed characters are A through Z, 0 through 9 or spaces.
I want to improve this regex to prevent the following:
No leading/training spaces - If the user types one or more spaces before or after the entry, do not allow.
No double-spaces - If the user types the space key more than once, do not allow.
The Regex I have right now to prevent special characters is as follows and appears to work just fine, which is:
^[a-zA-Z0-9 ]+$
Following some other ideas, I tried all these options but they did not work:
^\A\s+[a-zA-Z0-9 ]+$\A\s+
/s*^[a-zA-Z0-9 ]+$/s*
Could I get a helping hand with this code? Again, I just want letters A-Z, numbers 0-9, and no leading or trailing spaces.
Thanks.
You can use the following regex:
^[a-zA-Z0-9]+(?: [a-zA-Z0-9]+)*$
See regex demo.
The regex will match alphanumerics at the start (1 or more) and then zero or more chunks of a single space followed with one or more alphanumerics.
As an alternative, here is a regex based on lookaheads (but is thus less efficient):
^(?!.* {2})(?=\S)(?=.*\S$)[a-zA-Z0-9 ]+$
See the regex demo
The (?!.* {2}) disallows consecutive spaces and (?=.*\S$) requires a non-whitespace to be at the end of the string and (?=\S) requires it at the start.

Extracting text between two keywords or a keyword and \n

I have a set of lines where most of them follow this format
STARTKEYWORD some text I want to extract ENDKEYWORD\n
I want to find these lines and extract information from them.
Note, that the text between keywords can contain a wide range of characters (latin and non-latin letters, numbers, spaces, special characters) except \n.
ENDKEYWORD is optional and sometimes can be omitted.
My attempts are revolving around this regex
STARTKEYWORD (.+)(?:\n| ENDKEYWORD)
However capturing group (.+) consumes as many characters as possible and takes ENDKEYWORD which I do not need.
Is there a way to get some text I want to extract solely with regular expressions?
You can make (.+) non greedy (which is by default greedy and eats whatever comes in its way) by adding ? and add $ instead of \n for making more efficient
STARTKEYWORD (.+?)(?:$| ENDKEYWORD$)
If you specifically want \n you can use:
STARTKEYWORD (.+?)(?:\n| ENDKEYWORD\n)
See DEMO
You could use a lookahead based regex. It always better to use $ end of the line anchor since the last line won't contain a newline character at the last.
STARTKEYWORD (.+?)(?= ENDKEYWORD|$)
OR
STARTKEYWORD (.+?)(?: ENDKEYWORD|$)
DEMO

Regular expression to allow spaces between words

I want a regular expression that prevents symbols and only allows letters and numbers. The regex below works great, but it doesn't allow for spaces between words.
^[a-zA-Z0-9_]*$
For example, when using this regular expression "HelloWorld" is fine, but "Hello World" does not match.
How can I tweak it to allow spaces?
tl;dr
Just add a space in your character class.
^[a-zA-Z0-9_ ]*$
Now, if you want to be strict...
The above isn't exactly correct. Due to the fact that * means zero or more, it would match all of the following cases that one would not usually mean to match:
An empty string, "".
A string comprised entirely of spaces, " ".
A string that leads and / or trails with spaces, " Hello World ".
A string that contains multiple spaces in between words, "Hello World".
Originally I didn't think such details were worth going into, as OP was asking such a basic question that it seemed strictness wasn't a concern. Now that the question's gained some popularity however, I want to say...
...use #stema's answer.
Which, in my flavor (without using \w) translates to:
^[a-zA-Z0-9_]+( [a-zA-Z0-9_]+)*$
(Please upvote #stema regardless.)
Some things to note about this (and #stema's) answer:
If you want to allow multiple spaces between words (say, if you'd like to allow accidental double-spaces, or if you're working with copy-pasted text from a PDF), then add a + after the space:
^\w+( +\w+)*$
If you want to allow tabs and newlines (whitespace characters), then replace the space with a \s+:
^\w+(\s+\w+)*$
Here I suggest the + by default because, for example, Windows linebreaks consist of two whitespace characters in sequence, \r\n, so you'll need the + to catch both.
Still not working?
Check what dialect of regular expressions you're using.* In languages like Java you'll have to escape your backslashes, i.e. \\w and \\s. In older or more basic languages and utilities, like sed, \w and \s aren't defined, so write them out with character classes, e.g. [a-zA-Z0-9_] and [\f\n\p\r\t], respectively.
* I know this question is tagged vb.net, but based on 25,000+ views, I'm guessing it's not only those folks who are coming across this question. Currently it's the first hit on google for the search phrase, regular expression space word.
One possibility would be to just add the space into you character class, like acheong87 suggested, this depends on how strict you are on your pattern, because this would also allow a string starting with 5 spaces, or strings consisting only of spaces.
The other possibility is to define a pattern:
I will use \w this is in most regex flavours the same than [a-zA-Z0-9_] (in some it is Unicode based)
^\w+( \w+)*$
This will allow a series of at least one word and the words are divided by spaces.
^ Match the start of the string
\w+ Match a series of at least one word character
( \w+)* is a group that is repeated 0 or more times. In the group it expects a space followed by a series of at least one word character
$ matches the end of the string
This one worked for me
([\w ]+)
Try with:
^(\w+ ?)*$
Explanation:
\w - alias for [a-zA-Z_0-9]
"whitespace"? - allow whitespace after word, set is as optional
I assume you don't want leading/trailing space. This means you have to split the regex into "first character", "stuff in the middle" and "last character":
^[a-zA-Z0-9_][a-zA-Z0-9_ ]*[a-zA-Z0-9_]$
or if you use a perl-like syntax:
^\w[\w ]*\w$
Also: If you intentionally worded your regex that it also allows empty Strings, you have to make the entire thing optional:
^(\w[\w ]*\w)?$
If you want to only allow single space chars, it looks a bit different:
^((\w+ )*\w+)?$
This matches 0..n words followed by a single space, plus one word without space. And makes the entire thing optional to allow empty strings.
This regular expression
^\w+(\s\w+)*$
will only allow a single space between words and no leading or trailing spaces.
Below is the explanation of the regular expression:
^ Assert position at start of the string
\w+ Match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
1st Capturing group (\s\w+)*
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\s Match any white space character [\r\n\t\f ]
\w+ Match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
$ Assert position at end of the string
Just add a space to end of your regex pattern as follows:
[a-zA-Z0-9_ ]
This does not allow space in the beginning. But allowes spaces in between words. Also allows for special characters between words. A good regex for FirstName and LastName fields.
\w+.*$
For alphabets only:
^([a-zA-Z])+(\s)+[a-zA-Z]+$
For alphanumeric value and _:
^(\w)+(\s)+\w+$
If you are using JavaScript then you can use this regex:
/^[a-z0-9_.-\s]+$/i
For example:
/^[a-z0-9_.-\s]+$/i.test("") //false
/^[a-z0-9_.-\s]+$/i.test("helloworld") //true
/^[a-z0-9_.-\s]+$/i.test("hello world") //true
/^[a-z0-9_.-\s]+$/i.test("none alpha: ɹqɯ") //false
The only drawback with this regex is a string comprised entirely of spaces. "       " will also show as true.
It was my regex: #"^(?=.{3,15}$)(?:(?:\p{L}|\p{N})[._()\[\]-]?)*$"
I just added ([\w ]+) at the end of my regex before *
#"^(?=.{3,15}$)(?:(?:\p{L}|\p{N})[._()\[\]-]?)([\w ]+)*$"
Now string is allowed to have spaces.
This regex allow only alphabet and spaces:
^[a-zA-Z ]*$
Try with this one:
result = re.search(r"\w+( )\w+", text)