Noob regex poser (match MAY contain and MUST have) - regex

Probably really simple for you Regex masters :) I'm a noob at regex, just having picked up some PHP, but wanting to learn (once this project is complete, I'll knuckle down and crack regular expressions).
I'd like to understand how to compose a regex that may contain some data, but must contain other.
My example being, the match MAY begin with numbers but doesn't have to, however if it does, I need the number and the following 2 words. If it doesn't begin with a number, just the first 2 words. The data will be at the beginning of the string.
The following would match:
123 Fore Street, Fiveways (123 Fore Street returned(no comma))
Our House Village (Our House returned)
7 Eightnine (7 Eightnine returned)
Thanks

Something like this should work:
^((?:\d+\s)?\w+(?:\s\w+)?)
You can test it out somewhere like http://rubular.com/ before coding it, it's usually easier.
What it means:
^ -> beginning of the line
(?:\d+\s)? -> a non capturing group, (marked by ?:), consisting of several digits and a space, since we follow it by ?, it's optional.
\w+(?:\s\w+)? -> several alphanumeric characters (look up what \w means), followed by, optionally, a space and another "word", again in a non capturing group.
The whole thing is encapsulated in a capturing group, so group 1 will contain your match.

Use this regex with multiline option
^(\d+(\s*\b[a-zA-Z]+\b){1,2}|(\s*\b[a-zA-Z]+\b){1,2})
Group1 contains your required data
\d+ means match digit i.e \d 1 to many times+
\s* means match space i.e \s 0 to many times*
(\s*\b[a-zA-Z]+\b){1,2} matches 1 to 2 words..

Related

How to select with regex this character?

For the example i have these four ip address:
10.100.0.11; wrong
10.100.1.12; good
10.100.11.4; good
10.100.44.1; wrong
The task has simple rules. In the 3rd place cant be 0, and the 4rd place cant be a solo 1.
I need to select they from an ip table in different routers and i know only this rules.
My solution:
^(10.100.[1-9]{1,3}.[023456789]{1,3})$
but in this case every number with 1 like 10, 100 etc is missing, so in this way this solution is wrong.
^(10.100.[1-9]{1,3}.[1-9]{2,3})$
This solve the problem of the single 1, but make another one.
From the rules you have given, this regex should work:
10\.100\.([123456789]\d*|\d{2,})\.([^1]$|\d{2,})
it also matches 3rd position number containing a 0 but not in the first place.
so 10.100.10.4 will match as well as 10.100.02.4
I don't know if it's the intended behavior since I'm not familiar with ip adress.
The last part \.([^1]$|\d{2,}) reads like this:
"after the 3rd dot is either
a character which is not 1 followed by the end of the line
or two or more digits"
If you want to avoid malformed string containing non-digit character like 10.100.12.a to be match you should replace [^1] by [023456789] or lazier (and therefore better ;) by [02-9]
I use https://regex101.com to debug regex. It's just awesome.
Here is your regex if you want to play with it
You might use
^10\.100\.[1-9]{1,3}\.(?:[02-9]|\d{2,3})$
The pattern matches
^ Start of string
10\.100\. Match 10.100. (note to escape the dot to match it literally)
[1-9]{1,3} Match 3 times a digit 1-9
\. Match a dot
(?: Non capture group
[02-9] Match a digit 0 or 2-9
| Or
\d{2,3} Match 2 or 3 digits 0-9
) Close the group
$ End of string
Regex demo

generate search words from text with CamelCase by using regEx

I want from this text with CamelCase search words to generate.
I do not know if that's possible only with RegEx. but I'm already close.
I use it in the scripting language AutoHotkey (https://autohotkey.com/docs/misc/RegEx-QuickRef.htm) .
data: reCommended for future AutoHotkeyReleases.
regEx: (((\b[^A-Z\s]*)?([A-Z][a-z]+)|([\W_-]?[a-z]+))) ( https://regex101.com/r/NgRmXZ/2 )
expected Groups:
reCommended
re
Commended
for
future
Auto
Hotkey
AutoHotkey
HotkeyReleases
Releases
AutoHotkeyReleases.
i also tried, but not works for me:
(?=\p{Lu}\p{Ll})|(?<=\p{Ll})(?=\p{Lu}) from Splitting CamelCase with regex
(([a-z]*)(?<=[a-z])((?:[A-Z])[a-z]+)) https://regex101.com/r/NgRmXZ/3
(?<=[a-z])([A-Z])|(?<=[A-Z])([A-Z][a-z]) https://regex101.com/r/NgRmXZ/4
((?<!^)([A-Z][a-z]+|(?<=[a-z])[A-Z][a-z]+)) https://regex101.com/r/B5vXaZ/1
I have started to implement my prototyp already here:
https://gist.github.com/sl5net/ba5aef19f44fe68204ccb6c96e7c96e0
I have made a regex that almost satisfies your need. However, I'm missing one combination. I don't think, it's possible, because it would require parantheses to overlap, the 'Hotkey' would have to be part of two different overlapping Groups.
Well, here's the regex:
/\b((\w+?(?=[A-Z]|\b))([A-Z][a-z]*)?)([A-Z][a-z]*)?/g
It starts by a Word boundary, then creates 2 Groups, Group 2 matches any Word character one or more times (ungreedy) until a look ahead for a Capital letter OR a Word boundary is reached.
Group 3 will match a Capital letter followed by zero or more lowercase letters. That is optional.
Group 1 combines Group 2 and Group 3.
Finally Group 4 will match a Capital letter followed by zero or more lowercase letters. That is optional.
As mentioned, I don't think its possible to create a Group, that combines Group 3 and Group 4, since they overlap. Other than that, this should Work, as you want.

Validation of international telephone numbers with REGEXMATCH

I'm trying to apply a data validation formula to a column, checking if the content is a valid international telephone number. The problem is I can't have +1 or +some dial code because it's interpreted as an operator. So I'm looking for a regex that accepts all these, with the dial code in parentheses:
(+1)-234-567-8901
(+61)-234-567-89-01
(+46)-234 5678901
(+1) (234) 56 89 901
(+1) (234) 56-89 901
(+46).234.567.8901
(+1)/234/567/8901
A starting regex can be this one (where I also took the examples).
This regex match all the example you gave us (tested with https://fr.functions-online.com/preg_match_all.html)
/^\(\+\d+\)[\/\. \-]\(?\d{3}\)?[\/\. \-][\d\- \.\/]{7,11}$/m
^ Match the beginning of the string or new line.
To match (+1) and (+61): \(\+\d+\): The plus sign and the parentheses have to be escaped since they have special meaning in the regex. \d+ Stand for any digit (\d) character and the plus means one or more (the plus could be replaced by {1,2})
[\/\. \-] This match dot, space, slash and hyphen exactly one time.
\(?\d{3}\)?: The question mark is for optional parenthesis (? = 0 or 1 time). It expect three digits.
[\/\. \-] Same as step 3
[\d\- \.\/]{7,11}: Expect digits, hyphen, space, dot or slash between 7 and 11 time.
$ Match the end of the line or the end of the string
The m modifier allow the caret (^) and dollar sign ($) combination to match line break. Remove that if you want those symbol to match only the begining and the end of the string.
Slashes are use are delimiter for this regex (there are other character that you can use).
I must admit I don't like the last part of the regex as do not ensure that you have at least 7 digits.
It would be probably better to remove all the separator (by example with PHP function str_replace) and deal only with parenthesis and number with this regex
/(\(\+\d+\))(\(?\d{3}\)?)(\d{3})(\d{4})/m
Notice that in this last regex I used 4 capturing group to match the four digit section of the phone number. This regex keep the parenthesis and the plus sign of the first group and the optional parenthesis of the second group. To keep only the digits group, you can use this regex:
/\(\+(\d+)\)\(?(\d{3})\)?(\d{3})(\d{4})/m
Note: The groups are for formatting the phone number after validating it. It is probably better for you to keep all your phone number in your database in the same format.
Well, here are different possibility you can use.
Note: Those regex should be compatible with all regex engine, but it is good practice to specify with which language you works because regex engine don't deal the same way with advanced/fancy function.
By example, the look behind is not supported by javascript and .Net allow a more powerful control on lookbehind than PHP.
Keep me in touch if you need more information

Regular expression for crossword solution

This is a crossword problem. Example:
the solution is a 6-letter word which starts with "r" and ends with "r"
thus the pattern is "r....r"
the unknown 4 letters must be drawn from the pool of letters "a", "e", "i" and "p"
each letter must be used exactly once
we have a large list of candidate 6-letter words
Solutions: "rapier" or "repair".
Filtering for the pattern "r....r" is trivial, but finding words which also have [aeip] in the "unknown" slots is beyond me.
Is this problem amenable to a regex, or must it be done by exhaustive methods?
Try this:
r(?:(?!\1)a()|(?!\2)e()|(?!\3)i()|(?!\4)p()){4}r
...or more readably:
r
(?:
(?!\1) a () |
(?!\2) e () |
(?!\3) i () |
(?!\4) p ()
){4}
r
The empty groups serve as check marks, ticking off each letter as it's consumed. For example, if the word to be matched is repair, the e will be the first letter matched by this construct. If the regex tries to match another e later on, that alternative won't match it. The negative lookahead (?!\2) will fail because group #2 has participated in the match, and never mind that it didn't consume anything.
What's really cool is that it works just as well on strings that contain duplicate letters. Take your redeem example:
r
(?:
(?!\1) e () |
(?!\2) e () |
(?!\3) e () |
(?!\4) d ()
){4}
m
After the first e is consumed, the first alternative is effectively disabled, so the second alternative takes it instead. And so on...
Unfortunately, this technique doesn't work in all regex flavors. For one thing, they don't all treat empty/failed group captures the same. The ECMAScript spec explicitly states that references to non-participating groups should always succeed.
The regex flavor also has to support forward references--that is, backreferences that appear before the groups they refer to in the regex. (ref) It should work in .NET, Java, Perl, PCRE and Ruby, that I know of.
Assuming that you meant that the unknown letters must be among [aeip], then a suitable regex could be:
/r[aeip]{4,4}r/
What's the front end language being used to compare strings, is it java, .net ...
here is an example/psuedo code using java
String mandateLetters = "aeio"
String regPattern = "\\br["+mandateLetters+"]*r$"; // or if for specific length \\br[+mandateLetters+]{4}r$
Pattern pattern = Pattern.compile(regPattern);
Matcher matcher = pattern.matcher("is this repair ");
matcher.find();
Why not replace each '.' in your original pattern with '[aeip]'?
You'd wind up with a regex string r[aeip][aeip][aeip][aeip]r.
This could of course be shortened to r[aeip]{4,4}r, but that would be a pain to implement in the general case and probably wouldn't improve the code any.
This doesn't address the issue of duplicate letter use. If I were coding it, I'd handle that in code outside the regexp - mostly because the regexp would get more complicated than I'd care to handle.
So the "only once" part is the critical thing. Listing all permutations is obviously not feasible. If your language/environment supports lookaheads and backreferences you can make it a bit easier for yourself:
r(?=[aeip]{4,4})(.)(?!\1)(.)(?!\1|\2)(.)(?!\1|\2|\3).r
Still quite ugly, but here is how it works:
r # match an r
(?= # positive lookahead (doesn't advance position of "cursor" in input string)
[aeip]{4,4}
) # make sure that there are the four desired character ahead
(.) # match any character and capture it in group 1
(?!\1)# make sure that the next character is NOT the same as the previous one
(.) # match any character and capture it in group 2
(?!\1|\2)
# make sure that the next character is neither the first nor the second
(.) # match any character and capture it in group 3
(?!\1|\2|\3)
# same thing again for all three characters
. # match another arbitrary character
r # match an r
Working demo.
This is neither really elegant nor scalable. So you might just want to use r([aiep]{4,4})r (capturing the four critical letters) and ensure the additional condition without regex.
EDIT: In fact, the above pattern is only really useful and necessary if you just want to ensure that you have 4 non-identical characters. For your specific case, again using lookaheads, there is simpler (despite longer) solution:
r(?=[^a]*a[^a]*r)(?=[^e]*e[^e]*r)(?=[^i]*i[^i]*r)(?=[^p]*p[^p]*r)[aeip]{4,4}r
Explained:
r # match an r
(?= # lookahead: ensure that there is exactly one a until the next r
[^a]* # match an arbitrary amount of non-a characters
a # match one a
[^a]* # match an arbitrary amount of non-a characters
r # match the final r
) # end of lookahead
(?=[^e]*e[^e]*r) # ensure that there is exactly one e until the next r
(?=[^i]*i[^i]*r) # ensure that there is exactly one i until the next r
(?=[^p]*p[^p]*r) # ensure that there is exactly one p until the next r
[aeip]{4,4}r # actually match the rest to include it in the result
Working demo.
For r....m with a pool of deee, this could be adjusted as:
r(?=[^d]*d[^d]*m)(?=[^e]*(?:e[^e])*{3,3}m)[de]{4,4}m
This ensures that there is exactly one d and exactly 3 es.
Working demo.
not fully regex due to sed multi regex action
sed -n -e '/^r[aiep]\{4,\}r$/{/\([aiep]\).*\1/!p;}' YourFile
take pattern 4 letter in group aeipsourround by r, keep only line where no letter in the sub group is found twice.
A more scalable solution (no need to write \1, \2, \3 and so on for each letter or position) is to use negative lookahead to assert that each character is not occurring later:
^r(?:([aeip])(?!.*\1)){4}r$
more readable as:
^r
(?:
([aeip])
(?!.*\1)
){4}
r$
Improvements
This was a quick solution which works in the situation you gave us, but here are some additional constraints to have a robuster version:
If the "pool of letters" may share some letters with the end of string, include the end of pattern in the lookahead:
^r(?:([aeip])(?!.*\1.*\2)){4}(r$)
(may not work as intended in all regex flavors, in which case copy-paste the end of pattern instead of using \2)
If some letters must be present not only once but a different fixed number of times, add a separate lookahead for all letters sharing this number of times. For instance, "r....r" with one "a" and one "p" but two "e" would be matched by this regex (but "rapper" and "repeer" wouldn't):
^r(?:([ap])(?!.*\1.*\3)|([e])(?!.*\2.*\2.*\3)){4}(r$)
The non-capturing groups now has 2 alternatives: ([ap])(?!.*\1.*\3) which matches "a" or "p" not followed anywhere until ending by another one, and ([e])(?!.*\2.*\2.*\3) which matches "e" not followed anywhere until ending by 2 other ones (so it fails on the first one if there are 3 in total).
BTW this solution includes the above one, but the end of pattern is here shifted to \3 (also, cf. note about flavors).

How to extract internal words using regex

I am trying to match only the street name from a series of addresses. The addresses might look like:
23 Barrel Rd.
14 Old Mill Dr.
65-345 Howard's Bluff
I want to use a regex to match "Barrel", "Old Mill", and "Howard's". I need to figure out how to exclude the last word. So far I have a lookbehind to exclude the digits, and I can include the words and spaces and "'" by using this:
(?<=\d\s)(\w|\s|\')+
How can I exclude the final word (which may or may not end in a period)? I figure I should be using a lookahead, but I can't figure out how to formulate it.
You don't need a look-behind for this:
/^[-\d]+ ([\w ']+) \w+\.?$/
Match one or more digits and hyphens
space
match letters, digits, spaces, apostrophes into capture group 1
space
match a final word and an optional period
An example Ruby implementation:
regex = /^[-\d]+ ([\w ']+) \w+\.?$/
tests = [ "23 Barrel Rd.", "14 Old Mill Dr.", "65-345 Howard's Bluff" ]
tests.each do |test|
p test.match(regex)[1]
end
Output:
"Barrel"
"Old Mill"
"Howard's"
I believe the lookahead you want is (?=\s\w+\.?$).
\s: you don't want to include the last space
\w: at least one word-character (A-Z, a-z, 0-9, or '_')
\.?: optional period (for abbreviations such as "St.")
$: make sure this is the last word
If there's a possibility that there might be additional whitespace before the newline, just change this to (?=\s\w+\.?\s*$).
Why not just match what you want? If I have understood well you need to get all the words after the numbers excluding the last word. Words are separated by space so just get everything between numbers and the last space.
Example
\d+(?:-\d+)? ((?:.)+)  Note: there's a space at the end.
Tha will end up with what you want in \1 N times.
If you just want to match the exact text you may use \K (not supported by every regex engine) but: Example
With the regex \d+(?:-\d+)? \K.+(?= )
Another option is to use the split() function provided in most scripting languages. Here's the Python version of what you want:
stname = address.split()[1:-1]
(Here address is the original address line, and stname is the name of the street, i.e., what you're trying to extract.)