How to extract internal words using regex

How to extract internal words using regex - regex

I am trying to match only the street name from a series of addresses. The addresses might look like:
23 Barrel Rd.
14 Old Mill Dr.
65-345 Howard's Bluff
I want to use a regex to match "Barrel", "Old Mill", and "Howard's". I need to figure out how to exclude the last word. So far I have a lookbehind to exclude the digits, and I can include the words and spaces and "'" by using this:
(?<=\d\s)(\w|\s|\')+
How can I exclude the final word (which may or may not end in a period)? I figure I should be using a lookahead, but I can't figure out how to formulate it.

You don't need a look-behind for this:
/^[-\d]+ ([\w ']+) \w+\.?$/
Match one or more digits and hyphens
space
match letters, digits, spaces, apostrophes into capture group 1
space
match a final word and an optional period
An example Ruby implementation:
regex = /^[-\d]+ ([\w ']+) \w+\.?$/
tests = [ "23 Barrel Rd.", "14 Old Mill Dr.", "65-345 Howard's Bluff" ]
tests.each do |test|
p test.match(regex)[1]
end
Output:
"Barrel"
"Old Mill"
"Howard's"

I believe the lookahead you want is (?=\s\w+\.?$).
\s: you don't want to include the last space
\w: at least one word-character (A-Z, a-z, 0-9, or '_')
\.?: optional period (for abbreviations such as "St.")
$: make sure this is the last word
If there's a possibility that there might be additional whitespace before the newline, just change this to (?=\s\w+\.?\s*$).

Why not just match what you want? If I have understood well you need to get all the words after the numbers excluding the last word. Words are separated by space so just get everything between numbers and the last space.
Example
\d+(?:-\d+)? ((?:.)+)  Note: there's a space at the end.
Tha will end up with what you want in \1 N times.
If you just want to match the exact text you may use \K (not supported by every regex engine) but: Example
With the regex \d+(?:-\d+)? \K.+(?= )

Another option is to use the split() function provided in most scripting languages. Here's the Python version of what you want:
stname = address.split()[1:-1]
(Here address is the original address line, and stname is the name of the street, i.e., what you're trying to extract.)

Related

How to find digits in String by regular expression? [duplicate]

I would like to match positive and negative numbers (no decimal or thousand separators) inside a string using .NET, but I want to match whole words only.
So if a string looks like
redeem: -1234
paid: 234432
then I'd like to match -1234 and 234432
But if text is
LS022-1234-5678
FA123245
then I want no match returned. I tried
\b\-?\d+\b
but it will only match 1234 in the first scenario, not returning the "-" sign.
Any help is appreciated. Thank you.

Well, I'm sure this is far from perfect, but it works for your examples:
(?<=\W)-?(?<!\w-)\d+
If you want to allow underscores just before the number, then I'd use this modification:
(?i)(?<=[^a-z0-9])-?(?<![a-z0-9]-)\d+
Let me know of any issues and I'll try and help. If you'd like me to explain either of them, let me know that too.
EDIT
To only match if there is a space or tab just before the number / negative sign (as noted in the comment below), this could be used:
(?<=[ \t])-?\d+
Note that it will match e.g. on the first number series of a telephone number, time or date value, and will not match if the number is at the beginning of the line (after a newline) - make sure this is what you intend :D

There is no word boundary between a space and -, thus you can't use \b there.
You could use:
(?<!\S)-?\d+\b
or
(?<![\w-])-?\d+\b
depending on your requirements (which aren't fully specified).
Both will work for your examples tho.

The \b-?\d+\b pattern is wrong because \b before an optional -? pattern will require a word char to appear immediately to the left of the hyphen. In general, do not use word boundaries next to optional patterns (unless you know what you are doing of course).
You might use -?\b\d+\b to match 123 or -123 like numbers as whole words. However, here, you are looking for something a bit different, because the 1234 and 5678 are whole words inside LS022-1234-5678 since they are enclosed with non-word chars (namely, a hyphen).
In this case, you need to extend whole word matching \b with extra lookbehind check on the left:
-?\b(?<!\d-)\d+\b
See the regex demo. Details:
-? - an optional hyphen
\b - a word boundary
(?<!\d-) - a negative lookbehind that fails the match if there is a digit + - immediately to the left of the current location.
\d+ - one or more digits
\b - a word boundary.
See the C# demo:
using System;
using System.Linq;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var text = "LS022-1234-5678, FA123245, redeem: -1234, paid: 234432";
var matches = Regex.Matches(text, #"-?\b(?<!\d-)\d+\b").Cast<Match>().Select(x => x.Value).ToList();
foreach (var s in matches)
Console.WriteLine(s);
}
}
Output:
-1234
234432

Regex that gets a sentence containing a name

I am creating regexes that get the whole sentence if a piece of specific information exists. Right now I am working on my name regex, so if there is any composed name (example: "Jorge Martel", "Jorge Martel del Arnold Albuquerque") the regex should get the whole sentence that has the name.
If I have these two sentences:
(1) - "A hardworking guy is working at the supermarket. They call him Jorge Horizon, but that's not his real name."
(2) - "He has an identity document that contains the name, Jorge Martel Arnold."
The regex should return these two results from the sentences above:
(1) - "They call him Jorge Horizon, but that's not his real name."
(2) - "He has an identity document that contains the name, Jorge Martel Arnold."
This is my regex:
(?:(?(?<=[\.!?]\s([A-Z]))(.+?[^.])|))?((?:(?:[A-Z][A-zÀ-ÿ']+\s(?:(?:(?:[A-zÀ-ÿ']{1,3}\s)?(?:[A-ZÀ-Ÿ][A-zÀ-ÿ']*\s?))+))\b)(.+?[\.!?](?:\s|\n|\Z)))
Basically, it verifies if there is a dot, exclamation, or interrogation symbol with a blank space and an upper case character and tells the regex that everything must be select, else it should get all the sentence.
My else case (|) right now is empty, because using (.+?) avoids my first condition...
Regex without the else case:
Validates until the dot, but doesn't get the second sentence.
Regex with the else case:
Validates the second sentence, but overrides the first condition that appears in the first sentence.
I expect my regex to return correctly the sentences:
"They call him Jorge Horizon, but that's not his real name."
"He has an identity document that contains the name, Jorge Martel Arnold."
I have also created a text to validate the regex operations as I will be using it a lot in texts. I added a lot of conditions in this text, which will probably appear in my daily work.
Check my regex, sentence, and text here:
Does anyone know what should I change in my regex? I have tried many variations and still cannot find the solution.
P.S.: I intend to use it in my python code, but I need to fix it with the regex and not with the python code.

you can try this.
[\w\ \,\']+\.\ ?([\w\ \,\']+\.)|^([\w\ \,\']+\.)$
prints $1$2. I.e if group one is empty it prints blank since there is no match, then will print group 2. Visa versa, it prints group 1 when group 2 is not there.
[\w\ ,']+.\ ?([\w\ ,']+.) - as matching anything with XXX. XXX.
then
^([\w\ ,']+.)$ - must start end with only 1 sentence.
Though honestly this can easily be done with a Tokenizer of (.) that check length of 1 or 2. It' really like using a sledgehammer to hammer a nail.

Matching names can be a very hard job using a regex, but if you want to match at least 2 consecutive uppercase words using the specified ranges.
Assuming the names start with an uppercase char A-Z (else you can extend that character class as well with the allowed chars or if supported use \p{Lu} to match an uppercase char that has a lowercase variant):
(?<!\S)[A-Z][A-Za-zÀ-ÿ]*(?:\s+[a-zÀ-ÿ,]+)*\s+[A-Z][a-zÀ-ÿ]*\s+[A-Z][a-zÀ-ÿ,]*.*?[.!?](?!\S)
(?<!\S) Assert a whitespace boundary to the left
[A-Z][A-Za-zÀ-ÿ]* Match an uppercase char A-Z optionally followed by matching the defined ranges
(?:\s+[a-zÀ-ÿ,]*)* Optionally repeat matching 1+ whitespace chars and 1 or more of the ranges
\s+[A-Z][a-zÀ-ÿ]*\s+[A-Z][a-zÀ-ÿ,]* Match 2 times whitespace chars followed by an uppercase A-Z and optional chars defined in the character class
.*?[.!?] Match as least as possible chars followed by one of . ! or ?
(?!\S) Assert a whitspace boundary to the right
Regex demo

Try this:
((?:^|(?:[^\.!?]*))[^\.!?\n]*(?:(?:[A-ZÀ-Ÿ][A-zÀ-ÿ']+\s?){2,}[^\.!?]*[\.!?]))
It will capture sentences where name has at least two words, e.g. His name is John Smith.
It won't capture sentences like: John went to a concert.

Regular expression to match n consecutive capitalized words

I am trying to capture n consecutive capitalized words. My current code is
n=5
a='This is a Five Gram With Five Caps and it also contains a Two Gram'
re.findall(' ([A-Z]+[a-z|A-Z]* ){n}',a)
Which returns the following:
['Caps ']
It's identifying the fifth consecutive capitalized word, but I would like it to return the entire string of capitalized words. In other words:
[' Five Gram With Five Caps ']

Note that | doesn't act as an OR inside a character class. It'll match | literally. The other issue here is that findall's behaviour is to return the match unless a group exists (although python's documentation doesn't really make this clear):
The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups
So this is why you're getting the result of the first capture group, which is the last uppercase-starting word of Caps.
The simple solution is to change your capturing group to a non-capturing group. I've also changed the space at the start to \b so as to not match an additional whitespace (which I presume you were planning on trimming anyway).
See code in use here
import re
r = re.compile(r"\b(?:[A-Z][a-zA-Z]* ){5}")
s = "This is a Five Gram With Five Caps and it also contains a Two Gram"
print(r.findall(s))
See regex in use here
\b(?:[A-Z][a-zA-Z]* ){5}
\b Assert position as a word boundary
(?:[A-Z][a-zA-Z]* ?){5} Match the following exactly 5 times
[A-Z] Match an uppercase ASCII letter once
[a-zA-Z]* Match any ASCII letter any number of times
Match a space
Result: ['Five Gram With Five Caps ']
Additionally, you may use the regex \b\[A-Z\]\[a-zA-Z\]*(?: \[A-Z\]\[a-zA-Z\]*){4}\b instead. This will allow matches at the start/end of the string as well as anywhere in the middle without grabbing extra whitespace. Another alternative may include (?:^|(?<= ))\[A-Z\]\[a-zA-Z\]*(?: \[A-Z\]\[a-zA-Z\]*){4}(?= |$)

Wrap the whole pattern in a capturing group:
(([A-Z]+[a-z|A-Z]* ){5})
Demo

Validation of international telephone numbers with REGEXMATCH

I'm trying to apply a data validation formula to a column, checking if the content is a valid international telephone number. The problem is I can't have +1 or +some dial code because it's interpreted as an operator. So I'm looking for a regex that accepts all these, with the dial code in parentheses:
(+1)-234-567-8901
(+61)-234-567-89-01
(+46)-234 5678901
(+1) (234) 56 89 901
(+1) (234) 56-89 901
(+46).234.567.8901
(+1)/234/567/8901
A starting regex can be this one (where I also took the examples).

This regex match all the example you gave us (tested with https://fr.functions-online.com/preg_match_all.html)
/^\(\+\d+\)[\/\. \-]\(?\d{3}\)?[\/\. \-][\d\- \.\/]{7,11}$/m
^ Match the beginning of the string or new line.
To match (+1) and (+61): \(\+\d+\): The plus sign and the parentheses have to be escaped since they have special meaning in the regex. \d+ Stand for any digit (\d) character and the plus means one or more (the plus could be replaced by {1,2})
[\/\. \-] This match dot, space, slash and hyphen exactly one time.
\(?\d{3}\)?: The question mark is for optional parenthesis (? = 0 or 1 time). It expect three digits.
[\/\. \-] Same as step 3
[\d\- \.\/]{7,11}: Expect digits, hyphen, space, dot or slash between 7 and 11 time.
$ Match the end of the line or the end of the string
The m modifier allow the caret (^) and dollar sign ($) combination to match line break. Remove that if you want those symbol to match only the begining and the end of the string.
Slashes are use are delimiter for this regex (there are other character that you can use).
I must admit I don't like the last part of the regex as do not ensure that you have at least 7 digits.
It would be probably better to remove all the separator (by example with PHP function str_replace) and deal only with parenthesis and number with this regex
/(\(\+\d+\))(\(?\d{3}\)?)(\d{3})(\d{4})/m
Notice that in this last regex I used 4 capturing group to match the four digit section of the phone number. This regex keep the parenthesis and the plus sign of the first group and the optional parenthesis of the second group. To keep only the digits group, you can use this regex:
/\(\+(\d+)\)\(?(\d{3})\)?(\d{3})(\d{4})/m
Note: The groups are for formatting the phone number after validating it. It is probably better for you to keep all your phone number in your database in the same format.
Well, here are different possibility you can use.
Note: Those regex should be compatible with all regex engine, but it is good practice to specify with which language you works because regex engine don't deal the same way with advanced/fancy function.
By example, the look behind is not supported by javascript and .Net allow a more powerful control on lookbehind than PHP.
Keep me in touch if you need more information

regex: find one-digit number

I need to find the text of all the one-digit number.
My code:
$string = 'text 4 78 text 558 my.name#gmail.com 5 text 78998 text';
$pattern = '/ [\d]{1} /';
(result: 4 and 5)
Everything works perfectly, just wanted to ask it is correct to use spaces?
Maybe there is some other way to distinguish one-digit number.
Thanks

First of all, [\d]{1} is equivalent to \d.
As for your question, it would be better to use a zero width assertion like a lookbehind/lookahead or word boundary (\b). Otherwise you will not match consecutive single digits because the leading space of the second digit will be matched as the trailing space of the first digit (and overlapping matches won't be found).
Here is how I would write this:
(?<!\S)\d(?!\S)
This means "match a digit only if there is not a non-whitespace character before it, and there is not a non-whitespace character after it".
I used the double negative like (?!\S) instead of (?=\s) so that you will also match single digits that are at the beginning or end of the string.
I prefer this over \b\d\b for your example because it looks like you really only want to match when the digit is surrounded by spaces, and \b\d\b would match the 4 and the 5 in a string like 192.168.4.5
To allow punctuation at the end, you could use the following:
(?<!\S)\d(?![^\s.,?!])
Add any additional punctuation characters that you want to allow after the digit to the character class (inside of the square brackets, but make sure it is after the ^).

Use word boundaries. Note that the range quantifier {1} (a single \d will only match one digit) and the character class [] is redundant because it only consists of one character.
\b\d\b

Search around word boundaries:
\b\d\b
As explained by the others, this will extract single digits meaning that some special characters might not be respected like "." in an ip address. To address that, see F.J and Mike Brant's answer(s).

It really depends on where the numbers can appear and whether you care if they are adjacent to other characters (like . at the end of a sentence). At the very least, I would use word boundaries so that you can get numbers at the beginning and end of the input string:
$pattern = '/\b\d\b/';
But you might consider punctuation at the end like:
$pattern = '/\b\d(\b|\.|\?|\!)/';

If one-digit numbers can be preceded or followed by characters other than digits (e.g., "a1 cat" or "Call agent 7, pronto!") use
(?<!\d)\d(?!\d)
Demo
The regular expression reads, match a digit (\d) that is neither preceded nor followed by digit, (?<!\d) being a negative lookbehind and (?!\d) being a negative lookahead.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to extract internal words using regex - regex

Another option is to use the split() function provided in most scripting languages. Here's the Python version of what you want: stname = address.split()[1:-1] (Here address is the original address line, and stname is the name of the street, i.e., what you're trying to extract.)

Related

How to find digits in String by regular expression? [duplicate]

Regex that gets a sentence containing a name

Regular expression to match n consecutive capitalized words

Validation of international telephone numbers with REGEXMATCH

regex: find one-digit number

Categories

Resources