One of the columns has the data as below and I only need the suburb name, not the state or postcode.
I'm using Alteryx and tried regex (\<\w+\>)\s\<\w+\> but only get a few records to the new column.
Input:
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta NSW 2150
Claymore 2559
CASULA
Output
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta
Claymore
CASULA
This regex matches all letter-words up to but not including an Australian state abbreviation (since the addresses are clearly Australian):
( ?(?!(VIC|NSW|QLD|TAS|SA|WA|ACT|NT)\b)\b[a-zA-Z]+)+
See demo
The negative look ahead includes a word boundary to allow suburbs that start with a state abbreviation (see demo).
Expanding on Bohemian's answer, you can use groupings to do a REGEXP REPLACE in alteryx. So:
REGEX_Replace([Field1], "(.*)(\VIC|NSW|QLD|TAS|SA|WA|ACT|NT)+(\s*\d+)" , "\1")
This will grab anything that matches in the first group (so just the suburb). The second and third groups match the state and the zip. Not a perfect regex, but should get you most of the way there.
I think this workflow will help you :
Related
I am trying to apply a regex filter on news headlines. I want the filter only to match, if at least one word of both wordlists are present in the news headline. Furthermore, it should only generate 1 match (not multiple matches if some tokens apply).
These are my wordlists (and my regex which doesnt work currently):
(Threat actor|attack|skimm|malware|exploit|fraud|inject|trojan|ransom|\bRCE\b)+(\bATM\b|bank|\bAustria\b)
The regex should only match, if "ATM", "bank" or "Austria" AND a word from the first list (in the paranthesis) is present in the news headline, not if only "ATM", ... is present.
Example: A match should only appear, if "exploit" AND "ATM" is encountered in the headline.
Given the four headlines below, only headline 2 should return a match.
An APT Group Exploiting a 0-day in FatPipe WARP, MPVPN, and IPVPN
Software
Ares Malware: The Grandson of the Kronos Banking Trojan that targets
German Flag of Germany Banks.
In human-operated ransomware attacks, threat actors use predictable
methods to enter a device but eventually rely on hands-on-keyboard
activities.
Kotak Mahindra Bank launches new transactions across India
Example 1 has only a word of the first list. Example 4 has only a word of the second list.
Only example 2 has occurences of words of BOTH lists.
Example 3 has also 2 two occurences of the first list, but none of the second list, therefore NO MATCH.
I would be very grateful if you could provide a working regex filter for this case.
Regards, Michael
You could match both groups in both ways:
(Threat actor|attack|skimm|malware|exploit|fraud|inject|trojan|ransom|\bRCE\b).*(\bATM\b|bank|\bAustria\b)|(\bATM\b|bank|\bAustria\b).*(Threat actor|attack|skimm|malware|exploit|fraud|inject|trojan|ransom|\bRCE\b)
Regex demo
Generally address comes with comma seperationa and can be splitted using simple regex. e.g
123 Main St, Los Angeles, CA, 90210
We can apply regex here and split using comma. But in my database addresses are stored without comma. e.g
A Better Property Management<br/> 6621 E PACIFIC COAST HWY<br/> STE 255<br/> LONG BEACH CA 90803-4241
And I want to put comma before the city. Something like this:
A Better Property Management<br/> 6621 E PACIFIC COAST HWY<br/> STE 255<br/> LONG BEACH ,CA 90803-4241
I was thing about finding the last two letter word from the end and put comma using regex . But I also need to account for the situations where we don't have complete address or missing city and pincodes. Is there a way this can be done. I only found solutions where we can split using comma but not the reverse.
I was thinking if we could select the last 2 words before numbers with something like [A-Za-z]{2} (don't know if this is correct). And at the same time if we can check to do this only if the string ends with numbers.
I tried
(\b(AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY|Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|New Hampshire|New Jersey|New Mexico|New York|North Carolina|North Dakota|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode Island|South Carolina|South Dakota|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West Virginia|Wisconsin|Wyoming)\b)
https://regex101.com/r/75fqO6/1
You can use
[a-zA-Z]+\s+\d(?:[\d-]*\d)?$
Replace with ,$0.
See the regex demo. Details:
[a-zA-Z]+ - one or more letters
\s+ - one or more whitespaces
\d - a digit
(?:[\d-]*\d)? - an optional substring of zero or more digits/hyphens and then a digit
$ - end of string.
The $0 in the replacement is a backreference to the whole match value, all text matched by the regex is put back where it was found with a prepended comma.
I have this regex
(\b(\S+\s+){1,10})\1.*MY
and I want to group 1 to capture "The name" from
The name is is The name MY
I get "is" for now.
The name can be any random words of any length.
It need not be at the beginning.
It need on be only 2 or 3 words. It can be less than 10 words.
Only thing sure is that it will be the last set of repeating words.
Examples:
The name is Anthony is is The name is Anthony - "The name is Anthony".
India is my country All Indians are India is my country - "India is my country "
Times of India Alphabet Google is the company Alphabet Google canteen - "Alphabet Google"
You could try:
(\b\w+[\w\s]+\b)(?:.*?\b\1)
As demonstrated here
Explanation -
(\b\w+[\w\s]+\b) is the capture group 1 - which is the text that is repeated - separated by word boundaries.
(?:.*?\b\1) is a non-capturing group which tells the regex system to match the text in group 1, only if it is followed by zero-or-more characters, a word-boundary, and the repeated text.
Regex generally captures thelongest le|tmost match. There are no examples in your question where this would not actualny be the string you want, but that could just mean you have not found good examples to show us.
With that out of the way,
((\S+\s)+)(\S+\s){0,9}\1
would appear to match your requirements as currently stated. The "longest leftmost" behavior could still get in the way if there are e.g. straddling repetitions, like
this that more words this that more words
where in the general case regex alone cannot easily be made to always prefer the last possible match and tolerate arbitrary amounts of text after it.
I'm doing a perl program (script?) that reads through a text file and identifies all names and categorizes them as either person, location, organization, or miscellaneous. I'm having trouble with things like New York or Pacific First Financial Corp. where there are multiple capitalized words in a row. I've been using:
/([A-Z][a-z]+)+/
to capture as many capitalized words in a row as there are on a given line. From what I understand the + will match 1 or more instances of such pattern, but it's only matching one (i.e. New in New York). For New York, I can just repeate the [A-Z][a-z]+ twice but it doesn't find patterns with more than 2 capitalized words in a row. What am I doing wrong?
PS Sorry if my use of vocabulary is off I'm always so bad with that.
You were just missing the spacing between words.
The following matches whitespace before each word, except the first, so covers the cases you've described:
use strict;
use warnings;
while (<DATA>) {
while (/(?=\w)((?:\s*[A-Z][a-z]+)+)/g) {
print "$1\n";
}
}
__DATA__
I'm doing a perl program (script?) that reads through a text file and identifies all names and categorizes them as either person, location, organization, or miscellaneous. I'm having trouble with things like New York or Pacific First Financial Corp. where there are multiple capitalized words in a row. I've been using:
to capture as many capitalized words in a row as there are on a given line. From what I understand the + will match 1 or more instances of such pattern, but it's only matching one (i.e. New in New York). For New York, I can just repeate the [A-Z][a-z]+ twice but it doesn't find patterns with more than 2 capitalized words in a row. What am I doing wrong?
PS Sorry if my use of vocabulary is off I'm always so bad with that.
Outputs:
New York
Pacific First Financial Corp
From
New
New York
For New York
What
Sorry
There's a CPAN module called Lingua::EN::NamedEntity which seems to do what you want. Might be worth taking a quick look at it.
The How
The pattern you provide, /([A-Z][a-z]+)+/, in your question matches one of more capitalised words given consecutively, like this
This
ThisAndThat
but it won't match this
Not This
It actually matches each of these individually
Not
This
So lets modify the regex to /(?:[A-Z][a-z]+)(?:\s*[A-Z][a-z]+)*/. Now that is a bit of a mouthful so lets break it down a bit at a time
(?: ... ) Groups like this don't capture which is more efficient
[A-Z][a-z]+ Matches a capitalised word
\s*[A-Z][a-z]+ Matches a subsequent capitalised word, optionally starting with
whitespace
The What - TL;DR
Put this all together and we now have a regex that matches a capitalised word, then any subsequent ones with or without whitespace seperation. So it matches
This
ThisAndThat
Not This
We can now abstract this regex a bit to avoid repetition and use it in code as so
my $CAPS_WORD = qr/[A-Z][a-z]+/;
my $FULL_RE = qr/(?:$CAPS_WORD)(?:\s*$CAPS_WORD)*/;
$string =~ /$FULL_RE/;
say $&;
The Why
This answer gives an alternative to the already great one given by #Miller, both will work fine but this solution is quite a bit faster since it doesn't use a lookahead. This is faster than this by a factor of 7
$ time ./bench-simple.pl
Running 100000 runs
800000 matches
real 0m2.869s
user 0m2.860s
sys 0m0.008s
$ time ./bench-lookahead.pl
Running 100000 runs
800000 matches
real 0m19.845s
user 0m19.831s
sys 0m0.012s
I've found several questions that touch on this, but none that seem to answer it. I am trying to build a Regex that will allow me to identify Proper Nouns in a group of text.
I am defining a Proper Noun as follows: A word or group of words that begin with a capital letter, are longer than 1 digit (to exclude things like I, A, etc), and are NOT the first word of a new sentence.
So, in the following text
"Susan Dow stayed at the Holiday Inn on Thursday. She met Tom and Shirley Temple at the bar where they ordered Green Eggs and Ham"
I would want the following returned
Holiday Inn
Thursday
Tom
Shirley Temple
Green Eggs
Ham
Right now, [A-Z]{1,1}[a-z]*([\s][A-Z]{1,1}[a-z]*)* is what I have, but it's returning Susan Dow and She in addition to the ones listed above. How can I get my . look-up to work?
You can use:
(?<!^|\. |\. )[A-Z][a-z]+
per this rubular
Update: Integrated the two negative looks using alternation. Also added check for two spaces between sentences. Note that repetition operators cannot be used in negative lookbehinds per notes in http://www.regular-expressions.info/lookaround.html