I want to capture Alta, Utah, USA from asd Alta, Utah, USA qwe. Basically I'm trying to capture places from a text. It won't be a perfect method, but the places must start with a capital and use a comma, followed by another word with a capital.
So far, I have wrote:
\s[A-Z][a-z]+[,]?
I want to do multiple words, not just the first word, Alta. This is my attempt to use square brackets inside other square brackets.
[\s[A-Z][a-z]+[,]?]+
But that doesn't work, so it must be syntactically incorrect.
Updated as per OP's comment:
(?:\s*[A-Z][A-Za-z]+[,\s])+
Demo
Original Answer:
\b([A-Z][a-zA-Z]+),?
Original Demo
And you will get the names of the country in group 1 for each match
I think this is what you need:
([A-Z][a-zA-Z]+)(,\s*([A-Z][a-zA-Z]+))*
Though the requirement pointed out by #Rizwan (in his comment) is still to be understood.
Debuggex Demo
Just joining the party:
import re
dirty = "asd Alta, Utah, USA qwe"
p = re.compile("([A-Z][a-zA-Z]+)")
re.findall(p,dirty)
output:
['Alta', 'Utah', 'USA']
Related
Lets say we have the string:
one day, when Anne, Lisa and Paul went to the store, then Anne said to Paul: "I love Lisa!". Then Lisa laughed and kissed Anne.
is there a way with regex to match the first name, and then match and all other occurrences of the same name in the string?
Given the name-matching regex /[A-Z][a-z]+ (with /g maybe?), can the regex matcher be made to remember the first match, and then use that match EXACTLY for the rest of the string? Other subsequent matches to the name-matching regex should be ignored (except for Anne in the example).
The result would be (if matches are replaced with "Foo"):
one day, when Foo, Lisa and Paul went to the store, then Foo said to Paul: "I love Lisa!". Then Lisa laughed and kissed Foo.
Please ignore the fact that the sentence start uncapitalized, or add an example that also handles this.
Using a script to get the first match and then using that as input for a second iteration works of course, but that's outside the scope of the question (which is limited to ONE regex expression).
The only way I could think of is with non-fixed width lookbehinds. For example through Pypi's regex module, and maybe Javascript too? Either way, assuming a name is capture through [A-Z][a-z]+ as per your question try:
\b([A-Z][a-z]+)\b(?<=^[^A-Z]*\b\1\b.*)
See an online demo
\b([A-Z][a-z]+)\b - A 1st capture group capturing a name between two word-boundaries;
(?<=^[^A-Z]*\b\1\b.*) - A non-fixed width positive lookbehind to match start of line anchor followed by 0+ characters other than uppercase followed by the content of the 1st capture group and 0+ characters.
Here is a PyPi's example:
import regex as re
s= 'Anne, Lisa and Paul went to the store, then Anne said to Paul: "I love Lisa!". Then Lisa laughed and kissed Anne.'
s_new = re.sub(r'\b([A-Z][a-z]+)\b(?<=^[^A-Z]*\b\1\b.*)', 'Foo', s)
print(s_new)
Prints:
Foo, Lisa and Paul went to the store, then Foo said to Paul: "I love Lisa!". Then Lisa laughed and kissed Foo.
I am working in python and there I have a list of countries that I would like to clean. Most countries are already written the way I want them to be. However, some country names have a one- or two-digit number attached or there is a text in brackets appended. Here's a sample of that list:
Argentina
Australia1
Bolivia (Plurinational State of)
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia8
The part that I want to capture would look like this:
Argentina
Australia
Bolivia
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia
The best solution that I was able to come up with is ^[a-zA-Z\s,ô'ç-]+. However, this leaves country names that are followed by a text in parentheses with a trailing white space.
This means I would like to match the entire country name unless there is a digit or a white space followed by an open bracket, then I would like it to stop before the digit or the (
I know that I could probably solve this in two steps but I am also reasonably sure that it should be possible to define a pattern that can do it in one step. Since I am anyway in the process of getting familiar with regex, I thought this would be a nice thing to know.
The pattern can be written as matching any char except digits, parenthesis or whitespace chars. And that part by itself can be optionally repeated preceded by a space.
^[^\d\s()]+(?: [^\d\s()]+)*
^ Start of string
[^\d\s()]+ Match 1+ times any char except a digit, whitespace char or parenthesis using a negated character class
(?: Non capture group to repeat as a whole part
[^\d\s()]+ Same match as above
)* Close the non capture group and optionally repeat it
Regex demo
I suggest you simply convert the strings you don't want to empty strings, using the regular expression
\d+$| +\(.*\)
with the multiline flag set, causing ^ and $ to respectively match the beginning and end of a line, rather than the beginning and end of the string.
Demo
The expression matches one or more digits at the end of a line or one or more spaces followed by a string that is enclosed in matching parentheses.
I think you can try ^([^\d \n]| +[^\d (\n])+ or, if you can guarantee your input doesn't contain double-spaces, the slightly simpler ^([^\d \n]| [^\d(\n])+
(The ^ character inside [] excludes the following characters, see https://regexone.com/lesson/excluding_characters)
Technically, the regex I've given omits trailing spaces, but for your application it doesn't sound like that would be a bad thing.
You can test the regex here https://regex101.com/r/dupn18/1
This should do the trick
In [1]: import re
In [2]: pattern = re.compile(r'(.+(?=\d| \()|.+)')
In [3]: data = """Argentina
...: Australia1
...: Bolivia (Plurinational State of)
...: China, Hong Kong Special Administrative Region
...: Côte d'Ivoire
...: Curaçao
...: Guinea-Bissau
...: Indonesia8""".splitlines()
In [4]: [pattern.search(country).group() for country in data]
Out[4]:
['Argentina',
'Australia',
'Bolivia',
'China, Hong Kong Special Administrative Region',
"Côte d'Ivoire",
'Curaçao',
'Guinea-Bissau',
'Indonesia']
One of the columns has the data as below and I only need the suburb name, not the state or postcode.
I'm using Alteryx and tried regex (\<\w+\>)\s\<\w+\> but only get a few records to the new column.
Input:
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta NSW 2150
Claymore 2559
CASULA
Output
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta
Claymore
CASULA
This regex matches all letter-words up to but not including an Australian state abbreviation (since the addresses are clearly Australian):
( ?(?!(VIC|NSW|QLD|TAS|SA|WA|ACT|NT)\b)\b[a-zA-Z]+)+
See demo
The negative look ahead includes a word boundary to allow suburbs that start with a state abbreviation (see demo).
Expanding on Bohemian's answer, you can use groupings to do a REGEXP REPLACE in alteryx. So:
REGEX_Replace([Field1], "(.*)(\VIC|NSW|QLD|TAS|SA|WA|ACT|NT)+(\s*\d+)" , "\1")
This will grab anything that matches in the first group (so just the suburb). The second and third groups match the state and the zip. Not a perfect regex, but should get you most of the way there.
I think this workflow will help you :
Sample 1 String:
Aquaman Figure, XL DC Comics
Sample 2 String:
Rocket Raccoon, Mini Marvel
Regex:
/(DC Comics|Marvel)/
Match Sample 1:
DC Comics
Match Sample 2:
Marvel
Works perfectly in Regex101
How do I reverse this?
I want to match Aquaman Figure, XL and Rocket Raccoon, Mini only.
Edit:
/(.+)(?=Marvel)/ seems to do the job. It excludes Marvel from Rocket Raccon! How do I make this also work with DC comics?
/(.+)(?=Marvel)/ (or /(.+)(?=DC Comics|Marvel)/ for both) isn't going to work for something like:
John Marvel Bob
For which I presume you want the result to be:
John Bob
You'll only get John in the first match, and you'll get Marvel Bob in the second match (since look-ahead doesn't consume the looked-ahead characters).
Or something that doesn't contain either of the strings (since you require that the next characters match some given characters to get a match).
The easiest solution is probably just replacing the two desired sub-strings with empty strings. Replace:
DC Comics|Marvel
with:
(empty string)
Or you can repeatedly search for:
/(.*?)(DC Comics|Marvel|$)/
And just extract the first group (which will correspond to what matches .*, which is everything starting from the end of the last match up to just before "DC Comics", "Marvel" or the end of the string).
The reluctant quantifier ? is needed to prevent the .* from matching John Marvel Bob, rather than just John in John Marvel Bob Marvel.
re.findall(r"(.*)(?=Marvel|Comics)",input)
This does exactly what you are looking for.Its in python.input will be your string.
So, I've built a regex which follows this:
4!a2!a2!c[3!c]
which is translated to
4 alpha character followed by
2 alpha characters followed by
2 characters followed by
3 optional character
this is a standard format for SWIFT BIC code HSBCGB2LXXX
my regex to pull this out of string is:
(?<=:32[^:]:)(([a-zA-Z]{4}[a-zA-Z]{2})[0-9][a-zA-Z]{1}[X]{3})
Now this is targeting a specific tag (32) and works, however, I'm not sure if it's the cleanest, plus if there are any characters before H then it fails.
the string being matched against is:
:32B:HsBfGB4LXXXHELLO
the following returns HSBCGB4LXXX, but this:
:32B:2HsBfGB4LXXXHELLO
returns nothing.
EDIT
For clarity. I have a string which contains multiple lines all starting with :2xnumber:optional letter (eg, :58A:) i want to specify a line to start matching in and return a BIC from anywhere in the line.
EDIT
Some more example data to help:
:20:ABCDERF Z
:23B:CRED
:32A:140310AUD2120,
:33B:AUD2120,
:50K:/111222333
Mr Bank of Dad
Dads house
England
:52D:/DBEL02010987654321
address 1
address 2
:53B:/HSBCGB2LXXX
:57A://AU124040
AREFERENCE
:59:/44556677
A line which HSBCGB2LXXX contains a BIC
:70:Another line of data
:71A:Even more
Ok, so I need to pass in as a variable the tag 53 or 59 and return the BIC HSBCGB2LXXX only!
Your regex can be simplified, and corrected to allow a character before the H, to:
:32[^:]:.?([a-zA-Z]{6}\d[a-zA-Z]XXX)
The changes made were:
Lost the look behind - just make it part of the match
Inserting .? meaning "optional character"
([a-zA-Z]{4}[a-zA-Z]{2}) ==> [a-zA-Z]{6} (4+2=6)
[0-9] ==> \d (\d means "any digit")
[X]{3} ==> XXX (just easier to read and less characters)
Group 1 of the match contains your target
I'm not quite sure if I understand your question completely, as your regular expression does not completely match what you have described above it. For example, you mentioned 3 optional characters, but in the regexp you use 3 mandatory X-es.
However, the actual regular expression can be further cleaned:
instead of [a-zA-Z]{4}[a-zA-Z]{2}, you can simply use [a-zA-Z]{6}, and the grouping parentheses around this might be unnecessary;
the {1} can be left out without any change in the result;
the X does not need surrounding brackets.
All in all
(?<=:32[^:]:)([a-zA-Z]{6}[0-9][a-zA-Z]X{3})
is shorter and matches in the very same cases.
If you give a better description of the domain, probably further improvements are also possible.