Regex extract mutliple values - regex

I have a string that looks like this: "60% Bob Peterson, 35% Jake Peter Sullivan, 5% Maria Teresa".
I want to write a regex to grab the first word after %:
Desired output: "Bob, Jake, Maria"
So far I came up with this: %\W*(\w+)
But it is only grabbing the first instance. I need to grab all instances and print them separated with a comma.

You can use Python inside Tableau using TabPy
import re
string = "60% Bob Peterson, 35% Jake Peter Sullivan, 5% Maria Teresa"
result = re.findall(r'\d+%\s(\w+)', string)
print(', '.join(result))
In the tableau, you cant loop the regex check , hence use Python

Related

Regular Expression String Search Returns Group Not Match

I'm a bit stuck with this regex, they make my head hurt lol
I'm trying to find a person's name between two known strings.
The beginning and end of the string are known but will have a few variations.
Here's the regex I'm working on, but I need it to return the first value, but it's returning a group.
So I know I'm on the right track.
(?<=Sales Call - [AB]: ).*?(?= and (Steve Test$|Randy Robbins$|Peter Pan$))
Here are some possible values I'm testing against:
String: Sales Call - A: Sally Warren and Steve Test
Result: Sally Warren
Sales Call - B: Ted Wilson and Randy Robbins
Return: Ted Wilson
Sales Call - A: Alicia Alton and Peter Pan
Return: Alicia Alton
So for clarification, I just need the middle string (the person's name) and nothing else but think my regex needs to be slightly tweaked.
Thanks!
Steve
The reason you are capturing something is because of the alternation at the end of the pattern:
(Steve Test$|Randy Robbins$|Peter Pan$)
You could make it a non capturing group:
(?:Steve Test$|Randy Robbins$|Peter Pan$)
However, I suggest using the following pattern instead:
(?<=Sales Call - [AB]: ).*?(?=\s+and \w+(?: \w+)*)
Here is a working demo.

Extract data from dataset

I need to extract title from name but cannot understand how it is working . I have provided the code below :
combine = [traindata , testdata]
for dataset in combine:
dataset["title"] = dataset["Name"].str.extract(' ([A-Za-z]+)\.' , expand = False )
There is no error but i need to understand the working of above code
Name
Braund, Mr. Owen Harris
Cumings, Mrs. John Bradley (Florence Briggs Thayer)
Heikkinen, Miss. Laina
Futrelle, Mrs. Jacques Heath (Lily May Peel)
Allen, Mr. William Henry
Moran, Mr. James
above is the name feature from csv file and in dataset["title"] it stores the title of each name that is mr , miss , master , etc
Your code extracts the title from name using pandas.Series.str.extract function which uses regex
pandas.series.str.extract - Extract capture groups in the regex pat as columns in a DataFrame.
' ([A-Za-z]+)\.' this is a regex pattern in your code which finds the part of string that is here Name wherever a . is present.
[A-Za-z] - this part of pattern looks for charaters between alphabetic range of a-z and A-Z
+ it states that there can be more than one character
\. looks for following . after a part of string
An example is provided on the link above where it extracts a part from
string and puts the parts in seprate columns
I found this specific response with the link very helpful on how to use the 'str's extract method and put the strings in columns and series with changing the expand's value from True to False.

Multiple filter regex

Sample Data:
ID Name User
12 Test Same
14 Xyz Joe
15 Abc John
16 Def Bill
17 Ghi Donald
If a user searches for Abc or Joe, he should get that rows.
Regex:
'Abc|Joe'
Output:
14 Xyz Joe
15 Abc John
Now, if the user further searches for e, it should filter based on the previous output(2 rows retrieved), so I will just get 14 Xyz Joe . Is this possible using regex?
I am trying to have all this in one regex.
`'Abc|Joe and the second filter goes here (All in one regex)'`
Use case: The user selects checkboxes to set the filters he wants to apply on the data (All the data in the columns Name and User are available). He may then search again on the filtered result using a search textbox.
((firstRegex)(?:.*(secondRegex)))|((secondRegex)(?:.*(firstRegex)))
((Abc|Xyz)(?:.*(Jo)))|((Jo)(?:.*(Abc|Xyz)))
See Demo
we don't know which regex would before or after,so it have two case and we use | combine these case.If have more search,suggest you write some code.
For the 2 filters:
/^\d+\s+(?:Abc|Xyz|Def)\s+\S*(?:Jo|ill).*/mg;
If the user doesn't specify the second filter, you could just leave it empty as (?:).
I'm positive you could create these kind of expressions if you read a couple of minutes about regex syntax, so allow me to recommend:
Regular Expressions Tutorial (regular-expressions.info). A quite comprehensive tutorial to learn regex.
regex101.com. Allows you to test different expressions and understand the way a pattern matches the subjet string.

Regex parse with alteryx

One of the columns has the data as below and I only need the suburb name, not the state or postcode.
I'm using Alteryx and tried regex (\<\w+\>)\s\<\w+\> but only get a few records to the new column.
Input:
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta NSW 2150
Claymore 2559
CASULA
Output
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta
Claymore
CASULA
This regex matches all letter-words up to but not including an Australian state abbreviation (since the addresses are clearly Australian):
( ?(?!(VIC|NSW|QLD|TAS|SA|WA|ACT|NT)\b)\b[a-zA-Z]+)+
See demo
The negative look ahead includes a word boundary to allow suburbs that start with a state abbreviation (see demo).
Expanding on Bohemian's answer, you can use groupings to do a REGEXP REPLACE in alteryx. So:
REGEX_Replace([Field1], "(.*)(\VIC|NSW|QLD|TAS|SA|WA|ACT|NT)+(\s*\d+)" , "\1")
This will grab anything that matches in the first group (so just the suburb). The second and third groups match the state and the zip. Not a perfect regex, but should get you most of the way there.
I think this workflow will help you :

Extract a portion of text using RegEx

I would like to extract portion of a text using a regular expression. So for example, I have an address and want to return just the number and streets and exclude the rest:
2222 Main at King Edward Vancouver BC CA
But the addresses varies in format most of the time. I tried using Lookbehind Regex and came out with this expression:
.*?(?=\w* \w* \w{2}$)
The above expressions handles the above example nicely but then it gets way too messy as soon as commas come into the text, postal codes which can be a 6 character string or two 3 character strings with a space in the middle, etc...
Is there any more elegant way of extracting a portion of text other than a lookbehind regex?
Any suggestion or a point in another direction is greatly appreciated.
Thanks!
Regular expressions are for data that is REGULAR, that follows a pattern. So if your data is completely random, no, there's no elegant way to do this with regex.
On the other hand, if you know what values you want, you can probably write a few simple regexes, and then just test them all on each string.
Ex.
regex1= address # grabber, regex2 = street type grabber, regex3 = name grabber.
Attempt a match on string1 with regex1, regex2, and finally regex3. Move on to the next string.
well i thot i'd throw my hat into the ring:
.*(?=,? ([a-zA-Z]+,?\s){3}([\d-]*\s)?)
and you might want ^ or \d+ at the front for good measure
and i didn't bother specifying lengths for the postal codes... just any amount of characters hyphens in this one.
it works for these inputs so far and variations on comas within the City/state/country area:
2222 Main at King Edward Vancouver, BC, CA, 333-333
555 road and street place CA US 95000
2222 Main at King Edward Vancouver BC CA 333
555 road and street place CA US
it is counting at there being three words at the end for the city, state and country but other than that it's like ryansstack said, if it's random it won't work. if the city is two words like New York it won't work. yeah... regex isn't the tool for this one.
btw: tested on regexhero.net
i can think of 2 ways you can do this
1) if you know that "the rest" of your data after the address is exactly 2 fields, ie BC and CA, you can do split on your string using space as delimiter, remove the last 2 items.
2) do a split on delimiter /[A-Z][A-Z]/ and store the result in array. then print out the array ( this is provided that the address doesn't contain 2 or more capital letters)