RegEx for extracting multiple words in a passage using Tableau - regex

I have a passage and I need to extract a couple of words from it in tableau. The passage is given below:
This looks like a suspicious account. Please look at the details
below. Name: John Mathew Email:john.mathew#abc.com Phone:+1
111-111-1111 Department: abc
For more enquiries contact: ----
Name, email, phone and the department are in the same line separated by blank spaces. I used the below regex and it works well for the department alone:
regexp_extract([CASE DESCRIPTION],'Department : (.+)')
When I apply this one name, I get:
Name: John Mathew Email:john.mathew#abc.com Phone:+1 111-111-1111
Department: abc
instead of just the name. The same happens with email.
How do I solve this problem?

It looks to me like the issue is that your regex just has '(.+)' as its capture group, which basically means "everything" (after the specified string). Since the fields are all on one line, everything after "name" includes the email, phone, and department. (The regex works with department because it's the last thing on the line.)
So, to make it work right, you need to give your regex something other than the end of the line to stop on. To capture just the name, you need to stop before the Email tag, and so on down the list. Something like
Name = regexp_extract([CASE_DESCRIPTION],'Name: (.+) Email:')
email = regexp_extract([CASE_DESCRIPTION],'Email: (.+) Phone:')
phone = regexp_extract([CASE_DESCRIPTION],'Phone: (.+) Department:')
department = regexp_extract([CASE_DESCRIPTION],'Department: (.+)')

Related

How to match an optional group after a compulsory one that could include it

I have a string input I need to parse, that has 2 different possible formats. It may look like either of the following:
2900 Sétubal (Portugal)
2900 Sétubal
I need a regex that will adequately split the postal code, city, and country (if provided) of both solutions.
This is the regex I've come up with so far.
(?P<postal_code>\d*) (?P<city>.*)( \((?P<country>.*)\))?
The problem is that regexes being read from left to right, the city group matches the country part of the string if it is provided, and I end up with something like :
postal_code = 2900
city = Sétubal (Portugal)
The output is right when I make the country group compulsory:
(?P<postal_code>\d*) (?P<city>.*)( \((?P<country>.*)\))
postal_code = 2900
city = Sétubal
country = Portugal
However, this regex does NOT match the 2nd possible format:
2900 Sétubal
I have tried using lookarounds, but I haven't succeeded. Any piece of advice will most definitely be welcome.
The following regex extracts your data:
(\d+)\s+([^()]*)\s+(\(([^()]+)\))?
Test here.
Based on your regex:
(?P<postal_code>\d+) +(?P<city>[^()]+)(?> +|$)(\((?P<country>[^()]+)\))?
Test here.

Regex for extracting text from .eml file

I need to write regex to get the following data from an email. The data to be phrased is first name, last name, phone number, email id, pin code, message etc, i am a newbie and am not much aware of REGEX, can anyone help me with it.
enter code hereContact Us
Title
Mr.
Last Name
S
First Name
Nitesh
Contact Us
By phone on:
0344 892 8979
E-Mail Address
niteshdsingh#gmail.com<mailto:niteshdsingh#gmail.com>
Phone Number
123456789
Postcode
421202
City
test
Message
test
Best Regards,
I don't think this regex can be regarded as a generic email parser... rather it will only work for the format that you have provided:
Last\s+Name(?:\n)+((?: *\w+)+)|First\s+Name(?:\n)+((?: *\w+)+)|By phone on:((?: *\d+)+)|(?:E-Mail\s+Address(?:\n)+((?:(?: *\w+)+)#[^\.]+\.[^<]+))|(?:Phone Number(?:\n)+((?: *\w+)+))|(?:Postcode(?:\n)+((?: *\w+)+))|(?:Message(?:\n)+((?: *\w+)+))
Regex 101 Demo
Here in the following groups you get your desired data:
Group 1. Last Name
Group 2. First Name
Group 3. By phone on
Group 4. email
Group 5. Phone Number
Group 6. Postcode
Group 7. Message
UPDATED AS PER THE OP COMMENT:
(?:E-Mail\s+Address(?:\n)+((?:(?: *\w+)+)#[^\.]+\.[^<]+))|(?:Phone Number(?:\n)+((?: *\w+)+))|(?:Postcode(?:\n)+((?: *\w+)+))|(?:Message(?:\n)+((?: *\w+)+))|(?:City(?:\n)+((?: *\w+)+))
Demo Two

Extract email and name with regex

What would be the regular expressions to extract the name and email from strings like these?
johndoe#example.com
John <johndoe#example.com>
John Doe <johndoe#example.com>
"John Doe" <johndoe#example.com>
It can be assumed that the email is valid. The name will be separated by the email by a single space, and might be quoted.
The expected results are:
johndoe#example.com
Name: nil
Email: johndoe#example.com
John <johndoe#example.com>
Name: John
Email: johndoe#example.com
John Doe <johndoe#example.com>
Name: John Doe
Email: johndoe#example.com
"John Doe" <johndoe#example.com>
Name: John Doe
Email: johndoe#example.com
This is my progress so far:
(("?(.*)"?)\s)?(<?(.*#.*)>?)
(which can be tested here: http://regexr.com/?337i5)
The following regex appears to work on all inputs and uses only two capturing groups:
(?:"?([^"]*)"?\s)?(?:<?(.+#[^>]+)>?)
http://regex101.com/r/dR8hL3
Thanks to #RohitJain and #burning_LEGION for introducing the idea of non-capturing groups and character exclusion respectively.
use this regex "?([^"]*)"?\s*([^\s]+#.+)
group 1 contains name
group 2 contains email
(([^<>()\[\]\\.,;:\s#"]+(\.[^<>()\[\]\\.,;:\s#"]+)*)|(".+"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))
https://regex101.com/r/pVV5TI/1
You can try this (same code as yours but improved), but you need to check returned groups after matching because the email is either returned in group 2 or group 3, depending on whether a name is given.
(?:("?(?:.*)"?)\s)?<(.*#.*)>|(.*#.*)
This way you can get with or without name, removing the quotes.
\"*?(([\p{L}0-9-_ ]+)\"?)*?\b\ *<?([a-z0-9-_\.]+#[a-z0-9-_\.]+\.[a-z]+)>?
Although #hpique has a good answer, that solution only works when the name/email string is the only thing being analyzed in Regex. It will not work when you have a longer message that contains other items, such as an email. Also many of the other solutions will fail to match when the person has included a middle name (i.e. James Herbert Bond <jbond#example.com).
Here is a more robust Regex solution I wrote that can pick up the first names, last names, and emails like you wanted, even if there are many other things in the string:
/(?:"?)(\b[A-Z][a-z]+\b ?)(\b[A-Z][a-z]+\b ?)*(?:"?) ?<([a-zA-Z0-9._-]+#[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)>|([a-zA-Z0-9._-]+#[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)/g
Check out the above syntax here: Example on Regexr

How to use regex to extract an address with optional street2

I need to extract name, street1, street2, city, state, zip
I have data in this form
JOHN m SMITH [1111 WEST OAK ROAD, SUITE 101, CITY, ST 55555]
GEORGE m JONES [222 MAIN STREET, CITY, ST 55555]
My results for JOHN should be
name="JOHN m SMITH"
street1="1111 WEST OAK ROAD"
street2="SUITE 101"
city = "CITY"
state = "ST"
zip = "55555"
This works with GEORGE's data
Regex r = new Regex(#"^(?<name>.*)\[(?<street>.*)[,]\s(?<city>.*)[,]\s(?<state>.*)\s(?<zip>\d{5})\]$");
var match = r.Match(fullNameAndAddress);
name = match.Groups["name"].Value;
street = match.Groups["street"].Value;
city = match.Groups["city"].Value;
state = match.Groups["state"].Value;
zip = match.Groups["zip"].Value;
How do I add the optional street2?
I want 1 and only 1 "street" group. I thought it should have this: (....){1}?
street2 is optional zero or 1 times. I thought it should have this (...)?
but it doesn't work with JOHN's data, both street1 & street2 are going into the street group:
^(?<name>.*)\[((?<street>.*)[,]\s){1}?((?<street2>.*)[,]\s)?(?<city>.*)[,]\s(?<state>.*)\s(?<zip>\d{5})\]$
Could you clarify what you want stored in street?
Do you want John's to look like '1111 WEST OAK ROAD, SUITE 101'?
Or do you want to stuff it into some variable you wont be using, so that street looks like '1111 WEST OAK ROAD'?
Edit: With clarification, check out this link
http://rubular.com/r/S4HaTMVFZl
What happens here I believe is that the * is greedy, grabbing as much as it can before finding the final occurence of [,]\s
Adding a ? after the .* makes it lazy, grabbing the least information possible.
The amended regex looks like this
^(?<name>.*)\[((?<street>.*?)[,]\s)((?<street2>.*)[,]\s)?(?<city>.*)[,]\s(?<state>.{2})\s(?<zip>\d{5})\]$
You'll notice I changed the Regex for state from .* to .{2}, forcing a 2-character state. Feel free to revert that if you don't want it :)
I made a couple of changes to your regex in rubular.com, and it seemed to be working on both the example strings:
^(?<name>.+)\s\[(?<street>[^,]+),\s((?<street2>[^,]+),\s+)?(?<city>[^,]+),\s(?<state>.+)\s(?<zip>\d{5})\]$
street2 = match.Groups["street2"].Value;
One trick I've learned with regex's is to use the negation of the divider (eg. [^,]* for anything but a comma) instead of .*, so it's impossible to capture multiple fields with one expression. Also, the + operator, which requires at least one match, is useful in most of the groups.
Also, the additional comma is only there if there's an street2 component of the address, which indicates that the comma should be in the same capture group as the street2 part. I added an extra capture group around the street2 capture group to account for this. You can make groups non-capturing in most languages, but it didn't seem necessary.

Avoid literal repetition

Suppose I have this string:
Address XXXXX city XXXXX
And this regEX:
Address (.*?) city (.*?)
What will happen if the Address is "The city of London" ?
It depends on whether your reex engine is in greedy mode or not.
If it's in greedy mode, it will work as expected since it will look for the longest match.
Whether your particular regex engines runs in greedy mode by default, or whether it even has a greedy mode, is not something we can tell you based on the information provided in the question.
If you're using .NET, this page has a description on greedy versus lazy matching.
Basically, given the string XYZZY, the regex X.*Y will match XYZZY (greedy) while X.*?Y will match XY (lazy).
What you need is a way to ensure you can differentiate between the delimiters and the elements of your string, otherwise you'll be in trouble no matter what, such as with:
Address The city baths city Manchester city, England
Perhaps you could look into something like:
Address "put address here" city "put city here"
and try to make sure you never get a city name with quotes in it. However, be careful. I once worked on a project where we managed to get some decent compression on city names (it was embedded so every byte counted) by only having to store alpha characters.
Shortly thereafter, we rolled out nationally and the residents of A1 mining settlement were rather miffed at our short-sightedness :-) One town in the whole of Oz with a digit in the name, who'd have thought?
Alternatively, put the address and city on separate lines thus:
Address: The city baths
City: Manchester city, England
Then you can look for things like:
^Address:\s*(.*)$
^City:\s*(.*)$