Email extraction from csv using regex - regex

I have the following regex:
/(.+?)((?:(?:[^<>()\[\]\\.,;:\s#"]+(?:\.[^<>()\[\]\\.,;:\s#"]+)*)|(?:".+"))#(?:(?:\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}])|(?:(?:[a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,})))/gi
Used to extract email address and name from the following different formats and avoid duplicates,
"FName LName" <fname.lname#gmail.com>, "Eg Name" <egname#gmail.com>,
Closed Call<close_call#gmail.co.um>
toys#urs.com
serima<serima#google.com>
One <one#one.com>;Two <two#two.com>; "New <new#new.com>"
Have couple of problems with it:
On test case #2 t gets trimmed, getting only oys#urs.com, this happens only on the first email address.
Second capturing group returns Name (if present along with a < if present) and then had to strip out the < separately
Any way to extract the above as follows, in much more elegant/efficient way
[{'name':'FName LName', 'email':'fname.lname#gmail.com'},
{'name':'Eg Name', 'email':'egname#gmail.com'},
{'name':'Closed Call', 'email':'close_call#gmail.co.um'}]
[{'name':'', 'email':'toys#urs.com'}]
[{'name':'serima', 'email':'serima#google.com'}]
[{'name':'One', 'email':'one#one.com'},
{'name':'Two', 'email':'two#two.com'},
{'name':'New', 'email':'new#new.com'}]
Note: Name may/maynot be enclosed with double quotes, there may/may not be space between the name and <

Problem#1 solved by making the first capturing group a little more greedy,
/(.*?)((?:(?:[^<>()\[\]\\.,;:\s#"]+(?:\.[^<>()\[\]\\.,;:\s#"]+)*)|(?:".+"))#(?:(?:\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}])|(?:(?:[a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,})))/gi
Problem#2 will save it for tonight's dream-time ;-)

Related

Extract digits in between 2 different parameters

I have all data being imported into one cell as:
"<blank space><email address><blank space><CustomerId><blank space><(email address)><line break for next entry>"
Example:
email1#provider.com 12345678 (email1#provider.com)
email224#provider.com 23902490 (email224#provider.com)
I need to extract only the customer ID's, while separating them with a comma, so I tried the following: regexreplace(A2,"([^[:digit:]])",","), however, this also extracts the numbers associated with the emails, so it returns me:
,,,,,1,,,,,,,,,,,,,,12345678,,,,,,,1,,,,,,,,,,,,,,
,,,,,224,,,,,,,,,,,,,,23902490,,,,,,,224,,,,,,,,,,,,,,
Since the email address is set by the user, I don't have control how many digits or if only digits are used in it. I can't seem to understand how to isolate the CustomerIds alone.
Please help!
Edit1:
CustomerID: 64-bit int field, randomly assigned to a client, therefore checking by the length of the string would not work.
Edit2:
For now, I am using the formula below, but I would still be interested in a solution using Regex.
filter(transpose(split($B$4," ")),isnumber(transpose(split($B$4," "))))
If they are separated by a space you should be able to set the space to be your delimiter and extract from there.
https://zapier.com/blog/split-text-excel-zapier/
use:
=ARRAYFORMULA(TEXTJOIN(", ", 1, IFERROR(REGEXEXTRACT(A1:A2, "(?s)(\d{8})"))))

Can I make my Alteryx RegEx parse conditional?

I receive messages with the fields below. I want to group and extract the user inputs. Majority of submissions contain all fields and the regex works great. Problem comes in when someone removes additional lines if let's say they only need to fill in down to Amount 1
Name:
Number:
Amount:
Old Code:
Code 1:
Amount 1:
Code 2:
Amount 2:
Code 3:
Amount 3:
Code 4:
Amount 4:
I'm using Alteryx to parse the message contents and have success with my current regex but want to be ready for unavoidable user submission inconsistency
Name:(.+)\sNumber:(.+)\sAmount:(.+)\sOld Code:(.+)\sCode 1:(.+)\sAmount 1:(.+)\sCode 2:(.*?)\sAmount 2:(.*?)\sCode 3:(.*?)\sAmount 3:(.*?)\sCode 4:(.*?)\sAmount 4:(.*?[^-]*)
Is it possible to have Alteryx return parsed results from a message even if a listed field is deleted?
Alteryx issue with new cascading regex
Anyway, you can always do a cascading nested optional grouping around the
lines to just match what's valid up to a point.
This expects the form lines to be in order. If it's not, a different type
of regex is needed - an out-of-order regex ( see the bottom regex ) .
Both these regex are for Perl 5.10
(?-ms)Name:(.*)(?:\s+Number:(.*)(?:\s+Amount:(.*)(?:\s+Old[ ]+Code:(.*)(?:\s+Code[ ]+1:(.*)(?:\s+Amount[ ]+1:(.*)(?:\s+Code[ ]+2:(.*)(?:\s+Amount[ ]+2:(.*)(?:\s+Code[ ]+3:(.*)(?:\s+Amount[ ]+3:(.*)(?:\s+Code[ ]+4:(.*)(?:\s+Amount[ ]+4:(.*?[^-]*))?)?)?)?)?)?)?)?)?)?)?
https://regex101.com/r/9oKXEE/1
For out-of-order matching, use this
(?m-s)\A(?:[\S\s]*?(?:(?(1)(?!))^\h*Name\h*:\h*(.*)|(?(2)(?!))^\h*Number\h*:\h*(.*)|(?(3)(?!))^\h*Amount\h*:\h*(.*)|(?(4)(?!))^\h*Old\h*Code\h*:\h*(.*)|(?(5)(?!))^\h*Code\h*1\h*:\h*(.*)|(?(6)(?!))^\h*Amount\h*1\h*:\h*(.*)|(?(7)(?!))^\h*Code\h*2\h*:\h*(.*)|(?(8)(?!))^\h*Amount\h*2\h*:\h*(.*)|(?(9)(?!))^\h*Code\h*3\h*:\h*(.*)|(?(10)(?!))^\h*Amount\h*3\h*:\h*(.*)|(?(11)(?!))^\h*Code\h*4\h*:\h*(.*)|(?(12)(?!))^\h*Amount\h*4\h*:\h*(.*?))){1,12}
https://regex101.com/r/f2rG1v/1
In this situation, you don't need to use Regex straight off the bat and given the inconsistent data it could take a while to perfect one regex term...
You can do it this way instead:
- RecordID first,
- Then you can use a Text 2 Columns with a new-line (\n) delimiter. Configure this to "Split to Rows".
- You can then use a Text to Columns to split on the delimter ":".
That will handle additional rows entered etc. At that stage, you can figure out how to clean up the results (filter to remove null lines, multi-row to tag records, cross-tab to create a table etc...). If you want to flag any unknown rows, you can have a Text Input with the required rows and use Find/Replace or Join to separate the data.

variable number of capturing groups

I have a xpath expression which I want to use to extract City and date from a td which contains a string of this kind:
City(may contain spaces and may be missing, but the following space is always present) on 2013/07/20
So far, I got to the following solution for extracting the date, which works partially:
//path/to/my/td/text()/replace(.,'(.*) on (.*)','$3')
This works when City is present, but when City is missing I get "on 2013/07/20" as a result.
I think this is because the first capturing group fails and so the number of groups is different.
How can I get this expression to work?
I did not fully check your regex, but it looks fine at first sight. Anyway, you can also go an easier way if you only want to get the date by extracting the text after "on ":
//path/to/my/td/text()/substring-after(.,'on ')
edit: or you may go the substring-way and select the last 10 characters of the content:
//path/to/my/td/text()/substring(., string-length(.) - 9)

Please help clarify my regex pattern

I have the following string:
<script>m('02:29:1467301/>Sender1*>some text message?<<02:29:13625N1/>Sender2*>Recipient2: another message??<>A<<02:29:1393100=>User1*|0User2*|%></B><<','');</script>
N.B. messages are separated by <<
I need extract from message the following parts:
1. Time
2. Sender
3. Recipient
4. Text
Recipient may being defined or not, this field is optional.
I do this by the following pattern:
(?<message>(?<time>\d{1,2}:\d{1,2}:[0-9a-z]+)/>(?<messageData>(?<sender>.+?)\*>(.+?)))<<
But, I cannot extract recipient separately from the message text.
(?<message>(?<time>\d{1,2}:\d{1,2}:[0-9a-z]+)/>(?<messageData>(?<sender>.+?)\*>(((?<recipient>.+?):){0,1}(?<messageText>.+?))))<<
N.B. In the first message no recipient
Please help correct my pattern.
The <recipient> group pattern needs to exclude < and : or else it will match the text between *> and the timestamp's first colon when the recipient is omitted (as in the first message of your example).
A simple tweak to that group pattern should fix it:
(?<message>(?<time>\d{1,2}:\d{1,2}:[0-9a-z]+)/>(?<messageData>(?<sender>.+?)\*>(((?<recipient>[^<:]+):)?(?<messageText>.+?))))<<
Note I replaced {0,1} with the optional quantifier (?). It's just shorthand to improve readability (a little goes a long way). :-)
Speaking of readability, here it is in multi-line form:
(?<message>
(?<time>\d{1,2}:\d{1,2}:[0-9a-z]+)/>
(?<messageData>
(?<sender>.+?)\*>
(
((?<recipient>[^<:]+):)?
(?<messageText>.+?)
)
)
)<<
I don't know if the unnamed group containing <recipient> and <messageText> was intentional, but it's unnecessary. You can break it down to this:
(?<message>
(?<time>\d{1,2}:\d{1,2}:[0-9a-z]+)/>
(?<messageData>
(?<sender>.+?)\*>
((?<recipient>[^<:]+):)?
(?<messageText>.+?)
)
)<<
Check this out, may fit little better:
(?<message>(?<time>\d{1,2}:\d{1,2}:[0-9a-z]*).+?>(?<messageData>(?<sender>.*?)>(((?<recipient>[^<:]+):)?(?<messageText>.*?))))<<
P.S. Hi there ;)

Regex named capture group with multiple values

I seem to be having a tough regex week. Anyone that can save me from throwing my laptop out the window gets a virtual beer. I have some data in the form of:
... f=something group="First Group,Group2" foo=val ...
where the number of groups can vary. I need to capture each group entry to a named capture. Based on a previous post, The difference here is that I don't have a constant to key off of within the values (i.e. ID-1-1, ID-2-2 allows me to say ID-\d+-\d+ whereas these values could be pretty much anything). I've been trying a ton of stuff, but I tend to get matches that are far too greedy, or I (often) get these 2 values:
First Group
First Group,Group2
What I need is:
First Group
Group2
...
I'm currently trying regex such as this where I'm trying to anchor to the group=" portion, and not exceed the ending ":
(?:(?:group=\")|(?:\"))(?<group>(?:(.+)+?)
Hopefully someone can make my day a lot better...
Here's the PHP solution. Once again, regex doesn't like capturing the multiple values so we need to break it in to two searches. One extracts the group value, the next extracts each value from the group
$test = 'f=something group="First Group,Group2" foo=val';
$re = '/(?:group=)?\x22(?<group>(?:[^\x2C]+\x2C*)+)\x22/';
$_ = null;
if (preg_match($re,$test,$_))
echo "Group Contents: ".$_['group']."\r\n";
$__ = null;
$re = '/(?:^|\x2C)(?<value>(?:[^\x2C]+)+)/';
if (preg_match_All($re,$_['group'],$__))
echo "Group Values: ".print_r($__['value'],true);
Should be pretty easy to port in to another language, just extract the regexes out and manage them the way you normally would.