Extracting email addresses from messy text in OpenRefine

Extracting email addresses from messy text in OpenRefine - regex

I am trying to extract just the emails from text column in openrefine. some cells have just the email, but others have the name and email in john doe <john#doe.com> format. I have been using the following GREL/regex but it does not return the entire email address. For the above exaple I'm getting ["n#doe.com"]
value.match(
/.*([a-zA-Z0-9_\-\+]+#[\._a-zA-Z0-9-]+).*/
)
Any help is much appreciated.

The n is captured because you are using .* before the capturing group, and since it can match any 0+ chars other than line break chars greedily the only char that can land in Group 1 during backtracking is the char right before #.
If you can get partial matches git rid of the .* and use
/[^<\s]+#[^\s>]+/
See the regex demo
Details
[^<\s]+ - 1 or more chars other than < and whitespace
# - a # char
[^\s>]+ - 1 or more chars other than whitespace and >.
Python/Jython implementation:
import re
res = ''
m = re.search(r'[^<\s]+#[^\s>]+', value)
if m:
res = m.group(0)
return res
There are other ways to match these strings. In case you need a full string match .*<([^<]+#[^>]+)>.* where .* will not gobble the name since it will stop before an obligatory <.

If some cells contain just the email, it's probably better to use the #wiktor-stribiżew's partial match. In the development version of Open Refine, there is now a value.find() function that can do this, but it will only be officially implemented in the next version (2.9). In the meantime, you can reproduce it using Python/Jython instead of GREL:
import re
return re.findall(r"[^<\s]+#[^\s>]+", value)[0]
Result :

Related

Regex for validating account names for NEAR protocol

I want to have accurate form field validation for NEAR protocol account addresses.
I see at https://docs.near.org/docs/concepts/account#account-id-rules that the minimum length is 2, maximum length is 64, and the string must either be a 64-character hex representation of a public key (in the case of an implicit account) or must consist of "Account ID parts" separated by . and ending in .near, where an "Account ID part" consists of lowercase alphanumeric symbols separated by either _ or -.
Here are some examples.
The final 4 cases here should be marked as invalid (and there might be more cases that I don't know about):
example.near
sub.ex.near
something.near
98793cd91a3f870fb126f66285808c7e094afcfc4eda8a970f6648cdf0dbd6de
wrong.near.suffix (INVALID)
shouldnotendwithperiod.near. (INVALID)
space should fail.near (INVALID)
touchingDotsShouldfail..near (INVALID)
I'm wondering if there is a well-tested regex that I should be using in my validation.
Thanks.
P.S. Originally my question pointed to what I was starting with at https://regex101.com/r/jZHtDA/1 but starting from scratch like that feels unwise given that there must already be official validation rules somewhere that I could copy.
I have looked at code that I would have expected to use some kind of validation, such as these links, but I haven't found it yet:
https://github.com/near/near-wallet/blob/40512df4d14366e1b8e05152fbf5a898812ebd2b/packages/frontend/src/utils/account.js#L8
https://github.com/near/near-wallet/blob/40512df4d14366e1b8e05152fbf5a898812ebd2b/packages/frontend/src/components/accounts/AccountFormAccountId.js#L95
https://github.com/near/near-cli/blob/cdc571b1625a26bcc39b3d8db68a2f82b91f06ea/commands/create-account.js#L75

The pre-release (v0.6.0-0) version of the JS SDK comes with a built-in accountId validation function:
const ACCOUNT_ID_REGEX =
/^(([a-z\d]+[-_])*[a-z\d]+\.)*([a-z\d]+[-_])*[a-z\d]+$/;
/**
* Validates the Account ID according to the NEAR protocol
* [Account ID rules](https://nomicon.io/DataStructures/Account#account-id-rules).
*
* #param accountId - The Account ID string you want to validate.
*/
export function validateAccountId(accountId: string): boolean {
return (
accountId.length >= 2 &&
accountId.length <= 64 &&
ACCOUNT_ID_REGEX.test(accountId)
);
}
https://github.com/near/near-sdk-js/blob/dc6f07bd30064da96efb7f90a6ecd8c4d9cc9b06/lib/utils.js#L113
Feel free to implement this in your program too.

Something like this should do: /^(\w|(?<!\.)\.)+(?<!\.)\.(testnet|near)$/gm
Breakdown
^ # start of line
(
\w # match alphanumeric characters
| # OR
(?<!\.)\. # dots can't be preceded by dots
)+
(?<!\.) # "." should not precede:
\. # "."
(testnet|near) # match "testnet" or "near"
$ # end of line
Try the Regex out: https://regex101.com/r/vctRlo/1

If you want to match word characters only, separated by a dot:
^\w+(?:\.\w+)*\.(?:testnet|near)$
Explanation
^ Start of string
\w+ Match 1+ word characters
(?:\.\w+)* Optionally repeat . and 1+ word characters
\. Match .
(?:testnet|near) Match either testnet or near
$ End of string
Regex demo
A bit broader variant matching whitespace character excluding the dot:
^[^\s.]+(?:\.[^\s.]+)*\.(?:testnet|near)$
Regex demo

Sanitize url path with regex

I'm trying to sanitize a url path from the following elements
ids (1, 14223423, 24fb3bdc-8006-47f0-a608-108f66d20af4)
filenames (things.xml, doc.v2.final.csv)
domains (covered under filenames)
emails (foo#bar.com)
Sample:
/v1/upload/dxxp-sSy449dk_rm_1debit_A_03MAY21.final.csv/email/foo#bar.com?who=knows
Desired outcome:
/upload/email
I have something that works... but I'm not proud (written in Ruby)
# Remove params from the path (everything after the ?)
route = req.path&.split('?')&.first
# Remove filenames with singlular extentions, domains, and emails
route = route&.gsub(/\b[\w-]*#?[\w-]+\.[\w-]+\b/, '')
# Remove ids from the path (any string that contains a number)
route = "/#{route&.scan(/\b[a-z_]+\b/i)&.join('/')}".chomp('/')
I can't help but think this can be done simply with something like \/([a-z_]+)\/?, but the \/? is too loose, and \/ is too restrictive.

Perhaps you can remove the parts starting with a / and that contain at least a dot or a digit.
Replace the match with an empty string.
/[^/\d.]*[.\d][^/]*
Rubular regex demo
/ Match a forward slash
[^/\d.]* Match 0+ times any char except / or . or a digit
[.\d] Match either a . or a digit
[^/]* Match 0+ times any char except /
Output
/upload/email

In Ruby, you can use a bit of code to simplify your checks in a similar way you did:
text = text.split('?').first.split('/').select{ |x| not x.match?(/\A[^#]*#\S+\z|\d/) }.join("/")
See the Ruby demo. Note how much this approach simplifies the email and digit checking.
Details
text.split('?').first - split the string with ? and grab the first part
.split('/') - splits with / into subparts
.select{ |x| not x.match?(/\A[^#]*#\S+\z|\d/) } - only keep the items that do not match \A[^#]*#\S+\z|\d regex: \A[^#]*#\S+\z - start of string, any zero or more chars other than #, a # char, then any zero or more non-whitespace chars and end of string, or a digit
.join("/") - join the resulting items with /.

So, I think it's better to go with the allow list here, rather than a block list. Seems like it's more predictable to say "we only keep words with letters and underscores".
# Keep path w/o params
route = req.path.to_s.split('?').first
# Keep words that only contain letters or _
route = route.split('/').keep_if { |chunk| chunk[/^[a-z_]+$/i] }
# Put the path back together
route = "/#{route.join('/')}".chomp('/')

Regex doesn't rules out cases python

I'm trying to develop a regex where it will take in format i.e:
date: 1:10 #7 (correct)
date: 1:10 (correct)
1:10 #7 (correct)
13.01.06 (incorrect)
Here is my regex developed on pythex:
(date)? ?\D? ?(1|4) ?(:|-|\.) ?[-+]?[0-9]+( ?(#) ?[a-zA-Z0-9]?)?
I'm working in python projects that uses OCR so sometimes the ":" between 1 and 10 is not translated correctly. Do you guys have a better way to tackle to regex problem?

Do you specifically need regex? If not, you could solve your problem by parsing the string to datetime.
from datetime import datetime
def test_date(date_string):
try:
datetime.strptime(date_string, '%d.%m.%y')
return # or do something else to skip the further processing
except ValueError:
pass
# Process valid date string
print('Valid date: {}'.format(date_string))
test_date('13.01.06') # Does not print anything
test_date('1:10 #7') # Works!

You need to make the non-digit pattern obligatory by removing ? after \D and wrap the whole part before the (1|4) pattern with an optional non-capturing group (to match date, : and space optionally), and in the end add a word boundary before the (1|4) pattern so that it could only be matched as a whole word, when the digit is not preceded with a digit, letter or _.
(?:(date)? ?\D ?)?\b([14]) ?([-:.]) ?[-+]?([0-9]+)( ?# ?([a-zA-Z0-9]*))?
^^^ ^ ^^^
See the regex demo.

Don't get number if precede by month python

I'm pulling some data out of the web utilizing python in the Jupyter notebook. I have pulled down the data, parsed, and created the data frame. I have extracted a number out of a string that I have in a variable in the data frame. I utilizing this regex to do it:
number = []
for note in df["person_notes"]:
match = re.search(r'\d+', note)
if match:
number.append(note[match.start(): match.end()])
else:
number.append("")
df["number"] = number
Some strings are missing the number I'm looking for. For those cases, I will like to number.append(""). Those strings have instead a full date like so... "September 20, 2016" and my re.search() is pulling the number 20 out of that full date. If the string has a data like so, I want to ignore the 20 and instead I want to number.append("").
How can I modify the re.search() to ignore the number if the number is preceded by a month?

I suggest useing the old JS regex trick: enclose the pattern you wouldenclose with a negative lookbehind with an optional capturing group, and if it is a success, discard the match (here, append a ""). Else, grab the other capturing group contents (here, the digits).
See the Python demo:
import re
number = []
p = re.compile(r'((?:Jan|Febr)(?:uary)?|Ma(?:y|r(?:ch)?)|A(?:ug(?:ust)?|pr(?:il)?)|Ju(?:ne?|ly?)|Oct(?:ober)?|(?:Sept|Nov|Dec)(?:ember)?)? *(\d+)')
match = p.search('September 20, 2016')
if match and not match.group(1): # Did the string match and did Group 1 fail?
number.append(match.group(2)) # Yes, then add digits
else:
number.append("") # Else, add an empty value
print(number)
If you do not care about the shortened month names and keep it readable, you may use a simpler regex:
p = re.compile(r'(January|February|March|April|May|June|July|August|September‌|October|November|De‌cember)? *(\d+)')
The regex matches:
((?:Jan|Febr)(?:uary)?|Ma(?:y|r(?:ch)?)|A(?:ug(?:ust)?|pr(?:il)?)|Ju(?:ne?|ly?)|Oct(?:ober)?|(?:Sept|Nov|Dec)(?:ember)?)? - months (full or short names)
* - zero or more spaces
(\d+) - Group 2: one or more digits.

Regular expression help - comma delimited string

I don't write many regular expressions so I'm going to need some help on the one.
I need a regular expression that can validate that a string is an alphanumeric comma delimited string.
Examples:
123, 4A67, GGG, 767 would be valid.
12333, 78787&*, GH778 would be invalid
fghkjhfdg8797< would be invalid
This is what I have so far, but isn't quite right: ^(?=.*[a-zA-Z0-9][,]).*$
Any suggestions?

Sounds like you need an expression like this:
^[0-9a-zA-Z]+(,[0-9a-zA-Z]+)*$
Posix allows for the more self-descriptive version:
^[[:alnum:]]+(,[[:alnum:]]+)*$
^[[:alnum:]]+([[:space:]]*,[[:space:]]*[[:alnum:]]+)*$ // allow whitespace
If you're willing to admit underscores, too, search for entire words (\w+):
^\w+(,\w+)*$
^\w+(\s*,\s*\w+)*$ // allow whitespaces around the comma

Try this pattern: ^([a-zA-Z0-9]+,?\s*)+$
I tested it with your cases, as well as just a single number "123". I don't know if you will always have a comma or not.
The [a-zA-Z0-9]+ means match 1 or more of these symbols
The ,? means match 0 or 1 commas (basically, the comma is optional)
The \s* handles 1 or more spaces after the comma
and finally the outer + says match 1 or more of the pattern.
This will also match
123 123 abc (no commas) which might be a problem
This will also match 123, (ends with a comma) which might be a problem.

Try the following expression:
/^([a-z0-9\s]+,)*([a-z0-9\s]+){1}$/i
This will work for:
test
test, test
test123,Test 123,test
I would strongly suggest trimming the whitespaces at the beginning and end of each item in the comma-separated list.

You seem to be lacking repetition. How about:
^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$
I'm not sure how you'd express that in VB.Net, but in Python:
>>> import re
>>> x [ "123, $a67, GGG, 767", "12333, 78787&*, GH778" ]
>>> r = '^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$'
>>> for s in x:
... print re.match( r, s )
...
<_sre.SRE_Match object at 0xb75c8218>
None
>>>>
You can use shortcuts instead of listing the [a-zA-Z0-9 ] part, but this is probably easier to understand.
Analyzing the highlights:
[a-zA-Z0-9 ]+ : capture one or more (but not zero) of the listed ranges, and space.
(?:[...]+,)* : In non-capturing parenthesis, match one or more of the characters, plus a comma at the end. Match such sequences zero or more times. Capturing zero times allows for no comma.
[...]+ : capture at least one of these. This does not include a comma. This is to ensure that it does not accept a trailing comma. If a trailing comma is acceptable, then the expression is easier: ^[a-zA-Z0-9 ,]+

Yes, when you want to catch comma separated things where a comma at the end is not legal, and the things match to $LONGSTUFF, you have to repeat $LONGSTUFF:
$LONGSTUFF(,$LONGSTUFF)*
If $LONGSTUFF is really long and contains comma repeated items itself etc., it might be a good idea to not build the regexp by hand and instead rely on a computer for doing that for you, even if it's just through string concatenation. For example, I just wanted to build a regular expression to validate the CPUID parameter of a XEN configuration file, of the ['1:a=b,c=d','2:e=f,g=h'] type. I... believe this mostly fits the bill: (whitespace notwithstanding!)
xend_fudge_item_re = r"""
e[a-d]x= #register of the call return value to fudge
(
0x[0-9A-F]+ | #either hardcode the reply
[10xks]{32} #or edit the bitfield directly
)
"""
xend_string_item_re = r"""
(0x)?[0-9A-F]+: #leafnum (the contents of EAX before the call)
%s #one fudge
(,%s)* #repeated multiple times
""" % (xend_fudge_item_re, xend_fudge_item_re)
xend_syntax = re.compile(r"""
\[ #a list of
'%s' #string elements
(,'%s')* #repeated multiple times
\]
$ #and nothing else
""" % (xend_string_item_re, xend_string_item_re), re.VERBOSE | re.MULTILINE)

Try ^(?!,)((, *)?([a-zA-Z0-9])\b)*$
Step by step description:
Don't match a beginning comma (good for the upcoming "loop").
Match optional comma and spaces.
Match characters you like.
The match of a word boundary make sure that a comma is necessary if more arguments are stacked in string.

Please use - ^((([a-zA-Z0-9\s]){1,45},)+([a-zA-Z0-9\s]){1,45})$
Here, I have set max word size to 45, as longest word in english is 45 characters, can be changed as per requirement

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js