Inside openRefine I want to run the below regex on a website's source that finds email addresses with a mailto link. My trouble is when running value.match, I get this error:
Parsing error at offset 12: Bad regular expression (Unclosed character class near index 10 .*mailto:[^ ^)
I have tested the expression in another environment without value.match and it works.
value.match(/.*mailto:[^/"/']*.com.*/)
So if you have text like:
Blah blah mail me
To extract the email address using the match function in OpenRefine you need to use:
value.match(/.*mailto:([^\"\']*.com).*/)
This will give an array containing the email address, which is captured using a capture group. To extract the email address from the array (which is necessary if you want to store the mail address in an OpenRefine cell) you need to get the string value from the array. e.g.:
value.match(/.*mailto:([^\"\']*.com).*/)[0]
The difference between your original expression and this one is that the characters are escaped correctly and there is a capture group - basically implementing the advice from #LukStorms in the comments above.
isNotNull(value.match(/.*mailto:[^\"\']*.com.*/))
as described on our Reference page for the match() function, it return an array of capture groups in your RegEx pattern and then isNotNull() outputs True or False if your value is like that pattern:
https://github.com/OpenRefine/OpenRefine/wiki/GREL-String-Functions#matchstring-s-regexp-p
also described here: https://github.com/OpenRefine/OpenRefine/wiki/Understanding-Regular-Expressions#basic-examples
You can also use get() as described here in Recipes on our wiki, BUT will only work well if you have only 1 email address per cell (its because the get() function without number from or to, makes assumptions and uses the length of the array to determine the last element and pushes out only the last element, not the first, or third, etc.):
https://github.com/OpenRefine/OpenRefine/wiki/Recipes#find-a-sub-pattern-that-exists-at-the-end-of-a-string
For example:
get(value.match(/.*(mailto:[^\"\']*.com).*/),0)
Related
I am using a data analysis package that exposes a Regex function for string parsing. I am trying to parse a response from a website that is in the format...
key1=val1&key2=val2&key3=val3 ...
[There is the possibility that the keys and values may be percent encoded, but the current return values are not, the current return values are tokens and other info that are alphanumeric].
I understand this data to be www-form-urlencoded, or alternatively it might be known as query string format.
The object is to extract the value for a given key, if the order of the keys cannot be relied upon. For example, I might know that one of the keys I should receive is "token", so what regex pattern can I use to extract the value for the key "token"? I have searched for this but cannot find anything that does what I need, but if there is a duplicate question, apologies in advance.
In Alteryx, you may use Tokenize with a regex containing a capturing group around the part you need to extract:
The Tokenize Method allows you to specify a regular expression to match on and that part of the string is parsed into separate columns (or rows). When using the Tokenize method, you want to match to the whole token, and if you have a marked group, only that part is returned.
I bolded the part of the method description that proves that if there is a capturing group, only this part will be returned rather than the whole match.
Thus, you may use
(?:^|[?&])token=([^&]*)
where instead of token you may use any of the keys the value for which you want to extract.
See the regex demo.
Details
(?:^|[?&]) - the start of a string, ? or & (if the string is just a plain key-value pair string, you may omit ? and use (?:^|&) or (?<![^&]))
token - the key
= - an equal sign
([^&]*) - Group 1 (this will get extracted): 0 or more chars other than & (if you do not want to extract empty values, replace * with + quantifier).
I have a text field which stores a list of email addresses e.g: x#demo.com; a.x#demo.com. I have another text field which stores the exact value matched from the list of emails i.e. if /x#demo.com/i is in x#demo.com;a.x#demo.com then it should return x#demo.com.
The issue I am having is that if I have /a.x#demo.com/i, I will get x#demo.com instead of a.x#demo.com
I know of the regex expression /^x#demo.com$/i, but this means I can only have one email in my list of email addresses which won't help.
I have tried a couple of other regex expressions with no luck.
Any ideas on how I can achieve this?
You can use this slightly changed regex:
/(^|;)x#demo.com($|;)/i
It will match from either beginning of string or start after a semi colon and end either at end of string or at a semi colon.
Edit:
Small change, this uses look behind and look forward, then you will only get the match, you want:
(?<=^|;)x#demo.com(?=$|;)
Edit2:
To allow Spaces around the semi colon and at start and end, use this (#-quoted):
#"(?<=^\s*|;\s*)x#demo.com(?=\s*$|\s*;)"
or use double escaping:
"(?<=^\\s*|;\\s*)x#demo.com(?=\\s*$|\\s*;)"
My regular expression is like this:
.*(kgrj4e|\*)[^:]*:([^;]*);?
The 'kgrj4e' part is a userid and is dynamic. The PR.... parts are printers. If the userid is not found I want the default printer (PR12346).
For first test string below I want result to be PR12345, but I get PR12346
snljoe,snlaks,kgrj4e,snlbla:PR12345;*:PR12346
Note: the users snljoe, snlaks and snlbla are just examples and can be totally different. In fact the list of users can be longer or smaller.
For second test string below I want result to be PR12346
snljoe,snlaks,snlbla:PR12345;*:PR12346
How to fix the regular expression so both test strings give the expected result?
You can get the number with a search and replace:
Search for: ^(?:(?!.*,kgrj4e(?:[;,])).*\*:(\w+)|.*?(PR\d+).*)
Replace with: $1$2
See this demo
I assume that the kgrj4e is a user-defined value that should be missing in the string to match the last printer value. If it is present, the first printer value is returned.
I have this file where I only want to extract the email address and first name from our client list.
So a sample from the file:
a#abc.com,www.abc.com,2011-11-15 00:00:00,8.8.8.8,John,Doe,209 Park Rd,See,FL,33870,,,
b#abc.com,cde.com,2011-11-07 00:00:00,4.4.4.4,Erickson,Crast,136 Kua St # 1367,Pearl,HI,96782,,8084568190,
I would like to get back
a#abc.com,John
b#abc.com,Erickson
So basically email address and First Name
I know I can do this in powershell but maybe a find and replace in ultraedit will be faster
Note: you will notice some fields are not provided so it will show ",," meaning those fields were left empty when the user signed up but the amount of comma in each line is the same, 12 being the count.
So basically there are fields separated by ",". Without looking at the correct content (i.e. email/timestamp etc. will need to have a certain format which could also be checked) let's just try to extract the values of the first and fourth field.
so I'd suggest
a Replace-Operation where you search for
^([^,]*),[^,]*,[^,]*,[^,]*,([^,]*),.*$
and replace it with
\1 # \2
Options: "Regular Expressions: Unix".
(Just inserted the # to have a separator, although the first whitespace would be sufficient. But you'll get the idea, I assume...)
Result:
a#abc.com # John
b#abc.com # Erickson
I have a JMeter Regular Expression Extractor which searches for the following regular expression:
myId=[0-9]{10}
This retrieves the 10 digit numeric id number from my websites form. I then set a "Reference Name" of myId for the id number. My template value is $0$ and my match No. is set to blank.
In my HTTP Request, I then pass a parameter value of:
${myId}
When I run my JMeter test, it inserts text in the form of:
myId=myId=1234567890
How do I get rid of the duplicate myId=?
Not sure about JMeter's implementation of RegEx but normally
myId=[0-9]{10}
would match everything, including myId=. What you need is to define capture groups that you want extracted using () and then you will reference the capture group array and get the item you want. E.g.
myId=([0-9]{10})
group 0 would still be the whole thing but group 1 would be just the numeric portion as delimited by (), i.e. without myId=. Hope this helps.