Ultraedit, regular expression help, extracting 2 values, comma separated - regex

I have this file where I only want to extract the email address and first name from our client list.
So a sample from the file:
a#abc.com,www.abc.com,2011-11-15 00:00:00,8.8.8.8,John,Doe,209 Park Rd,See,FL,33870,,,
b#abc.com,cde.com,2011-11-07 00:00:00,4.4.4.4,Erickson,Crast,136 Kua St # 1367,Pearl,HI,96782,,8084568190,
I would like to get back
a#abc.com,John
b#abc.com,Erickson
So basically email address and First Name
I know I can do this in powershell but maybe a find and replace in ultraedit will be faster
Note: you will notice some fields are not provided so it will show ",," meaning those fields were left empty when the user signed up but the amount of comma in each line is the same, 12 being the count.

So basically there are fields separated by ",". Without looking at the correct content (i.e. email/timestamp etc. will need to have a certain format which could also be checked) let's just try to extract the values of the first and fourth field.
so I'd suggest
a Replace-Operation where you search for
^([^,]*),[^,]*,[^,]*,[^,]*,([^,]*),.*$
and replace it with
\1 # \2
Options: "Regular Expressions: Unix".
(Just inserted the # to have a separator, although the first whitespace would be sufficient. But you'll get the idea, I assume...)
Result:
a#abc.com # John
b#abc.com # Erickson

Related

How to use Postgres Regex Replace with a capture group

As the title presents above I am trying to reference a capture groups for a regex replace in a postgres query. I have read that the regex_replace does not support using regex capture groups. The regex I am using is
r"(?:[\s\(\)\=\)\,])(username)(?:[\s\(\)\=\)\,])?"gm
The above regex almost does what I need it to but I need to find out how to only allow a match if the capture groups also capture something. There is no situation where a "username" should be matched if it just so happens to be a substring of a word. By ensuring its surrounded by one of the above I can much more confidently ensure its a username.
An example application of the regex would be something like this in postgres (of course I would be doing an update vs a select):
select *, REGEXP_REPLACE(reqcontent,'(?:[\s\(\)\=\)\,])(username)(?:[\s\(\)\=\)\,])?' ,'NEW-VALUE', 'gm') from table where column like '%username%' limit 100;
If there is any more context that can be provided please let me know. I have also found similar posts (postgresql regexp_replace: how to replace captured group with evaluated expression (adding an integer value to capture group)) but that talks more about splicing in values back in and I don't think quite answers my question.
More context and example value(s) for regex work against. The below text may look familiar these are JQL filters in Jira. We are looking to update our usernames and all their occurrences in the table that contains the filter. Below is a few examples of filters. We originally were just doing a find a replace but that doesn't work because we have some usernames that are only two characters and it was matching on non usernames (e.g je (username) would place a new value in where the word project is found which completely malforms the JQL/String resulting in something like proNEW-VALUEct = balh blah)
type = bug AND status not in (Closed, Executed) AND assignee in (test, username)
assignee=username
assignee = username
Definition of Answered:
Regex that will only match on a 'username' if its surrounded by one of the specials
A way to regex/replace that username in a postgres query.
Capturing groups are used to keep the important bits of information matched with a regex.
Use either capturing groups around the string parts you want to stay in the result and use their placeholders in the replacement:
REGEXP_REPLACE(reqcontent,'([\s\(\)\=\)\,])username([\s\(\)\=\)\,])?' ,'\1NEW-VALUE\2', 'gm')
Or use lookarounds:
REGEXP_REPLACE(reqcontent,'(?<=[\s\(\)\=\)\,])(username)(?=[\s\(\)\=\)\,])?' ,'NEW-VALUE', 'gm')
Or, in this case, use word boundaries to ensure you only replace a word when inside special characters:
REGEXP_REPLACE(reqcontent,'\yusername\y' ,'NEW-VALUE', 'g')

Get an exact regex match of an email value from a list of email addresses

I have a text field which stores a list of email addresses e.g: x#demo.com; a.x#demo.com. I have another text field which stores the exact value matched from the list of emails i.e. if /x#demo.com/i is in x#demo.com;a.x#demo.com then it should return x#demo.com.
The issue I am having is that if I have /a.x#demo.com/i, I will get x#demo.com instead of a.x#demo.com
I know of the regex expression /^x#demo.com$/i, but this means I can only have one email in my list of email addresses which won't help.
I have tried a couple of other regex expressions with no luck.
Any ideas on how I can achieve this?
You can use this slightly changed regex:
/(^|;)x#demo.com($|;)/i
It will match from either beginning of string or start after a semi colon and end either at end of string or at a semi colon.
Edit:
Small change, this uses look behind and look forward, then you will only get the match, you want:
(?<=^|;)x#demo.com(?=$|;)
Edit2:
To allow Spaces around the semi colon and at start and end, use this (#-quoted):
#"(?<=^\s*|;\s*)x#demo.com(?=\s*$|\s*;)"
or use double escaping:
"(?<=^\\s*|;\\s*)x#demo.com(?=\\s*$|\\s*;)"

Powershell: Using regex to match patterns and its variations (which has special characters)

I have several thousand text files containing form information (one text file for each form), including the unique id of each form.
I have been trying to extract just the form id using regex (which I am not too familiar with) to match the string of characters found before and after the form id and extract only the form ID number in between them. Usually the text looks like this: "... 12 ID 12345678 INDEPENDENT BOARD..."
The bolded 8-digit number is the form ID that I need to extract.
The code I used can be seen below:
$id= ([regex]::Match($text_file, "12 ID (.+) INDEPENDENT").Groups[1].Value)
This works pretty well, but I soon noticed that there were some files for which this script did not work. After investigation, I found that there was another variation to the text containing the form ID used by some of the text files. This variation looks like this: "... 12 ID 12345678 (a.12(3)(b),45)..."
So my first challenge is to figure out how to change the script so that it will match the first or the second pattern. My second challenge is to escape all the special characters in "(a.12(3)(b),45)".
I know that the pipe | is used as an "or" in regex and two backslashes are used to escape special characters, however the code below gives me errors:
$id= ([regex]::Match($text_one_line, "34 PR (.+) INDEPENDENT"|"34 PR (.+) //(a//.12//(3//)//(b//)//,45//)").Groups[1].Value)
Where have I gone wrong here and how I can fix my code?
Thank you!
When you approach a regex pattern always look for fixed vs. variable parts.
In your case the ID seems to be fixed, and it is, therefore, useful as a reference point.
The following pattern applies this suggestion: (?:ID\s+)(\d{8})
(click on the pattern for an explanation).
$str = "... 12 ID 12345678 INDEPENDENT BOARD..."
$ret = [Regex]::Matches($str, "(?:ID\s+)(\d{8})")
for($i = 0; $i -lt $ret.Count; $i++) {
$ret[0].Groups[1].Value
}
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. It contains a treasure trove of useful information.

OpenRefine Regex and GREL match error

Inside openRefine I want to run the below regex on a website's source that finds email addresses with a mailto link. My trouble is when running value.match, I get this error:
Parsing error at offset 12: Bad regular expression (Unclosed character class near index 10 .*mailto:[^ ^)
I have tested the expression in another environment without value.match and it works.
value.match(/.*mailto:[^/"/']*.com.*/)
So if you have text like:
Blah blah mail me
To extract the email address using the match function in OpenRefine you need to use:
value.match(/.*mailto:([^\"\']*.com).*/)
This will give an array containing the email address, which is captured using a capture group. To extract the email address from the array (which is necessary if you want to store the mail address in an OpenRefine cell) you need to get the string value from the array. e.g.:
value.match(/.*mailto:([^\"\']*.com).*/)[0]
The difference between your original expression and this one is that the characters are escaped correctly and there is a capture group - basically implementing the advice from #LukStorms in the comments above.
isNotNull(value.match(/.*mailto:[^\"\']*.com.*/))
as described on our Reference page for the match() function, it return an array of capture groups in your RegEx pattern and then isNotNull() outputs True or False if your value is like that pattern:
https://github.com/OpenRefine/OpenRefine/wiki/GREL-String-Functions#matchstring-s-regexp-p
also described here: https://github.com/OpenRefine/OpenRefine/wiki/Understanding-Regular-Expressions#basic-examples
You can also use get() as described here in Recipes on our wiki, BUT will only work well if you have only 1 email address per cell (its because the get() function without number from or to, makes assumptions and uses the length of the array to determine the last element and pushes out only the last element, not the first, or third, etc.):
https://github.com/OpenRefine/OpenRefine/wiki/Recipes#find-a-sub-pattern-that-exists-at-the-end-of-a-string
For example:
get(value.match(/.*(mailto:[^\"\']*.com).*/),0)

Django custom template filter to highlight a column in a block of text

I'm rendering a list in an HTML template using {{ my_list | join:"<\br>"}} , and it appears as...
$GPGGA,062511,2816.8178,S,15322.3185,E,6,04,2.6,72.6,M,37.5,M,,*68
$GPGGA,062512,2816.8177,S,15322.3184,E,1,04,2.6,72.6,M,37.5,M,,*62
$GPGGA,062513,2816.8176,S,15322.3181,E,1,04,2.6,72.6,M,37.5,M,,*67
$GPGGA,062514,2816.8176,S,15322.3180,E,1,03,2.6,72.6,M,37.5,M,,*66
$GPGGA,062515,2816.8176,S,15322.3180,E,6,03,2.6,72.6,M,37.5,M,,*60
I am attempting to use regular expressions to insert the CSS at the 4th and 5th commas so I can highlight the text in this column, however I'm not able to figure out the expression to do this. Other methods to achieve this also appreciated.
Other info:
1) each line ends with a '\n'. Although this can be removed and the HTML display is unchanged, I've left it in for the regular expression to use if required.
2) The string will not always have a nice header such as '$GPGGA' in this example, although I could add one to help ID the start of the line if required by the regex.
3) The columns may not be a uniform number of characters as indicated in this example.
The filters I'm working on are as follows
#register.filter(is_safe=True)
def highight_start(text):
return re.sub('regex to find 4th comma in each line', ",<span class='my_highlight'>", text, flags=re.MULTILINE)
#register.filter(is_safe=True)
def highight_end(text):
return re.sub('regex to find 5th comma in each line', "</span>,", text, flags=re.MULTILINE)
Regards
You can achieve that by replacing the 5th value with the value itself wrapped in your <span> tags.
RegEx: ^((?:[\w\d\.\$]+,){4})([\d\.]+)
Replacement: \1<span class='my_highlight'>\2</span>
Explained demo here: http://regex101.com/r/cX5iA0
Note: I assumed the 5th value will be digits and dots
Thanks #ka, who got me ontrack with this solution. My working filter uses:
expression = '^((?:[^,]+,){4})([^,]+)'
replace = r'\g<1><span class="my_highlight">\g<2></span>'
#[^,] also allows matching of hidden HTML tags in the text
#To get the groups to insert back into the text and not be overwritten, they need to be referenced as indicated in 'replace'.