regex match multiple capture groups in any order - regex

Given the sample string below I'm trying to capture the 'to', 'from', 'subject' and 'type' fields and spit them back out in a different format. The issue is that these fields (to, from, etc) can be in any order.
SAMPLE STRING TO REGEX ON
<cfmail to="#toAddr#" from="#fromAddress"
subject="#subject#" type="html">
#emailMsg#
</cfmail>
OUTPUT I'M LOOKING FOR
to:toAddr, from:fromAddress, subject:subject
If I knew that the order of those field I'm interested in was always the same then this is pretty easy, but I'm stumped on how to do this matching if, for instance, 'from' comes before 'to'
The perl one-liner I have right now is (just testing with 'to' and 'subject')
s/<cfmail.*?((to)="(.*?)")|((subject)="(.*?)").*<\/cfmail>/\1:\2, \3:\4/g
This ends up matching the 'to' value but stops there and I don't get anything for the 'subject' value. I've tried several variations on this where I change matching group setup etc but have had no luck on it.

Do you need to allow for missing fields (e.g. no type field)? What about other fields in addition to those four? If you answered no to both questions, this regex should do the trick:
s!<cfmail(?:\s+to="(?<to>[^"]+)"|\s+from="(?<from>[^"]+)"|\s+subject="(?<subject>[^"]+)"|\s+type="(?<type>[^"]+)")+>.*?</cfmail>!to:$+{to}, from:$+{from}, subject:$+{subject}!gs
Here's the regex alone in more readable form:
<cfmail
(?:
\s+to="(?<to>[^"]+)"
|
\s+from="(?<from>[^"]+)"
|
\s+subject="(?<subject>[^"]+)"
|
\s+type="(?<type>[^"]+)"
)+
>
.*?</cfmail>
...and a DEMO
You were actually pretty close; alternation was the key. You just needed to add a quantifier.
Notice that I removed the capturing groups from the field names. You already know the names, you just need to pair them with the correct values. The named groups make that much easier.

Related

Regex Match and exclude if contains word before match

I have the following case.
In a log there is multiple hashes that can be extracted with the following regex
\b[a-fA-F\d]{32}\b
In this example we have 3 hashes that will be matching with the previous regex, but I want to exclude the ones that the field is named 'link' and 'value'
u'closed_by': {u'link': u'https://test.test.com/api/now/table/sys_user/175f7cc0d7989d87bc43e322c42c8da8', u'value': u'175f7cc0d7989d87bc43e322c42c8da8'}, u'sensor_name': u'175f7cc0d7989d87bc43e322c42c8da8'
I tried the following regex but didn't work, should be matching the last hash 'sensor_name'
(\b[a-fA-F\d]{32}\b)((.?!'link':\s\S+\')\,|(.?!'value':\s\S+\')\},)
**Note: this is only an extract of the original log, the match should be to anything that is a hash except the fields 'link' and 'value' following to 'lin', could be multiple fields named 'value'
Can someone help me to know what I'm doing wrong, please?
Use this pattern - any key words you want to avoid can be added to the lookbehind
(?<!link|value)': u'([\da-zA-Z]{32})

How to use Postgres Regex Replace with a capture group

As the title presents above I am trying to reference a capture groups for a regex replace in a postgres query. I have read that the regex_replace does not support using regex capture groups. The regex I am using is
r"(?:[\s\(\)\=\)\,])(username)(?:[\s\(\)\=\)\,])?"gm
The above regex almost does what I need it to but I need to find out how to only allow a match if the capture groups also capture something. There is no situation where a "username" should be matched if it just so happens to be a substring of a word. By ensuring its surrounded by one of the above I can much more confidently ensure its a username.
An example application of the regex would be something like this in postgres (of course I would be doing an update vs a select):
select *, REGEXP_REPLACE(reqcontent,'(?:[\s\(\)\=\)\,])(username)(?:[\s\(\)\=\)\,])?' ,'NEW-VALUE', 'gm') from table where column like '%username%' limit 100;
If there is any more context that can be provided please let me know. I have also found similar posts (postgresql regexp_replace: how to replace captured group with evaluated expression (adding an integer value to capture group)) but that talks more about splicing in values back in and I don't think quite answers my question.
More context and example value(s) for regex work against. The below text may look familiar these are JQL filters in Jira. We are looking to update our usernames and all their occurrences in the table that contains the filter. Below is a few examples of filters. We originally were just doing a find a replace but that doesn't work because we have some usernames that are only two characters and it was matching on non usernames (e.g je (username) would place a new value in where the word project is found which completely malforms the JQL/String resulting in something like proNEW-VALUEct = balh blah)
type = bug AND status not in (Closed, Executed) AND assignee in (test, username)
assignee=username
assignee = username
Definition of Answered:
Regex that will only match on a 'username' if its surrounded by one of the specials
A way to regex/replace that username in a postgres query.
Capturing groups are used to keep the important bits of information matched with a regex.
Use either capturing groups around the string parts you want to stay in the result and use their placeholders in the replacement:
REGEXP_REPLACE(reqcontent,'([\s\(\)\=\)\,])username([\s\(\)\=\)\,])?' ,'\1NEW-VALUE\2', 'gm')
Or use lookarounds:
REGEXP_REPLACE(reqcontent,'(?<=[\s\(\)\=\)\,])(username)(?=[\s\(\)\=\)\,])?' ,'NEW-VALUE', 'gm')
Or, in this case, use word boundaries to ensure you only replace a word when inside special characters:
REGEXP_REPLACE(reqcontent,'\yusername\y' ,'NEW-VALUE', 'g')

Multiple Email validation in a single input field separated by ;

Currently i am writing a software where a user can input more than one email in a input field separated by: ";"
Now i have a regex that validates the email but sadly enough doesn't work when i have more Emails in the input field when using the separation.
Has anyone ever created such a regex or is there anyone that is able to help me?
Thanx in advance and looking forward for a response.
Here is my Regex:
[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]{2,4}+(\;|)
Just put the pattern which matches the following emails inside a non-capturing group with a preceding ; and make it to repeat zero or more times.
^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]{2,4}+(?:;[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]{2,4}+)*$
And one more thing is, you need to escape the dot.

Regular expression for syntax highlighting attributes in HTML tag

I'm working on regular expressions for some syntax highlighting in a Sublime/TextMate language file, and it requires that I "begin" on a non-self closing html tag, and end on the respective closing tag:
begin: (<)([a-zA-Z0-9:.]+)[^/>]*(>)
end: (</)(\2)([^>]*>)
So far, so good, I'm able to capture the tag name, and it matches to be able to apply the appropriate patterns for the area between the tags.
jsx-tag-area:
begin: (<)([a-zA-Z0-9:.]+)[^/>]*>
beginCaptures:
'1': {name: punctuation.definition.tag.begin.jsx}
'2': {name: entity.name.tag.jsx}
end: (</)(\2)([^>]*>)
endCaptures:
'1': {name: punctuation.definition.tag.begin.jsx}
'2': {name: entity.name.tag.jsx}
'3': {name: punctuation.definition.tag.end.jsx}
name: jsx.tag-area.jsx
patterns:
- {include: '#jsx'}
- {include: '#jsx-evaluated-code'}
Now I'm also looking to also be able to capture zero or more of the html attributes in the opening tag to be able to highlight them.
So if the tag were <div attr="Something" data-attr="test" data-foo>
It would be able to match on attr, data-attr, and data-foo, as well as the < and div
Something like (this is very rough):
(<)([a-zA-Z0-9:.]+)(?:\s(?:([0-9a-zA-Z_-]*=?))\s?)*)[^/>]*(>)
It doesn't need to be perfect, it's just for some syntax highlighting, but I was having a hard time figuring out how to achieve multiple capture groups within the tag, whether I should be using look-around, etc, or whether this is even possible with a single expression.
Edit: here are more details about the specific case / question - https://github.com/reactjs/sublime-react/issues/18
I may found a possible solution.
It is not perfect because as #skamazin said in the comments if you are trying to capture an arbitrary amount of attributes you will have to repeat the pattern that matches the attributes as many times as you want to limit the number of attributes you will allow.
The regex is pretty scary but it may work for your goal. Maybe it would be possible to simplify it a bit or maybe you will have to adjust some things
For only one attribute it will be as this:
(<)([a-zA-Z0-9:.]+)(?:(?: ((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))
DEMO
For more attributes you will need to add this as many times as you want:
(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))?
So for example if you want to allow maximum 3 attributes your regex will be like this:
(<)([a-zA-Z0-9:.]+)(?:(?: ((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)?(?: |>))?(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)?(?: |>))?
DEMO
Tell me if it suits you and if you need further details.
I'm unfamiliar with sublimetext or react-jsx but this to me sounds like a case of "Regex is your tool, not your solution."
A solution that uses regex as a tool for this would be something like this JsFiddle (note that the regex is slightly obfuscated because of html-entities like > for > etc.)
Code that does the actual replacing:
blabla.replace(/(<!--(?:[^-]|-(?!->))*-->)|(<(?:(?!>).)+>)|(\{[^\}]+\})/g, function(m, c, t, a) {
if (c!=undefined)
return '<span class="comment">' + c + '</span>';
if (t!=undefined)
return '<span class="tag">' + t.replace(/ [a-z_-]+=?/ig, '<span class="attr">$&</span>') + '</span>';
if (a!=undefined)
return a.replace(/'[^']+'/g, '<span class="quoted">$&</span>');
});
So here I'm first capturing the separate type of groups following this general pattern adapted for this use-case of HTML with accolade-blocks. Those captures are fed to a function that determines what type of capture we're dealing with and further replaces subgroups within this capture with its own .replace() statements.
There's really no other reliable way to do this. I can't tell you how this translates to your environment but maybe this is of help.
Regex alone doesn't seem to be good enough, but since you're working with sublime's scripting here, there's a way to simplify both the code and the process. Keep in mind, I'm a vim user and not familiar with sublime's internals - also, I usually work with javascript regexes, not PCREs (which seems to be the format used by sublime, or closest thereof).
The idea is as follows:
use a regex to get the tag, attributes (in a string) and contents of the tag
use capture groups to do further processing and matching if necessary
In this case, I made this regex:
<([a-z]+)\ ?([a-z]+=\".*?\"\ ?)?>([.\n\sa-z]*)(<\/\1>)?
It starts by finding an opening tag, creates a control group for the tag name, if it finds a space it proceeds, matches the bulk of attributes (inside the \"...\" pattern I could have used \"[^\"]*?\" to match only non-quote characters, but I purposefully match any character greedily until the closing quote - this is to match the bulk of attributes, which we can process later), matches any text in between tags and then finally matches the closing tag.
It creates 4 capture groups:
tag name
attribute string
tag contents
closing tag
as you can see in this demo, if there is no closing tag, we get no capture group for it, same for attributes, but we always get a capture group for the contents of the tag. This can be a problem generally (since we can't assume that a captured feature will be in the same group) but it isn't here because, in the conflict case where we get no attributes and no content, thus the 2nd capture group is empty, we can just assume it means no attributes and the lack of a 3rd group speaks for itself. If there's nothing to parse, nothing can be parsed wrongly.
Now to parse the attributes, we can simply do it with:
([a-z]+=\"[^\"]*?\")
demo here. This gives us the attributes exactly. If sublime's scripting lets you get this far, it certainly would allow you further processing if necessary. You can of course always use something like this:
(([a-z]+)=\"([^\"]*?)\")
which will provide capture groups for the attribute as a whole and its name and value separately.
Using this approach, you should be able to parse the tags well enough for highlighting in 2-3 passes and send off the contents for highlighting to whatever highlighter you want (or just highlight it as plaintext in whatever fancy way you want).
Your own regex was quite helpful in answering your question.
This seems to work well for me:
/(:?<|<\/)([a-zA-Z0-9:.]+)(?:\s(?:([0-9a-zA-Z_-]*=?))\s?)*[^/>]*(:?>|\/>)/g
The / at the beginning and end are just the wrappers regex usually requires. In addition, the g at the end stands for global, so it works for repetitions as well.
A good tool I use to figure out what I am doing wrong with my regex is: http://regexr.com/
Hope this helps!

Removing commas and empty tags from a string using regex

I am trying to filter out spam before being posted using a few routines and external services (akismet) but they all seem to fail when pushing in a comma delimited word or a word formed with empty tags. Eg
b[u][/u]u[u][/u]y[i][/i]m[b][/b] e <-> buyme
b,u,y,m,e <-> buyme
Does anyone know of a good ColdFusion regex to strip out this sort of behavior before I can post it to aksimet for processing?
Firstly: Have you checked whether is Akismet not already doing this?
I would very much suspect it already does all this processing (and more), so you don't actually need to.
Anyway, assuming this is bbcode, and thus the relevant tags will be for bold/italic/underline, you can replace them with:
TextForAkismet = rereplace( TextForAkismet , '\[([biu])\]\[/\1\]' , '' , 'all' )
If there are other empty tags you want to remove, simply update the captured group (the bit in parentheses) as appropriate. To also cater for potentially attributes (but still an empty tag), a quick and dirty way is to use [^\]]* after the tag name (outside the captured group).
'\[([biu]|img|url)[^\]]*\]\[/\1\]'
Depending on the dialect of bbcode you're working with, you may need to handle quoted brackets which would need a more complex expression.
To remove commas that appear between letters, use:
TextForAkismet = rereplace( TextForAkismet , '\b,\b' , '' , 'all' )
(Where \b matches any position between alphanumeric and non-alphanumeric.)