Regex Optional Conditional Exact Match? - regex

I have a Regex that looks like this:
(?<Number>\d{3})-?(?<Hand>R?L?)[-\s]?(?<Description>.*?)?(?<ShnOpp>SHN|OPP)?$
With some sample data:
104-RL-BLAH BLA SHN
104-RL FOO OPP
102-RL-BAR WL74
102-BAR WL74
102-R-BAR WL74 SHN
102-R-BAR WL74 OPP
So, the named group Hand can either contain RL|R|L|{Blank}.
But, if and only if, Hand="RL" do I want to match ShnOpp with SHN|OPP, otherwise just leave it as part of the description. So, can I do a literal IF condition within my regex?
Either my Googling skills failed me or maybe you just can't do it, but I'd love to be proved wrong.
Here's a link to a working sample: https://regex101.com/r/wGghbV/2

You can't use a conditional to check that a certain group captured one exact text, however it is possible to use a conditional here by adding a new group that only matches RL like:
(?<Number>\d{3})-?(?<Hand>(?<RL>RL)|[RL]?)[ \-]?(?<Description>.*?)[ \-]?(?(RL)(?<ShnOpp>SHN|OPP)?)$
Your updated sample: https://regex101.com/r/wGghbV/3

Related

How to find "complicated" URLs in a text file

I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).
This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)
So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).
There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.
Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)

Google sheets REGEXREPLACE() replace text with itself

In Google Sheets, I'd like to replace the following text:
The ANS consists of fibers that stimulate ((1smooth muscle)), ((2cardiac muscle)), and ((3glandular cells)).
with this text below:
The ANS consists of fibers that stimulate {{c1::smooth muscle}}, {{c2::cardiac muscle}}, and {{c3::glandular cells}}.
I know if I use =REGEXREPLACE(E3, "\(\([0-9]*", "{{c::") I can get here:
The ANS consists of fibers that stimulate {{c::smooth muscle)), {{c::cardiac muscle)), and {{c::glandular cells)).
BUT I don't know how to keep the original numbers
Nvm, figured it out.
Putting parentheses around the term allows you to reference it again in your replacement string.
For example, my solution for my problem was this:
=REGEXREPLACE(E3, "\(\(([0-9]*)", "{{c$1::")
This works because putting [0-9]* in parentheses like so: ([0-9]*) allowed it to be referenced as $1 in my substitution string.
I assume that if I had another phrase enclosed in parentheses after that it would be able to be referenced with $2.
Hope this helps someone in the future.
Pass 1:
Search Pattern:(((\d)(.+)
Replacement:{{C$1::$2
Pass 2:
Search Pattern: ))
Replacment: }}
I played around with it some more and this will do it in one pass:
Search Pattern: \(\((\d)(.+?)\b\)\)
Replacements: {{C$1::$2}}

Regex that matches a pattern only if string does not begin with 'N'

I need to put together a regex that matches a patter only if string does not begin with 'N'.
Here is my pattern so far [A-E]+[-+]?.
Now I want to make sure that it does not match something like:
N\A
NA
NB+
NB-
NCAB
This is for REGEXP_SUBSTR command in Oracle SQL DB
UPDATE
It looks like I should have been more specific, sorry
I want to extract from a string [A-E]+[-+]? but if the string also matches ^(N|n) then I want my regex to return nothing.
See examples below:
String Returns
N/A
F1/AAA AAA
NABC
FABC ABC
To match a character between A and E not preceded by N, you can use:
([^N]|^)[A-E]+
If you want to avoid fields that contains N[A-E] use a negation in your query using the pattern N[A-E] (in other words, use two predicates, this one to exclude NA and the first to find A)
To be more clear:
WHERE NOT REGEXP_LIKE(coln, 'N[A-E]') AND REGEXP_LIKE(coln, '[A-E]')
Ok I figured it out, I broadened the scope of the problem a little, I realized that I can also play with other parameters of REGEXP_SUBSTR in this case that I can have returned only second substring.
REGEXP_SUBSTR(field1, '^([^NA-D][^A-D]*)?([A-D]+[-+]?)',1,1,'i',2)
I still have to give you guys the credit, lot of good ideas that led me to here.
Just throw a [^N]? in front. That should do it.
OOPS...
That actually needs to include an " OR ^ "...
It should look like this:
([^N]|^)[A-E]+[-+]?
Sorry about that...It looks like the right answer already got posted anyway.

Regular expression for syntax highlighting attributes in HTML tag

I'm working on regular expressions for some syntax highlighting in a Sublime/TextMate language file, and it requires that I "begin" on a non-self closing html tag, and end on the respective closing tag:
begin: (<)([a-zA-Z0-9:.]+)[^/>]*(>)
end: (</)(\2)([^>]*>)
So far, so good, I'm able to capture the tag name, and it matches to be able to apply the appropriate patterns for the area between the tags.
jsx-tag-area:
begin: (<)([a-zA-Z0-9:.]+)[^/>]*>
beginCaptures:
'1': {name: punctuation.definition.tag.begin.jsx}
'2': {name: entity.name.tag.jsx}
end: (</)(\2)([^>]*>)
endCaptures:
'1': {name: punctuation.definition.tag.begin.jsx}
'2': {name: entity.name.tag.jsx}
'3': {name: punctuation.definition.tag.end.jsx}
name: jsx.tag-area.jsx
patterns:
- {include: '#jsx'}
- {include: '#jsx-evaluated-code'}
Now I'm also looking to also be able to capture zero or more of the html attributes in the opening tag to be able to highlight them.
So if the tag were <div attr="Something" data-attr="test" data-foo>
It would be able to match on attr, data-attr, and data-foo, as well as the < and div
Something like (this is very rough):
(<)([a-zA-Z0-9:.]+)(?:\s(?:([0-9a-zA-Z_-]*=?))\s?)*)[^/>]*(>)
It doesn't need to be perfect, it's just for some syntax highlighting, but I was having a hard time figuring out how to achieve multiple capture groups within the tag, whether I should be using look-around, etc, or whether this is even possible with a single expression.
Edit: here are more details about the specific case / question - https://github.com/reactjs/sublime-react/issues/18
I may found a possible solution.
It is not perfect because as #skamazin said in the comments if you are trying to capture an arbitrary amount of attributes you will have to repeat the pattern that matches the attributes as many times as you want to limit the number of attributes you will allow.
The regex is pretty scary but it may work for your goal. Maybe it would be possible to simplify it a bit or maybe you will have to adjust some things
For only one attribute it will be as this:
(<)([a-zA-Z0-9:.]+)(?:(?: ((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))
DEMO
For more attributes you will need to add this as many times as you want:
(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))?
So for example if you want to allow maximum 3 attributes your regex will be like this:
(<)([a-zA-Z0-9:.]+)(?:(?: ((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)?(?: |>))?(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)?(?: |>))?
DEMO
Tell me if it suits you and if you need further details.
I'm unfamiliar with sublimetext or react-jsx but this to me sounds like a case of "Regex is your tool, not your solution."
A solution that uses regex as a tool for this would be something like this JsFiddle (note that the regex is slightly obfuscated because of html-entities like > for > etc.)
Code that does the actual replacing:
blabla.replace(/(<!--(?:[^-]|-(?!->))*-->)|(<(?:(?!>).)+>)|(\{[^\}]+\})/g, function(m, c, t, a) {
if (c!=undefined)
return '<span class="comment">' + c + '</span>';
if (t!=undefined)
return '<span class="tag">' + t.replace(/ [a-z_-]+=?/ig, '<span class="attr">$&</span>') + '</span>';
if (a!=undefined)
return a.replace(/'[^']+'/g, '<span class="quoted">$&</span>');
});
So here I'm first capturing the separate type of groups following this general pattern adapted for this use-case of HTML with accolade-blocks. Those captures are fed to a function that determines what type of capture we're dealing with and further replaces subgroups within this capture with its own .replace() statements.
There's really no other reliable way to do this. I can't tell you how this translates to your environment but maybe this is of help.
Regex alone doesn't seem to be good enough, but since you're working with sublime's scripting here, there's a way to simplify both the code and the process. Keep in mind, I'm a vim user and not familiar with sublime's internals - also, I usually work with javascript regexes, not PCREs (which seems to be the format used by sublime, or closest thereof).
The idea is as follows:
use a regex to get the tag, attributes (in a string) and contents of the tag
use capture groups to do further processing and matching if necessary
In this case, I made this regex:
<([a-z]+)\ ?([a-z]+=\".*?\"\ ?)?>([.\n\sa-z]*)(<\/\1>)?
It starts by finding an opening tag, creates a control group for the tag name, if it finds a space it proceeds, matches the bulk of attributes (inside the \"...\" pattern I could have used \"[^\"]*?\" to match only non-quote characters, but I purposefully match any character greedily until the closing quote - this is to match the bulk of attributes, which we can process later), matches any text in between tags and then finally matches the closing tag.
It creates 4 capture groups:
tag name
attribute string
tag contents
closing tag
as you can see in this demo, if there is no closing tag, we get no capture group for it, same for attributes, but we always get a capture group for the contents of the tag. This can be a problem generally (since we can't assume that a captured feature will be in the same group) but it isn't here because, in the conflict case where we get no attributes and no content, thus the 2nd capture group is empty, we can just assume it means no attributes and the lack of a 3rd group speaks for itself. If there's nothing to parse, nothing can be parsed wrongly.
Now to parse the attributes, we can simply do it with:
([a-z]+=\"[^\"]*?\")
demo here. This gives us the attributes exactly. If sublime's scripting lets you get this far, it certainly would allow you further processing if necessary. You can of course always use something like this:
(([a-z]+)=\"([^\"]*?)\")
which will provide capture groups for the attribute as a whole and its name and value separately.
Using this approach, you should be able to parse the tags well enough for highlighting in 2-3 passes and send off the contents for highlighting to whatever highlighter you want (or just highlight it as plaintext in whatever fancy way you want).
Your own regex was quite helpful in answering your question.
This seems to work well for me:
/(:?<|<\/)([a-zA-Z0-9:.]+)(?:\s(?:([0-9a-zA-Z_-]*=?))\s?)*[^/>]*(:?>|\/>)/g
The / at the beginning and end are just the wrappers regex usually requires. In addition, the g at the end stands for global, so it works for repetitions as well.
A good tool I use to figure out what I am doing wrong with my regex is: http://regexr.com/
Hope this helps!

Regexp for finding tags without nested tags

I'm trying to write a regexp which will help to find non-translated texts in html code.
Translated texts means that they are going through special tag: or through construction: ${...}
Ex. non-translated:
<h1>Hello</h1>
Translated texts are:
<h1><fmt:message key="hello" /></h1>
<button>${expression}</button>
I've written the following expression:
\<(\w+[^>])(?:.*)\>([^\s]+?)\</\1\>
It finds correct strings like:
<p>text<p>
Correctly skips
<a><fmt:message key="common.delete" /></a>
But also catches:
<li><p><fmt:message key="common.delete" /></p></li>
And I can't figure out how to add exception for ${...} strings in this expression
Can anybody help me?
If I understand you properly, you want to ensure the data inside the "tag" doesn't contain fmt:messsage or ${....}
You might be able to use a negative-lookahead in conjuction with a . to assert that the characters captured by the . are not one of those cases:
/<(\w+)[^>]*>(?:(?!<fmt:message|\$\{|<\/\1>).)*<\/\1>/i
If you want to avoid capturing any "tags" inside the tag, you can ignore the <fmt:message portion, and just use [^<] instead of a . - to match only non <
/<(\w+)[^>]*>(?:(?!\$\{)[^<])*<\/\1>/i
Added from comment If you also want to exclude "empty" tags, add another negative-lookahead - this time (?!\s*<) - ensure that the stuff inside the tag is not empty or only containing whitespace:
/<(\w+)[^>]*>(?!\s*<)(?:(?!\$\{)[^<])*<\/\1>/i
If the format is simple as in your examples you can try this:
<(\w+)>(?:(?!<fmt:message).)+</\1>
Rewritten into a more formal question:
Can you match
aba
but not
aca
without catching
abcba ?
Yes.
FSM:
Start->A->B->A->Terminate
Insert abcba and run it
Start is ready for input.
a -> MATCH, transition to A
b -> MATCH, transition to B
c -> FAIL, return fail.
I've used a simple one like this with success,
<([^>]+)[^>]*>([^<]*)</\1>
of course if there is any CDATA with '<' in those it's not going to work so well. But should do fine for simple XML.
also see
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/
for a discussion of using regex to parse html
executive summary: don't