Thanks to the help with my previous homework question Regex to match tags like <A>, <BB>, <CCC> but not <ABC>, but now I have another homework question.
I need to match tags like <LOL>, <LOLOLOL> (3 uppercase letters, with repeatable last two letters), but not <lol> (need to be uppercase).
Using the technique from the previous homework, I tried <[A-Z]([A-Z][A-Z])\1*>. This works, except there's an additional catch: the repeating part can be in mixed case!
So I need to also match <LOLolol>, <LOLOLOlol>, because it's 3 uppercase letters, with repeatable last two letters in mixed case. I know you can make a pattern case-insensitive with /i, and that will let me match <LOLolol> with the regex I have, but it will also now match <lololol>, because the check for the first 3 letters are also case-insensitive.
So how do I do this? How can I check the first 3 letters case sensitively, and then the rest of the letters case-insensitively? Is this possible with regex?
Yes! You can in fact do this in some flavors, using what is called embedded modifier. This puts the modifier in the pattern, and you can essentially select which parts of the pattern the modifiers apply to.
The embedded modifier for case insensitivity is (?i), so the pattern you want in this case is:
<[A-Z]([A-Z]{2})(?i:\1*)>
References
regular-expressions.info/Modifiers
Specifying Modes Inside The Regular Expression
Instead of /regex/i, you can also do /(?i)regex/
Turning Modes On and Off for Only Part of The Regular Expression
You can also do /first(?i)second(?-i)third/
Modifier Spans
You can also do /first(?i:second)third/
Related
I am trying to write a complex regex for a large corpus. However, due to many ORs, I am not able to capture the "not" in weren't don't wasn't didn't shouln't doesn't
I would like it to match base verb and n't separately: E.g. were and n't
I have added it in the first line on: https://www.regexpal.com/?fam=106183 with the regex.
Any clue why it is not picking despite it being present in the expression on first order: [a-z]{1}'\w
Edit:
The regex is long because it is part of a large corpus. My problem is that the n't is not getting separated out, even though I placed in first order of preference for OR.
Thanks in advance
Trying to parse natural language perfectly with a regular expression is never going to be "perfect". Language contains too many quirks and exceptions.
However, with that said, trying to cover all scenarios explicitly like you have done ("a 2 letter lower case word", "a 4 letter capitalised word", "a word with a multiple of 3 letters" (??!), ... is a doomed approach.
Keep the pattern as simple as you possibly can, and only add exceptions if you really need to.
Here's a basic approach:
/n't|\b\w+(?!'t)/
This is matching "n't", or 'any word, excluding the last letter if it's proceeded by "'t"'.
You may wish to build upon that slightly, but it solved the use case you've provided:
Demo
In order to understand why your original pattern doesn't work, let's consider a Minimal, Complete, Verifiable Example:
Cutting your pattern down to:
/[a-z]?'[a-z]{1,}|[\w-]+/
Consider how it matches the string:
"weren't"
First, the characters weren are matched by the [\w-]+ portion of the pattern.
Then, the 't characters are matched by the [a-z]?'[a-z]{1,} portion of the pattern.
Fundamentally, having the greedy [\w-]+ section in this pattern will mean it cannot work. This will always match up-to-and-including the "n" in "n't", which means the overall match fails for non-3-letter words.
I wanted a regular expression for a pattern like the following,
1. F639-180C
2. 245A-14F0
3. 319A-15E4
4. A45C-15E5
As I have observed, there will be 8 alphanumeric characters and a hyphen in between. The pattern that I thought was "[A-Za-z]|[0-9]{4}"-"[A-Za-z]|[0-9]{4}", I am not sure if this will work fine.
Your regexp was almost correct, if you want to use it with the explicit enumeration of chars, you may do it like this:
/[A-Za-z0-9]{4}\-[A-Za-z0-9]{4}/g
or to make it even simpler, it may turn to
/\w{4}\-\w{4}/g
where \w{4} match any word character [a-zA-Z0-9_]. I was not sure if you need a global flag - but you can remove it yourself, depending on the task.
Keep in mind, that depending on what regexp you are using, it might turn to an alternate way to match alpanumeric letters
[[:alpha:]]{4}-[[:alpha:]]{4}
__
Improvement: as it was outlined in comments, you probably need to grab only HEX codes, so the regular expression has to take into consideration not the whole set of chars from A to Z, but only HEX codes: [A-Fa-f0-9]
I am trying to find the appropriate regex pattern that allows me to pick out whole words either starting with or ending with a comma, but leave out numbers. I've come up with ([\w]+,) which matches the first word followed by a comma, so in something like:
red,1,yellow,4
red, will match, but I am trying to find a solution that will match like like the following:
red, 1 ,yellow, 4
I haven't been able to find anything that can break strings up like this, but hopefully you'll be able to help!
This regex
,?[a-zA-Z][a-zA-Z0-9]*,?
Matches 'words' optionally enclose with commas. No spaces between commas and the 'word' are permitted and the word must start with an alphanumeric.
See here for a demo.
To ascertain that at least one comma is matched, use the alternation syntax:
(,[a-zA-Z][a-zA-Z0-9]*|[a-zA-Z][a-zA-Z0-9]*,)
Unfortunately no regex engine that i am aware of supports cascaded matching. However, since you usually operate with regexen in the context of programming environments, you could repeatedly match against a regex and take the matched substring for further matches. This can be achieved by chaining or iterated function calls using speical delimiter chars (which must be guaranteed not to occur in the test strings).
Example (Javascript):
"red, 1 ,yellow, 4, red1, 1yellow yellow"
.replace(/(,?[a-zA-Z][a-zA-Z0-9]*,?)/g, "<$1>")
.replace(/<[^,>]+>/g, "")
.replace(/>[^>]+(<|$)/g, "> $1")
.replace(/^[^<]+</g, "<")
In this example, the (simple) regex is tested for first. The call returns a sequence of preliminary matches delimted by angle brackets. Matches that do not contain the required substring (, in this case) are eliminated, as is all intervening material.
This technique might produce code that is easier to maintain than a complicated regex.
However, as a rule of thumb, if your regex gets too complicated to be easily maintained, a good guess is that it hasn't been the right tool in the first place (Many engines provide the x matching modifier that allows you to intersperse whitespace - namely line breaks and spaces - and comments at will).
The issue with your expression is that:
- \w resolves to this: [a-zA-Z0-9_]. This includes numeric data which you do not want.
- You have the comma at the end, this will match foo, but not ,foo.
To fix this, you can do something like so: (,\s*[a-z]+)|([a-z]+\s*,). An example is available here.
I would need one or more regular expressions to match some invalid urls of a website, that have uppercase letters before OR after a certain pattern.
These are the structure rules to match the invalid URLs:
a defined website
zero, or more uppercase letters if zero uppercase letters after the pattern
a pattern
zero, or more uppercase letters if zero uppercase letters before the pattern
To be explicit with examples:
http://website/uppeRcase/pattern/upperCase // match it, uppercase before and after pattern
http://otherweb/WhatevercAse/pattern/whatevercase // do not match, no website
http://website/lowercase/pattern/lowercase // do not match, no uppercase before or after pattern
http://website/lowercase/pattern/uppercasE // match it, uppercase after pattern
http://website/Uppercase/pattern/lowercase // match it, uppercase before pattern
http://website/WhatevercAse/asdasd/whatEveRcase // do not match it, no pattern
Thanks in advance for your help!
Mario
I'd advise against doing the two things you are describing with a regular expression in one step. Use a url parsing library to extract the path and hostname components separately. You want to do this for a couple of reasons, There can be some surprising stuff in the host portion of the url that can throw you off, for instance, the hostname of
http://website#otherweb/uppeRcase/pattern/upperCase
is actually otherweb, and should be excluded, even though it begins with website. similarly:
http://website/actual/path/component?uppeRcase/pattern/upperCase
should be excluded, even though the url has the pattern, surrounded by upper case path components, because the matching region is not part of the path.
http://website/uppe%52case/%70attern/upper%43ase
is actually the same resource as your first example, but contains escapes that might prevent a regex from noticing it.
Once you've extracted and converted the escape sequences of just the path component, though, a regex is probably a great tool to use.
To match uppercase letters you simply need [A-Z]. Then build around that the rest of your rules. Without knowing the exactly what you mean by "website" and "pattern" it is difficult to give better guidance.
This expression will match if uppercase characters are both between "website" and "pattern" as well as after "pattern"
^http://website/.*[A-Z]+.*/pattern/.*[A-Z]+.*$
This expression will bath on either uppercase-case
^http://website/(.*[A-Z]+.*/pattern/.*[A-Z]+.*|.*[A-Z]+.*/pattern/.*|.*/pattern/.*[A-Z]+.*)$
UPDATE:
To #TokenMacGuy's point, RegEx parsing of URLs can be very tricky. If you want to break into parts and then validate, you can start with this expression which should match and group most* URLs.
(?<protocol>(http|ftp|https|ftps):\/\/)?(?<site>[\w\-_\.]+\.(?<tld>([0-9]{1,3})|([a-zA-Z]{2,3})|(aero|arpa|asia|coop|info|jobs|mobi|museum|name|travel))+(?<port>:[0-9]+)?\/?)((?<resource>[\w\-\.,#^%:/~\+#]*[\w\-\#^%/~\+#])(?<queryString>(\?[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*=[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*)+(&[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*=[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*)*)?)?
*it worked in all my tests, but I can't claim I was exhaustive.
I need a regular expression to allow the user to enter an alphanumeric string that starts with a letter (not a digit).
This should work in any of the Regular Expression (RE) engines. There is a nicer syntax in the PCRE world but I prefer mine to be able to run anywhere:
^[A-Za-z][A-Za-z0-9]*$
Basically, the first character must be alpha, followed by zero or more alpha-numerics. The start and end tags are there to ensure that the whole line is matched. Without those, you may match the AB12 of the "###AB12!!!" string.
Full explanation:
^ start tag.
[A-Za-z] any one of the upper/lower case letters.
[A-Za-z0-9] any one of the upper/lower case letters or digits,
* repeated zero or more times.
$ end tag
Update:
As Richard Szalay rightly points out, this is ASCII only (or, more correctly, any encoding scheme where the A-Z, a-z and 0-9 groups are contiguous) and only for the "English" letters.
If you want true internationalized REs (only you know whether that is a requirement), you'll need to use one of the more appropriate RE engines, such as the PCRE mentioned above, and ensure it's compiled for Unicode mode. Then you can use "characters" such as \p{L} and \p{N} for letters and numerics respectively. I think the RE in that case would be:
^\p{L}[\pL\pN]*$
but I'm not certain. I've never used REs for our internationalized software. See here for more than you ever wanted to know about PCRE.
I think this should do the work:
^[A-Za-z][A-Za-z0-9]*$
You're looking for a pattern like this:
^[a-zA-Z][a-zA-Z0-9]*$
That one requires one letter and any number of letters/numbers after that. You may want to adjust the allowed lengths.