I would need one or more regular expressions to match some invalid urls of a website, that have uppercase letters before OR after a certain pattern.
These are the structure rules to match the invalid URLs:
a defined website
zero, or more uppercase letters if zero uppercase letters after the pattern
a pattern
zero, or more uppercase letters if zero uppercase letters before the pattern
To be explicit with examples:
http://website/uppeRcase/pattern/upperCase // match it, uppercase before and after pattern
http://otherweb/WhatevercAse/pattern/whatevercase // do not match, no website
http://website/lowercase/pattern/lowercase // do not match, no uppercase before or after pattern
http://website/lowercase/pattern/uppercasE // match it, uppercase after pattern
http://website/Uppercase/pattern/lowercase // match it, uppercase before pattern
http://website/WhatevercAse/asdasd/whatEveRcase // do not match it, no pattern
Thanks in advance for your help!
Mario
I'd advise against doing the two things you are describing with a regular expression in one step. Use a url parsing library to extract the path and hostname components separately. You want to do this for a couple of reasons, There can be some surprising stuff in the host portion of the url that can throw you off, for instance, the hostname of
http://website#otherweb/uppeRcase/pattern/upperCase
is actually otherweb, and should be excluded, even though it begins with website. similarly:
http://website/actual/path/component?uppeRcase/pattern/upperCase
should be excluded, even though the url has the pattern, surrounded by upper case path components, because the matching region is not part of the path.
http://website/uppe%52case/%70attern/upper%43ase
is actually the same resource as your first example, but contains escapes that might prevent a regex from noticing it.
Once you've extracted and converted the escape sequences of just the path component, though, a regex is probably a great tool to use.
To match uppercase letters you simply need [A-Z]. Then build around that the rest of your rules. Without knowing the exactly what you mean by "website" and "pattern" it is difficult to give better guidance.
This expression will match if uppercase characters are both between "website" and "pattern" as well as after "pattern"
^http://website/.*[A-Z]+.*/pattern/.*[A-Z]+.*$
This expression will bath on either uppercase-case
^http://website/(.*[A-Z]+.*/pattern/.*[A-Z]+.*|.*[A-Z]+.*/pattern/.*|.*/pattern/.*[A-Z]+.*)$
UPDATE:
To #TokenMacGuy's point, RegEx parsing of URLs can be very tricky. If you want to break into parts and then validate, you can start with this expression which should match and group most* URLs.
(?<protocol>(http|ftp|https|ftps):\/\/)?(?<site>[\w\-_\.]+\.(?<tld>([0-9]{1,3})|([a-zA-Z]{2,3})|(aero|arpa|asia|coop|info|jobs|mobi|museum|name|travel))+(?<port>:[0-9]+)?\/?)((?<resource>[\w\-\.,#^%:/~\+#]*[\w\-\#^%/~\+#])(?<queryString>(\?[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*=[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*)+(&[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*=[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*)*)?)?
*it worked in all my tests, but I can't claim I was exhaustive.
Related
I have a very large file containing thousands of sentences. In all of them, the first word of each sentence begins with lowercase, but I need them to begin with uppercase.
I looked through the site trying to find a regex to do this but I was unable to. I learned a lot about regex in the process, which is always a plus for my job, but I was unable to find specifically what I am looking for.
I tried to find a way of compiling the code from several answers, including the following:
Convert first lowercase to uppercase and uppercase to lowercase (regex?)
how to change first two uppercase character to lowercase character on each line in vim
Regex, two uppercase characters in a string
Convert a char to upper case using regular expressions (EditPad Pro)
But for different reasons none of them served my purpose.
I am working with a translation-specific application which accepts regex.
Do you think this is possible at all? It would save me hours of tedious work.
You can use this regex to search for the first letters of sentences:
(?<=[\.!?]\s)([a-z])
It matches a lowercase letter [a-z], following the end of a previous sentence (which might end with one of the following: [\.!?]) and a space character \s.
Then make a substitution with \U$1.
It doesn't work only for the very first sentence. I intentionally kept the regex simple, because it's easy to capitalize the very first letter manually.
Working example: https://regex101.com/r/hqwK26/1
UPD: If your software doesn't support \U, you might want to copy your text to Notepad++ and make a replacement there. The \U is fully supported, just checked.
UPD2: According to the comments, the task is slightly different, and just the first letters of each line should be capitalized.
There is a simple regex for that: ^([a-z]), with the same substitution pattern.
Here is a working example: https://regex101.com/r/hqwK26/2
Taking Ildar's answer and combining both of his patterns should work with no compromises.
(?<=[\.!?]\s)([a-z])|^([a-z])
This is basically saying, if first pattern OR second pattern. But because you're now technically extracting 2 groups instead of one, you'll have to refer to group 2 as $2. Which should be fine because only one of the patterns should be matched.
So your substitution pattern would then be as follows...
\U$1$2
Here's a working example, again based on Ildar's answer...
https://regex101.com/r/hqwK26/13
I have creating a script using VBA to go through a Word document to find all word that could possibly be an acronym but I found that my regEx pattern is not find all of them.
The regEx pattern I am using is "([A-Z]{2,})(-([A-Z]{2,})[A-Za-z0-9])"
With this pattern I am able to find
AA
AAA
AA-BB
AA-BBB
AAA-BB
AAA-BBB
AAA-1234
AAA-BBB-1234
but it does not find these words
B2B
B2B-1234
B2B-A1A-1234
The expectation of the word match should be that the first character is a letter and must contains at least two uppercase letters and at least one number. In addition, if there are dashes in the the word then the characters before the dash must match the expectation of the word match.
Is there is a way to use the regEx pattern above to also include the letter-digit-letter acronyms too?
Milco, welcome to StackOverflow. I think that the following regex will work for you:
([A-Z][A-Z0-9]+)(-[A-Z0-9]{2,})*
This regex accommodates digits and an optional number of hyphenated terms and matches each of your cases above. I tested it out at regextesteronline.com - I'm assuming that VB.net regexes are the same as VBA, which they should be, at least for basic regexes.
I have this regular expression: ^[a-zA-Z]\s{3,16}$
What I want is to match any name with any spaces, for example, John Smith and that contains 3 to 16 characters long..
What am I doing wrong?
Background
There are a couple of things to note here. First, a quantifier (in this case, {3,16}) only applies to the last regex token. So what your current regex really is saying is to "Match any string that has a single alphabetical character (case-insensitive) followed by 3 to 16 whitespace characters (e.g. spaces, tabs, etc.)."
Second, a name can have more than 2 parts (a middle name, certain ethnic names like "De La Cruz") or include special characters such as accented vowels. You should consider if this is something you need to account for in your program. These things are important and should be considered for any real application.
Assumptions and Answer
Now, let's just assume you only want a certain format for names that consists of a first name, a last name, and a space. Let's also assume you only want simple ASCII characters (i.e. no special characters or accented characters). Furthermore, both the first and last names should start with a capital character followed by only lower-case characters. Other than that, there are no restrictions on the length of the individual parts of the name. In this case, the following regex would do the trick:
^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$
Notes
The first token after the ^ character is what is called a positive lookahead. Basically a positive look ahead will match the regex between the opening (?= and closing ) without actually moving the position of the cursor that is matching the string.
Notice I removed the \s token, since you usually want only a (space). The space can be replaced with the \s token, if tabs and other whitespace is desired there.
I also added a restriction that a name must start with a capital letter followed by only lower-case letters.
Crude English Translation
To help your understanding, here is a simple English translation of what the regex is really doing. The part in italics is just copied from the first part of the English translation of the regex.
"Match any string that has 3-16 characters and starts with a capital alphabetical character followed by one or more (+) alphabetical characters followed by a single space followed by a capital alphabetical character followed by one or more (+) alphabetical characters and ends with any lowercase letter."
Tools
There are a couple of tools I like to use when I am trying to tackle a challenging regex. They are listed below in no particular order:
https://regex101.com/ - Allows you to test regex expressions in real time. It also has a nifty little library to help you along.
http://www.regular-expressions.info/ - Basically a repository of knowledge on regex.
Edit/Update
You mentioned in your comments that you are using your regex in JavaScript. JavaScript uses a forward slash surrounding the regex to determine what is a regex. For this simple case, there are 2 options for using a regex to match a string.
First, use String's match method as follows
"John Smith".match(/^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$/);
Second, create a regex and use its test() method. For example,
/^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$/.test("John Smith");
The latter is probably what you want as it simply returns true or false depending on whether the regex actually matches the string or not.
I'm looking to validate English names. How could I improve this RegEx in order to ensure that the hyphen, comma or apostrophe does not come first or last in the string? Further, a hyphen or apostrophe must always be directly preceded and succeeded with a letter.
I have the RegEx: [a-zA-Z-', ].
My understanding is that this will allow for all letters, in all cases, hyphens, commas and apostrophes along with whitespace at any location in the string.
Quite new to the language so just getting my head around it.
Thanks.
Ok, since it's only given a presumed name, here's what you might want to try:
^[A-Z][A-Za-z'-]+[a-z](,? [A-Z][A-Za-z'-]+[a-z])*$
This will work with names like O'Harry Jefferson-Wayne, but will reject words not ending with a small English letter.
The gist of it is this [A-Z] start of name, [A-Za-z'-]+ continuation, [a-z] end of name. then a repeatable group of optional names, with an optional comma delimiter, but required space.
I took more time writing the examples to test at http://www.regexplanet.com/. That's where I practice my regular expressions when I'm developing. You should try it.
I'm new to this site and don't know if this is the place to ask this question here?
I was wondering if someone can explain the 3 regex code examples below in detail?
Thanks.
Example 1
`&([a-z]{1,2})(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig);`i
Example 2
\\1
Example 3
`[^a-z0-9]`i','`[-]+`
The first regex looks like it'll match the HTML entities for accented characters (e.g., é is é; ø is ø; æ is æ; and  is Â).
To break it down, & will match an ampersand (the start of the entity), ([a-z]{1,2}) will match any lowercase letter one or two times, (acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig) will match one of the terms in the pipe-delimited list (e.g., circ, grave, cedil, etc.), and ; will match a semicolon (the end of the entity). I'm not sure what the i character means at the end of the line; it's not part of the regex.
All told, it will match the HTML entities for accented/diacritic/ligatures. Compared, though, to this page, it doesn't seem that it matches all of the valid entities (although ti does catch many of them). Unless you run in case-insensitive mode, the [a-z] will only match lowercase letters. It will also never match the entities ð or þ (ð, þ, respectively) or their capital versions (Ð, Þ, also respectively).
The second regex is simpler. \1 in a regex (or in regex find-replace) simply looks for the contents of the first capturing group (denoted by parentheses ()) and (in a regex) matches them or (in the replace of a find) inserts them. What you have there \\1 is the \1, but it's probably written in a string in some other programming language, so the coder had to escape the backslash with another backslash.
For your third example, I'm less certain what it does, but I can explain the regexes. [^a-z0-9] will match any character that's not a lowercase letter or number (or, if running in case-insensitive mode, anything that's not a letter or a number). The caret (^) at the beginning of the character class (that's anything inside square brackets []) means to negate the class (i.e., find anything that is not specified, instead of the usual find anything that is specified). [-]+ will match one or more hyphens (-). I don't know what the i',' between the regexes means, but, then, you didn't say what language this is written in, and I'm not familiar with it.