What is the following token pattern matching: [A-Za-z0-9_]+(?=\\s+) - regex

I understand the meaning of [A-Za-z0-9_]+ corresponding to a repeated sequence of one or more characters containing upper case letters, lower case letters, digits and underscores, but what does the whole expression corresponds to?

I'm going to assume that your regex is /[A-Za-z0-9_]+(?=\s+)/ and that your programming language requires you to escape the \ as \\.
Like you said, [A-Za-z0-9_]+ matches one or more alpha-numeric characters.
The (?=) pattern indicates a positive look ahead expression. We are checking if after the alpha-numeric characters, we have one or more(+) whitespace(\s) characters. However, the difference between /[A-Za-z0-9_]+\s+/ and /[A-Za-z0-9_]+(?=\s+)/ is that the former would include the whitespace in the match while the latter will not.
If you run your regex on this_is_followed_by_whitespace␠␠␠ where "␠" indicates spaces, only this_is_followed_by_whitespace will be matched. The expression is just looking ahead to check whether there is whitespace. Running /[A-Za-z0-9_]+\s+/ on the same string would match this_is_followed_by_whitespace␠␠␠.
Play around with your regex on this RegExr demo.

Related

Using regex to find abbreviations

I am trying to create a regular expression that will identify possible abbreviations within a given string in Python. I am kind of new to RegEx and I am having difficulties creating an expression though I beleive it should be somewhat simple. The expression should pick up words that have two or more capitalised letter. The expression should also be able to pick up words where a dash have been used in-between and report the whole word (both before and after the dash). If numbers are also present they should also be reported with the word.
As such, it should pick up:
ABC, AbC, ABc, A-ABC, a-ABC, ABC-a, ABC123, ABC-123, 123-ABC.
I have already made the following expression: r'\b(?:[a-z]*[A-Z\-][a-z\d[^\]*]*){2,}'.
However this does also pick up these wrong words:
A-bc, a-b-c
I believe the problem is that it looks for either multiple capitalised letters or dashes. I wish for it to only give me words that have atleast two or more capitalised letters. I understand that it will also "mistakenly" take words as "Abc-Abc" but I don't believe there is a way to avoid these.
If a lookahead is supported and you don't want to match double -- you might use:
\b(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b
Explanation
\b A word boundary
(?= Positive lookahead, assert that from the current location to the right is
(?:[a-z\d-]*[A-Z]){2} Match 2 times the optionally the allowed characters and an uppercase char A-Z
) Close the lookahead
[A-Za-z\d]+ match 1+ times the allowed characters without the hyphen
(?:-[A-Za-z\d]+)* Optionally repeat - and 1+ times the allowed characters
\b A word boundary
See a regex101 demo.
To also not not match when there are hyphens surrounding the characters you can use negative lookarounds asserting not a hyphen to the left or right.
\b(?<!-)(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b(?!-)
See another regex demo.

Regex to match 4 letters in a string

I am trying to write some regex that will match a string that contains 4 or more letters in it that are not necessarily in sequence.
The input string can have a mix of upper and lowercase letters, numbers, non-alpha chars etc, but I only want it to pass the regex test if it contains at least 4 upper or lowercase letters.
An example of what I would like to be a valid input can be seen below:
a124Gh0st
I have currently written this piece of regex:
(?(?=[a-zA-Z])([a-zA-Z])| )
Which returns 5 matches successfully but it will currently always pass as long as I have greater than 1 letter in the input string. if I add {4,} to the end of it then it works, but only in situations where there are 4 letters in a row.
I am using the following website to test what I have been doing: regex101
Any help on this would be greatly appreciated.
You may use
(?s)^([^a-zA-Z]*[A-Za-z]){4}.*
or
^([^a-zA-Z]*[A-Za-z]){4}[\s\S]*
See the regex demo.
Details:
^ - start of string
([^a-zA-Z]*[A-Za-z]){4} - exactly 4 sequences of:
[^a-zA-Z]* - 0+ chars other than ASCII letters
[A-Za-z] - an ASCII letter
[\S\s]* - any 0+ chars (same as .* if the DOTALL modifier is enabled).
Why don't you just match the zero or more characters between each letter? For example,
(?:[A-Za-z].*){4}
You'll recognize the [A-Za-z]. The . matches any character, so .* is a run of any number (including zero) of any character. The group of a letter followed by any number of any characters is repeated four times, so this pattern matches if and only if at least four letters appear in the string. (Note that the trailing .* of the fourth repeat of the pattern is mostly inconsequential, since it can match zero characters).
If you are using a regex language that supports reluctant quantifiers, then using them will make this pattern considerably more efficient. For example, in Java or Perl, one might prefer to use
(?:[A-Za-z].*?){4}
The .*? still matches any number of any character, but the matching algorithm will match as few characters as possible with each such run. This will reduce the amount of backtracking it needs to perform. For this particular pattern, it will reduce the needed backtracking to zero.
If you do not have reluctant quantifiers in your regex dialect, then you can achieve the same desirable effect a bit more verbosely:
(?:[A-Za-z][^A-Za-z]*?){4}
There, only non-letters are matched for the runs between letters.
Even with this, the pattern uses some regex features not present in all regex flavors -- non-capturing groups, enumerated quantifiers -- but these are present in your original regex. For a maximally-compatible form, you might write
[A-Za-z][^A-Za-z]*[A-Za-z][^A-Za-z]*[A-Za-z][^A-Za-z]*[A-Za-z]

What is the regular expression to allow uppercase/lowercase (alphabetical characters), periods, spaces and dashes only?

I am having problems creating a regex validator that checks to make sure the input has uppercase or lowercase alphabetical characters, spaces, periods, underscores, and dashes only. Couldn't find this example online via searches. For example:
These are ok:
Dr. Marshall
sam smith
.george con-stanza .great
peter.
josh_stinson
smith _.gorne
Anything containing other characters is not okay. That is numbers, or any other symbols.
The regex you're looking for is ^[A-Za-z.\s_-]+$
^ asserts that the regular expression must match at the beginning of the subject
[] is a character class - any character that matches inside this expression is allowed
A-Z allows a range of uppercase characters
a-z allows a range of lowercase characters
. matches a period
rather than a range of characters
\s matches whitespace (spaces and tabs)
_ matches an underscore
- matches a dash (hyphen); we have it as the last character in the character class so it doesn't get interpreted as being part of a character range. We could also escape it (\-) instead and put it anywhere in the character class, but that's less clear
+ asserts that the preceding expression (in our case, the character class) must match one or more times
$ Finally, this asserts that we're now at the end of the subject
When you're testing regular expressions, you'll likely find a tool like regexpal helpful. This allows you to see your regular expression match (or fail to match) your sample data in real time as you write it.
Check out the basics of regular expressions in a tutorial. All it requires is two anchors and a repeated character class:
^[a-zA-Z ._-]*$
If you use the case-insensitive modifier, you can shorten this to
^[a-z ._-]*$
Note that the space is significant (it is just a character like any other).

Regex help with matching

Hello I need coming up with a valid regular expression It could be any identifier name that starts with a letter or underscore but may contain any number of letters, underscores, and/or digits (all letters may be upper or lower case).
For example, your regular expression should match the following text strings: “_”, “x2”, and “This_is_valid” It should not match these text strings: “2days”, or “invalid_variable%”.
So far this is what I have came up with but I don't think it is right
/^[_\w][^\W]+/
The following will work:
/^[_a-zA-Z]\w*$/
Starts with (^) a letter (upper or lowercase) or underscore ([_a-zA-Z]), followed by any amount of letter, digit, or underscore (\w) to the end ($)
Read more about Regular Expressions in Perl
Maybe the below regex:
^[a-zA-Z_]\w*$
If the identify is at the start of a string, then it's easy
/^(_|[a-zA-Z]).*/
If it's embedded in a longer string, I guess it's not much worse, assuming it's the start of a word...
/\s(_|[a-zA-Z]).*/

Regex for alphanumeric, but at least one letter

In my ASP.NET page, I have an input box that has to have the following validation on it:
Must be alphanumeric, with at least one letter (i.e. can't be ALL
numbers).
^\d*[a-zA-Z][a-zA-Z0-9]*$
Basically this means:
Zero or more ASCII digits;
One alphabetic ASCII character;
Zero or more alphanumeric ASCII characters.
Try a few tests and you'll see this'll pass any alphanumeric ASCII string where at least one non-numeric ASCII character is required.
The key to this is the \d* at the front. Without it the regex gets much more awkward to do.
Most answers to this question are correct, but there's an alternative, that (in some cases) offers more flexibility if you want to change the rules later on:
^(?=.*[a-zA-Z].*)([a-zA-Z0-9]+)$
This will match any sequence of alphanumerical characters, but only if the first group also matches the whole sequence. It's a little-known trick in regular expressions that allows you to handle some very difficult validation problems.
For example, say you need to add another constraint: the string should be between 6 and 12 characters long. The obvious solutions posted here wouldn't work, but using the look-ahead trick, the regex simply becomes:
^(?=.*[a-zA-Z].*)([a-zA-Z0-9]{6,12})$
^[\p{L}\p{N}]*\p{L}[\p{L}\p{N}]*$
Explanation:
[\p{L}\p{N}]* matches zero or more Unicode letters or numbers
\p{L} matches one letter
[\p{L}\p{N}]* matches zero or more Unicode letters or numbers
^ and $ anchor the string, ensuring the regex matches the entire string. You may be able to omit these, depending on which regex matching function you call.
Result: you can have any alphanumeric string except there's got to be a letter in there somewhere.
\p{L} is similar to [A-Za-z] except it will include all letters from all alphabets, with or without accents and diacritical marks. It is much more inclusive, using a larger set of Unicode characters. If you don't want that flexibility substitute [A-Za-z]. A similar remark applies to \p{N} which could be replaced by [0-9] if you want to keep it simple. See the MSDN page on character classes for more information.
The less fancy non-Unicode version would be
^[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*$
^[0-9]*[A-Za-z][0-9A-Za-z]*$
is the regex that will do what you're after. The ^ and $ match the start and end of the word to prevent other characters. You could replace the [0-9A-z] block with \w, but i prefer to more verbose form because it's easier to extend with other characters if you want.
Add a regular expression validator to your asp.net page as per the tutorial on MSDN: http://msdn.microsoft.com/en-us/library/ms998267.aspx.
^\w*[\p{L}]\w*$
This one's not that hard. The regular expression reads: match a line starting with any number of word characters (letters, numbers, punctuation (which you might not want)), that contains one letter character (that's the [\p{L}] part in the middle), followed by any number of word characters again.
If you want to exclude punctuation, you'll need a heftier expression:
^[\p{L}\p{N}]*[\p{L}][\p{L}\p{N}]*$
And if you don't care about Unicode you can use a boring expression:
^[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*$
^[0-9]*[a-zA-Z][a-zA-Z0-9]*$
Can be
any number ended with a character,
or an alphanumeric expression started with a character
or an alphanumeric expression started with a number, followed by a character and ended with an alphanumeric subexpression