negation classes regex [duplicate] - regex

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
i wrote this regex for tokenize a text: "\b\w+\b"
but someone suggets me to convert it into \b[^\W\d_]+\b
can anyone explaing to me why this second way (using negation) is better?
thanks

The first one matches all letters, numbers and the underscore. Depending on the regex engine, this may include unicode letters and numbers. (the word boundaries are superfluous in this case btw.)
The second regex matches only letters (excluding non-word-charcters, digits and the underscore). Due to the word boundary, it will only match them, if they are surrounded by non-word-characters or start/end of th string.
If your regex engine supports this, you might want to use [[:alpha:]] or \p{L} (or [A-Za-z] in case of non-unicode) instead to make your intent clearer.

Related

The use of ".*" in regex for password validation [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
I came across this regex used for password validation:
(?=.*[a-z])(?=.*[A-Z])(?=.*[\d])(?=.*[^a-zA-Z\d])(?=\S+$).{8,}
There are only two things that are unclear to me about this regex:
what are .* used for and why this regex doesn't work without them?
what is the difference/benefit or using [\d] instead of \d, because the regex works just fine in both cases
.* matches any sequence of characters; . matches any character (other than newline, which is not relevant here) and * matches zero or more of the preceding pattern. This is used in the lookaheads to search for matches anywhere in the password. If you didn't have it,then it would require that you have those types of characters in a specific order: a lowercase letter followed by an uppercase letter followed by a digit. With .*, it means the password must contain at least one of each of them, but they can be anywhere in the password.
There's no difference between \d and [\d]. Whoever write this might just use the brackets out of habit, or perhaps to make it easier to modify it to put other characters into the character class.

Matching a string that does not contain certain word... without lookaround [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 6 years ago.
I would like to use (pure) regex to match strings that do not contain the word word.
However, I would not like to use lookaround, balancing groups, or that kind of stuff.
If it is impossible, then can we match strings that do not start with word instead?
Examples
word should not match.
wor should match.
wore should match.
use this pattern
^.*\bword\b.*$|(.+)
match strings that has word first, then match and capture strings that don't
Demo
Depending on your engine, you could use this pattern
^.*\bword\b.*$(*SKIP)(*F)|(.+)
Demo

Regular expression for 10 numeric digits [duplicate]

This question already has an answer here:
Java Regex for telephone number - Must Include only 8 digits with not more than 2 dash [duplicate]
(1 answer)
Closed 7 years ago.
10 numeric digits, may be in the following formats: 123-4-567890, 1234-567890 or 1234567890
What is the regular expression for above digits?
Any help is appreciated.
Thanks.
Assuming you mean any digit, 0-9, should be found if (and only if) it meets the three formats you presented, one regex that would work is
(([0-9]{3}-[0-9]{1}-[0-9]{6})|([0-9]{4}-[0-9]{6})|([0-9]{10}))
The above breaks down to three separate patterns, one for each case you presented, separated by regex's equivalent of "or", the | character. Each of the statements above contains [0-9], a character class which will match any digit. Following each character class is a {n} statement, which means "repeat the previous item n times".
Disclaimer, there is probably a cleverer way to do this with a shorter pattern, but my regex-foo isn't quite that advanced yet

Tricky Regular Expression with a Alphanumeric pattern in uppercase [duplicate]

This question already has answers here:
Can you make just part of a regex case-insensitive?
(5 answers)
Closed 3 years ago.
Okay this might not be tricky at all for some but at the moment really screwing up with my head.
First of all i don't know what engine i am dealing with, but it doesn't seem to identify uppercase.
I have a string for example
Circuit Ref
Service Type
A End Address
Z End Address
52GD J32SD41 O2AE EVC001
Evolve Internet
And I am only trying to extract the string "52GD J32SD41 O2AE EVC001". I have already tried quite a few combinations like
[0-9A-Z]{4}\s[0-9A-Z]+\s[0-9A-Z]+\s[0-9A-Z]+
[A-Z0-9]{4}\s\W+\s\W+\s\W+
[A-Z0-9]{4}\s[A-Z0-9\s]*[A-Z0-9\s]*[A-Z0-9\s]*
Nothing seem to work...I want to keep the expression fairly flexible as the expression can change order of the letters and digits. but the pattern is mostly same. Any nudge in a right direction will be greatly appreciated.
Thanks
This is wild guess, but please try following things:
in front of the regex add (?-i) (Related question, regular-expressions.info, net page about regex)
enclose regex with (?-i: ... )
enclose regex with (?I: ... )
BTW. Regarding 2nd case that you tried: [A-Z0-9]{4}\s\W+\s\W+\s\W+.
Seem that you tried to use \W as "upper case word character", but it is not what it means.
\W means anything that is not \w. That is any non-word character.

Regex, Difference between ^[a-zA-Z]+$ vs [a-zA-Z]* [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
I'm very new to programming and I've been told to avoid regex for now, but I find it extremely helpful.
When writing a program to check if a string only contains letters, I found on stackoverlow that both ^[a-zA-Z]+$ and [a-zA-Z]* yield the same results. I understand how [a-zA-Z] works and I understand how [A-z]is different from both of those as well, but I do not understand +$ vs ^[]* or why they yield the same result and I'm having trouble finding anything to explain it.
Here's the example I used it in:
String student = input.next();
while (!student.matches("[a-zA-Z]*")) {
System.out.print("Invalid input. Enter name: ");
student = input.next();
}
This is my first question here so sorry if this kind of question is frowned upon.
As you know,
[a-zA-Z]
Matches a single upper or lower-case letter.
[a-zA-Z]*
matches zero or more upper- or lower-case letters in a row.
^[a-zA-Z]+$
matches a string that STARTS with one-or more upper- or lower-case letters and also ends with it. Meaning, the only thing in your string is upper- or lower-case letters.
^ and $ play more of a role when you're dealing with streams of data, using regular expressions to sift out stuff you want while ignoring the stuff you don't. That last pattern could be used to find a stream consisting of only upper and lower-case letters.
* is zero or more, + is one or more.
However, there is a larger difference which is the ^ and $. In the first example, it is saying that it MUST contain only [a-zA-Z], where the string 123abc123 is not valid.
In the 2nd example, where ^ and $ are omitted, 123abc123 is valid.