Regex to match all of a set except certain ones - regex

I'm sure this has been asked before, but I can't seem to find it (or know the proper wording to search for)
Basically I want a regex that matches all non-alphanumeric except hyphens. So basically match \W+ except exclude '-' I'm not sure how to exclude specific ones from a premade set.

\W is a shorthand for [^\w]. So:
[^\w-]+
A bit of background:
[…] defines a set
[^…] negates a set
Generally, every \v (smallcase) set is negated by a \V (uppercase) where V is any letter that defines a set.
for international characters, you may want to look into [[:alpha:]] and [[:alnum:]]

[^\w-]+
will do just that. Match any characters not in the \w set except hyphen.

You can use:
[^a-zA-Z0-9_-]
or
[^\w-]
to match a single non-hyphen, non-alphanumeric. To match one or more of then prefix with a +

In Java7 or above, you need to prepend the (?U) to match all locale specific characters. e.g.
(?U)[^\w-]
In a Java string (you need to escape \ character with another one):
(?U)[^\\w-]

Related

Regex not extracting all matching words

I am trying to extract words that have at least one character from a special character set. It picks up some words and not others. Here is a link to regex101 to test it. This it the regex \b(\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*)\b, and this is the sample sentence I am using
His full name is Abu ʿĪsa Muḥammad ibn ʿĪsa ibn Sawrah ibn Mūsa ibn
Al-Daḥāk Al-Sulamī Al-Tirmidhī.
It should match the following words:
ʿĪsa Muḥammad ʿĪsa Mūsa Al-Daḥāk Al-Sulamī Al-Tirmidhī
I am not too experienced with regex, so I have no idea what I am doing wrong. If someone knows any tool to find out why a specific word doesn't match a regex pattern, please let me know as well.
You can use
[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ][\wāīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]*
After matching the one required special character, use another character set to match more occurrences of those characters or normal word characters.
https://regex101.com/r/ovJoLt/2
You can make this work by enabling the Unicode flag /u (so that the word boundary \b assertions support Unicode characters) and adding hyphens to the surrounding character groups:
/\b[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+[\w-]*\b/gu
Plus, you don't need the capturing group, since the only characters being matched form the desired output anyway (\b is a zero-width assertion).
Demo
You are not doing anything wrong except that to match unicode boundaries you have to enable u modifier or use (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*(?!\S)
If you want to match hyphen add it to your character class (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]+\w*(?!\S)

Regex - special characters and numbers - PHP and Javascript

As I have hard time creating regex that would match letters only including accented characters (ie. Czech characters), I would like to go the other way around for my name validation - detect special characters and numbers.
What would be regex that matches special characters and numbers?
To specify #anubhava's, \w stands for [a-zA-Z0-9_] and capitalizing it negates the character class. If you want to match _ too, you'll have to make your own character class like [^a-zA-Z0-9] (everything but alphanumeric). Also this can be shortened to [^a-z\d] if you use the i modifier. Note, this would also match accented characters since they are not a-zA-Z0-9.
Example
However, I always advice against trying to use a "regular" expression to match a name (since names are not regular). See this blog post.

How to match until get specific pattern in Regex

I have a scenario where i want to match specific word and then match everything until i get another pattern. For example
ABC=145865865
Then anything comes in ways
and then
Date=11/11/2001
I have tried (.*?) but it only match that specific line in my scenario i have multiple lines of data in between.
How can i do this?
Closest guess to what I think you're looking for:
ABC=(\d+)[\s\S]*?Date=(\d\d/\d\d/\d{4})
This uses [\s\S] which means "either a whitespace character or not a whitespace character", which is equivalent to "any character". The . can also be set to match any character, but I tend to prefer [\s\S] because it does just that without having to set flags. You haven't specified the language you are using so I can't tell you how to set such a flag anyway (it's re.DOTALL in Python).
Multiple lines? If you mean you have newline characters (\n) in between then you need to set the DOTALL flag, as follows:
Pattern p = Pattern.compile(<your-regex-here>, Pattern.DOTALL)
The above will match new line characters between the two strings.

Regex help NOT a-z or 0-9

I need a regex to find all chars that are NOT a-z or 0-9
I don't know the syntax for the NOT operator in regex.
I want the regex to be NOT [a-z, A-Z, 0-9].
Thanks in advance!
It's ^. Your regex should use [^a-zA-Z0-9]. Beware: this character class may have unexpected behavior with non-ascii locales. For instance, this would match é.
Edited
If the regexes are perl-compatible (PCRE), you can use \s to match all whitespace. This expands to include spaces and other whitespace characters. If they're posix-compatible, use [:space:] character class (like so: [^a-zA-Z0-9[:space:]]). I would recommend using [:alnum:] instead of a-zA-Z0-9.
If you want to match the end of a line, you should include a $ at the end. Turning on multiline mode is only when your match should extend across multiple lines, and it reduces performance for larger files since more must be read into memory.
Why don't you include a copy of sample input, the text you want to match, and the program you are using to do so?
It's pretty simple; you just add ^ at the beginning of a character set to negate that character set.
For example, the following pattern will match everything that's not in that character set -- i.e., not a lowercase ASCII character or a digit:
[^a-z0-9]
As a side note, some of the more helpful Regular Expression resources I've found have been this site and this cheat sheet (C# specific).
Put at ^ at the begining of your character class expression: [^a-z0-9]
At start [^a-zA-Z0-9]
for condition;
pre_match();
pre_replace();
ergi();
try this
You can also use \W it's a shorthand for non-word character (equal to [^a-zA-Z0-9_])

Regular Expression to test an entire word

i have this expression ([a-zA-Z]|ñ|Ñ)* which i want to use to block all characters but letters and Ñ to be entered on a textbox.
The problem is that return a match for: A9023 but also for 32""". How can i do to return a match for A9023 but not for 32""".
Thanks.
You need to add assertions for the start and the end of the string:
^([a-zA-Z]|ñ|Ñ)*$
Otherwise the regular expression matches at any position. Additionally, you can also write ([a-zA-Z]|ñ|Ñ)* as the character class [a-zA-ZñÑ]*:
^[a-zA-ZñÑ]*$
Sure that you don't mean ^([a-zA-Z]|ñ|Ñ)*$ -- you might be finding the characters you want but not excluding what you don't? The expression I mentioned will pin to the beginning ^ and the end $ of the string, so that nothing else will pass. Otherwise:
123ABC456
...will pass your match, because it found 0-or-more letters... though there were also other letters.
You didn't say which regex flavor (which programming language) you're using, but you might want to consider either
^\p{L}*$
if your regex flavor supports Unicode properties or
^[^\W\d_]*$
if it doesn't.
Reason: Your regex will allow only unaccented letters and Ñ - is there a real language that uses the latter without also having accented letters?
\p{L} means "any letter in any 'language'",
[^\W\d_] means "any character that is neither a non-alphanumeric, a digit or an underscore", which is just a fancy but necessary way to say "any letter" (\w is a shorthand for "letter, digit or underscore", \W is the inverse of that).