Alternative for VBA regex unicode characters groups support - regex

VBA Regular Expressions character groups do not support unicode character groups (e.g. {p(L}). Also \w matches only latin alphanumerics. So my problem was how to replace non alphanumeric characters from my unicode string without typing the whole characters' list in pattern field.
For example, trying to replace with underscore every non word character in "abc (for αβψ̌) and de (for δε)", with pattern \W results in "abc__for_______and_de__for____" instead of abc__for_αβψ___and_de__for_δε_
Finally I think there is at least one quick solution...

An approach is to find the unicode first and last character in range and use it as character range. With the pattern [^\w,\u0370-\u03FF\u1F00-\u1FFF] I can get rid of any non-latin or non-greek alphanumeric character.
Also we can use this pattern in the excel function RegExReplace

Related

How to build regex to search for strings that has non-alphanumeric characters?

I want a search for strings that has any special characters like alphanumeric characters. /[^a-zA-Z0-9]/ searches for strings that has no alphanumeric characters. But I don't want that. I want to filter with the only alphanumeric characters like á. So that it can match with álgebra but doesn't match with algebra. How can I build that?
You seem to want to match alphanumeric strings that should contain any letter other than an ASCII letter.
You can use
^(?=.*(?![A-Za-z])\p{L})[\p{L}0-9]+$
See the regex demo.
Details
^ - start of string
(?=.*(?![A-Za-z])\p{L}) - there must be at least one letter that is not an ASCII letter
[\p{L}0-9]+ - one or more any Unicode letters or ASCII digits
$ - end of string.
If you already know what character will be taken. I think you could use [list of characters].
I will give an example. I have some texts.
álgebara
algebara
algebrà
I use regex: ^.*?[áà].*?$
Result:
álgebara
algebrà
Demo
Details:
^.*?[áà].*?$: captures which string contains the specified character á or à

Notepad++ remove non alpanumeric characters

What is the best way to remove non alphanumeric characters from a text file using notepad++?
I only want to keep numbers and letters, Is there a built in feature to help or should I go the regex route?
I am trying to use this to keep them as well as spaces [a-zA-Z0-9 ]. It is working but I need to do the opposite!
In a Replace dialog window (Ctrl+H), use a negated character class in the Find What field:
[^a-zA-Z0-9\s]+
Here, [^ starts a negated character class that matches any character other than the one that belongs to the character set(s)/range(s) defined in it. So, the whole matches 1 or more chars other than ASCII letters, digits, and any whitespace.
Or, to make the expression Unicode-aware,
[^[:alnum:][:space:]]+
Here, [:alnum:] matches all alphanumeric chars and [:space:] matches all whitespace.

Regex - special characters and numbers - PHP and Javascript

As I have hard time creating regex that would match letters only including accented characters (ie. Czech characters), I would like to go the other way around for my name validation - detect special characters and numbers.
What would be regex that matches special characters and numbers?
To specify #anubhava's, \w stands for [a-zA-Z0-9_] and capitalizing it negates the character class. If you want to match _ too, you'll have to make your own character class like [^a-zA-Z0-9] (everything but alphanumeric). Also this can be shortened to [^a-z\d] if you use the i modifier. Note, this would also match accented characters since they are not a-zA-Z0-9.
Example
However, I always advice against trying to use a "regular" expression to match a name (since names are not regular). See this blog post.

Regex for this particular pattern

I have three different things
xxx
xxx>xxx
xxx>xxx>xxx
Where xxx can be any combination of letters and number
I need a regex that can match the first two but NOT the third.
To match ASCII letters and digits try the following:
^[a-zA-Z0-9]{3}(>[a-zA-Z0-9]{3})?$
If letters and digits outside of the ASCII character set are required then the following should suffice:
^[^\W_]{3}(>[^\W_]{3})?$
^\w+(?:>\w+)?$
matches an entire string.
\w+(?:>\w+)?\b(?!>)
matches strings like this in a larger substring.
If you want to exclude the underscore from matching, you can use [\p{L]\p{N}] instead (if your regex engine knows Unicode), or [^\W_] if it doesn't, as a substitute for \w.

Regex for alphanumeric, but at least one letter

In my ASP.NET page, I have an input box that has to have the following validation on it:
Must be alphanumeric, with at least one letter (i.e. can't be ALL
numbers).
^\d*[a-zA-Z][a-zA-Z0-9]*$
Basically this means:
Zero or more ASCII digits;
One alphabetic ASCII character;
Zero or more alphanumeric ASCII characters.
Try a few tests and you'll see this'll pass any alphanumeric ASCII string where at least one non-numeric ASCII character is required.
The key to this is the \d* at the front. Without it the regex gets much more awkward to do.
Most answers to this question are correct, but there's an alternative, that (in some cases) offers more flexibility if you want to change the rules later on:
^(?=.*[a-zA-Z].*)([a-zA-Z0-9]+)$
This will match any sequence of alphanumerical characters, but only if the first group also matches the whole sequence. It's a little-known trick in regular expressions that allows you to handle some very difficult validation problems.
For example, say you need to add another constraint: the string should be between 6 and 12 characters long. The obvious solutions posted here wouldn't work, but using the look-ahead trick, the regex simply becomes:
^(?=.*[a-zA-Z].*)([a-zA-Z0-9]{6,12})$
^[\p{L}\p{N}]*\p{L}[\p{L}\p{N}]*$
Explanation:
[\p{L}\p{N}]* matches zero or more Unicode letters or numbers
\p{L} matches one letter
[\p{L}\p{N}]* matches zero or more Unicode letters or numbers
^ and $ anchor the string, ensuring the regex matches the entire string. You may be able to omit these, depending on which regex matching function you call.
Result: you can have any alphanumeric string except there's got to be a letter in there somewhere.
\p{L} is similar to [A-Za-z] except it will include all letters from all alphabets, with or without accents and diacritical marks. It is much more inclusive, using a larger set of Unicode characters. If you don't want that flexibility substitute [A-Za-z]. A similar remark applies to \p{N} which could be replaced by [0-9] if you want to keep it simple. See the MSDN page on character classes for more information.
The less fancy non-Unicode version would be
^[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*$
^[0-9]*[A-Za-z][0-9A-Za-z]*$
is the regex that will do what you're after. The ^ and $ match the start and end of the word to prevent other characters. You could replace the [0-9A-z] block with \w, but i prefer to more verbose form because it's easier to extend with other characters if you want.
Add a regular expression validator to your asp.net page as per the tutorial on MSDN: http://msdn.microsoft.com/en-us/library/ms998267.aspx.
^\w*[\p{L}]\w*$
This one's not that hard. The regular expression reads: match a line starting with any number of word characters (letters, numbers, punctuation (which you might not want)), that contains one letter character (that's the [\p{L}] part in the middle), followed by any number of word characters again.
If you want to exclude punctuation, you'll need a heftier expression:
^[\p{L}\p{N}]*[\p{L}][\p{L}\p{N}]*$
And if you don't care about Unicode you can use a boring expression:
^[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*$
^[0-9]*[a-zA-Z][a-zA-Z0-9]*$
Can be
any number ended with a character,
or an alphanumeric expression started with a character
or an alphanumeric expression started with a number, followed by a character and ended with an alphanumeric subexpression