Pattern regex inclusion special characters - regex

I have this line
pattern = "\S*\w+(\s?$|\s{1,}\w+)+"
It all works fine as it allows me to block the initial white space, and allow at those between the words, but I can not include special characters (for example: '+' &%) without changing this property. Can someone help me out ? Thank you

If all you want is a space split you should replace \w with \S.
And anyway having \S*\w+ is sort of redundant, you could simplify with \S*\w.
But if you want finer control why not write out the whole range and replace \w with [a-zA-Z0-9_+&%]?
Check out regular expressions for javascript

Only \S matches special characters, \w only matches [a-zA-Z0-9_].
So you could simply replace them to
pattern="\S*\S+(\s?$|\s{1,}\S+)+"
but there is so much redundancy then. Simplify it to
pattern="\S+(\s+\S+)*\s?"
or if really the only thing you care about is starting with \S then just do
pattern="\S[\s\S]*" <!-- or -->
pattern="\S.*" <!-- not allowing linebreaks -->

From my understanding if you want it to not find the whitespaces or the special characters simply remove the \S* this matches anything OTHER then whitespace which includes special characters.
\w+(\s?$|\s{1,}\w+)+
This means it would block the whitespace and the special characters at the beginning of the regex however special characters inbetween words would be ignored. for that i would replace the \s with \W for non-word characters. This would allow spaces and special characters in between the words.
\w+(\W?$|\W{1,}\w+)+
A great site to test out regex and where I was able to confirm this was regex101.com it's a place you can test out the regex as you type it with detailed information that displays what your regex will do as you type it. You can also include sample text to see what your regex will find in the text. the above regex when given: " ! Test" only captured the Test and ignored both the ! and the spaces prior to Test.

Related

Regex allowing a-zA-Z0-9, spaces, dots, commas, minus and not allowing newline

I want to make regex that works for online sentence and can include:
a-zA-Z
0-9
spaces
dots
minus sign
commas
I tried formula like this:
^[A-Za-z0-9\s,.\r\n-]+$
But it does not work for input with newlines "dsa\nsad"
Can anybody explain me what is wrong and how to do it correctly?
Your current regex is matching any whitespace by including \s in the character class. Therefore, it is superfluous to also include \r and \n, and you might as well just use:
^[A-Za-z0-9\s,.-]+$
That aside, your pattern is working if you test it with an input containing an actual linefeed instead of \n.
Demo

Regex not extracting all matching words

I am trying to extract words that have at least one character from a special character set. It picks up some words and not others. Here is a link to regex101 to test it. This it the regex \b(\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*)\b, and this is the sample sentence I am using
His full name is Abu ʿĪsa Muḥammad ibn ʿĪsa ibn Sawrah ibn Mūsa ibn
Al-Daḥāk Al-Sulamī Al-Tirmidhī.
It should match the following words:
ʿĪsa Muḥammad ʿĪsa Mūsa Al-Daḥāk Al-Sulamī Al-Tirmidhī
I am not too experienced with regex, so I have no idea what I am doing wrong. If someone knows any tool to find out why a specific word doesn't match a regex pattern, please let me know as well.
You can use
[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ][\wāīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]*
After matching the one required special character, use another character set to match more occurrences of those characters or normal word characters.
https://regex101.com/r/ovJoLt/2
You can make this work by enabling the Unicode flag /u (so that the word boundary \b assertions support Unicode characters) and adding hyphens to the surrounding character groups:
/\b[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+[\w-]*\b/gu
Plus, you don't need the capturing group, since the only characters being matched form the desired output anyway (\b is a zero-width assertion).
Demo
You are not doing anything wrong except that to match unicode boundaries you have to enable u modifier or use (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*(?!\S)
If you want to match hyphen add it to your character class (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]+\w*(?!\S)

regex npp - search string must be followed by specific chars, but not include those chars

In the line below, I need to these two lines into one single line by replacing the newline and empty space with nothing.
Provisioned Links : 2/14, 2/24, 7/10, 7/12,
7/25, 7/31, 7/32
Therefore I have this regex (in Notepad++):
(\r\n|\n)\s+[0-9]\/[0-9]*
Problem: the match includes the 7/25 - I need it to look for the #/## but not include it.
If I use this lookaround pattern:
(\r\n|\n)\s+(q=[0-9]\/[0-9])*
all lines beginning with newline + spaces are matched, whether or not they end with #/##.
What am I doing wrong?
regex101 fiddle to play with
Be careful:
You should correct the way you constructed the lookahead: (?=....)
Lookarounds are not quantifiable.
so what you need really is [\r\n]\s+(?=[0-9]\/[0-9]*).
Live demo
To normalize whitespace, why not simply replace "comma with additional space after it" with "comma plus one tab character" ?
You don't need that complicated pattern at all, because \s matches spaces, newlines, and tabs all at the same time:
Pattern: ,\s*
Replacement string: ,\t
https://regex101.com/r/T0QJnq/1

Regular expression to allow spaces between words

I want a regular expression that prevents symbols and only allows letters and numbers. The regex below works great, but it doesn't allow for spaces between words.
^[a-zA-Z0-9_]*$
For example, when using this regular expression "HelloWorld" is fine, but "Hello World" does not match.
How can I tweak it to allow spaces?
tl;dr
Just add a space in your character class.
^[a-zA-Z0-9_ ]*$
Now, if you want to be strict...
The above isn't exactly correct. Due to the fact that * means zero or more, it would match all of the following cases that one would not usually mean to match:
An empty string, "".
A string comprised entirely of spaces, " ".
A string that leads and / or trails with spaces, " Hello World ".
A string that contains multiple spaces in between words, "Hello World".
Originally I didn't think such details were worth going into, as OP was asking such a basic question that it seemed strictness wasn't a concern. Now that the question's gained some popularity however, I want to say...
...use #stema's answer.
Which, in my flavor (without using \w) translates to:
^[a-zA-Z0-9_]+( [a-zA-Z0-9_]+)*$
(Please upvote #stema regardless.)
Some things to note about this (and #stema's) answer:
If you want to allow multiple spaces between words (say, if you'd like to allow accidental double-spaces, or if you're working with copy-pasted text from a PDF), then add a + after the space:
^\w+( +\w+)*$
If you want to allow tabs and newlines (whitespace characters), then replace the space with a \s+:
^\w+(\s+\w+)*$
Here I suggest the + by default because, for example, Windows linebreaks consist of two whitespace characters in sequence, \r\n, so you'll need the + to catch both.
Still not working?
Check what dialect of regular expressions you're using.* In languages like Java you'll have to escape your backslashes, i.e. \\w and \\s. In older or more basic languages and utilities, like sed, \w and \s aren't defined, so write them out with character classes, e.g. [a-zA-Z0-9_] and [\f\n\p\r\t], respectively.
* I know this question is tagged vb.net, but based on 25,000+ views, I'm guessing it's not only those folks who are coming across this question. Currently it's the first hit on google for the search phrase, regular expression space word.
One possibility would be to just add the space into you character class, like acheong87 suggested, this depends on how strict you are on your pattern, because this would also allow a string starting with 5 spaces, or strings consisting only of spaces.
The other possibility is to define a pattern:
I will use \w this is in most regex flavours the same than [a-zA-Z0-9_] (in some it is Unicode based)
^\w+( \w+)*$
This will allow a series of at least one word and the words are divided by spaces.
^ Match the start of the string
\w+ Match a series of at least one word character
( \w+)* is a group that is repeated 0 or more times. In the group it expects a space followed by a series of at least one word character
$ matches the end of the string
This one worked for me
([\w ]+)
Try with:
^(\w+ ?)*$
Explanation:
\w - alias for [a-zA-Z_0-9]
"whitespace"? - allow whitespace after word, set is as optional
I assume you don't want leading/trailing space. This means you have to split the regex into "first character", "stuff in the middle" and "last character":
^[a-zA-Z0-9_][a-zA-Z0-9_ ]*[a-zA-Z0-9_]$
or if you use a perl-like syntax:
^\w[\w ]*\w$
Also: If you intentionally worded your regex that it also allows empty Strings, you have to make the entire thing optional:
^(\w[\w ]*\w)?$
If you want to only allow single space chars, it looks a bit different:
^((\w+ )*\w+)?$
This matches 0..n words followed by a single space, plus one word without space. And makes the entire thing optional to allow empty strings.
This regular expression
^\w+(\s\w+)*$
will only allow a single space between words and no leading or trailing spaces.
Below is the explanation of the regular expression:
^ Assert position at start of the string
\w+ Match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
1st Capturing group (\s\w+)*
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\s Match any white space character [\r\n\t\f ]
\w+ Match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
$ Assert position at end of the string
Just add a space to end of your regex pattern as follows:
[a-zA-Z0-9_ ]
This does not allow space in the beginning. But allowes spaces in between words. Also allows for special characters between words. A good regex for FirstName and LastName fields.
\w+.*$
For alphabets only:
^([a-zA-Z])+(\s)+[a-zA-Z]+$
For alphanumeric value and _:
^(\w)+(\s)+\w+$
If you are using JavaScript then you can use this regex:
/^[a-z0-9_.-\s]+$/i
For example:
/^[a-z0-9_.-\s]+$/i.test("") //false
/^[a-z0-9_.-\s]+$/i.test("helloworld") //true
/^[a-z0-9_.-\s]+$/i.test("hello world") //true
/^[a-z0-9_.-\s]+$/i.test("none alpha: ɹqɯ") //false
The only drawback with this regex is a string comprised entirely of spaces. "       " will also show as true.
It was my regex: #"^(?=.{3,15}$)(?:(?:\p{L}|\p{N})[._()\[\]-]?)*$"
I just added ([\w ]+) at the end of my regex before *
#"^(?=.{3,15}$)(?:(?:\p{L}|\p{N})[._()\[\]-]?)([\w ]+)*$"
Now string is allowed to have spaces.
This regex allow only alphabet and spaces:
^[a-zA-Z ]*$
Try with this one:
result = re.search(r"\w+( )\w+", text)

Regex help NOT a-z or 0-9

I need a regex to find all chars that are NOT a-z or 0-9
I don't know the syntax for the NOT operator in regex.
I want the regex to be NOT [a-z, A-Z, 0-9].
Thanks in advance!
It's ^. Your regex should use [^a-zA-Z0-9]. Beware: this character class may have unexpected behavior with non-ascii locales. For instance, this would match é.
Edited
If the regexes are perl-compatible (PCRE), you can use \s to match all whitespace. This expands to include spaces and other whitespace characters. If they're posix-compatible, use [:space:] character class (like so: [^a-zA-Z0-9[:space:]]). I would recommend using [:alnum:] instead of a-zA-Z0-9.
If you want to match the end of a line, you should include a $ at the end. Turning on multiline mode is only when your match should extend across multiple lines, and it reduces performance for larger files since more must be read into memory.
Why don't you include a copy of sample input, the text you want to match, and the program you are using to do so?
It's pretty simple; you just add ^ at the beginning of a character set to negate that character set.
For example, the following pattern will match everything that's not in that character set -- i.e., not a lowercase ASCII character or a digit:
[^a-z0-9]
As a side note, some of the more helpful Regular Expression resources I've found have been this site and this cheat sheet (C# specific).
Put at ^ at the begining of your character class expression: [^a-z0-9]
At start [^a-zA-Z0-9]
for condition;
pre_match();
pre_replace();
ergi();
try this
You can also use \W it's a shorthand for non-word character (equal to [^a-zA-Z0-9_])