CQ5 textfield validation with regex - regex

I have a simple CQ dialog with a textfield. The authors somehow managed to paste illegal characters into it, the last two times it was a vertical tab (VT) copied from a PowerPoint file.
I played around with some regex and came up with the following to exclude anything below SPACE and DEL:
/^[^\0-\x1F\x7F]*$/
Sadly I can't really test the vertical tab as I am not able to enter this character on regex101. So I tried it with TAB and this seems to be working: https://regex101.com/r/yH0lN5/1
But if I use this in my regex property of the textfield, no matter what I enter the validation fails. Any idea what I am doing wrong?
White listing isn't an option as i need to support Unicode characters like chinese in the future.

You should double the backslashes to make sure they are treated as literal backslashes by the regex engine.
Also, I suggest using consistent notation, and replace \0 with \x00:
regex="/^[^\\x00-\\x1F\\x7F]*$/"
And this regex just matches entires strings that contain zero or more characters (due to *) other than (due to the negated character class used [^...]) the ones from the NUL to US character ([\x00-\x1F]) and a DEL character (\x7F):

Related

Ignoring invisible characters in RegEx

I've run into a bit of a conundrum.
I am currently trying to build a regex to filter out some particularly nasty scam emails. I'm sure you've seen them before, using a data dump from a compromised website to threaten to reveal intimate videos.
That's all well and good, except I noticed while testing the regex that some of these messages insert special invisible characters in the middle of words. Like you might see here (I've found it especially hard to find a place that keeps these special characters):
Regexr link
I find myself looking for a way to create a regex that might ignore these characters all together, as some emails have them and some don't. In the end, I'm trying to create a match with something like
/all (.*)your contacts
If there's a particular string you're trying to flag, you could do something like this:
Detect "email" with optional invis characters: /e[^\w]?m[^\w]?a[^\w]?i[^\w]?l/
[^\w]? will detect anything that's not a letter or digit. You could also use [^\w]* if you're seeing more than one invisible character being used between letters.
Most invisible characters are just whitespace.
These don't matter which character set they're rendered in,
it's probably invisible.
If using a Unicode aware regex engine, you could probably just stick
in the whitespace class between the characters you're looking for.
If not, you could try using the class equivalent [ ].
\s =
[\x{9}-\x{D}\x{1C}-\x{20}\x{85}\x{A0}\x{1680}\x{2000}-\x{200A}\x{2028}-\x{2029}\x{202F}\x{205F}\x{3000}]
Same, but without CRLF's
[^\S\r\n] =
[\x{9}\x{B}-\x{C}\x{1C}-\x{20}\x{85}\x{A0}\x{1680}\x{2000}-\x{200A}\x{2028}-\x{2029}\x{202F}\x{205F}\x{3000}]

Notepad ++: selecting text up to matched characters

In notepad ++, I want to select text up to a certain text match, including the match.
The txt file I am working with contains a lot of text with also white characters, returns and some special characters. In this text, there are characters that mark an end. Let's call these stop characters "ZZ." for now.
Using RegEx, I tried to create an expression that finds the next "ZZ." and selects everything before it. This is what it looks like:
+., \c ZZ.\n
But I seem to have gotten something wrong. As it is a similar to this
problem, I tried to use their RegEx with slight modification. Here is a picture so you can figure what I'd like to accomplish:
Find the next stop marker, selext the marker and everything before it.
In the actual file, the stop marker is "გვ."
If I want to use those, maybe I need to change the RegEx even more, as those are no ASCII characters? Like so, as stated in the RegEx Wiki?
\c+ (\x{nnnn}\x{nnnn}.)\n
Not quite sure if the \c works that way. I have seen expressions that use something like (A-Za-z)(0-9) but this is a different alphabet.
To match any text up to and including some pattern, use .*? (to match any zero or more characters, as few as possible) with the . matches newline option ON and add the გვ after it:

Regex forces star to give at least one character, while it should be none

I'm trying to create a Regex for a custom syntax file to use in Sublime Text 2, made with YAML. My syntax has commands in this form, with a maximum of 6 arguments:
#MY_COMMAND.argument01.argument02 with spaces and characters.arg03#
I want to color the command name, dots, and arguments all in different colors, so I want a Regex that selects all the contents in different groups, so I can use captures to color them in the YAML file.
I came up with this one:
/([^.]*)(.)([^.]*)(.)([^.]*)(.)([^.]*)(.)([^.]*)(.)([^.]*)(?=#)/
It does almost exactly what I want. It works great as long as the command has the maximum of arguments, which is 6, and just as much as how many times I wrote ([^.]*).
So this works fine. But when I use less arguments, something weird (and to me, unexpected) happens. The last few groups, that should just return nothing at all, each grab a single character on the end of the string, which makes the last argument have a few less characters than intended.
Apparently I can't share images yet, because I just made this account, but you can check out the problem here. In this example, you can hover over the text to see the groups. In this case, I would like group 7 to contain foo and I would like group 8 and up to contain nothing.
Any help would be greatly appreciated.
You should be careful when matching a literal dot with a regex: either escape it outside a character class (\.), or use it inside a character class ([.]).
To make some parts of the regex optional, use non-capturing groups with ? quantifier.
Thus, you can use the following regex:
/^([^.]*)(?:\.([^.]*))?(?:\.([^.]*))?(?:\.([^.]*))?(?:\.([^.]*))?(?:\.([^.]*))?(?=#)/m
See demo
Note that in multiline mode, the [^.] can "overmatch" across lines as it also matches newline symbol. The multiline mode makes ^ match at the beginning of a line. Perhaps, you do not need the multiline mode at all, so adjust as appropriate.

Regular expression for alphanumeric and underscore c#

I am working with ASP.NET MVC 5 application in which I want to add dataannotation validation for Name field.
That should accept any combination of number,character and under score only.
I tried by this but not working :
RegularExpression("([a-zA-Z0-9_ .&'-]+)", ErrorMessage = "Invalid.")]
Try this regex written under the regexr.com site.
Criteria - alphanumeric,underscore and space.
http://regexr.com/3agii
([a-zA-Z0-9_\s]+)
You are using a character class, that is the thing between the square brackets ([a-zA-Z0-9_ .&'-]). Within that square brackets you can define all characters that should be matched by this class. So, now it is easy: you allow characters you don't want to match.
Based on your "try" you could change this to
[a-zA-Z0-9_]
that seem to be the characters you want to match. But is it really what you need? Are that really the only characters that are possible for that field?
If yes then you are done.
If no, you probably want to add all characters of all languages. Luckily there is a Unicode property for that:
\p{L} All letter characters
There is another predefined group that could be useful for you:
\w matches any word character (The definition can also be found in the first link, includes the Unicode categories Ll,Lu,Lt,Lo,Lm,Nd,Pc, that is basically [a-zA-Z0-9_] but Unicode style with all letters and more connecting characters)
But still, if you want to match real names this will not cover all possible names. I have another answer on this topic here

Regex without \r\n and not entirely space

I'm checking for valid user input for an executable; however it does include things like del/rm, dir/ls. The input is collected through XML and is validated using XSD. I will not check for file existence, since my program submits to a server, which may or may not have access to the same files.
The only requirements then, are that it not have a new line \r or \n and it cannot be entirely white space. I think it would be valid to assume that tab \t would not be allowed either, but I am more concerned with newlines.
Thanks
Does this mean you have the limitations mentioned here:
http://www.regular-expressions.info/xml.html
If so, then you probably want something like this:
[^\r\n\t]*[^\r\n\t\s][^\r\n\t]*
The middle part means there has to be one character that is not a newline, tab, or whitespace. The rest of it means zero or more characters around that character that aren't a newline or tab (but it can be whitespace). I think you might be able to remove the \r\n\t from the middle group because they all might be encompassed in \s but I haven't tested any of this.
Remove the three occurrences of \t if you want tabs.
I am not entirely sure what you want to do, but a regular expressions for "no newline and not just whitespace" would be
[ \t]*\S[^\r\n]*
This matches zero or more whitespace characters followed by a non-whitespace characters and an abitrary number of characters that are not \r or \n (including spaces and tabs). It cannot match a string consisting of only whitespace (as there would be no non-whitespace character matching \S).
To prohibit tabs also, you can change this to read
[ ]*\S[^\r\n\t]*