Regex: How to leave out webding font characters? - regex

I've a free text field on my form where the users can type in anything. Some users are pasting text into this field from Word documents with some weird characters that I don't want to go in my DB. (e.g. webding font characters) I'm trying to get a regular expression that would give me only the alphanum and the punctuation characters.
But when I try the following, the output is still all the characters. How can I leave them out?
<html><body><script type="text/javascript">var str="";document.write(str.replace(/[^a-zA-Z 0-9 [:punct]]+/g, " "));</script></body></html>

If you want only ascii, use /[^ -~]+/ as regex. The problem is your [:punct:] statement. Perhaps javascript does not support [:punct:]?

Related

REGEX - Suppress Non-Printable characters in Spark SQL

I have a column which contains free-form text, i.e alphabets, digits and certain special characters and non-printable non-ascii control characters. How can I clean this text string by suppressing the non-printable characters using REGEX in Spark SQL 2.4 ?
Just to clarify further, besides ascii alphabets and digits, I also need to retain characters like %-()|,<;:">?/[]#+=#!&.. etc. Only the non-printable non-ascii characters need to be removed using regex.
Example - something similar to:
select regexp_replace(col, "[^:print:][^:ctrl:]", '')
OR
select regexp_replace(col, "[^:alphanum:]", "")
But I can't get it to work in Spark SQL (with the SQL API). Can anyone please advise with a working example.
Any help is appreciated.
Thanks

What is the Regex ? every unicode characters (but not every blank ones except the spaces)

I am basically trying to find a Regex against which you could test a Nickname/Pseudo after a form validation.
For exemple I want this to be allowed:
"ཀུ༇ༀ"
or this
0X-_my perfect name_-X0"
but not this
"my\tperfectname"
so no \n \t \r etc... while still keeping spaces.
Just use the pattern:
[^\n\r\t]+
Or if you wanna be more thorough:
[^\n\r\t\f\v]+

How do I remove all non-ASCII characters with regex and Notepad++?

I searched a lot, but nowhere is it written how to remove non-ASCII characters from Notepad++.
I need to know what command to write in find and replace (with picture it would be great).
If I want to make a white-list and bookmark all the ASCII words/lines so non-ASCII lines would be unmarked
If the file is quite large and can't select all the ASCII lines and just want to select the lines containing non-ASCII characters...
This expression will search for non-ASCII values:
[^\x00-\x7F]+
Tick off 'Search Mode = Regular expression', and click Find Next.
Source: Regex any ASCII character
In Notepad++, if you go to menu Search → Find characters in range → Non-ASCII Characters (128-255) you can then step through the document to each non-ASCII character.
Be sure to tick off "Wrap around" if you want to loop in the document for all non-ASCII characters.
In addition to the answer by ProGM, in case you see characters in boxes like NUL or ACK and want to get rid of them, those are ASCII control characters (0 to 31), you can find them with the following expression and remove them:
[\x00-\x1F]+
In order to remove all non-ASCII AND ASCII control characters, you should remove all characters matching this regex:
[^\x1F-\x7F]+
To remove all non-ASCII characters, you can use following replacement: [^\x00-\x7F]+
To highlight characters, I recommend using the Mark function in the search window: this highlights non-ASCII characters and put a bookmark in the lines containing one of them
If you want to highlight and put a bookmark on the ASCII characters instead, you can use the regex [\x00-\x7F] to do so.
Cheers
To keep new lines:
First select a character for new line... I used #.
Select replace option, extended.
input \n replace with #
Hit Replace All
Next:
Select Replace option Regular Expression.
Input this : [^\x20-\x7E]+
Keep Replace With Empty
Hit Replace All
Now, Select Replace option Extended and Replace # with \n
:) now, you have a clean ASCII file ;)
Another good trick is to go into UTF8 mode in your editor so that you can actually see these funny characters and delete them yourself.
Another way...
Install the Text FX plugin if you don't have it already
Go to the TextFX menu option -> zap all non printable characters to #. It will replace all invalid chars with 3 # symbols
Go to Find/Replace and look for ###. Replace it with a space.
This is nice if you can't remember the regex or don't care to look it up. But the regex mentioned by others is a nice solution as well.
Click on View/Show Symbol/Show All Character - to show the [SOH] characters in the file
Click on the [SOH] symbol in the file
CTRL=H to bring up the replace
Leave the 'Find What:' as is
Change the 'Replace with:' to the character of your choosing (comma,semicolon, other...)
Click 'Replace All'
Done and done!
In addition to Steffen Winkler:
[\x00-\x08\x0B-\x0C\x0E-\x1F]+
Ignores \r \n AND \t (carriage return, linefeed, tab)

HTML5 - input, using placeholder attribute for an "example", possible to use pattern attribute to ensure that "example" is not submitted?

I am creating a registration form.
I want to use the placeholder attribute on a password input to explain, in part, what type of regex is used for validation, using the pattern attribute.
This is the regex i found at www.html5pattern.com :
(?=^.{6,}$)((?=.*\d)|(?=.*\W+))(?![.\n])(?=.*[A-Z])(?=.*[a-z]).*$
The explanation for this regex was as follows:
Password (UpperCase, LowerCase, Number/SpecialChar and min 6 Chars)
The example i have used in the placeholder attribute, along with the title attribute, is this : Examp1e.
I would like to ensure that a user does not specifically enter "Examp1e" as their password.
Does anyone have any advice, suggestions, or input as to how i should go about this task?
That regex you started with is bad. For one thing, JavaScript's \W is equivalent to [^A-Za-z0-9_]: any character that isn't an ASCII word character. That includes all of the ASCII punctuation, whitespace and control characters, plus all non-ASCII characters. There's no official definition for "special characters" that I know of, but I'm pretty sure this is not what the author meant.
To move this along, I'll assume only ASCII characters are allowed in the password, and that "special characters" refers to punctuation characters:
[!"#$%&'()*+,\-./:;<=>?#\[\\\]^_`{|}~]
That would make the regex
^(?=[!-~]{6,20}$)(?=.*[A-Z])(?=.*[a-z])(?=.*[\d!"#$%&'()*+,\-./:;<=>?#\[\\\]^_`{|}~]).*$
Notes:
Notice how I pulled the ^ out of the first group; it's there to anchor the whole regex, not just that one lookahead.
I also merged the "digits" and "specials" lookaheads into one. It's not a big deal in this case, but one of my rules thumb is that you should never use an alternation if a character class will do the job.
[!-~] is an old Perl idiom for any "visible" ASCII character (i.e., anything but whitespace or control characters).
I haven't the slightest idea what the original author was trying to do with that (?![.\n]).
This regex works very well for me when validating an input via the html5 pattern attribute, as i have not been able to produce an invalid out of a valid address:
(?=^.{6,20}$)((?=.*\d)|(?=.*\W+))(?![.\n])(?=.*[A-Z])(?=.*[a-z]).*$

Regex with Tab delimited text containing \x09

I've got a tough one.
I've got tab-delimited text to match with a regex.
My regex looks like:
^([\w ]+)\t(\d*)\t(\d+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)$
and an example source text is (tabs converted to \t for clarity):
JJ\t345\t0\tTest\tSome test text\tmore text: pcre:"/\x20\x62\x3b\x0a\x09\x61\x2e\x53\x74\x61\x72/"\tNone
However, the problem is that in my source text, the 6th field contains a regex string. Therefore, it can contain \x09, which naturally blows up the regex since it's seen as a tab as well.
Is there any way to tell the regex engine, "Match on \t but not on the text \x09." My guess is no, since they're the same thing.
If not, is there any character that could be safely used for delimiting text that contains a regex string?
I would recommend encoding all of the characters in the pcre string prior to running the regular expression against it.
Seems like a problem with the test case. A regex might have tabs in it, but your sample above doesn't. Your string in Java would look like:
String testString = "JJ\t345\t0\tTest\tSome test text\tmore text: pcre:"/\\x20\\x62\\x3b\\x0a\\x09\\x61\\x2e\\x53\\x74\\x61\\x72/"\tNone";
If you look at this string in the debugger you'll have \x09 as 4 characters instead of as 1 (the tab).