The whole problem is that I need editable content of PDF in Lithuanian (let's say in UNICODE), but when I convert PDF to editable breed (I decided to use Google Docs converter) all Lithuanian (let's say some set of UNICODE) characters needs to be fixed / replaced.
Was unable to figure out how to find-replace Unicode character. For example if I need to replace (with match case) Á with Į it selects all A, which is wrong.
It's not the first when Google products are not adopted to life, coze we leave in I18 world, and America with ASCII is not the center of Universe. That's really sucks.
So... how it's possible to to achieve the goal?
Use Notepad++. It has no issues doing such a task. You can just copy paste the results back onto a google doc if necessary.
'Find and Replace' (Ctrl+H) -->
Uncheck 'Ignore Latin diacritics (e. g. ä = a, E = É)' -->
Type your letter in the 'Find' field.
If for some reason that doesn't work, check 'Match using regular expressions' -->
In the 'Find' field, type \uXXXX, where XXXX is the Unicode value of the character. For instance, for Á, the Unicode value is 00C1, so type \u00C1 in the 'Find' field.
Type the letter you want in the 'Replace' field.
Related
I need to replace all characters with an accent in a text file, that is:
á é í ó ú ñ
for their non-accent equivalents:
a e i o u n
Can this be achieved via some regex command for the entire file at once?
Update (Feb 1st, 2017)
I took the great answer by Keith Hall and turned into a Sublime package. You can find it here: RemoveNonAsciiChars.
You can use a regex like:
(?=\p{L})[^a-zA-Z]
to find the characters with diacritics.
(?=\p{L}) positive lookahead to ensure the next character is a Unicode letter
[^a-zA-Z] negative character class to exclude letters without diacritics.
This is necessary because Sublime Text (or, more specifically, the Boost regex engine it uses for Find and Replace) doesn't support \p{M}. See http://www.regular-expressions.info/unicode.html for more information on what the \p meta character does.
For replacing, unfortunately you will need to specify the characters to replace manually. To make it harder, ST doesn't seem to support the POSIX character equivalents, nor does it support conditionals in the replacement, which would allow you to do the find and replace in one pass, using capture groups.
Therefore, you would need to use multiple find expressions like:
[ÀÁÂÃÄÅ]
replace with
A
and
[àáâãäå]
replace with
a
etc.
which is a lot of manual work.
A much easier/quicker/less-manual-work approach would be to use the Python API instead of regex:
Tools menu -> Developer -> New Plugin
Paste in the following:
import sublime
import sublime_plugin
import unicodedata
class RemoveNonAsciiCharsCommand(sublime_plugin.TextCommand):
def run(self, edit):
entire_view = sublime.Region(0, self.view.size())
ascii_only = unicodedata.normalize('NFKD', self.view.substr(entire_view)).encode('ascii', 'ignore').decode('utf-8')
self.view.replace(edit, entire_view, ascii_only)
Save it in the folder ST recommends (which will be your Packages/User folder), as something like remove_non_ascii_chars.py (file extension is important, base name isn't)
View menu -> Show Console
Type/paste in view.run_command('remove_non_ascii_chars') and press Enter
The diacritics will have been removed (the characters with an accent will have been converted to their non-accented equivalents).
Note: the above will actually also remove all non-ascii characters as well...
Further reading:
http://fabzter.com/blog/remove-nonspacing-characters-text-python
What is the best way to remove accents in a Python unicode string?
This is not one of these "help me build my regex" questions. I have an HTML form input field where a user can provide geographical position data in various formats. The following regex works fine in regexr.com as well as in my application. However, I want to use the "pattern" parameter of HTML5 to additionally validate a user's input before submitting it.
((([E|W|N|S](\s)?)?([\-]?[0-1]?[(0-9)]{1,2})[°][ ]?([(0-5)]?[(0-9)]{1})([\.|,][0-9]{1,5})?['][ ]?([0-5]{0,1}[0-9]?(([\.|\,])[0-9]{0,3})?)([\"]|[']{2}){0,1}((\s)?[E|W|N|S])?)|([-]?[1]?[0-9]{1,2}[\.|,][0-9]{1,9}))
The point is that this regex contains a quote character ("). Now, I put this regex in my input like this:
<input type="text" pattern = "regex..."...." />
Browsers do not recognize this regex and don't do any validation at all, so obviously I need to escape that quote. What I tried so far:
PHP's addslashes() function escapes too many characters.
I escaped the quote with a single backslash
That did not change anything. I tested with Chrome, which works fine with simple regular expressions. The one above obviously is a bit too complicated.
I know the regular expression above is not perfect for matching coordinates, however, this is not to be discussed here. I just would like to know how to correctly escape a pattern in HTML5 as Chrome does not do anything with that regex.
Use HTML entities:
instead of ' use '
instead of " use "
If you're creating the regexp using PHP, you can use htmlentities() to encode a string using HTML entities. By default, this will just encode double quotes, but you can use the ENT_QUOTES option to encode both types of quotes.
I searched a lot, but nowhere is it written how to remove non-ASCII characters from Notepad++.
I need to know what command to write in find and replace (with picture it would be great).
If I want to make a white-list and bookmark all the ASCII words/lines so non-ASCII lines would be unmarked
If the file is quite large and can't select all the ASCII lines and just want to select the lines containing non-ASCII characters...
This expression will search for non-ASCII values:
[^\x00-\x7F]+
Tick off 'Search Mode = Regular expression', and click Find Next.
Source: Regex any ASCII character
In Notepad++, if you go to menu Search → Find characters in range → Non-ASCII Characters (128-255) you can then step through the document to each non-ASCII character.
Be sure to tick off "Wrap around" if you want to loop in the document for all non-ASCII characters.
In addition to the answer by ProGM, in case you see characters in boxes like NUL or ACK and want to get rid of them, those are ASCII control characters (0 to 31), you can find them with the following expression and remove them:
[\x00-\x1F]+
In order to remove all non-ASCII AND ASCII control characters, you should remove all characters matching this regex:
[^\x1F-\x7F]+
To remove all non-ASCII characters, you can use following replacement: [^\x00-\x7F]+
To highlight characters, I recommend using the Mark function in the search window: this highlights non-ASCII characters and put a bookmark in the lines containing one of them
If you want to highlight and put a bookmark on the ASCII characters instead, you can use the regex [\x00-\x7F] to do so.
Cheers
To keep new lines:
First select a character for new line... I used #.
Select replace option, extended.
input \n replace with #
Hit Replace All
Next:
Select Replace option Regular Expression.
Input this : [^\x20-\x7E]+
Keep Replace With Empty
Hit Replace All
Now, Select Replace option Extended and Replace # with \n
:) now, you have a clean ASCII file ;)
Another good trick is to go into UTF8 mode in your editor so that you can actually see these funny characters and delete them yourself.
Another way...
Install the Text FX plugin if you don't have it already
Go to the TextFX menu option -> zap all non printable characters to #. It will replace all invalid chars with 3 # symbols
Go to Find/Replace and look for ###. Replace it with a space.
This is nice if you can't remember the regex or don't care to look it up. But the regex mentioned by others is a nice solution as well.
Click on View/Show Symbol/Show All Character - to show the [SOH] characters in the file
Click on the [SOH] symbol in the file
CTRL=H to bring up the replace
Leave the 'Find What:' as is
Change the 'Replace with:' to the character of your choosing (comma,semicolon, other...)
Click 'Replace All'
Done and done!
In addition to Steffen Winkler:
[\x00-\x08\x0B-\x0C\x0E-\x1F]+
Ignores \r \n AND \t (carriage return, linefeed, tab)
I am creating a registration form.
I want to use the placeholder attribute on a password input to explain, in part, what type of regex is used for validation, using the pattern attribute.
This is the regex i found at www.html5pattern.com :
(?=^.{6,}$)((?=.*\d)|(?=.*\W+))(?![.\n])(?=.*[A-Z])(?=.*[a-z]).*$
The explanation for this regex was as follows:
Password (UpperCase, LowerCase, Number/SpecialChar and min 6 Chars)
The example i have used in the placeholder attribute, along with the title attribute, is this : Examp1e.
I would like to ensure that a user does not specifically enter "Examp1e" as their password.
Does anyone have any advice, suggestions, or input as to how i should go about this task?
That regex you started with is bad. For one thing, JavaScript's \W is equivalent to [^A-Za-z0-9_]: any character that isn't an ASCII word character. That includes all of the ASCII punctuation, whitespace and control characters, plus all non-ASCII characters. There's no official definition for "special characters" that I know of, but I'm pretty sure this is not what the author meant.
To move this along, I'll assume only ASCII characters are allowed in the password, and that "special characters" refers to punctuation characters:
[!"#$%&'()*+,\-./:;<=>?#\[\\\]^_`{|}~]
That would make the regex
^(?=[!-~]{6,20}$)(?=.*[A-Z])(?=.*[a-z])(?=.*[\d!"#$%&'()*+,\-./:;<=>?#\[\\\]^_`{|}~]).*$
Notes:
Notice how I pulled the ^ out of the first group; it's there to anchor the whole regex, not just that one lookahead.
I also merged the "digits" and "specials" lookaheads into one. It's not a big deal in this case, but one of my rules thumb is that you should never use an alternation if a character class will do the job.
[!-~] is an old Perl idiom for any "visible" ASCII character (i.e., anything but whitespace or control characters).
I haven't the slightest idea what the original author was trying to do with that (?![.\n]).
This regex works very well for me when validating an input via the html5 pattern attribute, as i have not been able to produce an invalid out of a valid address:
(?=^.{6,20}$)((?=.*\d)|(?=.*\W+))(?![.\n])(?=.*[A-Z])(?=.*[a-z]).*$
I've a free text field on my form where the users can type in anything. Some users are pasting text into this field from Word documents with some weird characters that I don't want to go in my DB. (e.g. webding font characters) I'm trying to get a regular expression that would give me only the alphanum and the punctuation characters.
But when I try the following, the output is still all the characters. How can I leave them out?
<html><body><script type="text/javascript">var str="";document.write(str.replace(/[^a-zA-Z 0-9 [:punct]]+/g, " "));</script></body></html>
If you want only ascii, use /[^ -~]+/ as regex. The problem is your [:punct:] statement. Perhaps javascript does not support [:punct:]?