I need to replace all characters with an accent in a text file, that is:
á é í ó ú ñ
for their non-accent equivalents:
a e i o u n
Can this be achieved via some regex command for the entire file at once?
Update (Feb 1st, 2017)
I took the great answer by Keith Hall and turned into a Sublime package. You can find it here: RemoveNonAsciiChars.
You can use a regex like:
(?=\p{L})[^a-zA-Z]
to find the characters with diacritics.
(?=\p{L}) positive lookahead to ensure the next character is a Unicode letter
[^a-zA-Z] negative character class to exclude letters without diacritics.
This is necessary because Sublime Text (or, more specifically, the Boost regex engine it uses for Find and Replace) doesn't support \p{M}. See http://www.regular-expressions.info/unicode.html for more information on what the \p meta character does.
For replacing, unfortunately you will need to specify the characters to replace manually. To make it harder, ST doesn't seem to support the POSIX character equivalents, nor does it support conditionals in the replacement, which would allow you to do the find and replace in one pass, using capture groups.
Therefore, you would need to use multiple find expressions like:
[ÀÁÂÃÄÅ]
replace with
A
and
[àáâãäå]
replace with
a
etc.
which is a lot of manual work.
A much easier/quicker/less-manual-work approach would be to use the Python API instead of regex:
Tools menu -> Developer -> New Plugin
Paste in the following:
import sublime
import sublime_plugin
import unicodedata
class RemoveNonAsciiCharsCommand(sublime_plugin.TextCommand):
def run(self, edit):
entire_view = sublime.Region(0, self.view.size())
ascii_only = unicodedata.normalize('NFKD', self.view.substr(entire_view)).encode('ascii', 'ignore').decode('utf-8')
self.view.replace(edit, entire_view, ascii_only)
Save it in the folder ST recommends (which will be your Packages/User folder), as something like remove_non_ascii_chars.py (file extension is important, base name isn't)
View menu -> Show Console
Type/paste in view.run_command('remove_non_ascii_chars') and press Enter
The diacritics will have been removed (the characters with an accent will have been converted to their non-accented equivalents).
Note: the above will actually also remove all non-ascii characters as well...
Further reading:
http://fabzter.com/blog/remove-nonspacing-characters-text-python
What is the best way to remove accents in a Python unicode string?
Related
The whole problem is that I need editable content of PDF in Lithuanian (let's say in UNICODE), but when I convert PDF to editable breed (I decided to use Google Docs converter) all Lithuanian (let's say some set of UNICODE) characters needs to be fixed / replaced.
Was unable to figure out how to find-replace Unicode character. For example if I need to replace (with match case) Á with Į it selects all A, which is wrong.
It's not the first when Google products are not adopted to life, coze we leave in I18 world, and America with ASCII is not the center of Universe. That's really sucks.
So... how it's possible to to achieve the goal?
Use Notepad++. It has no issues doing such a task. You can just copy paste the results back onto a google doc if necessary.
'Find and Replace' (Ctrl+H) -->
Uncheck 'Ignore Latin diacritics (e. g. ä = a, E = É)' -->
Type your letter in the 'Find' field.
If for some reason that doesn't work, check 'Match using regular expressions' -->
In the 'Find' field, type \uXXXX, where XXXX is the Unicode value of the character. For instance, for Á, the Unicode value is 00C1, so type \u00C1 in the 'Find' field.
Type the letter you want in the 'Replace' field.
I searched a lot, but nowhere is it written how to remove non-ASCII characters from Notepad++.
I need to know what command to write in find and replace (with picture it would be great).
If I want to make a white-list and bookmark all the ASCII words/lines so non-ASCII lines would be unmarked
If the file is quite large and can't select all the ASCII lines and just want to select the lines containing non-ASCII characters...
This expression will search for non-ASCII values:
[^\x00-\x7F]+
Tick off 'Search Mode = Regular expression', and click Find Next.
Source: Regex any ASCII character
In Notepad++, if you go to menu Search → Find characters in range → Non-ASCII Characters (128-255) you can then step through the document to each non-ASCII character.
Be sure to tick off "Wrap around" if you want to loop in the document for all non-ASCII characters.
In addition to the answer by ProGM, in case you see characters in boxes like NUL or ACK and want to get rid of them, those are ASCII control characters (0 to 31), you can find them with the following expression and remove them:
[\x00-\x1F]+
In order to remove all non-ASCII AND ASCII control characters, you should remove all characters matching this regex:
[^\x1F-\x7F]+
To remove all non-ASCII characters, you can use following replacement: [^\x00-\x7F]+
To highlight characters, I recommend using the Mark function in the search window: this highlights non-ASCII characters and put a bookmark in the lines containing one of them
If you want to highlight and put a bookmark on the ASCII characters instead, you can use the regex [\x00-\x7F] to do so.
Cheers
To keep new lines:
First select a character for new line... I used #.
Select replace option, extended.
input \n replace with #
Hit Replace All
Next:
Select Replace option Regular Expression.
Input this : [^\x20-\x7E]+
Keep Replace With Empty
Hit Replace All
Now, Select Replace option Extended and Replace # with \n
:) now, you have a clean ASCII file ;)
Another good trick is to go into UTF8 mode in your editor so that you can actually see these funny characters and delete them yourself.
Another way...
Install the Text FX plugin if you don't have it already
Go to the TextFX menu option -> zap all non printable characters to #. It will replace all invalid chars with 3 # symbols
Go to Find/Replace and look for ###. Replace it with a space.
This is nice if you can't remember the regex or don't care to look it up. But the regex mentioned by others is a nice solution as well.
Click on View/Show Symbol/Show All Character - to show the [SOH] characters in the file
Click on the [SOH] symbol in the file
CTRL=H to bring up the replace
Leave the 'Find What:' as is
Change the 'Replace with:' to the character of your choosing (comma,semicolon, other...)
Click 'Replace All'
Done and done!
In addition to Steffen Winkler:
[\x00-\x08\x0B-\x0C\x0E-\x1F]+
Ignores \r \n AND \t (carriage return, linefeed, tab)
I'm trying to get Vim to highlight non-ASCII characters. Is there an available setting, regex search pattern, or plugin to do so?
Using range in a [] character class in your search, you ought to be able to exclude the ASCII hexadecimal character range, therefore highlighting (assuming you have hlsearch enabled) all other characters lying outside the ASCII range:
/[^\x00-\x7F]
This will do a negative match (via [^]) for characters between ASCII 0x00 and ASCII 0x7F (0-127), and appears to work in my simple test. For extended ASCII, of course, extend the range up to \xFF instead of \x7F using /[^\x00-\xFF].
You may also express it in decimal via \d:
/[^\d0-\d127]
If you need something more specific, like exclusion of non-printable characters, you will need to add those ranges into the character class [].
Yes, there is a native feature to do highlighting for any matched strings.
Inside Vim, do:
:help highlight
:help syn-match
syn-match defines a string that matches fall into a group.
highlight defines the color used by the group.
Just think about syntax highlighting for your vimrc files.
So you can use below commands in your .vimrc file:
syntax match nonascii "[^\x00-\x7F]"
highlight nonascii guibg=Red ctermbg=2
For other (from now on less unlucky) folks ending up here via a search engine and can't accomplish highlighting of non-ASCII characters, try this (put this into your .vimrc):
highlight nonascii guibg=Red ctermbg=1 term=standout
au BufReadPost * syntax match nonascii "[^\u0000-\u007F]"
This has the added benefit of not colliding with regular (filetype [file extension] based) syntax definitions.
This regex works to highlight as well. It was the first google hit for "vim remove non-ascii characters" from briceolion.com and with :set hlsearch will highlight:
/[^[:alnum:][:punct:][:space:]]/
If you are interested also in the non printable characters use this one: /[^\x00-\xff]/
I use it in a function:
function! NonPrintable()
setlocal enc=utf8
if search('[^\x00-\xff]') != 0
call matchadd('Error', '[^\x00-\xff]')
echo 'Non printable characters in text'
else
setlocal enc=latin1
echo 'All characters are printable'
endif
endfunction
Based on the other answers on this topic and the answer I got here I've added this to my .vimrc, so that I can control the non-ascii highlighting by typing <C-w>1. It also shows inside comments, although you will need to add the comment group for each file syntax you will use. That is, if you will edit a zsh file, you will need to add zshComment to the line
au BufReadPost * syntax match nonascii "[^\x00-\x7F]" containedin=cComment,vimLineComment,pythonComment
otherwise it won't show the non-ascii character (you can also set containedin=ALL if you want to be sure to show non-ascii characters in all groups). To check how the comment is called on a different file type, open a file of the desired type and enter :sy on vim, then search on the syntax items for the comment.
function HighlightNonAsciiOff()
echom "Setting non-ascii highlight off"
syn clear nonascii
let g:is_non_ascii_on=0
augroup HighlightUnicode
autocmd!
augroup end
endfunction
function HighlightNonAsciiOn()
echom "Setting non-ascii highlight on"
augroup HighlightUnicode
autocmd!
autocmd ColorScheme *
\ syntax match nonascii "[^\x00-\x7F]" containedin=cComment,vimLineComment,pythonComment |
\ highlight nonascii cterm=underline ctermfg=red ctermbg=none term=underline
augroup end
silent doautocmd HighlightUnicode ColorScheme
let g:is_non_ascii_on=1
endfunction
function ToggleHighlightNonascii()
if g:is_non_ascii_on == 1
call HighlightNonAsciiOff()
else
call HighlightNonAsciiOn()
endif
endfunction
silent! call HighlightNonAsciiOn()
nnoremap <C-w>1 :call ToggleHighlightNonascii()<CR>
Somehow none of the above answers worked for me.
So I used :1,$ s/[^0-9a-zA-Z,-_\.]//g
It keeps most of the characters I am interested in.
Someone already have answered the question. However, for others that are still having problems, here is another solution to highlight non-ascii characters in comments (or any syntax group in the matter). It's not the best, but it's a temporary fix.
One may try:
:syntax match nonascii "[^\u0000-\u007F]" containedin=ALL contained |
\ highlight nonascii ctermfg=yellow guifg=yellow
This has mix parts from other solutions. You may remove contained, but, from documentation, there may be potential problem of recursing itself (as I understand). To view other defined patterns, syn-contains section would contain it.
:help syn-containedin
:help syn-contains
Replicated issue from: Set item to higher highlight priority on vim
I have an Apple Address Book exported as .vcf where the contacts images are stored as base64.
I'm trying to use Emacs to strip the photos out of the file.
An image in the file looks like this (the ^M are added by the exporter):
...
PHOTO;BASE64:^M
/9j/4AAQSkZJRgABAFEAAQABAAD/4imoSUNDX1BST0ZJTE95AQEAACmYYXBwbAIAAABtbnRyUkdC
IFhZWiAH2QAIAB0AZFARAARRY3NwQVBQTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA9tYAAQAA
...
G8VVxuGjKs7uxniIKnO0SCOAeXn+InJo8sacff7woor3jEfujQH5e9FFAAH/2===^M
...
And I'm trying to query-replace on the following (I use Ctrl-q to insert the ^M and ^J):
PHOTO;BASE64:^M^J*^M^J
But that doesn't work. What am I missing here?
Try this one:
PHOTO;BASE64:^M[^^M]*?^M^J
^^M contains two characters ^ and ^M. It matches everything except ^M
I don't see what the ^J are good for. Is this something emacs special? I also don't know if emacs has a dotall modifier, but you can try this (to my regex experience with other engines)
PHOTO;BASE64:^M(\s|[^\s])*?^M
Emacs regex is explained here: www.emacswiki.org
\s is a whitespace character
[^\s] is anything but whitespace
So the regex mean match anything between ^M and ^M
I've a free text field on my form where the users can type in anything. Some users are pasting text into this field from Word documents with some weird characters that I don't want to go in my DB. (e.g. webding font characters) I'm trying to get a regular expression that would give me only the alphanum and the punctuation characters.
But when I try the following, the output is still all the characters. How can I leave them out?
<html><body><script type="text/javascript">var str="";document.write(str.replace(/[^a-zA-Z 0-9 [:punct]]+/g, " "));</script></body></html>
If you want only ascii, use /[^ -~]+/ as regex. The problem is your [:punct:] statement. Perhaps javascript does not support [:punct:]?