word boundaries of UTF-8 text in perl - regex

My perl script is provided with a string of characters in UTF-8 which could be in any language. I need to capitalize the first character of each word, and the remaining characters of the word converted to lower case. This must be done while leaving the text in UTF-8 format.
The following seems to work well enough when the text only contains latin characters
$my_string =~ s/([\w']+)/\u\L$1/g;
How can I get this to work in a UTF-8 string?

See perlunicode for an overview of the facilities you need to be familiar with. Basically, you are looking for something like \p{LC}.
Your problem space is not well-defined, though; not all scripts have a concept of character case. The LC property will only match on scripts which do, so it should get you there.

Related

Regex: Replace "something" by a unicode character

I am trying to figure out how to find a certain character and replace it with a Unicode character. In my example, I want to find all spaces (\s) and replace them with a narrow or thin space (e.g. Unicode U+2006).
Sample Text
8. 3. 2014
Search Pattern
(\d{1,2}\.)(\s?)(\d{1,2}\.)(\s?)(\d{2,4})
Replacement Pattern
$1{UNICODE}$3{UNICODE}$5
For some reason I cannot replace by(!) a Unicode character, I can only search for one.
I am working with a RegEx App called »RegExRX 3« to test my strings. In the end, I want to be able to use it with Adobe’s InDesign GREP functionality.
I know I could just copy and paste the correct whitespace into place but I am interested in how to do it with a Unicode character.
Thanks in advance!
InDesign uses Perl-compatible regular expressions (pcre). Getting a Unicode character into the replacement string is done by \x{XXXX} where XXXX is the hexadecimal character code:
$1\x{2009}$2\x{2009}$5
But in general you can replace by any character you can type. Just put actual thin spaces into your search-and-replace dialog:
$1 $3 $5
You can use your OS's utilities to grab the thin space from the list of available characters, for Windows it's the "Character Map" tool, where the thin space can be found in the "General Punctuation" Unicode sub-range. Searching for "thin space" works as well. MacOS has the "Character Viewer", which can do the same thing.

How to check for and replace non UTF-8 characters in tcl?

What's the best way to search if a given string contains non UTF-8 characters in tcl? Is regexp'ing "^[\x00-\x7f]+$" the only way forward?
I'm trying to write a tcl proc to check if a given variable contains non UTF-8 characters and if it does replace it with "Not supported"
All Tcl's characters are Unicode characters.
OK, that's not helpful. You actually appear to be asking about non-ASCII characters. Supposing you wanted to replace each non-ASCII character with a ?, you might use a regular expression substitution, like this:
regsub -all {[\u0080-\uffff]} $inputString "?" outputString
The key here is that the RE is in braces (virtually always strongly recommended) and that we're using \uXXXX escape sequences (which the RE engine also understands). That'll put many ?s in potentially, but I'm sure you can adjust.

Is there a regex way to detect whether a character can be part of a word or not?

The "tricky" part of this question is that what I mean by alphabeth is not just the 26 characters. It should also include anything alphabeth like, including accented characters and hebrew's alibeth, etc.etc.
Why I need them?
I want to split texts into words.
Alphabeths like latin alphabeth, hebrew's alibeth, arab abjads, are separated by space.
Chinese characters are separated by nothing.
So I think I should separate texts by anything that's not alphabeth.
In other word, a, b, c, d, é is fine.
駅,南,口,第,自,転,車.,3,5,6 is not and all such separator should be it's own words. Or stuff like that.
In short I want to detect whether a character may be a word by itself, or can be part of a word.
What have I tried?
Well you can check the question here I asked a long time ago:
How can we separate utf-8 characters into words if some of the characters are chinese?
I implement the only answer there but then I found out that the chinese characters aren't split. Why not split based on nothing? Well, that means the alphabeths are splitted too.
If all those alphabeths "stick" together that I can separate them based on UTF, that would be fine too.
I will just use the answer at How can we separate utf-8 characters into words if some of the characters are chinese?
and "pull out" all non alphabeth characters.
Not a perfect solution, but good enough for me because western characters and chinese characters rarely show up on the same text anyway.
Maybe you shouldn't do this with regular expressions but with good old string index scanning instead.
The Hebrew, Chinese, Korean etc. alphabets are all in consecutive ranges of unicode code-points. So you could easily detect the alphabet by reading the unicode value of the character and then checking which unicode block it belongs to.
Jan Goyvaerts (of PowerGrep fame) once showed me this very useful syntax to do just this:
(?<![\p{M}\p{L}])word(?![\p{M}\p{L}])
This expression uses a regex lookbehind and a regex lookahead to ensure that the boundaries of the word are such that there is no letter or diacritic mark on either side.
Why is this regex better than simply using "\b"? The strength of this regex is the incorporation of \p{M} to include diacritics. When the normal word boundary marker (\b) is used, regex engines will find word breaks at the places of many diacritics, even though the diacritics are actually part of the word (this is the case, for instance, with Hebrew diacritics. For an example, take the Hebrew word גְּבוּלוֹת, and run a regex of "\b." on it - you'll see how it actually breaks the word into word different parts, at each diacritic point). The regex above fixes this by using a Unicode Character Class to ensure that diacritics are always considered part of the word and not breaks within the word.

regex unicode character in vim

I'm being an idiot.
Someone cut and pasted some text from microsoft word into my lovely html files.
I now have these unicode characters instead of regular quote symbols, (i.e. quotes appear as <92> in the text)
I want to do a regex replace but I'm having trouble selecting them.
:%s/\u92/'/g
:%s/\u5C/'/g
:%s/\x92/'/g
:%s/\x5C/'/g
...all fail. My google-fu has failed me.
From :help regexp (lightly edited), you need to use some specific syntax to select unicode characters with a regular expression in Vim:
\%u match specified multibyte character (eg \%u20ac)
That is, to search for the unicode character with hex code 20AC, enter this into your search pattern:
\%u20ac
The full table of character search patterns includes some additional options:
\%d match specified decimal character (eg \%d123)
\%x match specified hex character (eg \%x2a)
\%o match specified octal character (eg \%o040)
\%u match specified multibyte character (eg \%u20ac)
\%U match specified large multibyte character (eg \%U12345678)
This solution might not address the problem as originally stated, but it does address a different but very closely related one and I think it makes a lot of sense to place it here.
I don't know in which version of Vim it was implemented, but I was working on 7.4 when I tried it.
When in Edit mode, the sequence to output unicode characters is: ctrl-v u xxxx where xxxx is the code point. For instance outputting the euro sign would be ctrl-v u 20ac.
I tried it in Command mode as well and it worked. That is, to replace all instances of "20 euro" in my document with "20 €", I'd do:
:%s/20 euro/20 <ctrl-v u 20ac>/gc
In the above <ctrl-v u 20ac> is not literal, it's the sequence of keys that will output the € character.

(grep) Regex to match non-ASCII characters?

On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find and then do a grep to print the non-ASCII characters, and then do a wc -l to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.
However, is there a regular expression for 'any character that's not an ASCII character'?
This will match a single non-ASCII character:
[^\x00-\x7F]
This is a valid PCRE (Perl-Compatible Regular Expression).
You can also use the POSIX shorthands:
[[:ascii:]] - matches a single ASCII char
[^[:ascii:]] - matches a single non-ASCII char
[^[:print:]] will probably suffice for you.**
No, [^\x20-\x7E] is not ASCII.
This is real ASCII:
[^\x00-\x7F]
Otherwise, it will trim out newlines and other special characters that are part of the ASCII table!
You could also to check this page: Unicode Regular Expressions, as it contains some useful Unicode characters classes, like:
\p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.
You can use this regex:
[^\w \xC0-\xFF]
Case ask, the options is Multiline.
[^\x00-\x7F] and [^[:ascii:]] miss some control bytes so strings can be the better option sometimes. For example cat test.torrent | perl -pe 's/[^[:ascii:]]+/\n/g' will do odd things to your terminal, where as strings test.torrent will behave.
To Validate Text Box Accept Ascii Only use this Pattern
[\x00-\x7F]+
I use [^\t\r\n\x20-\x7E]+ and that seems to be working fine.
You don't really need a regex.
printf "%s\n" *[!\ -~]*
This will show file names with control characters in their names, too, but I consider that a feature.
If you don't have any matching files, the glob will expand to just itself, unless you have nullglob set. (The expression does not match itself, so technically, this output is unambiguous.)
This turned out to be very flexible and extensible.
$field =~ s/[^\x00-\x7F]//g ; # thus all non ASCII or specific items in question could be cleaned. Very nice either in selection or pre-processing of items that will eventually become hash keys.