I am trying to match all latin characters in UTF 16 encoded text. I have been using [A-Za-z] which has been working great. As I've been parsing chinese and japanese text I've been coming across bizarre versions of A-Z that the regex isn't picking up.
https://gist.github.com/kyleect/1c66fd388d362653969d
Left are the characters I can't identify, right is from my keyboard. I copy and pasted them in to chrome page find input, google search and the find input in my text editor. All agree: Left == Right but Right != Left
What are these characters and wow do I target them in regex?
You can take a look at their character codes in your browser’s console:
> 'B'.charCodeAt(0).toString(16)
ff22
It’s a fullwidth letter! You can probably match the whole set with [\uff21-\uff3a] in a decent regex engine. Or A-Z in an even more decent one.
Related
In notepad ++, I want to select text up to a certain text match, including the match.
The txt file I am working with contains a lot of text with also white characters, returns and some special characters. In this text, there are characters that mark an end. Let's call these stop characters "ZZ." for now.
Using RegEx, I tried to create an expression that finds the next "ZZ." and selects everything before it. This is what it looks like:
+., \c ZZ.\n
But I seem to have gotten something wrong. As it is a similar to this
problem, I tried to use their RegEx with slight modification. Here is a picture so you can figure what I'd like to accomplish:
Find the next stop marker, selext the marker and everything before it.
In the actual file, the stop marker is "გვ."
If I want to use those, maybe I need to change the RegEx even more, as those are no ASCII characters? Like so, as stated in the RegEx Wiki?
\c+ (\x{nnnn}\x{nnnn}.)\n
Not quite sure if the \c works that way. I have seen expressions that use something like (A-Za-z)(0-9) but this is a different alphabet.
To match any text up to and including some pattern, use .*? (to match any zero or more characters, as few as possible) with the . matches newline option ON and add the გვ after it:
In VBA (Word specifically), I'm trying to use the RegExp object to search through a long document. Some of the patterns I search for include unicode character (such as a non-breaking hyphen or non-breaking space). When I access the text via
ActiveDocument.Range.Text
I get the text but stripped of unicode characters (or at least some of them, ones that I need). For example, if the text ABC-123, where the hyphen is a non-breaking (or hard) hyphen, U+2011, when I access the text using ActiveDocument.Range.Text, it displays ABC123.
I thought perhaps it just displays it incorrectly, and that the character is really there, but all the search and replace I've done don't show it. Plus, when I regex the unicode character using \u2011, it doesn't find it.
Is there another way to access the document's full content, but intact with all the unicode characters?
UPDATE: I inspected the output of the ABC123, and it appears that the character is hidden. That is, Len(str) = 7 instead of 6, what you'd expect. The following shows what is happening:
Print Asc(Mid(str, 4, 1))
=> 30
ASCII character 30, or \u001e is a record separator. When I search for this, it finds this zero-length character. I tested a wider range of unicode characters (\u2000-\u201f) and interestingly they all are detected with the \u control sequence in the regex, except for \u2011, which changes to \u001e. Even the en-space (\u2002) and em-space (\u2003) are recognized. I haven't done it for all the unicode characters, but it seems odd that I have stumbled upon one of the few that don't register.
This isn't an answer, but a workaround. When using RegExp to search for unicode characters, most will be recognized in the ActiveDocument.Range.Text variable using a \uxxxx code. If not, open a new Word document. In the body, add some text that contains the unicode character (e.g. non-breaking hyphen). Then in VBA, use the immediates window to find the ASCII character code for that character:
Print Asc(Mid(ActiveDocument.Range.Text, <char_position>, 1))
This will tell you if it is actually there (if the character doesn't show up in strings). The code you get won't actually work for every unicode character, since some of them are converted to ASCII characters (e.g. en-quad \u2000 will return ASCII 32, space, when using the Asc() function on it. Luckily, you can regex \u2000 and it will find it.).
For the non-breaking hyphen, the code that works with regex is \u001e.
I have a simple CQ dialog with a textfield. The authors somehow managed to paste illegal characters into it, the last two times it was a vertical tab (VT) copied from a PowerPoint file.
I played around with some regex and came up with the following to exclude anything below SPACE and DEL:
/^[^\0-\x1F\x7F]*$/
Sadly I can't really test the vertical tab as I am not able to enter this character on regex101. So I tried it with TAB and this seems to be working: https://regex101.com/r/yH0lN5/1
But if I use this in my regex property of the textfield, no matter what I enter the validation fails. Any idea what I am doing wrong?
White listing isn't an option as i need to support Unicode characters like chinese in the future.
You should double the backslashes to make sure they are treated as literal backslashes by the regex engine.
Also, I suggest using consistent notation, and replace \0 with \x00:
regex="/^[^\\x00-\\x1F\\x7F]*$/"
And this regex just matches entires strings that contain zero or more characters (due to *) other than (due to the negated character class used [^...]) the ones from the NUL to US character ([\x00-\x1F]) and a DEL character (\x7F):
I have written a lot of text in greek with selected language "greek" in lyx 2.1 . The problem is that there are words and characters in english in there and when i output to pdf with pdflatex they are transformed into the matching greek characters. I suppose i need to change the language of every english word in there to english, but that sounds very repetitive... I am wondering if i can use regex to match english characters and change their language with find replace somehow. I have successfuly managed to match english characters with this capture group \regexp{([\backslash x00-\backslash x7F]+)\endregexp{}} but i dont know what to replace it with. Any ideas?
PS: does lyx regex even have backreferences or are capture groups just there for no reason?
The "tricky" part of this question is that what I mean by alphabeth is not just the 26 characters. It should also include anything alphabeth like, including accented characters and hebrew's alibeth, etc.etc.
Why I need them?
I want to split texts into words.
Alphabeths like latin alphabeth, hebrew's alibeth, arab abjads, are separated by space.
Chinese characters are separated by nothing.
So I think I should separate texts by anything that's not alphabeth.
In other word, a, b, c, d, é is fine.
駅,南,口,第,自,転,車.,3,5,6 is not and all such separator should be it's own words. Or stuff like that.
In short I want to detect whether a character may be a word by itself, or can be part of a word.
What have I tried?
Well you can check the question here I asked a long time ago:
How can we separate utf-8 characters into words if some of the characters are chinese?
I implement the only answer there but then I found out that the chinese characters aren't split. Why not split based on nothing? Well, that means the alphabeths are splitted too.
If all those alphabeths "stick" together that I can separate them based on UTF, that would be fine too.
I will just use the answer at How can we separate utf-8 characters into words if some of the characters are chinese?
and "pull out" all non alphabeth characters.
Not a perfect solution, but good enough for me because western characters and chinese characters rarely show up on the same text anyway.
Maybe you shouldn't do this with regular expressions but with good old string index scanning instead.
The Hebrew, Chinese, Korean etc. alphabets are all in consecutive ranges of unicode code-points. So you could easily detect the alphabet by reading the unicode value of the character and then checking which unicode block it belongs to.
Jan Goyvaerts (of PowerGrep fame) once showed me this very useful syntax to do just this:
(?<![\p{M}\p{L}])word(?![\p{M}\p{L}])
This expression uses a regex lookbehind and a regex lookahead to ensure that the boundaries of the word are such that there is no letter or diacritic mark on either side.
Why is this regex better than simply using "\b"? The strength of this regex is the incorporation of \p{M} to include diacritics. When the normal word boundary marker (\b) is used, regex engines will find word breaks at the places of many diacritics, even though the diacritics are actually part of the word (this is the case, for instance, with Hebrew diacritics. For an example, take the Hebrew word גְּבוּלוֹת, and run a regex of "\b." on it - you'll see how it actually breaks the word into word different parts, at each diacritic point). The regex above fixes this by using a Unicode Character Class to ensure that diacritics are always considered part of the word and not breaks within the word.