Character Classes in Vim - regex

I have a question about using regexes in Vim.
When using character classes, if I search using the pattern [a-y] the search is case insensitive.
But the pattern a-z seems to make the search case sensitive.
I think it's because of the z. But I don't know why.
And I am using gVim 7.4 on Win 8.1.
And the character z of [a-z] is a lowercase z
Changing the pattern to a-Z makes the search case sensitive too.
With the pattern a-Y, strangely a 'wrong scope' error occurs.
The following images are descriptions about encoding and config.
Thanks, everyone. :)

The most straightforward answer is that in one search, case-sensitivity is on, and in the other, it is not. See :help 'ignorecase.
If that is not the case, then the only way I can reproduce this is to use a character that looks like an ASCII z but is in reality a completely different character. Among characters that resemble an ASCII z, the only one I can find that reproduces this behavior is U+0396 Greek Capital Letter Zeta: Ζ.
Even this theory is a little shaky, as this character looks like an uppercase Z, not a lowercase z - at least on my screen.
It is difficult to be certain that this is the issue given only the screenshots above and your description. More information in your question about exactly how you are entering the search characters, what encoding you are using, what your keyboard layout is, etc., might help someone write a better answer than this one.

Related

Ignoring invisible characters in RegEx

I've run into a bit of a conundrum.
I am currently trying to build a regex to filter out some particularly nasty scam emails. I'm sure you've seen them before, using a data dump from a compromised website to threaten to reveal intimate videos.
That's all well and good, except I noticed while testing the regex that some of these messages insert special invisible characters in the middle of words. Like you might see here (I've found it especially hard to find a place that keeps these special characters):
Regexr link
I find myself looking for a way to create a regex that might ignore these characters all together, as some emails have them and some don't. In the end, I'm trying to create a match with something like
/all (.*)your contacts
If there's a particular string you're trying to flag, you could do something like this:
Detect "email" with optional invis characters: /e[^\w]?m[^\w]?a[^\w]?i[^\w]?l/
[^\w]? will detect anything that's not a letter or digit. You could also use [^\w]* if you're seeing more than one invisible character being used between letters.
Most invisible characters are just whitespace.
These don't matter which character set they're rendered in,
it's probably invisible.
If using a Unicode aware regex engine, you could probably just stick
in the whitespace class between the characters you're looking for.
If not, you could try using the class equivalent [ ].
\s =
[\x{9}-\x{D}\x{1C}-\x{20}\x{85}\x{A0}\x{1680}\x{2000}-\x{200A}\x{2028}-\x{2029}\x{202F}\x{205F}\x{3000}]
Same, but without CRLF's
[^\S\r\n] =
[\x{9}\x{B}-\x{C}\x{1C}-\x{20}\x{85}\x{A0}\x{1680}\x{2000}-\x{200A}\x{2028}-\x{2029}\x{202F}\x{205F}\x{3000}]

regex to highlight sentences longer than n words

I am trying to write a regex expression that can be used to identify long sentences in a document. I my case a scientific manuscript. I aim to be doing that either in libre office or any text editor with regex search.
So far I got the following expression to work on most occasions:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+){24,}?(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
btw, I got inspired from this post
It contains:
group1:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+)
a repetition element (stating how many words n - 1):
{24,}?
group2:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
The basic functioning is:
group1 matches any number of word characters OR other characters that are present in the text followed by one or more spaces
group1 has to be repeated 24 times (or as many as you want the sentences to be long)
group2 matches any number of word characters OR other characters that are present in the text followed by a full stop, exclamation mark, question mark or paragraph break.
Any string that fulfills all the above would then be highlighted.
What I can't solve so far is to make it work when a dot appears in the text with another meaning than a full stop. Things like: i.e., e.g., et al., Fig., 1.89, etc....
Also I don't like that I had to manually adjust it to be able to handle sentences that contain non-word characters such as , [ ( % - # µ " ' and so on. I would have to extend the expression every time I come across some other uncommon character.
I'd be happy for any help or suggestions of other ways to solve this.
You can do a lot with the swiss-army-knife that is regular expressions, but the problem you've presented approaches regex's limits. Some of the things you want to detect can probably be handled with really small changes, while others are a bit harder. If your goal is to have some kind of tool that accurately measures sentence length for every possible mutation of characters, you'll probably need to move outside LibreOffice to a dedicated custom piece of software or a third-party tool.
But, that said, there are a few tricks you can worm into your existing regex to make it work better, if you want to avoid programming or another tool. Let's look at a few of the techniques that might be useful to you:
You can probably tweak your regex for a few special cases, like Fig. and Mr., by including them directly. Where you currently have [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, which is basically [\w]+ with a bunch of other "special" characters, you could use something like ([\w|...]+|Mr\.|Mrs\.|Miss\.|Fig\.) (substituting in all the special characters where I wrote ..., of course). Regexes are "greedy" algorithms, and will try to consume as much of the text as possible, so by including special "dot words" directly, you can make the regex "skip over" certain period characters that are problematic in your text. Make sure that when you want to add a "period to skip" that you always precede it with a backslash, like in i\.e\., so that it doesn't get treated as the special "any" character.
A similar trick can capture numbers better by assuming that digits followed by a period followed by more digits are supposed to "eat" the period: ([\w|...]+|\d+\.\d+|...) That doesn't handle everything, and if your document authors are writing stuff like 0. in the middle of sentences then you have a tough problem, but it can at least handle pi and e properly.
Also, right now, your regex consumes characters until it reaches any terminating punctuation character — a ., or !, or ?, or the end of the document. That's a problem for things like i.e., and 3.14, since as far as your regex is concerned, the sentence stops at the .. You could require your regex to only stop the sentence once ._ is reached — a period followed by a space. That wouldn't fix mismatches for words like Mr., but it would treat "words" like 3.14 as a word instead of as the end of a sentence, which is closer than you currently are. To do this, you'll have to include an odd sequence as part of the "word" regex, something like (\.[^ ]), which says "dot followed by not-a-space" is part of the word; and then you'll have to change the terminating sequence to (\. |!|?|$). Repeat the changes similarly for ! and ?.
Another useful trick is to take advantage of character-code ranges, instead of encoding each special character directly. Right now, you're doing it the hard way, by spelling out every accented character and digraph and diacritic in the universe. Instead, you could just say that everything that's a "special character" is considered to be part of the "word": Instead of [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, write [\w|\-|\/|\u0080-\uFFFF], which captures every character except emoji and a few from really obscure dead languages. LibreOffice seems to have Unicode support, so using \uXXXX patterns should work inside [ character ranges ].
This is probably enough to make your regex somewhat acceptable in LibreOffice, and might even be enough to answer your question. But if you're really intent on doing more complex document analysis like this, you may be better off exporting the document as plain text and then running a specialized tool on it.

RegEx: Searching for numbers (int, float) that are NOT part of a word

I'm hoping we have some regular expression guru's here that might be able to help me - a regex newbie - solve a problem.
I know some people will want to know some background info on this issue:
Regex Flavor: Basic Regex, being used in a Vertica Database using the REGEXP_REPLACE function.
The regex I am using is working great with one exception.
I have a rule that I'm trying to implement, related to stripping the numbers from text, where any number that is part of a word, e.g. table5, go2market, 33monroe, room222, etc. is ignored and NOT filtered.
Here is what I started with for detecting numbers:
[-+]?[0-9]*\.?[0-9]
That seems to work pretty well, including handling directly adjacent commas and parentheses for example.
But all cases where there is a number that is part of alphabetic text is also being detected, which fails the rule that it cannot be a part of a word, and by word, I mean any alphabetic text.
So, in searching for solutions, I happened upon this regex that seems to work well detecting those specific cases where numbers appear next to, or in, any string of characters:
((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)
My thought was that maybe I could add this as an INVERTED match to my original regex, to allow it to still select standalone numbers while ignoring those that were a part of a word, like so:
[-+]?[0-9]^((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)*\.?[0-9]^((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)
Unfortunately however, it breaks the original detection of standalone numbers.
:(
I'm hoping there is someone here that can spot what I'm doing wrong, and help me identify the right solution?
Thanks in advance!
According to Vertica documentation, the regex flavour seems to follow the Perl syntax. In this case you can use negative lookarounds and in particular a negative lookbehind: (?<!\w) (not preceded with a word character.)
Lookarounds are only tests and don't consume characters.
You can also use a negative lookahead to test the right part, (?!\w) (not followed by a word character), but it's more simple to use a word boundary since the pattern ends with a digit (that is also a word character):
(?<!\w)[-+]?\d*\.?\d+\b
In the worst case, if you have something like v1.0 in your string and you want to avoid it, you can try to use the bactracking control verbs (*SKIP) and (*FAIL). (*FAIL) forces the pattern to fail and (*SKIP) skips all the already matched positions before it. I hope vertica supports these Perl regex features.
Something like:
\p{L}+[-+]?\d*\.?\d+(*SKIP)(*FAIL)|[-+]?\d*\.?\d+(*SKIP)(?!\p{L})

REGEX: Special Characters detected for [[a-zA-Z0-9]]

I built a filter with the rule: [[a-zA-Z0-9]]
So the intention was that content with at least a number or a letter and should remove content with special characters for example "?" and ":)" or any other emojis.
So far it worked great, however i noticed that the rule does not worked for symbols that starts with the unicode "U+1" and seems to recognize as a letter or a number. Other special letters/symbols starting with "U+0" for example ◌́ seems to work as intended.
Can somebody please explain the reason for this?
Thanks

Regex word-break with unicode diacritics

I am working on an application that searches text using regular expressions based on input from a user. One option the user has is to include a "Match 0 or more characters" wildcard using the asterisk. I need this to only match between word boundaries. My first attempt was to convert all asterisks to (?:(?=\B).)*, which works fine for most cases. Where it fails is that apparently .Net considers the position between a unicode character with a diacritic and another character a word-break. I consider this a bug, and have submitted it to the Microsoft feedback site.
In the meantime, however, I need to get the functionality implemented and product shipped. I am considering using [\p{L}\p{M}\p{N}\p{Pc}]* as the replacement text, but, frankly, am in "I don't really understand what this is going to do" land. I mean, I can read the specifications, but am not confident that I could sufficiently test this to make sure it is doing what I expect. I simply wouldn't know all the boundary conditions to test. The application is used by cross-cultural workers, many of whom are in tribal locations, so any and all writing systems need to be supported, including some that use zero-width word breaks.
Does anyone have a more elegant solution, or could confirm/correct the code above, or offer some pointers?
Thanks for your help.
The equivalent of /(?:(?=\B).)*/ in a unicode context would be:
/
(?:
(?: (?<=[\p{L}\p{M}\p{N}\p{Pc}]) (?=[\p{L}\p{M}\p{N}\p{Pc}])
| (?<![\p{L}\p{M}\p{N}\p{Pc}]) (?![\p{L}\p{M}\p{N}\p{Pc}])
)
.
)*
/
...or somewhat simplified:
/(?:[\p{L}\p{M}\p{N}\p{Pc}]+|[^\p{L}\p{M}\p{N}\p{Pc}]+)?/
This would match either a word or a non-word (spacing, punctuation etc.) sequence, possibly an empty one.
A normal or negated word-boundary (\b or \B) is basically a double look-around. One looking behind, making sure of the type of character that precedes the current position. Similarly one looking ahead.
In the second regex, I removed the look-arounds and used simple character classes instead.