Ignoring invisible characters in RegEx - regex

I've run into a bit of a conundrum.
I am currently trying to build a regex to filter out some particularly nasty scam emails. I'm sure you've seen them before, using a data dump from a compromised website to threaten to reveal intimate videos.
That's all well and good, except I noticed while testing the regex that some of these messages insert special invisible characters in the middle of words. Like you might see here (I've found it especially hard to find a place that keeps these special characters):
Regexr link
I find myself looking for a way to create a regex that might ignore these characters all together, as some emails have them and some don't. In the end, I'm trying to create a match with something like
/all (.*)your contacts

If there's a particular string you're trying to flag, you could do something like this:
Detect "email" with optional invis characters: /e[^\w]?m[^\w]?a[^\w]?i[^\w]?l/
[^\w]? will detect anything that's not a letter or digit. You could also use [^\w]* if you're seeing more than one invisible character being used between letters.

Most invisible characters are just whitespace.
These don't matter which character set they're rendered in,
it's probably invisible.
If using a Unicode aware regex engine, you could probably just stick
in the whitespace class between the characters you're looking for.
If not, you could try using the class equivalent [ ].
\s =
[\x{9}-\x{D}\x{1C}-\x{20}\x{85}\x{A0}\x{1680}\x{2000}-\x{200A}\x{2028}-\x{2029}\x{202F}\x{205F}\x{3000}]
Same, but without CRLF's
[^\S\r\n] =
[\x{9}\x{B}-\x{C}\x{1C}-\x{20}\x{85}\x{A0}\x{1680}\x{2000}-\x{200A}\x{2028}-\x{2029}\x{202F}\x{205F}\x{3000}]

Related

regex to highlight sentences longer than n words

I am trying to write a regex expression that can be used to identify long sentences in a document. I my case a scientific manuscript. I aim to be doing that either in libre office or any text editor with regex search.
So far I got the following expression to work on most occasions:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+){24,}?(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
btw, I got inspired from this post
It contains:
group1:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+)
a repetition element (stating how many words n - 1):
{24,}?
group2:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
The basic functioning is:
group1 matches any number of word characters OR other characters that are present in the text followed by one or more spaces
group1 has to be repeated 24 times (or as many as you want the sentences to be long)
group2 matches any number of word characters OR other characters that are present in the text followed by a full stop, exclamation mark, question mark or paragraph break.
Any string that fulfills all the above would then be highlighted.
What I can't solve so far is to make it work when a dot appears in the text with another meaning than a full stop. Things like: i.e., e.g., et al., Fig., 1.89, etc....
Also I don't like that I had to manually adjust it to be able to handle sentences that contain non-word characters such as , [ ( % - # µ " ' and so on. I would have to extend the expression every time I come across some other uncommon character.
I'd be happy for any help or suggestions of other ways to solve this.
You can do a lot with the swiss-army-knife that is regular expressions, but the problem you've presented approaches regex's limits. Some of the things you want to detect can probably be handled with really small changes, while others are a bit harder. If your goal is to have some kind of tool that accurately measures sentence length for every possible mutation of characters, you'll probably need to move outside LibreOffice to a dedicated custom piece of software or a third-party tool.
But, that said, there are a few tricks you can worm into your existing regex to make it work better, if you want to avoid programming or another tool. Let's look at a few of the techniques that might be useful to you:
You can probably tweak your regex for a few special cases, like Fig. and Mr., by including them directly. Where you currently have [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, which is basically [\w]+ with a bunch of other "special" characters, you could use something like ([\w|...]+|Mr\.|Mrs\.|Miss\.|Fig\.) (substituting in all the special characters where I wrote ..., of course). Regexes are "greedy" algorithms, and will try to consume as much of the text as possible, so by including special "dot words" directly, you can make the regex "skip over" certain period characters that are problematic in your text. Make sure that when you want to add a "period to skip" that you always precede it with a backslash, like in i\.e\., so that it doesn't get treated as the special "any" character.
A similar trick can capture numbers better by assuming that digits followed by a period followed by more digits are supposed to "eat" the period: ([\w|...]+|\d+\.\d+|...) That doesn't handle everything, and if your document authors are writing stuff like 0. in the middle of sentences then you have a tough problem, but it can at least handle pi and e properly.
Also, right now, your regex consumes characters until it reaches any terminating punctuation character — a ., or !, or ?, or the end of the document. That's a problem for things like i.e., and 3.14, since as far as your regex is concerned, the sentence stops at the .. You could require your regex to only stop the sentence once ._ is reached — a period followed by a space. That wouldn't fix mismatches for words like Mr., but it would treat "words" like 3.14 as a word instead of as the end of a sentence, which is closer than you currently are. To do this, you'll have to include an odd sequence as part of the "word" regex, something like (\.[^ ]), which says "dot followed by not-a-space" is part of the word; and then you'll have to change the terminating sequence to (\. |!|?|$). Repeat the changes similarly for ! and ?.
Another useful trick is to take advantage of character-code ranges, instead of encoding each special character directly. Right now, you're doing it the hard way, by spelling out every accented character and digraph and diacritic in the universe. Instead, you could just say that everything that's a "special character" is considered to be part of the "word": Instead of [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, write [\w|\-|\/|\u0080-\uFFFF], which captures every character except emoji and a few from really obscure dead languages. LibreOffice seems to have Unicode support, so using \uXXXX patterns should work inside [ character ranges ].
This is probably enough to make your regex somewhat acceptable in LibreOffice, and might even be enough to answer your question. But if you're really intent on doing more complex document analysis like this, you may be better off exporting the document as plain text and then running a specialized tool on it.

RegEx: Searching for numbers (int, float) that are NOT part of a word

I'm hoping we have some regular expression guru's here that might be able to help me - a regex newbie - solve a problem.
I know some people will want to know some background info on this issue:
Regex Flavor: Basic Regex, being used in a Vertica Database using the REGEXP_REPLACE function.
The regex I am using is working great with one exception.
I have a rule that I'm trying to implement, related to stripping the numbers from text, where any number that is part of a word, e.g. table5, go2market, 33monroe, room222, etc. is ignored and NOT filtered.
Here is what I started with for detecting numbers:
[-+]?[0-9]*\.?[0-9]
That seems to work pretty well, including handling directly adjacent commas and parentheses for example.
But all cases where there is a number that is part of alphabetic text is also being detected, which fails the rule that it cannot be a part of a word, and by word, I mean any alphabetic text.
So, in searching for solutions, I happened upon this regex that seems to work well detecting those specific cases where numbers appear next to, or in, any string of characters:
((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)
My thought was that maybe I could add this as an INVERTED match to my original regex, to allow it to still select standalone numbers while ignoring those that were a part of a word, like so:
[-+]?[0-9]^((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)*\.?[0-9]^((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)
Unfortunately however, it breaks the original detection of standalone numbers.
:(
I'm hoping there is someone here that can spot what I'm doing wrong, and help me identify the right solution?
Thanks in advance!
According to Vertica documentation, the regex flavour seems to follow the Perl syntax. In this case you can use negative lookarounds and in particular a negative lookbehind: (?<!\w) (not preceded with a word character.)
Lookarounds are only tests and don't consume characters.
You can also use a negative lookahead to test the right part, (?!\w) (not followed by a word character), but it's more simple to use a word boundary since the pattern ends with a digit (that is also a word character):
(?<!\w)[-+]?\d*\.?\d+\b
In the worst case, if you have something like v1.0 in your string and you want to avoid it, you can try to use the bactracking control verbs (*SKIP) and (*FAIL). (*FAIL) forces the pattern to fail and (*SKIP) skips all the already matched positions before it. I hope vertica supports these Perl regex features.
Something like:
\p{L}+[-+]?\d*\.?\d+(*SKIP)(*FAIL)|[-+]?\d*\.?\d+(*SKIP)(?!\p{L})

CQ5 textfield validation with regex

I have a simple CQ dialog with a textfield. The authors somehow managed to paste illegal characters into it, the last two times it was a vertical tab (VT) copied from a PowerPoint file.
I played around with some regex and came up with the following to exclude anything below SPACE and DEL:
/^[^\0-\x1F\x7F]*$/
Sadly I can't really test the vertical tab as I am not able to enter this character on regex101. So I tried it with TAB and this seems to be working: https://regex101.com/r/yH0lN5/1
But if I use this in my regex property of the textfield, no matter what I enter the validation fails. Any idea what I am doing wrong?
White listing isn't an option as i need to support Unicode characters like chinese in the future.
You should double the backslashes to make sure they are treated as literal backslashes by the regex engine.
Also, I suggest using consistent notation, and replace \0 with \x00:
regex="/^[^\\x00-\\x1F\\x7F]*$/"
And this regex just matches entires strings that contain zero or more characters (due to *) other than (due to the negated character class used [^...]) the ones from the NUL to US character ([\x00-\x1F]) and a DEL character (\x7F):

Select capitalized & all-caps words using RegEx

I'm trying to find names of people and companies (everything that is capitalized but not in the beginning of a sentence) in a large body of text. The purpose is to find as many instances as possible so that they can be XML-tagged properly.
This is what I've come up with so far:
[^\W](\s\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+
It has two problems:
It selects two characters too many in front of the hit.
In the sentence "Is this Beetle ugly?" it finds s Beetle which complicates the subsequent tagging.
When a capitalized word is preceded with an apostrophe or a colon, it isn't found. If possible I'd like to limit what characters are used for determining a sentence to just !?.
Here's the sample text I'm using to test it out:
John Adams is my hero. There's just no limits to his imagination! Is
this Beetle ugly? It sings at the: La Scala opera house. I have a
dream that I will find work at' Frame Store but not in the USA! This
way ILM could do whatever they pleased. ILM was very sweet. Visual
Effects did a good job... Neither did Animatronix?
I'm using jEdit http.//jedit.org since I need something that works on both Windows and OS X.
Update, this avoids now the matching at the start of the string.
(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+
(?<!(?:[!?\.]\s|^)) is a negative lookbehind that ensures it is not preceded by one of the !?. and a space OR by the start of a new row.
I tested it with jEdit.
Update to cover Names consisting of multiple words
(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]*\b(?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)*)+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (added)
^ (changed)
I added the group (?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)* to match optional following words starting with uppercase letters. And I changed the + to a * to match the A in your example My company's called A Few Good Men. But this change causes now the regex to match I as a name.
See tchrists comment. Names are not a simple thing and it gets really difficult if you want to cover the more complex cases.
This is also working
(?<!\p{P}\s)(\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+
But \p{P} covers all punctuation, I understood this is not what you want. But maybe you can find here on regular-expressions.info/unicode.html a property that fits your needs.
Another mistake in your expression is the | in the character class. Its not needed, you are just adding this character to your class and with it it will match words like U|S|A, so just remove it:
(?<![!?\.]\s)(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+

Regex word-break with unicode diacritics

I am working on an application that searches text using regular expressions based on input from a user. One option the user has is to include a "Match 0 or more characters" wildcard using the asterisk. I need this to only match between word boundaries. My first attempt was to convert all asterisks to (?:(?=\B).)*, which works fine for most cases. Where it fails is that apparently .Net considers the position between a unicode character with a diacritic and another character a word-break. I consider this a bug, and have submitted it to the Microsoft feedback site.
In the meantime, however, I need to get the functionality implemented and product shipped. I am considering using [\p{L}\p{M}\p{N}\p{Pc}]* as the replacement text, but, frankly, am in "I don't really understand what this is going to do" land. I mean, I can read the specifications, but am not confident that I could sufficiently test this to make sure it is doing what I expect. I simply wouldn't know all the boundary conditions to test. The application is used by cross-cultural workers, many of whom are in tribal locations, so any and all writing systems need to be supported, including some that use zero-width word breaks.
Does anyone have a more elegant solution, or could confirm/correct the code above, or offer some pointers?
Thanks for your help.
The equivalent of /(?:(?=\B).)*/ in a unicode context would be:
/
(?:
(?: (?<=[\p{L}\p{M}\p{N}\p{Pc}]) (?=[\p{L}\p{M}\p{N}\p{Pc}])
| (?<![\p{L}\p{M}\p{N}\p{Pc}]) (?![\p{L}\p{M}\p{N}\p{Pc}])
)
.
)*
/
...or somewhat simplified:
/(?:[\p{L}\p{M}\p{N}\p{Pc}]+|[^\p{L}\p{M}\p{N}\p{Pc}]+)?/
This would match either a word or a non-word (spacing, punctuation etc.) sequence, possibly an empty one.
A normal or negated word-boundary (\b or \B) is basically a double look-around. One looking behind, making sure of the type of character that precedes the current position. Similarly one looking ahead.
In the second regex, I removed the look-arounds and used simple character classes instead.