Regex word-break with unicode diacritics - regex

I am working on an application that searches text using regular expressions based on input from a user. One option the user has is to include a "Match 0 or more characters" wildcard using the asterisk. I need this to only match between word boundaries. My first attempt was to convert all asterisks to (?:(?=\B).)*, which works fine for most cases. Where it fails is that apparently .Net considers the position between a unicode character with a diacritic and another character a word-break. I consider this a bug, and have submitted it to the Microsoft feedback site.
In the meantime, however, I need to get the functionality implemented and product shipped. I am considering using [\p{L}\p{M}\p{N}\p{Pc}]* as the replacement text, but, frankly, am in "I don't really understand what this is going to do" land. I mean, I can read the specifications, but am not confident that I could sufficiently test this to make sure it is doing what I expect. I simply wouldn't know all the boundary conditions to test. The application is used by cross-cultural workers, many of whom are in tribal locations, so any and all writing systems need to be supported, including some that use zero-width word breaks.
Does anyone have a more elegant solution, or could confirm/correct the code above, or offer some pointers?
Thanks for your help.

The equivalent of /(?:(?=\B).)*/ in a unicode context would be:
/
(?:
(?: (?<=[\p{L}\p{M}\p{N}\p{Pc}]) (?=[\p{L}\p{M}\p{N}\p{Pc}])
| (?<![\p{L}\p{M}\p{N}\p{Pc}]) (?![\p{L}\p{M}\p{N}\p{Pc}])
)
.
)*
/
...or somewhat simplified:
/(?:[\p{L}\p{M}\p{N}\p{Pc}]+|[^\p{L}\p{M}\p{N}\p{Pc}]+)?/
This would match either a word or a non-word (spacing, punctuation etc.) sequence, possibly an empty one.
A normal or negated word-boundary (\b or \B) is basically a double look-around. One looking behind, making sure of the type of character that precedes the current position. Similarly one looking ahead.
In the second regex, I removed the look-arounds and used simple character classes instead.

Related

regex to highlight sentences longer than n words

I am trying to write a regex expression that can be used to identify long sentences in a document. I my case a scientific manuscript. I aim to be doing that either in libre office or any text editor with regex search.
So far I got the following expression to work on most occasions:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+){24,}?(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
btw, I got inspired from this post
It contains:
group1:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+)
a repetition element (stating how many words n - 1):
{24,}?
group2:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
The basic functioning is:
group1 matches any number of word characters OR other characters that are present in the text followed by one or more spaces
group1 has to be repeated 24 times (or as many as you want the sentences to be long)
group2 matches any number of word characters OR other characters that are present in the text followed by a full stop, exclamation mark, question mark or paragraph break.
Any string that fulfills all the above would then be highlighted.
What I can't solve so far is to make it work when a dot appears in the text with another meaning than a full stop. Things like: i.e., e.g., et al., Fig., 1.89, etc....
Also I don't like that I had to manually adjust it to be able to handle sentences that contain non-word characters such as , [ ( % - # µ " ' and so on. I would have to extend the expression every time I come across some other uncommon character.
I'd be happy for any help or suggestions of other ways to solve this.
You can do a lot with the swiss-army-knife that is regular expressions, but the problem you've presented approaches regex's limits. Some of the things you want to detect can probably be handled with really small changes, while others are a bit harder. If your goal is to have some kind of tool that accurately measures sentence length for every possible mutation of characters, you'll probably need to move outside LibreOffice to a dedicated custom piece of software or a third-party tool.
But, that said, there are a few tricks you can worm into your existing regex to make it work better, if you want to avoid programming or another tool. Let's look at a few of the techniques that might be useful to you:
You can probably tweak your regex for a few special cases, like Fig. and Mr., by including them directly. Where you currently have [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, which is basically [\w]+ with a bunch of other "special" characters, you could use something like ([\w|...]+|Mr\.|Mrs\.|Miss\.|Fig\.) (substituting in all the special characters where I wrote ..., of course). Regexes are "greedy" algorithms, and will try to consume as much of the text as possible, so by including special "dot words" directly, you can make the regex "skip over" certain period characters that are problematic in your text. Make sure that when you want to add a "period to skip" that you always precede it with a backslash, like in i\.e\., so that it doesn't get treated as the special "any" character.
A similar trick can capture numbers better by assuming that digits followed by a period followed by more digits are supposed to "eat" the period: ([\w|...]+|\d+\.\d+|...) That doesn't handle everything, and if your document authors are writing stuff like 0. in the middle of sentences then you have a tough problem, but it can at least handle pi and e properly.
Also, right now, your regex consumes characters until it reaches any terminating punctuation character — a ., or !, or ?, or the end of the document. That's a problem for things like i.e., and 3.14, since as far as your regex is concerned, the sentence stops at the .. You could require your regex to only stop the sentence once ._ is reached — a period followed by a space. That wouldn't fix mismatches for words like Mr., but it would treat "words" like 3.14 as a word instead of as the end of a sentence, which is closer than you currently are. To do this, you'll have to include an odd sequence as part of the "word" regex, something like (\.[^ ]), which says "dot followed by not-a-space" is part of the word; and then you'll have to change the terminating sequence to (\. |!|?|$). Repeat the changes similarly for ! and ?.
Another useful trick is to take advantage of character-code ranges, instead of encoding each special character directly. Right now, you're doing it the hard way, by spelling out every accented character and digraph and diacritic in the universe. Instead, you could just say that everything that's a "special character" is considered to be part of the "word": Instead of [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, write [\w|\-|\/|\u0080-\uFFFF], which captures every character except emoji and a few from really obscure dead languages. LibreOffice seems to have Unicode support, so using \uXXXX patterns should work inside [ character ranges ].
This is probably enough to make your regex somewhat acceptable in LibreOffice, and might even be enough to answer your question. But if you're really intent on doing more complex document analysis like this, you may be better off exporting the document as plain text and then running a specialized tool on it.

RegEx: Searching for numbers (int, float) that are NOT part of a word

I'm hoping we have some regular expression guru's here that might be able to help me - a regex newbie - solve a problem.
I know some people will want to know some background info on this issue:
Regex Flavor: Basic Regex, being used in a Vertica Database using the REGEXP_REPLACE function.
The regex I am using is working great with one exception.
I have a rule that I'm trying to implement, related to stripping the numbers from text, where any number that is part of a word, e.g. table5, go2market, 33monroe, room222, etc. is ignored and NOT filtered.
Here is what I started with for detecting numbers:
[-+]?[0-9]*\.?[0-9]
That seems to work pretty well, including handling directly adjacent commas and parentheses for example.
But all cases where there is a number that is part of alphabetic text is also being detected, which fails the rule that it cannot be a part of a word, and by word, I mean any alphabetic text.
So, in searching for solutions, I happened upon this regex that seems to work well detecting those specific cases where numbers appear next to, or in, any string of characters:
((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)
My thought was that maybe I could add this as an INVERTED match to my original regex, to allow it to still select standalone numbers while ignoring those that were a part of a word, like so:
[-+]?[0-9]^((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)*\.?[0-9]^((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)
Unfortunately however, it breaks the original detection of standalone numbers.
:(
I'm hoping there is someone here that can spot what I'm doing wrong, and help me identify the right solution?
Thanks in advance!
According to Vertica documentation, the regex flavour seems to follow the Perl syntax. In this case you can use negative lookarounds and in particular a negative lookbehind: (?<!\w) (not preceded with a word character.)
Lookarounds are only tests and don't consume characters.
You can also use a negative lookahead to test the right part, (?!\w) (not followed by a word character), but it's more simple to use a word boundary since the pattern ends with a digit (that is also a word character):
(?<!\w)[-+]?\d*\.?\d+\b
In the worst case, if you have something like v1.0 in your string and you want to avoid it, you can try to use the bactracking control verbs (*SKIP) and (*FAIL). (*FAIL) forces the pattern to fail and (*SKIP) skips all the already matched positions before it. I hope vertica supports these Perl regex features.
Something like:
\p{L}+[-+]?\d*\.?\d+(*SKIP)(*FAIL)|[-+]?\d*\.?\d+(*SKIP)(?!\p{L})

Multiple spaces, multiple commas and multiple hypens in alphanumeric regex

I am very new to regex and regular expressions, and I am stuck in a situation where I want to apply a regex on an JSF input field.
Where
alphanumeric
multiple spaces
multiple dot(.)
multiple hyphen (‐)
are allowed, and Minimum limit is 1 and Maximum limit is 5.
And for multiple values - they must be separated by comma (,)
So a Single value can be:
3kd-R
or
k3
or
-4
And multiple values (must be comma separated):
kdk30,3.K-4,ER--U,2,.I3,
By the help of stackoverflow, so far I am able to achieve only this:
(^[a-zA-Z0-9 ]{5}(,[a-zA-Z0-9 ]{5})*$)
Something like
^[-.a-zA-Z0-9 ]{1,5}(,[-.a-zA-Z0-9 ]{1,5})*$
Changes made
[-.a-zA-Z0-9 ] Added - and . to the character class so that those are matched as well.
{1,5} Quantifier, ensures that it is matched minimum 1 and maximum 5 characters
Regex demo
You've done pretty good. You need to add hyphen and dot to that first character class. Note: With the hyphen, since it delegates ranges within a character class, you need to position it where contextually it cannot be specifying a range--not to say put it where it seems like it would be an invalid range, e.g., 7-., but positionally cannot be a range, i.e., first or last. So your first character class would look something like this:
[a-zA-Z 0-9.-]{1,5} or [-a-zA-Z0-9 .]{1,5}
So, we've just defined what one segment looks like. That pattern can reoccur zero or more times. Of course, there are many ways to do that, but I would favor a regex subroutine because this allows code reuse. Now if the specs change or you're testing and realize you have to tweak that segment pattern, you only need to change it in one place.
Subroutines are not supported in BRE or ERE, but most widely-used modern regex engines support them (Perl, PCRE, Ruby, Delphi, R, PHP). They are very simple to use and understand. Basically, you just need to be able to refer to it (sound familiar? refer-back? back-reference?), so this means we need to capture the regex we wish to repeat. Then it's as simple as referring back to it, but instead of \1 which refers to the captured value (data), we want to refer to it as (?1), the capturing expression. In doing so, we've logically defined a subroutine:
([a-zA-Z 0-9.-]{1,5})(,(?1))*
So, the first group basically defines our subroutine and the second group consists of a comma followed by the same segment-definition expression we used for the first group, and that is optional ('*' is the zero-or-more quantifier).
If you operate on large quantities of data where efficiency is a consideration, don't capture when you don't have to. If your sole purpose for using parenthesis is to alternate (e.g., \b[bB](asset|eagle)\b hound) or to quantify, as in our second group, use the (?: ... ) notation, which signifies to the regex engine that this is a non-capturing group. Without going into great detail, there is a lot of overhead in maintaining the match locations--not that it's complex, per se, just potentially highly repetitive. Regex engines will match, store the information, then when the match fails, they "give up" the match and try again starting with the next matching substring. Each time they match your capture group, they're storing that information again. Okay, I'm off the soapbox now. :-)
So, we're almost there. I say "almost" because I don't have all the information. But if this should be the sole occupant of the "subject" (line, field, etc.--the data sample you're evaluating), you should anchor it to "assert" that requirement. The caret '^' is beginning of subject, and the dollar '$' is end of subject, so by encapsulating our expression in ^ ... $ we are asserting that the subject matches in it's entirety, front-to-back. These assertions have zero-length; they consume no data, only assert a relative position. You can operate on them, e.g., s/^/ / would indent your entire document two spaces. You haven't really substituted the beginning of line with two spaces, but you're able to operate on that imaginary, zero-length location. (Do some research on zero-length assertions [aka zero-width assertions, or look-arounds] to uncover a powerful feature of modern regex. For example, in the previous regex if I wanted to make sure I did not insert two spaces on blank lines: s/^(?!$)/ /)
Also, you didn't say if you need to capture the results to do something with it. My impression was it's validation only, so that's not necessary. However, if it is needed, you can wrap the entire expression in capturing parenthesis: ^( ... )$.
I'm going to provide a final solution that does not assume you need to capture but does assume the entire subject should consist of this value:
^([a-zA-Z 0-9. -]{1,5})(?:,(?1))*$
I know I went on a bit, but you said you were new to regex, so wanted to provide some detail. I hope it wasn't too much detail.
By the way, an excellent resource with tutorials is regular-expressions dot info, and a wonderful regex development and testing tool is regex101 dot com. And I can never say enough about stack overflow!

lookahead in kate for patterns

I'm working on compiling a table of cases for a legal book. I've converted it to HTML so I can use the tags for search and replace operations, and I'm currently working in Kate. The text refers to the names of cases and the citations for the cases are in the footnotes, e.g.
<i>Smith v Jones</i>127 ......... [other stuff including newline characters].......</br>127 (1937) 173 ER 406;
I've been able to get lookahead working in Kate, using:
<i>.*</i>([0-9]{1,4}) .+<br/>\1 .*<br/>
...but I've run into greediness problems.
The text is a mess, so I really need to find matches step by step rather than relying on a batch process.
Is there a Linux (or Windows) text editor that supports both lookahead AND non-greedy operators, or am I going to have to try grep or sed?
I'm not familiar with Kate, but it seems to use QRegExp, which is incompatible with other Perl-like regex flavors in many important ways. For example, most flavors allow you make individual quantifiers non-greedy by appending a question mark (e.g. .* => .+?), but in QRegExp you can only make them all greedy or all non-greedy. What's worse, it looks like Kate doesn't even let you do that--via a Non-Greedy checkbox, for example.
But it's best not to rely on non-greedy quantifiers all time anyway. For one thing, they don't guarantee the shortest possible match, as many people say. You should get in the habit of being more specific about what should and should not be matched, when that's not too difficult. For example, if the section you want to match doesn't contain any tags other than the ones in your sample string, you can do this:
<i>[^<]*</i>(\d+)\b[^<]+<br/>\1\b[^<]*<br/>
The advantage of using [^<]* instead of .* is that it will never try to match anything after the next <. .* will always grab the rest of the document at first, only to backtrack almost all the way to the starting point. The non-greedy version, .*?, will initially match only to the next <, but if the match attempt fails later on it will go ahead and consume the < and beyond, eventually to consume the whole document.
If there can be other tags, you can use [^<]*(<(?!br/>)[^<]*)* instead. It will consume any characters that are not <, or < if it's not the beginning of a <br/> tag.
<i>[^<]*</i>(\d+)\b[^<]*(<(?!br/>)[^<]*)*<br/>\1\b[^<]*(<(?!br/>)[^<]*)*<br/>
By the way, what you're calling a lookahead (I'm assuming you mean \1) is really a backreference. The (?!br/>) in my regex is an example of lookaheads--in this case a negative lookahead. The Kate/QRegExp docs claim that lookaheads are supported but non-capturing groups-- e.g. (?:...)--aren't, which is why used all capturing groups in that last regex.
If you have the option of switching to a different editor, I strongly recommend that you do so. My favorite is EditPad Pro; it has the best regex support I've ever seen in an editor.

making a small regular expression a bit more readable

I've got a working regular expression, but I'd like to make it a tad more readable, and I'm far from a regex guru, so I was humbly hoping for some tips.
This is designed to scrape the output of several different compilers, linkers, and other build tools, and is used to build a nice little summery report. It does it's job great, but I'm left feeling like I wrote it in a clunky fashion, and I'd sooner learn than keep it the wrong way.
(.*?)\s?:?\s?(informational|warning|error|fatal error)?\s([A-Z]+[0-9][0-9][0-9][0-9]):\s(.*)$
Which, broken down simply, is as follows:
(.*?) # non-greedily match up until...
\s?:?\s? # we come across a possible " : "
(informational|warning|error|fatal error)? # possibly followed by one of these
\s([A-Z]+[0-9][0-9][0-9][0-9]):\s # but 100% followed by this alphanum
(.*)$ # and then capture the rest
I'm mostly interested in making the 2nd and 4th entry above more... beautiful. For some reason, the regex tester I was using (The Regulator) didn't match plain spaces, so I had to use the \s... but it is not meant to match any other whitespace.
Any schooling will be greatly appreciated.
The easiest way to make a long regex more readable is to use the "free-spacing" (or \x) modifier, which would let you write your regex just like you did in the second block of code -- it makes whitespace ignored. This isn't supported by all engines, however (according to the page linked above, .NET, Java, Perl, PCRE, Python, Ruby and XPath support it).
Note also that in free-spacing mode, you can use [ ] instead of \s if you want to only match a space character (unless you're using Java, in which case you have to use \ , which is an escaped space).
There's not really anything you can do for the second line, if you want each element to be optional independently of the other elements, but the fourth can be shortened:
\s([A-Z]+\d{4}):\s
\d is a shorthand class equivalent to [0-9], and {4} specifies that it should appear exactly four times.
The third line can be slightly shortened as well ((?:…) specifies a non-capturing group):
(informational|warning|(?:fatal )? error)?
From an efficiency standpoint, unless you actually need to capture subpatterns each time you use brackets, you can remove all of them, except for on the third line, where the group is needed for the alternation) -- but that one can be made non-capturing. Putting this all together you'd get:
.*?
\s?:?\s?
(?:informational|warning|(?:fatal )?error)?
\s[A-Z]+\d{4}:\s
.*$
Line 2
I think your regular expression doesn't match with the comment. You probably want this instead:
(\s:\s)?
To make it non-capturing:
(?:\s:\s)?
You should be able to use a literal space instead of \s. This must be a restriction in the tool you are using.
Line 4
[0-9][0-9][0-9][0-9] can be replaced with [0-9]{4}.
In some languages [0-9] is equivalent to \d.
Perhaps you can build the RE from sub-expressions, so that your end RE would look something like this:
/$preamble$possible_colon$keyword$alphanum$trailer/