regex to highlight sentences longer than n words - regex

I am trying to write a regex expression that can be used to identify long sentences in a document. I my case a scientific manuscript. I aim to be doing that either in libre office or any text editor with regex search.
So far I got the following expression to work on most occasions:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+){24,}?(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
btw, I got inspired from this post
It contains:
group1:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+)
a repetition element (stating how many words n - 1):
{24,}?
group2:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
The basic functioning is:
group1 matches any number of word characters OR other characters that are present in the text followed by one or more spaces
group1 has to be repeated 24 times (or as many as you want the sentences to be long)
group2 matches any number of word characters OR other characters that are present in the text followed by a full stop, exclamation mark, question mark or paragraph break.
Any string that fulfills all the above would then be highlighted.
What I can't solve so far is to make it work when a dot appears in the text with another meaning than a full stop. Things like: i.e., e.g., et al., Fig., 1.89, etc....
Also I don't like that I had to manually adjust it to be able to handle sentences that contain non-word characters such as , [ ( % - # µ " ' and so on. I would have to extend the expression every time I come across some other uncommon character.
I'd be happy for any help or suggestions of other ways to solve this.

You can do a lot with the swiss-army-knife that is regular expressions, but the problem you've presented approaches regex's limits. Some of the things you want to detect can probably be handled with really small changes, while others are a bit harder. If your goal is to have some kind of tool that accurately measures sentence length for every possible mutation of characters, you'll probably need to move outside LibreOffice to a dedicated custom piece of software or a third-party tool.
But, that said, there are a few tricks you can worm into your existing regex to make it work better, if you want to avoid programming or another tool. Let's look at a few of the techniques that might be useful to you:
You can probably tweak your regex for a few special cases, like Fig. and Mr., by including them directly. Where you currently have [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, which is basically [\w]+ with a bunch of other "special" characters, you could use something like ([\w|...]+|Mr\.|Mrs\.|Miss\.|Fig\.) (substituting in all the special characters where I wrote ..., of course). Regexes are "greedy" algorithms, and will try to consume as much of the text as possible, so by including special "dot words" directly, you can make the regex "skip over" certain period characters that are problematic in your text. Make sure that when you want to add a "period to skip" that you always precede it with a backslash, like in i\.e\., so that it doesn't get treated as the special "any" character.
A similar trick can capture numbers better by assuming that digits followed by a period followed by more digits are supposed to "eat" the period: ([\w|...]+|\d+\.\d+|...) That doesn't handle everything, and if your document authors are writing stuff like 0. in the middle of sentences then you have a tough problem, but it can at least handle pi and e properly.
Also, right now, your regex consumes characters until it reaches any terminating punctuation character — a ., or !, or ?, or the end of the document. That's a problem for things like i.e., and 3.14, since as far as your regex is concerned, the sentence stops at the .. You could require your regex to only stop the sentence once ._ is reached — a period followed by a space. That wouldn't fix mismatches for words like Mr., but it would treat "words" like 3.14 as a word instead of as the end of a sentence, which is closer than you currently are. To do this, you'll have to include an odd sequence as part of the "word" regex, something like (\.[^ ]), which says "dot followed by not-a-space" is part of the word; and then you'll have to change the terminating sequence to (\. |!|?|$). Repeat the changes similarly for ! and ?.
Another useful trick is to take advantage of character-code ranges, instead of encoding each special character directly. Right now, you're doing it the hard way, by spelling out every accented character and digraph and diacritic in the universe. Instead, you could just say that everything that's a "special character" is considered to be part of the "word": Instead of [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, write [\w|\-|\/|\u0080-\uFFFF], which captures every character except emoji and a few from really obscure dead languages. LibreOffice seems to have Unicode support, so using \uXXXX patterns should work inside [ character ranges ].
This is probably enough to make your regex somewhat acceptable in LibreOffice, and might even be enough to answer your question. But if you're really intent on doing more complex document analysis like this, you may be better off exporting the document as plain text and then running a specialized tool on it.

Related

Regex taking too many characters

I need some help with building up my regex.
What I am trying to do is match a specific part of text with unpredictable parts in between the fixed words. An example is the sentence one gets when replying to an email:
On date at time person name has written:
The cursive parts are variable, might contains spaces or a new line might start from this point.
To get this, I built up my regex as such: On[\s\S]+?at[\s\S]+?person[\s\S]+?has written:
Basically, the [\s\S]+? is supposed to fill in any letter, number, space or break/new line as I am unable to predict what could be between the fixed words tha I am sure will always be there.
Now comes the hard part, when I would add the word "On" somewhere in the text above the sentence that I want to match, the regex now matches a much bigger text than I want. This is due to the use of [\s\S]+.
How am I able to make my regex match as less characters as possible? Using "?" before the "+" to make it lazy does not help.
Example is here with words "From - This - Point - Everything:". Cases are ignored.
Correct: https://regexr.com/3jdek.
Wrong because of added "From": https://regexr.com/3jdfc
The regex is to be used in VB.NET
A more real life, with html tags, can be found here. Here, I avoided using [\s\S]+? or (.+)?(\r)?(\n)?(.+?)
Correct: https://regexr.com/3jdd1
Wrong: https://regexr.com/3jdfu after adding certain parts of the regex in the text above. Although, in html, barely possible to occur as the user would never write the matching tag himself, I do want to make sure my regex is correctjust in case
These things are certain: I know with what the part of text starts, no matter where in respect to the entire text, I know with what the part of text ends, and there are specific fixed words that might make the regex more reliable, but they can be ommitted. Any text below the searched part is also allowed to be matched, but no text above may be matched at all
Another example where it goes wrong: https://regexr.com/3jdli. Basically, I have less to go with in this text, so the regex has less tokens to work with. Adding just the first < already makes the regex take too much.
From my own experience, most problems are avoided when making sure I do not use any [\s\S]+? before I did a (\r)?(\n)? first
[\s\S] matches all character because of union of two complementary sets, it is like . with special option /s (dot matches newlines). and regex are greedy by default so the largest match will be returned.
Following correct link, the token just after the shortest match must be geschreven, so another way to write without using lazy expansion, which is more flexible is to prepend the repeated chracter set by a negative lookahead inside loop,
so
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft(.+?(?=geschreven))geschreven:
becomes
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft((?:(?!geschreven).)+)geschreven:
(?: ) is for non capturing the group which just encapsulates the negative lookahead and the . (which can be replaced by [\s\S])
(?! ) inside is the negative lookahead which ensures current position before next character is not the beginning of end token.
Following comments it can be explicitly mentioned what should not appear in repeating sequence :
From(?:(?!this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!this|point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
to understand what the technic (?:(?!tokens)[\s\S])+ does.
in the first this can't appear between From and this
in the second From or this can't appear between From and this
in the third this or point can't appear between this and point
etc.

Multiple spaces, multiple commas and multiple hypens in alphanumeric regex

I am very new to regex and regular expressions, and I am stuck in a situation where I want to apply a regex on an JSF input field.
Where
alphanumeric
multiple spaces
multiple dot(.)
multiple hyphen (‐)
are allowed, and Minimum limit is 1 and Maximum limit is 5.
And for multiple values - they must be separated by comma (,)
So a Single value can be:
3kd-R
or
k3
or
-4
And multiple values (must be comma separated):
kdk30,3.K-4,ER--U,2,.I3,
By the help of stackoverflow, so far I am able to achieve only this:
(^[a-zA-Z0-9 ]{5}(,[a-zA-Z0-9 ]{5})*$)
Something like
^[-.a-zA-Z0-9 ]{1,5}(,[-.a-zA-Z0-9 ]{1,5})*$
Changes made
[-.a-zA-Z0-9 ] Added - and . to the character class so that those are matched as well.
{1,5} Quantifier, ensures that it is matched minimum 1 and maximum 5 characters
Regex demo
You've done pretty good. You need to add hyphen and dot to that first character class. Note: With the hyphen, since it delegates ranges within a character class, you need to position it where contextually it cannot be specifying a range--not to say put it where it seems like it would be an invalid range, e.g., 7-., but positionally cannot be a range, i.e., first or last. So your first character class would look something like this:
[a-zA-Z 0-9.-]{1,5} or [-a-zA-Z0-9 .]{1,5}
So, we've just defined what one segment looks like. That pattern can reoccur zero or more times. Of course, there are many ways to do that, but I would favor a regex subroutine because this allows code reuse. Now if the specs change or you're testing and realize you have to tweak that segment pattern, you only need to change it in one place.
Subroutines are not supported in BRE or ERE, but most widely-used modern regex engines support them (Perl, PCRE, Ruby, Delphi, R, PHP). They are very simple to use and understand. Basically, you just need to be able to refer to it (sound familiar? refer-back? back-reference?), so this means we need to capture the regex we wish to repeat. Then it's as simple as referring back to it, but instead of \1 which refers to the captured value (data), we want to refer to it as (?1), the capturing expression. In doing so, we've logically defined a subroutine:
([a-zA-Z 0-9.-]{1,5})(,(?1))*
So, the first group basically defines our subroutine and the second group consists of a comma followed by the same segment-definition expression we used for the first group, and that is optional ('*' is the zero-or-more quantifier).
If you operate on large quantities of data where efficiency is a consideration, don't capture when you don't have to. If your sole purpose for using parenthesis is to alternate (e.g., \b[bB](asset|eagle)\b hound) or to quantify, as in our second group, use the (?: ... ) notation, which signifies to the regex engine that this is a non-capturing group. Without going into great detail, there is a lot of overhead in maintaining the match locations--not that it's complex, per se, just potentially highly repetitive. Regex engines will match, store the information, then when the match fails, they "give up" the match and try again starting with the next matching substring. Each time they match your capture group, they're storing that information again. Okay, I'm off the soapbox now. :-)
So, we're almost there. I say "almost" because I don't have all the information. But if this should be the sole occupant of the "subject" (line, field, etc.--the data sample you're evaluating), you should anchor it to "assert" that requirement. The caret '^' is beginning of subject, and the dollar '$' is end of subject, so by encapsulating our expression in ^ ... $ we are asserting that the subject matches in it's entirety, front-to-back. These assertions have zero-length; they consume no data, only assert a relative position. You can operate on them, e.g., s/^/ / would indent your entire document two spaces. You haven't really substituted the beginning of line with two spaces, but you're able to operate on that imaginary, zero-length location. (Do some research on zero-length assertions [aka zero-width assertions, or look-arounds] to uncover a powerful feature of modern regex. For example, in the previous regex if I wanted to make sure I did not insert two spaces on blank lines: s/^(?!$)/ /)
Also, you didn't say if you need to capture the results to do something with it. My impression was it's validation only, so that's not necessary. However, if it is needed, you can wrap the entire expression in capturing parenthesis: ^( ... )$.
I'm going to provide a final solution that does not assume you need to capture but does assume the entire subject should consist of this value:
^([a-zA-Z 0-9. -]{1,5})(?:,(?1))*$
I know I went on a bit, but you said you were new to regex, so wanted to provide some detail. I hope it wasn't too much detail.
By the way, an excellent resource with tutorials is regular-expressions dot info, and a wonderful regex development and testing tool is regex101 dot com. And I can never say enough about stack overflow!

emacs syntax highlight numbers not part of words (with regex?)

I've moved to emacs recently and I am used to/like numbers being highlighted. A quick hack I took from here puts the following in my .emacs:
(add-hook 'after-change-major-mode-hook
'(lambda () (font-lock-add-keywords
nil
'(("\\([0-9]+\\)"
1 font-lock-warning-face prepend)))))
Which gives a good start, i.e. any digit is highlighted. However, I am a complete beginner with regex and would ideally like the following behaviour:
Also highlight the decimal point if it's part of a float, e.g. 12.34
Do not highlight any part of the number if it is next/part of a word. e.g. in these cases: foo11 ba11r 11spam, none of the '1's should be highlighted
Allow 'e' within two number integers to allow scientific notation (not required, bonus credit)
Unfortunately this looks very much like a 'do this for me' question which I am loathe to post, but I have failed thus far to make any decent progress myself.
About as far as I have got is discovering [^a-zA-Z][0-9]+[^a-zA-Z] to match anything but a letter either side (e.g. an equals sign), but all this does is include the adjacent symbol in the highlighting. I am not sure how to tell it 'only highlight the numbers if there isn't a letter on either side'.
Of course, I can't imagine regex is the way to go with complicated syntax highlighting, so any good number highlighting in emacs ideas are also welcome,
Any help very much appreciated. (In case it makes any difference, this is for use when Python coding.)
Start by going to your scratch buffers and typing in a some test text. put some numbers in there, some identifiers that contain numbers, some numbers with missing parts (like .e12), etc. These will be our testcases and will let us experiment rapidly. Now run M-x re-builder to enter the regex builder mode, which will let you try out any regex against the text of the current buffer to see what it matches. This is a very handy mode; you'll be able to use it all the time. Just note that because Emacs lisp requires you to put regexes into strings, you must double up on all of your backslashes. You're already doing that correctly, but I'm not going to double them up in here.
So, limiting the match to numbers that are not part of identifiers is pretty easy. \b will match word boundaries, so putting one at either end of your regex will make it match a whole word
You can match floats just by adding a period to the character class you started with, so that it becomes [0-9.]. Unfortunately, that can match a period all on it's own; what we really want is [0-9]*\.?[0-9]+, which will match 0 or more digits followed by an optional period followed by one or more digits.
A leading sign can be matched with [-+]?, so that gets us negative numbers.
To match exponents we need an optional group: \(...\)?, and since we are only using this for highlighting, and don't actually need to separate out the content of the group, we can do \(?:...\), which will save the regex matcher a little time. Inside the group we will need to match an "e" ([eE]), an optional sign ([-+]?), and one or more digits ([0-9]+).
Putting it all together: [-+]?\b[0-9]*\.?[0-9]+\(?:[eE][-+]?[0-9]+\)?\b. Note that I've put the optional sign before the first word boundary, because the "+" and "-" characters create a word boundary.
First of all, lose the add-hook and the lambda. The font-lock-add-keywords call doesn't need either. If you want this only for python-mode, pass the mode symbol as the first argument instead of nil.
Second, there are two main ways to do that.
Add a grouping construct around the digits. The numbers in the font-lock-keywords forms correspond to the groups, so this would be '(("\\([^a-zA-Z]\\([0-9]+\\)[^a-zA-Z]\\)" 2 font-lock-warning-face prepend). The outer grouping is rather useless here, though, so this can be simplified to '(("[^a-zA-Z]\\([0-9]+\\)[^a-zA-Z]" 1 font-lock-warning-face prepend).
Just use the beginning and end of symbol backslash constructs. Then the regexp looks like this: \_<[0-9]+\_>. We can highlight the whole match here, so there's no need for the group number: '(("\\_<[0-9]+\\_>" . font-lock-warning-face prepend). As a variation, you could use the beginning-of-word and end-of-word constructs, but you probably don't want to highlight numbers adjacent to underscores or whatever other characters, if any, python-mode has in the syntax class symbol.
And lastly, there's probably no need for prepend. The numbers are likely all unhighlighted before this, and if you consider possible interaction with other minor modes like whitespace, you'd better choose append, or just omit this element entirely.
End result:
(font-lock-add-keywords nil '(("\\_<[0-9]+\\_>" . font-lock-warning-face)))

Select capitalized & all-caps words using RegEx

I'm trying to find names of people and companies (everything that is capitalized but not in the beginning of a sentence) in a large body of text. The purpose is to find as many instances as possible so that they can be XML-tagged properly.
This is what I've come up with so far:
[^\W](\s\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+
It has two problems:
It selects two characters too many in front of the hit.
In the sentence "Is this Beetle ugly?" it finds s Beetle which complicates the subsequent tagging.
When a capitalized word is preceded with an apostrophe or a colon, it isn't found. If possible I'd like to limit what characters are used for determining a sentence to just !?.
Here's the sample text I'm using to test it out:
John Adams is my hero. There's just no limits to his imagination! Is
this Beetle ugly? It sings at the: La Scala opera house. I have a
dream that I will find work at' Frame Store but not in the USA! This
way ILM could do whatever they pleased. ILM was very sweet. Visual
Effects did a good job... Neither did Animatronix?
I'm using jEdit http.//jedit.org since I need something that works on both Windows and OS X.
Update, this avoids now the matching at the start of the string.
(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+
(?<!(?:[!?\.]\s|^)) is a negative lookbehind that ensures it is not preceded by one of the !?. and a space OR by the start of a new row.
I tested it with jEdit.
Update to cover Names consisting of multiple words
(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]*\b(?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)*)+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (added)
^ (changed)
I added the group (?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)* to match optional following words starting with uppercase letters. And I changed the + to a * to match the A in your example My company's called A Few Good Men. But this change causes now the regex to match I as a name.
See tchrists comment. Names are not a simple thing and it gets really difficult if you want to cover the more complex cases.
This is also working
(?<!\p{P}\s)(\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+
But \p{P} covers all punctuation, I understood this is not what you want. But maybe you can find here on regular-expressions.info/unicode.html a property that fits your needs.
Another mistake in your expression is the | in the character class. Its not needed, you are just adding this character to your class and with it it will match words like U|S|A, so just remove it:
(?<![!?\.]\s)(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+

Regex for username that allows numbers, letters and spaces

I'm looking for some regex code that I can use to check for a valid username.
I would like for the username to have letters (both upper case and lower case), numbers, spaces, underscores, dashes and dots, but the username must start and end with either a letter or number.
Ideally, it should also not allow for any of the special characters listed above to be repeated more than once in succession, i.e. they can have as many spaces/dots/dashes/underscores as they want, but there must be at least one number or letter between them.
I'm also interested to find out if you think this is a good system for a username? I've had a look for some regex that could do this, but none of them seem to allow spaces, and I would like for the usernames to have some spaces in them.
Thank you :)
So it looks like you want your username to have a "word" part (sequence of letters or numbers), interspersed with some "separator" part.
The regex will look something like this:
^[a-z0-9]+(?:[ _.-][a-z0-9]+)*$
Here's a schematic breakdown:
_____sep-word…____
/ \
^[a-z0-9]+(?:[ _.-][a-z0-9]+)*$ i.e. "word ( sep word )*"
|\_______/ \____/\_______/ |
| "word" "sep" "word" |
| |
from beginning of string... till the end of string
So essentially we want to match things like word, word-sep-word, word-sep-word-sep-word, etc.
There will be no consecutive sep without a word in between
The first and last char will always be part of a word (i.e. not a sep char)
Note that for [ _.-], - is last so that it's not a range definition metacharacter. The (?:…) is what is called a non-capturing group. We need the brackets for grouping for the repetition (i.e. (…)*), but since we don't need the capture, we can use (?:…)* instead.
To allow uppercase/various Unicode letters etc, just expand the character class/use more flags as necessary.
References
regular-expressions.info/Anchors, Character Class, Repetition, Grouping
Although I'm sure someone will shortly post a 1 million lines regex to do exactly what you want, I don't think in this case a regex is a good solution.
Why don't you write a good old fashioned parser? It will take about as long as writing the regex that does everything you mentioned, but it's going to be much easier to maintain and read.
In particular, this is the tricky part:
it should also not allow for any of
the special characters listed above to
be repeated more than once in
succession
Alternatively you can always do a hybrid of the two. A regex for the other checks ([a-zA-Z0-9][a-zA-Z0-9 _-\.]*[a-zA-Z0-9]) and a non-regex method for the no-repeat requirement.
You don't have to use a regex for everything. I find that requirements like the "no two consecutive characters" usually make the regexes so ugly that it's better to do that bit with a simple procedural loop.
I'd just use something like ^[A-Za-z0-9][A-Za-z0-9 \.\-_]*[A-Za-z0-9]$ (or the equivalents like ::alnum:: if your regex engine is more advanced) and then just check every character in a loop to make sure the next character isn't the same.
By doing it procedurally, you can check all the other rules you're likely to want at some point without resorting to what I call "regex gymnastics", things like:
not allowed to contain your first or last name.
no more than two consecutive digits.
and so forth.