Generic regex umlaut solution? - c++

Is there a generic (non-)word regex that covers all mutations of characters on this globe? I am developing an application that should handle all languages.
Technically I want to split sentences by words. Splitting them by nonword characters (\W) splits by 'ä' too. A static workaround is not an option since and explicitely covering all mutations on this world (éçḮñ and thousands more) is impossible.

I can't give you something that will work on all languages because I don't know enough languages to judge whether there will be edge cases.
My suggestion:
Split on whitespace (\s+).
Trim punctuation characters from start/end of each "word" you got in step 1 (replace ^\p{P}+|\p{P}+$ with nothing - the QRegularExpression docs say that it supports Unicode fully, so there's hope this will work)
Unless you care about preserving punctuation in examples like This is Charles' car, this should go a long way without removing punctuation within words like it's or Marne-sur-Seine.

Related

regex to highlight sentences longer than n words

I am trying to write a regex expression that can be used to identify long sentences in a document. I my case a scientific manuscript. I aim to be doing that either in libre office or any text editor with regex search.
So far I got the following expression to work on most occasions:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+){24,}?(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
btw, I got inspired from this post
It contains:
group1:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+)
a repetition element (stating how many words n - 1):
{24,}?
group2:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
The basic functioning is:
group1 matches any number of word characters OR other characters that are present in the text followed by one or more spaces
group1 has to be repeated 24 times (or as many as you want the sentences to be long)
group2 matches any number of word characters OR other characters that are present in the text followed by a full stop, exclamation mark, question mark or paragraph break.
Any string that fulfills all the above would then be highlighted.
What I can't solve so far is to make it work when a dot appears in the text with another meaning than a full stop. Things like: i.e., e.g., et al., Fig., 1.89, etc....
Also I don't like that I had to manually adjust it to be able to handle sentences that contain non-word characters such as , [ ( % - # µ " ' and so on. I would have to extend the expression every time I come across some other uncommon character.
I'd be happy for any help or suggestions of other ways to solve this.
You can do a lot with the swiss-army-knife that is regular expressions, but the problem you've presented approaches regex's limits. Some of the things you want to detect can probably be handled with really small changes, while others are a bit harder. If your goal is to have some kind of tool that accurately measures sentence length for every possible mutation of characters, you'll probably need to move outside LibreOffice to a dedicated custom piece of software or a third-party tool.
But, that said, there are a few tricks you can worm into your existing regex to make it work better, if you want to avoid programming or another tool. Let's look at a few of the techniques that might be useful to you:
You can probably tweak your regex for a few special cases, like Fig. and Mr., by including them directly. Where you currently have [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, which is basically [\w]+ with a bunch of other "special" characters, you could use something like ([\w|...]+|Mr\.|Mrs\.|Miss\.|Fig\.) (substituting in all the special characters where I wrote ..., of course). Regexes are "greedy" algorithms, and will try to consume as much of the text as possible, so by including special "dot words" directly, you can make the regex "skip over" certain period characters that are problematic in your text. Make sure that when you want to add a "period to skip" that you always precede it with a backslash, like in i\.e\., so that it doesn't get treated as the special "any" character.
A similar trick can capture numbers better by assuming that digits followed by a period followed by more digits are supposed to "eat" the period: ([\w|...]+|\d+\.\d+|...) That doesn't handle everything, and if your document authors are writing stuff like 0. in the middle of sentences then you have a tough problem, but it can at least handle pi and e properly.
Also, right now, your regex consumes characters until it reaches any terminating punctuation character — a ., or !, or ?, or the end of the document. That's a problem for things like i.e., and 3.14, since as far as your regex is concerned, the sentence stops at the .. You could require your regex to only stop the sentence once ._ is reached — a period followed by a space. That wouldn't fix mismatches for words like Mr., but it would treat "words" like 3.14 as a word instead of as the end of a sentence, which is closer than you currently are. To do this, you'll have to include an odd sequence as part of the "word" regex, something like (\.[^ ]), which says "dot followed by not-a-space" is part of the word; and then you'll have to change the terminating sequence to (\. |!|?|$). Repeat the changes similarly for ! and ?.
Another useful trick is to take advantage of character-code ranges, instead of encoding each special character directly. Right now, you're doing it the hard way, by spelling out every accented character and digraph and diacritic in the universe. Instead, you could just say that everything that's a "special character" is considered to be part of the "word": Instead of [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, write [\w|\-|\/|\u0080-\uFFFF], which captures every character except emoji and a few from really obscure dead languages. LibreOffice seems to have Unicode support, so using \uXXXX patterns should work inside [ character ranges ].
This is probably enough to make your regex somewhat acceptable in LibreOffice, and might even be enough to answer your question. But if you're really intent on doing more complex document analysis like this, you may be better off exporting the document as plain text and then running a specialized tool on it.

Enhanced regex for FirstName LastName

I've reviewed many regex recommendations but unfortunately they were either too restrictive (offending some persons by preventing enter their real name correctly) or too liberal, I mean allowing enter garbage like '---'.
Unfortunately I am not a regex guru, I can understand medium complex regex, and maybe can modify simple changes, but now I am stuck
Here is my starter (with all accented letters what I will not copy here because the string would be longer than my screen)
[a-zA-Z ,.'-]{3,30}
The space, and the ,.'- punctuation will do most cases, but in case anyone have suggestion what character to add please comment it.
Now my question: This regex allows having more than 1 punctuation in a row, what I would not like to allow. It is legal to have a space then one punctuation, but not legal to have two or more, Like "---"
I would also disallow starting with space ,.-, with other words I would like to enable only letters and ' as the first char. I can manage this (I hope):
[a-zA-Z']{1}[a-zA-Z ,.'-]{2,29}
...but the question how to prevent to enter a bunch of punctuation still remains
In .NET, you can use named Unicode classes to allow Unicode letters, and much more.
Here is a regex you can use:
^(?!.*\p{P}\p{P}+)[\p{L}']{1}[\p{L}\p{Zs},.'-]{2,29}$
(?!.*\p{P}\p{P}+) will disallow several punctuation symbols at a stretch.
Alternatively, you can disallow 2 or more consecutive non-letter symbols with
^(?!.*\P{L}\P{L}+)[\p{L}']{1}[\p{L}\p{Zs},.'-]{2,29}$
BUT this won't allow comma + space.
Here is a demo (it is advisable to test each string individually on regexstorm.net, or see alternative demo).

Is there a regex way to detect whether a character can be part of a word or not?

The "tricky" part of this question is that what I mean by alphabeth is not just the 26 characters. It should also include anything alphabeth like, including accented characters and hebrew's alibeth, etc.etc.
Why I need them?
I want to split texts into words.
Alphabeths like latin alphabeth, hebrew's alibeth, arab abjads, are separated by space.
Chinese characters are separated by nothing.
So I think I should separate texts by anything that's not alphabeth.
In other word, a, b, c, d, é is fine.
駅,南,口,第,自,転,車.,3,5,6 is not and all such separator should be it's own words. Or stuff like that.
In short I want to detect whether a character may be a word by itself, or can be part of a word.
What have I tried?
Well you can check the question here I asked a long time ago:
How can we separate utf-8 characters into words if some of the characters are chinese?
I implement the only answer there but then I found out that the chinese characters aren't split. Why not split based on nothing? Well, that means the alphabeths are splitted too.
If all those alphabeths "stick" together that I can separate them based on UTF, that would be fine too.
I will just use the answer at How can we separate utf-8 characters into words if some of the characters are chinese?
and "pull out" all non alphabeth characters.
Not a perfect solution, but good enough for me because western characters and chinese characters rarely show up on the same text anyway.
Maybe you shouldn't do this with regular expressions but with good old string index scanning instead.
The Hebrew, Chinese, Korean etc. alphabets are all in consecutive ranges of unicode code-points. So you could easily detect the alphabet by reading the unicode value of the character and then checking which unicode block it belongs to.
Jan Goyvaerts (of PowerGrep fame) once showed me this very useful syntax to do just this:
(?<![\p{M}\p{L}])word(?![\p{M}\p{L}])
This expression uses a regex lookbehind and a regex lookahead to ensure that the boundaries of the word are such that there is no letter or diacritic mark on either side.
Why is this regex better than simply using "\b"? The strength of this regex is the incorporation of \p{M} to include diacritics. When the normal word boundary marker (\b) is used, regex engines will find word breaks at the places of many diacritics, even though the diacritics are actually part of the word (this is the case, for instance, with Hebrew diacritics. For an example, take the Hebrew word גְּבוּלוֹת, and run a regex of "\b." on it - you'll see how it actually breaks the word into word different parts, at each diacritic point). The regex above fixes this by using a Unicode Character Class to ensure that diacritics are always considered part of the word and not breaks within the word.

Regex for username that allows numbers, letters and spaces

I'm looking for some regex code that I can use to check for a valid username.
I would like for the username to have letters (both upper case and lower case), numbers, spaces, underscores, dashes and dots, but the username must start and end with either a letter or number.
Ideally, it should also not allow for any of the special characters listed above to be repeated more than once in succession, i.e. they can have as many spaces/dots/dashes/underscores as they want, but there must be at least one number or letter between them.
I'm also interested to find out if you think this is a good system for a username? I've had a look for some regex that could do this, but none of them seem to allow spaces, and I would like for the usernames to have some spaces in them.
Thank you :)
So it looks like you want your username to have a "word" part (sequence of letters or numbers), interspersed with some "separator" part.
The regex will look something like this:
^[a-z0-9]+(?:[ _.-][a-z0-9]+)*$
Here's a schematic breakdown:
_____sep-word…____
/ \
^[a-z0-9]+(?:[ _.-][a-z0-9]+)*$ i.e. "word ( sep word )*"
|\_______/ \____/\_______/ |
| "word" "sep" "word" |
| |
from beginning of string... till the end of string
So essentially we want to match things like word, word-sep-word, word-sep-word-sep-word, etc.
There will be no consecutive sep without a word in between
The first and last char will always be part of a word (i.e. not a sep char)
Note that for [ _.-], - is last so that it's not a range definition metacharacter. The (?:…) is what is called a non-capturing group. We need the brackets for grouping for the repetition (i.e. (…)*), but since we don't need the capture, we can use (?:…)* instead.
To allow uppercase/various Unicode letters etc, just expand the character class/use more flags as necessary.
References
regular-expressions.info/Anchors, Character Class, Repetition, Grouping
Although I'm sure someone will shortly post a 1 million lines regex to do exactly what you want, I don't think in this case a regex is a good solution.
Why don't you write a good old fashioned parser? It will take about as long as writing the regex that does everything you mentioned, but it's going to be much easier to maintain and read.
In particular, this is the tricky part:
it should also not allow for any of
the special characters listed above to
be repeated more than once in
succession
Alternatively you can always do a hybrid of the two. A regex for the other checks ([a-zA-Z0-9][a-zA-Z0-9 _-\.]*[a-zA-Z0-9]) and a non-regex method for the no-repeat requirement.
You don't have to use a regex for everything. I find that requirements like the "no two consecutive characters" usually make the regexes so ugly that it's better to do that bit with a simple procedural loop.
I'd just use something like ^[A-Za-z0-9][A-Za-z0-9 \.\-_]*[A-Za-z0-9]$ (or the equivalents like ::alnum:: if your regex engine is more advanced) and then just check every character in a loop to make sure the next character isn't the same.
By doing it procedurally, you can check all the other rules you're likely to want at some point without resorting to what I call "regex gymnastics", things like:
not allowed to contain your first or last name.
no more than two consecutive digits.
and so forth.

Regex word-break with unicode diacritics

I am working on an application that searches text using regular expressions based on input from a user. One option the user has is to include a "Match 0 or more characters" wildcard using the asterisk. I need this to only match between word boundaries. My first attempt was to convert all asterisks to (?:(?=\B).)*, which works fine for most cases. Where it fails is that apparently .Net considers the position between a unicode character with a diacritic and another character a word-break. I consider this a bug, and have submitted it to the Microsoft feedback site.
In the meantime, however, I need to get the functionality implemented and product shipped. I am considering using [\p{L}\p{M}\p{N}\p{Pc}]* as the replacement text, but, frankly, am in "I don't really understand what this is going to do" land. I mean, I can read the specifications, but am not confident that I could sufficiently test this to make sure it is doing what I expect. I simply wouldn't know all the boundary conditions to test. The application is used by cross-cultural workers, many of whom are in tribal locations, so any and all writing systems need to be supported, including some that use zero-width word breaks.
Does anyone have a more elegant solution, or could confirm/correct the code above, or offer some pointers?
Thanks for your help.
The equivalent of /(?:(?=\B).)*/ in a unicode context would be:
/
(?:
(?: (?<=[\p{L}\p{M}\p{N}\p{Pc}]) (?=[\p{L}\p{M}\p{N}\p{Pc}])
| (?<![\p{L}\p{M}\p{N}\p{Pc}]) (?![\p{L}\p{M}\p{N}\p{Pc}])
)
.
)*
/
...or somewhat simplified:
/(?:[\p{L}\p{M}\p{N}\p{Pc}]+|[^\p{L}\p{M}\p{N}\p{Pc}]+)?/
This would match either a word or a non-word (spacing, punctuation etc.) sequence, possibly an empty one.
A normal or negated word-boundary (\b or \B) is basically a double look-around. One looking behind, making sure of the type of character that precedes the current position. Similarly one looking ahead.
In the second regex, I removed the look-arounds and used simple character classes instead.