Ruby: How can I capture all characters, ignoring whitespace? - regex

I want to capture all word characters, ignoring whitespace in a given string.
str = "Hello there how are you?"
I want the result to be:
"Hellotherehowareyou"
I have tried:
str[/(\w*)*/]
# => "Hello"
…but it returns the first word only. How do I capture all the word characters?

What's Wrong
str[/(\w*)*/] returns a substring, rather than scanning the whole string for matches or removing undesirable characters. You'd be better off using one of the other String methods like #gsub, #tr, #delete, #scan, or #match, depending on what your real intent is.
Use Character Properties or Classes
If you're looking for a robust solution, Ruby character properties or POSIX character classes are probably the way to go. To get the results you provided in your original post, you could use the Unicode-aware \p{Alpha} property. For example:
str.scan(/\p{Alpha}/).join
#=> "Hellotherehowareyou"
Alternatively, if you just want to delete spaces and the question mark, and you don't care about other types of characters, then String#delete may suffice for your specific corpus.
str.delete ' ?'
#=> "Hellotherehowareyou"
If you need a more complex way to select or reject elements from a stream of characters, you could even do something like:
str.chars.select { _1 =~ /\p{Alpha}/ }.join
#=> "Hellotherehowareyou"
There are certainly other approaches, too. The KISS and YAGNI principles probably apply. Meanwhile, choose a solution based on readability and the semantic intent of your code, since most solutions will yield very similar results for your specific example.

Related

How to do a negative lookbehind within a %r<…>-delimited regexp in Ruby?

I like the %r<…> delimiters because it makes it really easy to spot the beginning and end of the regex, and I don't have to escape any /. But it seems that they have an insurmountable limitation that other delimiters don't have?
Every other delimiter imaginable works fine:
/(?<!foo)/
%r{(?<!foo)}
%r[(?<!foo)]
%r|(?<!foo)|
%r/(?<!foo)/
But when I try to do this:
%r<(?<!foo)>
it gives this syntax error:
unterminated regexp meets end of file
Okay, it probably doesn't like that it's not a balanced pair, but how do you escape it such that it does like it?
Does something need to be escaped?
According to wikibooks.org:
Any single non-alpha-numeric character can be used as the delimiter,
%[including these], %?or these?, %~or even these things~.
By using this notation, the usual string delimiters " and ' can appear
in the string unescaped, but of course the new delimiter you've chosen
does need to be escaped.
Indeed, escaping is needed in these examples:
%r!(?<\!foo)!
%r?(\?<!foo)?
But if that were the only problem, then I should be able to escape it like this and have it work:
%r<(?\<!foo)>
But that yields this error:
undefined group option: /(?\<!foo)/
So maybe escaping is not needed/allowed? wikibooks.org does list %<pointy brackets> as one of the exceptions:
However, if you use
%(parentheses), %[square brackets], %{curly brackets} or
%<pointy brackets> as delimiters then those same delimiters
can appear unescaped in the string as long as they are in balanced
pairs
Is it a problem with balanced pairs?
Balanced pairs are no problem as long as you are doing something in the Regexp that requires them, like...
%r{(?<!foo{1})} # repetition quantifier
%r[(?<![foo])] # character class
%r<(?<name>foo)> # named capture group
But what if you need to insert a left-side delimiter ({, [, or <) inside the regex? Just escape it, right? Ruby seems to have no problem with escaped unbalanced delimiters most of the time...
%r{(?<!foo\{)}
%r[(?<!\[foo)]
%r<\<foo>
It's just when you try to do it in the middle of the "group options" (which I guess is what the <! characters are classified as here) following a (? that it doesn't like it:
%r<(?\<!foo)>
# undefined group option: /(?\<!foo)/
So how do you do that then and make Ruby happy? (without changing the delimiters)
Conclusion
The workaround is easy. I'll just change this particular regex to just use something else instead like %r{…} instead.
But the questions remain...
Is there really no way to escape the < here?
Are there really some regular expression that are simply impossible to write using certain delimiters like %r<…>?
Is %r<…> the only regular expression delimiter pair that has this problem (where some regular expressions are impossible to write when using it). If you know of a similar example with %r{…}/%r[…], do share!
Version info
Not that it probably matters since this syntax probably hasn't changed, but I'm using:
⟫ ruby -v
ruby 2.6.0p0 (2018-12-25 revision 66547) [x86_64-linux]
Reference:
https://ruby-doc.org/core-2.6.3/Regexp.html
% Notation
As others have mentioned, seems like an oversight based on how this character differs from other paired boundaries.
As far as "Is there really no way to escape the < here?" there is a way... but you're not going to like it:
%r<(?#{'<'}!foo)> == %r((?<!foo))
Using interpolation to insert the < character seems to work. But given that there are much better options, I would avoid it unless you were planning on splitting the regex into sections anyway...

regex to highlight sentences longer than n words

I am trying to write a regex expression that can be used to identify long sentences in a document. I my case a scientific manuscript. I aim to be doing that either in libre office or any text editor with regex search.
So far I got the following expression to work on most occasions:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+){24,}?(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
btw, I got inspired from this post
It contains:
group1:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+)
a repetition element (stating how many words n - 1):
{24,}?
group2:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
The basic functioning is:
group1 matches any number of word characters OR other characters that are present in the text followed by one or more spaces
group1 has to be repeated 24 times (or as many as you want the sentences to be long)
group2 matches any number of word characters OR other characters that are present in the text followed by a full stop, exclamation mark, question mark or paragraph break.
Any string that fulfills all the above would then be highlighted.
What I can't solve so far is to make it work when a dot appears in the text with another meaning than a full stop. Things like: i.e., e.g., et al., Fig., 1.89, etc....
Also I don't like that I had to manually adjust it to be able to handle sentences that contain non-word characters such as , [ ( % - # µ " ' and so on. I would have to extend the expression every time I come across some other uncommon character.
I'd be happy for any help or suggestions of other ways to solve this.
You can do a lot with the swiss-army-knife that is regular expressions, but the problem you've presented approaches regex's limits. Some of the things you want to detect can probably be handled with really small changes, while others are a bit harder. If your goal is to have some kind of tool that accurately measures sentence length for every possible mutation of characters, you'll probably need to move outside LibreOffice to a dedicated custom piece of software or a third-party tool.
But, that said, there are a few tricks you can worm into your existing regex to make it work better, if you want to avoid programming or another tool. Let's look at a few of the techniques that might be useful to you:
You can probably tweak your regex for a few special cases, like Fig. and Mr., by including them directly. Where you currently have [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, which is basically [\w]+ with a bunch of other "special" characters, you could use something like ([\w|...]+|Mr\.|Mrs\.|Miss\.|Fig\.) (substituting in all the special characters where I wrote ..., of course). Regexes are "greedy" algorithms, and will try to consume as much of the text as possible, so by including special "dot words" directly, you can make the regex "skip over" certain period characters that are problematic in your text. Make sure that when you want to add a "period to skip" that you always precede it with a backslash, like in i\.e\., so that it doesn't get treated as the special "any" character.
A similar trick can capture numbers better by assuming that digits followed by a period followed by more digits are supposed to "eat" the period: ([\w|...]+|\d+\.\d+|...) That doesn't handle everything, and if your document authors are writing stuff like 0. in the middle of sentences then you have a tough problem, but it can at least handle pi and e properly.
Also, right now, your regex consumes characters until it reaches any terminating punctuation character — a ., or !, or ?, or the end of the document. That's a problem for things like i.e., and 3.14, since as far as your regex is concerned, the sentence stops at the .. You could require your regex to only stop the sentence once ._ is reached — a period followed by a space. That wouldn't fix mismatches for words like Mr., but it would treat "words" like 3.14 as a word instead of as the end of a sentence, which is closer than you currently are. To do this, you'll have to include an odd sequence as part of the "word" regex, something like (\.[^ ]), which says "dot followed by not-a-space" is part of the word; and then you'll have to change the terminating sequence to (\. |!|?|$). Repeat the changes similarly for ! and ?.
Another useful trick is to take advantage of character-code ranges, instead of encoding each special character directly. Right now, you're doing it the hard way, by spelling out every accented character and digraph and diacritic in the universe. Instead, you could just say that everything that's a "special character" is considered to be part of the "word": Instead of [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, write [\w|\-|\/|\u0080-\uFFFF], which captures every character except emoji and a few from really obscure dead languages. LibreOffice seems to have Unicode support, so using \uXXXX patterns should work inside [ character ranges ].
This is probably enough to make your regex somewhat acceptable in LibreOffice, and might even be enough to answer your question. But if you're really intent on doing more complex document analysis like this, you may be better off exporting the document as plain text and then running a specialized tool on it.

Simple regex for matching up to an optional character?

I'm sure this is a simple question for someone at ease with regular expressions:
I need to match everything up until the character #
I don't want the string following the # character, just the stuff before it, and the character itself should not be matched. This is the most important part, and what I'm mainly asking. As a second question, I would also like to know how to match the rest, after the # character. But not in the same expression, because I will need that in another context.
Here's an example string:
topics/install.xml#id_install
I want only topics/install.xml. And for the second question (separate expression) I want id_install
First expression:
^([^#]*)
Second expression:
#(.*)$
[a-zA-Z0-9]*[\#]
If your string contains any other special characters you need to add them into the first square bracket escaped.
I don't use C#, but i will assume that it uses pcre... if so,
"([^#]*)#.*"
with a call to 'match'. A call to 'search' does not need the trailing ".*"
The parens define the 'keep group'; the [^#] means any character that is not a '#'
You probably tried something like
"(.*)#.*"
and found that it fails when multiple '#' signs are present (keeping the leading '#'s)?
That is because ".*" is greedy, and will match as much as it can.
Your matcher should have a method that looks something like 'group(...)'. Most matchers
return the entire matched sequence as group(0), the first paren-matched group as group(1),
and so forth.
PCRE is so important i strongly encourage you to search for it on google, learn it, and always have it in your programming toolkit.
Use look ahead and look behind:
To get all characters up to, but not including the pound (#): .*?(?=\#)
To get all characters following, but not including the pound (#): (?<=\#).*
If you don't mind using groups, you can do it all in one shot:
(.*?)\#(.*) Your answers will be in group(1) and group(2). Notice the non-greedy construct, *?, which will attempt to match as little as possible instead of as much as possible.
If you want to allow for missing # section, use ([^\#]*)(?:\#(.*))?. It uses a non-collecting group to test the second half, and if it finds it, returns everything after the pound.
Honestly though, for you situation, it is probably easier to use the Split method provided in String.
More on lookahead and lookbehind
first:
/[^\#]*(?=\#)/ edit: is faster than /.*?(?=\#)/
second:
/(?<=\#).*/
For something like this in C# I would usually skip the regular expressions stuff altogether and do something like:
string[] split = exampleString.Split('#');
string firstString = split[0];
string secondString = split[1];

Regex for username that allows numbers, letters and spaces

I'm looking for some regex code that I can use to check for a valid username.
I would like for the username to have letters (both upper case and lower case), numbers, spaces, underscores, dashes and dots, but the username must start and end with either a letter or number.
Ideally, it should also not allow for any of the special characters listed above to be repeated more than once in succession, i.e. they can have as many spaces/dots/dashes/underscores as they want, but there must be at least one number or letter between them.
I'm also interested to find out if you think this is a good system for a username? I've had a look for some regex that could do this, but none of them seem to allow spaces, and I would like for the usernames to have some spaces in them.
Thank you :)
So it looks like you want your username to have a "word" part (sequence of letters or numbers), interspersed with some "separator" part.
The regex will look something like this:
^[a-z0-9]+(?:[ _.-][a-z0-9]+)*$
Here's a schematic breakdown:
_____sep-word…____
/ \
^[a-z0-9]+(?:[ _.-][a-z0-9]+)*$ i.e. "word ( sep word )*"
|\_______/ \____/\_______/ |
| "word" "sep" "word" |
| |
from beginning of string... till the end of string
So essentially we want to match things like word, word-sep-word, word-sep-word-sep-word, etc.
There will be no consecutive sep without a word in between
The first and last char will always be part of a word (i.e. not a sep char)
Note that for [ _.-], - is last so that it's not a range definition metacharacter. The (?:…) is what is called a non-capturing group. We need the brackets for grouping for the repetition (i.e. (…)*), but since we don't need the capture, we can use (?:…)* instead.
To allow uppercase/various Unicode letters etc, just expand the character class/use more flags as necessary.
References
regular-expressions.info/Anchors, Character Class, Repetition, Grouping
Although I'm sure someone will shortly post a 1 million lines regex to do exactly what you want, I don't think in this case a regex is a good solution.
Why don't you write a good old fashioned parser? It will take about as long as writing the regex that does everything you mentioned, but it's going to be much easier to maintain and read.
In particular, this is the tricky part:
it should also not allow for any of
the special characters listed above to
be repeated more than once in
succession
Alternatively you can always do a hybrid of the two. A regex for the other checks ([a-zA-Z0-9][a-zA-Z0-9 _-\.]*[a-zA-Z0-9]) and a non-regex method for the no-repeat requirement.
You don't have to use a regex for everything. I find that requirements like the "no two consecutive characters" usually make the regexes so ugly that it's better to do that bit with a simple procedural loop.
I'd just use something like ^[A-Za-z0-9][A-Za-z0-9 \.\-_]*[A-Za-z0-9]$ (or the equivalents like ::alnum:: if your regex engine is more advanced) and then just check every character in a loop to make sure the next character isn't the same.
By doing it procedurally, you can check all the other rules you're likely to want at some point without resorting to what I call "regex gymnastics", things like:
not allowed to contain your first or last name.
no more than two consecutive digits.
and so forth.

Regex matching spaces, but not in "strings"

I am looking for a regular exression matching spaces only if thos spaces are not enclosed in double quotes ("). For example, in
Mary had "a little lamb"
it should match the first an the second space, but not the others.
I want to split the string only at the spaces which are not in the double quotes, and not at the quotes.
I am using C++ with the Qt toolkit and wanted to use QString::split(QRegExp). QString is very similar to std::string and QRegExp are basically POSIX regex encapsulated in a class. If there exist such a regex, the split would be trivial.
Examples:
Mary had "a little lamb" => Mary,had,"a little lamb"
1" 2 "3 => 1" 2 "3 (no splitting at ")
abc def="g h i" "j k" = 12 => abc,def="g h i","j k",=,12
Sorry for the edits, I was very imprecise when I asked the question first. Hope it is somewhat more clear now.
(I know you just posted almost exactly the same answer yourself, but I can't bear to just throw all this away. :-/)
If it's possible to solve your problem with a regex split operation, the regex will have to match even numbers of quotation marks, as MSalters said. However, a split regex should match only the spaces you're splitting on, so the rest of the work has to be done in a lookahead. Here's what I would use:
" +(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"
If the text is well formed, a lookahead for an even number of quotes is sufficient to determine that the just-matched space is not inside a quoted sequence. That is, lookbehinds aren't necessary, which is good because QRegExp doesn't seem to support them. Escaped quotes can be accommodated too, but the regex becomes quite a bit larger and uglier. But if you can't be sure the text is well formed, it's extremely unlikely you'll be able to solve your problem with split().
By the way, QRegExp does not implement POSIX regular expressions--if it did, it wouldn't support lookaheads OR lookbehinds. Instead, it falls into the loosely-defined category of Perl-compatible regex flavors.
What should happen to "a" b "c" ?
Note that in the substring " b " the spaces are between quotes.
-- edit --
I assume a space is "between quotes" if it is preceded and followed by an odd number of standard quotation marks (i.e. U+0022, I'll ignore those funny Unicode “quotes”).
That means you need the following regex: ^[^"]*("[^"]*"[^"]*)*"[^"]* [^"]*"[^"]*("[^"]*"[^"]*)*$
("[^"]*"[^"]*) represents a pair of quotes. ("[^"]*"[^"]*)* is an even amount of quotes, ("[^"]"[^"]*)*" an odd amount. Then there's the actual quoted string part, followed by another odd number of quotes. ^$ anchors are needed because you need to count every quote from the beginning of the string. This answers the " b " substring problem above by never looking at substrings. The price is that every character in your input must be matched against the entire string, which turns this into an O(N*N) split operation.
The reason why you can do this in a regex is because there is a finite amount of memory needed. Effectively just one bit; "have I seen an odd or even number of quotes so far?". You don't actually have to match up individual "" pairs.
This is not the only interpretation possible, though. If you do include “funny Unicode quotes” which should be paired, you also need to deal with ““double quoted”” strings. This in turn means you need a count of open “, which means you need infinite storage, which in turns means it's no longer a regular language, which means you can't use a regex. QED.
Anyway, even if it was possible, you still would want a proper parser. The O(N*N) behavior to count the number of quotes preceding each character just isn't funny. If you already know there are X quotes preceding Str[N], it should be an O(1) operation to determine how many quotes precede Str[N+1], not O(N). The possible answers are after all just X or X+1 !
MSalters pushed me on the right track. The problem with his answer that the regex he gives always matches the whole string and so is unsuitable for split(), but this can partly redeemed by a lookahead match. Assuming that the quotes are always paired (they are indeed), I can split on every space which is followed by an even number of quotes.
The regex without C escapes and in single quotes looks like
' (?=[^"]*("[^"]*"[^"]*)*$)'
In the source it finally looked like (using Qt and C++)
QString buf("Mary had \"a little lamb\""); // string we want to split
QStringList splitted = buf.split( QRegExp(" (?=[^\"]*(\"[^\"]*\"[^\"]*)*$)") );
Simple, eh?
For the performance, the strings are parsed once at the start of the program, they are a few dozen and they are less than hundred chars. I will test its runtime with long strings, just to be sure nothing bad happens ;-)
If the quoting in the strings is simple (like your examples), you can use alternation. This regex first hunts for a simple quoted string; failing that it finds spaces.
/(\"[^\"]*\"| +)/
In Perl, if you use grouping in the regex when calling split(), the function returns not only the elements but also the captured groups (in this case, our delimiter). If you then filter out the blank and spaces-only delimiters, you will get the desired list of elements. I don't know whether a similar strategy would work in C++, but the following Perl code does work:
use strict;
use warnings;
while (<DATA>){
chomp;
my #elements = split /(\"[^\"]*\"| +)/, $_;
#elements = grep {length and /[^ ]/} #elements;
# Do stuff with #elements
}
__DATA__
Mary had "a little lamb"
1" 2 "3
abc def="g h i" "j k" = 12
Simplest regex-solution: match whole spaces AND quotes. Filter quotes later
"[^"]*"|\s