white space in Regular expression - regex

I making use of this software, dk-brics-automaton to get number of states
of regular expressions. Now ,for example I have this type of RE:
^SEARCH\s+[^\n]{10}
When I insert it below as a string, the compiler say that invalid escape sequence
RegExp r = new RegExp("^SEARCH\s+[^\n]{10}", ALL);
where ALL is a certain FLAG
when I use double back slashes before small s, then the compiler accepts it
as a string where as over here \s means space but I am confused when I will make use of
double back slashes then it will consider just back slash and "s" where as I meant white space.
Now, I have thousands of such regular expressions for which I want to compute finite automaton
states.So, does that mean that I have to add manually back slashes in all the RE?
Here is a link where they have explained something related to this but I am not getting it:
http://www.brics.dk/automaton/doc/index.html
Please help me if anyone has some past experience in this software or if you have any idea to solve this issue.

I had another look at that documentation. "automaton" is a java package, therefor I think you have to treat them like java regexes. So just double every backslash inside a regex.
The thing here is, Java does not know "raw" strings. So you have to escape for two levels. The first level that evaluates escape sequences is the string level.
The string does not know an escape sequence \s, that is the error. \n is fine, the string evaluates it and stores instead the two characters \ (0x5C) and n (0x6E) the character 0x0A.
Then the string is stored and handed over to the regex constructor. Here happens the next round of escape sequence evaluation.
So if you want to escape for the regex level, then you have to double the backslashes. The string level will evaluate the \\ to \ and so the regex level gets the correct escape sequences.

Related

How to do a negative lookbehind within a %r<…>-delimited regexp in Ruby?

I like the %r<…> delimiters because it makes it really easy to spot the beginning and end of the regex, and I don't have to escape any /. But it seems that they have an insurmountable limitation that other delimiters don't have?
Every other delimiter imaginable works fine:
/(?<!foo)/
%r{(?<!foo)}
%r[(?<!foo)]
%r|(?<!foo)|
%r/(?<!foo)/
But when I try to do this:
%r<(?<!foo)>
it gives this syntax error:
unterminated regexp meets end of file
Okay, it probably doesn't like that it's not a balanced pair, but how do you escape it such that it does like it?
Does something need to be escaped?
According to wikibooks.org:
Any single non-alpha-numeric character can be used as the delimiter,
%[including these], %?or these?, %~or even these things~.
By using this notation, the usual string delimiters " and ' can appear
in the string unescaped, but of course the new delimiter you've chosen
does need to be escaped.
Indeed, escaping is needed in these examples:
%r!(?<\!foo)!
%r?(\?<!foo)?
But if that were the only problem, then I should be able to escape it like this and have it work:
%r<(?\<!foo)>
But that yields this error:
undefined group option: /(?\<!foo)/
So maybe escaping is not needed/allowed? wikibooks.org does list %<pointy brackets> as one of the exceptions:
However, if you use
%(parentheses), %[square brackets], %{curly brackets} or
%<pointy brackets> as delimiters then those same delimiters
can appear unescaped in the string as long as they are in balanced
pairs
Is it a problem with balanced pairs?
Balanced pairs are no problem as long as you are doing something in the Regexp that requires them, like...
%r{(?<!foo{1})} # repetition quantifier
%r[(?<![foo])] # character class
%r<(?<name>foo)> # named capture group
But what if you need to insert a left-side delimiter ({, [, or <) inside the regex? Just escape it, right? Ruby seems to have no problem with escaped unbalanced delimiters most of the time...
%r{(?<!foo\{)}
%r[(?<!\[foo)]
%r<\<foo>
It's just when you try to do it in the middle of the "group options" (which I guess is what the <! characters are classified as here) following a (? that it doesn't like it:
%r<(?\<!foo)>
# undefined group option: /(?\<!foo)/
So how do you do that then and make Ruby happy? (without changing the delimiters)
Conclusion
The workaround is easy. I'll just change this particular regex to just use something else instead like %r{…} instead.
But the questions remain...
Is there really no way to escape the < here?
Are there really some regular expression that are simply impossible to write using certain delimiters like %r<…>?
Is %r<…> the only regular expression delimiter pair that has this problem (where some regular expressions are impossible to write when using it). If you know of a similar example with %r{…}/%r[…], do share!
Version info
Not that it probably matters since this syntax probably hasn't changed, but I'm using:
⟫ ruby -v
ruby 2.6.0p0 (2018-12-25 revision 66547) [x86_64-linux]
Reference:
https://ruby-doc.org/core-2.6.3/Regexp.html
% Notation
As others have mentioned, seems like an oversight based on how this character differs from other paired boundaries.
As far as "Is there really no way to escape the < here?" there is a way... but you're not going to like it:
%r<(?#{'<'}!foo)> == %r((?<!foo))
Using interpolation to insert the < character seems to work. But given that there are much better options, I would avoid it unless you were planning on splitting the regex into sections anyway...

regex to highlight sentences longer than n words

I am trying to write a regex expression that can be used to identify long sentences in a document. I my case a scientific manuscript. I aim to be doing that either in libre office or any text editor with regex search.
So far I got the following expression to work on most occasions:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+){24,}?(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
btw, I got inspired from this post
It contains:
group1:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+)
a repetition element (stating how many words n - 1):
{24,}?
group2:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
The basic functioning is:
group1 matches any number of word characters OR other characters that are present in the text followed by one or more spaces
group1 has to be repeated 24 times (or as many as you want the sentences to be long)
group2 matches any number of word characters OR other characters that are present in the text followed by a full stop, exclamation mark, question mark or paragraph break.
Any string that fulfills all the above would then be highlighted.
What I can't solve so far is to make it work when a dot appears in the text with another meaning than a full stop. Things like: i.e., e.g., et al., Fig., 1.89, etc....
Also I don't like that I had to manually adjust it to be able to handle sentences that contain non-word characters such as , [ ( % - # µ " ' and so on. I would have to extend the expression every time I come across some other uncommon character.
I'd be happy for any help or suggestions of other ways to solve this.
You can do a lot with the swiss-army-knife that is regular expressions, but the problem you've presented approaches regex's limits. Some of the things you want to detect can probably be handled with really small changes, while others are a bit harder. If your goal is to have some kind of tool that accurately measures sentence length for every possible mutation of characters, you'll probably need to move outside LibreOffice to a dedicated custom piece of software or a third-party tool.
But, that said, there are a few tricks you can worm into your existing regex to make it work better, if you want to avoid programming or another tool. Let's look at a few of the techniques that might be useful to you:
You can probably tweak your regex for a few special cases, like Fig. and Mr., by including them directly. Where you currently have [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, which is basically [\w]+ with a bunch of other "special" characters, you could use something like ([\w|...]+|Mr\.|Mrs\.|Miss\.|Fig\.) (substituting in all the special characters where I wrote ..., of course). Regexes are "greedy" algorithms, and will try to consume as much of the text as possible, so by including special "dot words" directly, you can make the regex "skip over" certain period characters that are problematic in your text. Make sure that when you want to add a "period to skip" that you always precede it with a backslash, like in i\.e\., so that it doesn't get treated as the special "any" character.
A similar trick can capture numbers better by assuming that digits followed by a period followed by more digits are supposed to "eat" the period: ([\w|...]+|\d+\.\d+|...) That doesn't handle everything, and if your document authors are writing stuff like 0. in the middle of sentences then you have a tough problem, but it can at least handle pi and e properly.
Also, right now, your regex consumes characters until it reaches any terminating punctuation character — a ., or !, or ?, or the end of the document. That's a problem for things like i.e., and 3.14, since as far as your regex is concerned, the sentence stops at the .. You could require your regex to only stop the sentence once ._ is reached — a period followed by a space. That wouldn't fix mismatches for words like Mr., but it would treat "words" like 3.14 as a word instead of as the end of a sentence, which is closer than you currently are. To do this, you'll have to include an odd sequence as part of the "word" regex, something like (\.[^ ]), which says "dot followed by not-a-space" is part of the word; and then you'll have to change the terminating sequence to (\. |!|?|$). Repeat the changes similarly for ! and ?.
Another useful trick is to take advantage of character-code ranges, instead of encoding each special character directly. Right now, you're doing it the hard way, by spelling out every accented character and digraph and diacritic in the universe. Instead, you could just say that everything that's a "special character" is considered to be part of the "word": Instead of [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, write [\w|\-|\/|\u0080-\uFFFF], which captures every character except emoji and a few from really obscure dead languages. LibreOffice seems to have Unicode support, so using \uXXXX patterns should work inside [ character ranges ].
This is probably enough to make your regex somewhat acceptable in LibreOffice, and might even be enough to answer your question. But if you're really intent on doing more complex document analysis like this, you may be better off exporting the document as plain text and then running a specialized tool on it.

Match specific numbers and before after have or regex escaped

With /\escape/ I can escape special regex right? But why isn't working?
I'm trying to search specific numbers from the beginning which start with |something in the middle have numbers only [0-9] and ends with | again.
Also have other string etc from left and from the right like so left|something[0-9]|right
This is what I've done, but is not working
/\|/234123[0-9]/\|/
\ only escapes the next character, so the second forward slash is ending the regular expression. Instead, you want this:
/\|something[0-9]\|/
You have to make sure that something is escaped correctly.
Note that if you need to match any number not just a digit, you need [0-9]+.
What would probably help you the most would be the right tool for the job:
https://regex101.com/r/ZdjhCE/2
You'll still have to set your language, as regex are similar between languages, but unluckily not 100% identical.

What is the meaning of the \\+ in this regex?

I am trying to parse this example regex.
I know that slashes can be used as escape characters. So if you wanted to search for ) without implying a grouping you would do \ and then ) spelling this out to avoid stack overflow regex...
I also know that a plus sign can indicate one or more of the preceding item.
But in the example below, is the plus sign or the slash getting escaped? It seems like the first slash allows you to "escape" the second slash and then the plus sign indicates that there is at least one prior slashes --- but the example says there are at least two + in the string...
What does this regex mean? There are too many new things going on for me to parse it.
But in the example below, is the plus sign or the slash getting escaped?
Both!
The \ is escaped because the query language you are using probably uses it as an escape character itself (i.e. to escape quotes). So \\ is understood as a single \ in the regex, which is then used to escape the +. The regex means a single + followed by zero or many +.
It could probably be rewritten as \\++ where the second + is actually the regex quantifier.
That regexp can actually mean two different things, depending on the PostgreSQL version and the value of standard_conforming_strings.
Old versions (before standard_conforming_strings or those that defaulted to off) would interpret the string as a backslash-escaped string. So PostgreSQL would turn \\+\\+* into \+\+*, i.e. it'd consume a level of escaping. Then the regular expression would consume the remaining level to escape the pluses, so they're interpreted as literal +s not qualifiers. That regexp says ++ followed by zero or more other characters.
Newer versions with standard_conforming_strings defaulting to on will, per the SQL standard, not decode the backslashes as escapes. So you'll run the regexp \\+\\+*, which is one or more backslashes, followed by one or more backslashes, followed by ... oops, the asterisk without a preceding character is an error.
So we know you must have standard_conforming_strings off, 'cos the query would fail to compile the regexp on a new one.
regress=> SELECT 'blah' ~ '\\+\\+*';
ERROR: invalid regular expression: quantifier operand invalid
postgres=> SHOW standard_conforming_strings;
standard_conforming_strings
-----------------------------
on
(1 row)
You'll have this problem later on, so I suggest dealing with it before you upgrade.
Assuming that the x_spam_level field always starts with the pluses, which the regexp doesn't check, that code might be better written as:
x_spam_level LIKE '++%'
If it doesn't start with the pluses use:
x_spam_level LIKE '%++%'
which is what the current regexp is doing. PostgreSQL will turn that into a regular expression internally, but you don't have to worry about the escaping.
If you want to use a regular expression and have it behave consisently across all versions, use:
x_spam_level ~ E'\\+\\+*'
The E'' syntax tells PostgreSQL to decode backslash escapes, irrespective of the standard_conforming_strings setting.

exact meaning of tag filtering regex

next regular expression filters some html tags' style/src attribute.
[(?i:s\\*c\\*r\\*i\\*p\\*t)]
[(?i:e\\*x\\*p\\*r\\*e\\*s\\*s\\*i\\*o\\*n)]
Besides 'modifier span',
what is "\\*"?
Does it mean s*c*r*i*p*t ? Then, does it have any effect to filtering?
In regex, \\* means 0 or more literal \ characters. So the regexes are looking for the words script and expression, possibly with any number of backslashes between the letters, and possibly with no backslashes at all.
Some examples that would match:
s\c\r\\ipt
sc\\\\\ript
s\\\c\r\\\ip\\\t
script
As Qtax points out, the language is going to be important here. I don't recognize that regex syntax, but some require backslashes to be double-escaped: once for the primary language, and once for the regex engine. That's a hard thing to explain, but basically it means that the patterns might only match the following two inputs, depending on the programming language:
s*c*r*i*p*t
e*x*p*r*e*s*s*i*o*n
Generally, a \ character in regex escapes special characters to suppress their special meaning.i.e \n would actually equate to \n instead of newline.
Simple as that!
Just to add to the answer, the characters in question would resolve to s\*c\*r\*i\*p\*t