Lucene Regex for alphanumeric match but not all numeric - regex

I want to find the alphanumeric words in lucene automata regex but not entirely numeric and even not entirely alphabets.
I have tried
(([a-zA-Z0-9]{1,10})&(.*[0-9].*))
but this returns all numeric words also
So i tried to negate all numeric like below but it does not work
(^[0-9])(([a-zA-Z0-9]{1,10})&(.*[0-9].*))
Input String:
DL200, dal2 , 700091
Expected output:
DL200 and dal2
but it should not return 700091

With the help of the JvdV answer and with the help of https://stackoverflow.com/a/38665819/9758194, I was able to get the desired output
(([a-zA-Z0-9]{1,10})&(.*[0-9].*))&~([0-9]*)

Didn't know much about lucene regex flavor, but a little research tought me that it does not support PCRE library, however some standard operators are supported. I found that it does not include lookarounds nor word boundaries. Have a look at the docs.
Either way, to overcome the lack of support on lookarounds I had a look at this older SO post to use ~ instead. Furthermore, I see you can use the & operator to check if the string matches multiple patterns.
This makes for the assumption the following pattern might work for you:
~[0-9]+&~[^0-9]+&[A-Za-z0-9]{2,10}
~[0-9]+ - Negate a string made of numbers only.
&
~[^0-9]+ - Negate a string made of non-numbers only.
&
[A-Za-z0-9]{2,10} - Matches a string that is made out of 2 to 10 alphanumeric characters.

Related

RegEx: Searching for numbers (int, float) that are NOT part of a word

I'm hoping we have some regular expression guru's here that might be able to help me - a regex newbie - solve a problem.
I know some people will want to know some background info on this issue:
Regex Flavor: Basic Regex, being used in a Vertica Database using the REGEXP_REPLACE function.
The regex I am using is working great with one exception.
I have a rule that I'm trying to implement, related to stripping the numbers from text, where any number that is part of a word, e.g. table5, go2market, 33monroe, room222, etc. is ignored and NOT filtered.
Here is what I started with for detecting numbers:
[-+]?[0-9]*\.?[0-9]
That seems to work pretty well, including handling directly adjacent commas and parentheses for example.
But all cases where there is a number that is part of alphabetic text is also being detected, which fails the rule that it cannot be a part of a word, and by word, I mean any alphabetic text.
So, in searching for solutions, I happened upon this regex that seems to work well detecting those specific cases where numbers appear next to, or in, any string of characters:
((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)
My thought was that maybe I could add this as an INVERTED match to my original regex, to allow it to still select standalone numbers while ignoring those that were a part of a word, like so:
[-+]?[0-9]^((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)*\.?[0-9]^((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)
Unfortunately however, it breaks the original detection of standalone numbers.
:(
I'm hoping there is someone here that can spot what I'm doing wrong, and help me identify the right solution?
Thanks in advance!
According to Vertica documentation, the regex flavour seems to follow the Perl syntax. In this case you can use negative lookarounds and in particular a negative lookbehind: (?<!\w) (not preceded with a word character.)
Lookarounds are only tests and don't consume characters.
You can also use a negative lookahead to test the right part, (?!\w) (not followed by a word character), but it's more simple to use a word boundary since the pattern ends with a digit (that is also a word character):
(?<!\w)[-+]?\d*\.?\d+\b
In the worst case, if you have something like v1.0 in your string and you want to avoid it, you can try to use the bactracking control verbs (*SKIP) and (*FAIL). (*FAIL) forces the pattern to fail and (*SKIP) skips all the already matched positions before it. I hope vertica supports these Perl regex features.
Something like:
\p{L}+[-+]?\d*\.?\d+(*SKIP)(*FAIL)|[-+]?\d*\.?\d+(*SKIP)(?!\p{L})

How to evaluate a RegExp in an array with match groups?

I need to parse an array-like text with regular expression and get the match groups.
One example of then text I want to parse is this:
['red','green', 'blue']
I want to use match groups, because I want to extract them.
I am using this regular expression, but the groups found by it are not like what I expected:
\[ *('.+?')( *, *('.+?'))* *\]
The idea is to parse in this order:
A square bracket
Any number of spaces
A group with:
Single quote
Any character
Single quote
Zero or more groups of:
Any number of spaces
A comma
Any number of spaces
A group with
Single quote
Any character
Single quote
Any number of spaces
A square bracket
And get one group with each parsed array element.
Can you help me?
Hint: a easy way to test regexp is the site http://rubular.com
This isn't going to be a totalitarian answer, but I'm fairly certain you can't whitespace check by doing " *", at least it may depend on the language you're using.
Here's a C# regex example that shows some of the language requirements to check for whitespace: regex check for white space in middle of string
Edit: I see you added Ruby as your language, unfortunately I'm not verbose in Ruby so specifics I cannot help you with, sorry.
Edit2: Seeing as you're forcing yourself into Ruby to debug your regex statement, might I suggest: http://www.debuggex.com/ which tries to stay language independent?
Try this regex: '([^']+)', it should give you the following match groups red, green, blue according to rubular.com
You can match an arbitrary number of groups with one regex:
^\[\s*|(?:\G'([^']+)'\s*(?:,\s*|]$))+
or like this (should be more performant):
^\[\s*+|(?>\G'([^']++)'\s*+(?>,\s*+|]$))++
This work in ruby like asked before, in delphi I don't know.

What is wrong with my simple regex that accepts empty strings and apartment numbers?

So I wanted to limit a textbox which contains an apartment number which is optional.
Here is the regex in question:
([0-9]{1,4}[A-Z]?)|([A-Z])|(^$)
Simple enough eh?
I'm using these tools to test my regex:
Regex Analyzer
Regex Validator
Here are the expected results:
Valid
"1234A"
"Z"
"(Empty string)"
Invalid
"A1234"
"fhfdsahds527523832dvhsfdg"
Obviously if I'm here, the invalid ones are accepted by the regex. The goal of this regex is accept either 1 to 4 numbers with an optional letter, or a single letter or an empty string.
I just can't seem to figure out what's not working, I mean it is a simple enough regex we have here. I'm probably missing something as I'm not very good with regexes, but this syntax seems ok to my eyes. Hopefully someone here can point to my error.
Thanks for all help, it is greatly appreciated.
You need to use the ^ and $ anchors for your first two options as well. Also you can include the second option into the first one (which immediately matches the third variant as well):
^[0-9]{0,4}[A-Z]?$
Without the anchors your regular expression matches because it will just pick a single letter from anywhere within your string.
Depending on the language, you can also use a negative look ahead.
^[0-9]{0,4}[A-Za-z](?!.*[0-9])
Breakdown:
^[0-9]{0,4} = This look for any number 0 through 4 times at the beginning of the string
[A-Za-z] = This look for any characters (Both cases)
(?!.*[0-9]) = This will only allow the letters if there are no numbers anywhere after the letter.
I haven't quite figured out how to validate against a null character, but that might be easier done using tools from whatever language you are using. Something along this logic:
if String Doesn't equal $null Then check the Rexex
Something along those lines, just adjusted for however you would do it in your language.
I used RegEx Skinner to validate the answers.
Edit: Fixed error from comments

regex to match strings not ending with a pattern?

I am trying to form a regular expression that will match strings that do NOT end a with a DOT FOLLOWED BY NUMBER.
eg.
abcd1
abcdf12
abcdf124
abcd1.0
abcd1.134
abcdf12.13
abcdf124.2
abcdf124.21
I want to match first three.
I tried modifying this post but it didn't work for me as the number may have variable length.
Can someone help?
You can use something like this:
^((?!\.[\d]+)[\w.])+$
It anchors at the start and end of a line. It basically says:
Anchor at the start of the line
DO NOT match the pattern .NUMBERS
Take every letter, digit, etc, unless we hit the pattern above
Anchor at the end of the line
So, this pattern matches this (no dot then number):
This.Is.Your.Pattern or This.Is.Your.Pattern2012
However it won't match this (dot before the number):
This.Is.Your.Pattern.2012
EDIT: In response to Wiseguy's comment, you can use this:
^((?!\.[\d]+$)[\w.])+$ - which provides an anchor after the number. Therefore, it must be a dot, then only a number at the end... not that you specified that in your question..
If you can relax your restrictions a bit, you may try using this (extended) regular expression:
^[^.]*.?[^0-9]*$
You may omit anchoring metasymbols ^ and $ if you're using function/tool that matches against whole string.
Explanation: This regex allows any symbols except dot until (optional) dot is found, after which all non-numerical symbols are allowed. It won't work for numbers in improper format, like in string: abcd1...3 or abcd1.fdfd2. It also won't work correctly for some string with multiple dots, like abcd.ab123cd.a (the problem description is a bit ambigous).
Philosophical explanation: When using regular expressions, often you don't need to do exactly what your task seems to be, etc. So even simple regex will do the job. An abstract example: you have a file with lines are either numbers, or some complicated names(without digits), and say, you want to filter out all numbers, then simple filtering by [^0-9] - grep '^[0-9]' will do the job.
But if your task is more complex and requires validation of format and doing other fancy stuff on data, why not use a simple script(say, in awk, python, perl or other language)? Or a short "hand-written" function, if you're implementing stand-alone application. Regexes are cool, but they are often not the right tool to use.
I would just use a simple negative look-behind anchored at the end:
.*(?<!\\.\\d+)$

Regex to match any strings containing Cyrillic symbols, except comments marked with //, ///, ///, etc

I want to find all strings containing at least 1 Cyrillic character (basically /.*[А-я].*/) but with exception of comments.
Comment is a string or part of a string which starts with 2 or more / characters.
Currently I get this regex which do some part of the trick:
^(?=^.*?[А-я]+).*?((?=[\/]{2,})|(^(?:(?![\/]{2,}).)*$))
But I'd like to get less bloated and faster expression.
And as additional question: could anyone explain why this one is working? I combined it by trial-and-error but I'm not sure I completely understood how it works, because when I try to change it in any part - it stops working.
The following regex will match any cyrllic character that is not preceded by a double forward slash
(?<!/{2}.*)[А-я]
It specifies that it should not be preceded by a double slash by using a negative lookbehind.
You haven't specified what flavour of regex your using, but be aware some flavours don't support lookarounds. For example PCRE (javascript) doesn't. You are using 3 of them in your regex, so i presume its ok.