I am looking at regexes to validate and parse well-known text, which is a format used to transfer spatial data and looks like:
POLYGON((51.124 -3.973, 51.1 -3.012, ....))
or
MULTIPOLYGON(((POLYGON((51.124 -3.973, 51.1 -3.012, ....)),POLYGON((50.14 -13.973, 51.1 -13.012, ....))
among other variations.
There is a good answer here: Parsing a WKT-file which uses the regex:
\d+(?:\.\d*)?
From other places I have also seen
\d*\.\d+|\d+
and
(\d*\.)?\d+
These all seem to do the same thing, but it got me wondering about the relative workings of these 3 regexes, and if there are any performance issues or subtleties under the hood to be aware of.
To be clear, I am aware that there are libraries for parsing WKT in various languages. My question is purely about the relative behavior of number extracting regexes.
It depends what number formats you need to allow, example:
format 1: 22
format 2: 22.2
format 3: .2
format 4: 2.
the 1st pattern \d+(?:\.\d*)? matches 1,2,4
the 2nd pattern \d*\.\d+|\d+ matches 1,2,3
the 3rd pattern (\d*\.)?\d+ matches 1,2,3 (and have an uneeded capturing group)
Note: pattern 2 and 3 are slower to succeed than the first if the number is an integer, because they must match all digits until the dot, backtrack to the start and retry the same digits one more time. (see the schema below)
str | pattern | state
-----+----------------+-----------------------------
123 | \d*\.\d+|\d+ | START
123 | \d*\.\d+|\d+ | OK
123 | \d*\.\d+|\d+ | OK
123 | \d*\.\d+|\d+ | OK
123 | \d*\.\d+|\d+ | FAIL => backtrack
123 | \d*\.\d+|\d+ | FAIL => backtrack
123 | \d*\.\d+|\d+ | FAIL => backtrack
123 | \d*\.\d+|\d+ | go to the next alternative
123 | \d*\.\d+|\d+ | OK
123 | \d*\.\d+|\d+ | OK
123 | \d*\.\d+|\d+ | OK => SUCCESS
if you want to match the four cases, you can use:
\.\d+|\d+(?:\.\d*)?
(+) if the number doesn't begin with a dot, the first alternative fails immediatly and the second alternative will match all other cases. The backtracking is limited to the minimum.
(-) if you have few numbers that start with a dot the first alternative will be tested and will fail each times. However, the first alternative fails quickly.(in other words, for the same reason). In this case, it is better to use \d+(?:\.\d*)?|\.\d+
Obviously, if you want to support negative values you need to add -?:
-?(?:\.\d+|\d+(?:\.\d*)?)
Related
I have data as follows - please see the fiddle here for all data and code below:
INSERT INTO t VALUES
('|0|34| first zero'),
('|45|0| second zero'),
('|0|0| both zeroes');
I want to SELECT from the start of the line
1st character in the line is a piple (|)
next characters are a valid (possibly negative - one minus sign) INTEGER
after the valid INT, another pipe
then another valid INT
then a pipe
The rest of the line can be anything at all - including sequences with pipe, INT, pipe, INT - but these are not to be SELECTed!
and I'm using a regex to try and SELECT the valid INTEGERs. A single ZERO is also a valid reading - one ZERO and one ZERO only!
The valid integers must be from between the first 3 pipe (|) characters and not elsewhere in the line - i.e.
^|3|3|adfasfadf |555|6666| -- tuple (3, 3) is valid
but
^|--567|-765| adfasdf -- tuple (--567, -765) is invalid - two minus signs!
and
^|This is stuff.... |34|56| -- tuple (34, 56) is invalid - doesn't start pipe, int, pipe, int!
Now, my regexes (so far) are as follows:
SELECT
SUBSTRING(a, '^\|(0{1}|[-+]?[1-9]{1}\d*)\|') AS n1,
SUBSTRING(a, '^\|[-+]?[1-9]{1}\d*\|(0{1}|[-+]?[1-9]{1}\d*)\|') AS n2,
a
FROM t;
and the results I'm getting for my 3 records of interest are:
n1 n2 a
0 NULL |0|34| first zero -- don't want NULL, want 34
45 0 |45|0| second zero -- OK!
0 NULL |0|0| both zeroes -- don't want NULL, want 0
3 3 |3|3| some stuff here
...
... other data snipped - but working OK!
...
Now, the reason why it works for the middle one is that I have (0{1}|.... other parts of the regex in both the upper and lower one!
So, that means take 1 and only 1 zero OR... the other parts of the regex. Fine, I've got that much!
However, and this is the crux of my problem, when I try to change:
'^\|[-+]?[1-9]{1}\d*\|(0{1}|[-+]?[1-9]{1}\d*)\|'
to
'^\|0{1}|[-+]?[1-9]{1}\d*\|(0{1}|[-+]?[1-9]{1}\d*)\|'
Notice the 0{1}| bit I've added near the beginning of my regex - so, this should allow one and only one ZERO at the beginning of the second string (preceded by a pipe literal (|)) OR the rest... the pipe at the end of my 5 character snippet above in this case being part of the regex.
But the result I get is unchanged for the first 3 records - shown above, but it now messes up many records further down - one example a record like this:
|--567|-765|A test of bad negatives...
which obviously fails (NULL, NULL) in the first SELECT now returns (NULL,-765) for the second. If the first fails, I want the second to fail!
I'm at a loss to understand why adding 0{1}|... should have this effect, and I'm also at a loss to understand why my (0, NULL), (45, 0) and (0, NULL) don't give me (0, 0), (45, 0) and (0, 0) as I would expect?
The 0{1}| snippet appears to work fine in the capturing groups, but not outside - is this the problem? Is there a problem with PostgreSQL's regex implementation?
All I did was add a bit to the regex which said as well as what you've accepted before, please accept one and only one leading ZERO!
I have a feeling there's something about regexes I'm missing - so my question is as follows:
could I please receive an explanation as to what's going on with my regex at the moment?
could I please get a corrected regex that will work for INTEGERs as I've indicated. I know there are alternatives, but I'd like to get to the bottom of the mistake I'm making here and, finally
is there an optimum/best method to achieve what I want using regexes? This one was sort of cobbled together and then added to as further necessary conditions became clearer.
I would want any answer(s) to work with the fiddle I've supplied.
Should you require any further information, please don't hesitate to ask! This is not a simple "please give me a regex for INTs" question - my primary interest is in fixing this one to gain understanding!
Some simplifications could be done to the patterns.
SELECT
SUBSTRING(a, '^\|(0|[+-]?[1-9][0-9]*)\|[+-]?[0-9]+\|') AS n1,
SUBSTRING(a, '^\|[+-]?[0-9]+\|(0|[+-]?[1-9][0-9]*)\|') AS n2,
a
FROM t;
n1 | n2 | a
:--- | :--- | :--------------------------------------------------------------
0 | 34 | |0|34| first zero
45 | 0 | |45|0| second zero
0 | 0 | |0|0| both zeroes
3 | 3 | |3|3| some stuff here
null | null | |SE + 18.5D some other stuff
-567 | -765 | |-567|-765|A test of negatives...
null | null | |--567|-765|A test of bad negatives...
null | null | |000|00|A test of zeroes...
54 | 45 | |54|45| yet more stuff
32 | 23 | |32|23| yet more |78|78| stuff
null | null | |This is more text |11|111|22222||| and stuff |||||||
null | null | |1 1|1 1 1|22222|
null | null | |71253412|ahgsdfhgasfghasf
null | null | |aadfsd|34|Fails if first fails - deliberate - Unix philosophy!
db<>fiddle here
I'm trying to make a regex that fits this need:
"a" < Match group 1
"b" < Match group 3
"a-b" < Match group 1, 2 and 3
"-" < No match
"ab" < No match
I was trying to make something like (a?)(-b?) but obviously this doesn't work like I want.
Edit:
Using a real example to explain better:
Regex I tried to use: /remind (me|him)? about (this|that)?/gm
Text | Should match?
"remind me" | Yes
"remind me about this" | Yes
"remind me about that" | Yes
"remind me about error" | No
"remind him about this" | Yes
"remind about" | NO
"remind this" | Yes
"remind error" | No
"remind me" | Yes
Edit explaining the reason:
I need this regex to split the data in fields, like "Who will be reminded?" "What is the reminder text?"
remind me about this
Person: me
Thing: this
remind me
Person: me
Thing: missing
remind that
Person: missing
Thing: that
remind me this
Error
remind about this
Error
Not totally sure, but maybe this helps. At least it satisfies your requirements: matches "a", "b", "a-b", and doesn't match ab and ba.
((a)-(b))|(?:\b(a)(?:[^b]|\b))|(?:(?:[^a]|\b))(b)\b
https://regex101.com/r/YOa83X/1/
A systematic approach is to use branch reset for the sentence structures.
Each branch contains a different set of elements.
This uses a branch reset. The Person is in group 1, the Thing is in group 2.
If either is missing, it means it's not there.
remind[ ](?|(him)[ ]about[ ](this)|(me)(?:[ ]about[ ](th(?:at|is)))?|()(th(?:at|is)))
https://regex101.com/r/DPfvs0/1
If there is no branch reset available, the capture groups can be paired as
increments of 2, i.e. Person / Thing .
1 & 2
3 & 4
5 & 6
Just see which pair matched.
remind[ ](?:(him)[ ]about[ ](this)|(me)(?:[ ]about[ ](th(?:at|is)))?|()(th(?:at|is)))
https://regex101.com/r/xJUFbQ/1
I'm trying to write a regex that checks if string contains 6 or more signs including 1 or more special sign [^0-9a-zA-Z\s] and 1 or more [0-9a-zA-Z].
Spent like 2h and not getting any closer :/
maybe this is of some help:
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?!.*\s).{6,13}$
Password expresion that requires one lower case letter, one upper case letter, one digit, 6-13 length, and no spaces.
Matches:
1agdH*$# | 1agdC*$# | 1agdB*$#
Non-Matches:
wyrn%#*&$# f | mbndkfh782 | BNfhjdhfjd&*)%#$)
This is based on the Regex Lib entry here
Taking the style of Hasson's answer . . .
grep -P '^(?=.*[^a-zA-Z0-9\s])(?=.*[a-zA-Z0-9])(?!.*\s).{6}'
6 or more chars (regexp not ended with $)
1 or more special char (?=.*[^0-9a-zA-Z\s])
1 or more (?=.*[0-9a-zA-Z])
no whitespace (?!.*\s)
Some test data, NO match:
password
pa5sword
pa5sWord
pa5sWord
password
test
1agdA
1agd
wyrn%#*&$# f
mbndkfh782
t1*$
Some test data, YES match:
pa5*Word
pa5*Word
pa5*Word1
pa5*Wor
1agdA*
1agdA*$
1agdA*$#
1agdA*$#1
1agdA*$#12
1agdA*$#123
1agdA*$#a
1agdA*$#ab
1agdA*$#abc
1agdA*$#abcd
BNfhjdhfjd&*)%#$)
I googled a lot, but I'm stuck.
There is a cool thing in HTML5, required patterns. It's great for emails / phones / dates validation. I use it in my small project for checking numbers. What I need is a pattern for:
YYYY.ordernumber
Order number may be any number from 1 to 1000000.
I tried to modify some YYYY.MM patterns for my case, but with no luck. What ever I type in does not pass the validation.
Can anyone please help?
UPDATE: Added a lookahead to ensure 'ordernumber' is > 0 (thanks to M42's remark in comments).
You can use those two attributes with your <input>:
pattern="^[0-9]{4}\.(?!0+$)([0-9]{1,6}|1000000)$"
required
E.g.
<input type="text" placeHolder="YYYY.ordernumber" title="YYYY.ordernumber"
pattern="^[0-9]{4}\.(?!0+$)([0-9]{1,6}|1000000)$" required />
See, also, this short demo.
Short explanation of the regex:
^[0-9]{4}\.(?!0+$)([0-9]{1,6}|1000000)$ _____________
^\______/\/\_____/ \________/\______/ ^___|match the end|
| | | |_(*2) |_ |_____ |of the string|
_______| | |____ | |
_________|__ _______|_____ _|______ _|________ _|______
|match the | |match exactly| |match a | |match 1 to| |or match|
|beggining of| |4 digits | |dot (*1)| |6 digits | |1000000 |
|the string |
(*1): '.' is a special character in regex, so it has to be escaped ('.').
(*2): This is a negative lookahead which does consume any characters, but looks ahead and makes sure that the rest of the string in not consisted of zeros only.
Just for the sake of completeness:
I must point out the fact that [0-9] matches only digits 0-9. If you need to also match other digit characters, such as for example Eastern Arabic numerals (٠١٢٣٤٥٦٧٨٩), you can use \d instead.
I am working on a Rails 3 application that needs to validate the password based on the following criteria: must be at least 6 characters and include one number and one letter.
Here is my Regex:
validates :password, :format => {:with => /^[([a-z]|[A-Z])0-9_-]{6,40}$/, message: "must be at least 6 characters and include one number and one letter."}
Right now if I put in a password of (for ex: dogfood) it will pass. But what I need it to do is to pass the criteria above.
I am not all that great at regex, so any and all help is greatly appreciated!
Use lookahead assertions:
/^(?=.*[a-zA-Z])(?=.*[0-9]).{6,}$/
| | |
| | |
| | Ensure there are at least 6 characters.
| |
| Look ahead for an arbitrary string followed by a number.
|
Look ahead for an arbitrary string followed by a letter.
Technically in this case you don't need the anchors, but it's good habit to use them.