PostgreSQL - tricky regular expression - what am I missing? - regex

I have data as follows - please see the fiddle here for all data and code below:
INSERT INTO t VALUES
('|0|34| first zero'),
('|45|0| second zero'),
('|0|0| both zeroes');
I want to SELECT from the start of the line
1st character in the line is a piple (|)
next characters are a valid (possibly negative - one minus sign) INTEGER
after the valid INT, another pipe
then another valid INT
then a pipe
The rest of the line can be anything at all - including sequences with pipe, INT, pipe, INT - but these are not to be SELECTed!
and I'm using a regex to try and SELECT the valid INTEGERs. A single ZERO is also a valid reading - one ZERO and one ZERO only!
The valid integers must be from between the first 3 pipe (|) characters and not elsewhere in the line - i.e.
^|3|3|adfasfadf |555|6666| -- tuple (3, 3) is valid
but
^|--567|-765| adfasdf -- tuple (--567, -765) is invalid - two minus signs!
and
^|This is stuff.... |34|56| -- tuple (34, 56) is invalid - doesn't start pipe, int, pipe, int!
Now, my regexes (so far) are as follows:
SELECT
SUBSTRING(a, '^\|(0{1}|[-+]?[1-9]{1}\d*)\|') AS n1,
SUBSTRING(a, '^\|[-+]?[1-9]{1}\d*\|(0{1}|[-+]?[1-9]{1}\d*)\|') AS n2,
a
FROM t;
and the results I'm getting for my 3 records of interest are:
n1 n2 a
0 NULL |0|34| first zero -- don't want NULL, want 34
45 0 |45|0| second zero -- OK!
0 NULL |0|0| both zeroes -- don't want NULL, want 0
3 3 |3|3| some stuff here
...
... other data snipped - but working OK!
...
Now, the reason why it works for the middle one is that I have (0{1}|.... other parts of the regex in both the upper and lower one!
So, that means take 1 and only 1 zero OR... the other parts of the regex. Fine, I've got that much!
However, and this is the crux of my problem, when I try to change:
'^\|[-+]?[1-9]{1}\d*\|(0{1}|[-+]?[1-9]{1}\d*)\|'
to
'^\|0{1}|[-+]?[1-9]{1}\d*\|(0{1}|[-+]?[1-9]{1}\d*)\|'
Notice the 0{1}| bit I've added near the beginning of my regex - so, this should allow one and only one ZERO at the beginning of the second string (preceded by a pipe literal (|)) OR the rest... the pipe at the end of my 5 character snippet above in this case being part of the regex.
But the result I get is unchanged for the first 3 records - shown above, but it now messes up many records further down - one example a record like this:
|--567|-765|A test of bad negatives...
which obviously fails (NULL, NULL) in the first SELECT now returns (NULL,-765) for the second. If the first fails, I want the second to fail!
I'm at a loss to understand why adding 0{1}|... should have this effect, and I'm also at a loss to understand why my (0, NULL), (45, 0) and (0, NULL) don't give me (0, 0), (45, 0) and (0, 0) as I would expect?
The 0{1}| snippet appears to work fine in the capturing groups, but not outside - is this the problem? Is there a problem with PostgreSQL's regex implementation?
All I did was add a bit to the regex which said as well as what you've accepted before, please accept one and only one leading ZERO!
I have a feeling there's something about regexes I'm missing - so my question is as follows:
could I please receive an explanation as to what's going on with my regex at the moment?
could I please get a corrected regex that will work for INTEGERs as I've indicated. I know there are alternatives, but I'd like to get to the bottom of the mistake I'm making here and, finally
is there an optimum/best method to achieve what I want using regexes? This one was sort of cobbled together and then added to as further necessary conditions became clearer.
I would want any answer(s) to work with the fiddle I've supplied.
Should you require any further information, please don't hesitate to ask! This is not a simple "please give me a regex for INTs" question - my primary interest is in fixing this one to gain understanding!

Some simplifications could be done to the patterns.
SELECT
SUBSTRING(a, '^\|(0|[+-]?[1-9][0-9]*)\|[+-]?[0-9]+\|') AS n1,
SUBSTRING(a, '^\|[+-]?[0-9]+\|(0|[+-]?[1-9][0-9]*)\|') AS n2,
a
FROM t;
n1 | n2 | a
:--- | :--- | :--------------------------------------------------------------
0 | 34 | |0|34| first zero
45 | 0 | |45|0| second zero
0 | 0 | |0|0| both zeroes
3 | 3 | |3|3| some stuff here
null | null | |SE + 18.5D some other stuff
-567 | -765 | |-567|-765|A test of negatives...
null | null | |--567|-765|A test of bad negatives...
null | null | |000|00|A test of zeroes...
54 | 45 | |54|45| yet more stuff
32 | 23 | |32|23| yet more |78|78| stuff
null | null | |This is more text |11|111|22222||| and stuff |||||||
null | null | |1 1|1 1 1|22222|
null | null | |71253412|ahgsdfhgasfghasf
null | null | |aadfsd|34|Fails if first fails - deliberate - Unix philosophy!
db<>fiddle here

Related

converting CFG to regular expression

Here's a CFG that generates strings of 0s, 1s, or 0s and 1s arranged like this (001, 011) where one of the characters must have a bigger count than the other like in 00011111 or 00000111 for example.
S → 0S1 | 0A | 0 | 1B | 1
A → 0A | 0
B → 1B | 1
I tried converting it to regular expression using this guide but I got stuck here since I have trouble converting 0S1 given that anything similar to it can't be found in that guide.
S → 0S1 | 0+ | 0 | 1+ | 1
A → 0A | 0 = 0+
B → 1B | 1 = 1+
One of my previous attempts is 0+0+1|0+1+1|1+|0+ but it doesn't accept strings I mentioned above like 00011111 and 00000111.
Plug and Play
^(?!01$)(?!0011$)(?!000111$)(?!00001111$)(?=[01]{1,8}$)0*1*$
You cannot perfectly translate this to a regular expression, but you can get close, by ensuring that the input does not have equal number of 0 and 1. This matches up to 8 digits.
How it works
^ first you start from the beginning of a line
(?!01$) ensure that the characters are not 01
(?!0011$) ensure that the characters are not 0011
the same for 000111 and 00001111
then ensure that there are from 1 to 8 zeroes and ones (this is needed, to ensure that the input is not made of more digits like 000000111111, because their symmetry is not verified)
then match these zeroes and ones till the end of the line
for longer inputs you need to add more text, for up to 10 digits it is this: ^(?!01$)(?!0011$)(?!000111$)(?!00001111$)(?!0000011111$)(?=[01]{1,10}$)0*1*$ (you jump by 2 by adding one more symmetry validation)
it is not possible by other means with regular expressions alone, see the explanation.
Explanation
The A and B are easy, as you saw 0+ and 1+. The concatenations in S after the first also are easy: 00+, 0, 11+, 1, that all mixed into one lead to (0+|1+). The problem is with the first concatenation 0S1.
So the problem can be shorten to S = 0S1. This grammar is recursive. But neither left linear nor right linear. To recognize an input for this grammar you will need to "remember" how many 0 you found, to be able to match the same amount of 1, but the finite-state machines that are created from the regular grammars (often and from regular expressions) do not have a computation history. They are only states and transitions, and the machinery "jumps" from one state to the other and does not remember the "path" traveled over the transitions.
For this reason you need more powerful machinery (like the push-down automaton) that can be constructed from a context-free grammar (as yours).

How to remove repeated words or phrases within the same string

I am working with a string variable response in Stata. This variable stores complete sentences, and many of these sentences have repeated phrases.
For example:
how do you know how do you know what it is?
it was during the during the past thirty days
well well I would hope I would hope that they're doing that
I want to clean these strings by removing all repeated phrases.
In other words, I want to transform this sentence:
how do you know how do you know what it is?
to the one below:
how do you know what it is?
So far, I have tried to fix each case individually, but this is incredibly time-consuming as there are thousands of repeated words/phrases.
I would like to run code that can identify when a phrase is repeated within the same observation / string, and then remove one instance of that phrase (or word).
I imagine regular expressions would help, but I cannot figure out much more than this.
The following works for me:
clear
input str80 string
"Pearly Spencer how do you know how do you know what it is?"
"it was during the during the past thirty days"
"well well I would hope I would hope that they're doing that"
"well well they're doing that I would hope I would hope "
"well well I would hope I would hope that they're doing that but but they don't"
end
clonevar wanted = string
local stop = 0
while `stop' == 0 {
generate dup = ustrregexs(2) if ustrregexm(wanted, "(\W|^)(.+)\s\2")
replace wanted = subinstr(wanted, dup, "", 1)
capture assert dup == ""
if _rc == 0 local stop = 1
else drop dup
}
replace wanted = strtrim(stritrim(wanted))
list wanted
+----------------------------------------------------------+
| wanted |
|----------------------------------------------------------|
1. | Pearly Spencer how do you know what it is? |
2. | it was during the past thirty days |
3. | well I would hope that they're doing that |
4. | well they're doing that I would hope |
5. | well I would hope that they're doing that but they don't |
+----------------------------------------------------------+
The above solution uses a regular expression to first identify repeated words / phrases. Then it eliminates this from the string by substituting a space in its place.
Because this particular regular expression does not find all sets in one pass (for example in the last observation there are three sets - well, I would hope and but), the process is repeated using a while loop until no repeated elements remain in the string.
In the final step, all unnecessary spaces are deleted to bring the string back to shape.

Regex for well-known text

I am looking at regexes to validate and parse well-known text, which is a format used to transfer spatial data and looks like:
POLYGON((51.124 -3.973, 51.1 -3.012, ....))
or
MULTIPOLYGON(((POLYGON((51.124 -3.973, 51.1 -3.012, ....)),POLYGON((50.14 -13.973, 51.1 -13.012, ....))
among other variations.
There is a good answer here: Parsing a WKT-file which uses the regex:
\d+(?:\.\d*)?
From other places I have also seen
\d*\.\d+|\d+
and
(\d*\.)?\d+
These all seem to do the same thing, but it got me wondering about the relative workings of these 3 regexes, and if there are any performance issues or subtleties under the hood to be aware of.
To be clear, I am aware that there are libraries for parsing WKT in various languages. My question is purely about the relative behavior of number extracting regexes.
It depends what number formats you need to allow, example:
format 1: 22
format 2: 22.2
format 3: .2
format 4: 2.
the 1st pattern \d+(?:\.\d*)? matches 1,2,4
the 2nd pattern \d*\.\d+|\d+ matches 1,2,3
the 3rd pattern (\d*\.)?\d+ matches 1,2,3 (and have an uneeded capturing group)
Note: pattern 2 and 3 are slower to succeed than the first if the number is an integer, because they must match all digits until the dot, backtrack to the start and retry the same digits one more time. (see the schema below)
str | pattern | state
-----+----------------+-----------------------------
123 | \d*\.\d+|\d+ | START
123 | \d*\.\d+|\d+ | OK
123 | \d*\.\d+|\d+ | OK
123 | \d*\.\d+|\d+ | OK
123 | \d*\.\d+|\d+ | FAIL => backtrack
123 | \d*\.\d+|\d+ | FAIL => backtrack
123 | \d*\.\d+|\d+ | FAIL => backtrack
123 | \d*\.\d+|\d+ | go to the next alternative
123 | \d*\.\d+|\d+ | OK
123 | \d*\.\d+|\d+ | OK
123 | \d*\.\d+|\d+ | OK => SUCCESS
if you want to match the four cases, you can use:
\.\d+|\d+(?:\.\d*)?
(+) if the number doesn't begin with a dot, the first alternative fails immediatly and the second alternative will match all other cases. The backtracking is limited to the minimum.
(-) if you have few numbers that start with a dot the first alternative will be tested and will fail each times. However, the first alternative fails quickly.(in other words, for the same reason). In this case, it is better to use \d+(?:\.\d*)?|\.\d+
Obviously, if you want to support negative values you need to add -?:
-?(?:\.\d+|\d+(?:\.\d*)?)

Constructing finite state automata corresponding to regular expressions. Are my solutions correct?

I have drawn my answers in paint, are they correct?
(4c) For the alphabet {0, 1} construct finite state automata corresponding to each of the following regular expressions:
(i) 0
(ii) 1 | 0
(iii) 0 * (1 | 0)
The first two are correct, although the first one might be able to be written as (depending on your convention)
(0) -- 0 --> ((1))
The last one is also correct, but can be simplified to (whenever you have ε appearing, there is likely to be a way to compress the edges and states together to remove it)
+- 0 -+
| |
v |
(0) ---+
/ \
1 0
\ /
v
((1))
(Excuse my ascii diagrams. I'm using (..) for each state, and ((..)) for final states.)
Notice that the 0* is basically a loop from a state to itself, since after reading a 0 the remaining regex to match is the same (as long as we aren't at the end of a string).

Regex for: 6 digits or 0-6 signs (digits or stars) with at least one star

How to write regex to validate this pattern?
123456 - correct
*1 - correct
1* - correct
124** - correct
*1*2 - correct
* - correct
123456* - incorrect (size 7)
12345 - incorrect (size 5 without stars)
tried:
^[0-9]{6}$|^(([0-9]){1,6}([*]){1,5}){1,6}+$
But it allows to have more than 6 numbers and don't allow for star to be before number.
There is no minimum/maximum count of "*" sign (but max count for all signs is 6).
Here you go:
^(?:\d{6}|(?=.*\*)[\d*]{1,6}|)$
Here is what it does:
^ <-- Start of the string (we don't want to capture more than that)
(?: <-- Start a non captured group (it will be used to do the "or" part)
\d{6} <-- 6 digits, nothing more
| <-- OR
(?=.*\*) <-- Look ahead for a '*' (you could replace the first * with {0,5})
[\d*] <-- digits or '*'
{1,6} <-- repeated one to six times (we know from the look ahead that there will be at least one '*'
| <-- OR (nothing)
) <-- End the non capturing group
$ <-- End of the string
I'm not quite sure if you want the empty case (but you said 0 to 6), if you actually want 1 to 6 just remove the last |
/ ([0-9] {6} ) | ( ( [0-9]{0-5} & [*]{1-5} ) {0-6})/
something like this?
[1-6]{6}|([1-6]|\*){1,6}[^123456]
this works for the inputs you gave...
If you want something else then update me...
You can't do this with just a regex. You also need a length check. However, here is a regex that will help.
([\d*]*\*[\d*]*)|(\d{6})
To validate the input, try something like this:
validate(input)
{
regex = "([\d*]*\*[\d*]*)|(\d{6})";
digitregex = ".*\d.*"; // this makes sure they aren't all stars
return (input.length < 7 and regex.matches(input) and digitregex.matches(input))
}
I am afraid that you will have to try for each position that the * might have, like this:
/([0-9]{6}|\*[0-9][0-9\*]{0,4}|[0-9]\*[0-9\*]{0,4}|[0-9]{2}\*[0-9\*]{0,3}|[0-9]{3}\*[0-9\*]{0,2}|[0-9]{4}\*[0-9\*]?|[0-9]{5}\*)/
Edit:
The above solution will however not allow **2
And I was wrong. You can do it with a look forward like Colin did. That is the way to go.
Try this : (updated)
([0-6]{6})|([0-6\*]{1,6})
It should work...
if any digits 0..9 are allowed try this regexp [0-9*]{2,6}
if only digits 1..6 as in your example [1-6*]{2,6}
it's a bit tricky cause also 12345 will be validated as correct
example here
You'll actually need a solution with look-around as already suggested by #Colin