I'm trying to build a predicate that given a list (eg. ['1', '2', '3', '.', ... , '2']) it has to recognise it as an IP string in the format (NNN.NNN.NNN.NNN) where N is a digit (between 0 and 9) it return true if it is correctly formatted. It must accept also the case where the IP is only (N.N.N.N or NNN.NNN.N.N and other combinations).
This is what I've done so far
test([X,Y,Z,D | String]) :- digit(X), digit(Y), digit(Z), dot(D), test(String).
Where digit is a list of facts like digit('0'). from 0 to 9; and dot is a fact defined as dot('.'). My code lacks of modularity (it can't recognise single N). Can you help me achieve this?
Thank you!
Welcome to SO. I would do it like this:
seperate the input list into 4 "blocks", where the dot is the delimiter and accept only blocks with 1-3 digit elements in them. To do this prolog can be a bit tricky due to the constraints on the length of the numbers (1-3) and the constraints on the dots (exactly 3 dots). So my approach would look like this (basic but a bit twisted):
test(In):-
digits123(In,['.'|R1]),
digits123(R1,['.'|R2]),
digits123(R2,['.'|R3]),
digits123(R3,[]).
digits123([A,B,C|R],R):-
digit(A),
digit(B),
digit(C).
digits123([A,B|R],R):-
digit(A),
digit(B).
digits123([A|R],R):-
digit(A).
What does it do? digits123/2 gets as input a list and it checks if either the first 3, 2, or one elements are digits. Also the rest of the list is unified with the second argument, making this a test for the numbers and a generator/test for the rest list. digits123/2 alone would be buggy since three digits could be mistaken for 2 or one digit. So the call makes it special: at first digits123/2 is called in the inputlist In, returning the list with the remaining elements. But this remaining list has to start with a dot (['.'|R1]). By writing it like this, R1 is the remaining list without the dot as delimiter. Repeat this two times more to check the 2nd and 3rd block. The last block is different: it has 1-3 digits but no dot (or anything else) afterwards. In other words: there should no elements remain (empty list, []).
Tested with SWISH:
?- test(['1', '2', '3', '.', '2', '3', '.', '2', '.' , '2']).
true;
false.
?- test(['1', '2', '3', '.', '2', '3', '.', '2', '.' ]).
false.
The easies way is to use a DCG grammer to specify what the dotted quad looks like. An IPv4 address is 4 octets, expressed in decimal, and separated by '.'. We can express that as a DCG rule like this:
dotted_quad --> octet, dot, octet, dot, octet, dot, octet.
The rule for a . is trivial:
dot --> ['.'].
The rule(s) for an octet are more complicated. An octet is constrained to be in the domain 0–255. Represented in decimal it can be
0-9:
A single decimal digit 0–9
10-99:
Two decimal digits, the first in the domain 1–9, the second in the domain 0–9.
100-199:
Three decimal digits. The 1st is 1 and the second and third are in the domain 0–9.
200-249:
Three decimal digits. The first is 2, the second has the domain 0–4, and the third, 0–9.
250–255:
Three decimal digits. The first is 2, the second is 5, and the third is constrained to 0–5.
We can either write a lot of special rules to cover this, or add some context to the rule for a decimal digit. Let's add some context to that rule (because why type more than you need to):
digit(D) --> [D].
digit(Min,Max) --> [D], { D #>= Min, D #=< Max }.
These let us say things like digit('0','5') to indicate that a digit in the domain 0-5 is allowed, or digit('2') to say that only the digit 2 is allowed.
Now that we have that we can define the rules for an octet:
octet --> digit('0','9') .
octet --> digit('1','9') , digit('0','9').
octet --> digit('1') , digit('0','9') , digit('0','9').
octet --> digit('2') , digit('0','4') , digit('0','9').
octet --> digit('2') , digit('5') , digit('0','5').
Now we can define our test predicate. All it needs to do is invoke the DCG we defined:
is_dotted_quad(Cs) :- phrase(dotted_quad, Cs).
But since typing something like
[ '1','2','3', '.', '2','3','4', '.', '7','8', '.', '9' ]`
is tedious and error-prone, let's add some conveniences to let us use strings or atoms:
is_dotted_quad(X) :- atom(X) , atom_chars(X,Cs) , is_dotted_quad(Cs) .
is_dotted_quad(X) :- string(X) , atom_chars(X,Cs) , is_dotted_quad(Cs) .
is_dotted_quad(Cs) :- phrase(dotted_quad, Cs).
Putting it all together, you get this:
is_dotted_quad(X) :- atom(X) , atom_chars(X,Cs) , is_dotted_quad(Cs).
is_dotted_quad(X) :- string(X) , string_chars(X,Cs) , is_dotted_quad(Cs).
is_dotted_quad(Cs) :- phrase(dotted_quad, Cs).
dotted_quad --> octet, dot, octet, dot, octet, dot, octet.
dot --> ['.'].
octet --> digit('0','9') .
octet --> digit('1','9') , digit('0','9') .
octet --> digit('1') , digit('0','9') , digit('0','9').
octet --> digit('2') , digit('0','4') , digit('0','9').
octet --> digit('2') , digit('5') , digit('0','5').
digit(V) --> [V].
digit(Min,Max) --> [D], { D #>= Min, D #=< Max }.
You can fiddle with it at https://swish.swi-prolog.org/p/dotted-quad-dcg.pl
Alternatively, if you don't want to use a DCG, then this riff off #Dudas's answer will work as well — but one should be aware that this is essentially what the above DCG is (DCG being an essentially simple bit of syntactic sugar on top of vanilla prolog), just a bit less declarative of intent:
is_dotted_quad(X) :- atom(X) , atom_chars(X,Cs) , is_dotted_quad(Cs).
is_dotted_quad(X) :- string(X) , string_chars(X,Cs) , is_dotted_quad(Cs).
is_dotted_quad(Cs) :- octet(Cs,T1),
dot(T1,T2), octet(T2,T3),
dot(T3,T4), octet(T4,T5),
dot(T5,T6), octet(T6,[]).
chars( [] , [] ) .
chars( [C|Cs] , [C|Cs] ) .
chars( X , Cs ) :- atom(X) , atom_chars(X,Cs) .
chars( X , Cs ) :- string(X) , string_chars(X,Cs) .
dot(['.'|T],T).
octet( [ Z | T ], T ) :- range(Z,'0','9') .
octet( [ Y , Z | T ], T ) :- range(Y,'1','9') , range(Z,'0','9') .
octet( [ '1' , Y , Z | T ], T ) :- range(Y,'0','9') , range(Z,'0','9') .
octet( [ '2' , Y , Z | T ], T ) :- range(Y,'0','4') , range(Z,'0','9') .
octet( [ '2' , '5' , Z | T ], T ) :- range(Z,'0','5') .
range(X,Min,Max) :- X #>= Min, X #=< Max .
Related
I am trying to write a regex to parse out seven match objects: four numbers and three operands:
Individual lines in the file look like this:
[ 9] -21 - ( 12) - ( -5) + ( -26) = ______
The number in brackets is the line number which will be ignored. I want the four integer values, (including the '-' if it is a negative integer), which in this case are -21, 12, -5 and -26. I also want the operands, which are -, - and +.
I will then take those values (match objects) and actually compute the answer:
-21 - 12 - -5 + -26 = -54
I have this:
[\s+0-9](-?[0-9]+)
In Pythex it grabs the [ 9] but it also then grabs every integer in separate match objects (four additional match objects). I don't know why it does that.
If I add a ? to the end: [\s+0-9](-?[0-9]+)? thinking it will only grab the first integer, it doesn't. I get seventeen matches?
I am trying to say, via the regex: Grab the line number and it's brackets (that part works), then grab the first integer including sign, then the operand, then the next integer including sign, then the next operand, etc.
It appears that I have failed to explain myself clearly.
The file has hundreds of lines. Here is a five line sample:
[ 1] 19 - ( 1) - ( 4) + ( 28) = ______
[ 2] -18 + ( 8) - ( 16) - ( 2) = ______
[ 3] -8 + ( 17) - ( 15) + ( -29) = ______
[ 4] -31 - ( -12) - ( -5) + ( -26) = ______
[ 5] -15 - ( 12) - ( 14) - ( 31) = ______
The operands are only '-' or '+', but any combination of those three may appear in a line. The integers will all be from -99 to 99, but that shouldn't matter if the regex works. The goal (as I see it) is to extract seven match objects: four integers and three operands, then add the numbers
exactly as they appear. The number in brackets is just the line number and plays no role in the computation.
Much luck with regex, if you just need the result:
import re
s="[ 9] -21 - ( 12) - ( -5) + ( -26) = ______"
s = s[s.find("]")+1:s.find("=")] # cut away line nr and = ...
if not re.sub( "[+-0123456789() ]*","",s): # weak attempt to prevent python code injection
print(eval(s))
else:
print("wonky chars inside, only numbers, +, - , space and () allowed.")
Output:
-54
Make sure to read the eval()
and have a look into:
https://opensourcehacker.com/2014/10/29/safe-evaluation-of-math-expressions-in-pure-python/
https://softwareengineering.stackexchange.com/questions/311507/why-are-eval-like-features-considered-evil-in-contrast-to-other-possibly-harmfu/311510
https://www.kevinlondon.com/2015/07/26/dangerous-python-functions.html
Example for hundreds of lines:
import re
s="[ 9] -21 - ( 12) - ( -5) + ( -26) = ______"
def calcIt(line):
s = line[line.find("]")+1:line.find("=")]
if not re.sub( "[+-0123456789() ]*","",s):
return(eval(s))
else:
print(line + " has wonky chars inside, only numbers, +, - , space and () allowed.")
return None
import random
random.seed(42)
pattern = "[ {}] -{} - ( {}) - ( -{}) + ( -{}) = "
for n in range(1000):
nums = [n]
nums.extend([ random.randint(0,100),random.randint(-100,100),random.randint(-100,100),
random.randint(-100,100)])
c = pattern.format(*nums)
print (c, calcIt(c))
Ahh... I had a cup of coffee and sat down in front of Pythex again.
I figured out the correct regex:
[\s+0-9]\s+(-?[0-9]+)\s+([-|+])\s+\(\s+(-?[0-9]+)\)\s+([-|+])\s+\(\s+(-?[0-9]+)\)\s+([-|+])\s+\(\s+(-?[0-9]+)\)
Yields:
-21
-
12
-
-5
+
-26
I have a string s where "substrings" are divided by a pipe. Substrings might or might not contain numbers. And I have a test character string n that contains a number and might or might not contain letters. See example below. Note that spacing can be any
I'm trying to drop all substrings where n is not in a range or is not an exact match. I understand that I need to split by -, convert to numbers, and compare low/high to n converted to numeric. Here's my starting point, but then I got stuck with getting the final good string out of unl_new.
s = "liquid & bar soap 1.0 - 2.0oz | bar 2- 5.0 oz | liquid soap 1-2oz | dish 1.5oz"
n = "1.5oz"
unl = unlist(strsplit(s,"\\|"))
unl_new = (strsplit(unl,"-"))
unl_new = unlist(gsub("[a-zA-Z]","",unl_new))
Desired output:
"liquid & bar soap 1.0 - 2.0oz | liquid soap 1-2oz | dish 1.5oz"
Am I completely on the wrong path? Thanks!
Here an option using r-base ;
## extract the n numeric
nn <- as.numeric(gsub("[^0-9|. ]", "", n))
## keep only numeric and -( for interval)
## and split by |
## for each interval test the condition to create a boolean vector
contains_n <- sapply(strsplit(gsub("[^0-9|. |-]", "", s),'[|]')[[1]],
function(x){
yy <- strsplit(x, "-")[[1]]
yy <- as.numeric(yy[nzchar(yy)])
## the condition
(length(yy)==1 && yy==nn) || length(yy)==2 && nn >= yy[1] && nn <= yy[2]
})
## split again and use the boolean factor to remove the parts
## that don't respect the condition
## paste the result using collapse to get a single character again
paste(strsplit(s,'[|]')[[1]][contains_n],collapse='')
## [1] "liquid & bar soap 1.0 - 2.0oz liquid soap 1-2oz dish 1.5oz"
Don't know if it is general enough, but you might try:
require(stringr)
splitted<-strsplit(s,"\\|")[[1]]
ranges<-lapply(strsplit(
str_extract(splitted,"[0-9\\.]+(\\s*-\\s*[0-9\\.]+|)"),"\\s*-\\s*"),
as.numeric)
tomatch<-as.numeric(str_extract(n,"[0-9\\.]+"))
paste(splitted[
vapply(ranges, function(x) (length(x)==1 && x==tomatch) || (length(x)==2 && findInterval(tomatch,x)==1),TRUE)],
collapse="|")
#[1] "liquid & bar soap 1.0 - 2.0oz | liquid soap 1-2oz | dish 1.5oz"
Here's a method starting from your unl step using stringr:
unl = unlist(strsplit(s,"\\|"))
n2 <- as.numeric(gsub("[[:alpha:]]*", "", n))
num_lst <- str_extract_all(unl, "\\d\\.?\\d*")
indx <- lapply(num_lst, function(x) {
if(length(x) == 1) {isTRUE(all.equal(n2, as.numeric(x)))
} else {n2 >= as.numeric(x[1]) & n2 <= as.numeric(x[2])}})
paste(unl[unlist(indx)], collapse=" | ")
[1] "liquid & bar soap 1.0 - 2.0oz | liquid soap 1-2oz | dish 1.5oz"
I also tested it with other amounts like "2.3oz". With n2 we coerce n to numeric for comparison. The variable num_lst isolates the numbers from the character string.
With indx we apply our comparisions over the string numbers. if there is one number we check if it equals n2. I chose not to use the basic == operator to avoid any rounding issues. Instead isTRUE(all.equal(x, y)) is used.
Finally, the logical index variable indx is used to subset the character string to extract the matches and paste them together with a pipe "|".
I am attempting to write elements from a nested list to individual lines in a file, with each element separated by tab characters. Each of the nested lists is of the following form:
('A', 'B', 'C', 'D')
The final output should be of the form:
A B C D
E F G H
. . . .
. . . .
However, my output seems to have reproducible inconsistencies such that the output is of the general form:
A B C D
E F G H
I J K L
M N O P
. . . .
. . . .
I've inspected the lists before writing and they seem identical in form. The code I'm using to write is:
with open("letters.txt", 'w') as outfile:
outfile.writelines('\t'.join(line) + '\n' for line in letter_list)
Importantly, if I replace '\t' with, for example, '|', the file is created without such inconsistencies. I know whitespace parsing can become an issue for certain file I/O operations, but I don't know how to troubleshoot it here.
Thanks for the time.
EDIT: Here is some actual input data (in nested-list form) and output:
IN
('5', '+', '5752624-5752673', 'alt_region_8161'), ('1', '+', '621461-622139', 'alt_region_67'), ('1', '+', '453907-454063', 'alt_region_60'), ('1', '+', '539611-539815', 'alt_region_61'), ('4', '+', '14610049-14610103', 'alt_region_6893'), ('4', '+', '14610049-14610144', 'alt_region_6895'), ('4', '+', '14610049-14610144', 'alt_region_6897'), ('4', '+', '14610049-14610144', 'alt_region_6896')]
OUT
4 + 12816011-12816087 alt_region_6808
1 + 21214720-21214747 alt_region_2377
4 + 9489968-9490833 alt_region_7382
1 + 12121545-12126263 alt_region_650
4 + 9489968-9490811 alt_region_7381
4 + 12816011-12816087 alt_region_6807
1 + 2032338-2032740 alt_region_157
5 + 4695084-4695628 alt_region_9316
1 + 22294677-22295134 alt_region_2424
1 + 22294677-22295139 alt_region_2425
1 + 22294677-22295139 alt_region_2426
1 + 22294677-22295139 alt_region_2427
1 + 22294677-22295134 alt_region_2422
1 + 22294677-22295134 alt_region_2423
1 + 22294384-22295198 alt_region_2428
1 + 22294384-22295198 alt_region_2429
5 + 20845105-20845211 alt_region_9784
5 + 20845105-20845206 alt_region_9783
3 + 2651447-2651889 alt_region_5562
EDIT: Thanks to everyone who commented. Sorry if the question was poorly phrased. I appreciate the help in clarifying the issue (or, apparently, non-issue).
There are no spaces (' ')in your output, only tabs ('\t').
>>> print(repr('1 + 21214720-21214747 alt_region_2377'))
'1\t+\t21214720-21214747\talt_region_2377'
^^ ^^ ^^
Tabs are not equivalent to a fixed number of spaces (in most editors). Rather, they move the character following the tab to the next available multiple of x characters from the left margin, where x varies - x is most commonly 8, though it is 4 here on SO.
>>> for i in range(7):
print('x'*i+'\tx')
x
x x
xx x
xxx x
xxxx x
xxxxx x
xxxxxx x
If you want your output to appear aligned to the naked eye, you should use string formatting:
>>> for line in data:
print('{:4} {:4} {:20} {:20}'.format(*line))
5 + 5752624-5752673 alt_region_8161
1 + 621461-622139 alt_region_67
1 + 453907-454063 alt_region_60
1 + 539611-539815 alt_region_61
4 + 14610049-14610103 alt_region_6893
4 + 14610049-14610144 alt_region_6895
4 + 14610049-14610144 alt_region_6897
4 + 14610049-14610144 alt_region_6896
Note, however, that this will not necessarily be readable by code that expects a tab-separated value file.
In some text editors, tabs are displayed like that. The contents of the file are correct, it's just a matter of how the file is displayed on screen. It happens with tabs but not with | which is why you don't see it happening when you use |.
I'm currently working on data migration in PostgreSQL. Since I'm new to posix regular expressions, I'm having some trouble with a simple pattern and would appreciate your help.
I want to have a regular expression split my table on each alphanumeric char in a column, eg. when a column contains a string 'abc' I'd like to split it into 3 rows: ['a', 'b', 'c']. I need a regexp for that
The second case is a little more complicated, I'd like to split an expression '105AB' into ['105A', '105B'], I'd like to copy the numbers at the beginning of the string and split the table on uppercase letters, in the end joining the number with exactly 1 uppercase letter.
the function I'll be using is probably regexp_split_to_table(string, regexp)
I'm intentionally providing very little data not to confuse anyone, since what I posted is the essence of the problem. If you need more information please comment.
The first was already solved by you:
select regexp_split_to_table(s, ''), i
from (values
('abc', 1),
('def', 2)
) s(s, i);
regexp_split_to_table | i
-----------------------+---
a | 1
b | 1
c | 1
d | 2
e | 2
f | 2
In the second case you don't say if the numerics are always the first tree characters:
select
left(s, 3) || regexp_split_to_table(substring(s from 4), ''), i
from (values
('105AB', 1),
('106CD', 2)
) s(s, i);
?column? | i
----------+---
105A | 1
105B | 1
106C | 2
106D | 2
For a variable number of numerics:
select n || a, i
from (
select
substring(s, '^\d{1,3}') n,
regexp_split_to_table(substring(s, '[A-Z]+'), '') a,
i
from (values
('105AB', 1),
('106CD', 2)
) s(s, i)
) s;
?column? | i
----------+---
105A | 1
105B | 1
106C | 2
106D | 2
Let's imagine you have a string:
strLine <- "The transactions (on your account) were as follows: 0 3,000 (500) 0 2.25 (1,200)"
Is there a function that strips out the numbers into an array/vector producing the following required solution:
result <- c(0, 3000, -500, 0, 2.25, -1200)?
i.e.
result[3] = -500
Notice, the numbers are presented in accounting form so negative numbers appear between (). Also, you can assume that only numbers appear to the right of the first occurance of a number. I am not that good with regexp so would appreciate it if you could help if this would be required. Also, I don't want to assume the string is always the same so I am looking to strip out all words (and any special characters) before the location of the first number.
library(stringr)
x <- str_extract_all(strLine,"\\(?[0-9,.]+\\)?")[[1]]
> x
[1] "0" "3,000" "(500)" "0" "2.25" "(1,200)"
Change the parens to negatives:
x <- gsub("\\((.+)\\)","-\\1",x)
x
[1] "0" "3,000" "-500" "0" "2.25" "-1,200"
And then as.numeric() or taRifx::destring to finish up (the next version of destring will support negatives by default so the keep option won't be necessary):
library(taRifx)
destring( x, keep="0-9.-")
[1] 0 3000 -500 0 2.25 -1200
OR:
as.numeric(gsub(",","",x))
[1] 0 3000 -500 0 2.25 -1200
Here's the base R way, for the sake of completeness...
x <- unlist(regmatches(strLine, gregexpr('\\(?[0-9,.]+', strLine)))
x <- as.numeric(gsub('\\(', '-', gsub(',', '', x)))
[1] 0.00 3000.00 -500.00 0.00 2.25 -1200.00
What for me worked perfectly when working on single strings in a data frame (One string per row in same column) was the following:
library(taRifx)
DataFrame$Numbers<-as.character(destring(DataFrame$Strings, keep="0-9.-"))
The results are in a new column from the same data frame.
Since this came up in another question, this is an uncrutched stringi solution (vs the stringr crutch):
as.numeric(
stringi::stri_replace_first_fixed(
stringi::stri_replace_all_regex(
unlist(stringi::stri_match_all_regex(
"The transactions (on your account) were as follows: 0 3,000 (500) 0 2.25 (1,200)",
"\\(?[0-9,.]+\\)?"
)), "\\)$|,", ""
),
"(", "-"
)
)