How to combine the Output of Regex Findall in Pandas - regex

I'm exploring regex with pandas in a jupyter notebook.
My goal is to extract housenumberadditions from an addressline, using a set of regex patterns.
I'm building upon this post: https://gist.github.com/christiaanwesterbeek/c574beaf73adcfd74997
and I use this for input from a .csv:
Afleveradres
Dorpstraat 2
Dorpstr. 2
Dorpstraat 2
Laan 1933 2
18 Septemberplein 12
Kerkstraat 42-f3
Kerk straat 2b
42nd street, 1337a
1e Constantijn Huigensstraat 9b
Maas-Waalweg 15
De Dompelaar 1 B
Kümmersbrucker Straße 2
Friedrichstädter Straße 42-46
Höhenstraße 5A
Saturnusstraat 60-75
Saturnusstraat 60 - 75
Plein \'40-\'45 10
Plein 1945 1
Steenkade t/o 56
Steenkade a/b Twee Gezusters
1, rue de l\'eglise
Herestraat 49 BOX1043
Maas-Waalweg 15 15
My goal is to extract the streetnames, housenumbers & housenumberadditions.
So far I basically use:
# get data
file_base_name = 'examples'
dfa = pd.read_csv(''+file_base_name+'.csv', sep=';')
#get number
dfa['num'] = dfa['Afleveradres'].str.extract(r"([,\s]+\d+)\s*")
dfa['num'] = dfa['num'].str.strip()
# split leftover values into street & addition
dfa['tmp']=dfa.Afleveradres.str.replace(r"([,\s]+\d+)\s*", ';')
# new data frame with split value columns
new = dfa["tmp"].str.split(";", n = 1, expand = True)
# making separate first name column from new data frame
dfa["str"]= new[0]
# making separate last name column from new data frame
dfa["add"]= new[1]
dfa.drop(['tmp'], axis=1, inplace=True)
which results in:
listing streenames, numbers & addition:
;Afleveradres;str;add;num
0;Dorpstraat 2;Dorpstraat;;2
1;Dorpstr. 2;Dorpstr.;;2
2;Dorpstraat 2;Dorpstraat;;2
3;Laan 1933 2;Laan;2;1933
4;18 Septemberplein 12;18 Septemberplein;;12
5;Kerkstraat 42-f3;Kerkstraat;-f3;42
6;Kerk straat 2b;Kerk straat;b;2
7;42nd street, 1337a;42nd street;a;, 1337
8;1e Constantijn Huigensstraat 9b;1e Constantijn Huigensstraat;b;9
9;Maas-Waalweg 15;Maas-Waalweg;;15
10;De Dompelaar 1 B;De Dompelaar;B;1
So far so good, for now.
Next, I'd like to correct for housenumber ranges, like '42-46' and '60 - 65'.
A re.findall returns expected values:
import re
def rem(str):
pattern = r'[,#\'?\.$%_]'
if re.match(pattern, str):
tmp = 'Y'
else:
tmp = 'N'
return tmp
def extract_numrange(row):
r = ''+row['Afleveradres']
num_range1 = re.findall(r'([,\s]+\d+\-+\d+)\s*|([,\s]+\d+\s+\-+\s+\d+)\s*',r)
return num_range1
# return rem(num_range1)
dfa['excep'] = dfa.apply(extract_numrange, axis=1)
dfa
output re.findall
15 Friedrichstädter Straße 42-46 Friedrichstädter Straße -46 42 [( 42-46, )]
16 Höhenstraße 5A Höhenstraße A 5 []
17 Saturnusstraat 60-75 Saturnusstraat -75 60 [( 60-75, )]
18 Saturnusstraat 60 - 75 Saturnusstraat -; 60 [(, 60 - 75)]
But how do I clean this output, from [( 42-46, )] and [(, 60 - 75)] into something like 42-46 and 60 - 75 in a new column?
Or are there better approaches for my question?

The problem comes from the fact there are two capturing groups. You need to re-vamp the pattern to use only a single capturing group, or get rid of the group altogether.
Your pattern is of the (Group1)\s*|(Group2)\s* type. As you see, all you need is to re-group the parts into (Group1|Group2)\s*.
So, the quickest fix is
([,\s]+\d+\-+\d+|[,\s]+\d+\s+\-+\s+\d+)\s*
See the regex demo.
However, I think you do not need the whitespaces on both ends. Then, move those patterns you do not want to capture out of the grouping:
[,\s]+(\d+\-+\d+|\d+\s+\-+\s+\d+)\s*
^^^^^^
See this regex demo.
Probably, you may reduce this even further to
[,\s](\d+(?:-+|\s+-+\s+)\d+)
See this regex demo, the (?:-+|\s+-+\s+) is a non-capturing group that won't result in additional tuple item.

Related

Extracting Multiple Blocks of Similar Text

I am trying to parse a report. The following is a sample of the text that I need to parse:
7605625112 DELIVERED N 1 GORDON CONTRACTORS I SIPLAST INC Freight Priority 2000037933 $216.67 1,131 ROOFING MATERIALS
04/23/2021 02:57 PM K WRISHT N 4 CAPITOL HEIGHTS, MD ARKADELPHIA, AR Prepaid 2000037933 -$124.23 170160-00
04/27/2021 12:41 PM 2 40 20743-3706 71923 $.00 055 $.00
2 WBA HOT $62.00 0
$12.92 $92.44
$167.36
7605625123 DELIVERED N 1 SECHRIST HALL CO SIPLAST INC Freight Priority 2000037919 $476.75 871 PAIL,UN1263,PAINT,3,
04/23/2021 02:57 PM S CHAVEZ N 39 HARLINGEN, TX ARKADELPHIA, AR Prepaid 2000037919 -$378.54
04/27/2021 01:09 PM 2 479 78550 71923 $.00 085 $95.35
2 HRL HOT $62.00 21
$13.55 $98.21
$173.76
This comprised of two or more blocks that start with "[0-9]{10}\sDELIVERED" and the last currency string prior to the next block.
If I test with "(?s)([0-9]{10}\sDELIVERED)(.*)(?<=\$167.36\n)" I successfully get the first Block, but If I use "(?s)([0-9]{10}\sDELIVERED)(.*)(?<=\$\d\d\d.\d\d\n)" it grabs everything.
If someone can show me the changes that I need to make to return two or more blocks I would greatly appreciate it.
* is a greedy operator, so it will try to match as much characters as possible. See also Repetition with Star and Plus.
For fixing it, you can use this regex:
(?s)(\d{10}\sDELIVERED)((.(?!\d{10}\sDELIVERED))*)(?<=\$\d\d\d.\d\d)
in which I basically replaced .* with (.(?!\d{10}\sDELIVERED))* so that for every character it checks if it is followed or not by \d{10}\sDELIVERED.
See a demo here

extracting a location from a list with regex

I have a list like so
x=['hello#thepowerhouse.group', 'ThePowerHouse\xa0 is a part of the House of ElektroCouture', 'Our Studio is located at Bikini Berlin Terrace Level, 2nd floor Budapester Str. 46 10787 Berlin', '\xa0', 'Office:\xa0+49 30 20837551', '\xa0', '\xa0']
I want to extract the this element Our Studio is located at Bikini Berlin Terrace Level, 2nd floor Budapester Str. 46 10787 Berlin'
Since I am doing this for several sites I want to extra the element with regular expressions so it can work with others. I thought that I could grab the element by saying if the element has lower case and upper case letters, numbers , commas , and sometimes a period. This is what I attempted but it didn't work.
import re
for element in x:
if re.findall("([A-Za-z0-9,])",element)==True:
print("match")
You can split up your rule into several simple regexes and test them in sequence instead of making some monster-expression.
import re
def is_location(text):
"""Returns True if text contains digits, uppercase and lowercase characters."""
patterns = r'[0-9]', r'[a-z]', r'[A-Z]'
return all(re.search(pattern, text) for pattern in patterns)
x = [
'hello#thepowerhouse.group',
'ThePowerHouse\xa0 is a part of the House of ElektroCouture',
'Our Studio is located at Bikini Berlin Terrace Level, 2nd floor Budapester Str. 46 10787 Berlin',
'\xa0', 'Office:\xa0+49 30 20837551', '\xa0', '\xa0'
]
print(next(filter(is_location, x)))

Codeeval Challenge 230: Football, Answer Only Partially Correct

I am working on a relatively new challenge in CodeEval called 'Football.' The description is listed in the following link:
https://www.codeeval.com/open_challenges/230/
Inputs are lines of a file read by Python, and within each line there are lists separated by '|', with each list representing a country: the first being country "1", second being country "2", and so on.
1 2 3 4 | 3 1 | 4 1
19 11 | 19 21 23 | 31 39 29
Outputs are also lines in response to each line read from the file.
1:1,2,3; 2:1; 3:1,2; 4:1,3;
11:1; 19:1,2; 21:2; 23:2; 29:3; 31:3; 39:3;
so country 1 supports team 1, 2, and 3 as shown in the first line of output: 1:1,2,3.
Below is my solution, and since I have no clue why the solution only works for the two sample cases lited in the description link, I'd like to ask anyone for comments and hints on how to correct my code. Thank you very much for your time and assistance ahead of time.
import sys
def football(string):
countries = map(str.split, string.split('|'))
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
results = []
for i in range(len(teams)):
results.append([teams[i]+':'])
for j in range(len(countries)):
if teams[i] in countries[j]:
results[i].append(str(j+1))
for i in range(len(results)):
results[i] = results[i][0]+','.join(results[i][1:])
return '; '.join(results) + '; '
if __name__ == '__main__':
lines = [line.rstrip() for line in open(sys.argv[1])]
for line in lines:
print football(line)
After deliberately failing an attempt to checkout the complete test input and my output, I found the problem. The line:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
will make the output problematic in terms of sorting. For example here's a sample input:
10 20 | 43 23 | 27 | 25 | 11 1 12 43 | 33 18 3 43 41 | 31 3 45 4 36 | 25 29 | 1 19 39 | 39 12 16 28 30 37 | 32 | 11 10 7
and it produces the output:
1:5,9; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 3:6,7; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 4:7; 41:6; 43:2,5,6; 45:7; 7:12;
But the challenge expects the output teams to be sorted by numbers in ascending order, which is not achieved by the above-mentioned code as the numbers are in string format, not integer format. Therefore the solution is simply adding a key to sort the teams list by ascending order for integer:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])), key=lambda x:int(x))
With a small change in this line, the code passes through the tests. A sample output looks like:
1:5,9; 3:6,7; 4:7; 7:12; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 41:6; 43:2,5,6; 45:7;
Please let me know if you have a better and more efficient solution to the challenge. I'd love to read better codes or great suggestions on improving my programming skills.
Here's how I solved it:
import sys
with open(sys.argv[1]) as test_cases:
for test in test_cases:
if test:
team_supporters = {}
for nation, nation_teams in enumerate(test.strip().split("|"), start=1):
for team in map(int, nation_teams.split()):
team_supporters.setdefault(team, []).append(nation)
print(*("{}:{};".format(team, ",".join(map(str, sorted(nations))))
for team, nations in sorted(team_supporters.items())))
The problem is not very complicated. We're given a mapping from nation (implicitly numbered by their order in the input) to a list of teams. We need to reverse that to create an output that maps from a team to a list of nations.
It seems natural to use a dictionary that maps in the same way as the desired output. We can use enumerate to give numbers to the nations as we iterate over them. The setdefault method of the dict adds empty lists to the dictionary as they are needed (using a collections.defaultdict instead of a regular dictionary would be another way to deal with this). We don't need to care about the order of the input, nor the order things are stored in the dictionary's inner lists.
The output we build using str.format calls and the default space separator of the print function. If the final semicolon wasn't desired, I'd have used print("; ".join("{}:{}.format(...))) instead. Since the output needs to be sorted by team at the top level, and by nation in the inner lists, we make some sorted calls where necessary.
Sorting the inner lists is probably not even be necessary, since the nations were processed in order, with their numbers derived from the order they had in the input line. Fortunately, Python's Timsort algorithm is very fast on already-sorted input, so even with a bit of unnecessary sorting, our code is still fast enough.

substring characters from a column in a data.table in R

Is there a more "r" way to substring two meaningful characters out of a longer string from a column in a data.table?
I have a data.table that has a column with "degree strings"... shorthand code for the degree someone got and the year they graduated.
> srcDT<- data.table(
alum=c("Paul Lennon","Stevadora Nicks","Fred Murcury"),
degree=c("W72","WG95","W88")
)
> srcDT
alum degree
1: Paul Lennon W72
2: Stevadora Nicks WG95
3: Fred Murcury W88
I need to extract the digits of the year from the degree, and put it in a new column called "degree_year"
No problem:
> srcDT[,degree_year:=substr(degree,nchar(degree)-1,nchar(degree))]
> srcDT
alum degree degree_year
1: Paul Lennon W72 72
2: Stevadora Nicks WG95 95
3: Fred Murcury W88 88
If only it were always that simple.
The problem is, the degree strings only sometimes look like the above. More often, they look like this:
srcDT<- data.table(
alum=c("Ringo Harrison","Brian Wilson","Mike Jackson"),
degree=c("W72 C73","WG95 L95","W88 WG90")
)
I am only interested in the 2 numbers next to the characters I care about: W & WG (and if both W and WG are there, I only care about WG)
Here's how I solved it:
x <-srcDT$degree ##grab just the degree column
z <-character() ## create an empty character vector
degree.grep.pattern <-c("WG[0-9][0-9]","W[0-9][0-9]")
## define a vector of regex's, in the order
## I want them
for(i in 1:length(x)){ ## loop thru all elements in degree column
matched=F ## at the start of the loop, reset flag to F
for(j in 1:length(degree.grep.pattern)){
## loop thru all elements of the pattern vector
if(length(grep(degree.grep.pattern[j],x[i]))>0){
## see if you get a match
m <- regexpr(degree.grep.pattern[j],x[i])
## if you do, great! grab the index of the match
y<-regmatches(x[i],m) ## then subset down. y will equal "WG95"
matched=T ## set the flag to T
break ## stop looping
}
## if no match, go on to next element in pattern vector
}
if(matched){ ## after finishing the loop, check if you got a match
yr <- substr(y,nchar(y)-1,nchar(y))
## if yes, then grab the last 2 characters of it
}else{
#if you run thru the whole list and don't match any pattern at all, just
# take the last two characters from the affilitation
yr <- substr(x[i],nchar(as.character(x[i]))-1,nchar(as.character(x[i])))
}
z<-c(z,yr) ## add this result (95) to the character vector
}
srcDT$degree_year<-z ## set the column to the results.
> srcDT
alum degree degree_year
1: Ringo Harrison W72 C73 72
2: Brian Wilson WG95 L95 95
3: Mike Jackson W88 WG90 90
This works. 100% of the time. No errors, no mis-matches.
The problem is: it doesn't scale. Given a data table with 10k rows, or 100k rows, it really slows down.
Is there a smarter, better way to do this? This solution is very "C" to me. Not very "R."
Thoughts on improvement?
Note: I gave a simplified example. In the actual data, there are about 30 different possible combinations of degrees, and combined with different years, there are something like 540 unique combinations of degree strings.
Also, I gave the degree.grep.pattern with only 2 patterns to match. In the actual work I'm doing, there are 7 or 8 patterns to match.
As it seem (per OPs) comments, there is no situation of "WG W", then a simple regex solution should do the job
srcDT[ , degree_year := gsub(".*WG?(\\d+).*", "\\1", degree)]
srcDT
# alum degree degree_year
# 1: Ringo Harrison W72 C73 72
# 2: Brian Wilson WG95 L95 95
# 3: Mike Jackson W88 WG90 90
Here's a solution based on the assumption that want the most recent degree with W in it:
regex <- "(?<=W|(?<=W)G)[0-9]{2}"
srcDT[ , degree_year :=
sapply(regmatches(degree,
gregexpr(regex, degree, perl = TRUE)),
function(x) max(as.integer(x)))]
> srcDT
alum degree degree_year
1: Ringo Harrison W72 C73 72
2: Brian Wilson WG95 L95 95
3: Mike Jackson W88 WG90 90
You said:
I gave the degree.grep.pattern with only 2 patterns to match. In the actual work I'm doing, there are 7 or 8 patterns to match.
But I'm not sure what this means. There are more options besides W and WG?
Here is one quick hack:
# split all words from degree and order so that WG is before W
words <- lapply(strsplit(srcDT$degree, " "), sort, decreasing=TRUE)
# obtain tags for each row (getting only first. But works since ordered)
tags <- mapply(Find, list(function(x) grepl("^WG|^W", x)), words)
# simple gsub to remove WG and W
(result <- gsub("^WG|^W", "", tags))
[1] "72" "95" "90"
Fast with 100k rows.
A solution without regular expressions, it's quite slow as it creates a sparse table... but it's clean and flexible so i leave it here.
First I split the degreeyears by space, then browse through them and build a clean structured table with one column per degree, that i fill it with years.
degreeyear_split <- sapply(srcDT$degree,strsplit," ")
for(i in 1:nrow(srcDT)){
for (degree_year in degreeyear_split[[i]]){
n <- nchar(degree_year)
degree <- substr(degree_year,1,n-2)
year <- substr(degree_year,n-1,n)
srcDT[i,degree] <- year
}}
Here I have my structure table, I paste W on the year i'm interested in, then paste WG on top of it.
srcDT$year <- srcDT$W
srcDT$year[srcDT$WG!=""]<-srcDT$WG[srcDT$WG!=""]
Then here's you result:
srcDT
alum degree W C WG L year
1: Ringo Harrison W72 C73 72 73 72
2: Brian Wilson WG95 L95 95 95 95
3: Mike Jackson W88 WG90 88 90 90

Random Text generator based on regex [duplicate]

This question already has answers here:
Using Regex to generate Strings rather than match them
(12 answers)
Closed 3 years ago.
I would like to know if there is software that, given a regex and of course some other constraints like length, produces random text that always matches the given regex.
Thanks
Yes, software that can generate a random match to a regex:
Exrex, Python
Pxeger, Javascript
regex-genex, Haskell
Xeger, Java
Xeger, Python
Generex, Java
rxrdg, C#
String::Random, Perl
regldg, C
paggern, PHP
ReverseRegex, PHP
randexp.js, Javascript
EGRET, Python/C++
MutRex, Java
Fare, C#
rstr, Python
randexp, Ruby
goregen, Go
bfgex, Java
regexgen, Javascript
strgen, Python
random-string, Java
regexp-unfolder, Clojure
string-random, Haskell
rxrdg, C#
Regexp::Genex, Perl
StringGenerator, Python
strrand, Go
regen, Go
Rex, C#
regexp-examples, Ruby
genex.js, JavaScript
genex, Go
Xeger is capable of doing it:
String regex = "[ab]{4,6}c";
Xeger generator = new Xeger(regex);
String result = generator.generate();
assert result.matches(regex);
All regular expressions can be expressed as context free grammars. And there is a nice algorithm already worked out for producing random sentences, from any CFG, of a given length. So upconvert the regex to a cfg, apply the algorithm, and wham, you're done.
If you want a Javascript solution, try randexp.js.
Check out the RandExp Ruby gem. It does what you want, though only in a limited fashion. (It won't work with every possible regexp, only regexps which meet some restrictions.)
Too late but it could help newcomer , here is a useful java library that provide many features for using regex to generate String (random generation ,generate String based on it's index, generate all String..) check it out here .
Example :
Generex generex = new Generex("[0-3]([a-c]|[e-g]{1,2})");
// generate the second String in lexicographical order that match the given Regex.
String secondString = generex.getMatchedString(2);
System.out.println(secondString);// it print '0b'
// Generate all String that matches the given Regex.
List<String> matchedStrs = generex.getAllMatchedStrings();
// Using Generex iterator
Iterator iterator = generex.iterator();
while (iterator.hasNext()) {
System.out.print(iterator.next() + " ");
}
// it print 0a 0b 0c 0e 0ee 0e 0e 0f 0fe 0f 0f 0g 0ge 0g 0g 1a 1b 1c 1e
// 1ee 1e 1e 1f 1fe 1f 1f 1g 1ge 1g 1g 2a 2b 2c 2e 2ee 2e 2e 2f 2fe 2f 2f 2g
// 2ge 2g 2g 3a 3b 3c 3e 3ee 3e 3e 3f 3fe 3f 3f 3g 3ge 3g 3g 1ee
// Generate random String
String randomStr = generex.random();
System.out.println(randomStr);// a random value from the previous String list
We did something similar in Python not too long ago for a RegEx game that we wrote. We had the constraint that the regex had to be randomly generated, and the selected words had to be real words. You can download the completed game EXE here, and the Python source code here.
Here is a snippet:
def generate_problem(level):
keep_trying = True
while(keep_trying):
regex = gen_regex(level)
# print 'regex = ' + regex
counter = 0
match = 0
notmatch = 0
goodwords = []
badwords = []
num_words = 2 + level * 3
if num_words > 18:
num_words = 18
max_word_length = level + 4
while (counter < 10000) and ((match < num_words) or (notmatch < num_words)):
counter += 1
rand_word = words[random.randint(0,max_word)]
if len(rand_word) > max_word_length:
continue
mo = re.search(regex, rand_word)
if mo:
match += 1
if len(goodwords) < num_words:
goodwords.append(rand_word)
else:
notmatch += 1
if len(badwords) < num_words:
badwords.append(rand_word)
if counter < 10000:
new_prob = problem.problem()
new_prob.title = 'Level ' + str(level)
new_prob.explanation = 'This is a level %d puzzle. ' % level
new_prob.goodwords = goodwords
new_prob.badwords = badwords
new_prob.regex = regex
keep_trying = False
return new_prob
Instead of starting from a regexp, you should be looking into writing a small context free grammer, this will allow you to easily generate such random text. Unfortunately, I know of no tool which will do it directly for you, so you would need to do a bit of code yourself to actually generate the text. If you have not worked with grammers before, I suggest you read a bit about bnf format and "compiler compilers" before proceeding...
I'm not aware of any, although it should be possible. The usual approach is to write a grammar instead of a regular expression, and then create functions for each non-terminal that randomly decide which production to expand. If you could post a description of the kinds of strings that you want to generate, and what language you are using, we may be able to get you started.