Extracting ZIP code from the address line - regex

I have a data frame which has address as one of the column, the address can sometimes contain ZIP/PIN code in it and sometimes not.
Data Frame:
BANK ADDRESS
ABU DHABI COMMERCIAL BANK REHMAT MANZIL, V. N. ROAD,CURCHGATE, MUMBAI - 400020
VIJAYA BANK BOKARO CITY JHARKHAND,15/D1 HOTEL BLUE-,DIAMOND COMPLEX,BOKARO CITY,JHARKHAND,JHARKHAND
ALLAHABAD BANK DANKIN GANJ DIST. MIRZAPUR - 231 001 UTTAR PRADESH
How can i extract only ZIP/PIN code with the following information:
1. ZIP/PIN code are 6 digits (INDIAN ZIP/PIN CODE)
2. ZIP are sometimes split by 3 digits, 560 015
3. ZIP are sometimes separated by -, eg: 560-015
Below is my present code:
df$zip <- stri_extract_all_regex(df$ADDRESS, "(?<!\\d)\\d{6}(?!\\d)")
But the above code does not account point 2 and 3 of my logic, that is handle the ZIP split by "" or "-"

But the above code does not account point 2 and 3 of my logic, that is
handle the ZIP split by "" or "-"
m = regexpr("\\<\\d{3}[- ]?\\d{3}\\>", df$ADDRESS)
df$zip = substr(df$ADDRESS, m, m + attr(m, "match.length") - 1)

Related

Grouping Similar words/phrases

I have a frequency table of words which looks like below
> head(freqWords)
employees work bose people company
1879 1804 1405 971 959
employee
100
> tail(freqWords)
youll younggood yoyo ytd yuorself zeal
1 1 1 1 1 1
I want to create another frequency table which will combine similar words and add their frequencies
In above example, my new table should contain both employee and employees as one element with a frequency of 1979. For example
> head(newTable)
employee,employees work bose people
1979 1804 1405 971
company
959
I know how to find out similar words (using adist, stringdist) but I am unable to create the frequency table. For instance I can use following to get a list of similar words
words <- names(freqWords)
lapply(words, function(x) words[stringdist(x, words) < 3])
and following to get a list of similar phrases of two words
lapply(words, function(x) words[stringdist2(x, words) < 3])
where stringdist2 is follwoing
stringdist2 <- function(word1, word2){
min(stringdist(word1, word2),
stringdist(word1, gsub(word2,
pattern = "(.*) (.*)",
repl="\\2,\\1")))
}
I do not have any punctuation/special symbols in my words/phrases. (I do not know a lot of R; I created stringdist2 by tweaking an implementation of adist2 I found here but I do not understand everything about how pattern and repl works)
So I need help to create new frequency table.

Split one column into two columns and retaining the seperator

I have a very large data array:
'data.frame': 40525992 obs. of 14 variables:
$ INSTNM : Factor w/ 7050 levels "A W Healthcare Educators"
$ Total : Factor w/ 3212 levels "1","10","100",
$ Crime_Type : Factor w/ 72 levels "MURD11","NEG_M11",
$ Count : num 0 0 0 0 0 0 0 0 0 0 ...
The Crime_Type column contains the type of Crime and the Year, so "MURD11" is Murder in 2011. These are college campus crime statistics my kid is analyzing for her school project, I am helping when she is stuck. I am currently stuck at creating a clean data file she can analyze
Once i converted the wide file (all crime types '9' in columns) to a long file using 'gather' the file size is going from 300MB to 8 GB. The file I am working on is 8GB. do you that is the problem. How do i convert it to a data.table for faster processing?
What I want to do is to split this 'Crime_Type' column into two columns 'Crime_Type' and 'Year'. The data contains alphanumeric and numbers. There are also some special characters like NEG_M which is 'Negligent Manslaughter'.
We will replace the full names later but can some one suggest on how I separate
MURD11 --> MURD and 11 (in two columns)
NEG_M10 --> NEG_M and 10 (in two columns)
etc...
I have tried using,
df <- separate(totallong, Crime_Type, into = c("Crime", "Year"), sep = "[:digit:]", extra = "merge")
df <- separate(totallong, Crime_Type, into = c("Year", "Temp"), sep = "[:alpha:]", extra = "merge")
The first one separates the Crime as it looks for numbers. The second one does not work at all.
I also tried
df$Crime_Type<- apply (strsplit(as.character(df$Crime_Type), split="[:digit:]"))
That does not work at all. I have gone through many posts on stack-overflow and thats where I got these commands but I am now truly stuck and would appreciate your help.
Since you're using tidyr already (as evidenced by separate), try the extract function, which, given a regex, puts each captured group into a new column. The 'Crime_Type' is all the non-numeric stuff, and the 'Year' is the numeric stuff. Adjust the regex accordingly.
library(tidyr)
extract(df, 'Crime_Type', into=c('Crime', 'Year'), regex='^([^0-9]+)([0-9]+)$')
In base R, one option would be to create a unique delimiter between the non-numeric and numeric part. We can capture as a group the non-numeric ([^0-9]+) and numeric ([0-9]+) characters by wrapping it inside the parentheses ((..)) and in the replacement we use \\1 for the first capture group, followed by a , and the second group (\\2). This can be used as input vector to read.table with sep=',' to read as two columns.
df1 <- read.table(text=gsub('([^0-9]+)([0-9]+)', '\\1,\\2',
totallong$Crime_Type),sep=",", col.names=c('Crime', 'Year'))
df1
# Crime Year
#1 MURD 11
#2 NEG_M 11
If we need, we can cbind with the original dataset
cbind(totallong, df1)
Or in base R, we can use strsplit with split specifying the boundary between non-number ((?<=[^0-9])) and a number ((?=[0-9])). Here we use lookarounds to match the boundary. The output will be a list, we can rbind the list elements with do.call(rbind and convert it to data.frame
as.data.frame(do.call(rbind, strsplit(as.character(totallong$Crime_Type),
split="(?<=[^0-9])(?=[0-9])", perl=TRUE)))
# V1 V2
#1 MURD 11
#2 NEG_M 11
Or another option is tstrsplit from the devel version of data.table ie. v1.9.5. Here also, we use the same regex. In addition, there is option to convert the output columns into different class.
library(data.table)#v1.9.5+
setDT(totallong)[, c('Crime', 'Year') := tstrsplit(Crime_Type,
"(?<=[^0-9])(?=[0-9])", perl=TRUE, type.convert=TRUE)]
# Crime_Type Crime Year
#1: MURD11 MURD 11
#2: NEG_M11 NEG_M 11
If we don't need the 'Crime_Type' column in the output, it can be assigned to NULL
totallong[, Crime_Type:= NULL]
NOTE: Instructions to install the devel version are here
Or a faster option would be stri_extract_all from library(stringi) after collapsing the rows to a single string ('v2'). The alternate elements in 'v3' can be extracted by indexing with seq to create new data.frame
library(stringi)
v2 <- paste(totallong$Crime_Type, collapse='')
v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
ind1 <- seq(1, length(v3), by=2)
ind2 <- seq(2, length(v3), by=2)
d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])
Benchmarks
v1 <- do.call(paste, c(expand.grid(c('MURD', 'NEG_M'), 11:15), sep=''))
set.seed(24)
test <- data.frame(v1= sample(v1, 40525992, replace=TRUE ))
system.time({
v2 <- paste(test$v1, collapse='')
v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
ind1 <- seq(1, length(v3), by=2)
ind2 <- seq(2, length(v3), by=2)
d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])
})
#user system elapsed
#56.019 1.709 57.838
data
totallong <- data.frame(Crime_Type= c('MURD11', 'NEG_M11'))

In R, use regular expression to match multiple patterns and add new column to list

I've found numerous examples of how to match and update an entire list with one pattern and one replacement, but what I am looking for now is a way to do this for multiple patterns and multiple replacements in a single statement or loop.
Example:
> print(recs)
phonenumber amount
1 5345091 200
2 5386052 200
3 5413949 600
4 7420155 700
5 7992284 600
I would like to insert a new column called 'service_provider' with /^5/ as Company1 and /^7/ as Company2.
I can do this with the following two lines of R:
recs$service_provider[grepl("^5", recs$phonenumber)]<-"Company1"
recs$service_provider[grepl("^7", recs$phonenumber)]<-"Company2"
Then I get:
phonenumber amount service_provider
1 5345091 200 Company1
2 5386052 200 Company1
3 5413949 600 Company1
4 7420155 700 Company2
5 7992284 600 Company2
I'd like to provide a list, rather than discrete set of grepl's so it is easier to keep country specific information in one place, and all the programming logic in another.
thisPhoneCompanies<-list(c('^5','Company1'),c('^7','Company2'))
In other languages I would use a for loop on on the Phone Company list
For every row in thisPhoneCompanies
Add service provider to matched entries in recs (such as the grepl statement)
end loop
But I understand that isn't the way to do it in R.
Using stringi :
library(stringi)
recs$service_provider <- stri_replace_all_regex(str = recs$phonenumber,
pattern = c('^5.*','^7.*'),
replacement = c('Company1', 'Company2'),
vectorize_all = FALSE)
recs
# phonenumber amount service_provider
# 1 5345091 200 Company1
# 2 5386052 200 Company1
# 3 5413949 600 Company1
# 4 7420155 700 Company2
# 5 7992284 600 Company2
Thanks to #thelatemail
Looks like if I use a dataframe instead of a list for the phone companies:
phcomp <- data.frame(ph=c(5,7),comp=c("Company1","Company2"))
I can match and add a new column to my list of phone numbers in a single command (using the match function).
recs$service_provider <- phcomp$comp[match(substr(recs$phonenumber,1,1), phcomp$ph)]
Looks like I lose the ability to use regular expressions, but the matching here is very simple, just the first digit of the phone number.

Sum list of characters using R

I'm playing around with some box score data that I downloaded from retrosheet.org. Instead of providing a run total for the home and away team, the data provides a line score in the following format: "10030(11)02x"
where each digit represents an inning. A number in () indicates more than 9 runs scored in an inning and x represents a half inning in which the team did not bat (the home team was ahead at the bottom of the 9th inning).
I'm trying to figure out a way to systematically sum up the total runs using a function. Ideally I could run something like this:
f("10030(11)02x") = 17
I'm using sum(sapply(strsplit("10001000x", ""), as.numeric), na.rm=T) to compute a sum for all observations that don't contain an inning with double digits, but I'm struggling figuring out how to deal with the double digit innings and parenthesis.
How about this
runcount<-function(x) {
# find double digits
m <- gregexpr("\\(\\d+\\)",x)
dd <- regmatches(x,m)
# remove double digits
regmatches(x,m)<-""
# remove x's
x <- gsub("x","",x)
# sum numbers
# add back in double digit values (remove parens)
sapply(strsplit(x,""), function(x) sum(as.numeric(x))) +
sapply(dd, function(x) sum(as.numeric(substr(x,2,nchar(x)-1))))
}
runcount("10030(11)02x")
# [1] 17
runcount("10030(11)(12)2x")
# [1] 29
runcount("100301020")
# [1] 7
runcount(c("10030(11)02x","10030(11)(12)2x","100301020"))
# [1] 17 29 7

read table with spaces in one column

I am attempting to extract tables from very large text files (computer logs). Dickoa provided very helpful advice to an earlier question on this topic here: extracting table from text file
I modified his suggestion to fit my specific problem and posted my code at the link above.
Unfortunately I have encountered a complication. One column in the table contains spaces. These spaces are generating an error when I try to run the code at the link above. Is there a way to modify that code, or specifically the read.table function to recognize the second column below as a column?
Here is a dummy table in a dummy log:
> collect.models(, adjust = FALSE)
model npar AICc DeltaAICc weight Deviance
5 AA(~region + state + county + city)BB(~region + state + county + city)CC(~1) 17 11111.11 0.0000000 5.621299e-01 22222.22
4 AA(~region + state + county)BB(~region + state + county)CC(~1) 14 22222.22 0.0000000 5.621299e-01 77777.77
12 AA(~region + state)BB(~region + state)CC(~1) 13 33333.33 0.0000000 5.621299e-01 44444.44
12 AA(~region)BB(~region)CC(~1) 6 44444.44 0.0000000 5.621299e-01 55555.55
>
> # the three lines below count the number of errors in the code above
Here is the R code I am trying to use. This code works if there are no spaces in the second column, the model column:
my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')
top <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'
my.data <- my.data[grep(top, my.data):grep(bottom, my.data)]
x <- read.table(text=my.data, comment.char = ">")
I believe I must use the variables top and bottom to locate the table in the log because the log is huge, variable and complex. Also, not every table contains the same number of models.
Perhaps a regex expression could be used somehow taking advantage of the AA and the CC(~1) present in every model name, but I do not know how to begin. Thank you for any help and sorry for the follow-up question. I should have used a more realistic example table in my initial question. I have a large number of logs. Otherwise I could just extract and edit the tables by hand. The table itself is an odd object which I have only ever been able to export directly with capture.output, which would probably still leave me with the same problem as above.
EDIT:
All spaces seem to come right before and right after a plus sign. Perhaps that information can be used here to fill the spaces or remove them.
try inserting my.data$model <- gsub(" *\\+ *", "+", my.data$model) before read.table
my.data <- my.data[grep(top, my.data):grep(bottom, my.data)]
my.data$model <- gsub(" *\\+ *", "+", my.data$model)
x <- read.table(text=my.data, comment.char = ">")