Ok, so I have a data frame of web forum comments. Each row has a cell containing an ID which is part of the link to that comment's parent comment. The rows contain the full permalink to the comment, of which the ID is the varying part.
I'd like to add a column that shows the user name attached to that parent comment. I'm assuming I'll need to use some regular expression function, which I find mystifying at this point.
In workflow terms, I need to find the row whose URL contains the parent comment ID, grab the user name from that row. Here's a toy example:
toy <- rbind(c("yes?", "john", "www.website.com/4908", "3214", NA), c("don't think so", "mary", "www.website.com/3958", "4908", NA))
toy <- as.data.frame(toy)
colnames(toy) <- c("comment", "user", "URL", "parent", "parent_user")
comment user URL parent parent_user
1 yes? john www.website.com/4908 3214 <NA>
2 don't think so mary www.website.com/3958 4908 <NA>
which needs to become:
comment user URL parent parent_user
1 yes? john www.website.com/4908 3214 <NA>
2 don't think so mary www.website.com/3958 4908 john
Some values in this column will be NA, since they're top level comments. So something like,
dataframe$parent_user <- dataframe['the row where parent
ID i is found in the URL column', 'the user name column in that row']
Thanks!!
Another option, using the basename function from base R, which "removes all of the path up to and including the last path separator (if any)"
toy$user[match(toy$parent, basename(as.character(toy$URL)))]
#1] <NA> john
#Levels: john mary
Here is a vectorized option with stri_extract and match
library(stringi)
toy$parent_user <- toy$user[match(toy$parent,stri_extract(toy$URL,
regex=paste(toy$parent, collapse="|")))]
toy
# comment user URL parent parent_user
#1 yes? john www.website.com/4908 3214 <NA>
#2 don't think so mary www.website.com/3958 4908 john
Or as #jazzurro mentioned, a faster option would be using stri_extract with data.table and fmatch
library(data.table)
library(fastmatch)
setDT(toy)[, parent_user := user[fmatch(parent,
stri_extract_last_regex(str=URL, pattern = "\\d+"))]]
Or a base R option would be
with(toy, user[match(parent, sub("\\D+", "", URL))])
#[1] <NA> john
#Levels: john mary
nchar('with(toy, user[match(parent, sub("\\D+", "", URL))])')
#[1] 51
nchar('toy$user[match(toy$parent, basename(as.character(toy$URL)))]')
#[1] 60
Perhaps not the prettiest way to do it, but an option:
toy$parent_user <- sapply(toy$parent,
function(x){p <- toy[x == sub('[^0-9]*', '', toy$URL), 'user'];
ifelse(length(p) > 0, as.character(p), NA)})
toy
# comment user URL parent parent_user
# 1 yes? john www.website.com/4908 3214 <NA>
# 2 don't think so mary www.website.com/3958 4908 john
The second line is really just to deal with cases lacking matches.
Related
Imagine I have a dataframe like this (or the names of all months)
set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(20),3)))
mydata <- rbind(mydata,c(2,round(runif(20),3)))
mydata <- rbind(mydata,c(3,round(runif(20),3)))
colnames(mydata) <- c("id", paste0(rep(c('Mary', 'Bob', 'Dylan', 'Tom', 'Jane', 'Sam', 'Tony', 'Luke', 'John', "Pam"), each=2), 1:2))
.
id Mary1 Mary2 Bob1 Bob2 Dylan1 Dylan2 Tom1 Tom2 Jane1 Jane2 Sam1 Sam2 Tony1 Tony2 Luke1 Luke2 John1 John2 Pam1 Pam2
1 0.266 0.372 0.573 0.908 0.202 0.898 0.945 0.661 0.629 0.062 0.206 0.177 0.687 0.384 0.770 0.498 0.718 0.992 0.380 0.777
2 0.935 0.212 0.652 0.126 0.267 0.386 0.013 0.382 0.870 0.340 0.482 0.600 0.494 0.186 0.827 0.668 0.794 0.108 0.724 0.411
3 0.821 0.647 0.783 0.553 0.530 0.789 0.023 0.477 0.732 0.693 0.478 0.861 0.438 0.245 0.071 0.099 0.316 0.519 0.662 0.407
Usually with many more columns.
And I want to add columns (it's up to you to decide to add them to the right, or create a new dataframe with these new columns) substracting every two.. (*)
id, Mary1-Mary2, Bob1-Bob2, Dylan1-Dylan2, Tom1-Tom2, Jane1-Jane2,...
This operation is quite common.
I'd like to do it by name, not by position, to prevent problems if they are not consecutive.
It could even happen that some columns don't have it's "twin" column, just leave as is, or ignore this complication now.
(*) The names of the columns have a prefix and a number.
Instead of just substracting two columns I could have groups of 5 and I may want to do something such as adding all numbers. A generic solution would be great.
I first tried to do it by convert it to long format, later operate with aggregate, and convert it back to wide format, but maybe it's much easier to do it directly in wide format. I know the problem is mainly related to use efficiently regular expressions.
R, data.table or dplyr, long format splitting colnames
I don't mind the speed but the simplest solution.
Any package is wellcome.
PD: All your codes fail if I add a lonely column.
set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(21),3)))
mydata <- rbind(mydata,c(2,round(runif(21),3)))
mydata <- rbind(mydata,c(3,round(runif(21),3)))
colnames(mydata) <- c(c("id", paste0(rep(c('Mary', 'Bob', 'Dylan', 'Tom', 'Jane', 'Sam', 'Tony', 'Luke', 'John', "Pam"), each=2), 1:2)),"Lola" )
I know I could filter it out manually but it would be better if the result is the difference (*) of every pair and leave alone the lonely column. (In case of differences of groups of size two)
The best option would be not manually remove the first column but split all columns in single and multiple columns.
How about using base R:
cn <- unique(gsub("\\d", "", colnames(mydata)))[-1]
sapply(cn, function(x) mydata[[paste0(x, 1)]] - mydata[[paste0(x, 2)]] )
You can use this approach for any arbitrary number of groups. For example this would return the row sums across the names with the suffix 1 or 2.:
sapply(cn, function(x) rowSums(mydata[, paste0(x, 1:2)]))
This paste approach could be replaced by regular expressions for more general applications.
You can do something like,
sapply(unique(sub('\\d', '', names(mydata[,-1]))),
function(i) Reduce('-', mydata[,-1][,grepl(i, sub('\\d', '', names(mydata[,-1])))]))
# Mary Bob Dylan Tom Jane Sam Tony Luke John Pam
#[1,] -0.106 -0.335 -0.696 0.284 0.567 0.029 0.303 0.272 -0.274 -0.397
#[2,] 0.723 0.526 -0.119 -0.369 0.530 -0.118 0.308 0.159 0.686 0.313
#[3,] 0.174 0.230 -0.259 -0.454 0.039 -0.383 0.193 -0.028 -0.203 0.255
As per your comment, we can easily sort the columns and then apply the formula above,
sorted.names <- names(mydata)[order(nchar(names(mydata)), names(mydata))]
mydata <- mydata[,sorted.names]
This solution handles an arbitrary number of twins.
## return data frame
twin.vars <- function(prefix, df) {
df[grep(paste0(prefix, '[0-9]+$'), names(df))]
}
pfx <- unique(sub('[0-9]*$', '', names(mydata[-1])))
tmp <- lapply(pfx, function(x) Reduce(`-`, twin.vars(x, mydata)))
cbind(id=mydata$id, as.data.frame(setNames(tmp, pfx)))
OK, I've chosen #NBATrends solution because it works well almost always and he was the first.
Anyway, I add my little contribution, just in case anybody is interested:
runs <- rle(sort(sub('\\d$', '', names(mydata))))
sapply(runs[[2]][runs[[1]]>1], function(x) mydata[[paste0(x, 1)]] - mydata[[paste0(x, 2)]] )
The only "problem" is that it changes the final order, but you don't need to manually remove isolated columns, and works for disordered columns too.
I'm perplexed because nobody posted a solution with dplyr or data.table :)
Is there a more "r" way to substring two meaningful characters out of a longer string from a column in a data.table?
I have a data.table that has a column with "degree strings"... shorthand code for the degree someone got and the year they graduated.
> srcDT<- data.table(
alum=c("Paul Lennon","Stevadora Nicks","Fred Murcury"),
degree=c("W72","WG95","W88")
)
> srcDT
alum degree
1: Paul Lennon W72
2: Stevadora Nicks WG95
3: Fred Murcury W88
I need to extract the digits of the year from the degree, and put it in a new column called "degree_year"
No problem:
> srcDT[,degree_year:=substr(degree,nchar(degree)-1,nchar(degree))]
> srcDT
alum degree degree_year
1: Paul Lennon W72 72
2: Stevadora Nicks WG95 95
3: Fred Murcury W88 88
If only it were always that simple.
The problem is, the degree strings only sometimes look like the above. More often, they look like this:
srcDT<- data.table(
alum=c("Ringo Harrison","Brian Wilson","Mike Jackson"),
degree=c("W72 C73","WG95 L95","W88 WG90")
)
I am only interested in the 2 numbers next to the characters I care about: W & WG (and if both W and WG are there, I only care about WG)
Here's how I solved it:
x <-srcDT$degree ##grab just the degree column
z <-character() ## create an empty character vector
degree.grep.pattern <-c("WG[0-9][0-9]","W[0-9][0-9]")
## define a vector of regex's, in the order
## I want them
for(i in 1:length(x)){ ## loop thru all elements in degree column
matched=F ## at the start of the loop, reset flag to F
for(j in 1:length(degree.grep.pattern)){
## loop thru all elements of the pattern vector
if(length(grep(degree.grep.pattern[j],x[i]))>0){
## see if you get a match
m <- regexpr(degree.grep.pattern[j],x[i])
## if you do, great! grab the index of the match
y<-regmatches(x[i],m) ## then subset down. y will equal "WG95"
matched=T ## set the flag to T
break ## stop looping
}
## if no match, go on to next element in pattern vector
}
if(matched){ ## after finishing the loop, check if you got a match
yr <- substr(y,nchar(y)-1,nchar(y))
## if yes, then grab the last 2 characters of it
}else{
#if you run thru the whole list and don't match any pattern at all, just
# take the last two characters from the affilitation
yr <- substr(x[i],nchar(as.character(x[i]))-1,nchar(as.character(x[i])))
}
z<-c(z,yr) ## add this result (95) to the character vector
}
srcDT$degree_year<-z ## set the column to the results.
> srcDT
alum degree degree_year
1: Ringo Harrison W72 C73 72
2: Brian Wilson WG95 L95 95
3: Mike Jackson W88 WG90 90
This works. 100% of the time. No errors, no mis-matches.
The problem is: it doesn't scale. Given a data table with 10k rows, or 100k rows, it really slows down.
Is there a smarter, better way to do this? This solution is very "C" to me. Not very "R."
Thoughts on improvement?
Note: I gave a simplified example. In the actual data, there are about 30 different possible combinations of degrees, and combined with different years, there are something like 540 unique combinations of degree strings.
Also, I gave the degree.grep.pattern with only 2 patterns to match. In the actual work I'm doing, there are 7 or 8 patterns to match.
As it seem (per OPs) comments, there is no situation of "WG W", then a simple regex solution should do the job
srcDT[ , degree_year := gsub(".*WG?(\\d+).*", "\\1", degree)]
srcDT
# alum degree degree_year
# 1: Ringo Harrison W72 C73 72
# 2: Brian Wilson WG95 L95 95
# 3: Mike Jackson W88 WG90 90
Here's a solution based on the assumption that want the most recent degree with W in it:
regex <- "(?<=W|(?<=W)G)[0-9]{2}"
srcDT[ , degree_year :=
sapply(regmatches(degree,
gregexpr(regex, degree, perl = TRUE)),
function(x) max(as.integer(x)))]
> srcDT
alum degree degree_year
1: Ringo Harrison W72 C73 72
2: Brian Wilson WG95 L95 95
3: Mike Jackson W88 WG90 90
You said:
I gave the degree.grep.pattern with only 2 patterns to match. In the actual work I'm doing, there are 7 or 8 patterns to match.
But I'm not sure what this means. There are more options besides W and WG?
Here is one quick hack:
# split all words from degree and order so that WG is before W
words <- lapply(strsplit(srcDT$degree, " "), sort, decreasing=TRUE)
# obtain tags for each row (getting only first. But works since ordered)
tags <- mapply(Find, list(function(x) grepl("^WG|^W", x)), words)
# simple gsub to remove WG and W
(result <- gsub("^WG|^W", "", tags))
[1] "72" "95" "90"
Fast with 100k rows.
A solution without regular expressions, it's quite slow as it creates a sparse table... but it's clean and flexible so i leave it here.
First I split the degreeyears by space, then browse through them and build a clean structured table with one column per degree, that i fill it with years.
degreeyear_split <- sapply(srcDT$degree,strsplit," ")
for(i in 1:nrow(srcDT)){
for (degree_year in degreeyear_split[[i]]){
n <- nchar(degree_year)
degree <- substr(degree_year,1,n-2)
year <- substr(degree_year,n-1,n)
srcDT[i,degree] <- year
}}
Here I have my structure table, I paste W on the year i'm interested in, then paste WG on top of it.
srcDT$year <- srcDT$W
srcDT$year[srcDT$WG!=""]<-srcDT$WG[srcDT$WG!=""]
Then here's you result:
srcDT
alum degree W C WG L year
1: Ringo Harrison W72 C73 72 73 72
2: Brian Wilson WG95 L95 95 95 95
3: Mike Jackson W88 WG90 88 90 90
I am trying to extract all the info, using a regular expression in R, after the first number and first word of an entry in a data frame.
For example:
Header =
c("2006 Volvo XC70",
"2012 Ford Econoline Cargo Van E-250 Commercial",
"2012 Nissan Frontier",
"2012 Kia Soul 5dr Wagon Automatic")
I want to write a pattern that will grab Volvo XC70, or Econoline Cargo Van E-250 Commercial (everything after the year and make) from an entry in my "header" column so that I may run the function on my data frame and create a new "model" column. I can't figure out a pattern that will allow me to skip the first string of integers, then a space, then the first string of characters, and then a space, and then grab everything proceeding.
Any help would be appreciated. Thanks!
Just use sub.
sub("^\\d+\\s+\\w+\\s+", "", df$x)
Example:
x <- "2012 Ford Econoline Cargo Van E-250 Commercial"
sub("^\\d+\\s+\\w+\\s+", "", x)
# [1] "Econoline Cargo Van E-250 Commercial"
For this task, I would fetch a basic list using the XML package:
library(XML)
doc <- xmlParse('http://www.fueleconomy.gov/ws/rest/ympg/shared/menu/make')
Now that we fetched the XML data we can create a vector with the car makes:
mk <- xpathSApply(doc, '//value', xmlValue)
Finally, I'll compile the pattern and play around with sprintf and sub:
df$Makes <- sub(sprintf('\\d+ (?:%s) ', paste(mk, collapse='|')), '', df$Header)
Output:
## Header
# 1 2006 Volvo XC70
# 2 2012 Ford Econoline Cargo Van E-250 Commercial
# 3 2012 Nissan Frontier
# 4 2012 Kia Soul 5dr Wagon Automatic
## Makes
# 1 XC70
# 2 Econoline Cargo Van E-250 Commercial
# 3 Frontier
# 4 Soul 5dr Wagon Automatic
I have some difficulties to extract an ID in the form:
27da12ce-85fe-3f28-92f9-e5235a5cf6ac
from a data frame:
a<-c("NAME_27da12ce-85fe-3f28-92f9-e5235a5cf6ac_THOMAS_MYR",
"NAME_94773a8c-b71d-3be6-b57e-db9d8740bb98_THIMO",
"NAME_1ed571b4-1aef-3fe2-8f85-b757da2436ee_ALEX",
"NAME_9fbeda37-0e4f-37aa-86ef-11f907812397_JOHN_TYA",
"NAME_83ef784f-3128-35a1-8ff9-daab1c5f944b_BISHOP",
"NAME_39de28ca-5eca-3e6c-b5ea-5b82784cc6f4_DUE_TO",
"NAME_0a52a024-9305-3bf1-a0a6-84b009cc5af4_WIS_MICHAL",
"NAME_2520ebbb-7900-32c9-9f2d-178cf04f7efc_Sarah_Lu_Van_Gar/Thomas")
Basically its the thing between the first and the second underscore.
Usually I approach that by:
library(tidyr)
df$a<-as.character(df$a)
df<-df[grep("_", df$a), ]
df<- separate(df, a, c("ID","Name") , sep = "_")
df$a<-as.numeric(df$ID)
However this time there a to many underscores...and my approach fails. Is there a way to extract that ID?
I think you should use extract instead of separate. You need to specify the patterns which you want to capture. I'm assuming here that ID is always starts with a number so I'm capturing everything after the first number until the next _ and then everything after it
df <- data.frame(a)
df <- df[grep("_", df$a),, drop = FALSE]
extract(df, a, c("ID", "NAME"), "[A-Za-z].*?(\\d.*?)_(.*)")
# ID NAME
# 1 27da12ce-85fe-3f28-92f9-e5235a5cf6ac THOMAS_MYR
# 2 94773a8c-b71d-3be6-b57e-db9d8740bb98 THIMO
# 3 1ed571b4-1aef-3fe2-8f85-b757da2436ee ALEX
# 4 9fbeda37-0e4f-37aa-86ef-11f907812397 JOHN_TYA
# 5 83ef784f-3128-35a1-8ff9-daab1c5f944b BISHOP
# 6 39de28ca-5eca-3e6c-b5ea-5b82784cc6f4 DUE_TO
# 7 0a52a024-9305-3bf1-a0a6-84b009cc5af4 WIS_MICHAL
# 8 2520ebbb-7900-32c9-9f2d-178cf04f7efc Sarah_Lu_Van_Gar/Thomas
try this (which assumes that the ID is always the part after the first unerscore):
sapply(strsplit(a, "_"), function(x) x[[2]])
which gives you "the middle part" which is your ID:
[1] "27da12ce-85fe-3f28-92f9-e5235a5cf6ac" "94773a8c-b71d-3be6-b57e-db9d8740bb98"
[3] "1ed571b4-1aef-3fe2-8f85-b757da2436ee" "9fbeda37-0e4f-37aa-86ef-11f907812397"
[5] "83ef784f-3128-35a1-8ff9-daab1c5f944b" "39de28ca-5eca-3e6c-b5ea-5b82784cc6f4"
[7] "0a52a024-9305-3bf1-a0a6-84b009cc5af4" "2520ebbb-7900-32c9-9f2d-178cf04f7efc"
if you want to get the Name as well a simple solution would be (which assumes that the Name is always after the second underscore):
Names <- sapply(strsplit(a, "_"), function(x) Reduce(paste, x[-c(1,2)]))
which gives you this:
[1] "THOMAS MYR" "THIMO" "ALEX" "JOHN TYA"
[5] "BISHOP" "DUE TO" "WIS MICHAL" "Sarah Lu Van Gar/Thomas"
Below is an example of a text file I need to parse.
Lead Attorney: John Doe
Staff Attorneys: John Doe Jr. Paralegal: John Doe III
Geographic Area: Wisconsin
Affiliated Offices: None
E-mail: blah#blah.com
I need to parse all the key/value pairs and import it into a database. For example, I will insert 'John Doe' into the [Lead Attorney] column. I started a regex but I'm running into problems when parsing line 2:
Staff Attorneys: John Doe Jr. Paralegal: John Doe III
I started with the following regex:
(\w*.?\w+):\s*(.)(?!(\w.?\w+:.*))
But that does not parse out 'Staff Attorneys: John Doe Jr.' and 'Paralegal: John Doe III'. How can I ensure that my regex returns two groups for every key/value pair even if the key/value pairs are on the same line? Thanks!
Does any kind of key appear as a second key? The text above can be fixed by doing a data.replace('Paralegal:', '\nParalegal:') first. Then there is only one key/value pair per line, and it gets trivial:
>>> data = """Lead Attorney: John Doe
... Staff Attorneys: John Doe Jr. Paralegal: John Doe III
... Geographic Area: Wisconsin
... Affiliated Offices: None
... E-mail: blah#blah.com"""
>>>
>>> result = {}
>>> data = data.replace('Paralegal:', '\nParalegal:')
>>> for line in data.splitlines():
... key, val = line.split(':', 1)
... result[key.strip()] = val.strip()
...
>>> print(result)
{'Staff Attorneys': 'John Doe Jr.', 'Lead Attorney': 'John Doe', 'Paralegal': 'John Doe III', 'Affiliated Offices': 'None', 'Geographic Area': 'Wisconsin', 'E-mail': 'blah#blah.com'}
If "Paralegal:" also appears first you can make a regexp to do the replacement only when it's not first, or make a .find and check that the character before is not a newline. If there are several keywords that can appear like this, you can make a list of keywords, etc.
If the keywords can be anything, but only one word, you can look for ':' and parse backwards for space, which can be done with regexps.
If the keywords can be anything and include spaces, it's impossible to do automatically.