In regex, mystery Error: assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634 - regex

Assume 900+ company names pasted together to form a regex pattern using the pipe separator -- "firm.pat".
firm.pat <- str_c(firms$firm, collapse = "|")
With a data frame called "bio" that has a large character variable (250 rows each with 100+ words) named "comment", I would like to replace all the company names with blanks. Both a gsub call and a str_replace_all call return the same mysterious error.
bio$comment <- gsub(pattern = firm.pat, x = bio$comment, replacement = "")
Error in gsub(pattern = firm.pat, x = bio$comment, replacement = "") :
assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634
library(stringr)
bio$comment <- str_replace_all(bio$comment, firm.pat, "")
Error: assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634
traceback() did not enlighten me.
> traceback()
4: gsub("aaronson rappaport|adams reese|adelson testan|adler pollock|ahlers cooney|ahmuty demers|akerman|akin gump|allen kopet|allen matkins|alston bird|alston hunt|alvarado smith|anderson kill|andrews kurth|archer
# hundreds of lines of company names omitted here
lties in all 50 states and washington, dc. results are compiled through a peer-review survey in which thousands of lawyers in the u.s. confidentially evaluate their professional peers."
), fixed = FALSE, ignore.case = FALSE, perl = FALSE)
3: do.call(f, compact(args))
2: re_call("gsub", string, pattern, replacement)
1: str_replace_all(bio$comment, firm.pat, "")
Three other posts have mentioned the cryptic error on SO, a passing reference and cites two other oblique references, but with no discussion.
I know this question lacks reproducible code, but even so, how do I find out what the error is explaining? Even better, how do I avoid throwing the error? The error does not seem to occur with smaller numbers of companies but I can't detect a pattern or threshold. I am running Windows 8, RStudio, updated versions of every package.
Thank you.

I had the same problem with pattern consisiting of hundreds of manufacters names. As I can suggest the pattern is too long, so I split it in two or more patterns and it works well.
ml<-length(firms$firm)
xyz<-gsub(sprintf("(*UCP)\\b(%s)\\b", paste(head(firms$firm,n=ml/2), collapse = "|")), "", bio$comment, perl=TRUE)
xyz<-gsub(sprintf("(*UCP)\\b(%s)\\b", paste(tail(firms$firm,n=ml/2), collapse = "|")), "", xyz, perl=TRUE)

You can use mgsub in the qdap package, which is an extension to gsub that handles vectors of patterns and replacements.
Please refer to this Answer

Related

Problem for line breaks (\n) with gtsummary functions

I have a problem trying to include line breaks into arguments of gtsummary functions, as statistic argument of tbl_summary() or update for modify_header(). It's a bit strange because it always have worked until now, and the package documentation indicates that this the way to do so...
Here a reproducible example :
## loading packages ##
library(dplyr)
library(gtsummary)
## gtsummary table ##
trial %>% tbl_summary(include = c("trt","stage","grade"),
by = "trt",
statistic = all_categorical() ~ "{p}% \n ({n})", # \n does not pass "({n})" to next line...
missing = "no") %>%
modify_header(update =list(all_stat_cols() ~ "**{level}** \n ({p}%, \n N = {n})"), # and here as well...
text_interpret = "md")
gtsummary cross table
Does the problem uniquely come from my computer ? Could it be due to a recent package update ?

Split parts of string defined by multiple delimiters into multiple variables in R

I have a large list of file names that I need to extract information from using R. The info is delimited by multiple dashes and underscores. I am having trouble figuring out a method that will accommodate the fact that the number of characters between delimiters is not consistent (the order of the information will remain constant, as will the delimiters used (hopefully)).
For example:
f <- data.frame(c("EI-SM4-AMW11_20160614_082800.wav", "PA-RF-A50_20160614_082800.wav"), stringsAsFactors = FALSE)
colnames(f)<-"filename"
f$area <- str_sub(f$filename, 1, 2)
f$rec <- str_sub(f$filename, 4, 6)
f$site <- str_sub(f$filename, 8, 12)
This produces correct results for the first file, but incorrect results for the second.
I've tried using the "stringr" and "stringi" packages, and know that hard coding the values in doesn't work, so I've come up with awkward solutions using both packages such as:
f$site <- str_sub(f$filename,
stri_locate_last(f$filename, fixed="-")[,1]+1,
stri_locate_first(f$filename, fixed="_")[,1]-1)
I feel like there must be a more elegant (and robust) method, perhaps involving regex (which I am painfully new to).
I've looked at other examples (Extract part of string (till the first semicolon) in R, R: Find the last dot in a string, Split string using regular expressions and store it into data frame).
Any suggestions/pointers would be very much appreciated.
Try this, from the `tidyr' package:
library(tidyr)
f %>% separate(filename, c('area', 'rec', 'site'), sep = '-')
You can also split along multiple difference delimeters, like so:
f %>% separate(filename, c('area', 'rec', 'site', 'date', 'don_know_what_this_is', 'file_extension'), sep = '-|_|\\.')
and then keep only the columns you want using dplyr's select function:
library(dplyr)
library(tidyr)
f %>%
separate(filename,
c('area', 'rec', 'site', 'date',
'don_know_what_this_is', 'file_extension'),
sep = '-|_|\\.') %>%
select(area, rec, site)
Something like this:
library(stringr)
library(dplyr)
f$area <- word(f$filename, 1, sep = "-")
f$rec <- word(f$filename, 2, sep = "-")
f$site <- word(f$filename, 3, sep = "-") %>%
word(1,sep = "_")
dplyr is not necessary but makes concatenation cleaner.
The function word belongs to stringr.

R- Subset a corpus by meta data (id) matching partial strings

I'm using the R (3.2.3) tm-package (0.6-2) and would like to subset my corpus according to partial string matches contained with the metadatum "id".
For example, I would like to filter all documents that contain the string "US" within the "id" column. The string "US" would be preceded and followed by various characters and numbers.
I have found a similar example here. It is recommended to download the quanteda package but I think this should also be possible with the tm package.
Another more relevant answer to a similar problem is found here. I have tried to adapt that sample code to my context. However, I don't manage to incorporate the partial string matching.
I imagine there might be multiple things wrong with my code so far.
What I have so far looks like this:
US <- tm_filter(corpus, FUN = function(corpus, filter) any(meta(corpus)["id"] == filter), grep(".*US.*", corpus))
And I receive the following error message:
Error in structure(as.character(x), names = names(x)) :
'names' attribute [3811] must be the same length as the vector [3]
I'm also not sure how to come up with a reproducible example simulating my problem for this post.
It could work like this:
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
(corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain)))
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 20
(idx <- grep("0", sapply(meta(corp, "id"), paste0), value=TRUE))
# 502 704 708
# "502" "704" "708"
(corpsubset <- corp[idx] )
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 3
You are looking for "US" instead of "0". Have a look at ?grep for details (e.g. fixed=TRUE).

Search and replace multiple strings in list of strings: improve R code

I am looking for a simplified solution to the following problem in R: I have a list of names that are separated by commas – however, some of the names also have commas in them. In order to separate the names, I would like to replace all names with commas first and then split by comma. My problem is that I have around 26 000 strings with several names in each and I have a list of around 130 names with commas. I have written a nested foreach loop (in order to use multiple cores to speed things up) and it works but it’s horribly slow. Is there a quicker way to search in the strings and replace the relevant names? Here is my sample code:
List_of_names<-as.data.frame(c("Fred, Heiko, Franz, Jr., Nice, LLC, Meike","Digital, Mike, John, Sr","Svenja, Sven"))
Comma_names<-as.data.frame(c("Franz, Jr.","Nice, LLC","John, Sr"))
colnames(Comma_names)<-"name"
Comma_names$replace_names<-gsub(",", "",Comma_names[,"name"])
library(doParallel)
library(foreach)
cl<-makeCluster(4) # Create cluster with desired number of cores
registerDoParallel(cl) # Register cluster
names_new<-foreach (i=1:nrow(List_of_names),.errorhandling="pass",.packages=c("foreach")) %dopar% {
name_2<-List_of_names[i,]
foreach (j=1:nrow(Comma_names),.combine=rbind,.errorhandling="pass") %do% {
if(length(grep(Comma_names[j,1],name_2))>0){
name_2<-gsub(Comma_names[j,1], Comma_names[j,2],name_2)
}
}
name_2
}
In addition, the result of the foreach loop is a list but if I try to save the list or replace the column in my original dataframe it takes forever. How can I change my code to make it faster?
Thank you everyone who is reads this and is able to help!
Principle
You can use a combination from Reduce and stri_replace_all from package stringi.
Code
library(stringi)
Comma_names <- structure(list(name = c("Franz, Jr.", "Nice, LLC", "John, Sr"),
replace_names = c("Franz Jr.", "Nice LLC", "John Sr")),
.Names = c("name", "replace_names"),
row.names = c(NA, -3L), class = "data.frame")
List_of_names <- structure(list(name = c("Fred, Heiko, Franz, Jr., Nice, LLC, Meike",
"Digital, Mike, John, Sr", "Svenja, Sven")),
.Names = "name",
row.names = c(NA, -3L), class = "data.frame")
wrapper <- function(str, ind) stri_replace_all(str, Comma_names$replace_names[ind],
fixed = Comma_names$name[ind])
ind <- 1:NROW(Comma_names)
Reduce(wrapper, ind, init = List_of_names$name)
# [1] "Fred, Heiko, Franz Jr., Nice LLC, Meike"
# [2] "Digital, Mike, John Sr"
# [3] "Svenja, Sven"
Explanation
stri_replace_all is a fast function which replaces all occurrences in a string. With Reduce you apply a function to the the result of the previous function call. So we apply wrapper to the column with all the names and replace the string in the first row of Comma_names. This string we again feed to wrapper now with the aim to replace all occurrences of the second row and so on. This code should run reasonable fast and you do not need to parallezie. Would be curious to hear your feedback on the execution time.
Benchmark
Just a little benchmark with 3 millions lines:
List_of_names <- List_of_names[rep(1:NROW(List_of_names), 1e6), , drop = FALSE]
system.time(invisible(Reduce(wrapper, ind, init = List_of_names$name)))
# user system elapsed
# 1.95 0.00 1.96

read table with spaces in one column

I am attempting to extract tables from very large text files (computer logs). Dickoa provided very helpful advice to an earlier question on this topic here: extracting table from text file
I modified his suggestion to fit my specific problem and posted my code at the link above.
Unfortunately I have encountered a complication. One column in the table contains spaces. These spaces are generating an error when I try to run the code at the link above. Is there a way to modify that code, or specifically the read.table function to recognize the second column below as a column?
Here is a dummy table in a dummy log:
> collect.models(, adjust = FALSE)
model npar AICc DeltaAICc weight Deviance
5 AA(~region + state + county + city)BB(~region + state + county + city)CC(~1) 17 11111.11 0.0000000 5.621299e-01 22222.22
4 AA(~region + state + county)BB(~region + state + county)CC(~1) 14 22222.22 0.0000000 5.621299e-01 77777.77
12 AA(~region + state)BB(~region + state)CC(~1) 13 33333.33 0.0000000 5.621299e-01 44444.44
12 AA(~region)BB(~region)CC(~1) 6 44444.44 0.0000000 5.621299e-01 55555.55
>
> # the three lines below count the number of errors in the code above
Here is the R code I am trying to use. This code works if there are no spaces in the second column, the model column:
my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')
top <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'
my.data <- my.data[grep(top, my.data):grep(bottom, my.data)]
x <- read.table(text=my.data, comment.char = ">")
I believe I must use the variables top and bottom to locate the table in the log because the log is huge, variable and complex. Also, not every table contains the same number of models.
Perhaps a regex expression could be used somehow taking advantage of the AA and the CC(~1) present in every model name, but I do not know how to begin. Thank you for any help and sorry for the follow-up question. I should have used a more realistic example table in my initial question. I have a large number of logs. Otherwise I could just extract and edit the tables by hand. The table itself is an odd object which I have only ever been able to export directly with capture.output, which would probably still leave me with the same problem as above.
EDIT:
All spaces seem to come right before and right after a plus sign. Perhaps that information can be used here to fill the spaces or remove them.
try inserting my.data$model <- gsub(" *\\+ *", "+", my.data$model) before read.table
my.data <- my.data[grep(top, my.data):grep(bottom, my.data)]
my.data$model <- gsub(" *\\+ *", "+", my.data$model)
x <- read.table(text=my.data, comment.char = ">")