data.table setnames combined with regex - regex

I would like to rename each column in a data table based on a regex in an appropriate way.
library(data.table)
DT <- data.table("a_foo" = 1:2, "bar_b" = 1:2)
a_foo bar_b
1: 1 1
2: 2 2
I would like to cut the "_foo" and "bar_" from the names. This classic line does the trick, but it also copies the whole table.
names(DT) <- gsub("_foo|bar_", "", names(DT))
How can I do the same using setnames()? I have a lots of variables, so just writing out all of the names is not an option.

You could try
setnames(DT, names(DT), gsub("_foo|bar_", "", names(DT)))
based on the usage in ?setnames i.e. setnames(x,old,new)
Or as #eddi commented
setnames(DT, gsub("_foo|bar_", "", names(DT)))

Related

Search and replace multiple strings in list of strings: improve R code

I am looking for a simplified solution to the following problem in R: I have a list of names that are separated by commas – however, some of the names also have commas in them. In order to separate the names, I would like to replace all names with commas first and then split by comma. My problem is that I have around 26 000 strings with several names in each and I have a list of around 130 names with commas. I have written a nested foreach loop (in order to use multiple cores to speed things up) and it works but it’s horribly slow. Is there a quicker way to search in the strings and replace the relevant names? Here is my sample code:
List_of_names<-as.data.frame(c("Fred, Heiko, Franz, Jr., Nice, LLC, Meike","Digital, Mike, John, Sr","Svenja, Sven"))
Comma_names<-as.data.frame(c("Franz, Jr.","Nice, LLC","John, Sr"))
colnames(Comma_names)<-"name"
Comma_names$replace_names<-gsub(",", "",Comma_names[,"name"])
library(doParallel)
library(foreach)
cl<-makeCluster(4) # Create cluster with desired number of cores
registerDoParallel(cl) # Register cluster
names_new<-foreach (i=1:nrow(List_of_names),.errorhandling="pass",.packages=c("foreach")) %dopar% {
name_2<-List_of_names[i,]
foreach (j=1:nrow(Comma_names),.combine=rbind,.errorhandling="pass") %do% {
if(length(grep(Comma_names[j,1],name_2))>0){
name_2<-gsub(Comma_names[j,1], Comma_names[j,2],name_2)
}
}
name_2
}
In addition, the result of the foreach loop is a list but if I try to save the list or replace the column in my original dataframe it takes forever. How can I change my code to make it faster?
Thank you everyone who is reads this and is able to help!
Principle
You can use a combination from Reduce and stri_replace_all from package stringi.
Code
library(stringi)
Comma_names <- structure(list(name = c("Franz, Jr.", "Nice, LLC", "John, Sr"),
replace_names = c("Franz Jr.", "Nice LLC", "John Sr")),
.Names = c("name", "replace_names"),
row.names = c(NA, -3L), class = "data.frame")
List_of_names <- structure(list(name = c("Fred, Heiko, Franz, Jr., Nice, LLC, Meike",
"Digital, Mike, John, Sr", "Svenja, Sven")),
.Names = "name",
row.names = c(NA, -3L), class = "data.frame")
wrapper <- function(str, ind) stri_replace_all(str, Comma_names$replace_names[ind],
fixed = Comma_names$name[ind])
ind <- 1:NROW(Comma_names)
Reduce(wrapper, ind, init = List_of_names$name)
# [1] "Fred, Heiko, Franz Jr., Nice LLC, Meike"
# [2] "Digital, Mike, John Sr"
# [3] "Svenja, Sven"
Explanation
stri_replace_all is a fast function which replaces all occurrences in a string. With Reduce you apply a function to the the result of the previous function call. So we apply wrapper to the column with all the names and replace the string in the first row of Comma_names. This string we again feed to wrapper now with the aim to replace all occurrences of the second row and so on. This code should run reasonable fast and you do not need to parallezie. Would be curious to hear your feedback on the execution time.
Benchmark
Just a little benchmark with 3 millions lines:
List_of_names <- List_of_names[rep(1:NROW(List_of_names), 1e6), , drop = FALSE]
system.time(invisible(Reduce(wrapper, ind, init = List_of_names$name)))
# user system elapsed
# 1.95 0.00 1.96

How to use regular expressions properly on a SQL files?

I have a lot of undocumented and uncommented SQL queries. I would like to extract some information within the SQL-statements. Particularly, I'm interested in DB-names, table names and if possible column names. The queries have usually the following syntax.
SELECT *
FROM mydb.table1 m
LEFT JOIN mydb.sometable o ON m.id = o.id
LEFT JOIN mydb.sometable t ON p.id=t.id
LEFT JOIN otherdb.sometable s ON s.column='test'
Usually, the statements involes several DBs and Tables. I would like only extract DBs and Tables with any other information. I thought if whether it is possible to extract first the information which begins after FROM & JOIN & LEFT JOIN. Here its usually db.table letters such as o t s correspond already to referenced tables. I suppose they are difficult to capture. What I tried without any success is to use something like:
gsub(".*FROM \\s*|WHERE|ORDER|GROUP.*", "", vec)
Assuming that each statement ends with WHERE/where or ORDER/order or GROUP... But that doesnt work out as expected.
You haven't indicated which database system you are using but virtually all such systems have introspection facilities that would allow you to get this information a lot more easily and reliably than attempting to parse SQL statements. The following code which supposes SQLite can likely be adapted to your situation by getting a list of your databases and then looping over the databases and using dbConnect to connect to each one in turn running code such as this:
library(gsubfn)
library(RSQLite)
con <- dbConnect(SQLite()) # use in memory database for testing
# create two tables for purposes of this test
dbWriteTable(con, "BOD", BOD, row.names = FALSE)
dbWriteTable(con, "iris", iris, row.names = FALSE)
# get all table names and columns
tabinfo <- Map(function(tab) names(fn$dbGetQuery(con, "select * from $tab limit 0")),
dbListTables(con))
dbDisconnect(con)
giving an R list whose names are the table names and whose entries are the column names:
> tabinfo
$BOD
[1] "Time" "demand"
$iris
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
or perhaps long form output is preferred:
setNames(stack(tabinfo), c("column", "table"))
giving:
column table
1 Time BOD
2 demand BOD
3 Sepal.Length iris
4 Sepal.Width iris
5 Petal.Length iris
6 Petal.Width iris
7 Species iris
You could use the stringi package for this.
library(stringi)
# Your string vector
myString <- "SELECT *
FROM mydb.table1 m
LEFT JOIN mydb.sometable o ON m.id = o.id
LEFT JOIN mydb.sometable t ON p.id=t.id
LEFT JOIN otherdb.sometable s ON s.column='test'"
# Three stringi functions used
# stringi_extract_all_regex will extract the strings which have FROM or JOIN followed by some text till the next space
# string_replace_all_regex will replace all the FROM or JOIN followed by space with null string
# stringi_unique will extract all unique strings
t <- stri_unique(stri_replace_all_regex(stri_extract_all_regex(myString, "((FROM|JOIN) [^\\s]+)", simplify = TRUE),
"(FROM|JOIN) ", ""))
> t
[1] "mydb.table1" "mydb.sometable" "otherdb.sometable"

Split one column into two columns and retaining the seperator

I have a very large data array:
'data.frame': 40525992 obs. of 14 variables:
$ INSTNM : Factor w/ 7050 levels "A W Healthcare Educators"
$ Total : Factor w/ 3212 levels "1","10","100",
$ Crime_Type : Factor w/ 72 levels "MURD11","NEG_M11",
$ Count : num 0 0 0 0 0 0 0 0 0 0 ...
The Crime_Type column contains the type of Crime and the Year, so "MURD11" is Murder in 2011. These are college campus crime statistics my kid is analyzing for her school project, I am helping when she is stuck. I am currently stuck at creating a clean data file she can analyze
Once i converted the wide file (all crime types '9' in columns) to a long file using 'gather' the file size is going from 300MB to 8 GB. The file I am working on is 8GB. do you that is the problem. How do i convert it to a data.table for faster processing?
What I want to do is to split this 'Crime_Type' column into two columns 'Crime_Type' and 'Year'. The data contains alphanumeric and numbers. There are also some special characters like NEG_M which is 'Negligent Manslaughter'.
We will replace the full names later but can some one suggest on how I separate
MURD11 --> MURD and 11 (in two columns)
NEG_M10 --> NEG_M and 10 (in two columns)
etc...
I have tried using,
df <- separate(totallong, Crime_Type, into = c("Crime", "Year"), sep = "[:digit:]", extra = "merge")
df <- separate(totallong, Crime_Type, into = c("Year", "Temp"), sep = "[:alpha:]", extra = "merge")
The first one separates the Crime as it looks for numbers. The second one does not work at all.
I also tried
df$Crime_Type<- apply (strsplit(as.character(df$Crime_Type), split="[:digit:]"))
That does not work at all. I have gone through many posts on stack-overflow and thats where I got these commands but I am now truly stuck and would appreciate your help.
Since you're using tidyr already (as evidenced by separate), try the extract function, which, given a regex, puts each captured group into a new column. The 'Crime_Type' is all the non-numeric stuff, and the 'Year' is the numeric stuff. Adjust the regex accordingly.
library(tidyr)
extract(df, 'Crime_Type', into=c('Crime', 'Year'), regex='^([^0-9]+)([0-9]+)$')
In base R, one option would be to create a unique delimiter between the non-numeric and numeric part. We can capture as a group the non-numeric ([^0-9]+) and numeric ([0-9]+) characters by wrapping it inside the parentheses ((..)) and in the replacement we use \\1 for the first capture group, followed by a , and the second group (\\2). This can be used as input vector to read.table with sep=',' to read as two columns.
df1 <- read.table(text=gsub('([^0-9]+)([0-9]+)', '\\1,\\2',
totallong$Crime_Type),sep=",", col.names=c('Crime', 'Year'))
df1
# Crime Year
#1 MURD 11
#2 NEG_M 11
If we need, we can cbind with the original dataset
cbind(totallong, df1)
Or in base R, we can use strsplit with split specifying the boundary between non-number ((?<=[^0-9])) and a number ((?=[0-9])). Here we use lookarounds to match the boundary. The output will be a list, we can rbind the list elements with do.call(rbind and convert it to data.frame
as.data.frame(do.call(rbind, strsplit(as.character(totallong$Crime_Type),
split="(?<=[^0-9])(?=[0-9])", perl=TRUE)))
# V1 V2
#1 MURD 11
#2 NEG_M 11
Or another option is tstrsplit from the devel version of data.table ie. v1.9.5. Here also, we use the same regex. In addition, there is option to convert the output columns into different class.
library(data.table)#v1.9.5+
setDT(totallong)[, c('Crime', 'Year') := tstrsplit(Crime_Type,
"(?<=[^0-9])(?=[0-9])", perl=TRUE, type.convert=TRUE)]
# Crime_Type Crime Year
#1: MURD11 MURD 11
#2: NEG_M11 NEG_M 11
If we don't need the 'Crime_Type' column in the output, it can be assigned to NULL
totallong[, Crime_Type:= NULL]
NOTE: Instructions to install the devel version are here
Or a faster option would be stri_extract_all from library(stringi) after collapsing the rows to a single string ('v2'). The alternate elements in 'v3' can be extracted by indexing with seq to create new data.frame
library(stringi)
v2 <- paste(totallong$Crime_Type, collapse='')
v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
ind1 <- seq(1, length(v3), by=2)
ind2 <- seq(2, length(v3), by=2)
d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])
Benchmarks
v1 <- do.call(paste, c(expand.grid(c('MURD', 'NEG_M'), 11:15), sep=''))
set.seed(24)
test <- data.frame(v1= sample(v1, 40525992, replace=TRUE ))
system.time({
v2 <- paste(test$v1, collapse='')
v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
ind1 <- seq(1, length(v3), by=2)
ind2 <- seq(2, length(v3), by=2)
d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])
})
#user system elapsed
#56.019 1.709 57.838
data
totallong <- data.frame(Crime_Type= c('MURD11', 'NEG_M11'))

Pattern matching in dataset

been struggling with this for a while.
I have a dataset with two columns, a Description column and the other is the pattern column that I am trying to match against the description column.If the corresponding pattern exists in the Description column, it needs to be replaced by an asterisk
For instance, if the Description is ABCDEisthedescription and the Pattern is ABCDE, then the new description should *isthedescription
I tried the following
data$NewDescription <- gsub(data$pattern,"\\*",Data$Description )
since there is more than one row in the dataset, it throws an error ( a warning rather)
"argument 'pattern' has length > 1 and only the first element will be used"
Any help will be hugely appreciated.
You can use an mapply here to apply the function to each row.
#sample data
data<-data.frame(
pattern=c("ABCDE","XYZ"),
Description=c("ABCDEisthedescription", "sillyXYZvalue")
)
Now use mapply
mapply(function(p,d) gsub(p, "\\*", d, fixed=T), data$pattern, data$Description)
# [1] "\\*isthedescription" "silly\\*value"
Additionally,
Patterns <- paste0(
sample(LETTERS[1:4],500,replace=TRUE),
sample(LETTERS[1:4],500,replace=TRUE),
sample(LETTERS[1:4],500,replace=TRUE),
sample(LETTERS[1:4],500,replace=TRUE))
##
Desc <- paste0(Patterns,"isthedescription")
Ptrn <- sample(Patterns,500)
##
Data <- data.frame(
Description=Desc,
Pattern=Ptrn,
stringsAsFactors=FALSE)
##
newDesc <- sapply(1:nrow(Data), function(X){
if(substr(Data$Description[X],1,4)==Data$Pattern[X]){
gsub(Data$Pattern[X],"*",Data$Description[X])
} else {
Data$Description[X]
}
})
#MrFlick's approach seems more concise though.

read table with spaces in one column

I am attempting to extract tables from very large text files (computer logs). Dickoa provided very helpful advice to an earlier question on this topic here: extracting table from text file
I modified his suggestion to fit my specific problem and posted my code at the link above.
Unfortunately I have encountered a complication. One column in the table contains spaces. These spaces are generating an error when I try to run the code at the link above. Is there a way to modify that code, or specifically the read.table function to recognize the second column below as a column?
Here is a dummy table in a dummy log:
> collect.models(, adjust = FALSE)
model npar AICc DeltaAICc weight Deviance
5 AA(~region + state + county + city)BB(~region + state + county + city)CC(~1) 17 11111.11 0.0000000 5.621299e-01 22222.22
4 AA(~region + state + county)BB(~region + state + county)CC(~1) 14 22222.22 0.0000000 5.621299e-01 77777.77
12 AA(~region + state)BB(~region + state)CC(~1) 13 33333.33 0.0000000 5.621299e-01 44444.44
12 AA(~region)BB(~region)CC(~1) 6 44444.44 0.0000000 5.621299e-01 55555.55
>
> # the three lines below count the number of errors in the code above
Here is the R code I am trying to use. This code works if there are no spaces in the second column, the model column:
my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')
top <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'
my.data <- my.data[grep(top, my.data):grep(bottom, my.data)]
x <- read.table(text=my.data, comment.char = ">")
I believe I must use the variables top and bottom to locate the table in the log because the log is huge, variable and complex. Also, not every table contains the same number of models.
Perhaps a regex expression could be used somehow taking advantage of the AA and the CC(~1) present in every model name, but I do not know how to begin. Thank you for any help and sorry for the follow-up question. I should have used a more realistic example table in my initial question. I have a large number of logs. Otherwise I could just extract and edit the tables by hand. The table itself is an odd object which I have only ever been able to export directly with capture.output, which would probably still leave me with the same problem as above.
EDIT:
All spaces seem to come right before and right after a plus sign. Perhaps that information can be used here to fill the spaces or remove them.
try inserting my.data$model <- gsub(" *\\+ *", "+", my.data$model) before read.table
my.data <- my.data[grep(top, my.data):grep(bottom, my.data)]
my.data$model <- gsub(" *\\+ *", "+", my.data$model)
x <- read.table(text=my.data, comment.char = ">")