Replace certain spaces to tabs - delimiters - regex

I have one column data.frame where some spaces should be delimiters some just a space.
#input data
dat <- data.frame(x=c("A 2 2 textA1 textA2 Z1",
"B 4 1 textX1 textX2 textX3 Z2",
"C 3 5 textA1 Z3"))
# x
# 1 A 2 2 textA1 textA2 Z1
# 2 B 4 1 textX1 textX2 textX3 Z2
# 3 C 3 5 textA1 Z3
Need to convert it to 5 column data.frame:
#expected output
output <- read.table(text="
A 2 2 textA1 textA2 Z1
B 4 1 textX1 textX2 textX3 Z2
C 3 5 textA1 Z3",sep="\t")
# V1 V2 V3 V4 V5
# 1 A 2 2 textA1 textA2 Z1
# 2 B 4 1 textX1 textX2 textX3 Z2
# 3 C 3 5 textA1 Z3
Essentially, need to change 1st, 2nd, 3rd, and the last space to a tab (or any other delimiter if it makes it easier to code).
Playing with regex is not giving anything useful yet...
Note1: In real data I have to replace 1st, 2nd, 3rd,...,19th and the last spaces to tabs.
Note2: There is no pattern in V4, text can be anything.
Note3: Last column is one word text with variable length.

Try
v1 <- gsub("^([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+", '\\1,\\2,\\3,', dat$x)
read.table(text=sub(' +(?=[^ ]+$)', ',', v1, perl=TRUE), sep=",")
# V1 V2 V3 V4 V5
#1 A 2 2 textA1 textA2 Z1
#2 B 4 1 textX1 textX2 textX3 Z2
#3 C 3 5 textA1 Z3
Or an option inspired from #Tensibai's post
n <- 3
fpat <- function(n){
paste0('^((?:\\w+ ){', n,'})([\\w ]+)\\s+(\\w+)$')
}
read.table(text=gsub(fpat(n), "\\1'\\2' \\3", dat$x, perl=TRUE))
# V1 V2 V3 V4 V5
#1 A 2 2 textA1 textA2 Z1
#2 B 4 1 textX1 textX2 textX3 Z2
#3 C 3 5 textA1 Z3
For more columns,
n <- 19
v1 <- "A 24 34343 212 zea4 2323 12343 111 dsds 134d 153xd 153xe 153de 153dd dd dees eese tees3 zee2 2353 23335 23353 ddfe 3133"
read.table(text=gsub(fpat(n), "\\1'\\2' \\3", v1, perl=TRUE), sep='')
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
#1 A 24 34343 212 zea4 2323 12343 111 dsds 134d 153xd 153xe 153de 153dd dd
# V16 V17 V18 V19 V20 V21
#1 dees eese tees3 zee2 2353 23335 23353 ddfe 3133

With a variable number of columns:
library(stringr)
cols <- 3
m <- str_match(dat$x, paste0("((?:\\w+ ){" , cols , "})([\\w ]+) (\\w+)"))
t <- paste0(gsub(" ", "\t", m[,2]), m[,3], "\t", m[,4])
> read.table(text=t,sep="\t")
V1 V2 V3 V4 V5
1 A 2 2 textA1 textA2 Z1
2 B 4 1 textX1 textX2 textX3 Z2
3 C 3 5 textA1 Z3
Change the number of columns to tell how many you wish before.
For the regex:
((?:\\w+ ){3}) Capture the 3 repetitions {3} of the non capturing group (?:\w+ ) which matche at least one alphanumeric character w+ followed by a space
([\\w ]+) (\w+) capture the free text from alphanumeric char or space [\w ]+ followed by a space and the capture the last word with \w+
Once that done, paste the 3 parts returned by str_match taking care of replacing the spaces in the first group m[,2] by tabs.
m[,1] is the whole match so it's unused here.
Old answer:
A basic one matching based on a fixed number of fields:
> read.table(text=gsub("(\\w+) (\\w+) (\\w+) ([\\w ]+) (\\w+)$","\\1\t\\2\t\\3\t\\4\t\\5",dat$x,perl=TRUE),sep="\t")
V1 V2 V3 V4 V5
1 A 2 2 textA1 textA2 Z1
2 B 4 1 textX1 textX2 textX3 Z2
3 C 3 5 textA1 Z3
Add as many (\w+) you wish before, and increase the number of \1 (back references)

Here can be one twisted way to go that will work whatever the number of "words" you have (and that works on your data); it's based on the number of alphanum characters in your "words" compared to the number of alphanum characters in the other fields:
res <- gsub("\\w{3,}\\K\\t(?=\\w{3,})", " ", gsub(" ", "\t", dat$x), perl=T)
res
# [1] "A\t2\t2\ttextA1 textA2\tZ1" "B\t4\t1\ttextX1 textX2 textX3\tZ2" "C\t3\t5\ttextA1\tZ3"
read.table(text=res, sep="\t")
# V1 V2 V3 V4 V5
#1 A 2 2 textA1 textA2 Z1
#2 B 4 1 textX1 textX2 textX3 Z2
#3 C 3 5 textA1 Z3
EDIT: A completely different way to go, only based on the number of the spaces k you need to replace before the last one:
k <- 3 # in your example
res <- sapply(as.character(dat$x),
function(x, k){
pos_sp <- gregexpr(" ", x)[[1]]
x <- strsplit(x, "")[[1]]
if (length(pos_sp) > k+1) pos_sp <- pos_sp[c(1:k, length(pos_sp))]
x[pos_sp] <- "\t"
x <- paste(x, collapse="")
}, k=k)
read.table(text=res, sep="\t")
# V1 V2 V3 V4 V5
# 1 A 2 2 textA1 textA2 Z1
# 2 B 4 1 textX1 textX2 textX3 Z2
# 3 C 3 5 textA1 Z3

Related

sort dataframe columns in R

Is there a way to sort dataframe columns in R. I tried with below, but the result is returning as character instead of dataframe
> asd <- data.frame(a = c("fsd","sdfsd"))
> asd <- with(asd, asd[order(a) , ])
> asd
[1] "fsd" "sdfsd"
Can we get in dataframe only?
Try this
a <- data.frame(x=LETTERS[1:5],y=c(5:1))
a[order(a$x),]
a[order(a$y),]
> a[order(a$x),]
x y
1 A 5
2 B 4
3 C 3
4 D 2
5 E 1
> a[order(a$y),]
x y
5 E 1
4 D 2
3 C 3
2 B 4
1 A 5

Assigning groups using grepl with multiple inputs

I have a dataframe:
df <- data.frame(name=c("john", "david", "callum", "joanna", "allison", "slocum", "lisa"), id=1:7)
df
name id
1 john 1
2 david 2
3 callum 3
4 joanna 4
5 allison 5
6 slocum 6
7 lisa 7
I have a vector containing regex that I wish to find in the df$name variable:
vec <- c("lis", "^jo", "um$")
The output I want to get is as follows:
name id group
1 john 1 2
2 david 2 NA
3 callum 3 3
4 joanna 4 2
5 allison 5 1
6 slocum 6 3
7 lisa 7 1
I could do this doing the following:
df$group <- ifelse(grepl("lis", df$name), 1,
ifelse(grepl("^jo", df$name), 2,
ifelse(grepl("um$", df$name), 3,
NA)
However, I want to do this directly from 'vec'. I am generating different values into vec reactively in a shiny app. Can I assign groups based on index in vec?
Further, if something like the below happens, the group should be the first appearing. e.g. 'Callum' is TRUE for 'all' and "um$" but should get a group 1 here.
vec <- c("all", "^jo", "um$")
Here are several options:
df$group <- apply(Vectorize(grepl, "pattern")(vec, df$name),
1,
function(ii) which(ii)[1])
# name id group
#1 john 1 2
#2 david 2 NA
#3 callum 3 3
#4 joanna 4 2
#5 allison 5 1
#6 slocum 6 3
#7 lisa 7 1
Use a named vector and merge on it:
names(vec) <- seq_along(vec)
df <- merge(df, stack(Vectorize(grep, "pattern", SIMPLIFY=FALSE)(vec, df$name)),
by.x="id", by.y="values", all.x = TRUE)
df[!duplicated(df$id),] # to keep only the first match
# id name ind
#1 1 john 2
#2 2 david <NA>
#3 3 callum 3
#4 4 joanna 2
#5 5 allison 1
#6 6 slocum 3
#7 7 lisa 1
A for loop:
df$group <- NA
for ( i in rev(seq_along(vec))) {
TFvec <- grepl(vec[i], df$name)
df$group[TFvec] <- i
}
df
# name id group
#1 john 1 2
#2 david 2 NA
#3 callum 3 3
#4 joanna 4 2
#5 allison 5 1
#6 slocum 6 3
#7 lisa 7 1
Or you can use outer with stri_match_first_regex from stringi
library(stringi)
match.mat <- outer(df$name, vec, stri_match_first_regex)
df$group <- apply(match.mat, 1, function(ii) which(!is.na(ii))[1])
# [1] for first match in `vec`
# name id group
#1 john 1 2
#2 david 2 NA
#3 callum 3 3
#4 joanna 4 2
#5 allison 5 1
#6 slocum 6 3
#7 lisa 7 1
A vectorised solution, using rebus and stringi.
library(rebus)
library(stringi)
Create a regular expression that captures any of the values in vec.
vec <- c("lis", "^jo", "um$")
(rx <- or1(vec, capture = TRUE))
## <regex> (lis|^jo|um$)
Match the regex, then convert to factor and integer.
matches <- stri_match_first_regex(df$name, rx)[, 2]
df$group <- as.integer(factor(matches, levels = c("lis", "jo", "um")))
df now looks like this:
name id group
1 john 1 2
2 david 2 NA
3 callum 3 3
4 joanna 4 2
5 allison 5 1
6 slocum 6 3
7 lisa 7 1

How to split R data.frame column based regular expression condition

I have a data.frame and I want to split one of its columns to two based on a regular expression. More specifically the strings have a suffix in parentheses that needs to be extracted to a column of its own.
So e.g. I want to get from here:
dfInit <- data.frame(VAR = paste0(c(1:10),"(",c("A","B"),")"))
to here:
dfFinal <- data.frame(VAR1 = c(1:10), VAR2 = c("A","B"))
1) gsubfn::read.pattern read.pattern in the gsubfn package can do that. The matches to the parenthesized portions of the regular rexpression are regarded as the fields:
library(gsubfn)
read.pattern(text = as.character(dfInit$VAR), pattern = "(.*)[(](.*)[)]$")
giving:
V1 V2
1 1 A
2 2 B
3 3 A
4 4 B
5 5 A
6 6 B
7 7 A
8 8 B
9 9 A
10 10 B
2) sub Another way is to use sub:
data.frame(V1=sub("\\(.*", "", dfInit$VAR), V2=sub(".*\\((.)\\)$", "\\1", dfInit$VAR))
giving the same result.
3) read.table This solution does not use a regular expression:
read.table(text = as.character(dfInit$VAR), sep = "(", comment = ")")
giving the same result.
You could also use extract from tidyr
library(tidyr)
extract(dfInit, VAR, c("VAR1", "VAR2"), "(\\d+).([[:alpha:]]+).", convert=TRUE) # edited and added `convert=TRUE` as per #aosmith's comments.
# VAR1 VAR2
#1 1 A
#2 2 B
#3 3 A
#4 4 B
#5 5 A
#6 6 B
#7 7 A
#8 8 B
#9 9 A
#10 10 B
See Split column at delimiter in data frame
dfFinal <- within(dfInit, VAR<-data.frame(do.call('rbind', strsplit(as.character(VAR), '[[:punct:]]'))))
> dfFinal
VAR.X1 VAR.X2
1 1 A
2 2 B
3 3 A
4 4 B
5 5 A
6 6 B
7 7 A
8 8 B
9 9 A
10 10 B
An approach with regmatches and gregexpr:
as.data.frame(do.call(rbind, regmatches(dfInit$VAR, gregexpr("\\w+", dfInit$VAR))))
You can also use cSplit from splitstackshape.
library(splitstackshape)
cSplit(dfInit, "VAR", "[()]", fixed=FALSE)
# VAR_1 VAR_2
# 1: 1 A
# 2: 2 B
# 3: 3 A
# 4: 4 B
# 5: 5 A
# 6: 6 B
# 7: 7 A
# 8: 8 B
# 9: 9 A
#10: 10 B

Removing brackets from a dataset

I have imported a dataset in R of 10 columns and 100 row. But in few columns there are brackets([]) and commas along with the values. How can i get rid of them?
As an instance, consider one of 4 columns and 2 rows.
V1 V2 V3 V4
3( [4 ([5 8
(1 5 9 [10,
And what i want is
V1 V2 V3 V4
3 4 5 8
1 5 9 10
Just use gsub:
mydf[] <- lapply(mydf, function(x) gsub("[][(),]", "", x))
mydf
# V1 V2 V3 V4
# 1 3 4 5 8
# 2 1 5 9 10
Instead of lapply, you can also use as.matrix:
mydf[] <- gsub("[][(),]", "", as.matrix(mydf))
mydf
# V1 V2 V3 V4
# 1 3 4 5 8
# 2 1 5 9 10

R: fragment a list

I am kind of tired of working with lists..and my limited R capabilities ... I could not solve this from long time...
My list with multiple dataframe looks like the following:
set.seed(456)
sn1 = paste( "X", c(1:4), sep= "")
onelist <- list (df1 <- data.frame(sn = sn1, var1 = runif(4)),
df2 <- data.frame(sn = sn1, var1 = runif(4)),
df3 <- data.frame(sn = sn1,var1 = runif(4)))
[[1]]
sn var1
1 X1 0.3852362
2 X2 0.3729459
3 X3 0.2179086
4 X4 0.7551050
[[2]]
sn var1
1 X1 0.8216811
2 X2 0.5989182
3 X3 0.6510336
4 X4 0.8431172
[[3]]
sn var1
1 X1 0.4532381
2 X2 0.7167571
3 X3 0.2912222
4 X4 0.1798831
I want make a subset list in which the row 2 and 3 are only present.
srow <- c(2:3) # just I have many rows in real data
newlist <- lapply(onelist, function(y) subset(y, row(y) == srow))
The newlist is empty....
> newlist
[[1]]
[1] sn var1
<0 rows> (or 0-length row.names)
[[2]]
[1] sn var1
<0 rows> (or 0-length row.names)
[[3]]
[1] sn var1
<0 rows> (or 0-length row.names)
Help please ....
Does this do it?
Note the comma after the rows which implicitly is interpreted as NULL and results in the extraction all of the columns:
> lapply(onelist, "[", c(2,3),)
[[1]]
sn var1
2 X2 0.2105123
3 X3 0.7329553
[[2]]
sn var1
2 X2 0.33195997
3 X3 0.08243274
[[3]]
sn var1
2 X2 0.3852362
3 X3 0.3729459
You could have gotten your subset strategy to work with:
lapply(onelist, function(y) subset(y, rownames(y) %in% srow ))
Note that many time people use "==" when they really should be using %in%
?match
I don't think the row function does what you think it does:
Returns a matrix of integers indicating their row number in a matrix-like object, or a factor indicating the row labels.
Looking at what it returns on the list you have
> row(onelist[[1]])
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
> row(onelist[[1]])==srow
[,1] [,2]
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
You are doing a simple subset of the data.frames, so you can just use
newlist <- lapply(onelist, function(y) y[srow,])
which gives
> newlist
[[1]]
sn var1
2 X2 0.2105123
3 X3 0.7329553
[[2]]
sn var1
2 X2 0.33195997
3 X3 0.08243274
[[3]]
sn var1
2 X2 0.3852362
3 X3 0.3729459