Related
I have some text strings that I would like to extract certain bits of information from. In particular I would like to extract the rating out of 10 from.
I would like help in constructing a functionfunc_to_extract_rating that does the following...
text_string_vec <- c('blah$2.94 blah blah 3/10 blah blah.',
'foo foo 8/10.',
'10/10 bar bar21/09/2010 bar bar',
'jdsfs1/10djflks5/10.')
func_to_extract_rating <- function(){}
output <- lapply(text_string_vec,func_to_extract_rating)
output
[[1]]
[1] 3 10
[[2]]
[1] 8 10
[[3]]
[1] 10 10
[[4]]
[[4]][[1]]
[1] 1 10
[[4]][[2]]
[1] 5 10
Something like this maybe:
library(stringr)
result = str_extract_all(text_string_vec, "[0-9]{1,2}/10")
result = lapply(result, function(x) gsub("/"," ", x))
[[1]]
[1] "3 10"
[[2]]
[1] "8 10"
[[3]]
[1] "10 10"
[[4]]
[1] "1 10" "5 10"
But since it's always out of 10, if you just want the numeric rating, you can do:
result = str_extract_all(text_string_vec, "[0-9]{1,2}/10")
result = lapply(result, function(x) as.numeric(gsub("/10","", x)))
Here is a base R option
lapply(strsplit(str1, "([0-9]{1,2}\\/10)(*SKIP)(*FAIL)|.", perl = TRUE),
function(x) {
lst <- lapply(strsplit(x[nzchar(x)], "/"), as.numeric)
if(length(lst)==1) unlist(lst) else lst})
#[[1]]
#[1] 3 10
#[[2]]
#[1] 8 10
#[[3]]
#[1] 10 10
#[[4]]
#[[4]][[1]]
#[1] 1 10
#[[4]][[2]]
#[1] 5 10
I was hoping somebody would be able to show me a way to extract data from a character vector.
The dataframe is as below
structure(list(Sensitivity = structure(c(1L, 5L, 4L, 4L, 4L,
4L, 3L, 5L, 2L), .Label = c(" 1.01 [ 0.21, 2.91]", " 89.60 [ 85.56, 92.82]",
" 92.95 [ 89.43, 95.59]", " 99.66 [ 98.14, 99.99]", " 100.00 [ 98.77, 100.00]"
), class = "factor"), Specificity = structure(c(8L, 1L, 3L, 4L,
2L, 5L, 6L, 1L, 7L), .Label = c(" 27.17 [ 25.15, 29.26]", " 44.96 [ 42.67, 47.26]",
" 53.31 [ 51.00, 55.61]", " 69.90 [ 67.75, 71.99]", " 70.23 [ 68.08, 72.31]",
" 90.18 [ 88.73, 91.50]", " 91.70 [ 90.35, 92.92]", " 100.00 [ 99.80, 100.00]"
), class = "factor")), .Names = c("Sensitivity", "Specificity"
), class = "data.frame", row.names = c(NA, -9L))
As an example for the first column element of the first column i would ideally get three columns of data of 1.01, 0.21 and 2.91.
The first and second numerical value is separated by a "[" and the second and third by a ",". I am not au fait with grep but have tried using and am going wrong somewhere!
Here is a regular expression solution you can try with using the str_extract_all from stringr package, where we use \\d+\\.\\d+ to match decimal numbers which start from one or more digits followed by . and another one or more digits pattern.
library(stringr)
lapply(df, function(col) do.call(rbind, str_extract_all(col, "\\d+\\.\\d+")))
$Sensitivity
[,1] [,2] [,3]
[1,] "1.01" "0.21" "2.91"
[2,] "100.00" "98.77" "100.00"
[3,] "99.66" "98.14" "99.99"
[4,] "99.66" "98.14" "99.99"
[5,] "99.66" "98.14" "99.99"
[6,] "99.66" "98.14" "99.99"
[7,] "92.95" "89.43" "95.59"
[8,] "100.00" "98.77" "100.00"
[9,] "89.60" "85.56" "92.82"
$Specificity
[,1] [,2] [,3]
[1,] "100.00" "99.80" "100.00"
[2,] "27.17" "25.15" "29.26"
[3,] "53.31" "51.00" "55.61"
[4,] "69.90" "67.75" "71.99"
[5,] "44.96" "42.67" "47.26"
[6,] "70.23" "68.08" "72.31"
[7,] "90.18" "88.73" "91.50"
[8,] "27.17" "25.15" "29.26"
[9,] "91.70" "90.35" "92.92"
Try this:
cbind(
matrix(as.numeric(unlist(strsplit(unlist(strsplit(gsub("]","",
dat$Sensitivity), ",")),"\\["))),ncol=3,byrow = T)
,
matrix(as.numeric(unlist(strsplit(unlist(strsplit(gsub("]","",
dat$Specificity), ",")),"\\["))),ncol=3,byrow = T)
)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1.01 0.21 2.91 100.00 99.80 100.00
[2,] 100.00 98.77 100.00 27.17 25.15 29.26
[3,] 99.66 98.14 99.99 53.31 51.00 55.61
[4,] 99.66 98.14 99.99 69.90 67.75 71.99
[5,] 99.66 98.14 99.99 44.96 42.67 47.26
[6,] 99.66 98.14 99.99 70.23 68.08 72.31
[7,] 92.95 89.43 95.59 90.18 88.73 91.50
[8,] 100.00 98.77 100.00 27.17 25.15 29.26
[9,] 89.60 85.56 92.82 91.70 90.35 92.92
Here is an option using base R to extract the numeric part with the type as numeric
lst <- lapply(d1, function(x) read.csv(text=gsub("[][]", ", ", x), header=FALSE)[-4])
lst
#$Sensitivity
# V1 V2 V3
#1 1.01 0.21 2.91
#2 100.00 98.77 100.00
#3 99.66 98.14 99.99
#4 99.66 98.14 99.99
#5 99.66 98.14 99.99
#6 99.66 98.14 99.99
#7 92.95 89.43 95.59
#8 100.00 98.77 100.00
#9 89.60 85.56 92.82
#$Specificity
# V1 V2 V3
#1 100.00 99.80 100.00
#2 27.17 25.15 29.26
#3 53.31 51.00 55.61
#4 69.90 67.75 71.99
#5 44.96 42.67 47.26
#6 70.23 68.08 72.31
#7 90.18 88.73 91.50
#8 27.17 25.15 29.26
#9 91.70 90.35 92.92
If needed, the list of data.frames can be converted to a single data.frame by cbinding
do.call(cbind, lst)
I would appreciate your help with this a lot!
I have ~4.5k txt files which look like this:
Simple statistics using MSPA parameters: 8_3_1_1 on input file: 20130815 104359 875 000000 0528 0548_result.tif
MSPA-class [color]: Foreground/data pixels [%] Frequency
============================================================
CORE(s) [green]: -- 0
CORE(m) [green]: 48.43/13.45 1
CORE(l) [green]: -- 0
ISLET [brown]: 3.70/ 1.03 20
PERFORATION [blue]: 0.00/ 0.00 0
EDGE [black]: 30.93/ 8.59 11
LOOP [yellow]: 9.66/ 2.68 6
BRIDGE [red]: 0.00/ 0.00 0
BRANCH [orange]: 7.28/ 2.02 40
Background [grey]: --- /72.22 11
Missing [white]: 0.00 0
I want to read all txt files from a directory into R and then perform a rearranging task on them before merging them together.
The values in the txt files can change, so in places where there is a 0.00 now, could be a relevant number in some files (so we need those). For the fields where there are -- now, it would be good if the script could test if there are -- , or a number. If there are the --, then it should turn them into NAs. On the other hand, real 0.00 values are of value and I need them. There is only one value for the Missing white column (or row here), that value should then be copied into both columns, foreground% and data pixels%.
The general rearranging which I need is to make all the data available as columns with only 1 row per txt file. For every row of data in the txt file here, there should be 3 columns in the output file (foreground%, data pixel% and frequency for every color). The name of the row should be the image name which is mentioned in the beginning of the file, here: 20130815 104359 875 000000 0528 0548
The rest can be omitted.
The output should look something like this:
I am working on this simultaneously but am not sure which direction to take. So any help is more than welcome!
Best,
Moritz
This puts it in the format you want, I think, but the example doesn't match your image so I can't be sure:
(lf <- list.files('~/desktop', pattern = '^image\\d+.txt', full.names = TRUE))
# [1] "/Users/rawr/desktop/image001.txt" "/Users/rawr/desktop/image002.txt"
# [3] "/Users/rawr/desktop/image003.txt"
lapply(lf, function(xx) {
rl <- readLines(con <- file(xx), warn = FALSE)
close(con)
## assuming the file name is after "file: " until the end of the string
## and ends in .tif
img_name <- gsub('.*file:\\s+(.*).tif', '\\1', rl[1])
## removes each string up to and including the ===== string
rl <- rl[-(1:grep('==', rl))]
## remove leading whitespace
rl <- gsub('^\\s+', '', rl)
## split the remaining lines by larger chunks of whitespace
mat <- do.call('rbind', strsplit(rl, '\\s{2, }'))
## more cleaning, setting attributes, etc
mat[mat == '--'] <- NA
mat <- cbind(image_name = img_name, `colnames<-`(t(mat[, 2]), mat[, 1]))
as.data.frame(mat)
})
I created three files using your example and made each one slightly different to show how this would work on a directory with several files:
# [[1]]
# image_name CORE(s) [green]: CORE(m) [green]: CORE(l) [green]: ISLET [brown]: PERFORATION [blue]: EDGE [black]: LOOP [yellow]: BRIDGE [red]: BRANCH [orange]: Background [grey]: Missing [white]:
# 1 20130815 104359 875 000000 0528 0548_result <NA> 48.43/13.45 <NA> 3.70/ 1.03 0.00/ 0.00 30.93/ 8.59 9.66/ 2.68 0.00/ 0.00 7.28/ 2.02 --- /72.22 0.00
#
# [[2]]
# image_name CORE(s) [green]: CORE(m) [green]: CORE(l) [green]: ISLET [brown]: PERFORATION [blue]: EDGE [black]: LOOP [yellow]: BRIDGE [red]: BRANCH [orange]: Background [grey]: Missing [white]:
# 1 20139341 104359 875 000000 0528 0548_result 23 48.43/13.45 23 <NA> 0.00/ 0.00 30.93/ 8.59 9.66/ 2.68 0.00/ 0.00 7.28/ 2.02 --- /72.22 0.00
#
# [[3]]
# image_name CORE(s) [green]: CORE(m) [green]: CORE(l) [green]: ISLET [brown]: PERFORATION [blue]: EDGE [black]: LOOP [yellow]: BRIDGE [red]: BRANCH [orange]: Background [grey]: Missing [white]:
# 1 20132343 104359 875 000000 0528 0548_result <NA> <NA> <NA> <NA> <NA> 30.93/ 8.59 9.66/ 2.68 0.00/ 0.00 7.28/ 2.02 <NA> 0.00
EDIT
made a few changes to extract all the info:
(lf <- list.files('~/desktop', pattern = '^image\\d+.txt', full.names = TRUE))
# [1] "/Users/rawr/desktop/image001.txt" "/Users/rawr/desktop/image002.txt"
# [3] "/Users/rawr/desktop/image003.txt"
res <- lapply(lf, function(xx) {
rl <- readLines(con <- file(xx), warn = FALSE)
close(con)
img_name <- gsub('.*file:\\s+(.*).tif', '\\1', rl[1])
rl <- rl[-(1:grep('==', rl))]
rl <- gsub('^\\s+', '', rl)
mat <- do.call('rbind', strsplit(rl, '\\s{2, }'))
dat <- as.data.frame(mat, stringsAsFactors = FALSE)
tmp <- `colnames<-`(do.call('rbind', strsplit(dat$V2, '[-\\/\\s]+', perl = TRUE)),
c('Foreground','Data pixels'))
dat <- cbind(dat[, -2], tmp, image_name = img_name)
dat[] <- lapply(dat, as.character)
dat[dat == ''] <- NA
names(dat)[1:2] <- c('MSPA-class','Frequency')
zzz <- reshape(dat, direction = 'wide', idvar = 'image_name', timevar = 'MSPA-class')
names(zzz)[-1] <- gsub('(.*)\\.(.*) (?:.*)', '\\2_\\1', names(zzz)[-1], perl = TRUE)
zzz
})
here is the result (I just transformed into a long matrix so it would be easier to read. the real results are in a very wide data frame, one for each file):
`rownames<-`(matrix(res[[1]]), names(res[[1]]))
# [,1]
# image_name "20130815 104359 875 000000 0528 0548_result"
# CORE(s)_Frequency "0"
# CORE(s)_Foreground "NA"
# CORE(s)_Data pixels "NA"
# CORE(m)_Frequency "1"
# CORE(m)_Foreground "48.43"
# CORE(m)_Data pixels "13.45"
# CORE(l)_Frequency "0"
# CORE(l)_Foreground "NA"
# CORE(l)_Data pixels "NA"
# ISLET_Frequency "20"
# ISLET_Foreground "3.70"
# ISLET_Data pixels "1.03"
# PERFORATION_Frequency "0"
# PERFORATION_Foreground "0.00"
# PERFORATION_Data pixels "0.00"
# EDGE_Frequency "11"
# EDGE_Foreground "30.93"
# EDGE_Data pixels "8.59"
# LOOP_Frequency "6"
# LOOP_Foreground "9.66"
# LOOP_Data pixels "2.68"
# BRIDGE_Frequency "0"
# BRIDGE_Foreground "0.00"
# BRIDGE_Data pixels "0.00"
# BRANCH_Frequency "40"
# BRANCH_Foreground "7.28"
# BRANCH_Data pixels "2.02"
# Background_Frequency "11"
# Background_Foreground "NA"
# Background_Data pixels "72.22"
# Missing_Frequency "0"
# Missing_Foreground "0.00"
# Missing_Data pixels "0.00"
with your sample data:
lf <- list.files('~/desktop/data', pattern = '.txt', full.names = TRUE)
`rownames<-`(matrix(res[[1]]), names(res[[1]]))
# [,1]
# image_name "20130815 103704 780 000000 0372 0616"
# CORE(s)_Frequency "0"
# CORE(s)_Foreground "NA"
# CORE(s)_Data pixels "NA"
# CORE(m)_Frequency "1"
# CORE(m)_Foreground "54.18"
# CORE(m)_Data pixels "15.16"
# CORE(l)_Frequency "0"
# CORE(l)_Foreground "NA"
# CORE(l)_Data pixels "NA"
# ISLET_Frequency "11"
# ISLET_Foreground "3.14"
# ISLET_Data pixels "0.88"
# PERFORATION_Frequency "0"
# PERFORATION_Foreground "0.00"
# PERFORATION_Data pixels "0.00"
# EDGE_Frequency "1"
# EDGE_Foreground "34.82"
# EDGE_Data pixels "9.75"
# LOOP_Frequency "1"
# LOOP_Foreground "4.96"
# LOOP_Data pixels "1.39"
# BRIDGE_Frequency "0"
# BRIDGE_Foreground "0.00"
# BRIDGE_Data pixels "0.00"
# BRANCH_Frequency "20"
# BRANCH_Foreground "2.89"
# BRANCH_Data pixels "0.81"
# Background_Frequency "1"
# Background_Foreground "NA"
# Background_Data pixels "72.01"
# Missing_Frequency "0"
# Missing_Foreground "0.00"
# Missing_Data pixels "0.00"
I copied and pasted your data on a text file and adjusted the space in order to have consistency between them. You might want to do it or if you can attach a text file, it would be easy to work with. You may use pastebin - http://en.wikipedia.org/wiki/Pastebin
First set your working directory as follows:
setwd("path of your file")
#EDIT: Create a single data frame of all files
split.row.data <- function(x){
a1 = sub("( )+(.*)", '\\2', x)
b1 = unlist(strsplit(sub("( )+(.*)", '\\2', (strsplit(a1, ":"))[[1]][2]), " "))
c1 = unlist(strsplit(b1[1], "/"))
if(length(c1) == 1){
if(which(b1[1:2] %in% "") == 1){
c1 = c(NA, c1)
}else if(which(b1[1:2] %in% "") == 2){
c1 = c(c1, NA)
}
}
c1[which(c1 %in% c("--", "--- "))] <- NA
return(c(unlist(strsplit(strsplit(a1, ":")[[1]][1], " ")),
c1,
b1[length(b1)]))
}
df2 <- data.frame(matrix(nrow = 1, ncol = 6), stringsAsFactors = FALSE)
file_list = list.files('~/desktop', pattern = '^image\\d+.txt', full.names = TRUE)
for (infile in file_list){
file_data <- readLines(con <- file(infile))
close(con)
filename = sub("(.*)(input file:)(.*)(.tif)", "\\3", file_data[3])
a2 <- file_data[7:length(file_data)]
d1 = lapply(a2, function(x) split.row.data(x))
df1 <- data.frame(matrix(nrow= length(d1), ncol = 5), stringsAsFactors = FALSE)
for(i in 1:length(d1)){df1[i, ] <- d1[[i]]}
df1 <- cbind(data.frame(rep(filename, nrow(df1)), stringsAsFactors = FALSE),
df1)
colnames(df1) <- colnames(df2)
df2 <- rbind(df2, df1)
}
df2 <- df2[2:nrow(df2), ]
df2[,4] <- as.numeric(df2[,4])
df2[,5] <- as.numeric(df2[,5])
df2[,6] <- as.numeric(df2[,6])
e1 = unlist(lapply(df2[,3], function(x) gsub(']', '', x)))
df2[,3] = unlist(lapply(e1, function(x) gsub("[[]", '', x)))
header_names <- unlist(lapply(strsplit(file_data[5], "/"), function(x) strsplit(x, " ")))
colnames(df2) <- c("filename",
strsplit(header_names[1], " ")[[1]][2],
"color",
header_names[2:length(header_names)])
row.names(df2) <- 1:nrow(df2)
output:
print(head(df2))
filename MSPA-class color Foreground data pixels [%] Frequency
1 20130815 103739 599 000000 0944 0788 CORE(s) green NA NA 0
2 20130815 103739 599 000000 0944 0788 CORE(m) green 63.46 17.41 1
3 20130815 103739 599 000000 0944 0788 CORE(l) green NA NA 0
4 20130815 103739 599 000000 0944 0788 ISLET brown 0.00 0.00 0
5 20130815 103739 599 000000 0944 0788 PERFORATION blue 0.00 0.00 0
6 20130815 103739 599 000000 0944 0788 EDGE black 35.00 9.60 1
#get data for only "background" from "MSPA-class" column
df2_background <- df2[which(df2[, "MSPA-class"] %in% "Background"), ]
print(df2_background)
filename MSPA-class color Foreground data pixels [%] Frequency
11 20130815 103739 599 000000 0944 0788 Background grey NA 72.57 1
22 20130815 143233 712 000000 1048 0520 Background grey NA 77.51 1
33 20130902 163929 019 000000 0394 0290 Background grey NA 54.55 6
I have
rownames(results.summary)
[1] "2 - 1" "3 - 1" "4 - 1"
What I want is to return a matrix of
2 1
3 1
4 1
The way Ive done it as:
for(i in 1:length(rownames(results.summary)){
current.split <- unlist(strsplit(rownames(results.summary)[i], "-"))
matrix.results$comparison.group[i] <- trim(current.split[1])
matrix.results$control.group[i] <- trim(current.split[2])
}
The trim function basically removes any whitespace on either end.
I've been learning regex and was wondering if there's perhaps a more elegant vectorized solution?
No need to use strsplit, just read it using read.table:
read.table(text=vec,sep='-',strip.white = TRUE) ## see #flodel comment
V1 V2
1 2 1
2 3 1
3 4 1
where vec is :
vec <- c("2 - 1", "3 - 1", "4 - 1")
This should work:
vv <- c("2 - 1", "3 - 1", "4 - 1")
matrix(as.numeric(unlist(strsplit(vv, " - "))), ncol = 2, byrow = TRUE)
# [,1] [,2]
# [1,] 2 1
# [2,] 3 1
# [3,] 4 1
You can also try scan
vec <- c("2 - 1", "3 - 1", "4 - 1")
s <- scan(text = vec, what = integer(), sep = "-", quiet = TRUE)
matrix(s, length(s)/2, byrow = TRUE)
# [,1] [,2]
# [1,] 2 1
# [2,] 3 1
# [3,] 4 1
Another option is cSplit.
library(splitstackshape)
cSplit(data.frame(vec), "vec", sep = " - ", fixed=TRUE)
# vec_1 vec_2
# 1: 2 1
# 2: 3 1
# 3: 4 1
You can use str_match from the package stringr for this:
library(stringr)
##
x <- c("2 - 1","3 - 1","4 - 1")
##
cmat <- str_match(x, "(\\d).+(\\d)")[,-1]
> apply(cmat,2,as.numeric)
[,1] [,2]
[1,] 2 1
[2,] 3 1
[3,] 4 1
Using reshape2 colsplit
library(reshape2)
colsplit(x, " - ", c("A", "B"))
# A B
# 1 2 1
# 2 3 1
# 3 4 1
Or using tidyrs separate
library(tidyr)
separate(data.frame(x), x, c("A", "B"), sep = " - ")
# A B
# 1 2 1
# 2 3 1
# 3 4 1
Shouldn't this code work for repeating number detection in R?
> grep(pattern = "\\d{2}", x = 1223)
[1] 1
> grep(pattern = "\\d{3}", x = 1223)
[1] 1
If we have 988 we should get true and if 123 we should get false.
Sounds like it isn't.
> grep(pattern = "\\d{2}", x = "1223")
[1] 1
> grep(pattern = "\\d{2}", x = "13")
[1] 1
You need to use backreferences:
> grep(pattern = "(\\d)\\1", x = "1224")
[1] 1
> grep(pattern = "(\\d)\\1{1,}", x = "1224")
[1] 1
> grep(pattern = "(\\d)\\1", x = "1234")
integer(0)
EDIT: Seems like you need to figure how it works: (\\d) creates a capture group for the \\d, which can be referred to using a backreference \\1. For example, if you have numbers like x2y and you want to find those where x is the same as y, then:
> grep(pattern = "(\\d)2\\1", x = "121")
[1] 1
> grep(pattern = "(\\d)2\\1", x = "124")
integer(0)
I'd strongly recommend that you read a basic tutorial on regular expressions.
I know the question explicitly says "using regex" in the title, but here is a non-regex method that could work, depending on what you want to do.
strings <- c("1223","1233","1234","113")
# detect consecutive repeat digits, or characters
(strings.rle <- lapply(strings, function(x)rle(unlist(strsplit(x,"")))))
[[1]]
Run Length Encoding
lengths: int [1:3] 1 2 1
values : chr [1:3] "1" "2" "3"
[[2]]
Run Length Encoding
lengths: int [1:3] 1 1 2
values : chr [1:3] "1" "2" "3"
[[3]]
Run Length Encoding
lengths: int [1:4] 1 1 1 1
values : chr [1:4] "1" "2" "3" "4"
[[4]]
Run Length Encoding
lengths: int [1:2] 2 1
values : chr [1:2] "1" "3"
Now you can work with strings.rle to do what you want
# which entries have consecutive repeat digits, or characters
strings[sapply(strings.rle, function(x) any(x$lengths > 1))]
[1] "1223" "1233" "113"
or
# which digits or characters are consecutively repeated?
lapply(strings.rle, function(x) x$values[which(x$lengths > 1)])
[[1]]
[1] "2"
[[2]]
[1] "3"
[[3]]
character(0)
[[4]]
[1] "1"