Parsing out a line in R to pick different objects

Parsing out a line in R to pick different objects - regex

I have this line:
system<-c("System configuration: type=Shared mode=Uncapped smt=4 lcpu=96 mem=393216MB psize=64 ent=16.00")
I need to parse out this and pick smt, lcpu, mem, mpsize and ent into different objects.
For example, I doing this to pick the smt, but it picks the whole line, any ideas what I am doing wrong here?
smt<-sub('^.* smt=([[:digit:]])', '\\1', system)
smt needs to have a number 4 in this case.

I would use strsplit a couple times, and type.convert:
parse.config <- function(x) {
clean <- sub("System configuration: ", "", x)
pairs <- strsplit(clean, " ")[[1]]
items <- strsplit(pairs, "=")
keys <- sapply(items, `[`, 1)
values <- sapply(items, `[`, 2)
values <- lapply(values, type.convert, as.is = TRUE)
setNames(values, keys)
}
config <- parse.config(system)
# $type
# [1] "Shared"
#
# $mode
# [1] "Uncapped"
#
# $smt
# [1] 4
#
# $lcpu
# [1] 96
#
# $mem
# [1] "393216MB"
#
# $psize
# [1] 64
#
# $ent
# [1] 16
The output is a list so you can access any of the parsed items, for example:
config$smt
# [1] 4

Using strapplyc in the gusbfn package the following creates a list L whose names are the left hand sides such as smt and whose values are the right hand sides.
library(gsubfn)
LHS <- strapplyc( system, "(\\w+)=" )[[1]]
RHS <- strapplyc( system, "=(\\w+)" )[[1]]
L <- setNames( as.list(RHS), LHS )
For example we can now get smt like this (and similarly for the other left hand sides):
> L$smt
[1] "4"
UPDATE: Simplified.

add .* to the end of your matching expression and you'll get "4".
sub('^.* smt=([[:digit:]]+).*', '\\1', system)
You may want to add the + I included in the instance where it is more than a single digit.
You could also approach this by splitting on spaces and the finding the matches:
splits <- unlist(strsplit(system, ' '))
sub('smt=', '', grep('smt=', splits, value=TRUE))
# [1] "4"
or wrapping it in a function:
matchfun <- function(string, to_match, splitter=' ') {
splits <- unlist(strsplit(string, splitter))
sub(to_match, '', grep(to_match, splits, value=TRUE))
}
matchfun(system, 'smt=')
# [1] "4"

Well, I'm voting for #GaborGrothendieck's, but am offering this as a more pedestrian alternative:
inp <- c("System configuration: type=Shared mode=Uncapped smt=4 lcpu=96 mem=393216MB psize=64 ent=16.00")
inparsed <- read.table(text=inp, stringsAsFactors=FALSE)
vals <- unlist(inparsed)[grep("\\=", unlist(inparsed))]
vals
# V3 V4 V5 V6 V7 V8 V9
# type=Shared mode=Uncapped smt=4 lcpu=96 mem=393216MB psize=64 ent=16.00
vals[grep("smt|lcpu|mem|mpsize|ent", vals)]
V5 V6 V7 V9
"smt=4" "lcpu=96" "mem=393216MB" "ent=16.00"
I would note that choosing the name 'system' for a variable seems most unwise in light of the system function's existence.

Related

Extracting pattern from the nested list in R using regex

I have following sorted list (lst) of time periods and I want to split the periods into specific dates and then extract maximum time period without altering order of the list.
$`1`
[1] "01.12.2015 - 21.12.2015"
$`2`
[1] "22.12.2015 - 05.01.2016"
$`3`
[1] "14.09.2015 - 12.10.2015" "29.09.2015 - 26.10.2015"
Therefore, after adjustment list should look like this:
$`1`
[1] "01.12.2015" "21.12.2015"
$`2`
[1] "22.12.2015" "05.01.2016"
$`3`
[1] "14.09.2015" "12.10.2015" "29.09.2015" "26.10.2015"
In order to do so, I began with splitting the list:
lst_split <- str_split(lst, pattern = " - ")
which leads to the following:
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "c(\"14.09.2015" "12.10.2015\", \"29.09.2015" "26.10.2015\")"
Then, I tried to extract the pattern:
lapply(lst_split, function(x) str_extract(pattern = c("\\d+\\.\\d+\\.\\d+"),x))
but my output is missing one date (29.09.2015)
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "14.09.2015" "12.10.2015" "26.10.2015"
Does anyone have an idea how I could make it work and maybe propose more efficient solution? Thank you in advance.

Thanks to comments of #WiktorStribiżew and #akrun it is enough to use str_extract_all.
In this example:
> str_extract_all(lst,"\\d+\\.\\d+\\.\\d+")
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "14.09.2015" "12.10.2015" "29.09.2015" "26.10.2015"

1) Use strsplit, flatten each component using unlist, convert the dates to "Date" class and then use range to get the maximum time span. No packages are used.
> lapply(lst, function(x) range(as.Date(unlist(strsplit(x, " - ")), "%d.%m.%Y")))
$`1`
[1] "2015-12-01" "2015-12-21"
$`2`
[1] "2015-12-22" "2016-01-05"
$`3`
[1] "2015-09-14" "2015-10-26"
2) This variation using a magrittr pipeline also works:
library(magrittr)
lapply(lst, function(x)
x %>%
strsplit(" - ") %>%
unlist %>%
as.Date("%d.%m.%Y") %>%
range
)
Note: The input lst in reproducible form is:
lst <- structure(list(`1` = "01.12.2015 - 21.12.2015", `2` = "22.12.2015 - 05.01.2016",
`3` = c("14.09.2015 - 12.10.2015", "29.09.2015 - 26.10.2015"
)), .Names = c("1", "2", "3"))

Extract words that meet a length condition from string

I have a patent data set and when I import the IPC-class information to R I get a string containing whitespaces in a variable amount and a set of numbers I don't need. The following are the IPC codes corresponding to a patent file:
b <- "F24J 2/05 20060101AFI20150224BHEP F24J 2/46 20060101ALI20150224BHEP "
I would like to remove all whitespaces and that long alphanumeric string and just get the data I am interested in, obtaining a data frame like this, in this case:
m <- data.frame(matrix(c("F24J 2/05", "F24J 2/46"), byrow = TRUE, nrow = 1, ncol = 2))
m
I am trying with gsub, since I know that the long string will always have a length considerably longer than the data I am interested in:
x = gsub("\\b[a-zA-Z0-9]{8,}\\b", "", ipc)
x
But I get stuck when I try to further clean this object in order to get the data frame I want. I am really stuck on this, and I would really appreciate if someone could help me.
Thank you very much in advance.

You can use str_extract_all from stringr package, provided you know the pattern you look for:
library(stringr)
str_extract_all(b, "[A-Z]\\d{2}[A-Z] *\\d/\\d{2}")[[1]]
#[1] "F24J 2/05" "F24J 2/46"

Option 1, select all the noise data and remoe it using a sustitution:
/\s+|\w{5,}/g
(Spaces and 'long' words)
https://regex101.com/r/lG4dC4/1
Option 2, select all the short words (length max 4):
/\b\S{4}\b/g
https://regex101.com/r/fZ8mH5/1

or…
library(stringi)
library(readr)
read_fwf(paste0(stri_match_all_regex(b, "[[:alnum:][:punct:][:blank:]]{50}")[[1]][,1], collapse="\n"),
fwf_widths(c(7, 12, 31)))[,1:2]
## X1 X2
## 1 F24J 2/05
## 2 F24J 2/46
(this makes the assumption - from only seeing 2 'records' - that each 'record' is 50 characters long)

Here's an approach to akie the amtrix using qdapRegex (I maintain this package) + magrittr's pipeline:
library(qdapRegex); library(magrittr)
b %>%
rm_white_multiple() %>%
rm_default(pattern="F[0-9A-Z]+\\s\\d{1,2}/\\d{1,2}", extract=TRUE) %>%
unlist() %>%
strsplit("\\s") %>%
do.call(rbind, .)
## [,1] [,2]
## [1,] "F24J" "2/05"
## [2,] "F24J" "2/46"

Appending data frames that satisfy regular expressions in loop in R?

I asked a similar question earlier but asked it confusingly. So now I'm trying to do it in a more orderly fashion.
I'm running a loop that imports up to 6 dataframes based on 650 ID variables. I want to append these 6 dataframes for every one of the 650 cases. I import the data like this:
for(i in 1:650){
try(part1 <- read.csv(file = paste0("Twitter Scrapes/searchTwitter/09July/",MP.ID[i],".csv")))
try(part2 <- read.csv(file = paste0("Twitter Scrapes/userTimeline/08July/",MP.ID[i],".csv")))
try(part3 <- read.csv(file = paste0("Twitter Scrapes/userTimeline/16July/",MP.ID[i],".csv")))
try(part4 <- read.csv(file = paste0("Twitter Scrapes/searchTwitter/17July/",MP.ID[i],".csv")))
try(part5 <- read.csv(file = paste0("Twitter Scrapes/userTimeline/24July/",MP.ID[i],".csv")))
try(part6 <- read.csv(file = paste0("Twitter Scrapes/searchTwitter/24July/",MP.ID[i],".csv")))
This all works fine. If any part doesn't exist, the try-arguments makes sure that the loop continues to execute.
So, for some cases, not all 6 datasets exist. This means I can't simply have the next line read
combinedData <- rbind(part1, part2, part3, part4, part5, part6)
as one of these elements may not exist and therefore mean that the appended dataset can't be produced. This is why I thought it would be good to have the rbind command run for any dataframe that satisfied a regular expression requirement, i.e. partX. In this case, even if, say, part5 doesn't exist, it can simply append the existing other dataframes and then move on to the next ID in the loop.
However, I have no idea how to do this. It would be amazing if you could help me with this, and I'm really sorry for posting the confusing question earlier.

I might use the recursive argument in list.files instead and use lists:
(lf <- list.files('~/desktop/test', recursive = TRUE, full.names = TRUE))
# [1] "/Users/rawr/desktop/test/feb/three.csv"
# [2] "/Users/rawr/desktop/test/jan/one.csv"
# [3] "/Users/rawr/desktop/test/jan/three.csv"
# [4] "/Users/rawr/desktop/test/jan/two.csv"
# [5] "/Users/rawr/desktop/test/jul/one.csv"
# [6] "/Users/rawr/desktop/test/jul/two.csv"
you can group by ids by grep'ing
id <- c('one','two','three')
for (ii in id) {
print(lf[grepl(ii, lf)])
cat('\n')
}
# [1] "/Users/rawr/desktop/test/jan/one.csv" "/Users/rawr/desktop/test/jul/one.csv"
#
# [1] "/Users/rawr/desktop/test/jan/two.csv" "/Users/rawr/desktop/test/jul/two.csv"
#
# [1] "/Users/rawr/desktop/test/feb/three.csv" "/Users/rawr/desktop/test/jan/three.csv"
so using this idea, you can use lapply to read them in resulting in one object with all the data frames
ll <- lapply(id, function(ii) {
files <- lf[grepl(ii, lf)]
setNames(lapply(files, function(x)
read.csv(x, header = FALSE)), files)
})
setNames(ll, id)
# $one
# $one$`/Users/rawr/desktop/test/jan/one.csv`
# V1
# 1 one
#
# $one$`/Users/rawr/desktop/test/jul/one.csv`
# V1
# 1 one
# 2 one
# 3 one
#
#
# $two
# $two$`/Users/rawr/desktop/test/jan/two.csv`
# V1
# 1 two
#
# $two$`/Users/rawr/desktop/test/jul/two.csv`
# V1
# 1 two
#
#
# $three
# $three$`/Users/rawr/desktop/test/feb/three.csv`
# V1
# 1 three
#
# $three$`/Users/rawr/desktop/test/jan/three.csv`
# V1
# 1 three
And then rbind them:
lapply(ll, function(x) `rownames<-`(do.call('rbind', x), NULL))
# [[1]]
# V1
# 1 one
# 2 one
# 3 one
# 4 one
#
# [[2]]
# V1
# 1 two
# 2 two
#
# [[3]]
# V1
# 1 three
# 2 three
or you can rbind them in the step before

How to use separate() properly?

I have some difficulties to extract an ID in the form:
27da12ce-85fe-3f28-92f9-e5235a5cf6ac
from a data frame:
a<-c("NAME_27da12ce-85fe-3f28-92f9-e5235a5cf6ac_THOMAS_MYR",
"NAME_94773a8c-b71d-3be6-b57e-db9d8740bb98_THIMO",
"NAME_1ed571b4-1aef-3fe2-8f85-b757da2436ee_ALEX",
"NAME_9fbeda37-0e4f-37aa-86ef-11f907812397_JOHN_TYA",
"NAME_83ef784f-3128-35a1-8ff9-daab1c5f944b_BISHOP",
"NAME_39de28ca-5eca-3e6c-b5ea-5b82784cc6f4_DUE_TO",
"NAME_0a52a024-9305-3bf1-a0a6-84b009cc5af4_WIS_MICHAL",
"NAME_2520ebbb-7900-32c9-9f2d-178cf04f7efc_Sarah_Lu_Van_Gar/Thomas")
Basically its the thing between the first and the second underscore.
Usually I approach that by:
library(tidyr)
df$a<-as.character(df$a)
df<-df[grep("_", df$a), ]
df<- separate(df, a, c("ID","Name") , sep = "_")
df$a<-as.numeric(df$ID)
However this time there a to many underscores...and my approach fails. Is there a way to extract that ID?

I think you should use extract instead of separate. You need to specify the patterns which you want to capture. I'm assuming here that ID is always starts with a number so I'm capturing everything after the first number until the next _ and then everything after it
df <- data.frame(a)
df <- df[grep("_", df$a),, drop = FALSE]
extract(df, a, c("ID", "NAME"), "[A-Za-z].*?(\\d.*?)_(.*)")
# ID NAME
# 1 27da12ce-85fe-3f28-92f9-e5235a5cf6ac THOMAS_MYR
# 2 94773a8c-b71d-3be6-b57e-db9d8740bb98 THIMO
# 3 1ed571b4-1aef-3fe2-8f85-b757da2436ee ALEX
# 4 9fbeda37-0e4f-37aa-86ef-11f907812397 JOHN_TYA
# 5 83ef784f-3128-35a1-8ff9-daab1c5f944b BISHOP
# 6 39de28ca-5eca-3e6c-b5ea-5b82784cc6f4 DUE_TO
# 7 0a52a024-9305-3bf1-a0a6-84b009cc5af4 WIS_MICHAL
# 8 2520ebbb-7900-32c9-9f2d-178cf04f7efc Sarah_Lu_Van_Gar/Thomas

try this (which assumes that the ID is always the part after the first unerscore):
sapply(strsplit(a, "_"), function(x) x[[2]])
which gives you "the middle part" which is your ID:
[1] "27da12ce-85fe-3f28-92f9-e5235a5cf6ac" "94773a8c-b71d-3be6-b57e-db9d8740bb98"
[3] "1ed571b4-1aef-3fe2-8f85-b757da2436ee" "9fbeda37-0e4f-37aa-86ef-11f907812397"
[5] "83ef784f-3128-35a1-8ff9-daab1c5f944b" "39de28ca-5eca-3e6c-b5ea-5b82784cc6f4"
[7] "0a52a024-9305-3bf1-a0a6-84b009cc5af4" "2520ebbb-7900-32c9-9f2d-178cf04f7efc"
if you want to get the Name as well a simple solution would be (which assumes that the Name is always after the second underscore):
Names <- sapply(strsplit(a, "_"), function(x) Reduce(paste, x[-c(1,2)]))
which gives you this:
[1] "THOMAS MYR" "THIMO" "ALEX" "JOHN TYA"
[5] "BISHOP" "DUE TO" "WIS MICHAL" "Sarah Lu Van Gar/Thomas"

How would I turn a multivalue string into a usable frequency table in R?

I have a field in a data frame called plugins_Apache_module
it contains strings like:
c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52",
"mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23",
"mod_ssl/2.2.9")
I need a frequency table on the modules, and also their versions.
What is the best way to do this in R? As being rather new in R, I've seen strsplit, gsub, some chatrooms also suggested I use the qdap package.
Ideally I would want the string transformed into a dataframe with a column for every mod, if the module is there, then the version goes in that particular field. How would I accomplish such a transform?
What dataframe format would be suggested if I want top-level frequencies - say mod_ssl (all versions) as well as relational options (mod_perl is very often used with mod_ssl).
I'm not too sure how to handle such variable length data when pushing into a dataframe for processing. Any advice is welcome.
I consider the right answer to look like:
mod_perl mod_python mod_ssl mod_auth_passthrough mod_bwlimited
1.99_16 3.1.3 2.0.52
2.2.23 2.1 1.4
2.2.9
So basically the first bit becomes a column and the version(s) that follows become a row entry

st <- c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52", "mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23", "mod_ssl/2.2.9")
scan(text=st, what="", sep=",")
Read 7 items
[1] "mod_perl/1.99_16" "mod_python/3.1.3" "mod_ssl/2.0.52"
[4] "mod_auth_passthrough/2.1" "mod_bwlimited/1.4" "mod_ssl/2.2.23"
[7] "mod_ssl/2.2.9"
strsplit( scan(text=st, what="", sep=","), "/")
Read 7 items
[[1]]
[1] "mod_perl" "1.99_16"
[[2]]
[1] "mod_python" "3.1.3"
[[3]]
[1] "mod_ssl" "2.0.52"
[[4]]
[1] "mod_auth_passthrough" "2.1"
[[5]]
[1] "mod_bwlimited" "1.4"
[[6]]
[1] "mod_ssl" "2.2.23"
[[7]]
[1] "mod_ssl" "2.2.9"
table( sapply(strsplit( scan(text=st, what="", sep=","), "/"), "[",1) )
#----------------
Read 7 items
mod_auth_passthrough mod_bwlimited mod_perl mod_python
1 1 1 1
mod_ssl
3
table( scan(text=st, what="", sep=",") )
#-----------
Read 7 items
mod_auth_passthrough/2.1 mod_bwlimited/1.4 mod_perl/1.99_16
1 1 1
mod_python/3.1.3 mod_ssl/2.0.52 mod_ssl/2.2.23
1 1 1
mod_ssl/2.2.9
1

You ask for at minimum two different things. Adding desired output greatly helped. I'm not sure if what you ask for is what you really want but you asked and it seemed like a fun problem. Ok here's how I would approach this using qdap (this requires qdap version 1.1.0 though):
## load qdap
library(qdap)
## your data
x <- c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52",
"mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23",
"mod_ssl/2.2.9")
## strsplit on commas and slashes
dat <- unlist(lapply(x, strsplit, ",|/"), recursive=FALSE)
## make just a list of mods per row
mods <- lapply(dat, "[", c(TRUE, FALSE))
## make a string of versions
ver <- unlist(lapply(dat, "[", c(FALSE, TRUE)))
## make a lookup key and split it into lists
key <- data.frame(mod = unlist(mods), ver, row = rep(seq_along(mods),
sapply(mods, length)))
key2 <- split(key[, 1:2], key$row)
## make it into freq. counts
freqs <- mtabulate(mods)
## rename assign freq table to vers in case you want freqs ans replace 0 with NA
vers <- freqs
vers[vers==0] <- NA
## loop through and fill the ones in each row using an env. lookup (%l%)
for(i in seq_len(nrow(vers))) {
x <- vers[i, !is.na(vers[i, ]), drop = FALSE]
vers[i, !is.na(vers[i, ])] <- colnames(x) %l% key2[[i]]
}
## Don't print the NAs
print(vers, na.print = "")
## mod_auth_passthrough mod_bwlimited mod_perl mod_python mod_ssl
## 1 1.99_16 3.1.3 2.0.52
## 2 2.1 1.4 2.2.23
## 3 2.2.9
## the frequency counts per mods
freqs
## mod_auth_passthrough mod_bwlimited mod_perl mod_python mod_ssl
## 1 0 0 1 1 1
## 2 1 1 0 0 1
## 3 0 0 0 0 1

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parsing out a line in R to pick different objects - regex

Related

Extracting pattern from the nested list in R using regex

Extract words that meet a length condition from string

Appending data frames that satisfy regular expressions in loop in R?

How to use separate() properly?

How would I turn a multivalue string into a usable frequency table in R?

Categories

Resources