All, I have searched around and can't find the answer on how to do this. I am relatively new to R and have not used regular expresions before but bascially I have some data put into a field like this:
"#Route - 6 #Category - PARKING #Details - Parking issues#Result - MOVED ON #Vehicle Type - Mercedes "
I basically want to be able to split the string up into different elements, so each category after the # has it own column.
I tried using the tidyr package and initially tried:
string %>% separate(Description, into = c("Route","Details","Result","License No",
"Vehicle Desciption"),
sep = "\n#", remove =F, extra = "drop")
But realised I only wanted the data after the "-". I tried inserting a "-" in the code but it didn't work. Does anyone know how I can split the string ideally between the "-" and the "#".
Many thanks
In one line:
> gsub("^\\s+|\\s+$","",gsub(".*?[-]","",unlist(strsplit(str,"#"))))
[1] "" "6" "PARKING" "Parking issues" "MOVED ON" "Mercedes"
Or separate for better understanding:
Break string by "#":
a = unlist(strsplit(str,"#"))
Remove what is before the "-"
b = gsub(".*?[-]","",a)
Remove leading and trailing spaces:
gsub("^\\s+|\\s+$","",b)
You could do the following:
strsplit(x, ' *#[^-]+- *')[[1]][2:6]
# [1] "6" "PARKING" "Parking issues" "MOVED ON" "Mercedes"
To supply the column names you desire, I suppose you could do something like:
mat <- matrix(strsplit(x, ' *#[^-]+- *')[[1]][2:6], ncol=5, byrow=T)
colnames(mat) <- c('Route', 'CAT', 'Details', 'Result', 'Vehicle Description')
# Route CAT Details Result Vehicle Description
# [1,] "6" "PARKING" "Parking issues" "MOVED ON" "Mercedes"
Using str_extract from stringr
library(stringr)
str_extract_all(str1, '(?<=-\\s)\\w+(?:\\s*\\w+){0,}')[[1]]
#[1] "6" "PARKING" "Parking issues" "MOVED ON"
#[5] "Mercedes"
str_extract_all(str2, '(?<=-\\s)\\w+(?:\\s*\\w+){0,}')[[1]]
#[1] "6" "PARKING"
#[3] "Parking issues" "MOVED ON"
#[5] "Mercedes" "Parking issues are present"
#[7] "MOVED ON" "Mercedes"
data
str1 <- "#Route - 6 #Category - PARKING #Details - Parking issues#Result - MOVED ON #Vehicle Type - Mercedes "
str2 <- "#Route - 6 #Category - PARKING #Details - Parking issues#Result - MOVED ON #Vehicle Type - Mercedes #Details - Parking issues are present#Result - MOVED ON #Vehicle Type - Mercedes "
Related
I have a file of log entries that I want to parse through. All the lines look like this:
F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz" "" 3322771022 (0,0) "1499.61 seconds (17.7 megabits/sec)
Each part has a specific designation which I'll put below.
F -- identifier of the line
20160525 -- date (yyyymmdd)
17:52:38.791 -- timestamp (HH:MM:SS.sss)
F798259D -- transfer identifier
156.145.15.85:46634 -- IP address and related port
xqixh8sl -- username
AES -- encryption level (could be - (dash))
"/pcgc...fastq.gz" -- transferred file (in ")
"" -- additional string (should be empty "")
2951144113 -- transferred bytes
(0,0) -- error
"2289.47 seconds (10.3 megabits/sec)" -- data about the transfer
I have imported the data file and am using the read.pattern() function to parse and separate it into it's fields. I only want the pieces of information that correlate with 2,3,4,5,6,7,8,10, and 12. However, I cannot get the pattern right. Before, I managed to get two of the fields that I needed by using this pattern:
pattern <- "^F ([0-9]+) [^ ]* .* \\(0,0\\) (.*)$"
This gave me a data frame that looked like this:
date speed of data transfer
1 20160525 "1.62 seconds (1.30 kilobits/sec)"
2 20160525 "0.29 seconds (1.93 kilobits/sec)"
3 20160525 "0.01 seconds (34.0 kilobits/sec)"
4 20160525 "0.01 seconds (102 kilobits/sec)"
5 20160525 "38.05 seconds (214 megabits/sec)"
These are only two of the fields I need, but whenever I try to add more that's where I mess up the syntax. For example:
pattern <- "^F\\s([0-9]+)\\s[0-9:.]+\\s([:alnum:])\\s[A-Z]\\s([0-9.:]+)\\s([:alnum:])\\s([•])\\s[:punct:][A-z][:punct:]\\s[:punct:]\\s.* \\(0,0\\) (.*)$"
This did not work. Could someone please help on writing this? It's been driving me crazy. Thanks!
Here's my solution:
library(stringer)
con <- readLines("dataSet.txt")
pattern <- "^F (\\d+) ([:graph:]+) ([:graph:]+) [A-Z]+ ([:graph:]+) ([:graph:]+) ([:graph:]+) ([:graph:]+) [:graph:]+ (\\d+) [:graph:]+ (.+)$"
matches <- str_match(con,pattern)
df <- data.frame(na.omit(matches[,-1]))
colnames(df) <- c("date", "timestamp", "transfer ID", "IP address", "username", "encryption level", "transferred file", "transferred bytes", "speed of data transfer")
This was the result:
1 20160525 08:22:06.838 F798256B 10.199.194.38:57708 wei2dt - "" 264 "1.62 seconds (1.30 kilobits/sec)"
2 20160525 08:28:26.920 F798256C 10.19.105.15:57708 wei2dt - "isi_audit_log.dmp-sv.tmp" 69 "0.29 seconds (1.93 kilobits/sec)"
If all of your lines follow a similar structure, you may be able to get away with simply splitting each row on the spaces.
x <- "F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES \"/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz\" \"\" 3322771022 (0,0) \"1499.61 seconds (17.7 megabits/sec)"
library(dplyr)
library(magrittr)
strsplit(x, " ") %>%
unlist() %>%
t() %>%
as.data.frame(stringsAsFactors = FALSE) %>%
setNames(c("id", "date", "timestamp", "transfer_id",
"curl_method", "ip_address", "username", "encryption",
"tranferred_file", "additional_string",
"transferred_bytes", "error",
"rate1", "rate2", "rate3", "rate4")) %>%
mutate(rate = paste(rate1, rate2, rate3, rate4)) %>%
select(-rate1:-rate4)
I have following sorted list (lst) of time periods and I want to split the periods into specific dates and then extract maximum time period without altering order of the list.
$`1`
[1] "01.12.2015 - 21.12.2015"
$`2`
[1] "22.12.2015 - 05.01.2016"
$`3`
[1] "14.09.2015 - 12.10.2015" "29.09.2015 - 26.10.2015"
Therefore, after adjustment list should look like this:
$`1`
[1] "01.12.2015" "21.12.2015"
$`2`
[1] "22.12.2015" "05.01.2016"
$`3`
[1] "14.09.2015" "12.10.2015" "29.09.2015" "26.10.2015"
In order to do so, I began with splitting the list:
lst_split <- str_split(lst, pattern = " - ")
which leads to the following:
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "c(\"14.09.2015" "12.10.2015\", \"29.09.2015" "26.10.2015\")"
Then, I tried to extract the pattern:
lapply(lst_split, function(x) str_extract(pattern = c("\\d+\\.\\d+\\.\\d+"),x))
but my output is missing one date (29.09.2015)
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "14.09.2015" "12.10.2015" "26.10.2015"
Does anyone have an idea how I could make it work and maybe propose more efficient solution? Thank you in advance.
Thanks to comments of #WiktorStribiżew and #akrun it is enough to use str_extract_all.
In this example:
> str_extract_all(lst,"\\d+\\.\\d+\\.\\d+")
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "14.09.2015" "12.10.2015" "29.09.2015" "26.10.2015"
1) Use strsplit, flatten each component using unlist, convert the dates to "Date" class and then use range to get the maximum time span. No packages are used.
> lapply(lst, function(x) range(as.Date(unlist(strsplit(x, " - ")), "%d.%m.%Y")))
$`1`
[1] "2015-12-01" "2015-12-21"
$`2`
[1] "2015-12-22" "2016-01-05"
$`3`
[1] "2015-09-14" "2015-10-26"
2) This variation using a magrittr pipeline also works:
library(magrittr)
lapply(lst, function(x)
x %>%
strsplit(" - ") %>%
unlist %>%
as.Date("%d.%m.%Y") %>%
range
)
Note: The input lst in reproducible form is:
lst <- structure(list(`1` = "01.12.2015 - 21.12.2015", `2` = "22.12.2015 - 05.01.2016",
`3` = c("14.09.2015 - 12.10.2015", "29.09.2015 - 26.10.2015"
)), .Names = c("1", "2", "3"))
I have some text with defined labels and need to split the text according to the labels.
For example given the text with labels set {A, B, C..}
text <- c("A: how are you B: hello sir C: bye bye")
text2 <- c("USER COMMENTS: TEST PROC: Refer manual. SOLUTION: fix BIAS32 user:param", "TEST PROC: install spare unit. USER COMMENTS: hello sir SOLUTION: tighten bolt 12","TEST PROC: bye bye.")
I need to extract text "how are you", "hello sir" , etc.. corresponding to labels A, B, etc.
There is no specific order of the labels, certain labels could be missing and labels can be phrases (not just characters)
This is what I have so far to extract text corresponding to label A:
gsub("(.*A.*:)(.*)(B.*|C.*)","\\2",text,perl=TRUE)
But this does not work in so many cases!
I am looking for a solution where I can define a vector of labels such as
labels <- c("USER COMMENTS", "TEST PROC", "SOLUTION") # this is a big list!
and extract the text corresponding to these labels as below
USER COMMENTS are "", "hello sir"
TEST PROC are "Refer manual.", "install spare unit.","bye bye."
SOLUTION are "fix BIAS32 user:param", "tighten bolt 12"
etc..
I think I might have a solution based on Sharath's comment.
First, there's strsplit(), which can split a vector based on regex. In your case you could use:
labels2<-paste(labels,collapse="|")
[1] "USER COMMENTS|TEST PROC|SOLUTION"
If you apply strsplit on that:
splittedtext<-strsplit(text2,labels2)
[[1]]
[1] "" ": "
[3] ": Refer manual. " ": fix BIAS32 user:param"
[[2]]
[1] "" ": install spare unit. " ": hello sir "
[4] ": tighten bolt 12"
[[3]]
[1] "" ": bye bye."
Pretty much what you want, right? You could do some refining by adding ": " to the end of every index, and the first element is gibberish. So taking care of the latter:
splittedtext<-lapply(splittedtext,"[",-1)
That generates the problem that you must figure out to which label a comment applies. For that you could use regexpr() function in R.
pos=sapply(labels,regexpr,text2)
USER COMMENTS TEST PROC SOLUTION
[1,] 1 16 41
[2,] 32 1 57
[3,] -1 1 -1
Each cell represents the position in which said label [column] appear on string [row]. -1 denote that it does not appear on this string.
Now switch, -1 for NA, and rank the remaining numbers. That will give to you which string snippet represents that label.
pos=ifelse(pos==-1,NA,pos) #switch -1 for NA
pos=t(apply(pos,1,rank,na.last="keep"))
USER COMMENTS TEST PROC SOLUTION
[1,] 1 2 3
[2,] 2 1 3
[3,] NA 1 NA
Now it's just matching.
I am using R to extract sentences containing specific person names from texts and here is a sample paragraph:
Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin. Melanchthon became professor of the Greek language in Wittenberg at the age of 21. He studied the Scripture, especially of Paul, and Evangelical doctrine. He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments. Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium.
In this short paragraph, there are several person names such as:
Johann Reuchlin, Melanchthon, Johann Eck. With the help of openNLP package, three person names Martin Luther, Paul and Melanchthon can be correctly extracted and recognized. Then I have two questions:
How could I extract sentences containing these names?
As the output of named entity recognizer is not so promising, if I add "[[ ]]" to each name such as [[Johann Reuchlin]], [[Melanchthon]], how could I extract sentences containing these name expressions [[A]], [[B]] ...?
Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph.
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
> unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
Or a little cleaner:
sentences<-unlist(strsplit(para,split="\\."))
sentences[grep(paste(toMatch, collapse="|"),sentences)]
If you are looking for the sentences that each person is in as separate returns then:
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
sentences<-unlist(strsplit(para,split="\\."))
foo<-function(Match){sentences[grep(Match,sentences)]}
lapply(toMatch,foo)
[[1]]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
Edit 3: To add each persons name, do something simple such as:
foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}
EDIT 4:
And if you wanted to find sentences that had multiple people/places/things (words), then just add an argument for those two such as:
toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")
and change perl to TRUE:
foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}
> lapply(toMatch,foo)
[[1]]
[1] "Martin Luther"
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] "Paul"
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] "Melanchthon"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
[[4]]
[1] "(?=.*Melanchthon)(?=.*Scripture)"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
EDIT 5: Answering your other question:
Given:
sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"
gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
Will give you the words inside the double brackets.
> gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
[1] "Tübingen" "Wittenberg" "Martin Luther" "Johann Reuchlin"
Here's a considerably simpler method using two packages quanteda and stringi:
sents <- unlist(quanteda::tokenize(txt, what = "sentence"))
namesToExtract <- c("Martin Luther", "Paul", "Melanchthon")
namesFound <- unlist(stringi::stri_extract_all_regex(sents, paste(namesToExtract, collapse = "|")))
sentList <- split(sents, list(namesFound))
sentList[["Melanchthon"]]
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
sentList
## $`Martin Luther`
## [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin."
##
## $Melanchthon
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
##
## $Paul
## [1] "He studied the Scripture, especially of Paul, and Evangelical doctrine."
I have some difficulties to extract an ID in the form:
27da12ce-85fe-3f28-92f9-e5235a5cf6ac
from a data frame:
a<-c("NAME_27da12ce-85fe-3f28-92f9-e5235a5cf6ac_THOMAS_MYR",
"NAME_94773a8c-b71d-3be6-b57e-db9d8740bb98_THIMO",
"NAME_1ed571b4-1aef-3fe2-8f85-b757da2436ee_ALEX",
"NAME_9fbeda37-0e4f-37aa-86ef-11f907812397_JOHN_TYA",
"NAME_83ef784f-3128-35a1-8ff9-daab1c5f944b_BISHOP",
"NAME_39de28ca-5eca-3e6c-b5ea-5b82784cc6f4_DUE_TO",
"NAME_0a52a024-9305-3bf1-a0a6-84b009cc5af4_WIS_MICHAL",
"NAME_2520ebbb-7900-32c9-9f2d-178cf04f7efc_Sarah_Lu_Van_Gar/Thomas")
Basically its the thing between the first and the second underscore.
Usually I approach that by:
library(tidyr)
df$a<-as.character(df$a)
df<-df[grep("_", df$a), ]
df<- separate(df, a, c("ID","Name") , sep = "_")
df$a<-as.numeric(df$ID)
However this time there a to many underscores...and my approach fails. Is there a way to extract that ID?
I think you should use extract instead of separate. You need to specify the patterns which you want to capture. I'm assuming here that ID is always starts with a number so I'm capturing everything after the first number until the next _ and then everything after it
df <- data.frame(a)
df <- df[grep("_", df$a),, drop = FALSE]
extract(df, a, c("ID", "NAME"), "[A-Za-z].*?(\\d.*?)_(.*)")
# ID NAME
# 1 27da12ce-85fe-3f28-92f9-e5235a5cf6ac THOMAS_MYR
# 2 94773a8c-b71d-3be6-b57e-db9d8740bb98 THIMO
# 3 1ed571b4-1aef-3fe2-8f85-b757da2436ee ALEX
# 4 9fbeda37-0e4f-37aa-86ef-11f907812397 JOHN_TYA
# 5 83ef784f-3128-35a1-8ff9-daab1c5f944b BISHOP
# 6 39de28ca-5eca-3e6c-b5ea-5b82784cc6f4 DUE_TO
# 7 0a52a024-9305-3bf1-a0a6-84b009cc5af4 WIS_MICHAL
# 8 2520ebbb-7900-32c9-9f2d-178cf04f7efc Sarah_Lu_Van_Gar/Thomas
try this (which assumes that the ID is always the part after the first unerscore):
sapply(strsplit(a, "_"), function(x) x[[2]])
which gives you "the middle part" which is your ID:
[1] "27da12ce-85fe-3f28-92f9-e5235a5cf6ac" "94773a8c-b71d-3be6-b57e-db9d8740bb98"
[3] "1ed571b4-1aef-3fe2-8f85-b757da2436ee" "9fbeda37-0e4f-37aa-86ef-11f907812397"
[5] "83ef784f-3128-35a1-8ff9-daab1c5f944b" "39de28ca-5eca-3e6c-b5ea-5b82784cc6f4"
[7] "0a52a024-9305-3bf1-a0a6-84b009cc5af4" "2520ebbb-7900-32c9-9f2d-178cf04f7efc"
if you want to get the Name as well a simple solution would be (which assumes that the Name is always after the second underscore):
Names <- sapply(strsplit(a, "_"), function(x) Reduce(paste, x[-c(1,2)]))
which gives you this:
[1] "THOMAS MYR" "THIMO" "ALEX" "JOHN TYA"
[5] "BISHOP" "DUE TO" "WIS MICHAL" "Sarah Lu Van Gar/Thomas"