How to Write Regular Expression Syntax in R - regex

I have a file of log entries that I want to parse through. All the lines look like this:
F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz" "" 3322771022 (0,0) "1499.61 seconds (17.7 megabits/sec)
Each part has a specific designation which I'll put below.
F -- identifier of the line
20160525 -- date (yyyymmdd)
17:52:38.791 -- timestamp (HH:MM:SS.sss)
F798259D -- transfer identifier
156.145.15.85:46634 -- IP address and related port
xqixh8sl -- username
AES -- encryption level (could be - (dash))
"/pcgc...fastq.gz" -- transferred file (in ")
"" -- additional string (should be empty "")
2951144113 -- transferred bytes
(0,0) -- error
"2289.47 seconds (10.3 megabits/sec)" -- data about the transfer
I have imported the data file and am using the read.pattern() function to parse and separate it into it's fields. I only want the pieces of information that correlate with 2,3,4,5,6,7,8,10, and 12. However, I cannot get the pattern right. Before, I managed to get two of the fields that I needed by using this pattern:
pattern <- "^F ([0-9]+) [^ ]* .* \\(0,0\\) (.*)$"
This gave me a data frame that looked like this:
date speed of data transfer
1 20160525 "1.62 seconds (1.30 kilobits/sec)"
2 20160525 "0.29 seconds (1.93 kilobits/sec)"
3 20160525 "0.01 seconds (34.0 kilobits/sec)"
4 20160525 "0.01 seconds (102 kilobits/sec)"
5 20160525 "38.05 seconds (214 megabits/sec)"
These are only two of the fields I need, but whenever I try to add more that's where I mess up the syntax. For example:
pattern <- "^F\\s([0-9]+)\\s[0-9:.]+\\s([:alnum:])\\s[A-Z]\\s([0-9.:]+)\\s([:alnum:])\\s([•])\\s[:punct:][A-z][:punct:]\\s[:punct:]\\s.* \\(0,0\\) (.*)$"
This did not work. Could someone please help on writing this? It's been driving me crazy. Thanks!

Here's my solution:
library(stringer)
con <- readLines("dataSet.txt")
pattern <- "^F (\\d+) ([:graph:]+) ([:graph:]+) [A-Z]+ ([:graph:]+) ([:graph:]+) ([:graph:]+) ([:graph:]+) [:graph:]+ (\\d+) [:graph:]+ (.+)$"
matches <- str_match(con,pattern)
df <- data.frame(na.omit(matches[,-1]))
colnames(df) <- c("date", "timestamp", "transfer ID", "IP address", "username", "encryption level", "transferred file", "transferred bytes", "speed of data transfer")
This was the result:
1 20160525 08:22:06.838 F798256B 10.199.194.38:57708 wei2dt - "" 264 "1.62 seconds (1.30 kilobits/sec)"
2 20160525 08:28:26.920 F798256C 10.19.105.15:57708 wei2dt - "isi_audit_log.dmp-sv.tmp" 69 "0.29 seconds (1.93 kilobits/sec)"

If all of your lines follow a similar structure, you may be able to get away with simply splitting each row on the spaces.
x <- "F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES \"/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz\" \"\" 3322771022 (0,0) \"1499.61 seconds (17.7 megabits/sec)"
library(dplyr)
library(magrittr)
strsplit(x, " ") %>%
unlist() %>%
t() %>%
as.data.frame(stringsAsFactors = FALSE) %>%
setNames(c("id", "date", "timestamp", "transfer_id",
"curl_method", "ip_address", "username", "encryption",
"tranferred_file", "additional_string",
"transferred_bytes", "error",
"rate1", "rate2", "rate3", "rate4")) %>%
mutate(rate = paste(rate1, rate2, rate3, rate4)) %>%
select(-rate1:-rate4)

Related

Subtracting every two columns

Imagine I have a dataframe like this (or the names of all months)
set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(20),3)))
mydata <- rbind(mydata,c(2,round(runif(20),3)))
mydata <- rbind(mydata,c(3,round(runif(20),3)))
colnames(mydata) <- c("id", paste0(rep(c('Mary', 'Bob', 'Dylan', 'Tom', 'Jane', 'Sam', 'Tony', 'Luke', 'John', "Pam"), each=2), 1:2))
.
id Mary1 Mary2 Bob1 Bob2 Dylan1 Dylan2 Tom1 Tom2 Jane1 Jane2 Sam1 Sam2 Tony1 Tony2 Luke1 Luke2 John1 John2 Pam1 Pam2
1 0.266 0.372 0.573 0.908 0.202 0.898 0.945 0.661 0.629 0.062 0.206 0.177 0.687 0.384 0.770 0.498 0.718 0.992 0.380 0.777
2 0.935 0.212 0.652 0.126 0.267 0.386 0.013 0.382 0.870 0.340 0.482 0.600 0.494 0.186 0.827 0.668 0.794 0.108 0.724 0.411
3 0.821 0.647 0.783 0.553 0.530 0.789 0.023 0.477 0.732 0.693 0.478 0.861 0.438 0.245 0.071 0.099 0.316 0.519 0.662 0.407
Usually with many more columns.
And I want to add columns (it's up to you to decide to add them to the right, or create a new dataframe with these new columns) substracting every two.. (*)
id, Mary1-Mary2, Bob1-Bob2, Dylan1-Dylan2, Tom1-Tom2, Jane1-Jane2,...
This operation is quite common.
I'd like to do it by name, not by position, to prevent problems if they are not consecutive.
It could even happen that some columns don't have it's "twin" column, just leave as is, or ignore this complication now.
(*) The names of the columns have a prefix and a number.
Instead of just substracting two columns I could have groups of 5 and I may want to do something such as adding all numbers. A generic solution would be great.
I first tried to do it by convert it to long format, later operate with aggregate, and convert it back to wide format, but maybe it's much easier to do it directly in wide format. I know the problem is mainly related to use efficiently regular expressions.
R, data.table or dplyr, long format splitting colnames
I don't mind the speed but the simplest solution.
Any package is wellcome.
PD: All your codes fail if I add a lonely column.
set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(21),3)))
mydata <- rbind(mydata,c(2,round(runif(21),3)))
mydata <- rbind(mydata,c(3,round(runif(21),3)))
colnames(mydata) <- c(c("id", paste0(rep(c('Mary', 'Bob', 'Dylan', 'Tom', 'Jane', 'Sam', 'Tony', 'Luke', 'John', "Pam"), each=2), 1:2)),"Lola" )
I know I could filter it out manually but it would be better if the result is the difference (*) of every pair and leave alone the lonely column. (In case of differences of groups of size two)
The best option would be not manually remove the first column but split all columns in single and multiple columns.
How about using base R:
cn <- unique(gsub("\\d", "", colnames(mydata)))[-1]
sapply(cn, function(x) mydata[[paste0(x, 1)]] - mydata[[paste0(x, 2)]] )
You can use this approach for any arbitrary number of groups. For example this would return the row sums across the names with the suffix 1 or 2.:
sapply(cn, function(x) rowSums(mydata[, paste0(x, 1:2)]))
This paste approach could be replaced by regular expressions for more general applications.
You can do something like,
sapply(unique(sub('\\d', '', names(mydata[,-1]))),
function(i) Reduce('-', mydata[,-1][,grepl(i, sub('\\d', '', names(mydata[,-1])))]))
# Mary Bob Dylan Tom Jane Sam Tony Luke John Pam
#[1,] -0.106 -0.335 -0.696 0.284 0.567 0.029 0.303 0.272 -0.274 -0.397
#[2,] 0.723 0.526 -0.119 -0.369 0.530 -0.118 0.308 0.159 0.686 0.313
#[3,] 0.174 0.230 -0.259 -0.454 0.039 -0.383 0.193 -0.028 -0.203 0.255
As per your comment, we can easily sort the columns and then apply the formula above,
sorted.names <- names(mydata)[order(nchar(names(mydata)), names(mydata))]
mydata <- mydata[,sorted.names]
This solution handles an arbitrary number of twins.
## return data frame
twin.vars <- function(prefix, df) {
df[grep(paste0(prefix, '[0-9]+$'), names(df))]
}
pfx <- unique(sub('[0-9]*$', '', names(mydata[-1])))
tmp <- lapply(pfx, function(x) Reduce(`-`, twin.vars(x, mydata)))
cbind(id=mydata$id, as.data.frame(setNames(tmp, pfx)))
OK, I've chosen #NBATrends solution because it works well almost always and he was the first.
Anyway, I add my little contribution, just in case anybody is interested:
runs <- rle(sort(sub('\\d$', '', names(mydata))))
sapply(runs[[2]][runs[[1]]>1], function(x) mydata[[paste0(x, 1)]] - mydata[[paste0(x, 2)]] )
The only "problem" is that it changes the final order, but you don't need to manually remove isolated columns, and works for disordered columns too.
I'm perplexed because nobody posted a solution with dplyr or data.table :)

Split one column into two columns and retaining the seperator

I have a very large data array:
'data.frame': 40525992 obs. of 14 variables:
$ INSTNM : Factor w/ 7050 levels "A W Healthcare Educators"
$ Total : Factor w/ 3212 levels "1","10","100",
$ Crime_Type : Factor w/ 72 levels "MURD11","NEG_M11",
$ Count : num 0 0 0 0 0 0 0 0 0 0 ...
The Crime_Type column contains the type of Crime and the Year, so "MURD11" is Murder in 2011. These are college campus crime statistics my kid is analyzing for her school project, I am helping when she is stuck. I am currently stuck at creating a clean data file she can analyze
Once i converted the wide file (all crime types '9' in columns) to a long file using 'gather' the file size is going from 300MB to 8 GB. The file I am working on is 8GB. do you that is the problem. How do i convert it to a data.table for faster processing?
What I want to do is to split this 'Crime_Type' column into two columns 'Crime_Type' and 'Year'. The data contains alphanumeric and numbers. There are also some special characters like NEG_M which is 'Negligent Manslaughter'.
We will replace the full names later but can some one suggest on how I separate
MURD11 --> MURD and 11 (in two columns)
NEG_M10 --> NEG_M and 10 (in two columns)
etc...
I have tried using,
df <- separate(totallong, Crime_Type, into = c("Crime", "Year"), sep = "[:digit:]", extra = "merge")
df <- separate(totallong, Crime_Type, into = c("Year", "Temp"), sep = "[:alpha:]", extra = "merge")
The first one separates the Crime as it looks for numbers. The second one does not work at all.
I also tried
df$Crime_Type<- apply (strsplit(as.character(df$Crime_Type), split="[:digit:]"))
That does not work at all. I have gone through many posts on stack-overflow and thats where I got these commands but I am now truly stuck and would appreciate your help.
Since you're using tidyr already (as evidenced by separate), try the extract function, which, given a regex, puts each captured group into a new column. The 'Crime_Type' is all the non-numeric stuff, and the 'Year' is the numeric stuff. Adjust the regex accordingly.
library(tidyr)
extract(df, 'Crime_Type', into=c('Crime', 'Year'), regex='^([^0-9]+)([0-9]+)$')
In base R, one option would be to create a unique delimiter between the non-numeric and numeric part. We can capture as a group the non-numeric ([^0-9]+) and numeric ([0-9]+) characters by wrapping it inside the parentheses ((..)) and in the replacement we use \\1 for the first capture group, followed by a , and the second group (\\2). This can be used as input vector to read.table with sep=',' to read as two columns.
df1 <- read.table(text=gsub('([^0-9]+)([0-9]+)', '\\1,\\2',
totallong$Crime_Type),sep=",", col.names=c('Crime', 'Year'))
df1
# Crime Year
#1 MURD 11
#2 NEG_M 11
If we need, we can cbind with the original dataset
cbind(totallong, df1)
Or in base R, we can use strsplit with split specifying the boundary between non-number ((?<=[^0-9])) and a number ((?=[0-9])). Here we use lookarounds to match the boundary. The output will be a list, we can rbind the list elements with do.call(rbind and convert it to data.frame
as.data.frame(do.call(rbind, strsplit(as.character(totallong$Crime_Type),
split="(?<=[^0-9])(?=[0-9])", perl=TRUE)))
# V1 V2
#1 MURD 11
#2 NEG_M 11
Or another option is tstrsplit from the devel version of data.table ie. v1.9.5. Here also, we use the same regex. In addition, there is option to convert the output columns into different class.
library(data.table)#v1.9.5+
setDT(totallong)[, c('Crime', 'Year') := tstrsplit(Crime_Type,
"(?<=[^0-9])(?=[0-9])", perl=TRUE, type.convert=TRUE)]
# Crime_Type Crime Year
#1: MURD11 MURD 11
#2: NEG_M11 NEG_M 11
If we don't need the 'Crime_Type' column in the output, it can be assigned to NULL
totallong[, Crime_Type:= NULL]
NOTE: Instructions to install the devel version are here
Or a faster option would be stri_extract_all from library(stringi) after collapsing the rows to a single string ('v2'). The alternate elements in 'v3' can be extracted by indexing with seq to create new data.frame
library(stringi)
v2 <- paste(totallong$Crime_Type, collapse='')
v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
ind1 <- seq(1, length(v3), by=2)
ind2 <- seq(2, length(v3), by=2)
d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])
Benchmarks
v1 <- do.call(paste, c(expand.grid(c('MURD', 'NEG_M'), 11:15), sep=''))
set.seed(24)
test <- data.frame(v1= sample(v1, 40525992, replace=TRUE ))
system.time({
v2 <- paste(test$v1, collapse='')
v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
ind1 <- seq(1, length(v3), by=2)
ind2 <- seq(2, length(v3), by=2)
d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])
})
#user system elapsed
#56.019 1.709 57.838
data
totallong <- data.frame(Crime_Type= c('MURD11', 'NEG_M11'))

How to use separate() properly?

I have some difficulties to extract an ID in the form:
27da12ce-85fe-3f28-92f9-e5235a5cf6ac
from a data frame:
a<-c("NAME_27da12ce-85fe-3f28-92f9-e5235a5cf6ac_THOMAS_MYR",
"NAME_94773a8c-b71d-3be6-b57e-db9d8740bb98_THIMO",
"NAME_1ed571b4-1aef-3fe2-8f85-b757da2436ee_ALEX",
"NAME_9fbeda37-0e4f-37aa-86ef-11f907812397_JOHN_TYA",
"NAME_83ef784f-3128-35a1-8ff9-daab1c5f944b_BISHOP",
"NAME_39de28ca-5eca-3e6c-b5ea-5b82784cc6f4_DUE_TO",
"NAME_0a52a024-9305-3bf1-a0a6-84b009cc5af4_WIS_MICHAL",
"NAME_2520ebbb-7900-32c9-9f2d-178cf04f7efc_Sarah_Lu_Van_Gar/Thomas")
Basically its the thing between the first and the second underscore.
Usually I approach that by:
library(tidyr)
df$a<-as.character(df$a)
df<-df[grep("_", df$a), ]
df<- separate(df, a, c("ID","Name") , sep = "_")
df$a<-as.numeric(df$ID)
However this time there a to many underscores...and my approach fails. Is there a way to extract that ID?
I think you should use extract instead of separate. You need to specify the patterns which you want to capture. I'm assuming here that ID is always starts with a number so I'm capturing everything after the first number until the next _ and then everything after it
df <- data.frame(a)
df <- df[grep("_", df$a),, drop = FALSE]
extract(df, a, c("ID", "NAME"), "[A-Za-z].*?(\\d.*?)_(.*)")
# ID NAME
# 1 27da12ce-85fe-3f28-92f9-e5235a5cf6ac THOMAS_MYR
# 2 94773a8c-b71d-3be6-b57e-db9d8740bb98 THIMO
# 3 1ed571b4-1aef-3fe2-8f85-b757da2436ee ALEX
# 4 9fbeda37-0e4f-37aa-86ef-11f907812397 JOHN_TYA
# 5 83ef784f-3128-35a1-8ff9-daab1c5f944b BISHOP
# 6 39de28ca-5eca-3e6c-b5ea-5b82784cc6f4 DUE_TO
# 7 0a52a024-9305-3bf1-a0a6-84b009cc5af4 WIS_MICHAL
# 8 2520ebbb-7900-32c9-9f2d-178cf04f7efc Sarah_Lu_Van_Gar/Thomas
try this (which assumes that the ID is always the part after the first unerscore):
sapply(strsplit(a, "_"), function(x) x[[2]])
which gives you "the middle part" which is your ID:
[1] "27da12ce-85fe-3f28-92f9-e5235a5cf6ac" "94773a8c-b71d-3be6-b57e-db9d8740bb98"
[3] "1ed571b4-1aef-3fe2-8f85-b757da2436ee" "9fbeda37-0e4f-37aa-86ef-11f907812397"
[5] "83ef784f-3128-35a1-8ff9-daab1c5f944b" "39de28ca-5eca-3e6c-b5ea-5b82784cc6f4"
[7] "0a52a024-9305-3bf1-a0a6-84b009cc5af4" "2520ebbb-7900-32c9-9f2d-178cf04f7efc"
if you want to get the Name as well a simple solution would be (which assumes that the Name is always after the second underscore):
Names <- sapply(strsplit(a, "_"), function(x) Reduce(paste, x[-c(1,2)]))
which gives you this:
[1] "THOMAS MYR" "THIMO" "ALEX" "JOHN TYA"
[5] "BISHOP" "DUE TO" "WIS MICHAL" "Sarah Lu Van Gar/Thomas"

R - grab the exact 8 digits number in a string and transform it

I have 2 problems in extracting and transforming data using R.
Here's the dataset:
messageID | msg
1111111111 | hey id 18271801, fix it asap
2222222222 | please fix it soon id12901991 and 91222911. dissapointed
3333333333 | wow $300 expensive man, come on
4444444444 | number 2837169119 test
The problem is:
I want to grab the number with only 8 digits length. In the dataset above, message id 3333...(300 - 3 digits) and 4444...(2837169119 - 10 digits) should not included. And here's my best shot so far:
as.matrix(unlist(apply(df[2],1,function(x){regmatches(x,gregexpr('([0-9]){8}', x))})))
.
However, with this line of code, message 444... is included because is contains more than 8 digits number.
Transform the data to another form like this:
message_id | customer_ID
1111111111 | 18271801
2222222222 | 12901991
2222222222 | 91222911
I don't know how to efficiently transform the data.
The output of dput(df):
structure(list(id = c(1111111111, 2222222222, 3333333333, 4444444444
), msg = c("hey id 18271801, fix it asap", "please fix it soon id12901991 and 91222911. dissapointed",
"wow $300 expensive man, come on", "number 2837169119 test")), .Names = c("id",
"msg"), row.names = c(NA, 4L), class = "data.frame")
Thanks
Use rebus to create your regular expression, and stringr to extract the matches.
You may need to play with the exact form of the regular expression. This code works on your examples, but you'll probably need to adapt it for your dataset.
library(rebus)
library(stringr)
# Create regex
rx <- negative_lookbehind(DGT) %R%
dgt(8) %R%
negative_lookahead(DGT)
rx
## <regex> (?<!\d)[\d]{8}(?!\d)
# Extract the IDs
extracted_ids <- str_extract_all(df$msg, perl(rx))
# Stuff the IDs into a data frame.
data.frame(
messageID = rep(
df$id,
vapply(extracted_ids, length, integer(1))
),
extractedID = unlist(extracted_ids, use.names = FALSE)
)

Remove defined strings from sentences in dataframe

I need to remove defined strings from sentences in data frame:
sent1 = data.frame(Sentences=c("bad printer for the money wireless setup was surprisingly easy",
"love my samsung galaxy tabinch gb whitethis is the first"), user = c(1,2))
Sentences User
bad printer for the money wireless setup was surprisingly easy 1
love my samsung galaxy tabinch gb whitethis is the first 2
Defined strings for excluding, e.g.:
stop_words <- c("bad", "money", "love", "is", "the")
I was wondering about something like this:
library(stringr)
words1 <- (str_split(unlist(sent1$Sentences)," "))
ddd = which(words1[[1]] %in% stop_words)
words1[[1]][-ddd]
But I need it for all items in the list. Then I need to have output table in the same structure as input table sent1, but without defined strings.
Please, I very appreciate any of help or advice.
You can combine the stop words and create a regex pattern. Therefore, you only need a single gsub command.
# create regex pattern
pattern <- paste0("\\b(?:", paste(stop_words, collapse = "|"), ")\\b ?")
# [1] "\\b(?:bad|money|love|is|the)\\b ?"
# remove stop words
res <- gsub(pattern, "", sent1$Sentences)
# [1] "printer for wireless setup was surprisingly easy"
# [2] "my samsung galaxy tabinch gb whitethis first"
# store result in a data frame
data.frame(Sentences = res)
# Sentences
# 1 printer for wireless setup was surprisingly easy
# 2 my samsung galaxy tabinch gb whitethis first