parsing access.log to data.frame - regex

I want to parse an access.log in R. It has the following form and I want to get it into a data.frame:
TIME="2013-07-25T06:28:38+0200" MOBILE_AGENT="0" HTTP_REFERER="-" REQUEST_HOST="www.example.com" APP_ENV="envvar" APP_COUNTRY="US" APP_DEFAULT_LOCATION="New York" REMOTE_ADDR="11.222.33.444" SESSION_ID="rstg35tsdf56tdg3" REQUEST_URI="/get/me/something" HTTP_USER_AGENT="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" REQUEST_METHOD="GET" REWRITTEN_REQUEST_URI="/index.php?url=/get/me/something" STATUS="200" RESPONSE_TIME="155,860ms" PEAK_MEMORY="18965" CPU="99,99"
The logs are 400MB per file and currently I have about 4GB logs so size matters.
Another thing.. There are two different log structures (different columns are included) so you can not assume to have the same columns always, but you can assume that only one kind of structure is parsed at a time.
What I have up to now is a regex for this structure:
(\\w+)[=][\"](.*?)[\"][ ]{0,1}
I can read the data in and somehow fit it into a dataframe using readlines, gsub and read.table but it is slow and messy.
Any ideas? Tnx!

You can do this for example:
text <- readLines(textConnection(text))
## since we can't use = as splitter (used in url) I create a new splitter
dd <- read.table(text=gsub('="','|"',text),sep=' ')
## use data.table since it is faster to apply operation by columns and bind them again
library(data.table)
DT <- as.data.table(dd)
DT.split <- DT[,lapply(.SD,function(x)
unlist(strsplit(as.character(x) ,"|",fixed=TRUE)))]
DT.split[c(F,T)]

Related

How to use regular expressions properly on a SQL files?

I have a lot of undocumented and uncommented SQL queries. I would like to extract some information within the SQL-statements. Particularly, I'm interested in DB-names, table names and if possible column names. The queries have usually the following syntax.
SELECT *
FROM mydb.table1 m
LEFT JOIN mydb.sometable o ON m.id = o.id
LEFT JOIN mydb.sometable t ON p.id=t.id
LEFT JOIN otherdb.sometable s ON s.column='test'
Usually, the statements involes several DBs and Tables. I would like only extract DBs and Tables with any other information. I thought if whether it is possible to extract first the information which begins after FROM & JOIN & LEFT JOIN. Here its usually db.table letters such as o t s correspond already to referenced tables. I suppose they are difficult to capture. What I tried without any success is to use something like:
gsub(".*FROM \\s*|WHERE|ORDER|GROUP.*", "", vec)
Assuming that each statement ends with WHERE/where or ORDER/order or GROUP... But that doesnt work out as expected.
You haven't indicated which database system you are using but virtually all such systems have introspection facilities that would allow you to get this information a lot more easily and reliably than attempting to parse SQL statements. The following code which supposes SQLite can likely be adapted to your situation by getting a list of your databases and then looping over the databases and using dbConnect to connect to each one in turn running code such as this:
library(gsubfn)
library(RSQLite)
con <- dbConnect(SQLite()) # use in memory database for testing
# create two tables for purposes of this test
dbWriteTable(con, "BOD", BOD, row.names = FALSE)
dbWriteTable(con, "iris", iris, row.names = FALSE)
# get all table names and columns
tabinfo <- Map(function(tab) names(fn$dbGetQuery(con, "select * from $tab limit 0")),
dbListTables(con))
dbDisconnect(con)
giving an R list whose names are the table names and whose entries are the column names:
> tabinfo
$BOD
[1] "Time" "demand"
$iris
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
or perhaps long form output is preferred:
setNames(stack(tabinfo), c("column", "table"))
giving:
column table
1 Time BOD
2 demand BOD
3 Sepal.Length iris
4 Sepal.Width iris
5 Petal.Length iris
6 Petal.Width iris
7 Species iris
You could use the stringi package for this.
library(stringi)
# Your string vector
myString <- "SELECT *
FROM mydb.table1 m
LEFT JOIN mydb.sometable o ON m.id = o.id
LEFT JOIN mydb.sometable t ON p.id=t.id
LEFT JOIN otherdb.sometable s ON s.column='test'"
# Three stringi functions used
# stringi_extract_all_regex will extract the strings which have FROM or JOIN followed by some text till the next space
# string_replace_all_regex will replace all the FROM or JOIN followed by space with null string
# stringi_unique will extract all unique strings
t <- stri_unique(stri_replace_all_regex(stri_extract_all_regex(myString, "((FROM|JOIN) [^\\s]+)", simplify = TRUE),
"(FROM|JOIN) ", ""))
> t
[1] "mydb.table1" "mydb.sometable" "otherdb.sometable"

Qgis or Python: converting a CSV file of simple locations to raster?

I have a CSV file as follows:
Diversity,Longitude,Latitude
7,114.99638889,-33.85333333
6,114.99790583,-33.85214594
10,115,-33.85416667
2,115.0252075,-33.84447519
I would like to convert it to a raster file with a set 'no data' value over most of the area and the values in cells at the long/lat locations.
Is there an easy way to do that in Qgis or python?
Cheers,
Steve
Not what you asked for, but here is how you can approach it in R
get the data:
d <- read.csv('file.csv')
d <- cbind(d[,2:3], d[,1])
load the raster package:
library(raster)
If your data are regularly spaced:
r <- rasterFromXYZ(d)
writeRaster(r, 'file.tif')
else create an empty raster and rasterize:
r <- raster(extent(d[,1:2]))
res(r) <- 1 # adjust this and other parameters as you see fit
r <- rasterize(d[,1:2], d[,3], fun=mean)

data.table setnames combined with regex

I would like to rename each column in a data table based on a regex in an appropriate way.
library(data.table)
DT <- data.table("a_foo" = 1:2, "bar_b" = 1:2)
a_foo bar_b
1: 1 1
2: 2 2
I would like to cut the "_foo" and "bar_" from the names. This classic line does the trick, but it also copies the whole table.
names(DT) <- gsub("_foo|bar_", "", names(DT))
How can I do the same using setnames()? I have a lots of variables, so just writing out all of the names is not an option.
You could try
setnames(DT, names(DT), gsub("_foo|bar_", "", names(DT)))
based on the usage in ?setnames i.e. setnames(x,old,new)
Or as #eddi commented
setnames(DT, gsub("_foo|bar_", "", names(DT)))

How to rename a column of a data frame with part of the data frame identifier in R?

I've got a number of files that contain gene expression data. In each file, the gene name is kept in a column "Gene_symbol" and the expression measure (a real number) is kept in a column "RPKM". The file name consists of an identifier followed by _ and the rest of the name (ends with "expression.txt"). I would like to load all of these files into R as data frames, for each data frame rename the column "RPKM" with the identifier of the original file and then join the data frames by "Gene_symbol" into one large data frame with one column "Gene_symbol" followed by all the columns with the expression measures from the individual files, each labeled with the original identifier.
I've managed to transfer the identifier of the original files to the names of the individual data frames as follows.
files <- list.files(pattern = "expression.txt$")
for (i in files) {var_name = paste("Data", strsplit(i, "_")[[1]][1], sep = "_"); assign(var_name, read.table(i, header=TRUE)[,c("Gene_symbol", "RPKM")])}
So now I'm at a stage where I have dataframes as follows:
Data_id0001 <- data.frame(Gene_symbol=c("geneA","geneB","geneC"),RPKM=c(2.43,5.24,6.53))
Data_id0002 <- data.frame(Gene_symbol=c("geneA","geneB","geneC"),RPKM=c(4.53,1.07,2.44))
But then I don't seem to be able to rename the RPKM column with the id000x bit. (That is in a fully automated way of course, looping through all the data frames I will generate in the real scenario.)
I've tried to store the identifier bit as a comment with the data frames but seem to be unable to assign the comment from within a loop.
Any help would be appreciated,
mce
You should never work this way in R. You should always try keeping all your data frames in a list and operate over them using function such as lapply etc. Thus, instead of using assign, just create an empty list of length of your files list and fill it with the for loop
For your current situation, we can fixed it using ls and mget combination in order to pull this data frames from the global environment into a list and then change the columns of interest.
temp <- mget(ls(pattern = "Data_id\\d+$"))
lapply(names(temp), function(x) names(temp[[x]])[2] <<- gsub("Data_", "", x))
temp
#$Data_id0001
# Gene_symbol id0001
# 1 geneA 2.43
# 2 geneB 5.24
# 3 geneC 6.53
#
# $Data_id0002
# Gene_symbol id0002
# 1 geneA 4.53
# 2 geneB 1.07
# 3 geneC 2.44
You could eventually use list2env in order to get them back to the global environment, but you should use with caution
thanks a lot for your suggestions! I think I get the point. The way I'm doing it now (see below) is hopefully a lot more R-like and works fine!!!
Cheers,
Maik
library(plyr)
files <- list.files(pattern = "expression.txt$")
temp <- list()
for (i in 1:length(files)) {temp[[i]]=read.table(files[i], header=TRUE)[,c("Gene_symbol", "RPKM")]}
for (i in 1:length(temp)) {temp[[i]]=rename(temp[[i]], c("RPKM"=strsplit(files[i], "_")[[1]][1]))}
combined_expression <- join_all(temp, by="Gene_symbol", type="full")

Write a list of lists to a table, with the names of each list as a column?

I have a fairly basic question about how to write a list to a file.
I have a list generated by Mfuzz acore function, that lists the names of all the probes I have in each of 20 clusters in the following format:
[[1]]
NAME MEM.SHIP
ILMN_X ILMN_X 0.9993195
.
.
.
[[20]]
NAME MEM.SHIP
ILMN_Y ILMN_Y 0.9982345
I want to convert it to a data frame and eventually to an output file where the list number is included as a column;
Like this:
CLUSTER NAME MEM.SHIP
1 ILMN_X 0.9993196
.
.
.
20 ILMN_Y 0.9982345
Where the CLUSTER column indicates which sub-list the probe belongs to. Each probe name can belong to multible sub-lists.
I have tried different things like suggestions in other posts to use plyr but I always just end up with a single list of all the variables without an indication of which sub-list they belonged to.
Thanks!
If your original list is called clstrs, I believe this is one solution:
do.call(rbind, lapply(seq_along(clstrs), function(i){
data.frame(CLUSTER=i, clstrs[[i]])
}))
Here's another way how to skin a cat.
# make some sample data
my.df <- data.frame(num = 1:10, val = runif(10))
my.list <- list(my.df, my.df, my.df, my.df, my.df, my.df)
# build index - count the number of rows in each list element that will be
# used to designate the rows based on their previous list affiliation
index <- lapply(my.list, nrow)
index <- rep(1:length(index), times = index)
# from here on it's basically what Nick did. rbind everything together and
# put some lipstick on and voila
my.out <- do.call("rbind", my.list)
my.out$index <- index
#or
my.out <- cbind(my.out, index)
I have a few minutes to spare so I did a quick benchmark using 10e5 rows for each data frame.
My solution with $index:
user system elapsed
0.81 0.27 1.08
Solution with cbind:
user system elapsed
19.92 0.42 20.38
Nick's solution:
user system elapsed
1.04 0.26 1.31