split string into columns

split string into columns - regex

I have a column of values that are little messy
Col1
----------------------------------------
B-Lipotropin(S)...............874 BTETLS
IgE-Dandelion(S).............4578 BTETLS
Beta Gamma-Globulin..........2807 BTETLS
Lactate, P
Phospholipid .........8296 BTETLS
How do I split these values into three columns like this
Col1 Col2 Col3
-----------------------------------------------
B-Lipotropin(S) 874 BTETLS
IgE-Dandelion(S) 4578 BTETLS
Beta Gamma-Globulin 2807 BTETLS
Lactate, P
Phospholipid 8296 BTETLS
Appreciate any help.

You can also use tidyr for this:
library(tidyr)
dat <- read.table(text="B-Lipotropin(S)...............874 BTETLS
IgE-Dandelion(S).............4578 BTETLS
Beta Gamma-Globulin..........2807 BTETLS
Lactate, P
Phospholipid .........8296 BTETLS",
sep=";", stringsAsFactors=F, col.names = 'Col1')
dat %>%
separate(Col1, c('Col1', 'Col2'), '\\.+', extra = 'drop') %>%
separate(Col2, c('Col2', 'Col3'), ' ', extra = 'drop')
# Col1 Col2 Col3
# 1 B-Lipotropin(S) 874 BTETLS
# 2 IgE-Dandelion(S) 4578 BTETLS
# 3 Beta Gamma-Globulin 2807 BTETLS
# 4 Lactate, P <NA> <NA>
# 5 Phospholipid 8296 BTETLS
edit: you can also do it in one step with separate(Col1, paste0('Col', 1:3), '([^,] )|(\\.+)', extra = 'drop')

Without the actual data, it is difficult to give a general solution. However, below is one using regular expressions.
Here I assumed that the first two columns are always separated by at least one ., possibly with spaces before or after; the second and third column are presumably separated by spaces.
dat <- read.table(text="B-Lipotropin(S)...............874 BTETLS
IgE-Dandelion(S).............4578 BTETLS
Beta Gamma-Globulin..........2807 BTETLS
Lactate, P
Phospholipid .........8296 BTETLS",
sep=";", stringsAsFactors=F)
# separate first column
l <- strsplit(dat[,1], split="[[:space:]]*\\.+[[:space:]]*")
l <- lapply(l, function(x) c(x,rep("",2-length(x))))
l <- do.call(rbind,l)
dat <- cbind(dat, l[,1])
# separate last two columns
l <- strsplit(l[,2], split="[[:space:]]+")
l <- lapply(l, function(x) c(x,rep("",2-length(x))))
l <- do.call(rbind,l)
dat <- cbind(dat, l)
colnames(dat) <- c("original","col1","col2","col3")
The separated columns look like this:
> dat[,-1]
col1 col2 col3
1 B-Lipotropin(S) 874 BTETLS
2 IgE-Dandelion(S) 4578 BTETLS
3 Beta Gamma-Globulin 2807 BTETLS
4 Lactate, P
5 Phospholipid 8296 BTETLS

Using base R with a regex to split the string in the right places.
setNames(as.data.frame( # coerce to data.frame
do.call(rbind, # bind list
lapply(
strsplit(dat$Col1, "\\.+|[0-9]+(?= )", perl=T), # split messy string
`length<-`, 3) # normalize lengths of lists
)
), paste0("Col", 1:3)) # add column names
# Col1 Col2 Col3
# 1 B-Lipotropin(S) 874 BTETLS
# 2 IgE-Dandelion(S) 4578 BTETLS
# 3 Beta Gamma-Globulin 2807 BTETLS
# 4 Lactate, P <NA> <NA>
# 5 Phospholipid 8296 BTETLS

Related

Merge two duplicate rows with imputing values from each other

I have a dataframe (df1) with only one column (col1) having identical values while other columns have missing values, for example as follows:
df1
--------------------------------------------------------------------
col1 col2 col3 col4 col5 col6
--------------------------------------------------------------------
0| 1234 NaT 120 NaN 115 XYZ
1| 1234 2015/01/12 120 Abc 115 NaN
2| 1234 2015/01/12 NaN NaN NaN NaN
I would like to merge the three rows with identical col1 values into one row such that the missing values are replaced with values from the other rows where the values exist in place of missing values. The resulting df will look like this:
result_df
--------------------------------------------------------------------
col1 col2 col3 col4 col5 col6
--------------------------------------------------------------------
0| 1234 2015/01/12 120 Abc 115 XYZ
Can anyone help me with this issue? Thanks in advance!

First remove duplicates in columns names col3 and col4:
s = df.columns.to_series()
df.columns = (s + '.' + s.groupby(s).cumcount().replace({0:''}).astype(str)).str.strip('.')
print (df)
col1 col2 col3 col4 col3.1 col4.1
0 1234 NaT 120.0 NaN 115.0 XYZ
1 1234 2015-01-12 120.0 Abc 115.0 NaN
2 1234 2015-01-12 NaN NaN NaN NaN
And then aggregate first:
df = df.groupby('col1', as_index=False).first()
print (df)
col1 col2 col3 col4 col3.1 col4.1
0 1234 2015-01-12 120.0 Abc 115.0 XYZ

Calculate the number of occurrence of a given character in each row of a data frame?

I would like to computer the number of these following characters :
"AAA", "BBB", "CCC","DDD","EEE","FFF"
In a data frame like this
Id Var1 Var2 Var3 Var4
1 xtAAA bBBB fCCC ::hFF
2 xtAAA ZEEE ::FFF
3 ooCCC bBBB CkCC
4 BBBh fCCC :-LLL
5 xtAAA lBBB eCCC ::FFF
6 BBBC
7 xtAAA CvCC fCCC BBBlF
Then obtain a new data frame with :
Id Var1 Var2 Var3 Var4 number.of.AAA number.of.BBB number.of.CCC
1 xtAAA bBBB fCCC ::hFF
2 xtAAA ZEEE ::FFF
3 ooCCC bBBB CkCC
4 BBBh fCCC :-LLL
5 xtAAA lBBB eCCC ::FFF
6 BBBC
7 xtAAA CvCC fCCC BBBlF
I have seen many scripts but none of them does what I am aiming to do.

The following should do what you want:
# smaller subset of the data
temp <- data.frame(matrix(c("xtAAA", "bBBB", "fCCC", "::hFF", "xtAAA","ZEEE", "::FFF"), byrow = T), stringsAsFactors=F)
# build a little counter function
counter <- function(strings, input) {
return(sapply(strings, function(i) sum(grepl(i, input))))
}
# get the counts
myCounts <- t(sapply(1:nrow(temp), function(i) counter(strings=c("AAA", "BBB", "CCC"), temp[i,])))
You can add this to your data.frame using cbind:
allDone <- cbind(temp, myCounts)

How to split contents in a single column into two separate columns in R?

I have a column in my dataframe:
Colname
20151102
19920311
20130204
>=70
60-69
20-29
I wish to split this column into two columns like:
Col1 Col2
20151102
19920311
20130204
>=70
60-69
20-29
How can I achieve this result?

Without the need of any package:
df[,c("Col1", "Col2")] <- ""
isnum <- suppressWarnings(!is.na(as.numeric(df$colname)))
df$Col1[isnum] <- df$colname[isnum]
df$Col2[!isnum] <- df$colname[!isnum]
df <- df[,!(names(df) %in% "colname")]
Data:
df = data.frame(colname=c("20151102","19920311","20130204",">=70","60-69","20-29"), stringsAsFactors=FALSE)

One possible solution, the idea is to use extract from tidyr. Note that the delimiter I choose (the dot) must not appear in your initial data.frame.
library(magrittr)
library(tidyr)
df$colname = df$colname %>%
grepl("[>=|-]+", .) %>%
ifelse(paste0(".", df$colname), paste0(df$colname, "."))
extract(df, colname, c("col1","col2"), "(.*)\\.(.*)")
# col1 col2
#1 222222
#2 1111111
#3 >=70
#4 60-69
#5 20-29
Data:
df = data.frame(colname=c("222222","1111111",">=70","60-69","20-29"))

Here is a single statement solution. read.pattern captures the two field types separately in the parts of the regular expression surrounded by parentheses. (format can be omitted if the Colname column is already of class "character". Also, if it were desired to have the first column numeric then omit the colClasses argument.)
library(gsubfn)
read.pattern(text = format(DF$Colname), pattern = "(^\\d+$)|(.*)",
col.names = c("Col1", "Col2"), colClasses = "character")
giving:
col1 col2
1 20151102
2 19920311
3 20130204
4 >=70
5 60-69
6 20-29
Note: Here is a visualization of the regular expression used:
(^\d+$)|(.*)
Debuggex Demo

Parsing irregular character strings for numbers and put into structured format using regular expressions in R

I have a vector of irregularly-structured character data, that I want to find an extract particular numbers from. For example, take this piece of a much larger dataset:
x <- c("2001 Tax # $25.19/Widget, 2002 Est Tax # $10.68/Widget; 2000 Est Int # $55.67/Widget",
"1999 Tax # $81.16/Widget",
"1998 Tax # $52.72/Widget; 2001 Est Int # $62.49/Widget",
"1994 Combined Tax/Int # $68.33/widget; 1993 Est Int # $159.67/Widget",
"1993 Combined Tax/Int # $38.33/widget; 1992 Est Int # $159.67/Widget",
"2006 Tax # $129.21/Widget, 1991 Est Tax # $58.19/Widget; 1991 Est Int # $30.95/Widget")
and so on. Reading the table for a larger vector shows that most of the entries are separated by semi-colons or commas, and that there are only a limited number of terms used -- the year, Tax, Int, Combined, Est -- with occasional variations in entries (like ";" versus ",", or "Widget" versus "widget").
I'd like to extract each of the numbers related to the terms above into a more structured data table, such as:
[id] [year] [number] [cat] [est]
row1 2001 25.19 Tax
row1 2002 10.68 Tax Est
row1 2000 55.67 Int Est
row2 1999 81.16 Tax
row3 1998 52.72 Tax
row3 2001 62.49 Int Est
....
or else maybe a more compact / sparse representation like:
[id] [1999tax] [2001tax] [2002esttax] [2000estint]
row1 0 25.19 10.68 55.67
row2 81.16 0 0 0
If that makes sense -- I ultimately need to put this into a regression model.
My first approach has been to write the following pseudocode:
split strings into list using strsplit() on ";" or ","
extract all years
operate on list elements using function that extracts numbers between "$" and "/"
return structured table columns
So far, I've only gotten this far:
pieces.of.x <- strsplit(x1, "[;,]"); head(pieces.of.x)
which gives:
[[1]]
[1] "2001 Tax # $25.19/Widget" " 2002 Est Tax # $10.68/Widget" " 2000 Est Int # $55.67/Widget"
[[2]]
[1] "1999 Tax # $81.16/Widget"
[[3]]
[1] "1998 Tax # $52.72/Widget" " 2001 Est Int # $62.49/Widget"
[[4]]
[1] "1994 Combined Tax/Int # $68.33/widget" " 1993 Est Int # $159.67/Widget"
[[5]]
[1] "1993 Combined Tax/Int # $38.33/widget" " 1992 Est Int # $159.67/Widget"
[[6]]
[1] "2006 Tax # $129.21/Widget" " 1991 Est Tax # $58.19/Widget" " 1991 Est Int # $30.95/Widget"
Unfortunately, I don't have the knowledge of both lapply() and regular expressions ("regex") in R, to make a procedure that is robust enough to extract the years, operate on each sub-vector of elements, and then return them.
Thanks in advance for reading.

The stringr package is pretty useful when dealing with strings, and I bet that someone could even make a single matcher to extract named capture group to get a similar solution...
[edit: missed the combined entries]
library(stringr)
library(data.table)
# Split the row entries
x <- strsplit(x, "[,;]")
# Generate the entry identifiers.
i <- 0
id <- unlist( sapply( x, function(r) rep(i<<-i+1, length(r) ) ) )
# Extract the desired values
x <- unlist( x, recursive = FALSE )
year.re <- "(^\\s?([[:digit:]]{4})\\s)"
value.re <- "[$]([[:digit:]]+[.][[:digit:]]{2})[/]"
object.re <- "[/]([[:alnum:]]+)$"
Cats<- c("Tax","Int","Combination")
x <- lapply( x, function(str) {
c( Year=str_extract( str, year.re),
Category=Cats[ grepl( "Tax", str)*1 + grepl( "Int", str)*2 ],
Estimate=grepl( "Est", str),
Value=str_match( str, value.re)[2],
Object=str_match( str, object.re)[2] )
})
# Create a data object.
data.table( ID=id, do.call(rbind,x), key=c("Year") )
## ID Year Category Estimate Value Object
## 1: 6 1991 Tax TRUE 58.19 Widget
## 2: 6 1991 Int TRUE 30.95 Widget
## 3: 5 1992 Int TRUE 159.67 Widget
## 4: 4 1993 Int TRUE 159.67 Widget
## 5: 5 1993 Combination FALSE 38.33 widget
## 6: 4 1994 Combination FALSE 68.33 widget
## 7: 3 1998 Tax FALSE 52.72 Widget
## 8: 2 1999 Tax FALSE 81.16 Widget
## 9: 1 2000 Int TRUE 55.67 Widget
## 10: 3 2001 Int TRUE 62.49 Widget
## 11: 1 2001 Tax FALSE 25.19 Widget
## 12: 1 2002 Tax TRUE 10.68 Widget
## 13: 6 2006 Tax FALSE 129.21 Widget

This is similar to one of he other answers and distinguishes between line numbers (your [id] column).
matches <- regmatches(x,gregexpr("[0-9]{4} [^#]+# \\$[0-9.]+",x))
lengths <- sapply(matches,length)
z <- unlist(matches)
z <- regmatches(z,regexec("([0-9]{4}) ([^#]+) # \\$([0-9.]+)",z))
df <- t(sapply(z,function(x)c(year=x[2], number=x[4], cat=x[3])))
df <- data.frame(id=rep(1:length(x),times=lengths),df, stringsAsFactors=F)
df$est <- ifelse(grepl("Est",df$cat),"Est","")
df$cat <- regmatches(df$cat,regexpr("[^ /]+$",df$cat))
df
# id year number cat est
# 1 1 2001 25.19 Tax
# 2 1 2002 10.68 Tax Est
# 3 1 2000 55.67 Int Est
# 4 2 1999 81.16 Tax
# 5 3 1998 52.72 Tax
# 6 3 2001 62.49 Int Est
# 7 4 1994 68.33 Int
# 8 4 1993 159.67 Int Est
# 9 5 1993 38.33 Int
# 10 5 1992 159.67 Int Est
# 11 6 2006 129.21 Tax
# 12 6 1991 58.19 Tax Est
# 13 6 1991 30.95 Int Est

To create exactly the dataframe you are asking for, you can use a few tricks like strsplit, regular expressions, and rbind.
x <- unlist(strsplit(x, ',|;'))
bits <- regmatches(x,gregexpr('(\\d|\\.)+|(Tax|Int|Est)', x))
df <- do.call(rbind, lapply(bits, function(info) {
data.frame(year = info[[1]], number = tail(info, 1)[[1]],
cat = if ('Tax' %in% info) 'Tax' else 'Int',
est = if ('Est' %in% info) 'Est' else '')
}))
df$cat <- factor(df$cat); df$est <- factor(df$est);
which gives us
year number cat est
1 2001 25.19 Tax
2 2002 10.68 Tax Est
3 2000 55.67 Int Est
4 1999 81.16 Tax
5 1998 52.72 Tax

You can extract the numbers out using:
regmatches(x,gregexpr('(\\d)+', x))
which yields
[[1]]
[1] "2001" "25.19" "2002" "10.68" "2000" "55.67"
[[2]]
[1] "1999" "81.16"
[[3]]
[1] "1998" "52.72" "2001" "62.49"
[[4]]
[1] "1994" "68.33" "1993" "159.67"
[[5]]
[1] "1993" "38.33" "1992" "159.67"
[[6]]
[1] "2006" "129.21" "1991" "58.19" "1991" "30.95"
However, if you can assume every year's info is separated by a , or ;, try this:
x <- unlist(strsplit(x, ',|;'))
nums <- regmatches(x,gregexpr('(\\d|\\.)+', x))
df <- data.frame(matrix(as.numeric(unlist(nums)), ncol = 2, byrow = TRUE))
colnames(df) <- c('Year', 'Number')
which looks like
Year Number
1 2001 25.19
2 2002 10.68
3 2000 55.67
4 1999 81.16
5 1998 52.72

Dataframes in a list; adding a new variable with name of dataframe

I have a list of dataframes which I eventually want to merge while maintaining a record of their original dataframe name or list index. This will allow me to subset etc across all the rows. To accomplish this I would like to add a new variable 'id' to every dataframe, which contains the name/index of the dataframe it belongs to.
Edit: "In my real code the dataframe variables are created from reading multiple files using the following code, so I don't have actual names only those in the 'files.to.read' list which I'm unsure if they will align with the dataframe order:
mylist <- llply(files.to.read, read.csv)
A few methods have been highlighted in several posts:
Working-with-dataframes-in-a-list-drop-variables-add-new-ones and
Using-lapply-with-changing-arguments
I have tried two similar methods, the first using the index list:
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1,df2)
# Adds a new coloumn 'id' with a value of 5 to every row in every dataframe.
# I WANT to change the value based on the list index.
mylist1 <- lapply(mylist,
function(x){
x$id <- 5
return (x)
}
)
#Example of what I WANT, instead of '5'.
#> mylist1
#[[1]]
#x y id
#1 1 11 1
#2 2 12 1
#3 3 13 1
#4 4 14 1
#5 5 15 1
#
#[[2]]
#x y id
#1 1 11 2
#2 2 12 2
#3 3 13 2
#4 4 14 2
#5 5 15 2
The second attempts to pass the names() of the list.
# I WANT it to add a new coloumn 'id' with the name of the respective dataframe
# to every row in every dataframe.
mylist2 <- lapply(names(mylist),
function(x){
portfolio.results[[x]]$id <- "dataframe name here"
return (portfolio.results[[x]])
}
)
#Example of what I WANT, instead of 'dataframe name here'.
# mylist2
#[[1]]
#x y id
#1 1 11 df1
#2 2 12 df1
#3 3 13 df1
#4 4 14 df1
#5 5 15 df1
#
#[[2]]
#x y id
#1 1 11 df2
#2 2 12 df2
#3 3 13 df2
#4 4 14 df2
#5 5 15 df2
But the names() function doesn't work on a list of dataframes; it returns NULL.
Could I use seq_along(mylist) in the first example.
Any ideas or better way to handle the whole "merge with source id"
Edit - Added Solution below: I've implemented a solution using Hadleys suggestion and Tommy’s nudge which looks something like this.
files.to.read <- list.files(datafolder, pattern="\\_D.csv$", full.names=FALSE)
mylist <- llply(files.to.read, read.csv)
all <- do.call("rbind", mylist)
all$id <- rep(files.to.read, sapply(mylist, nrow))
I used the files.to.read vector as the id for each dataframe
I also changed from using merge_recurse() as it was very slow for some reason.
all <- merge_recurse(mylist)
Thanks everyone.

Personally, I think it's easier to add the names after collapse:
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1 = df1, df2 = df2)
all <- do.call("rbind", mylist)
all$id <- rep(names(mylist), sapply(mylist, nrow))

Your first attempt was very close. By using indices instead of values it will work. Your second attempt failed because you didn't name the elements in your list.
Both solutions below use the fact that lapply can pass extra parameters (mylist) to the function.
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1=df1,df2=df2) # Name each data.frame!
# names(mylist) <- c("df1", "df2") # Alternative way of naming...
# Use indices - and pass in mylist
mylist1 <- lapply(seq_along(mylist),
    function(i, x){
        x[[i]]$id <- i
        return (x[[i]])
    }, mylist
)
# Now the names work - but I pass in mylist instead of using portfolio.results.
mylist2 <- lapply(names(mylist),
function(n, x){
x[[n]]$id <- n
return (x[[n]])
}, mylist
)

names() could work it it had names, but you didn't give it any. It's an unnamed list. You will need ti use numeric indices:
> for(i in 1:length(mylist) ){ mylist[[i]] <- cbind(mylist[[i]], id=rep(i, nrow(mylist[[i]]) ) ) }
> mylist
[[1]]
x y id
1 1 11 1
2 2 12 1
3 3 13 1
4 4 14 1
5 5 15 1
[[2]]
x y id
1 1 11 2
2 2 12 2
3 3 13 2
4 4 14 2
5 5 15 2

dlply function form plyr package could be an answer:
library('plyr')
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1 = df1, df2 = df2)
all <- ldply(mylist)

You could also use tidyverse, using lst instead of list which automatically names the list for you and then use imap:
library(tidyverse)
mylist <- dplyr::lst(df1, df2)
purrr::imap(mylist, ~mutate(.x, id = .y))
# $df1
# x y id
# 1 1 11 df1
# 2 2 12 df1
# 3 3 13 df1
# 4 4 14 df1
# 5 5 15 df1
# $df2
# x y id
# 1 1 11 df2
# 2 2 12 df2
# 3 3 13 df2
# 4 4 14 df2
# 5 5 15 df2

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

split string into columns - regex

Related

Merge two duplicate rows with imputing values from each other

Calculate the number of occurrence of a given character in each row of a data frame?

How to split contents in a single column into two separate columns in R?

Parsing irregular character strings for numbers and put into structured format using regular expressions in R

Dataframes in a list; adding a new variable with name of dataframe

Categories

Resources