regex to split on anything not a digit - regex

I would like to split strings on anything not a digit. In this particular case the strings were dates and times read in from an external .csv file and are not currently in as.POSIXct format.
Ideally I would like to split the strings using regex, but if there is a simpler way to convert them to six columns of numbers using a date / time function that would be of interest as well.
I have already succeeded in creating a regex that splits the strings into six columns, but this regex is not general.
Here are the data:
my.data <- read.csv(text = '
Date_Time
18/05/2011 07:32:40
19/05/2011 13:26:02
19/05/2011 13:32:47
19/05/2011 13:45:24
19/05/2011 14:57:27
19/05/2011 15:03:18
', header=TRUE, stringsAsFactors = FALSE, na.strings = 'NA', strip.white = TRUE)
Here is a regex statement that splits the strings into six columns:
my.date.time <- data.frame(do.call(rbind, strsplit(my.data$Date_Time,"[/|:|[:space:]]+") ))
The above statement is not general. Here is an unsuccessful attempt at making the regex general by specifying a split on anything that is not a digit:
data.frame(do.call(rbind, strsplit(my.data$Date_Time,"[^\\d]+") ))
After I split the strings into six columns I still need what seems like an excessive number of statements to convert the columns into numeric format:
colnames(my.date.time) <- c('my.day', 'my.month', 'my.year', 'my.hour', 'my.minute', 'my.second')
revised.data <- data.frame(my.data, my.date.time, stringsAsFactors = FALSE)
revised.data$my.day <- as.numeric(as.character(revised.data$my.day))
revised.data$my.month <- as.numeric(as.character(revised.data$my.month))
revised.data$my.year <- as.numeric(as.character(revised.data$my.year))
revised.data$my.hour <- as.numeric(as.character(revised.data$my.hour))
revised.data$my.minute <- as.numeric(as.character(revised.data$my.minute))
revised.data$my.second <- as.numeric(as.character(revised.data$my.second))
revised.data
str(revised.data)
Thank you for any assistance in generalizing the above regex (or streamlining the procedure using date / time functions). The apply function probably can eliminate most of the as.numeric(as.character) statements, although that is a relatively minor issue.

Give a try to \\D+
> x <- "18/05/2011 07:32:40"
> strsplit(x, "\\D+")
[[1]]
[1] "18" "05" "2011" "07" "32" "40"
or
> strsplit(x, "[^0-9]+")
[[1]]
[1] "18" "05" "2011" "07" "32" "40"

Maybe I missed something but here is my solution:
lisda <- apply(my.data, 1, strsplit, "[^[:digit:]]")
my.data2 <- t(data.frame(lisda))
my.data2
[,1] [,2] [,3] [,4] [,5] [,6]
Date_Time "18" "05" "2011" "07" "32" "40"
Date_Time.1 "19" "05" "2011" "13" "26" "02"
Date_Time.2 "19" "05" "2011" "13" "32" "47"
Date_Time.3 "19" "05" "2011" "13" "45" "24"
Date_Time.4 "19" "05" "2011" "14" "57" "27"
Date_Time.5 "19" "05" "2011" "15" "03" "18"
Just in case you want to convert them all to numeric.
apply(my.data2, 2, function(x) as.numeric(as.character(x)))

Using cSplit
library(splitstackshape)
tmp = cSplit(my.data, "Date_Time", "/")
out = cSplit(tmp, "Date_Time_3", ":")
if you read your data like this
my.data <- read.csv(text = 'Date Time
18/05/2011 07:32:40
19/05/2011 13:26:02
19/05/2011 13:32:47
19/05/2011 13:45:24
19/05/2011 14:57:27
19/05/2011 15:03:18', header=TRUE, sep =' ' ,stringsAsFactors = FALSE, na.strings = 'NA', strip.white = TRUE)
you could do
library(splitstackshape)
out = cSplit(my.data, splitCols = c("Date", "Time"), sep = c("/", ":"))
#> out
# Date_1 Date_2 Date_3 Time_1 Time_2 Time_3
#1: 18 5 2011 7 32 40
#2: 19 5 2011 13 26 2
#3: 19 5 2011 13 32 47
#4: 19 5 2011 13 45 24
#5: 19 5 2011 14 57 27
#6: 19 5 2011 15 3 18

You might consider using read.pattern from the gsubfn package for this:
library(gsubfn)
read.pattern(text = my.data$Date_Time, pattern = "\\d+")
# V1 V2 V3 V4 V5 V6
# 1 18 5 2011 7 32 40
# 2 19 5 2011 13 26 2
# 3 19 5 2011 13 32 47
# 4 19 5 2011 13 45 24
# 5 19 5 2011 14 57 27
# 6 19 5 2011 15 3 18
Then you can simply assign the column names as you desire.

Related

How to remove non-alphabetic characters and convert all letter to lowercase in R?

In the following string:
"I may opt for a yam for Amy, May, and Tommy."
How to remove non-alphabetic characters and convert all letter to lowercase and sort the letters within each word in R?
Meanwhile, I try to sort words in sentence and removes the duplicates.
You could use stringi
library(stringi)
unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE))))
Which gives:
## [1] "a" "amy" "and" "for" "i" "may" "opt" "tommy" "yam"
Update
As per mentionned by #DavidArenburg, I overlooked the "sort the letters within words" part of your question. You didn't provide a desired output and no immediate application comes to mind but, assuming you want to identify which words have a matching counterpart (string distance of 0):
unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE)))) %>%
stringdistmatrix(., ., useNames = "strings", method = "qgram") %>%
# a amy and for i may opt tommy yam
# a 0 2 2 4 2 2 4 6 2
# amy 2 0 4 6 4 0 6 4 0
# and 2 4 0 6 4 4 6 8 4
# for 4 6 6 0 4 6 4 6 6
# i 2 4 4 4 0 4 4 6 4
# may 2 0 4 6 4 0 6 4 0
# opt 4 6 6 4 4 6 0 4 6
# tommy 6 4 8 6 6 4 4 0 4
# yam 2 0 4 6 4 0 6 4 0
apply(., 1, function(x) sum(x == 0, na.rm=TRUE))
# a amy and for i may opt tommy yam
# 1 3 1 1 1 3 1 1 3
Words with more than one 0 per row ("amy", "may", "yam") have a scrambled counterpart.
str <- "I may opt for a yam for Amy, May, and Tommy."
## Clean the words (just keep letters and convert to lowercase)
words <- strsplit(tolower(gsub("[^A-Za-z ]", "", str)), " ")[[1]]
## split the words into characters and sort them
sortedWords <- sapply(words, function(word) sort(unlist(strsplit(word, ""))))
## Join the sorted letters back together
sapply(sortedWords, paste, collapse="")
# i may opt for a yam for amy may and
# "i" "amy" "opt" "for" "a" "amy" "for" "amy" "amy" "adn"
# tommy
# "mmoty"
## If you want to convert result back to string
do.call(paste, lapply(sortedWords, paste, collapse=""))
# [1] "i amy opt for a amy for amy amy adn mmoty"
stringr will let you work on all character sets in R and at C-speed, and magrittr will let you use a piping idiom that works well for your needs:
library(stringr)
library(magrittr)
txt <- "I may opt for a yam for Amy, May, and Tommy."
txt %>%
str_to_lower %>% # lowercase
str_replace_all("[[:punct:][:digit:][:cntrl:]]", "") %>% # only alpha
str_replace_all("[[:space:]]+", " ") %>% # single spaces
str_split(" ") %>% # tokenize
extract2(1) %>% # str_split returns a list
sort %>% # sort
unique # unique words
## [1] "a" "amy" "and" "for" "i" "may" "opt" "tommy" "yam"
The qdap package that I maintain has the bag_o_words function that works well for this:
txt <- "I may opt for a yam for Amy, May, and Tommy."
library(qdap)
unique(sort(bag_o_words(txt)))
## [1] "a" "amy" "and" "for" "i" "may" "opt" "tommy" "yam"

How to split character and numerical separately in R

I have a dataframe which looks like this:
df= data.frame(name= c("1Alex100.00","12Rina Faso92.31","113john00.00"))
And I want to split this into a data frame with 3 columns so that the output looks like:
name1 name2 name3
1 Alex 100.00
12 Rina Faso 92.31
113 john 00.00
I have tried stringr() and grep() and have got limited success. Lack of a delimiter makes it lot more difficult.
You could try
library(tidyr)
res <- extract(df, name, into=c('name1', 'name2', 'name3'),
'(\\d+)([^0-9]+)([0-9.]+)', convert=TRUE)
res
# name1 name2 name3
#1 1 Alex 100.00
#2 2 Rina Faso 92.31
#3 3 john 50.00
str(res)
# 'data.frame': 3 obs. of 3 variables:
#$ name1: int 1 2 3
#$ name2: Factor w/ 3 levels "Alex","john",..: 1 3 2
# $ name3: num 100 92.3 50
Update
Based on 'df' from #DavidArenburg's post
res <- extract(df, name, into=c('name1', 'name2', 'name3'),
'(\\d+)([^0-9]+)([0-9.]+)', convert=TRUE)
res
# name1 name2 name3
#1 121 Réunion 13.76
#2 2 Côte d'Ivoire 22.40
#3 3 john 50.00
Try with str_match from stringr:
str_match(df$name, "^([0-9]*)([A-Za-z ]*)([0-9\\.]*)")
# [,1] [,2] [,3] [,4]
# [1,] "1Alex100.00" "1" "Alex" "100.00"
# [2,] "2Rina Faso92.31" "2" "Rina Faso" "92.31"
# [3,] "3john50.00" "3" "john" "50.00"
So as.data.frame(str_match(df$name, "^([0-9]*)([A-Za-z ]*)([0-9\\.]*)")[,-1]) should give you the desired result.
You could do like this also.
> df <- data.frame(name= c("1Alex100.00","12Rina Faso92.31","113john00.00"))
> x <- do.call(rbind.data.frame, strsplit(as.character(df$name), "(?<=[A-Za-z])(?=\\d)|(?<=\\d)(?=[A-Za-z])", perl=T))
> colnames(x) <- c("name1", "name2", "name3")
> print(x, row.names=FALSE)
name1 name2 name3
1 Alex 100.00
12 Rina Faso 92.31
113 john 00.00
With base R it could be done abit uglier though it works with special characters too
with(df, cbind(sub("\\D.*", "", name),
gsub("[0-9.]", "", name),
gsub(".*[A-Za-z]", "", name)))
# [,1] [,2] [,3]
# [1,] "1" "Alex" "100.00"
# [2,] "2" "Rina Faso" "92.31"
# [3,] "3" "john" "50.00"
An example on special characters
df = data.frame(name= c("121Réunion13.76","2Côte d'Ivoire22.40","3john50.00"))
with(df, cbind(sub("\\D.*", "", name),
gsub("[0-9.]", "", name),
gsub(".*[A-Za-z]", "", name)))
# [,1] [,2] [,3]
# [1,] "121" "Réunion" "13.76"
# [2,] "2" "Côte d'Ivoire" "22.40"
# [3,] "3" "john" "50.00"
Base R not ugly solutions:
proto=data.frame(name1=numeric(),name2=character(),name3=numeric())
strcapture("(\\d+)(\\D+)(.*)",as.character(df$name),proto)
name1 name2 name3
1 1 Alex 100.00
2 12 Rina Faso 92.31
3 113 john 0.00
read.table(text=gsub("(\\d+)(\\D+)(.*)","\\1|\\2|\\3",df$name),sep="|")
V1 V2 V3
1 1 Alex 100.00
2 12 Rina Faso 92.31
3 113 john 0.00
You could use the package unglue :
df <- data.frame(name= c("1Alex100.00","12Rina Faso92.31","113john00.00"))
library(unglue)
unglue_unnest(df, name, "{name1}{name2=\\D+}{name3}", convert = TRUE)
#> name1 name2 name3
#> 1 1 Alex 100.00
#> 2 12 Rina Faso 92.31
#> 3 113 john 0.00

Extract numbers from strings including '|'

I have data where some of the items are numbers separated by "|", like:
head(mintimes)
[1] "3121|3151" "1171" "1351|1381" "1050" "" "122"
head(minvalues)
[1] 14 10 11 31 Inf 22
What I would like to do is extract all the times and match them to the minvalues. To end up with something like:
times values
3121 14
3151 14
1171 10
1351 11
1381 11
1050 31
122 22
I've tried to strsplit(mintimes, "|") and I've tried str_extract(mintimes, "[0-9]+") but they don't seem to work. Any ideas?
| is a regular expression metacharacter. When used literally, these special characters need to be escaped either with [] or with \\ (or you could use fixed = TRUE in some functions). So your call to strsplit() should be
strsplit(mintimes, "[|]")
or
strsplit(mintimes, "\\|")
or
strsplit(mintimes, "|", fixed = TRUE)
Regarding your other try with stringr functions, str_extract_all() seems to do the trick.
library(stringr)
str_extract_all(mintimes, "[0-9]+")
To get your desired result,
> mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")
> minvalues <- c(14, 10, 11, 31, Inf, 22)
> s <- strsplit(mintimes, "[|]")
> data.frame(times = as.numeric(unlist(s)),
values = rep(minvalues, sapply(s, length)))
# times values
# 1 3121 14
# 2 3151 14
# 3 1171 10
# 4 1351 11
# 5 1381 11
# 6 1050 31
# 7 122 22
By default strsplit splits using a regular expression and "|" is a special character in the regular expression syntax. You can either escape it
strsplit(mintimes,"\\|")
or just set fixed=T to not use regular expressions
strsplit(mintimes,"|", fixed=T)
I have written a function called cSplit that is useful for these types of things. You can get it from my Gist: https://gist.github.com/mrdwab/11380733
Usage would be:
cSplit(data.table(mintimes, minvalues), "mintimes", "|", "long")
# mintimes minvalues
# 1: 3121 14
# 2: 3151 14
# 3: 1171 10
# 4: 1351 11
# 5: 1381 11
# 6: 1050 31
# 7: 122 22
It also has a "wide" setting, in case that would be at all useful to you:
cSplit(data.table(mintimes, minvalues), "mintimes", "|", "wide")
# minvalues mintimes_1 mintimes_2
# 1: 14 3121 3151
# 2: 10 1171 NA
# 3: 11 1351 1381
# 4: 31 1050 NA
# 5: Inf NA NA
# 6: 22 122 NA
Note: The output is a data.table.
As others have mentioned, you need to escape the | to include it literally in a regular expression. As always, we can skin this cat many ways, and here's one way to do it with stringr:
x <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")
library(stringr)
unlist(str_extract_all(x, "\\d+"))
# [1] "3121" "3151" "1171" "1351" "1381" "1050" "122"
This won't work as expected if you have any decimal points in a character string of numbers, so the following (which says to match anything but |) might be safer:
unlist(str_extract_all(x, '[^|]+'))
# [1] "3121" "3151" "1171" "1351" "1381" "1050" "122"
Either way, you might want to wrap the result in as.numeric.
And here's another solution using stri_split_fixed from the stringi package. As an added value, we also play with mapply and do.call.
Input data:
mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")
minvalues <- c(14, 10, 11, 31, Inf, 22)
Split mintimes w.r.t. | and convert to numeric:
library("stringi")
mintimes <- lapply(stri_split_fixed(mintimes, "|"), as.numeric)
## [[1]]
## [1] 3121 3151
##
## [[2]]
## [1] 1171
##
## [[3]]
## [1] 1351 1381
##
## [[4]]
## [1] 1050
##
## [[5]]
## [1] NA
##
## [[6]]
## [1] 122
Column-bind each minvalues with corresponding mintimes:
tmp <- mapply(cbind, mintimes, minvalues)
## [[1]]
## [,1] [,2]
## [1,] 3121 14
## [2,] 3151 14
##
## [[2]]
## [,1] [,2]
## [1,] 1171 10
##
## [[3]]
## [,1] [,2]
## [1,] 1351 11
## [2,] 1381 11
##
## [[4]]
## [,1] [,2]
## [1,] 1050 31
##
## [[5]]
## [,1] [,2]
## [1,] NA Inf
##
## [[6]]
## [,1] [,2]
## [1,] 122 22
Row-bind all the 6 matrices & remove NA-rows:
res <- do.call(rbind, tmp)
res[!is.na(res[,1]),]
## [,1] [,2]
## [1,] 3121 14
## [2,] 3151 14
## [3,] 1171 10
## [4,] 1351 11
## [5,] 1381 11
## [6,] 1050 31
## [7,] 122 22
To get the output you want, try something like this:
library(dplyr)
Split.Times <- function(x) {
mintimes <- as.numeric(unlist(strsplit(as.character(x$mintimes), "\\|")))
return(data.frame(mintimes = mintimes, minvalues = x$minvalues, stringsAsFactors=FALSE))
}
df <- data.frame(mintimes, minvalues, stringsAsFactors=FALSE)
df %>%
filter(mintimes != "") %>%
group_by(mintimes) %>%
do(Split.Times(.))
This produces:
mintimes minvalues
1 1050 31
2 1171 10
3 122 22
4 1351 11
5 1381 11
6 3121 14
7 3151 14
(I borrowed from my answer here - which is pretty much the same question/problem)
Here's a qdap package approach:
mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")
minvalues <- c(14, 10, 11, 31, Inf, 22)
library(qdap)
list2df(setNames(strsplit(mintimes, "\\|"), minvalues), "times", "values")
## times values
## 1 3121 14
## 2 3151 14
## 3 1171 10
## 4 1351 11
## 5 1381 11
## 6 1050 31
## 7 122 22
You can use [:punct:]
strsplit(mintimes, "[[:punct:]]")

split strings on first and last commas

I would like to split strings on the first and last comma. Each string has at least two
commas. Below is an example data set and the desired result.
A similar question here asked how to split on the first comma: Split on first comma in string
Here I asked how to split strings on the first two colons: Split string on first two colons
Thank you for any suggestions. I prefer a solution in base R. Sorry if this is a duplicate.
my.data <- read.table(text='
my.string some.data
123,34,56,78,90 10
87,65,43,21 20
a4,b6,c8888 30
11,bbbb,ccccc 40
uu,vv,ww,xx 50
j,k,l,m,n,o,p 60', header = TRUE, stringsAsFactors=FALSE)
desired.result <- read.table(text='
my.string1 my.string2 my.string3 some.data
123 34,56,78 90 10
87 65,43 21 20
a4 b6 c8888 30
11 bbbb ccccc 40
uu vv,ww xx 50
j k,l,m,n,o p 60', header = TRUE, stringsAsFactors=FALSE)
You can use the \K operator which keeps text already matched out of the result and a negative look ahead assertion to do this (well almost, there is an annoying comma at the start of the middle portion which I am yet to get rid of in the strsplit). But I enjoyed this as an exercise in constructing a regex...
x <- '123,34,56,78,90'
strsplit( x , "^[^,]+\\K|,(?=[^,]+$)" , perl = TRUE )
#[[1]]
#[1] "123" ",34,56,78" "90"
Explantion:
^[^,]+ : from the start of the string match one or more characters that are not a ,
\\K : but don't include those matched characters in the match
So the first match is the first comma...
| : or you can match...
,(?=[^,]+$) : a , so long as it is followed by [(?=...)] one or more characters that are not a , until the end of the string ($)...
Here is a relatively simple approach. In the first line we use sub to replace the first and last commas with semicolons producing s. Then we read s using sep=";" and finally cbind the rest of my.data to it:
s <- sub(",(.*),", ";\\1;", my.data[[1]])
DF <- read.table(text=s, sep =";", col.names=paste0("mystring",1:3), as.is=TRUE)
cbind(DF, my.data[-1])
giving:
mystring1 mystring2 mystring3 some.data
1 123 34,56,78 90 10
2 87 65,43 21 20
3 a4 b6 c8888 30
4 11 bbbb ccccc 40
5 uu vv,ww xx 50
6 j k,l,m,n,o p 60
Here is code to split on the first and last comma. This code draws heavily from an answer by #bdemarest here: Split string on first two colons The gsub pattern below, which is the meat of the answer, contains important differences. The code for creating the new data frame after strings are split is the same as that of #bdemarest
# Replace first and last commas with colons.
new.string <- gsub(pattern="(^[^,]+),(.+),([^,]+$)",
replacement="\\1:\\2:\\3", x=my.data$my.string)
new.string
# Split on colons
split.data <- strsplit(new.string, ":")
# Create data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")
my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data
# my.string1 my.string2 my.string3 some.data
# 1 123 34,56,78 90 10
# 2 87 65,43 21 20
# 3 a4 b6 c8888 30
# 4 11 bbbb ccccc 40
# 5 uu vv,ww xx 50
# 6 j k,l,m,n,o p 60
# Here is code for splitting strings on the first comma
my.data <- read.table(text='
my.string some.data
123,34,56,78,90 10
87,65,43,21 20
a4,b6,c8888 30
11,bbbb,ccccc 40
uu,vv,ww,xx 50
j,k,l,m,n,o,p 60', header = TRUE, stringsAsFactors=FALSE)
# Replace first comma with colon
new.string <- gsub(pattern="(^[^,]+),(.+$)",
replacement="\\1:\\2", x=my.data$my.string)
new.string
# Split on colon
split.data <- strsplit(new.string, ":")
# Create data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")
my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data
# my.string1 my.string2 some.data
# 1 123 34,56,78,90 10
# 2 87 65,43,21 20
# 3 a4 b6,c8888 30
# 4 11 bbbb,ccccc 40
# 5 uu vv,ww,xx 50
# 6 j k,l,m,n,o,p 60
# Here is code for splitting strings on the last comma
my.data <- read.table(text='
my.string some.data
123,34,56,78,90 10
87,65,43,21 20
a4,b6,c8888 30
11,bbbb,ccccc 40
uu,vv,ww,xx 50
j,k,l,m,n,o,p 60', header = TRUE, stringsAsFactors=FALSE)
# Replace last comma with colon
new.string <- gsub(pattern="^(.+),([^,]+$)",
replacement="\\1:\\2", x=my.data$my.string)
new.string
# Split on colon
split.data <- strsplit(new.string, ":")
# Create new data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")
my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data
# my.string1 my.string2 some.data
# 1 123,34,56,78 90 10
# 2 87,65,43 21 20
# 3 a4,b6 c8888 30
# 4 11,bbbb ccccc 40
# 5 uu,vv,ww xx 50
# 6 j,k,l,m,n,o p 60
You can do a simple strsplit here on that column
popshift<-sapply(strsplit(my.data$my.string,","), function(x)
c(x[1], paste(x[2:(length(x)-1)],collapse=","), x[length(x)]))
desired.result <- cbind(data.frame(my.string=t(popshift)), my.data[-1])
I just split up all the values and make a new vector with the first, last and middle strings. Then i cbind that with the rest of the data. The result is
my.string.1 my.string.2 my.string.3 some.data
1 123 34,56,78 90 10
2 87 65,43 21 20
3 a4 b6 c8888 30
4 11 bbbb ccccc 40
5 uu vv,ww xx 50
6 j k,l,m,n,o p 60
Using str_match() from package stringr, and a little help from one of your links,
> library(stringr)
> data.frame(str_match(my.data$my.string, "(.+?),(.*),(.+?)$")[,-1],
some.data = my.data$some.data)
# X1 X2 X3 some.data
# 1 123 34,56,78 90 10
# 2 87 65,43 21 20
# 3 a4 b6 c8888 30
# 4 11 bbbb ccccc 40
# 5 uu vv,ww xx 50
# 6 j k,l,m,n,o p 60

Creating A Dataframe From A Text Dataset

I have a dataset that has hundreds of thousands of fields. The following is a simplified dataset
dataSet <- c("Plnt SLoc Material Description L.T MRP Stat Auto MatSG PC PN Freq Qty CFreq CQty Cur.RPt New.RPt CurRepl NewRepl Updt Cost ServStock Unit OpenMatResb DFStorLocLevel",
"0231 0002 GB.C152260-00001 ASSY PISTON & SEAL/O-RING 44 PD X A A A 18 136 30 29 50 43 24.88 51.000 EA",
"0231 0002 WH.112734 MOTOR REDUCER, THREE-PHAS 41 PD X B B A 16 17 3 3 5 4 483.87 1.000 EA X",
"0231 0002 WH.920569 SPINDLE MOTOR MINI O 22 PD X A A A 69 85 15 9 25 13 680.91 21.000 EA",
"0231 0002 GB.C150583-00001 VALVE-AIR MDI 64 PD X A A A 16 113 50 35 80 52 19.96 116.000 EA",
"0231 0002 FG.124-0140 BEARING 32 PD X A A A 36 205 35 32 50 48 21.16 55.000 EA",
"0231 0002 WP.254997 BEARING,BALL .9843 X 2.04 52 PD X A A A 18 155 50 39 100 58 2.69 181.000 EA"
)
I would like to create a dataframe out of this dataSet for further calculation. The approach I am following is as follows:
I split the dataSet by space and then recombine it.
dataSetSplit <- strsplit(dataSet, "\\s+")
The header (which is the first line) splits correctly and produces 25 characters. This can be seen by the str() function.
str(dataSetSplit)
I will then intend to combine all the rows together using the folloing script
combinedData <- data.frame(do.call(rbind, dataSetSplit))
Please note that the above script "combinedData " errors because the split did not produce equal number of fields.
For this approach to work all the fields must split correctly into 25 fields.
If you think this is a sound approach please let me know how to split the fileds into 25 fields.
It is worth mentioning that I do not like the approach of splitting the data set with the function strsplit(). It is an extremely time consuming step if used with a large data set. Can you please recommend an alternate approach to create a data frame out of the supplied data?
By the looks of it, you have a header row that is actually helpful. You can easily use gregexpr to calculate your "widths" to use with read.fwf.
Here's how:
## Use gregexpr to find the position of consecutive runs of spaces
## This will tell you the starting position of each column
Widths <- gregexpr("\\s+", dataSet[1])[[1]]
## `read.fwf` doesn't need the starting position, but the width of
## each column. We can use `diff` to calculate this.
Widths <- c(Widths[1], diff(Widths))
## Since there are no spaces after the last column, we need to calculate
## a reasonable width for that column too. We can do this with `nchar`
## to find the widest row in the data. From this, subtract the `sum`
## of all the previous values.
Widths <- c(Widths, max(nchar(dataSet)) - sum(Widths))
Let's also extract the column names. We could do this in read.fwf, but it would require us to substitute the spaces in the first line with a "sep" character.
Names <- scan(what = "", text = dataSet[1])
Now, read in everything except the first line. You would use the actual file instead of textConnection, I would suppose.
read.fwf(textConnection(dataSet), widths=Widths, strip.white = TRUE,
skip = 1, col.names = Names)
# Plnt SLoc Material Description L.T MRP Stat Auto MatSG PC PN Freq Qty
# 1 231 2 GB.C152260-00001 ASSY PISTON & SEAL/O-RING 44 PD NA X A A A 18 136
# 2 231 2 WH.112734 MOTOR REDUCER, THREE-PHAS 41 PD NA X B B A 16 17
# 3 231 2 WH.920569 SPINDLE MOTOR MINI O 22 PD NA X A A A 69 85
# 4 231 2 GB.C150583-00001 VALVE-AIR MDI 64 PD NA X A A A 16 113
# 5 231 2 FG.124-0140 BEARING 32 PD NA X A A A 36 205
# 6 231 2 WP.254997 BEARING,BALL .9843 X 2.04 52 PD NA X A A A 18 155
# CFreq CQty Cur.RPt New.RPt CurRepl NewRepl Updt Cost ServStock Unit OpenMatResb
# 1 NA NA 30 29 50 43 NA 24.88 51 EA <NA>
# 2 NA NA 3 3 5 4 NA 483.87 1 EA X
# 3 NA NA 15 9 25 13 NA 680.91 21 EA <NA>
# 4 NA NA 50 35 80 52 NA 19.96 116 EA <NA>
# 5 NA NA 35 32 50 48 NA 21.16 55 EA <NA>
# 6 NA NA 50 39 100 58 NA 2.69 181 EA <NA>
# DFStorLocLevel
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 NA
Many thanks to Ananda Mahto, he provided many pieces to this answer.
widthMinusFirst <- diff(gregexpr('(\\s[A-Z])+', dataSet[1])[[1]])
widthFirst <- gregexpr('\\s+', dataSet[1])[[1]][1]
Width <- c(widthFirst, widthMinusFirst)
Widths <- c(Width, max(nchar(dataSet)) - sum(Width))
columnNames <- scan(what = "", text = dataSet[1])
read.fwf(textConnection(dataSet[-1]), widths = Widths, strip.white = FALSE,
skip = 0, col.names = columnNames)