Clean character vector and strsplit into dataframe - regex

I have a character verctor I want to transform into a data frame.
It's mostly clean but I can't figure out how to finish the cleaning. Notice that the real data are a Date column as yyyy-mm-dd and a Variable column as a number (in this case four digits but not always) separated by a comma.
class(myvec)
[1] "character"
myvec
[1] " \"2016-01-01,8631n\" " " \"2016-01-02,8577n\" "
[3] " \"2016-01-03,8476n\" " " \"2016-01-04,8365n\" "
[5] " \"2016-01-05,8331n\" " " \"2016-01-06,8801n\" "
[7] " \"2016-01-07,5020n\""
The space and backslash" (' \"') should be removed. The same with the n\"
The expected output should be a data frame like this
Date Variable
[1,] "2016-01-01" "8631"
[2,] "2016-01-02" "8577"
[3,] "2016-01-03" "8476"
[4,] "2016-01-04" "8365"
[5,] "2016-01-05" "8331"
[6,] "2016-01-06" "8801"
[7,] "2016-01-07" "5020"
Once the vector is clan, I think this does the job
do.call(rbind,strsplit(clean_vector,","))
I think I can convert to date with lubridate and the var to numeric with as.numeric on my own, the question is about getting the character vector clean and in the correct format.

You can remove the offending characters by enumerating them:
# example
x = " \"2016-01-01,8631n\" "
gsub("[n \"]","",x)
# "2016-01-01,8631"
This works because [xyz] identifies any single character from the list xyz.
Or you can take a substring, since the formatting is fixed-width, with bad chars at the start and end:
substr(x,3,17)
# "2016-01-01,8631"
If the var part of the string varies in length, nchar(x)-3 should work in place of 17.

Related

RegEx select all between two character

Example:
I want to extract everything between "Item:" until " * "
Item: *Sofa (1 SET), 2 × Mattress, 3 × Baby Mattress, 5
Seaters Car (Fabric)*
Total price: 100.00
Subtotal: 989.00
But I only managed to extract "Item: *" and " Seaters Car (Fabric)* " by using (.*?)\*
After matching Item:, match anything but a colon with [^:]+, and then lookahead for a newline, ensuring that the match ends at the end of a line just before another label (like Total price:) starts:
Item: ([^:]+)(?=\n)

Comparing two version of the same string

I would like to write a function that compare two string in R. More precisely, if a have this data :
data <- list(
"First sentence.",
"Very first sentence.",
"Very first and only one sentences."
)
I would like the output to be :
[1] "Very" " and only one sentences"
My output is built by all substring that is not included in the previous one. For example:
2nd vs 1st, remove matching string - "first sentence." - from the 2nd, so result is "Very".
# "First sentence."
# "Very first sentence."
# match: ^^^^^^^^^^^^^^^
Now compare 3rd vs 2nd, remove matching string - "very first" - from 3rd , so result is " and only one sentences".
# "Very first sentence."
# "Very first and only one sentences."
# match: ^^^^^^^^^^
Then compare 4th vs 3rd, etc...
So based on this example my output should be:
c("Very", " and only one sentences")
# [1] "Very" " and only one sentences"
Here's a tidyverse approach:
library(dplyr)
library(tidyr)
# put data in a data.frame
data_frame(string = unlist(data)) %>%
# add ID column so we can recombine later
add_rownames('id') %>%
# add a lagged column to compare against
mutate(string2 = lag(string)) %>%
# break strings into words
separate_rows(string) %>%
# evaluate the following calls rowwise (until regrouped)
rowwise() %>%
# chop to rows with a string to compare against,
filter(!is.na(string2),
# where the word is not in the comparison string
!grepl(string, string2, ignore.case = TRUE)) %>%
# regroup by ID
group_by(id) %>%
# reassemble strings
summarise(string = paste(string, collapse = ' '))
## # A tibble: 2 x 2
## id string
## <chr> <chr>
## 1 2 Very
## 2 3 and only one sentences.
Select out string if you'd like just a vector by appending
...
%>% `[[`('string')
## [1] "Very" "and only one sentences."

replacing values in selected columns of a dataframe using RegExp

Assume I have a dataframe
mydata <- c("10 stack"," 10 stack and x" , "10 stack / dd" ," 10 stackxx")
R>mydata
[1] " 10 stack"
[2] " 10 stack and x"
[3] " 10 stack / dd"
[4] " 10 stackxx"
what I want to do is to replace and word begin with 10 stack [anything]to any other words in the dataframe , but without removing the rest of the string
the desired output. Also replace the backslash with and or comma.
[1] " new"
[2] " new and x"
[3] " new and dd"
[4] " new"
my code is
mydata[mydata =="10 stack" ] <- new # I can replace one type, but I need faster operation.
mydata[mydata =="///" ] <- and #for replacing backslash with and
I found another method can solve the problem
mydata<-as.data.frame(sapply(mydata,gsub,pattern="//\",replacement=","))
Try
library(stringi)
stri_replace_all_regex(mydata, c("10 stack", "\\/"), c("new", "and"), vectorize_all=FALSE)
Which gives:
#[1] "new" " new and x" "new and dd" " newxx"
As per mentioned by #rock321987 in the comments, if you want to replace 10 stack[anything], You could use the pattern \\b10 stack[^\\s]* instead:
stri_replace_all_regex(mydata, c("\\b10 stack[^\\s]*", "\\/"), c("new", "and"),
vectorize_all=FALSE)
Which gives:
#[1] "new" " new and x" "new and dd" " new"
You need to use sub() function, which matches pattern and substitute it with replacement.
sub("10 stack", " new", mydata)

How to remove the [1]s, [[1]]s and double quotes from a csv data in R?

I've a CSV file. It contains the output of some previous R operations, so it is filled with the index numbers (such as [1], [[1]]). When it is read into R, it looks like this, for example:
V1
1 [1] 789
2 [[1]]
3 [1] "PNG" "D115" "DX06" "Slz"
4 [1] 787
5 [[1]]
6 [1] "D010" "HC"
7 [1] 949
8 [[1]]
9 [1] "HC" "DX06"
(I don't know why all that wasted space between line number and the output data)
I need the above data to appear as follows (without [1] or [[1]] or " " and with the data placed beside its corresponding number, like):
789 PNG,D115,DX06,Slz
787 D010,HC
949 HC,DX06
(possibly the 789 and its corresponding data PNG,D115,DX06,Slz should be separated by a tab.. and like that for each row)
How to achieve this in R?
We could create a grouping variable ('indx'), split the 'V1' column using the grouping index after removing the parentheses part in the beginning as well as the quotes within the string ". Assuming that we need the first column as the numeric element, and the second column as the non-numeric part, we can use regex to replace the space with , (as showed in the expected result, and then rbind the list elements.
indx <- cumsum(c(grepl('\\[\\[', df1$V1)[-1], FALSE))
do.call(rbind,lapply(split(gsub('"|^.*\\]', '', df1$V1), indx),
function(x) data.frame(ind=x[1],
val=gsub('\\s+', ',', gsub('^\\s+|\\s+$', '',x[-1][x[-1]!=''])))))
# ind val
#1 789 PNG,D115,DX06,Slz
#2 787 D010,HC
#3 949 HC,DX06
data
df1 <- structure(list(V1 = c("[1] 789", "[[1]]",
"[1] \"PNG\" \"D115\" \"DX06\" \"Slz\"",
"[1] 787", "[[1]]", "[1] \"D010\" \"HC\"", "[1] 949",
"[[1]]", "[1] \"HC\" \"DX06\"")), .Names = "V1",
class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6",
"7", "8", "9"))
Honestly, a command-line fix using either sed/perl/egrep -o is less pain:
sed -e 's/.*\][ \t]*//' dirty.csv > clean.csv

Use string comparisons to split a column in R

To the best of my search this question hasn't been asked before.
I have a dataframe column called Product. This column has the company name as well as product model in just one column.
product.df <- data.frame("Product" = c("Company1 123M UG", "Company1 234M-I", "Company2 763-87-U","Company2 777-87", "Company3 Name1 87M", "Company3 Name1 O77M", "Company3 Name1 765-U MP"))
I want to split out the company names and product model number from this single column into two columns. I need a function that can find similar words between rows and classify them as Company names and the rest of the letters as product model number. No two rows as far as i can tell have same model numbers. So in the case above. I would get this answer
new.product.df <- data.frame("CompanyName" = c("Company1", "Company1", "Company2","Company2", "Company3 Name1", "Company3 Name1", "Company3 Name1"), "Model" = c("123M UG", "234M-I", "763-87-U", "777-87", "87M", "O77M", "765-U MP"))
I need a function that can compare two strings and return me similar continuous letters and dissimilar letters.
If you're guaranteed the first word is always a company name, then simply do a fixed split on the first space with max 2 output:
require(stringi)
stri_split_fixed(product.pd[,1], ' ', n=2)
or:
apply(product.df, 2, function(...) { stri_split_fixed(..., ' ', n=2) } )
[1] "Company1" "123M UG"
[1] "Company1" "234M-I"
[1] "Company2" "763-87-U"
[1] "Company2" "777-87"
[1] "Company3" "Name1 87M"
[1] "Company3" "Name1 O77M"
[1] "Company3" "Name1 765-U MP"
Try this
new.product.df <- data.frame(company=
unlist(lapply(strsplit(as.character(product.df$Product), split=" .[0-9]"), function(x) x[1])),
name =
unlist(lapply(strsplit(as.character(product.df$Product), split="[1|2] "), function(x) x[2]))
)
according to your data the separator between company and product is the first space character , so the first step we need to convert this first space character to something else , in this example to __ , later I'll tell you why we need to do this .
this is your actual data
Product
1 Company1 123M UG
2 Company1 234M-I
3 Company2 763-87-U
4 Company2 777-87
5 Company3 Name1 87M
6 Company3 Name1 O77M
7 Company3 Name1 765-U MP
this code to do this kind of conversion
product.df$Product <- sub(product.df$Product , pattern = " " , replacement = "__" ,
perl = T)
the data should be something like this
Product
1 Company1__123M UG
2 Company1__234M-I
3 Company2__763-87-U
4 Company2__777-87
5 Company3__Name1 87M
6 Company3__Name1 O77M
7 Company3__Name1 765-U MP
then use the tidyr library to separate this new data frame
library("tidyr")
new.product.df <- separate( product.df , Product , c("Company" , "Model") , sep = "__")
the reason behind converting space character to __ is that company name also may include space character like companies 123M UG & Name1 87M this will cause error later so the first step in this solution is to avoid this later when separating the column.
of course it will be better if we separated on the first occurrence of space character , but I don't know how because the global modifier is turned on by default for separator regex , so any suggestions are welcome