split a dataframe column by regular expression on characters separated by a "." - regex

In R, I have the following dataframe:
Name Category
1 Beans 1.12.5
2 Pears 5.7.9
3 Eggs 10.6.5
What I would like to have is the following:
Name Cat1 Cat2 Cat3
1 Beans 1 12 5
2 Pears 5 7 9
3 Eggs 10 6 5
Ideally some expression built inside plyr would be nice...
I will investigate on my side but as searching this might take me a lot of time I was just wondering if some of you do have some hints to perform this...

I've written a function concat.split (a "family" of functions, actually) as part of my splitstackshape package for dealing with these types of problems:
# install.packages("splitstackshape")
library(splitstackshape)
concat.split(mydf, "Category", ".", drop=TRUE)
# Name Category_1 Category_2 Category_3
# 1 Beans 1 12 5
# 2 Pears 5 7 9
# 3 Eggs 10 6 5
It also works nicely on "unbalanced" data.
dat <- data.frame(Name = c("Beans", "Pears", "Eggs"),
Category = c("1.12.5", "5.7.9.8", "10.6.5.7.7"))
concat.split(dat, "Category", ".", drop = TRUE)
# Name Category_1 Category_2 Category_3 Category_4 Category_5
# 1 Beans 1 12 5 NA NA
# 2 Pears 5 7 9 8 NA
# 3 Eggs 10 6 5 7 7
Because "long" or "molten" data are often required in these types of situations, the concat.split.multiple function has a "long" argument too:
concat.split.multiple(dat, "Category", ".", direction = "long")
# Name time Category
# 1 Beans 1 1
# 2 Pears 1 5
# 3 Eggs 1 10
# 4 Beans 2 12
# 5 Pears 2 7
# 6 Eggs 2 6
# 7 Beans 3 5
# 8 Pears 3 9
# 9 Eggs 3 5
# 10 Beans 4 NA
# 11 Pears 4 8
# 12 Eggs 4 7
# 13 Beans 5 NA
# 14 Pears 5 NA
# 15 Eggs 5 7

The qdap package has the colsplit2df for just these sort of situations:
#recreate your data first:
dat <- data.frame(Name = c("Beans", "Pears", "Eggs"), Category = c("1.12.5",
"5.7.9", "10.6.5"),stringsAsFactors=FALSE)
library(qdap)
colsplit2df(dat, 2, paste0("cat", 1:3))
## > colsplit2df(dat, 2, paste0("cat", 1:3))
## Name cat1 cat2 cat3
## 1 Beans 1 12 5
## 2 Pears 5 7 9
## 3 Eggs 10 6 5

If you have a consistent number of categories, then this will work:
#recreate your data first:
dat <- data.frame(Name = c("Beans", "Pears", "Eggs"), Category = c("1.12.5",
"5.7.9", "10.6.5"),stringsAsFactors=FALSE)
spl <- strsplit(dat$Category,"\\.")
len <- sapply(spl,length)
dat[paste0("cat",1:max(len))] <- t(sapply(spl,as.numeric))
Result:
dat
Name Category cat1 cat2 cat3
1 Beans 1.12.5 1 12 5
2 Pears 5.7.9 5 7 9
3 Eggs 10.6.5 10 6 5
If you have differing numbers of separated values, then this should account for it:
#example unbalanced data
dat <- data.frame(Name = c("Beans", "Pears", "Eggs"), Category = c("1.12.5",
"5.7.9", "10.6.5"),stringsAsFactors=FALSE)
dat$Category[2] <- "5.7"
spl <- strsplit(dat$Category,"\\.")
len <- sapply(spl,length)
spl <- Map(function(x,y) c(x,rep(NA,max(len)-y)), spl, len)
dat[paste0("cat",1:max(len))] <- t(sapply(spl,as.numeric))
Result:
Name Category cat1 cat2 cat3
1 Beans 1.12.5 1 12 5
2 Pears 5.7 5 7 NA
3 Eggs 10.6.5 10 6 5

Related

cumulative average powerbi by month

I have below dataset.
Math Literature Biology date student
4 2 5 2019-08-25 A
4 5 4 2019-08-08 A
5 4 5 2019-08-23 A
5 5 5 2019-08-15 A
5 5 5 2019-07-19 A
5 5 5 2019-07-15 A
5 5 5 2019-07-03 A
5 5 5 2019-06-26 A
1 1 2 2019-06-18 A
2 3 3 2019-06-14 A
5 5 5 2019-05-01 A
2 1 3 2019-04-26 A
I need to develop a solution in powerbi so in output I have cumulative average per subject per month
For example
April May June July August
Math | 2 3.5 3 3.75 4
Literature | 1 3 3 3.75 3.83
Biology | 3 4 3.6 4.125 4.33
Can you help?
You can use a matrix visualization for this.
Create a month-year variable and use it in the columns.
Use Average of Math,Literature and Biology in values
Under the format pane --> Values --> Show on rows --> Select this
This should give the view you are looking for. You can edit the value headers to your requirement.

Sum 5 rows at a time in an ordered SAS table with no unique identifier using proc sql

I'm working with a SAS table where I have ordered data that I need to sum in intervals of 5. I don't have a unique ID I can use for the group by statement and I'm struggling to find a solution.
Say I have this table
Number Name X Y
1 Susan 2 1
2 Susan 3 3
3 Susan 3 3
4 Susan 4 1
5 Susan 1 2
6 Susan 1 1
7 Susan 1 1
8 Susan 2 4
9 Susan 1 5
10 Susan 4 2
1 Steve 2 4
2 Steve 2 3
3 Steve 1 2
4 Steve 3 5
5 Steve 1 1
6 Steve 1 3
7 Steve 2 3
8 Steve 2 4
9 Steve 1 1
10 Steve 1 1
I'd want the output to look like
Number Name X Y
1-5 Susan 13 10
6-10 Susan 9 13
1-5 Steve 9 15
6-10 Steve 7 12
Is there an easy way to get output like this using proc sql? Thanks!
Try this:
proc sql;
select ceil(Number/5) as Grouping, Name, sum(X), sum(Y)
from have
group by Name, Grouping;
quit;

How to reshape a variable to wide in my dataset?

I am trying to reshape a variable to wide but not getting proper way to do so.
I have the day wise count dataset for each SSUID and i would like to reshape the day to wide to show the count for each SSUID in aggregate.
Dataset:
ssuid day count
1226 1 3
1226 2 7
1226 3 5
1226 4 7
1226 5 7
1226 6 6
1227 1 3
1227 2 6
1227 3 7
1227 4 4
1228 1 4
1228 2 4
1228 3 6
1228 4 7
1228 5 5
1229 1 3
1229 2 6
1229 3 6
1229 4 6
1229 5 5
I tried some code but getting the error:
count variable not constant within SSUID variable
My code:
reshape wide day, i(ssuid) j(count)
I would like to get the following result:
ssuid day1 day2 day3 day4 day5 day6
1226 3 7 5 7 7 6
1227 3 6 7 4 . .
1228 4 4 6 7 5 .
1229 3 6 6 6 5 .
The following works for me:
clear
input ssuid day count
1226 1 3
1226 2 7
1226 3 5
1226 4 7
1226 5 7
1226 6 6
1227 1 3
1227 2 6
1227 3 7
1227 4 4
1228 1 4
1228 2 4
1228 3 6
1228 4 7
1228 5 5
1229 1 3
1229 2 6
1229 3 6
1229 4 6
1229 5 5
end
reshape wide count, i(ssuid) j(day)
rename count# day#
list
+-------------------------------------------------+
| ssuid day1 day2 day3 day4 day5 day6 |
|-------------------------------------------------|
1. | 1226 3 7 5 7 7 6 |
2. | 1227 3 6 7 4 . . |
3. | 1228 4 4 6 7 5 . |
4. | 1229 3 6 6 6 5 . |
+-------------------------------------------------+

Finding the max(latest) date out of a column of dates then grouping them by employee

Importing the data frame
df = pd.read_csv("C:\\Users")
Printing the list of employees usernames
print (df['AssignedTo'])
Returns:
Out[4]:
0 vaughad
1 channln
2 stalasi
3 mitras
4 martil
5 erict
6 erict
7 channln
8 saia
9 channln
10 roedema
11 vaughad
Printing The Dates
Returns:
Out[6]:
0 2015-11-05
1 2016-05-27
2 2016-04-26
3 2016-02-18
4 2016-02-18
5 2015-11-02
6 2016-01-14
7 2015-12-15
8 2015-12-31
9 2015-10-16
10 2016-01-07
11 2015-11-20
Now I need to collect the latest date per employee?
I have tried:
MaxDate = max(df.FilledEnd)
But this just returns one date for all employees.
So we see multiple employees in the data set with different dates, in a new column named "LatestDate" I need the latest date that corresponds to the employee, so for "vaughad" in a new column it would return "2015-11-20" for all of "vaughad" records and in the same column for username "channln" it would return "2016-5-27" for all of "channln" latest dates.
You need to group your data first, using DataFrame.groupby(), after which you can produce aggregate values, like the maximum date in the FilledEnd series:
df.groupby('AssignedTo')['FilledEnd'].max()
This produces a series, with AssignedTo as the index, and the latest date for each of those employees as the values:
>>> df.groupby('AssignedTo')['FilledEnd'].max()
AssignedTo
channln 2016-05-27
erict 2016-01-14
martil 2016-02-18
mitras 2016-02-18
roedema 2016-01-07
saia 2015-12-31
stalasi 2016-04-26
vaughad 2015-11-20
Name: FilledEnd, dtype: object
If you wanted to add those max dates values back to the dataframe, use groupby(...).transform() with numpy.max instead, so you get a series with the same indices:
df['MaxDate'] = df.groupby('AssignedTo')['FilledEnd'].transform(np.max)
This adds in a MaxDate column:
AssignedTo FilledEnd MaxDate
0 vaughad 2015-11-05 2015-11-20
1 channln 2016-05-27 2016-05-27
2 stalasi 2016-04-26 2016-04-26
3 mitras 2016-02-18 2016-02-18
4 martil 2016-02-18 2016-02-18
5 erict 2015-11-02 2016-01-14
6 erict 2016-01-14 2016-01-14
7 channln 2015-12-15 2016-05-27
8 saia 2015-12-31 2015-12-31
9 channln 2015-10-16 2016-05-27
10 roedema 2016-01-07 2016-01-07
11 vaughad 2015-11-20 2015-11-20

Matching words from two files and extract matched one

I have following data frame:
dataFrame <- data.frame(sent = c(1,1,2,2,3,3,3,4,5), word = c("good printer", "wireless easy", "just right size",
"size perfect weight", "worth price", "website great tablet",
"pan nice tablet", "great price", "product easy install"), val = c(1,2,3,4,5,6,7,8,9))
Data frame "dataFrame" looks like below:
sent word val
1 good printer 1
1 wireless easy 2
2 just right size 3
2 size perfect weight 4
3 worth price 5
3 website great tablet 6
3 pan nice tablet 7
4 great price 8
5 product easy install 9
And then I have words:
nouns <- c("printer", "wireless", "weight", "price", "tablet")
I need to extract only these words (nouns) from dataFrame and only these extracted add to new column (eg.extract) in dataFrame.
I really very appreciate any of your help od advice. Thanks a lot in forward.
Desired output:
sent word val extract
1 good printer 1 printer
1 wireless easy 2 wireless
2 just right size 3 size
2 size perfect weight 4 weight
3 worth price 5 price
3 website great tablet 6 table
3 pan nice tablet 7 tablet
4 great price 8 price
5 product easy install 9 remove this row (no match)
Here's a simple solution using the stringi package (size isn't in your nouns list btw).
library(stringi)
transform(dataFrame,
extract = stri_extract_all(word,
regex = paste(nouns, collapse = "|"),
simplify = TRUE))
# sent word val extract
# 1 1 good printer 1 printer
# 2 1 wireless easy 2 wireless
# 3 2 just right size 3 <NA>
# 4 2 size perfect weight 4 weight
# 5 3 worth price 5 price
# 6 3 website great tablet 6 tablet
# 7 3 pan nice tablet 7 tablet
# 8 4 great price 8 price
# 9 5 product easy install 9 <NA>
this is another solution. a bit more complicated but it also deletes the rows which have no matching between nouns and dataFrame$word
require(stringr)
dataFrame <- data.frame("sent" = c(1,1,2,2,3,3,3,4,5),
"word" = c("good printer", "wireless easy", "just right size",
"size perfect weight", "worth price", "website great tablet",
"pan nice tablet", "great price", "product easy install"),
val = c(1,2,3,4,5,6,7,8,9))
nouns <- c("printer", "wireless", "weight", "price", "tablet")
test <- character()
df.del <- list()
for (i in 1:nrow(dataFrame)) {
if(length(intersect(nouns, unlist(strsplit(as.character(dataFrame$word[i]), " ")))) == 0) {
df.del <- rbind(df.del, i)
} else {
test <- rbind(test,
intersect(nouns, unlist(strsplit(as.character(dataFrame$word[i]), " "))))
}
}
dataFrame <- dataFrame[-c(unlist(df.del)), ]
dataFrame <- cbind(dataFrame, test)
names(dataFrame)[4] <- "extract"
output:
sent word val extract
1 1 good printer 1 printer
2 1 wireless easy 2 wireless
4 2 size perfect weight 4 weight
5 3 worth price 5 price
6 3 website great tablet 6 tablet
7 3 pan nice tablet 7 tablet
8 4 great price 8 price
Here is another solution using loop function and if statement.
word<-dataFrame$word
dat<-NULL
extract<-c(rep(c("remove"), each=length(word)))
n<-length(word)
m<-length(nouns)
for (i in 1:n) {
g<-as.character(word[i])
for (j in 1:m) {
dat<-grepl(nouns[j], g)
if(dat == TRUE) {extract[i] <- nouns[j]}
}
}
dataFrame$extract <- extract
# sent word val extract
#1 1 good printer 1 printer
#2 1 wireless easy 2 wireless
#3 2 just right size 3 remove
#4 2 size perfect weight 4 weight
#5 3 worth price 5 price
#6 3 website great tablet 6 tablet
#7 3 pan nice tablet 7 tablet
#8 4 great price 8 price
#9 5 product easy install 9 remove