I am doing a text mining project that will analyze some speeches from the three remaining presidential candidates. I have completed POS tagging with OpenNLP and created a two column data frame with the results. I have added a variable, called pair. Here is a sample from the Clinton data frame:
V1 V2 pair
1 c( NN FALSE
2 "thank VBP FALSE
3 you PRP FALSE
4 so RB FALSE
5 much RB FALSE
6 . . FALSE
7 it PRP FALSE
8 is VBZ FALSE
9 wonderful JJ FALSE
10 to TO FALSE
11 be VB FALSE
12 here RB FALSE
13 and CC FALSE
14 see VB FALSE
15 so RB FALSE
16 many JJ FALSE
17 friends NNS FALSE
18 . . FALSE
19 ive JJ FALSE
20 spoken VBN FALSE
What I'm now trying to do is write a function that will iterate through the V2 POS column and evaluate it for specific pattern pairs. (These come from Turney's PMI article.) I'm not yet very knowledgeable when it comes to writing functions, so I'm certain I've done it wrong, but here is what I've got so far.
pairs <- function(x){
JJ <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
for(i in 1:(length)(x) {
if(x == J && x+1 == N) { #i.e., if the first word = J and the next = N
pair[i] <- "JJ|NN" #insert this into the 'pair' variable
} else if (x == R && x+1 == J && x+2 != N) {
pair[i] <- "RB|JJ"
} else if (x == J && x+1 == J && x+2 != N) {
pair[i] <- "JJ|JJ"
} else if (x == N && x+1 == J && x+2 != N) {
pair[i] <- "NN|JJ"
} else if (x == R && x+1 == V) {
pair[i] <- "RB|VB"
} else {
pair[i] <- "FALSE"
}
}
}
# Run the function
cl.df.pairs <- pairs(cl.df$V2)
There are a number of (truly embarrassing) issues. First, when I try to run the function code, I get two Error: unexpected '}' in " }" errors at the end. I can't figure out why, because they match opening "{". I'm assuming it's because R is expecting something else to be there.
Also, and more importantly, this function won't exactly get me what I want, which is to extract the word pairs that match a pattern and then the pattern that they match. I honestly have no idea how to do that.
Then I need to figure out how to evaluate the semantic orientation of each word combo by comparing the phrases to the pos/neg lexical data sets that I have, but that's a whole other issue. I have the formula from the article, which I'm hoping will point me in the right direction.
I have looked all over and can't find a comparable function in any of the NLP packages, such as OpenNLP, RTextTools, etc. I HAVE looked at other SO questions/answers, like this one and this one, but they haven't worked for me when I've tried to adapt them. I'm fairly certain I'm missing something obvious here, so would appreciate any advice.
EDIT:
Here is the first 20 lines of the Sanders data frame.
head(sa.POS.df, 20)
V1 V2
1 the DT
2 american JJ
3 people NNS
4 are VBP
5 catching VBG
6 on RB
7 . .
8 they PRP
9 understand VBP
10 that IN
11 something NN
12 is VBZ
13 profoundly RB
14 wrong JJ
15 when WRB
16 , ,
17 in IN
18 our PRP$
19 country NN
20 today NN
And I've written the following function:
pairs <- function(x, y) {
require(gsubfn)
J <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
for(i in 1:(length(x))) {
ngram <- c(x[[i]], x[[i+1]])
# the ngram consists of the word on line `i` and the word below line `i`
}
strapply(y[i], "(J)\n(N)", FUN = paste(ngram, sep = " "), simplify = TRUE)
ngrams.df = data.frame(ngrams=ngram)
return(ngrams.df)
}
So, what is SUPPOSED to happen is that when strapply matches the pattern (in this case, an adjective followed by a noun, it should paste the ngram. And all of the resulting ngrams should populate the ngrams.df.
So I've entered the following function call and get an error:
> sa.JN <- pairs(x=sa.POS.df$V1, y=sa.POS.df$V2)
Error in x[[i + 1]] : subscript out of bounds
I'm only just learning the intricacies of regular expressions, so I'm not quite sure how to get my function to pull the actual adjective and noun. Based on the data shown here, it should pull "american" and "people" and paste them into the data frame.
Okay, here we go. Using this data (shared nicely with dput()):
df = structure(list(V1 = structure(c(15L, 3L, 11L, 4L, 5L, 9L, 2L,
16L, 18L, 14L, 13L, 8L, 12L, 20L, 19L, 1L, 7L, 10L, 6L, 17L), .Label = c(",",
".", "american", "are", "catching", "country", "in", "is", "on",
"our", "people", "profoundly", "something", "that", "the", "they",
"today", "understand", "when", "wrong"), class = "factor"), V2 = structure(c(3L,
5L, 7L, 12L, 11L, 10L, 2L, 8L, 12L, 4L, 6L, 13L, 10L, 5L, 14L,
1L, 4L, 9L, 6L, 6L), .Label = c(",", ".", "DT", "IN", "JJ", "NN",
"NNS", "PRP", "PRP$", "RB", "VBG", "VBP", "VBZ", "WRB"), class = "factor")), .Names = c("V1",
"V2"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20"))
I'll use the stringr package because of its consistent syntax so I don't have to look up the argument order for grep. We'll first detect the adjectives, then the nouns, and figure out where the line up (offsetting by 1). Then paste the words together that correspond to the matches.
library(stringr)
adj = str_detect(df$V2, "JJ")
noun = str_detect(df$V2, "NN")
pairs = which(c(FALSE, adj) & c(noun, FALSE))
ngram = paste(df$V1[pairs - 1], df$V1[pairs])
# [1] "american people"
Now we can put it in a function. I left the patterns as arguments (with adjective, noun as the defaults) for flexibility.
bigram = function(word, type, patt1 = "JJ", patt2 = "N[A-Z]") {
pairs = which(c(FALSE, str_detect(type, pattern = patt1)) &
c(str_detect(type, patt2), FALSE))
return(paste(word[pairs - 1], word[pairs]))
}
Demonstrating use on the original data
with(df, bigram(word = V1, type = V2))
# [1] "american people"
Let's cook up some data with more than one match to make sure it works:
df2 = data.frame(w = c("american", "people", "hate", "a", "big", "bad", "bank"),
t = c("JJ", "NNS", "VBP", "DT", "JJ", "JJ", "NN"))
df2
# w t
# 1 american JJ
# 2 people NNS
# 3 hate VBP
# 4 a DT
# 5 big JJ
# 6 bad JJ
# 7 bank NN
with(df2, bigram(word = w, type = t))
# [1] "american people" "bad bank"
And back to the original to test out a different pattern:
with(df, bigram(word = V1, type = V2, patt1 = "N[A-Z]", patt2 = "V[A-Z]"))
# [1] "people are" "something is"
I think the following is the code you wrote, but without throwing errors:
pairs <- function(x) {
J <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
pair = rep("FALSE", length(x))
for(i in 1:(nrow(x)-2)) {
this.pos = x[i,2]
next.pos = x[i+1,2]
next.next.pos = x[i+2,2]
if(this.pos == J && next.pos == N) { #i.e., if the first word = J and the next = N
pair[i] <- "JJ|NN" #insert this into the 'pair' variable
} else if (this.pos == R && next.pos == J && next.next.pos != N) {
pair[i] <- "RB|JJ"
} else if (this.pos == J && next.pos == J && next.next.pos != N) {
pair[i] <- "JJ|JJ"
} else if (this.pos == N && next.pos == J && next.next.pos != N) {
pair[i] <- "NN|JJ"
} else if (this.pos == R && next.pos == V) {
pair[i] <- "RB|VB"
} else {
pair[i] <- "FALSE"
}
}
## then deal with the last two elements, for which you can't check what's up next
return(pair)
}
not sure what you mean by this, though:
Also, and more importantly, this function won't exactly get me what I
want, which is to extract the word pairs that match a pattern and then
the pattern that they match. I honestly have no idea how to do that.
Related
below I wrote a function which searches for specific regular expressions within a vector. The function always searches for regular expressions including "Beer" or "Wine" within a vector. Now I would like to include the regular Expressions I am searching for (In my case "Beer and Wine") as additional variables into the vector. How can I do this?
x <- c("Beer","Wine","wine","Beer","Beef","Potato","Vacation")
Thirsty <- function(x) {
Beer <- grepl("Beer",x, ignore.case = TRUE)
Beer <- as.numeric(Beer == "TRUE")
Wine <- grepl("Wine",x, ignore.case = TRUE)
Wine <- as.numeric(Wine == "TRUE")
Drink <- Beer + Wine
Drink <- as.numeric(Drink == "0")
Drink <- abs(Drink -1)
}
y <- Thirsty(x)
y
This can be done with the following code:
x <- c("Beer","Wine","wine","Beer","Beef","Potato","Vacation")
drinks <- c("Beer","Wine")
Thirsty <- function(x, drinks) {
Reduce("|",lapply(drinks, function(p)grepl(p,x, ignore.case = TRUE)))
}
y <- Thirsty(x,drinks)
y
lapply loops over the possibilities in drinks and produces a list of logical vectors, one for each drink. These are combined into a single vector by Reduce.
I would simply try to concatenate the match patterns with |
strings = c("Beer","Wine","wine","Beer","Beef","Potato","Vacation")
thirstStrings = c("beer", "wine")
matchPattern = paste0(thirstStrings, collapse = "|") #"beer|wine"
grep(pattern = matchPattern, x = strings, ignore.case = T)
# [1] 1 2 3 4
You can easily wrap that in a function
Thirsty = function(x, matchStrings){
matchPattern = paste0(matchStrings, collapse = "|") #"beer|wine"
grep(pattern = matchPattern, x = x, ignore.case = T)
}
Thirsty(strings, thirstStrings) # [1] 1 2 3 4
This should also work
Thirsty = function(vec, ...) {
pattern = paste0(unlist(list(...)), collapse = "|")
stringr::str_detect(tolower(vec), pattern)
}
> Thirsty (x, "beer", "wine")
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE
I am looking for a shorter and more pretty solution (possibly in tidyverse) to the following problem. I have a data.frame "data":
id string
1 A 1.001 xxx 123.123
2 B 23,45 lorem ipsum
3 C donald trump
4 D ssss 134, 1,45
What I wanted to do is to extract all numbers (no matter if the delimiter is "." or "," -> in this case I assume that string "134, 1,45" can be extracted into two numbers: 134 and 1.45) and create a data.frame "output" looking similar to this:
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
I managed to do this (code below) but the solution is pretty ugly for me also not so efficient (two for-loops). Could someone suggest a better way to do do this (preferably using dplyr)
# data
data <- data.frame(id = c("A", "B", "C", "D"),
string = c("1.001 xxx 123.123",
"23,45 lorem ipsum",
"donald trump",
"ssss 134, 1,45"),
stringsAsFactors = FALSE)
# creating empty data.frame
len <- length(unlist(sapply(data$string, function(x) gregexpr("[0-9]+[,|.]?[0-9]*", x))))
output <- data.frame(id = rep(NA, len), string = rep(NA, len))
# main solution
start = 0
for(i in 1:dim(data)[1]){
tmp_len <- length(unlist(gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i])))
for(j in (start+1):(start+tmp_len)){
output[j,1] <- data$id[i]
output[j,2] <- regmatches(data$string[i], gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i]))[[1]][j-start]
}
start = start + tmp_len
}
# further modifications
output$string <- gsub(",", ".", output$string)
output$string <- as.numeric(ifelse(substring(output$string, nchar(output$string), nchar(output$string)) == ".",
substring(output$string, 1, nchar(output$string) - 1),
output$string))
output
1) Base R This uses relatively simple regular expressions and no packages.
In the first 2 lines of code replace any comma followed by a space with a
space and then replace all remaining commas with a dot. After these two lines s will be: c("1.001 xxx 123.123", "23.45 lorem ipsum", "donald trump", "ssss 134 1.45")
In the next 4 lines of code trim whitespace from beginning and end of each string field and split the string field on whitespace producing a
list. grep out those elements consisting only of digits and dots. (The regular expression ^[0-9.]*$ matches the start of a word followed by zero or more digits or dots followed by the end of the word so only words containing only those characters are matched.) Replace any zero length components with NA. Finally add data$id as the names. After these 4 lines are run the list L will be list(A = c("1.001", "123.123"), B = "23.45", C = NA, D = c("134", "1.45")) .
In the last line of code convert the list L to a data frame with the appropriate names.
s <- gsub(", ", " ", data$string)
s <- gsub(",", ".", s)
L <- strsplit(trimws(s), "\\s+")
L <- lapply(L, grep, pattern = "^[0-9.]*$", value = TRUE)
L <- ifelse(lengths(L), L, NA)
names(L) <- data$id
with(stack(L), data.frame(id = ind, string = values))
giving:
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
2) magrittr This variation of (1) writes it as a magrittr pipeline.
library(magrittr)
data %>%
transform(string = gsub(", ", " ", string)) %>%
transform(string = gsub(",", ".", string)) %>%
transform(string = trimws(string)) %>%
with(setNames(strsplit(string, "\\s+"), id)) %>%
lapply(grep, pattern = "^[0-9.]*$", value = TRUE) %>%
replace(lengths(.) == 0, NA) %>%
stack() %>%
with(data.frame(id = ind, string = values))
3) dplyr/tidyr This is an alternate pipeline solution using dplyr and tidyr. unnest converts to long form, id is made factor so that we can later use complete to recover id's that are removed by subsequent filtering, the filter removes junk rows and complete inserts NA rows for each id that would otherwise not appear.
library(dplyr)
library(tidyr)
data %>%
mutate(string = gsub(", ", " ", string)) %>%
mutate(string = gsub(",", ".", string)) %>%
mutate(string = trimws(string)) %>%
mutate(string = strsplit(string, "\\s+")) %>%
unnest() %>%
mutate(id = factor(id))
filter(grepl("^[0-9.]*$", string)) %>%
complete(id)
4) data.table
library(data.table)
DT <- as.data.table(data)
DT[, string := gsub(", ", " ", string)][,
string := gsub(",", ".", string)][,
string := trimws(string)][,
string := setNames(strsplit(string, "\\s+"), id)][,
list(string = list(grep("^[0-9.]*$", unlist(string), value = TRUE))), by = id][,
list(string = if (length(unlist(string))) unlist(string) else NA_character_), by = id]
DT
Update Removed assumption that junk words do not have digit or dot. Also added (2), (3) and (4) and some improvements.
We can replace the , in between the numbers with . (using gsub), extract the numbers with str_extract_all (from stringr into a list), replace the list elements that have length equal to 0 with NA, set the names of the list with 'id' column, stack to convert the list to data.frame and rename the columns.
library(stringr)
setNames(stack(setNames(lapply(str_extract_all(gsub("(?<=[0-9]),(?=[0-9])", ".",
data$string, perl = TRUE), "[0-9.]+"), function(x)
if(length(x)==0) NA else as.numeric(x)), data$id))[2:1], c("id", "string"))
# id string
#1 A 1.001
#2 A 123.123
#3 B 23.45
#4 C NA
#5 D 134
#6 D 1.45
Same idea as Gabor's. I had hoped to use R's built-in parsing of strings (type.convert, used in read.table) rather than writing custom regex substitutions:
sp = setNames(strsplit(data$string, " "), data$id)
spc = lapply(sp, function(x) {
x = x[grep("[^0-9.,]$", x, invert=TRUE)]
if (!length(x))
NA_real_
else
mapply(type.convert, x, dec=gsub("[^.,]", "", x), USE.NAMES=FALSE)
})
setNames(rev(stack(spc)), names(data))
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
Unfortunately, type.convert is not robust enough to consider both decimal delimiters at once, so we need this mapply malarkey instead of type.convert(x, dec = "[.,]").
I have a data frame with 107 columns and 745000 rows (much bigger than in my example).
The case is, that I have character type columns which I want to separate, because they seem to contain some type-ish ending at the end of each sequence.
I want to saparate these type-ending parts to new columns.
I have made my own solution, but it seem to be far too slow for iterating through all the 745000 rows 53 times.
So I embed my solution in the following code, with some arbitrary data:
set.seed(1)
code_1 <- paste0(round(runif(5000, 100000, 999999)), "_", round(runif(1000, 1, 15)))
code_2 <- sample(c(paste0(round(runif(10, 100000, 999999)), "_", round(runif(10, 1, 15))), NA), 5000, replace = TRUE)
code_3 <- sample(c(paste0(round(runif(3, 100000, 999999)), "_", round(runif(3, 1, 15))), NA), 5000, replace = TRUE)
code_4 <- sample(c(paste0(round(runif(1, 100000, 999999)), "_", round(runif(1, 1, 15))), NA), 5000, replace = TRUE)
code_type_1 <- rep(NA, 5000)
code_type_2 <- rep(NA, 5000)
code_type_3 <- rep(NA, 5000)
code_type_4 <- rep(NA, 5000)
df <- data.frame(cbind(code_1,
code_2,
code_3,
code_4,
code_type_1,
code_type_2,
code_type_3,
code_type_4),
stringsAsFactors = FALSE)
df_new <- data.frame(code_1 = character(),
code_2 = character(),
code_3 = character(),
code_4 = character(),
code_type_1 = character(),
code_type_2 = character(),
code_type_3 = character(),
code_type_4 = character(),
stringsAsFactors = FALSE)
for (i in 1:4) {
i_t <- i + 4
temp <- strsplit(df[, c(i)], "[_]")
for (j in 1:nrow(df)) {
df_new[c(j), c(i)] <- unlist(temp[j])[1]
df_new[c(j), c(i_t)] <- ifelse(is.na(unlist(temp[j])[1]), NA, unlist(temp[j])[2])
}
print(i)
}
for (i in 1:8) {
df_new[, c(i)] <- factor(df_new[, c(i)])
}
Do anyone have some ideas how to speed things up here?
First we pre-allocate the results data.frame to the desired final length. This is very important; see The R Inferno, Circle 2. Then we vectorize the inner loop. We also use fixed = TRUE and avoid the regex in strsplit.
system.time({
df_new1 <- data.frame(code_1 = character(nrow(df)),
code_2 = character(nrow(df)),
code_3 = character(nrow(df)),
code_4 = character(nrow(df)),
code_type_1 = character(nrow(df)),
code_type_2 = character(nrow(df)),
code_type_3 = character(nrow(df)),
code_type_4 = character(nrow(df)),
stringsAsFactors = FALSE)
for (i in 1:4) {
i_t <- i + 4
temp <- do.call(rbind, strsplit(df[, c(i)], "_", fixed = TRUE))
df_new1[, i] <- temp[,1]
df_new1[, i_t] <- ifelse(is.na(temp[,1]), NA, temp[,2])
}
df_new1[] <- lapply(df_new1, factor)
})
# user system elapsed
# 0.029 0.000 0.029
all.equal(df_new, df_new1)
#[1] TRUE
Of course, there are ways to make this even faster, but this is close to your original approach and should be sufficient.
Here's another way, using gsub inside a custom function in combination with purrr::dmap() - which is equivalent to lapply, but outputs a data.frame instead of a list.
library(purrr)
# Define function which gets rid of everything after and including "_"
replace01 <- function(df, ptrn = "_.*")
dmap(df[,1:4], gsub, pattern = ptrn, replacement = "")
# Because "pattern" is argument we can change it to get 2nd part, then cbind()
test <- cbind(replace01(df),
replace01(df, ptrn = ".*_"))
Note that the output here character columns, you can always convert them to factor if you like.
Another possibility:
setNames(do.call(rbind.data.frame, lapply(1:nrow(df), function(i) {
x <- stri_split_fixed(df[i, 1:4], "_", 2, simplify=TRUE)
y <- c(x[,1], x[,2])
y[y==""] <- NA
y
})), colnames(df)) -> df_new
or
setNames(do.call(rbind.data.frame, lapply(1:nrow(df), function(i) {
x <- stri_split_fixed(df[i, 1:4], "_", 2, simplify=TRUE)
c(x[,1], x[,2])
})), colnames(df)) -> df_new
df_new[df_new==""] <- NA
df_new
which is marginally faster:
Unit: milliseconds
expr min lq mean median uq max neval cld
na_after 669.8357 718.1301 724.8803 723.5521 732.9998 790.1405 10 a
na_inner 719.3362 738.1569 766.4267 762.1594 791.6198 825.0269 10 b
I have a dataframe with 2 columns GL and GLDESC and want to add a 3rd column called KIND based on some data that is inside of column GLDESC.
The dataframe is as follows:
GL GLDESC
1 515100 Payroll-Indir Salary Labor
2 515900 Payroll-Indir Compensated Absences
3 532300 Bulk Gas
4 539991 Area Charge In
5 551000 Repairs & Maint-Spare Parts
6 551100 Supplies-Operating
7 551300 Consumables
For each row of the data table:
If GLDESC contains the word Payroll anywhere in the string then I want KIND to be Payroll
If GLDESC contains the word Gas anywhere in the string then I want KIND to be Materials
In all other cases I want KIND to be Other
I looked for similar examples on stackoverflow but could not find any, also looked in R for dummies on switch, grep, apply and regular expressions to try and match only part of the GLDESC column and then fill the KIND column with the kind of account but was unable to make it work.
Since you have only two conditions, you can use a nested ifelse:
#random data; it wasn't easy to copy-paste yours
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
c("gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
"asdfg", "GAS--2", "fghfgh", "qweee"), sample(letters, 10), sep = " "))
DF$KIND <- ifelse(grepl("gas", DF$GLDESC, ignore.case = T), "Materials",
ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))
DF
# GL GLDESC KIND
#1 8 e gas l Materials
#2 1 c payroll12 y Payroll
#3 10 m GaSer v Materials
#4 6 t asdf n Other
#5 2 w qweaa t Other
#6 4 r PayROll-12 q Payroll
#7 9 n asdfg a Other
#8 5 d GAS--2 w Materials
#9 7 s fghfgh e Other
#10 3 g qweee k Other
EDIT 10/3/2016 (..after receiving more attention than expected)
A possible solution to deal with more patterns could be to iterate over all patterns and, whenever there is match, progressively reduce the amount of comparisons:
ff = function(x, patterns, replacements = patterns, fill = NA, ...)
{
stopifnot(length(patterns) == length(replacements))
ans = rep_len(as.character(fill), length(x))
empty = seq_along(x)
for(i in seq_along(patterns)) {
greps = grepl(patterns[[i]], x[empty], ...)
ans[empty[greps]] = replacements[[i]]
empty = empty[!greps]
}
return(ans)
}
ff(DF$GLDESC, c("gas", "payroll"), c("Materials", "Payroll"), "Other", ignore.case = TRUE)
# [1] "Materials" "Payroll" "Materials" "Other" "Other" "Payroll" "Other" "Materials" "Other" "Other"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat1a|pat1b", "pat2", "pat3"),
c("1", "2", "3"), fill = "empty")
#[1] "1" "1" "3" "empty"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat2", "pat1a|pat1b", "pat3"),
c("2", "1", "3"), fill = "empty")
#[1] "2" "1" "3" "empty"
I personally like matching by index. You can loop grep over your new labels, in order to get the indices of your partial matches, then use this with a lookup table to simply reassign the values.
If you wanna create new labels, use a named vector.
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
c(
"gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
"asdfg", "GAS--2", "fghfgh", "qweee"
), sample(letters, 10),
sep = " "
))
lu <- stack(sapply(c(Material = "gas", Payroll = "payroll"), grep, x = DF$GLDESC, ignore.case = TRUE))
DF$KIND <- DF$GLDESC
DF$KIND[lu$values] <- as.character(lu$ind)
DF$KIND[-lu$values] <- "Other"
DF
#> GL GLDESC KIND
#> 1 6 x gas f Material
#> 2 3 t payroll12 q Payroll
#> 3 5 a GaSer h Material
#> 4 4 s asdf x Other
#> 5 1 m qweaa y Other
#> 6 10 y PayROll-12 r Payroll
#> 7 7 g asdfg a Other
#> 8 2 k GAS--2 i Material
#> 9 9 e fghfgh j Other
#> 10 8 l qweee p Other
Created on 2021-11-13 by the reprex package (v2.0.1)
I solved my problem in an imperative style, but it looks very ugly. How can I make it better (more elegant, more concise, more functional - finally its Scala). Rows with the same values as the previous row, but with a different letter should be skipped, all other values of the rows should be added.
val row1 = new Row(20, "A", true) // add value
val row2 = new Row(30, "A", true) // add value
val row3 = new Row(40, "A", true) // add value
val row4 = new Row(40, "B", true) // same value as the previous element & different letter -> skip row
val row5 = new Row(60, "B", true) // add value
val row6 = new Row(70, "B", true) // add value
val row7 = new Row(70, "B", true) // same value as the previous element, but the same letter -> add value
val rows = List(row1, row2, row3, row4, row5, row6, row7)
var previousLetter = " "
var previousValue = 0.00
var countSkip = 0
for (row <- rows) {
if (row.value == previousValue && row.letter != previousLetter) {
row.relevant = false
countSkip += 1
}
previousLetter = row.letter
previousValue = row.value
}
// get sum
val sumValue = rows.filter(_.relevant == true).map(_.value) reduceLeftOption(_ + _)
val sum = sumValue match {
case Some(d) => d
case None => 0.00
}
assert(sum == 290)
assert(countSkip == 1)
Thanks in advance
Twistleton
(rows.head :: rows).sliding(2).collect{
case List(Row(v1,c1), Row(v2,c2)) if ! (v1 == v2 && c1 != c2) => v2 }.sum
I think the shortest (bulletproof) solution when Row is a case class (dropping the boolean) is
(for ((Row(v1,c1), Row(v2,c2)) <- (rows zip rows.take(1) ::: rows) if (v1 != v2 || c1 == c2)) yield v1).sum
Some of the other solutions don't handle the list-is-empty case, but this is largely because sliding has a bug where it will return a partial list if the list is too short. Clearer to me (and also bulletproof) is:
(rows zip rows.take(1) ::: rows).collect{
case (Row(v1,c1), Row(v2,c2)) if (v1 != v2 || c1 == c2) => v1
}.sum
(which is only two characters longer if you keep it on one line). If you need the number skipped also,
val indicated = (rows zip rows.take(1) ::: rows).collect {
case (Row(v1,c1), Row(v2,c2)) => (v1, v1 != v2 || c1 == c2)
}
val countSkip = indicated.filterNot(_._2).length
val sum = indicated.filter(_._2).map(_._1).sum
Fold it:
scala> rows.foldLeft((row1, 0))((p:(Row,Int), r:Row) => (r, p._2 + (if (p._1.value == r.value && p._1.letter != r.letter) 0 else r.value)))._2
res2: Int = 290
(new Row(0, " ", true) +: rows).sliding(2).map { case List(r1, r2) =>
if (r1.value != r2.value || r1.letter == r2.letter) { r2.value }
else { 0 }
}.sum
Of course you can drop the boolean member of Row if you do not need it for something else
Reduce it:
rows.reduceLeft { (prev, curr) =>
if (prev.value == curr.value && prev.letter != curr.letter) {
curr.relevant = false
countSkip += 1
}
curr
}