Pattern matching in dataset - regex

been struggling with this for a while.
I have a dataset with two columns, a Description column and the other is the pattern column that I am trying to match against the description column.If the corresponding pattern exists in the Description column, it needs to be replaced by an asterisk
For instance, if the Description is ABCDEisthedescription and the Pattern is ABCDE, then the new description should *isthedescription
I tried the following
data$NewDescription <- gsub(data$pattern,"\\*",Data$Description )
since there is more than one row in the dataset, it throws an error ( a warning rather)
"argument 'pattern' has length > 1 and only the first element will be used"
Any help will be hugely appreciated.

You can use an mapply here to apply the function to each row.
#sample data
data<-data.frame(
pattern=c("ABCDE","XYZ"),
Description=c("ABCDEisthedescription", "sillyXYZvalue")
)
Now use mapply
mapply(function(p,d) gsub(p, "\\*", d, fixed=T), data$pattern, data$Description)
# [1] "\\*isthedescription" "silly\\*value"

Additionally,
Patterns <- paste0(
sample(LETTERS[1:4],500,replace=TRUE),
sample(LETTERS[1:4],500,replace=TRUE),
sample(LETTERS[1:4],500,replace=TRUE),
sample(LETTERS[1:4],500,replace=TRUE))
##
Desc <- paste0(Patterns,"isthedescription")
Ptrn <- sample(Patterns,500)
##
Data <- data.frame(
Description=Desc,
Pattern=Ptrn,
stringsAsFactors=FALSE)
##
newDesc <- sapply(1:nrow(Data), function(X){
if(substr(Data$Description[X],1,4)==Data$Pattern[X]){
gsub(Data$Pattern[X],"*",Data$Description[X])
} else {
Data$Description[X]
}
})
#MrFlick's approach seems more concise though.

Related

Shiny Dashboard with plot

I am learning some shiny in order to do a dashboard. I have an idea. I want to create a dashboard that select from an selectinput a variable, group by such variable and plot a barplot or histogram of the total of that variable.
I have generated a sample dataset to generate what I need, however I can´t get what I need.
The UI code is the next one:
library(shiny)
shinyUI(fluidPage(
titlePanel("Demo dashboard"),
sidebarLayout(
sidebarPanel(
selectInput("variable",
"group by",
choices = c("City","Country")
)
),
mainPanel(
plotOutput("distPlot")
)
)
))
The server code is the next one, Here I aggregate by the variable that is the input and plot the total
library(shiny)
library(dplyr)
shinyServer(function(input, output) {
output$distPlot <- renderPlot({
sample<-tbl_df(data.frame(c("City1","City2","City3","City1","City2","City3","City2","City3"),
c("A","B","C","D","D","A","A","B"),
c(12,14,15,12,12,14,8,10)))
colnames(sample)<-c("City","Country","Amount")
df1<-sample%>%group_by(input$variable)%>%
summarise(total=sum(Amount))
sample%>%group_by(input$variable)%>%summarise(total=sum(Amount))
x<- df1$total
hist(x)
})
})
A screen capture of my result is the next:
however this is not the expected result. I can´t get the histogram required
The problem is your usage of dplyr:
Your original code doesn't evaluate input$variable to group by city, rather tries to group by a non-existing column called `input$variable`:
sample %>%
group_by(input$variable) %>%
summarise(total=sum(Amount))
Result:
# # A tibble: 1 x 2
# `input$variable` total
# <chr> <dbl>
# 1 City 97
You can check this yourself easily by adding either a print statement after the statement (e.g.: print(df1)) or adding a browser() before the statement.
This behaviour is because dplyr uses non-standard-evaluation by default. You can read up more about that here.
To use standard (programmable) evaluation you need to unquote input$variable so that the value is passed to dplyr. In the current version you can do that using a combination of !! and sym.
Example:
sample %>%
group_by(!!sym(input$variable)) %>%
summarise(total=sum(Amount))
Result:
# # A tibble: 3 x 2
# City total
# <fct> <dbl>
# 1 City1 24
# 2 City2 34
# 3 City3 39
Histogram:
Edit: Some more explanation: group_by doesn’t evaluate its input, but rather it quotes it: That's why you're getting `input$variable` as a column name.
On the other hand: the sym function turns the actual value of input$variable into a symbol, then !! can be used to remove the quoting:
What works in dplyr is if you don't have quotes around the input, so: group_by(City)
Let's see what happens step by step:
sym(input$variable) returns "City". group_by("City") would still not work because the input has quoting around it!
That's why we need to use !!: !! sym(input$variable) returns City without quotes. So the expression evaluates to group_by(City), and thus will work as expected.

How to match a specific string using regular expressions in R

I am trying to extract some financial data using regular expressions in R.
I have used a RegEx tester, http://regexr.com/, to make a regular expression that SHOULD capture the information I need - the problem is just that it doesn't...
I have extracted data from this URL: http://finance.yahoo.com/q/cp?s=%5EOMXC20+Components
I want to match the company names (DANSKE.CO, DSV.CO etc.) and I have created following regular expression which matches it on regexr.com:
.q\?s=(\S*\\)
But it doesn't work in R. Can someone help me figure out how to go about this?
Instead of messing around with regular expressions I would use XPath for something like fetching HTML content:
library("XML")
f <- tempfile()
download.file("https://finance.yahoo.com/q/cp?s=^OMXC20+Components", f)
doc <- htmlParse(f)
xpathSApply(doc, "//b/a", xmlValue)
# [1] "CARL-B.CO" "CHR.CO" "COLO-B.CO" "DANSKE.CO" "DSV.CO"
# [6] "FLS.CO" "GEN.CO" "GN.CO" "ISS.CO" "JYSK.CO"
# [11] "MAERSK-A.CO" "MAERSK-B.CO" "NDA-DKK.CO" "NOVO-B.CO" "NZYM-B.CO"
# [16] "PNDORA.CO" "TDC.CO" "TRYG.CO" "VWS.CO" "WDH.CO"
Does this help? If not, post back, and I'll provide another suggestion.
library(XML)
stocks <- c("AXP","BA","CAT","CSCO")
for (s in stocks) {
url <- paste0("http://finviz.com/quote.ashx?t=", s)
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
# ASSIGN TO STOCK NAMED DFS
assign(s, readHTMLTable(tableNodes[[9]],
header= c("data1", "data2", "data3", "data4", "data5", "data6",
"data7", "data8", "data9", "data10", "data11", "data12")))
# ADD COLUMN TO IDENTIFY STOCK
df <- get(s)
df['stock'] <- s
assign(s, df)
}
# COMBINE ALL STOCK DATA
stockdatalist <- cbind(mget(stocks))
stockdata <- do.call(rbind, stockdatalist)
# MOVE STOCK ID TO FIRST COLUMN
stockdata <- stockdata[, c(ncol(stockdata), 1:ncol(stockdata)-1)]
# SAVE TO CSV
write.table(stockdata, "C:/Users/rshuell001/Desktop/MyData.csv", sep=",",
row.names=FALSE, col.names=FALSE)
# REMOVE TEMP OBJECTS
rm(df, stockdatalist)

output computation in R using shiny [duplicate]

This question already has answers here:
How to calculate the number of occurrence of a given character in each row of a column of strings?
(14 answers)
Closed 7 years ago.
I am trying to find a pattern of "GC" in different genes(strings) with a user interface using Shiny.I am using grep command of R to find the pattern but I am not able to get the correct output.Below is the code of UI.R
library(shiny)
setwd("C:/Users/ishaan/Documents/aaa")
shinyUI(fluidPage(
# Copy the line below to make a select box
selectInput("select", label = h3("Select Human Gene Sequence"),
choices = list("CD83" = "UGGGUGAUUACAUAAUCUGACAAAUAAAAAAAUCCCGACUUUGGGAUGAGUGCUAGGAUGUUGUAAA"
, "SEC23A" = "UUUCACUGU"
, "ANKFY1" = "AAGUUUGACUAUAUGUGUAAAGGGACUAAAUAUUUUUGCAACAGCC"
,"ENST00000250457"="ACUUGUUGAAUAAACUCAGUCUCC"
),
selected = "UGGGUGAUUACAUAAUCUGACAAAUAAAAAAAUCCCGACUUUGGGAUGAGUGCUAGGAUGUUGUAAA"),
hr(),
fluidRow(column(5, verbatimTextOutput("value")),column(5, verbatimTextOutput("value2")))
))
Server.R
library(shiny)
setwd("C:/Users/ishaan/Documents/aaa")
shinyServer(function(input , output) {
strings=input$select
# You can access the value of the widget with input$select, e.g.
output$value <- renderPrint({ input$select })
output$value2 <- renderPrint({ grep("*gc*",input$value })
})
As already indicated in the comments there are parenteses are missing in your code. Furthermore the statement seems to be wrong. Grep expects a regular expression. The star doesn't make any sense here. Instead you have to use .*. However, this means grep will match the entire string if it contains gc which is I guess also not the result you want to have.
However you can use grepexpr to search for the string gc
>gregexpr("gc","aagccaagcca")[[1]]
[1] 3 8
attr(,"match.length")
[1] 2 2
attr(,"useBytes")
[1] TRUE
The output looks a bit confusing (to me). However you can you can see that the string was found at position 3 and 8
The number of occurences are then given by
length(gregexpr("gc","aagccaagcca")[[1]])
[1] 2
To make it match uppercase strings as well
length(gregexpr("gc","GCaagccaagcca",ignore.case=TRUE)[[1]])
Finally there is an issue with the length calculation if there is no match.
To solve this issue you need to use
mtch <- gregexpr("gcxx","GCaagccaagcxca",ignore.case=TRUE)[[1]]
if(mtch[1]==-1) 0 else length(mtch)

Extract words that meet a length condition from string

I have a patent data set and when I import the IPC-class information to R I get a string containing whitespaces in a variable amount and a set of numbers I don't need. The following are the IPC codes corresponding to a patent file:
b <- "F24J 2/05 20060101AFI20150224BHEP F24J 2/46 20060101ALI20150224BHEP "
I would like to remove all whitespaces and that long alphanumeric string and just get the data I am interested in, obtaining a data frame like this, in this case:
m <- data.frame(matrix(c("F24J 2/05", "F24J 2/46"), byrow = TRUE, nrow = 1, ncol = 2))
m
I am trying with gsub, since I know that the long string will always have a length considerably longer than the data I am interested in:
x = gsub("\\b[a-zA-Z0-9]{8,}\\b", "", ipc)
x
But I get stuck when I try to further clean this object in order to get the data frame I want. I am really stuck on this, and I would really appreciate if someone could help me.
Thank you very much in advance.
You can use str_extract_all from stringr package, provided you know the pattern you look for:
library(stringr)
str_extract_all(b, "[A-Z]\\d{2}[A-Z] *\\d/\\d{2}")[[1]]
#[1] "F24J 2/05" "F24J 2/46"
Option 1, select all the noise data and remoe it using a sustitution:
/\s+|\w{5,}/g
(Spaces and 'long' words)
https://regex101.com/r/lG4dC4/1
Option 2, select all the short words (length max 4):
/\b\S{4}\b/g
https://regex101.com/r/fZ8mH5/1
or…
library(stringi)
library(readr)
read_fwf(paste0(stri_match_all_regex(b, "[[:alnum:][:punct:][:blank:]]{50}")[[1]][,1], collapse="\n"),
fwf_widths(c(7, 12, 31)))[,1:2]
## X1 X2
## 1 F24J 2/05
## 2 F24J 2/46
(this makes the assumption - from only seeing 2 'records' - that each 'record' is 50 characters long)
Here's an approach to akie the amtrix using qdapRegex (I maintain this package) + magrittr's pipeline:
library(qdapRegex); library(magrittr)
b %>%
rm_white_multiple() %>%
rm_default(pattern="F[0-9A-Z]+\\s\\d{1,2}/\\d{1,2}", extract=TRUE) %>%
unlist() %>%
strsplit("\\s") %>%
do.call(rbind, .)
## [,1] [,2]
## [1,] "F24J" "2/05"
## [2,] "F24J" "2/46"

Extracting text after "?"

I have a string
x <- "Name of the Student? Michael Sneider"
I want to extract "Michael Sneider" out of it.
I have used:
str_extract_all(x,"[a-z]+")
str_extract_all(data,"\\?[a-z]+")
But can't extract the name.
I think this should help
substr(x, str_locate(x, "?")+1, nchar(x))
Try this:
sub('.*\\?(.*)','\\1',x)
x <- "Name of the Student? Michael Sneider"
sub(pattern = ".+?\\?" , x , replacement = '' )
To take advantage of the loose wording of the question, we can go WAY overboard and use natural language processing to extract all names from the string:
library(openNLP)
library(NLP)
# you'll also have to install the models with the next line, if you haven't already
# install.packages('openNLPmodels.en', repos = 'http://datacube.wu.ac.at/', type = 'source')
s <- as.String(x) # convert x to NLP package's String object
# make annotators
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
entity_annotator <- Maxent_Entity_Annotator()
# call sentence and word annotators
s_annotated <- annotate(s, list(sent_token_annotator, word_token_annotator))
# call entity annotator (which defaults to "person") and subset the string
s[entity_annotator(s, s_annotated)]
## Michael Sneider
Overkill? Probably. But interesting, and not actually all that hard to implement, really.
str_match is more helpful in this situation
str_match(x, ".*\\?\\s(.*)")[, 2]
#[1] "Michael Sneider"