I'm trying to do something fairly simple (I think) but I can't get my head round it. I'm trying to write a loop that checks if a character variable in a data frame contains any of a certain list of substrings, and to assign a corresponding value to a dummy variable.
so, imagine a data.frame, n=2000, with a variable data.frame$text. Furthermore, I have a character vector containing all the substrings I want to text data.frame$text for. Let's call it hillary_exists :
hillary_exists <- c("Hilary Clinton", "hilary clinton","hilaryclinton", "hillaryclinton", "HilaryClinton",
"HillaryClinton","Hillary Clinton", "Hillary Rodham Clinton", "Hillary", "Hilary", "#Hillary2016", "#ImWithHer",
"Hillary2016", "hillary", "hilary", "Clinton 2016", "Clinton", "Secretary of State Clinton",
"Senator Clinton", "Hilary Rodham", "Hilary Rodham Clinton", "Hilary Rodham-Clinton", "Hillary Rodham-Clinton")
Now, I want my loop to test every row of data.frame$text for the existence of every element of hillary_exists, and if any of them is TRUE, to generate a new value of 1 for the variable data.frame$hillary_mention . This is what I tried:
for(i in hillary_exists){
if(grepl(hillary_exists[i], data.frame$text)){
data.frame$hillary_mention <- 1
} else {
data.frame$hillary_mention <- 0 }
}
But obviously I'm missing the i component for the data.frame$text element, but I don't know how to address it.
Any help would be greatly appreciated! Thanks
One approach we can use to get this to work is to turn hillary_exists into a regex: hillary_regex <- paste(hillary_exists, collapse = "|"). Essentially, this just takes all of your terms and turns it into a big OR statement. This takes care of one of the loops for us automatically. Next, we just loop over our text column, data.frame$text, using sapply.
data.frame$hillary_mention <- sapply(data.frame$text, function(s) grepl(hillary_regex, s, ignore.case = TRUE))
It's good to use ignore.case = TRUE here because there may be mentions in the text that aren't accounted for in hillary_exists, such as "hIllary cLinTon".
Related
I have the following strings in a long string:
a=b=c=d;
a=b;
a=b=c=d=e=f;
I want to first search for above mentioned pattern (X=Y=...=Z) and then output like the following for each of the above mentioned strings:
a=d;
b=d;
c=d;
a=b;
a=f;
b=f;
c=f;
d=f;
e=f;
In general, I want all the variables to have an equal sign with the last variable on the extreme right of the string. Is there a way I can do it using regexprep in MATLAB. I am able to do it for a fixed length string, but for variable length, I have no idea how to achieve this. Any help is appreciated.
My attempt for the case of two equal signs is as follows:
funstr = regexprep(funstr, '([^;])+\s*=\s*+(\w+)+\s*=\s*([^;])+;', '$1 = $3; \n $2 = $3;\n');
Not a regexp but if you stick to Matlab you can make use of the cellfun function to avoid loop:
str = 'a=b=c=d=e=f;' ; %// input string
list = strsplit(str,'=') ;
strout = cellfun( #(a) [a,'=',list{end}] , list(1:end-1), 'uni', 0).' %'// Horchler simplification of the previous solution below
%// this does the same than above but more convoluted
%// strout = cellfun( #(a,b) cat(2,a,'=',b) , list(1:end-1) , repmat(list(end),1,length(list)-1) , 'uni',0 ).'
Will give you:
strout =
'a=f;'
'b=f;'
'c=f;'
'd=f;'
'e=f;'
Note: As Horchler rightly pointed out in comment, although the cellfun instruction allows to compact your code, it is just a disguised loop. Moreover, since it runs on cell, it is notoriously slow. You won't see the difference on such simple inputs, but keep this use when super performances are not a major concern.
Now if you like regex you must like black magic code. If all your strings are in a cell array from the start, there is a way to (over)abuse of the cellfun capabilities to obscure your code do it all in one line.
Consider:
strlist = {
'a=b=c=d;'
'a=b;'
'a=b=c=d=e=f;'
};
Then you can have all your substring with:
strout = cellfun( #(s)cellfun(#(a,b)cat(2,a,'=',b),s(1:end-1),repmat(s(end),1,length(s)-1),'uni',0).' , cellfun(#(s) strsplit(s,'=') , strlist , 'uni',0 ) ,'uni',0)
>> strout{:}
ans =
'a=d;'
'b=d;'
'c=d;'
ans =
'a=b;'
ans =
'a=f;'
'b=f;'
'c=f;'
'd=f;'
'e=f;'
This gives you a 3x1 cell array. One cell for each group of substring. If you want to concatenate them all then simply: strall = cat(2,strout{:});
I haven't had much experience w/ Matlab; but your problem can be solved by a simple string split function.
[parts, m] = strsplit( funstr, {' ', '='}, 'CollapseDelimiters', true )
Now, store the last part of parts; and iterate over parts until that:
len = length( parts )
for i = 1:len-1
print( strcat(parts(i), ' = ', parts(len)) )
end
I do not know what exactly is the print function in matlab. You can update that accordingly.
There isn't a single Regex that you can write that will cover all the cases. As posted on this answer:
https://stackoverflow.com/a/5019658/3393095
However, you have a few alternatives to achieve your final result:
You can get all the values in the line with regexp, pick the last value, then use a for loop iterating throughout the other values to generate the output. The regex to get the values would be this:
matchStr = regexp(str,'([^=;\s]*)','match')
If you want to use regexprep at any means, you should write a pattern generator and a replace expression generator, based on number of '=' in the input string, and pass these as parameters of your regexprep func.
You can forget about Regex and Split the input to generate the output looping throughout the values (similarly to alternative #1) .
This is pretty basic but I haven't found a simple way to do it. Say I have the following dataframe:
chars <- data.frame(type = c('ferrari_car--sport','ducati:bike:speed','honda:car_family','ninja_bike:speed','lambo_car','harley_bike'))
All I want is to search each of the values in the "type" column of this dataframe and create another column. If the text contains "car" then return "car"; if it contains "bike" then return "motorcycle" (ultimately I want to be able to do this for a bunch of different values)
My approach has been to duplicate the column, gsub "//car//" for "car" (and likewise for bike), then strip the "//" from either end.
Is there a faster/simpler way?
typestr <- c('ferrari_car','ducati_bike',
'honda:trolley_family','ninja_bike:speed','lambo_car','harley_bike')
library(stringr)
xstr <- str_extract(typestr,"(trolley|car|bike)")
rstr <- list(c("car","car"),c("bike","motorcycle"),c("trolley","trike"))
for (r in rstr) xstr <- gsub(r[1],r[2],xstr)
or
ifelse(grepl("bike",typestr),"motorcycle",
ifelse(grepl("car",typestr),"car",
ifelse(grepl("trolley",typestr),"trike",NA)))
There might be alternatives with str_replace, or making the examples above more elegant with Reduce() ...
I'm at a loss why the following code doesn't work. The intention is to input a vector of strings, some of which can be converted to a number, some can't. The following 'sapply' function should use a regex to match numbers and then return the number or (if not) return the original.
sapply(c("test","6","-99.99","test2"), function(v){
if(grepl("^[-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?$",v)){as.numeric(v)} else {v}
})
Which returns the following result:
"test" "6" "-99.99" "test2"
Edit: What I expect the code to return:
"test" 6 -99.99 "test2
I can run the if statement on each element successfully.
> if(grepl("^[-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?$","test")){as.numeric("test")} else {"test"}
[1] "test"
if(grepl("^[-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?$","6")){as.numeric("6")} else {"6"}
[1] 6
And etc...
I don't understand why this is happening. I guess I have two questions. One: Why is this happening? And two: Usually I'm pretty good at troubleshooting, but I have no idea where to even look for this. If you know the problem, how did you find/know the solution? Should I open up the internal lapply function code?
that happens because sapply returns a vector, and a vector can't be mixed. If you use lapply then you get a list result which can be mixed, the same code but with lapply instead of sapply works how you want it to.
#Jeremy points into right direction, you can use lapply, which returns a list. Or, you can tell sapply not to simplify result.
If simplification occurs, the output type is determined from the
highest type of the return values in the hierarchy NULL < raw <
logical < integer < double < complex < character < list < expression,
after coercion of pairlists to lists.
out <- sapply(c("test","6","-99.99","test2"), function(v){
if(grepl("^[-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?$",v)){
as.numeric(v)
} else {
v
}
}, simplify = FALSE)
> out
$test
[1] "test"
$`6`
[1] 6
$`-99.99`
[1] -99.99
$test2
[1] "test2"
Say I have a line in a file:
string <- "thanks so much for your help all along. i'll let you know when...."
I want to return a value indicating if the word know is within 6 words of help.
This is essentially a very crude implementation of Crayon's answer as a basic function:
withinRange <- function(string, term1, term2, threshold = 6) {
x <- strsplit(string, " ")[[1]]
abs(grep(term1, x) - grep(term2, x)) <= threshold
}
withinRange(string, "help", "know")
# [1] TRUE
withinRange(string, "thanks", "know")
# [1] FALSE
I would suggest getting a basic idea of the text tools available to you, and using them to write such a function. Note Tyler's comment: As implemented, this can match multiple terms ("you" would match "you" and "your") leading to funny results. You'll need to determine how you want to deal with these cases to have a more useful function.
you won't be able to get this from regex alone. I suggest splitting using space as delimiter, then loop or use a built-in function to do array search of your two terms and subtract the difference of the indexes (array positions).
edit: Okay I thought about it a second and perhaps this will work for you as a regex pattern:
\bhelp(\s+[^\s]+){1,5}+\s+know\b
This takes the same "space is the delimiter" concept. First matches for help then greedily up to 5 " word" then looks for " know" (since "know" would be the 6th).
Split your string:
> words <- strsplit(string, '\\s')[[1]]
Build a indices vector:
> indices <- 1:length(words)
Name indices:
> names(indices) <- words
Compute distance between words:
> abs(indices["help"] - indices["know"]) < 6
FALSE
EDIT In a function
distance <- function(string, term1, term2) {
words <- strsplit(string, "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
abs(indices[term1] - indices[term2])
}
distance(string, "help", "know") < 6
EDIT Plus
There is a great advantage in indexing words, once its done you can work on a lot of statistics on a text.
Alright; so here's the whole thing I'm suppose to do.
Input a number that corresponds with a number in Data Worksheet Column A and return the adjacent row data.
I want it to return the adjacent cells; example. If it finds 052035 in cell A5378, I Want it to return the data or cell numbers B5378, C5378
EDIT: I've deleted my code; since it didn't really follow with a good way to do it.
Worksheet Structure for Data:
A 1-7800ish[6 Digit number 1-9]
B 1-7800ish Area Codes
C 1-7800ish City/States
The data by the way; is a relatively large set that I got from a query on a SQL-Server. The string number that I'm looking for should have no duplicates based on my original query. [I grouped by before copying it over]
If ya'll have resources for a quick introduction to VB from a programming perspective that'll be helpful. I can program in C/C++ but the syntax in VB is a little weird to me.
If your end goal is to simply find the exact match in column A, and return the values in corresponding row, columns B & C, Regular Expressions is the wrong tool for the job. Use built in functions like Match.
I still don't understand the point of this exercise, as, the data is already arranged in columns A, B and C., you could simply use AutoFilter... This subroutine simply tells you that the value is found (and returns the corresponding data) or not found.
I have tested this (made a small change in dimensioning vals variable)
Sub Foo()
Dim valToLookFor As String
Dim rngToLookAt As Range
Dim foundRow As Long
Dim vals() As Variant
valToLookFor = "052035"
Set rngToLookAt = Range("A:A")
If Not IsError(Application.Match(valToLookFor, rngToLookAt, False)) Then
foundRow = Application.Match(valToLookFor, rngToLookAt, False)
ReDim vals(1)
vals(0) = rngToLookAt.Cells(foundRow).Offset(0, 1).Value
vals(1) = rngToLookAt.Cells(foundRow).Offset(0, 2).Value
'Alternatively, to return the cell address:
'vals(0) = rngToLookAt.Cells(foundRow).Offset(0,1).Address
'vals(1) = rngToLookAt.Cells(foundRow).Offset(0,2).Address
MsgBox Join(vals, ",")
Else:
Erase vals
MsgBox valToLookFor & " not found!", vbInformation
End If
End Sub
Here is proof that it works: