R: searching within split character strings with apply - regex

Within a large data frame, I have a column containing character strings e.g. "1&27&32" representing a combination of codes. I'd like to split each element in the column, search for a particular code (e.g. "1"), and return the row number if that element does in fact contain the code of interest. I was thinking something along the lines of:
apply(df["MEDS"],2,function(x){x.split<-strsplit(x,"&")if(grep(1,x.split)){return(row(x))}})
But I can't figure out where to go from there since that gives me the error:
Error in apply(df["MEDS"], 2, function(x) { :
dim(X) must have a positive length
Any corrections or suggestions would be greatly appreciated, thanks!

I see a couple of problems here (in addition to the missing semicolon in the function).
df["MEDS"] is more correctly written df[,"MEDS"]. It is a single column. apply() is meant to operate on each column/row of a matrix as if they were vectors. If you want to operate on a single column, you don't need apply()
strsplit() returns a list of vectors. Since you are applying it to a row at a time, the list will have one element (which is a character vector). So you should extract that vector by indexing the list element strsplit(x,"&")[[1]].
You are returning row(x) is if the input to your function is a matrix or knows what row it came from. It does not. apply() will pull each row and pass it to your function as a vector, so row(x) will fail.
There might be other issues as well. I didn't get it fully running.
As I mentioned, you don't need apply() at all. You really only need to look at the 1 column. You don't even need to split it.
OneRows <- which(grepl('(^|&)1(&|$)', df$MEDS))
as Matthew suggested. Or if your intention is to subset the dataframe,
newdf <- df[grepl((^|&)1(&|$)', df$MEDS),]

Related

How to get all cells that appear more than 5 times?

enter image description here
I have a table in OpenOffice that contains a column with region's codes (column J). Using table functions, how to get all codes that appear more than 5 times and write them in one cell?
Normally I would recommend breaking this problem down into smaller parts using helper columns. Or better yet, move the data into LibreOffice Base which can easily work with distinct values.
However, I managed to come up with a rather large formula that seems to do what you asked. Enter it as an array formula.
=TEXTJOIN(",";1;IF(COUNTIF(исходник.J$2:J$552;исходник.J2:J552)>5;IF(ROW(исходник.J2:J552)=MATCH(исходник.J2:J552;исходник.J$2:J$552;0)+ROW(J$2)-1;исходник.J2:J552;"")))
I can't test this on your actual data since your example is only an image, but let's say that there are six of both 77 and 37. Then this would show 77,37 as the result.
Here is a breakdown. Look up the functions in LibreOffice Online Help for more information.
=TEXTJOIN(",";1; — Join all results into a single cell, separated by commas.
IF(COUNTIF(исходник.J$2:J$552;исходник.J2:J552)>5; — Find codes that occur more than 5 times. This is the same as what you wrote.
IF(ROW(исходник.J2:J552)= — Compare the next result to the row number that we are currently looking at.
MATCH(исходник.J2:J552;исходник.J$2:J$552;0)+ROW(J$2)-1; — Determine the first row that has this code. We do this to get unique results instead of 6 or more of each code in the result.
исходник.J2:J552;""))) — Return the code. (Your formula simply returns 1 here, which doesn't seem to be what you want.) If it doesn't match, return an empty string rather than 0, because TEXTJOIN ignores empty strings.

Matlab: What's the most efficient approach to parse a large table or cell array with regexp when sometimes there is no match?

I am working with a messy manually maintained "database" that has a column containing a string with name,value pairs. I am trying to parse the entire column with regexp to pull out the values. The column is huge (>100,000 entries). As a proxy for my actual data, let's use this code:
line1={'''thing1'': ''-583'', ''thing2'': ''245'', ''thing3'': ''246'', ''morestuff'':, '''''};
line2={'''thing1'': ''617'', ''thing2'': ''239'', ''morestuff'':, '''''};
line3={'''thing1'': ''unexpected_string(with)parens5'', ''thing2'': 245, ''thing3'':''246'', ''morestuff'':, '''''};
mycell=vertcat(line1,line2,line3);
This captures the general issues encountered in the database. I want to extract what thing1, thing2, and thing3 are in each line using cellfun to output a scalar cell array. They should normally be 3 digit numbers, but sometimes they have an unexpected form. Sometimes thing3 is completely missing, without the name even showing up in the line. Sometimes there are minor formatting inconsistencies, like single quotes missing around the value, spaces missing, or dashes showing up in front of the three digit value. I have managed to handle all of these, except for the case where thing3 is completely missing.
My general approach has been to use expressions like this:
expr1='(?<=thing1''):\s?''?-?([\w\d().]*?)''?,';
expr2='(?<=thing2''):\s?''?-?([\w\d().]*?)''?,';
expr3='(?<=thing3''):\s?''?-?([\w\d().]*?)''?,';
This looks behind for thingX' and then tries to match : followed by zero or one spaces, followed by 0 or 1 single quote, followed by zero or one dash, followed by any combination of letters, numbers, parentheses, or periods (this is defined as the token), using a lazy match, until zero or one single quote is encountered, followed by a comma. I call regexp as regexp(___,'tokens','once') to return the matching token.
The problem is that when there is no match, regexp returns an empty array. This prevents me from using, say,
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),mycell);
unless I call it with 'UniformOutput',false. The problem with that is twofold. First, I need to then manually find the rows where there was no match. For example, I can do this:
emptyout=cellfun(#(x) isempty(x),out);
emptyID=find(emptyout);
backfill=cell(length(emptyID),1);
[backfill{:}]=deal('Unknown');
out(emptyID)=backfill;
In this example, emptyID has a length of 1 so this code is overkill. But I believe this is the correct way to generalize for when it is longer. This code will change every empty cell array in out with the string Unknown. But this leads to the second problem. I've now got a 'messy' cell array of non-scalar values. I cannot, for example, check unique(out) as a result.
Pardon the long-windedness but I wanted to give a clear example of the problem. Now my actual question is in a few parts:
Is there a way to accomplish what I'm trying to do without using 'UniformOutput',false? For example, is there a way to have regexp pass a custom string if there is no match (e.g. pass 'Unknown' if there is no match)? I can think of one 'cheat', which would be to use the | operator in the expression, and if the first token is not matched, look for something that is ALWAYS found. I would then still need to double back through the output and change every instance of that result to 'Unknown'.
If I take the 'UniformOutput',false approach, how can I recover a scalar cell array at the end to easily manipulate it (e.g. pass it through unique)? I will admit I'm not 100% clear on scalar vs nonscalar cell arrays.
If there is some overall different approach that I'm not thinking of, I'm also open to it.
Tangential to the main question, I also tried using a single expression to run regexp using 3 tokens to pull out the values of thing1, thing2, and thing3 in one pass. This seems to require 'UniformOutput',false even when there are no empty results from regexp. I'm not sure how to get a scalar cell array using this approach (e.g. an Nx1 cell array where each cell is a 3x1 cell).
At the end of the day, I want to build a table using these results:
mytable=table(out1,out2,out3);
Edit: Using celldisp sheds some light on the problem:
celldisp(out)
out{1}{1} =
246
out{2} =
Unknown
out{3}{1} =
246
I assume that I need to change the structure of out so that the contents of out{1}{1} and out{3}{1} are instead just out{1} and out{3}. But I'm not sure how to accomplish this if I need 'UniformOutput',false.
Note: I've not used MATLAB and this doesn't answer the "efficient" aspect, but...
How about forcing there to always be a match?
Just thinking about you really wanting a match to skip this problem, how about an empty match?
Looking on the MATLAB help page here I can see a 'emptymatch' option, perhaps this is something to try.
E.g.
the_thing_i_want_to_find|
Match "the_thing_i_want_to_find" or an empty match, note the | character.
In capture group it might look like this:
(the_thing_i_want_to_find|)
As a workaround, I have found that using regexprep can be used to find entries where thing3 is missing. For example:
replace='$1 ''thing3'': ''Unknown'', ''morestuff''';
missingexpr='(?<=thing2'':\s?)(''?-?[\w\d().]*?''?,) ''morestuff''';
regexprep(mycell{2},missingexpr,replace)
ans =
''thing1': '617', 'thing2': '239', 'thing3': 'Unknown', 'morestuff':, '''
Applying it to the entire array:
fixedcell=cellfun(#(x) regexprep(x,missingexpr,replace),mycell);
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),fixedcell,'UniformOutput',false);
This feels a little roundabout, but it works.
cellfun can be replaced with a plain old for loop. Your code will either be equally fast, or maybe even faster. cellfun is implemented with a loop anyway, there is no advantage of using it other than fewer lines of code. In your explicit loop, you can then check the output of regexp, and build your output array any way you like.

Using Regex in R with huge amount of strings

I have very big data and the next step is to delete certain strings (i.e. the associated rows) based on patterns. I need to use Regex for that. For example image column A as:
A-929.XZT-93002-B-DKE
A-938-XZT-29849-B-DKE
A-938-AXZ-93923-B-DKE
...
...
There are many more columns besides A. Now I want to delete all rows completely which contain the phrade "XZT" with any element before except a character. In this case it would be row1 and row2.
My question is as follows:
Can this be done in R as effectively as for example in VBA? Which package would you recommend to do so, or can it be done just as effectively with the base functions?
I am asking because there are different ways to apply Regex in R and I have to do it for about ~ 20,000++ rows numerous times, so I want to do it as fast as possible.
Thanks

How do I apply a formula to a range without applying said formula to every cell?

I'm trying to apply a formula without having it add the formula data to each and every cell - in other words, I need the cells that are receiving the formula to be untouched until they get their data.
I was searching around and it looked like an ARRAYFORMULA would work but it doesn't seem to be doing anything when I apply it.
For example, I want to apply this formula to a cell range: =SPLIT(E2, ",")). Each cell in the E column needs to be split into two the two adjacent cells next to it based on it's comma. When I try to apply =ARRAYFORMULA(SPLIT(E2:E99, ",")) only the cell I add this to gets the formula.
In addition to the contribution of pnuts, also try:
=ArrayFormula(iferror(REGEXEXTRACT(","&E2:E,"^"&REPT(",+[^,]+",COLUMN(OFFSET(A1,,,1,6))-1)&",+([^,]+)")))
Note: the last parameter of OFFSET can be changed to match the maximum number of values you have in the cells of the range E2:E (separated by a comma). E.g: if you have a no more than 3 values per cell, set it to three. The output will then be three columns wide (one column for each value).
Hope that makes sense ?
Also credits due to AdamL who (I believe) orginally crafted this workaround.
I think what you want may be array_constrain but for your example I can only at present offer you two formulae (one for each side of the comma):
=Array_constrain(arrayformula(left(E2:E,find(",",E2:E)-1)),match("xxx",E:E)-1,1)
=Array_constrain(arrayformula(mid(E2:E,find(",",E2:E)+1,len(E2:E))),match("xxx",E:E)-1,1)

How to use environments for lookups

My question builds upon the topic of matching a string against multiple patterns. One solution discussed here is to use sapply(keywords, grepl, strings, ignore.case=TRUE) which yields a two-dimensional matrix.
However, I run into significant speed issues, when applying this approach to 5K+ keywords and 60K+ strings..(I cancelled the process after 12hrs).
One idea is to use hash tables, or environments in R. However, I don't get how "translate/convert" my strings into an environment while keeping the numerical index?
I have strings[1]... till strings[60000]
e <- new.env(hash=TRUE)
for (i in 1:length(strings)) {
assign(x=i, value=strings, envir=e)
}
As x in assign must be a character, I can't use it like this, but I hope you get my idea..I want to be able to index the environment with the same numbers like in my string[...] vector
Thanks for your help!
R environments are not used as much as perl hashes are, I think
just because there are not widely understood 'idioms' for doing
so. In your case the key question is, do you really want the
numerical index? If so it should be the value. The key is your
string, that's the whole point of the exercise.
e <- new.env(hash=T)
strings <- as.character(chickwts$feed) # note! not unique
sapply(1:length(strings), function(i)assign(strings[i], i, e))
e$horsebean # returns 10
In this example only the last index associated with each string
is kept, but you can assign anything that might be useful to each
key, such as a vector of indices.
You can then lookup your data in a number of ways. You can regex search
for keys using ls, for example, and retrieve the values using mget():
# find all keys containing 'beans'
ls(e, patt='bean')
# retrieve bean data
mget(ls(e, pat='bean'),e)