Using Regex in R with huge amount of strings - regex

I have very big data and the next step is to delete certain strings (i.e. the associated rows) based on patterns. I need to use Regex for that. For example image column A as:
A-929.XZT-93002-B-DKE
A-938-XZT-29849-B-DKE
A-938-AXZ-93923-B-DKE
...
...
There are many more columns besides A. Now I want to delete all rows completely which contain the phrade "XZT" with any element before except a character. In this case it would be row1 and row2.
My question is as follows:
Can this be done in R as effectively as for example in VBA? Which package would you recommend to do so, or can it be done just as effectively with the base functions?
I am asking because there are different ways to apply Regex in R and I have to do it for about ~ 20,000++ rows numerous times, so I want to do it as fast as possible.
Thanks

Related

How to Keep rows of multi-line cells containing a keyword in google sheets

I'm trying to keep lines that contain the word "NOA" in a column A which has many multi-line cells as can be viewed in this Google Spreadsheet.
If "NOA" is present then, I would like to keep the line. The input and output should look like the image which I have "working" with too-many helper cells. Can this be combined into a single formula?
Theoretical Approaches:
I have been thinking about three approaches to solve this:
ARRAYFORMULA(REGEXREPLACE - couldn't get it to work
JOIN(FILTER(REGEXMATCH(TRANSPOSE - showing promise as it works in multiple steps
Using the QUERY Function - unfamiliar w/ function but wondering if this function has a fast solution
Practical attempts:
FIRST APPROACH: first I attempted using REGEXEXTRACT to extract out everything that did not have NOA in it, the Regex worked in demo but didn't work properly in sheets. I thought this might be a concise way to get the value, perhaps if my REGEX skill was better?
ARRAYFORMULA(REGEXREPLACE(A1:A7, "^(?:[^N\n]|N(?:[^O\n]|O(?:[^A\n]|$)|$)|$)+",""))
I think the Regex because overly complex, didn't work in Google or perhaps the formula could be improved, but because Google RE2 has limitations it makes it harder to do certain things.
SECOND APPROACH:
Then I came up with an alternate approach which seems to work 2 stages (with multiple helper cells) but I would like to do this with one equation.
=TRANSPOSE(split(A2,CHAR(10)))
=TEXTJOIN(CHAR(10),1,FILTER(C2:C7,REGEXMATCH(C2:C7,"NOA")))
Questions:
Can these formulas be combined and applied to the entire Column using an Index or Array?
Or perhaps, the REGEX in my first approach can be modified?
Is there a faster solution using Query?
The shared Google spreadhseet is here.
Thank you in advance for your help.
Here's one way you can do that:
=index(substitute(substitute(transpose(trim(
query(substitute(transpose(if(regexmatch(split(
filter(A2:A,A2:A<>""),char(10)),"NOA"),split(
filter(A2:A,A2:A<>""),char(10)),))," ","❄️")
,,9^9)))," ",char(10)),"❄️"," "))
First, we split the data by the newline (char 10), then we filter out the lines that don't contain NOA and finally we use a "query smush" to join everything back together.

Can a Regex-Replace be run on a range instead of looping through the cells in Excel?

I need to do many Regex replacements (~ 100 currently, but the list will grow) on a range of cells (varies, but up to 4 or 5 digit cell count).
Currently, my working draft is to loop through all cells repeatedly for each pattern, but obviously that's many loops.
Ideally, I'd call something like (pseudocode):
Sheet.Range("A1:G1000").RegexReplace(pattern, replacement)
However, the nearest thing is Range.Replace which only mentions "The string you want Microsoft Excel to search for".
The list of Regex.Replace overloads does not mention anything related to cells or ranges.
So, since Range.RegexReplace seems to be out - is there a more efficient way to replace many patterns in many cells than to loop through each pattern, row and column?
Don't iterate cells. Whether you're writing VBA, C#, or VB.NET, if you're working against Range objects in nested loops you're doing the single slowest thing you could possibly do with the Excel object model.
Work against an array instead - you need a function like this in your toolbox:
Public Function ToArray(ByVal target As Range) As Variant
Select Case True
Case target.Count = 1
'singe cell
ToArray = Array(target.Value)
Case target.Rows.Count = 1
'horizontal 1D range
ToArray = Application.WorksheetFunction.Transpose(Application.WorksheetFunction.Transpose(target.Value))
Case target.columns.Count = 1
'vertical 1D range
ToArray = Application.WorksheetFunction.Transpose(target.Value)
Case Else
'2D array: let Excel to the conversion itself
ToArray = target.Value
End Select
End Function
Now you iterate an in-memory array of values (with For loops) and for each value you iterate a number of Regex.Replace calls - cache and reuse the Regex objects as much as possible, so you're not re-creating the same objects over and over and over for thousands of values.
Once you've traversed the entire array, dump it into the worksheet (resize and transpose as needed), and voilà - you've instantly rewritten thousands of cells in a single operation.

R: searching within split character strings with apply

Within a large data frame, I have a column containing character strings e.g. "1&27&32" representing a combination of codes. I'd like to split each element in the column, search for a particular code (e.g. "1"), and return the row number if that element does in fact contain the code of interest. I was thinking something along the lines of:
apply(df["MEDS"],2,function(x){x.split<-strsplit(x,"&")if(grep(1,x.split)){return(row(x))}})
But I can't figure out where to go from there since that gives me the error:
Error in apply(df["MEDS"], 2, function(x) { :
dim(X) must have a positive length
Any corrections or suggestions would be greatly appreciated, thanks!
I see a couple of problems here (in addition to the missing semicolon in the function).
df["MEDS"] is more correctly written df[,"MEDS"]. It is a single column. apply() is meant to operate on each column/row of a matrix as if they were vectors. If you want to operate on a single column, you don't need apply()
strsplit() returns a list of vectors. Since you are applying it to a row at a time, the list will have one element (which is a character vector). So you should extract that vector by indexing the list element strsplit(x,"&")[[1]].
You are returning row(x) is if the input to your function is a matrix or knows what row it came from. It does not. apply() will pull each row and pass it to your function as a vector, so row(x) will fail.
There might be other issues as well. I didn't get it fully running.
As I mentioned, you don't need apply() at all. You really only need to look at the 1 column. You don't even need to split it.
OneRows <- which(grepl('(^|&)1(&|$)', df$MEDS))
as Matthew suggested. Or if your intention is to subset the dataframe,
newdf <- df[grepl((^|&)1(&|$)', df$MEDS),]

How to use environments for lookups

My question builds upon the topic of matching a string against multiple patterns. One solution discussed here is to use sapply(keywords, grepl, strings, ignore.case=TRUE) which yields a two-dimensional matrix.
However, I run into significant speed issues, when applying this approach to 5K+ keywords and 60K+ strings..(I cancelled the process after 12hrs).
One idea is to use hash tables, or environments in R. However, I don't get how "translate/convert" my strings into an environment while keeping the numerical index?
I have strings[1]... till strings[60000]
e <- new.env(hash=TRUE)
for (i in 1:length(strings)) {
assign(x=i, value=strings, envir=e)
}
As x in assign must be a character, I can't use it like this, but I hope you get my idea..I want to be able to index the environment with the same numbers like in my string[...] vector
Thanks for your help!
R environments are not used as much as perl hashes are, I think
just because there are not widely understood 'idioms' for doing
so. In your case the key question is, do you really want the
numerical index? If so it should be the value. The key is your
string, that's the whole point of the exercise.
e <- new.env(hash=T)
strings <- as.character(chickwts$feed) # note! not unique
sapply(1:length(strings), function(i)assign(strings[i], i, e))
e$horsebean # returns 10
In this example only the last index associated with each string
is kept, but you can assign anything that might be useful to each
key, such as a vector of indices.
You can then lookup your data in a number of ways. You can regex search
for keys using ls, for example, and retrieve the values using mget():
# find all keys containing 'beans'
ls(e, patt='bean')
# retrieve bean data
mget(ls(e, pat='bean'),e)

regular expression to reverse text order

I need to reverse the order of an html files title tag.. so the first text before the : are put at the end, and so on
original:
<title>text: texttwo: three more: four | site.com</title>
output:
<title>four: three more: texttwo: text | site.com</title>
the title inside is divided by : and needed to reverse the order, sometimes they are four (separated with three : and sometimes they are three, or whatever..
I use Notepad++ to replace.. - or if you want to suggest any other easy software to use to do that..
Thanks
I don't believe that this can be done with a standard regular expression - at least not with the requirement of needing to support any number of fields.
Assuming you have a large number of these to process, I'd use your favorite programming or scripting language, split the fields into an array (you can use regular expressions for this) - then read back from the array in reverse.
If you really don't want to write code (which I think is not a good idea because it is a really good opportunity to learn something new) you can try this:
http://jsimlo.sk/notepad/manual/wiki/index.php/Reverse_tools (Order of Words on Each Line (Ctrl+Shift+F))
but you need to download this:
http://jsimlo.sk/notepad/