Excel RegEx Functions in R - regex

I regularly work with Excel Sheets where some fields (observations) contain large amounts of text content in a part structured form (at least visually)
So the content of a single Cell/Obs might be somewhat like this:
My name is John Doe
I live at my address
My Post code is ABC123
My Favorite Pet is: A dog
In Excel I've created a few functions which I can use to look for a string within the cell so lets say that the data is in "A1"
in "A2" I can use "=GETPOSTCODE(A1) where the function is:
Function GetPostCode(PostCode As Range) As String
regex.Pattern = "[A-Z]{3}\d{3,}\b\w*"
regex.IgnoreCase = True
regex.MultiLine = True
Set X = regex.Execute(PostCode.Value)
For Each x1 In X
GetPostCode = UCase(x1)
Exit For
Next
End Function
What kind of structures/functions could I use in r to accomplish this?
the Cells really contain Much more data than that, its purely for example, and I have a number of different "get" functions with different regexs.
I've had a good look at all the Grep type commands but am struggling with limited/developing R skills.
I've been working around this kind of Principle, but pretty much stalled (where textfield is the column with my text in obviously!) I can get a list of all the rows where it contains a post code but not JUST the Post Code:
df$postcode <- df[(df$textfield = grep("[A-Z]{3}\\d{3,}\\b\\w*", df$textfield), ]
Any Help appreciated!

I think you need a combination of regexpr or grepexpr (to find the matches in the string) and regmatches to extract the matching parts of the strings:
x <- "My name is John Doe
I live at my address
My Post code is ABC123
My Favorite Pet is: A dog"
> regmatches(x, regexpr("[A-Z]{3}\\d{3,}\\b\\w*", x, ignore.case = TRUE))
[1] "ABC123"
Other options probably include str_extract from stringr or stri_extract from stringi packages.

Related

filtering columns by regex in dataframe

I have a large dataframe (3000+ columns) and I am trying to get a list of all column names that follow this pattern:
"stat.mineBlock.minecraft.123456stone"
"stat.mineBlock.minecraft.DFHFFBSBstone2"
"stat.mineBlock.minecraft.AAAstoneAAAA"
My code:
stoneCombined<-grep("^[stat.mineBlock.minecraft.][a-zA-Z0-9]*?[stone][a-zA-Z0-9]*?", colnames(ingame), ignore.case =T)
where ingame is the dataframe I am searching. My code returns a list of numbers however instead of the dataframe columns (like those above) that I was expecting. Con someone tell me why?
After adding value=TRUE (Thanks to user227710):
I now get column names, but I get every column in my dataset not those that contain : stat.mineBlock.minecraft. and stone like I was trying to get.
To return the column names you need to set value=TRUE as an additional argument of grep. The default option in grep is to set value=FALSE and so it will give you indices of the matched colnames. .
help("grep")
value
if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and if TRUE, a vector containing the matching elements themselves is returned.
grep("your regex pattern", colnames(ingame),value=TRUE, ignore.case =T)
Here is a solution in dplyr:
library(dplyr)
your_df %>%
select(starts_with("stat.mineBlock.minecraft"))
The more general way to match a column name to a regex is with matches() inside select(). See ?select for more information.
My answer is based on this SO post. As per the regex, you were very close.
Just [] create a character class matching a single character from the defined set, and it is the main reason it was not working. Also, perl=T is always safer to use with regex in R.
So, here is my sample code:
df <- data.frame(
"stat.mineBlock.minecraft.123456stone" = 1,
"stat.mineBlock.minecraft.DFHFFBSBwater2" = 2,
"stat.mineBlock.minecraft.DFHFFBSBwater3" = 3,
"stat.mineBlock.minecraft.DFHFFBSBstone4" = 4
)
grep("^stat\\.mineBlock\\.minecraft\\.[a-zA-Z0-9]*?stone[a-zA-Z0-9]*?", colnames(df), value=TRUE, ignore.case=T, perl=T)
See IDEONE demo

Search a string by a mix of syntactical and regex patterns

I would like to use R to search a text for patterns expressed through a mix of POS and actual strings. (I have seen this functionality in a python library here: http://www.clips.ua.ac.be/pages/pattern-search).
For instance, a search pattern could be: 'NOUNPHRASE be|is|was ADJECTIVE than NOUNPHRASE', and should return all strings containing structures like: "a cat is faster than a dog".
I know that packages like openNLP and qdap offer convenient POS-tagging. Has anyone been using the output of it for this kind of pattern maching ?
As a starter, using koRpus and TreeTagger:
library(koRpus)
library(tm)
mytxt <- c("This is my house.", "A house is better than no house.", "A cat is faster than a dog.")
pattern <- "Noun, singular or mass.*?Adjective, comparative.*?Noun, singular or mass"
tagged.results <- treetag(file = mytxt, treetagger="C:/TreeTagger/bin/tag-english.bat", lang="en", format="obj", stopwords=stopwords("en"))
tagged.results <- kRp.filter.wclass(tagged.results, "stopword")
taggedText(tagged.results)$id <- factor(head(cumsum(c(0, taggedText(tagged.results)$desc == "Sentence ending punctuation")) + 1, -1))
setNames(mytxt, grepl(pattern, aggregate(desc~id, taggedText(tagged.results), FUN = paste0)$desc))
# FALSE TRUE TRUE
# "This is my house." "A house is better than no house." "A cat is faster than a dog."

RegExReplace - A few examples to get me started, please

I'm trying to use RegExReplace to pre-process some text before it gets parsed for use in an Access database. Currently I have been defining a growing number of string patterns into a table, then use the stock Replace() function in VBA using that table. Works OK, but misses the mark in a few areas; I am pretty sure regular expressions will be a better long-term solution for me, but I am completely clueless how to construct them.
I'd like to see if the smart folks here can give me a leg up on the task using a few actual examples from my data, by illustrating the regex strings that will produce the desired result:
1. 6 IN 6IN
2. 12.3 IN X 2 YD 12.3IN_X_2YD
3. 6IN X 4IN 6IN_X_4IN
4. 8X120MM 8_X_120MM
5. 1 1/2" 1.5IN
6. CAT, DOG CAT DOG
7. CAT,DOG CAT DOG
8. CAT ,DOG CAT DOG
9. CAT , DOG CAT DOG
My patterns fail in ways like: CATHETER INFUSION => CATHETERINFUSION
I will be using a multi-pass approach vs attempting to come-up with some terribly complex expressions.
Can anyone offer some initial guidance to any of these samples. I'm confident I will be able to leverage these samples to extend as needed.
[Edit:] I did just find a few helpful examples:
NewStr := RegExReplace("abc123123", "123$", "xyz") ; Returns "abc123xyz" because the $ allows a match only at the end.
NewStr := RegExReplace("abc123", "i)^ABC") ; Returns "123" because a match was achieved via the case-insensitive option.
NewStr := RegExReplace("abcXYZ123", "abc(.*)123", "aaa$1zzz") ; Returns "aaaXYZzzz" by means of the $1 backreference.
NewStr := RegExReplace("abc123abc456", "abc\d+", "", ReplacementCount) ; Returns "" and stores 2 in ReplacementCount.
[Edit 2]: Making good progress!
strText = "BANDAGE, ADHESIVE, 2 FT X 3.5 IN X 0.25MM, LATEX-FREE"
strResult = RegExReplace(strText, "(,|\s+)", " ", True)
strResult = RegExReplace(strResult, "\s+(IN|FT|YD)\s+", "$1 ", True)
strResult = RegExReplace(strResult, "\s+X\s+", "_X_", True)
Produces:
BANDAGE ADHESIVE 2FT_X_3.5IN_X_0.25MM LATEX-FREE
Some regexps that might be useful:
/\s+IN/IN/
/\s+X\s+/_X_/
/(?:\d)X(?:\d)/_X_/

Need help removing HTML tags, certain punctuation, and ending periods

Suppose I have this test string:
test.string <- c("This is just a <test> string. I'm trying to see, if a FN will remove certain things like </HTML tags>, periods; but not the one in ASP.net, for example.")
I want to:
Remove anything contained within an html tag
Remove certain punctuation (,:;)
Period that end a sentence.
So the above should be:
c("This is just a string I'm trying to see if a FN will remove certain things like periods but not the one in ASP.net for example")
For #1, I've tried the following:
gsub("<.*?>", "", x, perl = FALSE)
And that seems to work OK.
For #2, I think it's simply:
gsub("[:#$%&*:,;^():]", "", x, perl = FALSE)
Which works.
For #3, I tried:
gsub("+[:alpha:]?[.]+[:space:]", "", test.string, perl = FALSE)
But that didn't work...
Any ideas on where I went wrong? I totally suck at RegExp, so any help would be much appreciated!!
Based on your provided input and rules for what you want removed, the following should work.
gsub('\\s*<.*?>|[:;,]|(?<=[a-zA-Z])\\.(?=\\s|$)', '', test.string, perl=T)
See Working Demo
Try this:
test.string <- "There is a natural aristocracy among men. The grounds of this are virtue and talents. "
gsub("\\.\\s*", "", gsub("([a-zA-Z0-9]). ([A-Z])", "\\1 \\2", test.string))
# "There is a natural aristocracy among men The grounds of this are virtue and talents

Regular expression for matching diffrent name format in python

I need a regular expression in python that will be able to match different name formats like
I have 4 different names format for same person.like
R. K. Goyal
Raj K. Goyal
Raj Kumar Goyal
R. Goyal
What will be the regular expression to get all these names from a single regular expression in a list of thousands.
PS: My list have thousands of such name so I need some generic solution for it so that I can combine these names together.In the above example R and Goyal can be used to write RE.
Thanks
"R(\.|aj)? (K(\.|umar)? )?Goyal" will only match those four cases. You can modify this for other names as well.
Fair warning: I haven't used Python in a while, so I won't be giving you specific function names.
If you're looking for a generic solution that will apply to any possible name, you're going to have to construct it dynamically.
ASSUMING that the first name is always the one that won't be dropped (I know people whose names follow the format "John David Smith" and go by David) you should be able to grab the first letter of the string and call that the first initial.
Next, you need to grab the last name- if you have no Jr's or Sr's or such, you can just take the last word (find the last occurrence of ' ', then take everything after that).
From there, "<firstInitial>* <lastName>" is a good start. If you bother to grab the whole first name as well, you can reduce your false positive matches further with "<firstInitial>(\.|<restOfFirstName>)* <lastName>" as in joon's answer.
If you want to get really fancy, detecting the presence of a middle name could reduce false positives even more.
I may be misunderstanding the problem, but I'm envisioning a solution where you iterate over the list of names and dynamically construct a new regexp for each name, and then store all of these regexps in a dictionary to use later:
import re
names = [ 'John Kelly Smith', 'Billy Bob Jones', 'Joe James', 'Kim Smith' ]
regexps={}
for name in names:
elements=name.split()
if len(elements) == 3:
pattern = '(%s(\.|%s)?)?(\ )?(%s(\.|%s)? )?%s$' % (elements[0][0], \
elements[0][1:], \
elements[1][0], \
elements[1][1:], \
elements[2])
elif len(elements) == 2:
pattern = '%s(\.|%s)? %s$' % (elements[0][0], \
elements[0][1:], \
elements[1])
else:
continue
regexps[name]=re.compile(pattern)
jksmith_regexp = regexps['John Kelly Smith']
print bool(jksmith_regexp.match('K. Smith'))
print bool(jksmith_regexp.match('John Smith'))
print bool(jksmith_regexp.match('John K. Smith'))
print bool(jksmith_regexp.match('J. Smith'))
This way you can easily keep track of which regexp will find which name in your text.
And you can also do handy things like this:
if( sum([bool(reg.match('K. Smith')) for reg in regexps.values()]) > 1 ):
print "This string matches multiple names!"
Where you check to see if some of the names in your text are ambiguous.