subset doesn't recognize regex - regex

I just cannot get this working. i want to subset all the rows containing "mail". I use this:
Email <- subset(Total_Content, source == ".*mail.*")
I have rows like this ones:
"snt152.mail.live.com",
"mailing.serviciosmovistar.com",
"blu179.mail.live.com"
But when using: "View(Email)"
I just get a data.frame empty (just see the columns). I don't need to "scape" any metacharacter, because i need the "." to mean "anycharacter" and the "*" (0 or more times), right? Thanks.

Well, no, it doesn't - it's not meant to. You're not passing it a regular expression to be evaluated against each row, you're just passing it a character string; it doesn't know that . and * are regex characters because it's not performing a regex search. It's returning all rows where source is the literal string .mail. - which in this case is 0 rows.
What you probably want to be doing (I'm assuming this is a data.frame, here) is:
Email <- Total_Content[grepl(x = Total_Content$source, pattern = ".*mail.*"),]
grepl produces a set of boolean values of whether each entry in Total_Content$source matched the pattern. Total_Content[boolean_vector,] limits to those rows of Total_Content where the equivalent boolean is TRUE.

Why not use subset with a logical regex funtion?
Email <- subset(Total_Content, grepl(".*mail.*", source) )
The subset function does create a local environment for the evaluation of expressions that are used in either the 'subset' (row targets) or the 'select' (column targets) arguments.

Related

Select all subtring before a special character

In PostgreSQL, I have the following text type in a column value.
{186=>15.55255158, 21=>5123.43494408, 164=>0.0}
I would like to select the numbers before the => character and use the ouput in a subquery. So the output should be:
186
21
164
I tried several regex statement but it does not work. Any help would be appreciated.
You need to both use a regular expression, and a function to extract the values matched from the regular expression into a data set. The ~ operator is only used to match in a where clause. You need the REGEXP_MATCHES function.
SELECT REGEXP_MATCHES(your_column_name, '(\d+)=>', 'g')
FROM your_table_name
The 'g' option will return multiple matches, rather than just the first.
SQL Fiddle
you can use the simple substring(column_name from '(\d+)=>') to extract the appropriate data.

OpenRefine custom text faceting

I have a column of names like:
Quaglia, Pietro Paolo
Bernard, of Clairvaux, Saint, or
.E., Calvin F.
Swingle, M Abate, Agostino, Assereto
Abati, Antonio
10-NA)\u, Ferraro, Giuseppe, ed, Biblioteca comunale ariostea. Mss. (Esteri
I want to make a Custom text facet with openrefine that mark as "true" the names with one comma and "false" all the others, so that I can work with those last (".E., Calvin F." is not a problem, I'll work with that later).
I'm trying using "Custom text facet" and this expression:
if(value.match(/([^,]+),([^,]+)/), "true", "false")
But the result is all false. What's the wrong part?
The expression you are using:
if(value.match(/([^,]+),([^,]+)/), "true", "false")
will always evaluate to false because the output of the 'match' function is either an array, or null. When evaluated by 'if' neither an array nor 'null' evaluate to true.
You can wrap the match function in a 'isNonBlank' or similar to get a boolean true/false, which would then cause the 'if' function to work as you want. However, once you have a boolean true/false result the 'if' becomes redundant as its only function is to turn the boolean true/false into string "true" or "false" - which won't make any difference to the values function of the custom text facet.
So:
isNonBlank(value.match(/([^,]+),([^,]+)/))
should give you the desired result using match
Instead of using 'match' you could use 'split' to split the string into an array using the comma as a split character. If you measure the length of the resulting array, it will give you the number of commas in the string (i.e. number of commas = length-1).
So your custom text facet expression becomes:
value.split(",").length()==2
This will give you true/false
If you want to break down the data based on the number of commas that appear, you could leave off the '==2' to get a facet which just gives you the length of the resulting array.
I would go with lookahead assertion to check if only 1 "," can find from the beginning until the end of line.
^(?=[^\,]+,[^\,]+$).*
https://regex101.com/r/iG4hX6/2

filtering columns by regex in dataframe

I have a large dataframe (3000+ columns) and I am trying to get a list of all column names that follow this pattern:
"stat.mineBlock.minecraft.123456stone"
"stat.mineBlock.minecraft.DFHFFBSBstone2"
"stat.mineBlock.minecraft.AAAstoneAAAA"
My code:
stoneCombined<-grep("^[stat.mineBlock.minecraft.][a-zA-Z0-9]*?[stone][a-zA-Z0-9]*?", colnames(ingame), ignore.case =T)
where ingame is the dataframe I am searching. My code returns a list of numbers however instead of the dataframe columns (like those above) that I was expecting. Con someone tell me why?
After adding value=TRUE (Thanks to user227710):
I now get column names, but I get every column in my dataset not those that contain : stat.mineBlock.minecraft. and stone like I was trying to get.
To return the column names you need to set value=TRUE as an additional argument of grep. The default option in grep is to set value=FALSE and so it will give you indices of the matched colnames. .
help("grep")
value
if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and if TRUE, a vector containing the matching elements themselves is returned.
grep("your regex pattern", colnames(ingame),value=TRUE, ignore.case =T)
Here is a solution in dplyr:
library(dplyr)
your_df %>%
select(starts_with("stat.mineBlock.minecraft"))
The more general way to match a column name to a regex is with matches() inside select(). See ?select for more information.
My answer is based on this SO post. As per the regex, you were very close.
Just [] create a character class matching a single character from the defined set, and it is the main reason it was not working. Also, perl=T is always safer to use with regex in R.
So, here is my sample code:
df <- data.frame(
"stat.mineBlock.minecraft.123456stone" = 1,
"stat.mineBlock.minecraft.DFHFFBSBwater2" = 2,
"stat.mineBlock.minecraft.DFHFFBSBwater3" = 3,
"stat.mineBlock.minecraft.DFHFFBSBstone4" = 4
)
grep("^stat\\.mineBlock\\.minecraft\\.[a-zA-Z0-9]*?stone[a-zA-Z0-9]*?", colnames(df), value=TRUE, ignore.case=T, perl=T)
See IDEONE demo

R: replacing special character in multiple columns of a data frame

I try to replace the german special character "ö" in a dataframe by "oe". The charcter occurs in multiple columns so I would like to be able to do this all in one by not having to specify individual columns.
Here is a small example of the data frame
data <- data.frame(a=c("aö","ab","ac"),b=c("bö","bb","ab"),c=c("öc","öb","acö"))
I tried :
data[data=="ö"]<-"oe"
but this did not work since I would need to work with regular expressions here. However when I try :
data[grepl("ö",data)]<-"oe"
I do not get what I want.
The dataframe at the end should look like:
> data
a b c
1 aoe boe oec
2 ab bb oeb
3 ac ab acoe
>
The file is a csv import that I import by read.csv. However, there seems to be no option to change to fix this with the import statement.
How do I get the desired outcome?
Here's one way to do it:
data <- apply(data,2,function(x) gsub("ö",'oe',x))
Explanation:
Your grepl doesn't work because grepl just returns a boolean matrix (TRUE/FALSE) corresponding to the elements in your data frame for which the regex matches. What the assignment then does is replace not just the character you want replaced but the entire string. To replace part of a string, you need sub (if you want to replace just once in each string) or gsub (if you want all occurrences replaces). To apply that to every column you loop over the columns using apply.
If you want to return a data frame, you can use:
data.frame(lapply(data, gsub, pattern = "ö", replacement = "oe"))

Simplest way to find out if at least one cell in a cell array matches a regular expression

I need to search a cell array and return a single boolean value indicating whether any cell matches a regular expression.
For example, suppose I want to find out if the cell array strs contains foo or -foo (case-insensitive). The regular expression I need to pass to regexpi is ^-?foo$.
Sample inputs:
strs={'a','b'} % result is 0
strs={'a','foo'} % result is 1
strs={'a','-FOO'} % result is 1
strs={'a','food'} % result is 0
I came up with the following solution based on How can I implement wildcard at ismember function of matlab? and Searching cell array with regex, but it seems like I should be able to simplify it:
~isempty(find(~cellfun('isempty', regexpi(strs, '^-?foo$'))))
The problem I have is that it looks rather cryptic for such a simple operation. Is there a simpler, more human-readable expression I can use to achieve the same result?
NOTE: The answer refers to the original regexp in the question: '-?foo'
You can avoid the find:
any(~cellfun('isempty', regexpi(strs, '-?foo')))
Another possibility: concatenate first all cells into a single string:
~isempty(regexpi([strs{:}], '-?foo'))
Note that you can remove the "-" sign in any of the above:
any(~cellfun('isempty', regexpi(strs, 'foo')))
~isempty(regexpi([strs{:}], 'foo'))
And that allows using strfind (with lower) instead of regexpi:
~isempty(strfind(lower([strs{:}]),'foo'))