Regex in R: match everything but not "some string" [duplicate] - regex

This question already has answers here:
How can I remove all objects but one from the workspace in R?
(14 answers)
Remove all punctuation except apostrophes in R
(4 answers)
Closed 9 years ago.
The answers to another question explain how to match a string not containing a word.
The problem (for me) is that the solutions given don't work in R.
Often I create a data.frame() from existing vectors and want to clean up my workspace. So for example, if my workspace contains:
> ls()
[1] "A" "B" "dat" "V"
>
and I want to retain only dat, I'd have to clean it up with:
> rm(list=ls(pattern="A"))
> rm(list=ls(pattern="B"))
> rm(list=ls(pattern="V"))
> ls()
[1] "dat"
>
(where A, B, and V are just examples of a large number of complicated names like my.first.vector that are not easy to match with rm(list=ls(pattern="[ABV]"))).
It would be most convenient (for me) to tell rm() to remove everything except dat, but the problem is that the solution given in the linked Q&A does not work:
> rm(list=ls(pattern="^((?!dat).)*$"))
Error in grep(pattern, all.names, value = TRUE) :
invalid regular expression '^((?!dat).)*$', reason 'Invalid regexp'
>
So how can I match everything except dat in R?

This will remove all objects except dat . (Use the ls argument all.names = TRUE if you want to remove objects whose names begin with a dot as well.)
rm( list = setdiff( ls(), "dat" ) )
Replace "dat" with a vector of names, e.g. c("dat", "some.other.object"), if you want to retain several objects; or, if the several objects can all be readily matched by a regular expression try something like this which removes all objects whose names do not start with "dat":
rm( list = setdiff( ls(), ls( pattern = "^dat" ) ) )
Another approach is to save the data, save("dat", file = "dat.RData"), exit R, start a new R session and load the data, 1oad("dat.RData"). Also note this method of restarting R.

Negative look-around requires perl=TRUE argument in R. So, you won't be able to directly use ls(pattern = ...) with that regular expression. Alternatively you can do:
rm(list = grep("^((?!dat).)*$", ls(), perl=TRUE, value=TRUE))
This is if you're looking for inexact matches. If you're looking for exact match, you should just do what Ferdinand has commented:
rm(list=ls()[ls() != "dat"])

Related

how to parse the key value pair with regex in C++ [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I have some string with such format:
aaaaaaaaaaaa //first line
[key = [metadata = 1 metadata = 2 metadata =3] KEY(1) = 100
KEY(2) = 16:30:00 KEY(3) = 2020-12-12 08:30:30 KEY(4) = 0]
I want to get the key value pairs in Json format like
{"KEY(1)":"100", "KEY(2)":"16:30:00", "KEY(3)":"2020-12-12 08:30:30", "KEY(4)":"0"}
I am kind of struggling to deal with the last part, because there could be space also in value like 2020-12-12 08:30:30, so the only way I can think of is to find the "=", the data between the first space and the second space on the left is the current key, and all rest util the previous "=" is the value for previous key, which is tricky and I am new to REGEX, how should I do it? Thanks!
I would not try to use a regex to do this.
There are difficulties you haven't considered yet. For example, the quoted string can contain an =, or (worse) it can contain a quote mark, so something like this is unusual, but seems to be legitimate:
{ "\"key\"=\"value\"" = "This is the value"}
When you're done parsing it, the key in this case will be "key" = "value" (with the quote marks and equal sign included in the string.
So not only do you need to recognize the beginning and end of each part of what you're dealing with, but in some cases you need to do some transformations on it to get the correct string.
Now, I'm not going to say this can't be done using a regex--but I think (at best) developing a regex that will work correctly will be more trouble than it's worth.

matlab regexp exclude specific set of file extensions [duplicate]

This question already has answers here:
Regex: match everything but a specific pattern
(6 answers)
Closed 2 years ago.
I want to exclude a set of file extensions and otherwise list folder contents.
%get filenames in current directory
p = dir(pwd);
p = {p.name};
p = p(1:min(end,20))'
%construct regular expression
%exclude = {'ini','m'}; %just for your convenience
reg = '\.(^ini|m)$';
%actually print file names/paths of files without a certain extension
regexpi(p,reg,'match','once')
This, however, does not work. How can I get the files that exclude these file extensions (last X amount of characters in path)? I tried [^abc] but this excludes individual characters, which I don't want. Please use regexp or regexprep in your answer
You wrote:
reg = '\.(^ini|m)$';
The ^ caret anchor comes after a . dot,
so it will never match start-of-string.
Remove it FTW:
reg = '\.(ini|m)$';

How to find a specific string followed by a number, with any number of characters between? [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 6 years ago.
I'm trying to write a regex for the following pattern:
[MyLiteralString][0 or more characters without restriction][at least 1 digit]
I thought this should do it:
(theColumnName)[\s\S]*[\d]+
As it looks for the literal string theColumnName, followed by any number of characters (whitespace or otherwise), and then at least one digit. But this matches more than I want, as you can see here:
https://www.regex101.com/r/HBsst1/1
(EDIT) Second set of more complex data - https://www.regex101.com/r/h7PCv7/1
Using the sample data in that link, I want the regex to identify the two occurrences of theColumnName] VARCHAR(10) and nothing more.
I have 300+ sql scripts which containing create statements for every type of database object: procedures, tables, triggers, indexes, functions -- everything. Because of that, I can't be too strict with my regex.
A stored procedure's file might include text like LEFT(theColumnName, 10) which I want to identify.
A create table statement would be like theColumnName VARCHAR(12).
So it needs to be very flexible as the number(s) isn't always the same. Sometimes it's 10, sometimes it's 12, sometimes it's 51 -- all kinds of different numbers.
Basically, I'm looking for the regex equivalent of this C# code:
//Get file data
string[] lines = File.ReadAllLines(filePath);
//Let's assume the first line contains 'theColumnName'
int theColumnNameIndex = lines[0].IndexOf("theColumnName");
if (theColumnNameIndex >= 0)
{
//Get the text proceeding 'theColumnName'
string temp = lines[0].Remove(0, theColumnNameIndex + "theColumnNameIndex".Length;
//Iterate over our substring
foreach (char c in temp)
{
if (Char.IsDigit(c))
//do a thing
}
}
(theColumnName).*?[\d]+
That'll make it stop capturing after the first number it sees.
The difference between * and *? is about greediness vs. laziness. .*\d for example would match abcd12ad4 in abcd12ad4, whereas .*?\d would have its first match as abcd1. Check out this page for more info.
Btw, if you don't want to match newlines, use a . (period) instead of [\s\S]

Understanding `regexp` in R [duplicate]

This question already has answers here:
Extract all numbers from a single string in R
(4 answers)
Closed 8 years ago.
Understanding regular expressions sometimes can be a trouble. Especially if your not really familiar writing them, like myself.
In R there are a couple of built-in functions (base package) which i would like to understand and be able to use. Like:
grep and gsub, that take as arguments (p, x) where p is a pattern and x is a character vector to look-up. split function also takes regexp as argument like many others.
Anyway i have an example such as:
string <- "39 22' 19'' N"
and i need to be able to extract numbers from it. So using these stringr, iterators, foreach libraries i am trying to figure out an expression using either iter or foreach.
str_locate(string, "[0-9]+") locates and z <- str_extract(obj, "[0-9]+") extracts only the first match on my string.
I have tried making something like
x <- iter(z)
nextElem(x)
but it doesn't work. And another one which normally doesn't work.
a <- foreach(iter(z))
a
How should i fix this using the above libraries?
Thanks.
Check http://cran.r-project.org/web/packages/stringr/stringr.pdf
str_extract_all(your_string, "[0-9]+")
you have exactly the same result with the basic functions:
strsplit(gsub("(\\D+)"," ", string), " ")
This is another way to do it in base R:
string <- "39 22' 19'' N"
regmatches(string,gregexpr("[0-9]+",string))
# [[1]]
# [1] "39" "22" "19"
Note that regmatches(...) returns a list where each element is a char vector with the matches. So to get just the char vector you would use:
regmatches(string,gregexpr("[0-9]+",string))[[1]]
# [1] "39" "22" "19"

r gsub and regex, obating y*_x* from y*_x*_xxxx.csv

General situation: I am currently trying to name dataframes inside a list in accordance to the csv files they have been retrieved from, I found that using gsub and regex is the way to go. Unfortunately, I can’t produce exactly what I need, just sort of.
I would be very grateful for some hints from someone more experienced, maybe there is a reasonable R regex cheat cheet ?
File are named r2_m1_enzyme.csv, the script should use the first 4 characters to name the corresponding dataframe r2_m1, and so on…
# generates a list of dataframes, to mimic a lapply(f,read.csv) output:
data <- list(data.frame(c(1,2)),data.frame(c(1,2)),data.frame(c(1,2)),data.frame(c(1,2)))
# this mimics file names obtained by list.files() function
f <-c("r1_m1_enzyme.csv","r2_m1_enzyme.csv","r1_m2_enzyme.csv","r2_m2_enzyme.csv")
# this should name the data frames according to the csv file they have been derived from
names(data) <- gsub("r*_m*_.*","\\1", f)
but it doesnt work as expected... they are named r2_m1_enzyme.csv instead of the desired r2_m1, although .* should stop it?
If I do:
names(data) <- gsub("r*_.*","\\1", f)
I do get r1, r2, r3 ... but I am missing my second index.
The question: So my questions is, what regex expression would allow me to obtain strings “r1_m1”, “r2_m1”, “r1_m2”, ... from strings that are are named r*_m*_xyz.csv
Search history: R regex use * for only one character, Gsub regex replacement, R ussing parts of filename to name dataframe, R regex cheat sheet,...
If your names are always five characters long you could use substr:
substr(f, 1, 5)
If you want to use gsub you have to group your expression (via ( and )) because \\1 refers to the first group and insert its content, e.g.:
gsub("^(r[0-9]+_m[0-9]+).*", "\\1", f)