Remove part of column name - regex

I have a df with column names of a.b.c.v1, d.e.f.v1, h.j.k.v1, and would like to remove v1 from all the column names of df.
I suppose I should use gsub but my trials with that were not successful.

We can use sub to remove the .v1 from the end of the string. (If we only need to remove 'v1', just remove the \\. from the pattern to match, but I think a . at the end of column name may not look that good). Here, we match the dot (\\.) followed by one of more characters that are not a dot ([^.]+) until the end of the string ($) and replace it with "".
colnames(df) <- sub('\\.[^.]+$', '', colnames(df))
colnames(df)
#[1] "a.b.c" "d.e.f" "h.j.k"

Related

Regex finding all commas between two words

I trying to clean up a large .csv file that contains many comma separated words that I need to consolidate parts of. So I have a subsection where I want to change all the commas to slashes. Lets say my file contains this text:
Foo,bar,spam,eggs,extra,parts,spoon,eggs,sudo,test,example,blah,pool
I want to select all commas between the unique words bar and blah. The idea is to then replace the commas with slashes (using find and replace), such that I get this result:
Foo,bar,spam/eggs/extra/parts/spoon/eggs/sudo/test/example,blah,pool
As per #EganWolf input:
How do I include words in the search but exclude them from the selection (for the unique words) and how do I then match only the commas between the words?
Thus far I have only managed to select all the text between the unique words including them:
bar,.*,blah, bar:*, *,blah, (bar:.+?,blah)*,*\2
I experimented with negative look ahead but cant get any search results from my statements.
Using Notepad++, you can do:
Ctrl+H
Find what: (?:\bbar,|\G(?!^))\K([^,]*),(?=.+\bblah\b)
Replace with: $1/
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
(?: # start non capture group
\bbar, # word boundary then bar then a comma
| # OR
\G # restart from last match position
(?!^) # negative lookahead, make sure not followed by beginning of line
) # end group
\K # forget all we've seen until this position
([^,]*) # group 1, 0 or more non comma
, # a comma
(?= # positive lookahead
.+ # 1 or more any character but newlie
\bblah\b # word boundary, blah, word boundary
) # end lookahead
Result for given example:
Foo,bar,spam/eggs/extra/parts/spoon/eggs/sudo/test/example,blah,pool
Screen capture:
The following regex will capture the minimally required text to access the commas you want:
(?<=bar,)(.*?(,))*(?=.*?,blah)
See Regex Demo.
If you want to replace the commas, you will need to replace everything in capture group 2. Capture group 0 has your entire match.
An alternative approach would be to split your string by comma to create an array of words. Then join words between bar and blah using / and append the other words joined by ,.
Here is a PowerShell example of split and join:
$a = "Foo,bar,spam,eggs,extra,parts,spoon,eggs,sudo,test,example,blah,pool"
$split = $a -split ","
$slashBegin = $split.indexof("bar")+1
$commaEnd = $split.indexof("blah")-1
$str1 = $split[0..($slashbegin-1)] -join ","
$str2 = $split[($slashbegin)..$commaend] -join "/"
$str3 = $split[($commaend+1)..$split.count] -join ","
#($str1,$str2,$str3) -join ","
Foo,bar,spam/eggs/extra/parts/spoon/eggs/sudo/test/example,blah,pool
This could easily be made into a function with your entire line and keywords as inputs.

Rearrange a character string

I have a character vector where some entries have a certain pattern at the end. I want to remove this pattern from the end and put it in front of the rest.
Example:
#My initial character vector
names <- c("sdadohf abc", "fsdgodhgf abc", "afhk xyz")
> names
[1] "sdadohf abc" "fsdgodhgf abc" "afhk xyz"
#What I want is to move "abc" to the front
> names
[1] "abc sdadohf" "abc fsdgodhgf" "afhk xyz"
Is there an easy way to achive this or do I have to write an own function?
First let's add one more string to your vector, one with multiple spaces between the text.
names <- c("sdadohf abc", "fsdgodhgf abc", "afhk xyz", "aksle abc")
You could use capturing groups in sub().
sub("(.*?)\\s+(abc)$", "\\2 \\1", names)
# [1] "abc sdadohf" "abc fsdgodhgf" "afhk xyz" "abc aksle"
Regex explanation courtesy of regex101:
(.*) 1st Capturing group - matches any character (except newline) between zero and unlimited times, as few times as possible, expanding as needed
\\s+ matches any white space character [\r\n\t\f ] between one and unlimited times, as many times as possible, giving back as needed
(abc) 2nd Capturing group - abc matches the characters abc literally, and $ asserts position at end of the string
When we swap the groups in "\\2 \\1", we bring the second capturing group abc to the beginning of the string.
Thanks to #Jota and #docendodiscimus for helping to improve my original regular expression.
Here is a split method. We split the 'names' by one or more space (\\s+) followed by 'abc' ((?=abc)), loop through the list with vapply, reverse (rev) the list elements and paste it together.
vapply(strsplit(names, "\\s+(?=abc)", perl=TRUE), function(x)
paste(rev(x), collapse=" "), character(1))
#[1] "abc sdadohf" "abc fsdgodhgf" "afhk xyz" "abc aksle"
data
names <- c("sdadohf abc", "fsdgodhgf abc", "afhk xyz", "aksle abc")
Use this
sub("(.*) \\b(abc)$", "\\2 \\1", names)
.* is a greedy match. It will match as much as it can before finding the string ending with abc.
.* is in first captured group(\\1)
abc is in second captured group(\\2)
We can just interchange their position using \\2 \\1 to find our resultant string

R split a character string on the second underscore

I have character strings with two underscores. Like these
c54254_g4545_i5454
c434_g4_i455
c5454_g544_i3
.
.
etc
I need to split these strings by the second underscore and I am afraid I have no clue how to do that in R (or any other tool for that sake). I'd be very happy if anyone can sort me out here.
Thank you
SM
One way would be to replace the second underscore by another delimiter (i.e. space) using sub and then split using that.
Using sub, we match one or more characters that are not a _ from the beginning (^) of the string (^[^_]+) followed by the first underscore (_) followed by one or characters that are not a _ ([^_]+). We capture that as a group by placing it inside the parentheses ((....)), then we match the _ followed by one or more characters till the end of the string in the second capture group ((.*)$). In the replacement, we separate the first (\\1) and second (\\2) with a space.
strsplit(sub('(^[^_]+_[^_]+)_(.*)$', '\\1 \\2', v1), ' ')
#[[1]]
#[1] "c54254_g4545" "i5454"
#[[2]]
#[1] "c434_g4" "i455"
#[[3]]
#[1] "c5454_g544" "i3"
data
v1 <- c('c54254_g4545_i5454', 'c434_g4_i455', 'c5454_g544_i3')
strsplit(sub("(_)(?=[^_]+$)", " ", x, perl=T), " ")
#[[1]]
#[1] "c54254_g4545" "i5454"
#
#[[2]]
#[1] "c434_g4" "i455"
#
#[[3]]
#[1] "c5454_g544" "i3"
With the pattern "(_)(?=[^_]+$)", we split on an underscore that comes before a series of one or more non-underscore characters. That way we only need one capture group.
I did this. However, although it works there may be a 'better' way?
str = 'c110478_g1_i1'
m = strsplit(str, '_')
f <- paste(m[[1]][1],m[[1]][2],sep='_')

remove characters from a string in a data frame

I have a data frame where column "ID" has values like these:
1234567_GSM00298873
1238416_GSM90473673
98377829
In other words, some rows have 7 numbers followed by "_" followed by letters and numbers; other rows have just numbers
I want to remove the numbers and the underscore preceding the letters, without affecting the rows that have only number. I tried
dataframe$ID <- gsub("*_", "", dataframe$ID)
but that only removes the underscore. So I learned that * means zero or more.
Is there a wildcard, and a repetition operator such that I can tell it to find the pattern "anything-seven-times-followed-by-_"?
Thanks!
Your regular expression syntax is incorrect. You have nothing preceding your repetition operator.
dataframe$ID <- gsub('[0-9]+_', '', dataframe$ID)
This matches any character of: 0 to 9 ( 1 or more times ) that is preceded by an underscore.
Working Demo
Something like this?:
dataframe$ID <- gsub("[0-9]+_", "", dataframe$ID)
The link http://marvin.cs.uidaho.edu/Handouts/regex.html could helps you.
"[0-9]*_" will match numbers followed by '_'
"[0-9]{7}_" will match 7 numbers followed by '_'
".{7}_" will match 7 characters followed by '_'
A different method. If a string has an underscore, return from the underscore to the end of the string; if not, return the string.
ID <- c("1234567_GSM00298873", "1238416_GSM90473673", "98377829")
ifelse(grepl("_", ID), substr(x = ID, 9, nchar(ID)), ID)

remove initial period and text after final period in string

I have a regex edge case that I am unable to solve. I need to grep to remove the leading period (if it exists) and the text following the last period (if it exists) from a string.
That is, given a vector:
x <- c("abc.txt", "abc.com.plist", ".abc.com")
I'd like to get the output:
[1] "abc" "abc.com" "abc"
The first two cases are solved already I obtained help in this related question. However not for the third case with leading .
I am sure it is trivial, but i'm not making the connections.
This regex does what you want:
^\.+|\.[^.]*$
Replace its matches with the empty string.
In R:
gsub("^\\.+|\\.[^.]*$", "", subject, perl=TRUE);
Explanation:
^ # Anchor the match to the start of the string
\.+ # and match one or more dots
| # OR
\. # Match a dot
[^.]* # plus any characters except dots
$ # anchored to the end of the string.