Perform multiple search-and-replaces on the colnames of a dataframe

Perform multiple search-and-replaces on the colnames of a dataframe - regex

I have a dataframe with 95 cols and want to batch-rename a lot of them with simple regexes, like the snippet at bottom, there are ~30 such lines. Any other columns which don't match the search regex must be left untouched.
**** Example: names(tr) = c('foo', 'bar', 'xxx_14', 'xxx_2001', 'yyy_76', 'baz', 'zzz_22', ...) ****
I started out with a wall of 25 gsub()s - crude but effective:
names(tr) <- gsub('_1$', '_R', names(tr))
names(tr) <- gsub('_14$', '_I', names(tr))
names(tr) <- gsub('_22$', '_P', names(tr))
names(tr) <- gsub('_50$', '_O', names(tr))
... yada yada
#Joshua: mapply doesn't work, turns out it's more complicated and impossible to vectorize. names(tr) contains other columns, and when these patterns do occur, you cannot assume all of them occur, let alone in the exact order we defined them. Hence, try 2 is:
pattern <- paste('_', c('1','14','22','50','52','57','76','1018','2001','3301','6005'), '$', sep='')
replace <- paste('_', c('R','I', 'P', 'O', 'C', 'D', 'M', 'L', 'S', 'K', 'G'), sep='')
do.call(gsub, list(pattern, replace, names(tr)))
Warning messages:
1: In function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, :
argument 'pattern' has length > 1 and only the first element will be used
2: In function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, :
argument 'replacement' has length > 1 and only the first element will be used
Can anyone fix this for me?
EDIT: I read all around SO and R doc on this subject for over a day and couldn't find anything... then when I post it I think of searching for '[r] translation table' and I find xlate. Which is not mentioned anywhere in the grep/sub/gsub documentation.
Is there anything in base/gsubfn/data.table etc. to allow me to write one search-and-replacement instruction? (like a dictionary or translation table)
Can you improve my clunky syntax to be call-by-reference to tr? (mustn't create temp copy of entire df)
EDIT2: my best effort after reading around was:
The dictionary approach (xlate) might be a partial answer to, but this is more than a simple translation table since the regex must be terminal (e.g. '_14$').
I could use gsub() or strsplit() to split on '_' then do my xlate translation on the last component, then paste() them back together. Looking for a cleaner 1/2-line idiom.
Or else I just use walls of gsub()s.

Wall of gsub could be always replace by for-loop. And you can write it as a function:
renamer <- function(x, pattern, replace) {
for (i in seq_along(pattern))
x <- gsub(pattern[i], replace[i], x)
x
}
names(tr) <- renamer(
names(tr),
sprintf('_%s$', c('1','14','22','50','52','57','76','1018','2001','3301','6005')),
sprintf('_%s' , c('R','I', 'P', 'O', 'C', 'D', 'M', 'L', 'S', 'K', 'G'))
)
And I found sprintf more useful than paste for creation this kind of strings.

The question predates the boom of the tidyverse but this is easily solved with the c(pattern1 = replacement1) option in stringr::str_replace_all.
tr <- data.frame("whatevs_1" = NA, "something_52" = NA)
tr
#> whatevs_1 something_52
#> 1 NA NA
patterns <- sprintf('_%s$', c('1','14','22','50','52','57','76','1018','2001','3301','6005'))
replacements <- sprintf('_%s' , c('R','I', 'P', 'O', 'C', 'D', 'M', 'L', 'S', 'K', 'G'))
names(replacements) <- patterns
names(tr) <- stringr::str_replace_all(names(tr), replacements)
tr
#> whatevs_R something_C
#> 1 NA NA
And of course, this particular case can benefit from dplyr
dplyr::rename_all(tr, stringr::str_replace_all, replacements)
#> whatevs_R something_C
#> 1 NA NA

Using do.call() nearly does it, it objects to differing arg lengths. I think I need to nest do.call() inside apply(), like in apply function to elements over a list.
But I need a partial do.call() over pattern and replace.
This is all starting to make a wall of gsub(..., fixed=TRUE) look like a more efficient idiom, if flabby code.
pattern <- paste('_', c('1','14','22','50'), '$', sep='')
replace <- paste('_', c('R','I', 'P', 'O'), sep='')
do.call(gsub, list(pattern, replace, names(tr)))
Warning messages:
1: In function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, :
argument 'pattern' has length > 1 and only the first element will be used
2: In function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, :
argument 'replacement' has length > 1 and only the first element will be used

Related

Alternative for matching specific characters in regex [duplicate]

This question already has answers here:
Optimize long lists of fixed string alternatives in regex
(2 answers)
Closed 6 months ago.
I have made a regex for matching the specific letters:
a, ae, eo, e, eu, ya, yae, yeo, ye, yo, o, oe, wa, wae, wo, we, wi, yu, u, ui, i, oo, ah
This is the solution that I made a[eh]||e[ou]|o[eo]|u[i]|w[aoei]|y[aeou]|[aeiou]. Is there any alternative solution that I could use to improve its performance or a better solution for this?

I think, there is no other solution to create regex of such type of pattern. But you may create this regex dynamically like this way . . .
arr = ['a', 'ae', 'eo', 'e', 'eu', 'ya', 'yae', 'yeo', 'ye', 'yo', 'o', 'oe', 'wa', 'wae', 'wo', 'we', 'wi', 'yu', 'u', 'ui', 'i', 'oo', 'ah'];
s = ''
arr.forEach(x => s = s + (s ? '|' : '') + x)
reg = new RegExp(s, 'gi')
// /a|ae|eo|e|eu|ya|yae|yeo|ye|yo|o|oe|wa|wae|wo|we|wi|yu|u|ui|i|oo|ah/gi

Replace element in string after first occurrence

I wish to replace all 2's in a string after the first occurrence of a 2, ideally using regex in base R. This seems like it must be a duplicate, but I cannot locate the answer.
Here is an example:
my.data <- read.table(text='
my.string
.1.222.2.2
..1..1..2.
1.1.2.2...
.222.232..
..1..1....
', header=TRUE, stringsAsFactors = FALSE)
my.data
desired.result <- read.table(text='
my.string
.1.2......
..1..1..2.
1.1.2.....
.2....3...
..1..1....
', header=TRUE, stringsAsFactors = FALSE)
desired.result
my.last.2 <- c(4, 9, 5, 2, NA)
my.last.2
Thank you for any assistance.

This appears to match your desired output:
> gsub(pattern = "(?<=2)(.*?)2",
replacement = "\\1\\.",
x = my.data$my.string,
perl = TRUE)
[1] ".1.2......" "..1..1..2." "1.1.2....." ".2....3..." "..1..1...."
This is literally a directly modification from this answer to a very similar question to make it R specific. I'll be honest, I don't quite understand this regex, so use (and up-vote) with caution.

This works, but is probably inefficient:
with(my.data, gsub("#", "2", gsub("2", ".", sub("2", "#", my.string))))
# [1] ".1.2......" "..1..1..2." "1.1.2....." ".2....3..." "..1..1...."
Approach: Use sub to only match the first occurrence and change it to # (or some other placeholder character which doesn't show up elsewhere in my.string, then use gsub to replace all remaining 2s, then gsub # back into 2.

Convert a set of numbers into a word

I need to convert a given string of numbers to the word those numbers correspond to. For example:
>>>number_to_word ('222 2 333 33')
'CAFE'
The numbers work like they do on a cell phone, you hit once on the second button and you get an 'A', you hit twice and you get an 'B', etc. Let's say I want the letter 'E', I'd have to press the third button twice.
I would like to have some help trying to understand the easiest way to do this function. I have thought on creating a dictionary with the key being the letter and the value being the number, like this:
dic={'A':'2', 'B':'22', 'C':'222', 'D':'3', 'E':'33',etc...}
And then using a 'for' cycle to read all the numbers the in the string, but I do not know how to start.

You need to reverse your dictionary:
def number_to_word(number):
dic = {'2': 'A', '22': 'B', '222': 'C', '3': 'D', '33': 'E', '333': 'F'}
return ''.join(dic[n] for n in number.split())
>>> number_to_word('222 2 333 33')
'CAFE'
Let's start inside out. number.split() splits the text with your number at white space characters:
>>> number = '222 2 333 33'
>>> number.split()
['222', '2', '333', '33']
We use a generator expression ((dic[n] for n in number.split())) to find the letter for each number. Here is a list comprehension that does nearly the same but also shows the result as a list:
>>> [dic[n] for n in number.split()]
['C', 'A', 'F', 'E']
This lets n run through all elements in the list with the numbers and uses n as the key in the dictionary dic to get the corresponding letter.
Finally, we use the method join() with an empty string as spectator to turn the list into a string:
>>> ''.join([dic[n] for n in number.split()])
'CAFE'

Python- How do I sort a list that the script is building to replicate another word?

I'm trying to implement a hangman game. I want part of the function to check if a letter is correct or incorrect. After a letter is found to be correct it will place the letter in a "used letters" list and a "correct letters list" The correct letters list will be built as the game goes on. I'd like it to sort the list to match the hidden word as the game is going.
For instance let's use the word "hardware"
If someone guessed "e, a, and h" it would come out like
correct = ["e", "a", "h"]
I would like it to sort the list so it would go
correct = ["h", "a", "e"]
then
correct = ["h", "a", "r", "a", "e"] after r has been guessed.
I also need to know if it would also see that "a" is in there twice and place it twice.
My code that doesn't allow you to win but you can lose. It's a work in progress.
I also can't get the letters left counter to work. I've made the code print the list to check if it was adding the letters. it is. So I don't know what's up there.
def hangman():
correct = []
guessed = []
guess = ""
words = ["source", "alpha", "patch", "system"]
sWord = random.choice(words)
wLen = len(sWord)
cLen = len(correct)
remaining = int(wLen - cLen)
print "Welcome to hangman.\n"
print "You've got 3 tries or the guy dies."
turns = 3
while turns > 0:
guess = str(raw_input("Take a guess. >"))
if guess in sWord:
correct.append(guess)
guessed.append(guess)
print "Great!, %d letters left." % remaining
else:
print "Incorrect, this poor guy's life is in your hands."
guessed.append(guess)
turns -= 1
print "You have %d turns left." % turns
if turns == 0:
print "HE'S DEAD AND IT'S ALL YOUR FAULT! ARE YOU HAPPY?"
print "YOU LOST ON PURPOSE, DIDN'T YOU?!"
hangman()

I'm not entirely clear on the desired behavior because:
correct = ["h", "a", "r", "a", "e"] after r has been guessed.
This is strange because a has only been guessed once, but shows up for each time it appears in hardware. Should r should also appear twice? If that is the correct behavior, then a very simple list comprehension can be used:
def result(guesses, key):
print [c for c in key if c in guesses]
In [560]: result('eah', 'hardware')
['h', 'a', 'a', 'e']
In [561]: result('eahr', 'hardware')
['h', 'a', 'r', 'a', 'r', 'e']
Iterate the letters in key and include them if the letter has been used as a "guess".
You can also have it insert a place holder for unfound characters fairly easily by using:
def result(guesses, key):
print [c if c in guesses else '_' for c in key]
print ' '.join([c if c in guesses else '_' for c in key])
In [567]: result('eah', 'hardware')
['h', 'a', '_', '_', '_', 'a', '_', 'e']
h a _ _ _ a _ e
In [568]: result('eahr', 'hardware')
['h', 'a', 'r', '_', '_', 'a', 'r', 'e']
h a r _ _ a r e
In [569]: result('eahrzw12', 'hardware')
['h', 'a', 'r', '_', 'w', 'a', 'r', 'e']
h a r _ w a r e

Concatenate gsub [duplicate]

This question already has answers here:
Replace multiple letters with accents with gsub
(11 answers)
Closed 9 years ago.
I'm currently running the following code to clean my data from accent characters:
df <- gsub('Á|Ã', 'A', df)
df <- gsub('É|Ê', 'E', df)
df <- gsub('Í', 'I', df)
df <- gsub('Ó|Õ', 'O', df)
df <- gsub('Ú', 'U', df)
df <- gsub('Ç', 'C', df)
However, I would like to do it in just one line (using another function for it would be ok). How can I do this?

Try something like this
iconv(c('Á'), "utf8", "ASCII//TRANSLIT")
You can just add more elements to the c().
EDIT: it is machine dependent, check help(iconv)
Here is the R solution
mychar <- c('ÁÃÉÊÍÓÕÚÇ')
iconv(mychar, "latin1", "ASCII//TRANSLIT") # one line, as requested
[1] "AAEEIOOUC"

It an encoding problem, Normally you resolve it by indicating the right encoding. If you still want to use regular expression to do it , you can use gsubfn to write one liner solution:
library(gsubfn)
ll <- list('Á'='A', 'Ã'='A', 'É'='E',
'Ê'='E', 'Í'='I', 'Ó'='O',
'Õ'='O', 'Ú'='U', 'Ç'='C')
gsubfn('Á|Ã|É|Ê|Í|Ó|Õ|Ú|Ç',ll,'ÁÃÉÊÍÓÕÚÇ')
[1] "AAEEIOOUC"
gsubfn('Á|Ã|É|Ê|Í|Ó|Õ|Ú|Ç',ll,c('ÁÃÉÊÍÓÕÚÇ','ÍÓÕÚÇ'))
[1] "AAEEIOOUC" "IOOUC"

One option could be chartr
> toreplace <- LETTERS
> replacewith <- letters
> (somestring <- paste(sample(LETTERS,10),collapse=""))
[1] "MUXJVYNZQH"
>
> chartr(
+ old=paste(toreplace,collapse=""),
+ new=paste(replacewith,collapse=""),
+ x=somestring
+ )
[1] "muxjvynzqh"

df = as.data.frame(apply(df,2,function(x) gsub('Á|Ã', 'A', df)))
2 indicates columns and 1 indicates rows

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perform multiple search-and-replaces on the colnames of a dataframe - regex

Related

Alternative for matching specific characters in regex [duplicate]

Replace element in string after first occurrence

Convert a set of numbers into a word

Python- How do I sort a list that the script is building to replicate another word?

Concatenate gsub [duplicate]

Categories

Resources