How to replace strings containing 'greater than' and 'less than'

How to replace strings containing 'greater than' and 'less than' - regex

I am trying to replace strings containing > and < with R
datanames<-names(data)
datanames
## [1] BbMx>2.5 BbAv>2.5 BbMx<2.5 BbAv<2.5
datanames<-gsub("[>]","gt",datanames)
datanames<-gsub("[<]","lt",datanames)
datanames<-gsub("[.]","",datanames)
datanames
## [1] BbMx25 BbAv25 BbMx251 BbAv251
What I am doing wrong?
UPDATE: For some strange reason R doesn't read the same character of the csv. Namely in my csv I read with libreoffice
"BbMx>2.5" "BbAv>2.5" "BbMx<2.5" "BbAv<2.5"
but once R read csv turn this strings in
"BbMx.2.5" "BbAv.2.5" "BbMx.2.5.1" "BbAv.2.5.1"

If you just do
x <- c("BbMx>2.5","BbAv>2.5","BbMx<2.5","BbAv<2.5")
x <- gsub("[>]","gt",x)
x <- gsub("[<]","lt",x)
x <- gsub("[.]","",x)
You should get
"BbMxgt25" "BbAvgt25" "BbMxlt25" "BbAvlt25"
as expected. The problem is that the input from names(data) isn't what you think it is.
R has rules about valid column names in data.frames. R will run make.names on those values to attempt to make uniuqe, valid names. This involved replacing non-alphanumeric values with periods and adding suffixes to ensure uniqueness.
To disable the auto-renaming, you can set check.names=F with the read.table/read.csv function and do the renaming yourself.
So if you have
x<-c("BbMx>2.5", "BbAv>2.5", "BbMx<2.5","BbAv<2.5" )
Then
make.names(x, unique=T)
# [1] "BbMx.2.5" "BbAv.2.5" "BbMx.2.5.1" "BbAv.2.5.1"
So ultimately this had nothing to do with gsub. This was really about how R transforms raw data into data.frames.

I know #MrFlick has provided an answer already, but just to comment on the way you are implementing your characters and calls using gsub, the < and > characters are not considered a character of special meaning so you do not need to place them inside a character class [ ], you can use them as a literal.
And you can cascade your gsub functions together here.
datanames <- gsub('>', 'gt', gsub('<', 'lt', gsub('\\.', '', datanames)))

Related

Bin new data according to existing intervals given as factor levels

I have a factor with levels which represent intervals (as produced by cut):
> head(data.train$glucose)
[1] [0,126] [0,126] (126,199] [0,126] [0,126] [0,126]
Levels: [0,126] (126,199]
Now I want to generate a new factor with the same levels from a numeric vector so that when the respective number falls into the first interval (e.g. 24) it becomes [0,126] and if it falls into the second interval (e.g. 153) it becomes (126,199].
The number of intervals can differ as can the form of the brackets (depending on whether they are open or closed intervals).
I think I can use sub together with cut for that (as in the last example in the helpfile of cut) but I am not very good at it to make it general enough. Is there also another way which is a little bit more intuitive? But perhaps I am thinking too complicated anyway...
If you give a solution with sub please explain the expression. Please also do not give solutions with functions from other packages as I am developing a package myself and I want to keep it as lean as possible.

I was looking for an elegant way to do this, but ended up using regex like you suggested:
ints<-cut(1:10,5)
set.seed(345)
a<-runif(20,1,10)
# get levels
levs <- levels(ints)
# remove brackets
levs.num <- sub( "^[\\(\\[]{1}(.+)[\\)\\]]{1}$" , "\\1" ,levs , perl = TRUE)
levs.right <- sub( "^[\\(\\[]{1}.+([\\)\\]]{1})$" , "\\1" ,levs , perl = TRUE)
levs.left <- sub( "^([\\(\\[]{1}).+[\\)\\]]{1}$" , "\\1" ,levs , perl = TRUE)
# get breaks
breaks <- unique(as.numeric(unlist(strsplit(levs.num ,","))))
if(all(levs.right=="]")){
right.arg <- TRUE
}else if(all(levs.left=="[")){
right.arg <- FALSE
}else{
stop("problem")
}
table(cut(a,breaks , right = right.arg ))
My regex should select everything between [ or ( and ] or ) and return it

Importing unfriedly formatted data in Excel and forcing messy values as column names

I'm trying to import some publicly available life outcomes data using the code below:
require(gdata)
# Source SIMD12 data zone level data
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE)
Naturally, the imported data frame doesn't look good:
I would like to amend my column names using the code below:
# Clean column names
names(simd.sg.xls) <- make.names(names = as.character(simd.sg.xls[1,]),
unique = TRUE,allow_ = TRUE)
But it produces rather unpleasant results:
> names(simd.sg.xls)
[1] "X1" "X1.1" "X771" "X354" "X229" "X74" "X67" "X33" "X19" "X1.2"
[11] "X6" "X1.3" "X8" "X7" "X7.1" "X6506" "X21" "X1.4" "X6158" "X6506.1"
[21] "X6506.2" "X6506.3" "X6263" "X6506.4" "X6468" "X1010" "X815" "X99" "X58" "X65"
[31] "X60" "X6506.5" "X21.1" "X1.5" "X6173" "X5842" "X6506.6" "X6506.7" "X6263.1" "X6506.8"
[41] "X6481" "X883" "X728" "X112" "X69" "X56" "X54" "X6506.9" "X21.2" "X1.6"
[51] "X6143" "X5651" "X6506.10" "X6506.11" "X6263.2" "X6506.12" "X6480" "X777" "X647" "X434"
[61] "X518" "X246" "X436" "X6506.13" "X21.3" "X1.7" "X6136" "X5677" "X6506.14" "X6506.15"
[71] "X6263.3" "X6506.16" "X660" "X567" "X480" "X557" "X261" "X456"
My question is if there is a way to neatly force the values from the first row to the column names? As I'm doing a lot of data I'm looking for solution that would be easily reproducible, I can accommodate a lot of violation to the actual strings to get syntactically correct names but ideally I would avoid faffing around with elaborate regular expressions as I'm often reading files like the one linked here and don't wan to be forced to adjust the rules for each single import.

It looks like the problem is that the header is on the second line, not the first. You could include a skip=1 argument but a more general way of dealing with this using read.xls seems to be to use the pattern and header arguments which force the first line which matches the pattern string to be treated as the header. Your code becomes:
require(gdata)
# Source SIMD12 data zone level data
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE,
pattern="DATAZONE", header=TRUE)
UPDATE
I don't get the warning messages you do when I execute the code. The messages refer to an issue with locale. The locale settings on my system are:
Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Yours are probably different. Locale data could be OS dependent. I'm using Windows 8.1. Also I'm using Strawberry Perl; you appear to be using something else. So some possible reasons for the discrepancy in warning messages but nothing more specific.
On the second question in your comment, to read the entire file, and convert a particular row ( in this case, row 2) to column names, you could use the following code:
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE,
header=FALSE, stringsAsFactors=FALSE)
names(simd.sg.xls) <- make.names(names = simd.sg.xls[2,],
unique = TRUE,allow_ = TRUE)
simd.sg.xls <- simd.sg.xls[-(1:2),]
All data will be of character type so you'll need to convert to factor and numeric as necessary.

Parameterize pattern match as function argument in R

I've a directory with csv files, about 12k in number, with the naming format being
YYYY-MM-DD<TICK>.csv
. The <TICK> refers to ticker of a stock, e.g. MSFT, GS, QQQ etc. There are total 500 tickers, of various length.
My aim is to merge all the csv for a particular tick and save as a zoo object in individual RData file in a separate directory.
To automate this I've managed to do the csv manipulation, setup as a function which gets a ticker as input, does all the data modification. But I'm stuck in making the file listing stage, passing the pattern to match the ticker being processed. I'm unable to make the pattern to be matched dependent on the ticker.
Below is the function i've tried to make work, doesn't work:
csvlist2zoo <- function(symbol){
csvlist=list.files(path = "D:/dataset/",pattern=paste("'.*?",symbol,".csv'",sep=""),full.names=T)
}
This works, but can't make it work in function
csvlist2zoo <- function(symbol){
csvlist=list.files(path = "D:/dataset/",pattern='.*?"ibm.csv',sep=""),full.names=T)
}
Searched in SO, there are similar questions, not exactly meeting my requirement. But if I missed something please point out in the right direction. Still fighting with regex.
OS: Win8 64bit, R version-3.1.0 (if needed)

Try:
csvlist2zoo <- function(symbol){
list.files(pattern=paste0('\\d{4}-\\d{2}-\\d{2}',symbol, ".csv"))
}
csvlist2zoo("QQQ")
#[1] "2002-12-19QQQ.csv" "2008-01-25QQQ.csv"
csvlist2zoo("GS")
#[1] "2005-05-18GS.csv"
I created some files in the working directory (linux)
v1 <- c("2001-05-17MSFT.csv", "2005-05-18GS.csv", "2002-12-19QQQ.csv", "2008-01-25QQQ.csv")
lapply(v1, function(x) write.csv(1:3, file=x))
Update
Using paste
csvlist2zoo <- function(symbol){
list.files(pattern=paste('\\d{4}-\\d{2}-\\d{2}',symbol, ".csv", sep=""))
}
csvlist2zoo("QQQ")
#[1] "2002-12-19QQQ.csv" "2008-01-25QQQ.csv"

regex for matching column names on a matrix

I'll have two strings of the form
"Initestimate" or "L#estimate" with # being a 1 or 2 digit number
" Nameestimate" with Name being the name of the actual symbol. In the example below, the name of our symbol is "6JU4"
And I have a matrix containing, among other things, columns containing "InitSymbol" and "L#Symbol". I want to return the column name of the column where the first row holds the substring before "estimate".
I'm using stringr. Right now I have it coded with a bunch of calls to str_sub but its really sloppy and I wanted to clean it up and do it right.
example code:
> examplemat <- matrix(c("RYU4","6JU4","6EU4",1,2,3),ncol=6)
> colnames(examplemat) <- c("InitSymb","L1Symb","L2Symb","RYU4estimate","6JU4estimate","6EU4estimate")
> examplemat
InitSymb L1Symb L2Symb RYU4estimate 6JU4estimate 6EU4estimate
[1,] "RYU4" "6JU4" "6EU4" "1" "2" "3"
> searchStr <- "L1estimate"
So with answer being the answer I'm looking for, I want to be able to input examplemat[,answer] so I can extract the data column (in this case, "2")
I don't really know how to do regex, but I think the answer looks something like
examplemat[,paste0(**some regex function**("[(Init)|(L[:digit:]+)]",searchStr),"estimate")]
what function goes there, and is my regex code right?

May be you can try:
library(stringr)
Extr <- str_extract(searchStr, '^[A-Za-z]\\d+')
Extr
[1] "L1"
#If the searchStr is `Initestimate`
#Extr <- str_extract(searchStr, '^[A-Za-z]{4}')
pat1 <- paste0("(?<=",Extr,").*")
indx1 <-examplemat[,str_detect(colnames(examplemat),perl(pat1))]
pat2 <- paste0("(?<=",indx1,").*")
examplemat[,str_detect(colnames(examplemat), perl(pat2))]
#6JU4estimate
# "2"
#For searchStr using Initestimate;
#examplemat[,str_detect(colnames(examplemat), perl(pat2))]
#RYU4estimate
# "1"

The question is bit confusing so I am quite not sure if my interpretation is correct.
First, you would extract the values in the string "coolSymb" without "Symb"
Second, you can detect if column name contains "cool" and return the location (column index)
by which() statement.
Finally, you can extract the value using simple matrix indexing.
library(stringr)
a = str_split("coolSymb", "Symb")[[1]][1]
b = which(str_detect(colnames(examplemat), a))
examplemat[1, b]
Hope this helps,

won782's use of str_split inspired me to find an answer that works, although I still want to know how to do this by matching the prefix instead of excluding the suffix, so I'll accept an answer that does that.
Here's the step-by-step
> str_split("L1estimate","estimate")[[1]][1]
[1] "L1"
replace the above step with one that gets {L1} instead of getting {not estimate} for bonus points
> paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")
[1] "L1Symb"
> examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")]
L1Symb
[1,] "6JU4"
> paste0(examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")],"estimate")
[1] "6JU4estimate"
> examplemat[,paste0(examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")],"estimate")]
6JU4estimate
[1,] "2"

Single quote replacement, handling of null integers in pandas/python2.7

New to Pandas/Python and I'm having to write some kludgy code. I would appreciate any input on how you would do this and speed it up (I'll be doing this for gigabytes of data).
So, I'm using pandas/python for some ETL work. Row-wise calculations are performed so I need them as numeric types within the process (left this part out). I need to output some of the fields as an array and get rid of the single quotes, nan's, and ".0"'s.
First question, is there a way to vectorize these if else statements ala ifelse in R? Second, surely there is a better way to remove the ".0". There seems to be major issues with out pandas/numpy handles nulls in numeric types.
Finally, the .replace does not seem to work on the DataFrame for single quotes. Am I missing something? Here's the sample code, please let me know if you have any questions about it:
import pandas as pd
# have some nulls and need it in integers
d = {'one' : [1.0, 2.0, 3.0, 4.0],'two' : [4.0, 3.0, NaN, 1.0]}
dat = pd.DataFrame(d)
# make functions to get rid of the ".0" and necessarily converting to strings
def removeforval(val):
if str(val)[-2:] == ".0":
val = str(val)[:len(str(val))-2]
else:
val = str(val)
return val
def removeforcol(col):
col = col.apply(removeforval)
return col
dat = dat.apply(removeforcol,axis=0)
# remove the nan's
dat = dat.replace('nan','')
# need some fields in arrays on a postgres database
quoted = ['{' + str(tuple(x))[1:-1] + '}' for x in dat.to_records(index=False)]
print "Before single quote removal"
print quoted
# try to replace single quotes using DataFrame's replace
quoted_df = pd.DataFrame(quoted).replace('\'','')
quoted_df = quoted_df.replace('\'','')
print "DataFrame does not seem to work"
print quoted_df
# use a loop
for item in range(len(quoted)):
quoted[item] = quoted[item].replace('\'','')
print "This Works"
print quoted
Thank you!

You understand that this is very odd to make a string exactly like this. This is not valid python at all. What are you doing with this? Why are you stringifying it?
revised
In [144]: list([ "{%s , %s}" % tup[1:] for tup in df.replace(np.nan,0).astype(int).replace(0,'').itertuples() ])
Out[144]: ['{1 , 4}', '{2 , 3}', '{3 , }', '{4 , 1}']

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to replace strings containing 'greater than' and 'less than' - regex

Related

Bin new data according to existing intervals given as factor levels

Importing unfriedly formatted data in Excel and forcing messy values as column names

Parameterize pattern match as function argument in R

regex for matching column names on a matrix

Single quote replacement, handling of null integers in pandas/python2.7

Categories

Resources