Removing duplicates from the data

Removing duplicates from the data - regex

I already loaded 20 csv files with function:
tbl = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
I combined all of those filves into one:
all_data = do.call(rbind.fill, list_of_data)
In the new table is a column called "Accession". After combining many of the names (Accession) are repeated. And I would like to remove all of the duplicates.
Another problem is that some of those "names" are ALMOST the same. The difference is that there is name and after become the dot and the number.
Let me show you how it looks:
AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--
<-- = Same sample, different names. Should be treated as one. So just ignore dot and a number after.
Tried this one:
all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
all_data = subset(all_data, !duplicated(CleanedAccession))
Error in `$<-.data.frame`(`*tmp*`, "CleanedAccession", value = character(0)) :

You can use this command to both subset and rename the values:
subset(transform(alldata, Ascension = sub("\\..*", "", Ascension)),
!duplicated(Ascension))
Ascension
1 AT3G26450
2 AT5G44520
3 AT4G24770
4 AT2G37220
5 AT3G02520
6 AT5G05270
7 AT1G32060
8 AT3G52380
9 AT2G43910
10 AT2G19760

What about
df <- data.frame( Accession = c("AT3G26450.1",
"AT5G44520.2",
"AT4G24770.1",
"AT2G37220.2",
"AT3G02520.1",
"AT5G05270.1",
"AT1G32060.1",
"AT3G52380.1",
"AT2G43910.2",
"AT2G19760.1",
"AT3G26450.2"))
df[!duplicated(unlist(lapply(strsplit(as.character(df$Accession),
".", fixed = T), "[", 1))), ]

Related

Applying Rcpp on a dataframe

I'm new to C++ and exploring faster computation possibilities on R through the Rcpp package. The actual dataframe contains over ~2 million rows, and is quite slow.
Existing Dataframes
Main Dataframe
df<-data.frame(z = c("a","b","c"), a = c(303,403,503), b = c(203,103,803), c = c(903,803,703))
Cost Dataframe
cost <- data.frame("103" = 4, "203" = 5, "303" = 6, "403" = 7, "503" = 8, "603" = 9, "703" = 10, "803" = 11, "903" = 12)
colnames(cost) <- c("103", "203", "303", "403", "503", "603", "703", "803", "903")
Steps
df contains z which is a categorical variable with levels a, b and c. I had done a merge operation from another dataframe to bring in a,b,c into df with the specific nos.
First step would be to match each row in z with the column names (a,b or c) and create a new column called 'type' and copy the corresponding number.
So the first row would read,
df$z[1] = "a"
df$type[1]= 303
Now it must match df$type with column names in another dataframe called 'cost' and create df$cost. The cost dataframe contains column names as numbers e.g. "103", "203" etc.
For our example, df$cost[1] = 6. It matches df$type[1] = 303 with cost$303[1]=6
Final Dataframe should look like this - Created a sample output
df1 <- data.frame(z = c("a","b","c"), type = c("303", "103", "703"), cost = c(6,4,10))

A possible solution, not very elegant but does the job:
library(reshape2)
tmp <- cbind(cost,melt(df)) # create a unique data frame
row.idx <- which(tmp$z==tmp$variable) # row index of matching values
col.val <- match(as.character(tmp$value[row.idx]), names(tmp) ) # find corresponding values in the column names
# now put all together
df2 <- data.frame('z'=unique(df$z),
'type' = tmp$value[row.idx],
'cost' = as.numeric(tmp[1,col.val]) )
the output:
> df2
z type cost
1 a 303 6
2 b 103 4
3 c 703 10
see if it works

How to take specific number of input in python

How do I take a specific number of input in python. Say, if I only want to insert 5 elements in a list then how can I do that?
I tried to do it but couldn't figure out how.
In the first line I want to take an integer which will be the size of the list.
Second line will consist 5 elements separated by a space like this:
5
1 2 3 4 5
Thanks in advance.

count = int(raw_input("Number of elements:"))
data = raw_input("Data: ")
result = data.split(sep=" ", maxsplit=count)
if len(result) < count:
print("Too few elements")
You can also wrap int(input("Number of elements:")) in try/except to ensure that first input is actually int.
p.s. here is helpful q/a how to loop until correct input.

Input :-
5
1 2 3 4 5
then, use the below code :
n = int(input()) # Number of elements
List = list ( map ( int, input().split(" ") ) )
Takes the space separated input as list of integers. Number of elements count is not necessary here.
You can get the size of the List by len(List) .
Here list is a keyword for generating a List.
Or you may use an alternative :
n = int(input()) # Number of elements
List = [ int(elem) for elem in input().split(" ") ]
If you want it as List of strings, then use :
List = list( input().split(" ") )
or
s = input() # default input is string by using input() function in python 2.7+
List = list( s.split(" ") )
Or
List = [ elem for elem in input().split(" ") ]
Number of elements count is necessary while using a loop for receiving input in a new line ,then
Let the Input be like :
5
1
2
3
4
5
The modified code will be:-
n = int(input())
List = [ ] #declare an Empty list
for i in range(n):
elem = int(input())
List.append ( elem )
For Earlier version of python , use raw_input ( ) instead of input ( ), which receives default input as String.

Split Pandas Column by values that are in a list

I have three lists that look like this:
age = ['51+', '21-30', '41-50', '31-40', '<21']
cluster = ['notarget', 'cluster3', 'allclusters', 'cluster1', 'cluster2']
device = ['htc_one_2gb','iphone_6/6+_at&t','iphone_6/6+_vzn','iphone_6/6+_all_other_devices','htc_one_2gb_limited_time_offer','nokia_lumia_v3','iphone5s','htc_one_1gb','nokia_lumia_v3_more_everything']
I also have column in a df that looks like this:
campaign_name
0 notarget_<21_nokia_lumia_v3
1 htc_one_1gb_21-30_notarget
2 41-50_htc_one_2gb_cluster3
3 <21_htc_one_2gb_limited_time_offer_notarget
4 51+_cluster3_iphone_6/6+_all_other_devices
I want to split the column into three separate columns based on the values in the above lists. Like so:
age cluster device
0 <21 notarget nokia_lumia_v3
1 21-30 notarget htc_one_1gb
2 41-50 cluster3 htc_one_2gb
3 <21 notarget htc_one_2gb_limited_time_offer
4 51+ cluster3 iphone_6/6+_all_other_devices
First thought was to do a simple test like this:
ages_list = []
for i in ages:
if i in df['campaign_name'][0]:
ages_list.append(i)
print ages_list
>>> ['<21']
I was then going to convert ages_list to a series and combine it with the remaining two to get the end result above but i assume there is a more pythonic way of doing it?

the idea behind this is that you'll create a regular expression based on the values you already have , for example if you want to build a regular expressions that capture any value from your age list you may do something like this '|'.join(age) and so on for all the values you already have cluster & device.
a special case for device list becuase it contains + sign that will conflict with the regex ( because + means one or more when it comes to regex ) so we can fix this issue by replacing any value of + with \+ , so this mean I want to capture literally +
df = pd.DataFrame({'campaign_name' : ['notarget_<21_nokia_lumia_v3' , 'htc_one_1gb_21-30_notarget' , '41-50_htc_one_2gb_cluster3' , '<21_htc_one_2gb_limited_time_offer_notarget' , '51+_cluster3_iphone_6/6+_all_other_devices'] })
def split_df(df):
campaign_name = df['campaign_name']
df['age'] = re.findall('|'.join(age) , campaign_name)[0]
df['cluster'] = re.findall('|'.join(cluster) , campaign_name)[0]
df['device'] = re.findall('|'.join([x.replace('+' , '\+') for x in device ]) , campaign_name)[0]
return df
df.apply(split_df, axis = 1 )
if you want to drop the original column you can do this
df.apply(split_df, axis = 1 ).drop( 'campaign_name', axis = 1)
Here I'm assuming that a value must be matched by regex but if this is not the case you can do your checks , you got the idea

How to very efficiently extract specific pattern from characters?

I have big data like this :
> Data[1:7,1]
[1] mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5
[2] mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
[3] mature=hsa-miR-448|mir_Family=mir-448|Gene=OR4F5
[4] mature=hsa-miR-659-3p|mir_Family=-|Gene=OR4F5
[5] mature=hsa-miR-5197-3p|mir_Family=-|Gene=OR4F5
[6] mature=hsa-miR-5093|mir_Family=-|Gene=OR4F5
[7] mature=hsa-miR-650|mir_Family=mir-650|Gene=OR4F5
what I want to do is that, in every row, I want to select the name after word mature= and also the word after Gene= and then pater them together with
paste(a,b, sep="-")
for example, the expected output from first two rows would be like :
hsa-miR-5087-OR4F5
hsa-miR-26a-1-3p-OR4F9
so, the final implementation is like this:
for(i in 1:nrow(Data)){
Data[i,3] <- sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[i,1])
Name <- strsplit(as.vector(Data[i,2]),"\\|")[[1]][2]
Data[i,4] <- as.numeric(sub("pvalue=","",Name))
print(i)
}
which work well, but it's very slow. the size of Data is very big and it has 200,000,000 rows. this implementation is very slow for that. how can I speed it up ?

If you can guarantee that the format is exactly as you specified, then a regular expression can capture (denoted by the brackets below) everything from the equals sign upto the pipe symbol, and from the Gene= to the end, and paste them together with a minus sign:
sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[,1])

Another option is to use read.table with = as a separator then pasting the 2 columns:
res = read.table(text=txt,sep='=')
paste(sub('[|].*','',res$V2), ## get rid from last part here
sub('^ +| +$','',res$V4),sep='-') ## remove extra spaces
[1] "hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5" "hsa-miR-659-3p-OR4F5"
[5] "hsa-miR-5197-3p-OR4F5" "hsa-miR-5093-OR4F5" "hsa-miR-650-OR4F5"

The simple sub solution already given looks quite nice but just in case here are some other approaches:
1) read.pattern Using read.pattern in the gsubfn package we can parse the data into a data.frame. This intermediate form, DF, can then be manipulated in many ways. In this case we use paste in essentially the same way as in the question:
library(gsubfn)
DF <- read.pattern(text = Data[, 1], pattern = "(\\w+)=([^|]*)")
paste(DF$V2, DF$V6, sep = "-")
giving:
[1] "hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5"
[4] "hsa-miR-659-3p-OR4F5" "hsa-miR-5197-3p-OR4F5" "hsa-miR-5093-OR4F5"
[7] "hsa-miR-650-OR4F5"
The intermediate data frame, DF, that was produced looks like this:
> DF
V1 V2 V3 V4 V5 V6
1 mature hsa-miR-5087 mir_Family - Gene OR4F5
2 mature hsa-miR-26a-1-3p mir_Family mir-26 Gene OR4F9
3 mature hsa-miR-448 mir_Family mir-448 Gene OR4F5
4 mature hsa-miR-659-3p mir_Family - Gene OR4F5
5 mature hsa-miR-5197-3p mir_Family - Gene OR4F5
6 mature hsa-miR-5093 mir_Family - Gene OR4F5
7 mature hsa-miR-650 mir_Family mir-650 Gene OR4F5
Here is a visualization of the regular expression we used:
(\w+)=([^|]*)
Debuggex Demo
1a) names We could make DF look nicer by reading the three columns of data and the three names separately. This also improves the paste statement:
DF <- read.pattern(text = Data[, 1], pattern = "=([^|]*)")
names(DF) <- unlist(read.pattern(text = Data[1,1], pattern = "(\\w+)=", as.is = TRUE))
paste(DF$mature, DF$Gene, sep = "-") # same answer as above
The DF in this section that was produced looks like this. It has 3 instead of 6 columns and remaining columns were used to determine appropriate column names:
> DF
mature mir_Family Gene
1 hsa-miR-5087 - OR4F5
2 hsa-miR-26a-1-3p mir-26 OR4F9
3 hsa-miR-448 mir-448 OR4F5
4 hsa-miR-659-3p - OR4F5
5 hsa-miR-5197-3p - OR4F5
6 hsa-miR-5093 - OR4F5
7 hsa-miR-650 mir-650 OR4F5
2) strapplyc
Another approach using the same package. This extracts the fields coming after a = and not containing a | producing a list. We then sapply over that list pasting the first and third fields together:
sapply(strapplyc(Data[, 1], "=([^|]*)"), function(x) paste(x[1], x[3], sep = "-"))
giving the same result.
Here is a visualization of the regular expression used:
=([^|]*)
Debuggex Demo

Here is one approach:
Data <- readLines(n = 7)
mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5
mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
mature=hsa-miR-448|mir_Family=mir-448|Gene=OR4F5
mature=hsa-miR-659-3p|mir_Family=-|Gene=OR4F5
mature=hsa-miR-5197-3p|mir_Family=-|Gene=OR4F5
mature=hsa-miR-5093|mir_Family=-|Gene=OR4F5
mature=hsa-miR-650|mir_Family=mir-650|Gene=OR4F5
df <- read.table(sep = "|", text = Data, stringsAsFactors = FALSE)
l <- lapply(df, strsplit, "=")
trim <- function(x) gsub("^\\s*|\\s*$", "", x)
paste(trim(sapply(l[[1]], "[", 2)), trim(sapply(l[[3]], "[", 2)), sep = "-")
# [1] "hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5" "hsa-miR-659-3p-OR4F5" "hsa-miR-5197-3p-OR4F5" "hsa-miR-5093-OR4F5"
# [7] "hsa-miR-650-OR4F5"

Maybe not the more elegant but you can try :
sapply(Data[,1],function(x){
parts<-strsplit(x,"\\|")[[1]]
y<-paste(gsub("(mature=)|(Gene=)","",parts[grepl("mature|Gene",parts)]),collapse="-")
return(y)
})
Example
Data<-data.frame(col1=c("mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5","mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9"),col2=1:2,stringsAsFactors=F)
> Data[,1]
[1] "mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5" "mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9"
> sapply(Data[,1],function(x){
+ parts<-strsplit(x,"\\|")[[1]]
+ y<-paste(gsub("(mature=)|(Gene=)","",parts[grepl("mature|Gene",parts)]),collapse="-")
+ return(y)
+ })
mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5 mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
"hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9"

How to add column to data.table with values from list based on regex

I have the following data.table:
id fShort
1 432-12 1245
2 3242-12 453543
3 324-32 45543
4 322-34 45343
5 2324-34 13543
DT <- data.table(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fShort=c("1245", "453543", "45543", "45343", "13543"))
and the following list:
filenames <- list("3242-124342345.png", "432-124343.png", "135-13434.jpeg")
I would like to create a new column "fComplete" that includes the complete filename from the list. For this the values of column "id" need to be matched with the filename-list. If the filename starts with the "id" string, the complete filename should be returned. I use the following regex
t <- grep("432-12","432-124343.png",value=T)
that return the correct filename.
This is how the final table should look like:
id fShort fComplete
1 432-12 1245 432-124343.png
2 3242-12 453543 3242-124342345.png
3 324-32 45543 NA
4 322-34 45343 NA
5 2324-34 13543 NA
DT2 <- data.table(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fshort=c("1245", "453543", "45543", "45343", "13543"),
fComplete = c("432-124343.png", "3242-124342345.png", NA, NA, NA))
I tried using apply and data.table approaches but I always get warnings like
argument 'pattern' has length > 1 and only the first element will be used
What is a simple approach to accomplish this?

Here's a data.table solution:
DT[ , fComplete := lapply(id, function(x) {
m <- grep(x, filenames, value = TRUE)
if (!length(m)) NA else m})]
id fShort fComplete
1: 432-12 1245 432-124343.png
2: 3242-12 453543 3242-124342345.png
3: 324-32 45543 NA
4: 322-34 45343 NA
5: 2324-34 13543 NA

In my experience with similar functions, sometimes the regex functions return a list, so you have to consider that in the apply - I usually do an example manually
Also apply will not always in y experience on its own return something that always works into a data.frame,sometimes I had to use lap ply, and or unlist and data.frame to modify it
Here is an answer - I am not familiar with data.tables and I was having issues with the filenames being in a list, but with some transformations this works. I worked it out by seeing what apply was outputting and adding the [1] to get the piece I needed
DT <- data.frame(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fShort=c("1245", "453543", "45543", "45343", "13543"))
filenames <- list("3242-124342345.png", "432-124343.png", "135-13434.jpeg")
filenames1 <- unlist(filenames)
x<-apply(DT[1],1,function(x) grep(x,filenames1)[1])
DT$fielname <- filenames1[x]

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Removing duplicates from the data - regex

You can use this command to both subset and rename the values: subset(transform(alldata, Ascension = sub("\\..*", "", Ascension)), !duplicated(Ascension)) Ascension 1 AT3G26450 2 AT5G44520 3 AT4G24770 4 AT2G37220 5 AT3G02520 6 AT5G05270 7 AT1G32060 8 AT3G52380 9 AT2G43910 10 AT2G19760

Related

Applying Rcpp on a dataframe

How to take specific number of input in python

Split Pandas Column by values that are in a list

How to very efficiently extract specific pattern from characters?

How to add column to data.table with values from list based on regex

Categories

Resources