Removing brackets from a dataset

Removing brackets from a dataset - regex

I have imported a dataset in R of 10 columns and 100 row. But in few columns there are brackets([]) and commas along with the values. How can i get rid of them?
As an instance, consider one of 4 columns and 2 rows.
V1 V2 V3 V4
3( [4 ([5 8
(1 5 9 [10,
And what i want is
V1 V2 V3 V4
3 4 5 8
1 5 9 10

Just use gsub:
mydf[] <- lapply(mydf, function(x) gsub("[][(),]", "", x))
mydf
# V1 V2 V3 V4
# 1 3 4 5 8
# 2 1 5 9 10
Instead of lapply, you can also use as.matrix:
mydf[] <- gsub("[][(),]", "", as.matrix(mydf))
mydf
# V1 V2 V3 V4
# 1 3 4 5 8
# 2 1 5 9 10

Related

Python Pandas Data frame creation

I tried to create a data frame df using the below code :
import numpy as np
import pandas as pd
index = [0,1,2,3,4,5]
s = pd.Series([1,2,3,4,5,6],index= index)
t = pd.Series([2,4,6,8,10,12],index= index)
df = pd.DataFrame(s,columns = ["MUL1"])
df["MUL2"] =t
print df
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
While trying to create the same data frame using the below syntax, I am getting a wierd output.
df = pd.DataFrame([s,t],columns = ["MUL1","MUL2"])
print df
MUL1 MUL2
0 NaN NaN
1 NaN NaN
Please explain why the NaN is being displayed in the dataframe when both the Series are non empty and why only two rows are getting displayed and no the rest.
Also provide the correct way to create the data frame same as has been mentioned above by using the columns argument in the pandas DataFrame method.

One of the correct ways would be to stack the array data from the input list holding those series into columns -
In [161]: pd.DataFrame(np.c_[s,t],columns = ["MUL1","MUL2"])
Out[161]:
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
Behind the scenes, the stacking creates a 2D array, which is then converted to a dataframe. Here's what the stacked array looks like -
In [162]: np.c_[s,t]
Out[162]:
array([[ 1, 2],
[ 2, 4],
[ 3, 6],
[ 4, 8],
[ 5, 10],
[ 6, 12]])

If remove columns argument get:
df = pd.DataFrame([s,t])
print (df)
0 1 2 3 4 5
0 1 2 3 4 5 6
1 2 4 6 8 10 12
Then define columns - if columns not exist get NaNs column:
df = pd.DataFrame([s,t], columns=[0,'MUL2'])
print (df)
0 MUL2
0 1.0 NaN
1 2.0 NaN
Better is use dictionary:
df = pd.DataFrame({'MUL1':s,'MUL2':t})
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
And if need change columns order add columns parameter:
df = pd.DataFrame({'MUL1':s,'MUL2':t}, columns=['MUL2','MUL1'])
print (df)
MUL2 MUL1
0 2 1
1 4 2
2 6 3
3 8 4
4 10 5
5 12 6
More information is in dataframe documentation.
Another solution by concat - DataFrame constructor is not necessary:
df = pd.concat([s,t], axis=1, keys=['MUL1','MUL2'])
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12

A pandas.DataFrame takes in the parameter data that can be of type ndarray, iterable, dict, or dataframe.
If you pass in a list it will assume each member is a row. Example:
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame([a, b], columns = ["Col1","Col2", "Col3"])
# output 1:
Col1 Col2 Col3
0 1 2 3
1 2 4 6
You are getting NaN because it expects index = [0,1] but you are giving [0,1,2,3,4,5]
To get the shape you want, first transpose the data:
data = np.array([a, b]).transpose()
How to create a pandas dataframe
import pandas as pd
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame(dict(Col1=a, Col2=b))
Output:
Col1 Col2
0 1 2
1 2 4
2 3 6

Replace certain spaces to tabs - delimiters

I have one column data.frame where some spaces should be delimiters some just a space.
#input data
dat <- data.frame(x=c("A 2 2 textA1 textA2 Z1",
"B 4 1 textX1 textX2 textX3 Z2",
"C 3 5 textA1 Z3"))
# x
# 1 A 2 2 textA1 textA2 Z1
# 2 B 4 1 textX1 textX2 textX3 Z2
# 3 C 3 5 textA1 Z3
Need to convert it to 5 column data.frame:
#expected output
output <- read.table(text="
A 2 2 textA1 textA2 Z1
B 4 1 textX1 textX2 textX3 Z2
C 3 5 textA1 Z3",sep="\t")
# V1 V2 V3 V4 V5
# 1 A 2 2 textA1 textA2 Z1
# 2 B 4 1 textX1 textX2 textX3 Z2
# 3 C 3 5 textA1 Z3
Essentially, need to change 1st, 2nd, 3rd, and the last space to a tab (or any other delimiter if it makes it easier to code).
Playing with regex is not giving anything useful yet...
Note1: In real data I have to replace 1st, 2nd, 3rd,...,19th and the last spaces to tabs.
Note2: There is no pattern in V4, text can be anything.
Note3: Last column is one word text with variable length.

Try
v1 <- gsub("^([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+", '\\1,\\2,\\3,', dat$x)
read.table(text=sub(' +(?=[^ ]+$)', ',', v1, perl=TRUE), sep=",")
# V1 V2 V3 V4 V5
#1 A 2 2 textA1 textA2 Z1
#2 B 4 1 textX1 textX2 textX3 Z2
#3 C 3 5 textA1 Z3
Or an option inspired from #Tensibai's post
n <- 3
fpat <- function(n){
paste0('^((?:\\w+ ){', n,'})([\\w ]+)\\s+(\\w+)$')
}
read.table(text=gsub(fpat(n), "\\1'\\2' \\3", dat$x, perl=TRUE))
# V1 V2 V3 V4 V5
#1 A 2 2 textA1 textA2 Z1
#2 B 4 1 textX1 textX2 textX3 Z2
#3 C 3 5 textA1 Z3
For more columns,
n <- 19
v1 <- "A 24 34343 212 zea4 2323 12343 111 dsds 134d 153xd 153xe 153de 153dd dd dees eese tees3 zee2 2353 23335 23353 ddfe 3133"
read.table(text=gsub(fpat(n), "\\1'\\2' \\3", v1, perl=TRUE), sep='')
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
#1 A 24 34343 212 zea4 2323 12343 111 dsds 134d 153xd 153xe 153de 153dd dd
# V16 V17 V18 V19 V20 V21
#1 dees eese tees3 zee2 2353 23335 23353 ddfe 3133

With a variable number of columns:
library(stringr)
cols <- 3
m <- str_match(dat$x, paste0("((?:\\w+ ){" , cols , "})([\\w ]+) (\\w+)"))
t <- paste0(gsub(" ", "\t", m[,2]), m[,3], "\t", m[,4])
> read.table(text=t,sep="\t")
V1 V2 V3 V4 V5
1 A 2 2 textA1 textA2 Z1
2 B 4 1 textX1 textX2 textX3 Z2
3 C 3 5 textA1 Z3
Change the number of columns to tell how many you wish before.
For the regex:
((?:\\w+ ){3}) Capture the 3 repetitions {3} of the non capturing group (?:\w+ ) which matche at least one alphanumeric character w+ followed by a space
([\\w ]+) (\w+) capture the free text from alphanumeric char or space [\w ]+ followed by a space and the capture the last word with \w+
Once that done, paste the 3 parts returned by str_match taking care of replacing the spaces in the first group m[,2] by tabs.
m[,1] is the whole match so it's unused here.
Old answer:
A basic one matching based on a fixed number of fields:
> read.table(text=gsub("(\\w+) (\\w+) (\\w+) ([\\w ]+) (\\w+)$","\\1\t\\2\t\\3\t\\4\t\\5",dat$x,perl=TRUE),sep="\t")
V1 V2 V3 V4 V5
1 A 2 2 textA1 textA2 Z1
2 B 4 1 textX1 textX2 textX3 Z2
3 C 3 5 textA1 Z3
Add as many (\w+) you wish before, and increase the number of \1 (back references)

Here can be one twisted way to go that will work whatever the number of "words" you have (and that works on your data); it's based on the number of alphanum characters in your "words" compared to the number of alphanum characters in the other fields:
res <- gsub("\\w{3,}\\K\\t(?=\\w{3,})", " ", gsub(" ", "\t", dat$x), perl=T)
res
# [1] "A\t2\t2\ttextA1 textA2\tZ1" "B\t4\t1\ttextX1 textX2 textX3\tZ2" "C\t3\t5\ttextA1\tZ3"
read.table(text=res, sep="\t")
# V1 V2 V3 V4 V5
#1 A 2 2 textA1 textA2 Z1
#2 B 4 1 textX1 textX2 textX3 Z2
#3 C 3 5 textA1 Z3
EDIT: A completely different way to go, only based on the number of the spaces k you need to replace before the last one:
k <- 3 # in your example
res <- sapply(as.character(dat$x),
function(x, k){
pos_sp <- gregexpr(" ", x)[[1]]
x <- strsplit(x, "")[[1]]
if (length(pos_sp) > k+1) pos_sp <- pos_sp[c(1:k, length(pos_sp))]
x[pos_sp] <- "\t"
x <- paste(x, collapse="")
}, k=k)
read.table(text=res, sep="\t")
# V1 V2 V3 V4 V5
# 1 A 2 2 textA1 textA2 Z1
# 2 B 4 1 textX1 textX2 textX3 Z2
# 3 C 3 5 textA1 Z3

How to split R data.frame column based regular expression condition

I have a data.frame and I want to split one of its columns to two based on a regular expression. More specifically the strings have a suffix in parentheses that needs to be extracted to a column of its own.
So e.g. I want to get from here:
dfInit <- data.frame(VAR = paste0(c(1:10),"(",c("A","B"),")"))
to here:
dfFinal <- data.frame(VAR1 = c(1:10), VAR2 = c("A","B"))

1) gsubfn::read.pattern read.pattern in the gsubfn package can do that. The matches to the parenthesized portions of the regular rexpression are regarded as the fields:
library(gsubfn)
read.pattern(text = as.character(dfInit$VAR), pattern = "(.*)[(](.*)[)]$")
giving:
V1 V2
1 1 A
2 2 B
3 3 A
4 4 B
5 5 A
6 6 B
7 7 A
8 8 B
9 9 A
10 10 B
2) sub Another way is to use sub:
data.frame(V1=sub("\\(.*", "", dfInit$VAR), V2=sub(".*\\((.)\\)$", "\\1", dfInit$VAR))
giving the same result.
3) read.table This solution does not use a regular expression:
read.table(text = as.character(dfInit$VAR), sep = "(", comment = ")")
giving the same result.

You could also use extract from tidyr
library(tidyr)
extract(dfInit, VAR, c("VAR1", "VAR2"), "(\\d+).([[:alpha:]]+).", convert=TRUE) # edited and added `convert=TRUE` as per #aosmith's comments.
# VAR1 VAR2
#1 1 A
#2 2 B
#3 3 A
#4 4 B
#5 5 A
#6 6 B
#7 7 A
#8 8 B
#9 9 A
#10 10 B

See Split column at delimiter in data frame
dfFinal <- within(dfInit, VAR<-data.frame(do.call('rbind', strsplit(as.character(VAR), '[[:punct:]]'))))
> dfFinal
VAR.X1 VAR.X2
1 1 A
2 2 B
3 3 A
4 4 B
5 5 A
6 6 B
7 7 A
8 8 B
9 9 A
10 10 B

An approach with regmatches and gregexpr:
as.data.frame(do.call(rbind, regmatches(dfInit$VAR, gregexpr("\\w+", dfInit$VAR))))

You can also use cSplit from splitstackshape.
library(splitstackshape)
cSplit(dfInit, "VAR", "[()]", fixed=FALSE)
# VAR_1 VAR_2
# 1: 1 A
# 2: 2 B
# 3: 3 A
# 4: 4 B
# 5: 5 A
# 6: 6 B
# 7: 7 A
# 8: 8 B
# 9: 9 A
#10: 10 B

Working with dataframes in a list: Rename variables

Define:
dats <- list( df1 = data.frame(A=sample(1:3), B = sample(11:13)),
df2 = data.frame(AA=sample(1:3), BB = sample(11:13)))
s.t.
> dats
$df1
A B
1 2 12
2 3 11
3 1 13
$df2
AA BB
1 1 13
2 2 12
3 3 11
I would like to change all variable names from all caps to lower. I can do this with a loop but somehow cannot get this lapply call to work:
dats <- lapply(dats, function(x)
names(x)<-tolower(names(x)))
which results in:
> dats
$df1
[1] "a" "b"
$df2
[1] "aa" "bb"
while the desired result is:
> dats
$df1
a b
1 2 12
2 3 11
3 1 13
$df2
aa bb
1 1 13
2 2 12
3 3 11

If you don't use return at the end of a function, the last evaluated expression returned. So you need to return x.
dats <- lapply(dats, function(x) {
names(x)<-tolower(names(x))
x})

Applying predicates on a list in R

Given a list of values in R, what is a nice way to filter values in a list by a given predicate function?

It's not entirely clear whether you have a proper list object in R, or another type of object such as a data.frame or vector. Assuming you have a true list object, we can combine lapply and subset to do what you want. If you don't have a list, then there's no need for lapply.
set.seed(1)
#Fake data
dat <- list(a = data.frame(x = sample(1:10, 20, TRUE))
, b = data.frame(x = sample(1:10, 20, TRUE)))
#Apply the subset function over the list
lapply(dat, subset, x < 3)
$a
x
10 1
12 2
$b
x
4 2
7 1
14 2
18 2
#Example two
lapply(dat, subset, x %in% c(1,7,9))
$a
x
6 9
8 7
9 7
10 1
13 7
$b
x
3 7
7 1
9 9
15 9
16 7

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Removing brackets from a dataset - regex

I have imported a dataset in R of 10 columns and 100 row. But in few columns there are brackets([]) and commas along with the values. How can i get rid of them? As an instance, consider one of 4 columns and 2 rows. V1 V2 V3 V4 3( [4 ([5 8 (1 5 9 [10, And what i want is V1 V2 V3 V4 3 4 5 8 1 5 9 10

Just use gsub: mydf[] <- lapply(mydf, function(x) gsub("[][(),]", "", x)) mydf # V1 V2 V3 V4 # 1 3 4 5 8 # 2 1 5 9 10 Instead of lapply, you can also use as.matrix: mydf[] <- gsub("[][(),]", "", as.matrix(mydf)) mydf # V1 V2 V3 V4 # 1 3 4 5 8 # 2 1 5 9 10

Related

Python Pandas Data frame creation

Replace certain spaces to tabs - delimiters

How to split R data.frame column based regular expression condition

Working with dataframes in a list: Rename variables

Applying predicates on a list in R

Categories

Resources