converting a list to dataframe pandas - list

Creating a list in a loop, my final list looks like below:
L [col1, col2, col3, col4 \
0 N 225.0 12.0 03.0 B ,
col1, col2, col3, col4 \
0 W 223.0 12.0 01.0 M ,
col1, col2, col3, col4 \
0 X 203.0 11.0 04.0 P ]
Im trying to convert this to a pandas DataFrame?
Each row looks like a proper dataframe in itself:
L[0]
col1 col2 col3 col4
N 225.0 12.0 03.0 B

I believe need create 2d numpy array with DataFrame contructor:
L = ['col1', 'col2', 'col3', 'col4',
'N 225.0', '12.0', '03.0', 'B' ,
'col1', 'col2', 'col3', 'col4',
'W 223.0', '12.0', '01.0', 'M' ,
'col1', 'col2', 'col3', 'col4',
'X 203.0', '11.0', '04.0', 'P' ]
a = np.array(L).reshape(-1, 8)[:, -4:]
print (a)
[['N 225.0' '12.0' '03.0' 'B']
['W 223.0' '12.0' '01.0' 'M']
['X 203.0' '11.0' '04.0' 'P']]
df = pd.DataFrame(a, columns = L[:4])
print (df)
col1 col2 col3 col4
0 N 225.0 12.0 03.0 B
1 W 223.0 12.0 01.0 M
2 X 203.0 11.0 04.0 P
Explanation:
First convert list to 1d numpy array:
print (np.array(L))
['col1' 'col2' 'col3' 'col4' 'N 225.0' '12.0' '03.0' 'B' 'col1' 'col2'
'col3' 'col4' 'W 223.0' '12.0' '01.0' 'M' 'col1' 'col2' 'col3' 'col4'
'X 203.0' '11.0' '04.0' 'P']
then reshape to Nx8 nd array:
print (np.array(L).reshape(-1, 8))
[['col1' 'col2' 'col3' 'col4' 'N 225.0' '12.0' '03.0' 'B']
['col1' 'col2' 'col3' 'col4' 'W 223.0' '12.0' '01.0' 'M']
['col1' 'col2' 'col3' 'col4' 'X 203.0' '11.0' '04.0' 'P']]
And last select last 4 colums:
print (np.array(L).reshape(-1, 8)[:, -4:])
[['N 225.0' '12.0' '03.0' 'B']
['W 223.0' '12.0' '01.0' 'M']
['X 203.0' '11.0' '04.0' 'P']]

Try this
L = ['Thanks You', 'Its fine no problem', 'Are you sure']
#create new df
df = pd.DataFrame({'col':L})
print (df)

Related

How create a column containing columns as values in redshift?

I have a data like this below :
id start_date end_date col1 col2 col3 col4 col5
issue_2017-09 2017-09-18 2017-09-30 true true true true false
i want to convert data into the following format:
id start_date end_date new_col
issue_2017-09 2017-09-18 2017-09-30 {'col1', 'col2', 'col3', 'col4'}
new_col is created out of the columns [col1, col2, col3, col4, col5] which are true.
plus I am using redshift.
I was able to resolve this using the following query :
select id, start_date , end_date, listagg(col_name, ', ') as new_col
from (
select id, start_date, end_date, col1 as val, 'col1' as col_name
from table
union all
select id, start_date, end_date, col2 as val, 'col2' as col_name
from table
union all
select id, start_date, end_date, col3 as val, 'col3' as col_name
from table
union all
select id, start_date, end_date, col4 as val, 'col4' as col_name
from table
union all
select id, start_date, end_date, col5 as val, 'col5' as col_name
from table
) t
where val is True
group by id, start_date, end_date
Here is an alternative method,
select
id, start_date, end_date,
'{' +
case when col1 then '''col1''' else '' end +
case when col2 then case when col1 then ', ''col2''' else '''col2''' end else '' end +
case when col3 then case when (col1 or col2) then ', ''col3''' else '''col3''' end else '' end +
case when col4 then case when (col1 or col2 or col3) then ', ''col4''' else '''col4''' end else '' end +
case when col5 then case when (col1 or col2 or col3 or col4) then ', ''col5''' else '''col5''' end else '' end +
'}' as new_col
from table01;

Merge two duplicate rows with imputing values from each other

I have a dataframe (df1) with only one column (col1) having identical values while other columns have missing values, for example as follows:
df1
--------------------------------------------------------------------
col1 col2 col3 col4 col5 col6
--------------------------------------------------------------------
0| 1234 NaT 120 NaN 115 XYZ
1| 1234 2015/01/12 120 Abc 115 NaN
2| 1234 2015/01/12 NaN NaN NaN NaN
I would like to merge the three rows with identical col1 values into one row such that the missing values are replaced with values from the other rows where the values exist in place of missing values. The resulting df will look like this:
result_df
--------------------------------------------------------------------
col1 col2 col3 col4 col5 col6
--------------------------------------------------------------------
0| 1234 2015/01/12 120 Abc 115 XYZ
Can anyone help me with this issue? Thanks in advance!
First remove duplicates in columns names col3 and col4:
s = df.columns.to_series()
df.columns = (s + '.' + s.groupby(s).cumcount().replace({0:''}).astype(str)).str.strip('.')
print (df)
col1 col2 col3 col4 col3.1 col4.1
0 1234 NaT 120.0 NaN 115.0 XYZ
1 1234 2015-01-12 120.0 Abc 115.0 NaN
2 1234 2015-01-12 NaN NaN NaN NaN
And then aggregate first:
df = df.groupby('col1', as_index=False).first()
print (df)
col1 col2 col3 col4 col3.1 col4.1
0 1234 2015-01-12 120.0 Abc 115.0 XYZ

pandas dataframe match column from external file

I have a df with
col0 col1 col2 col3
a 1 2 text1
b 1 2 text2
c 1 3 text3
and i another text file with
col0 col1 col2
met1 a text1
met2 b text2
met3 c text3
how do i match row values from col3 in my first df to the text file col2 and add to previous df only col0 string with out changing the structure of the df
desired output:
col0 col1 col2 col3 col4
a 1 2 text1 met1
b 1 2 text2 met2
c 1 3 text3 met3
You can use pandas.dataframe.merge(). E.g.:
df.merge(df2.loc[:, ['col0', 'col2']], left_on='col3', right_on='col2')
print(df)
col0 col1 col2 col3
0 a 1 2 text1
1 b 1 2 text2
2 c 1 3 text3
print(df2)
col0 col1 col2
0 met1 a text1
1 met2 b text2
2 met3 c text3
Merge df and df2
df3 = df.merge(df2, left_on='col3', right_on='col2',suffixes=('','_1'))
Housekeeping... renaming columns etc...
df3 = df3.rename(columns={'col0_1':'col4'}).drop(['col1_1','col2_1'], axis=1)
print(df3)
col0 col1 col2 col3 col4
0 a 1 2 text1 met1
1 b 1 2 text2 met2
2 c 1 3 text3 met3
And, reassign to df if you wish.
df = df3
OR
df = df.assign(col4=df.merge(df2, left_on='col3', right_on='col2',suffixes=('','_1'))['col0_1'])
print(df)
col0 col1 col2 col3 col4
0 a 1 2 text1 met1
1 b 1 2 text2 met2
2 c 1 3 text3 met3
Call your df df1. Then first load the text file into a dataframe using df2 = pd.read_csv('filename.txt'). Now, you want to rename the columns in df2 so that the column on which you want to merge has the same name in both columns:
df2.columns = ['new_col1', 'new_col2', 'col3']
Then:
pd.merge(df1, df2, on='col3')

split string into columns

I have a column of values that are little messy
Col1
----------------------------------------
B-Lipotropin(S)...............874 BTETLS
IgE-Dandelion(S).............4578 BTETLS
Beta Gamma-Globulin..........2807 BTETLS
Lactate, P
Phospholipid .........8296 BTETLS
How do I split these values into three columns like this
Col1 Col2 Col3
-----------------------------------------------
B-Lipotropin(S) 874 BTETLS
IgE-Dandelion(S) 4578 BTETLS
Beta Gamma-Globulin 2807 BTETLS
Lactate, P
Phospholipid 8296 BTETLS
Appreciate any help.
You can also use tidyr for this:
library(tidyr)
dat <- read.table(text="B-Lipotropin(S)...............874 BTETLS
IgE-Dandelion(S).............4578 BTETLS
Beta Gamma-Globulin..........2807 BTETLS
Lactate, P
Phospholipid .........8296 BTETLS",
sep=";", stringsAsFactors=F, col.names = 'Col1')
dat %>%
separate(Col1, c('Col1', 'Col2'), '\\.+', extra = 'drop') %>%
separate(Col2, c('Col2', 'Col3'), ' ', extra = 'drop')
# Col1 Col2 Col3
# 1 B-Lipotropin(S) 874 BTETLS
# 2 IgE-Dandelion(S) 4578 BTETLS
# 3 Beta Gamma-Globulin 2807 BTETLS
# 4 Lactate, P <NA> <NA>
# 5 Phospholipid 8296 BTETLS
edit: you can also do it in one step with separate(Col1, paste0('Col', 1:3), '([^,] )|(\\.+)', extra = 'drop')
Without the actual data, it is difficult to give a general solution. However, below is one using regular expressions.
Here I assumed that the first two columns are always separated by at least one ., possibly with spaces before or after; the second and third column are presumably separated by spaces.
dat <- read.table(text="B-Lipotropin(S)...............874 BTETLS
IgE-Dandelion(S).............4578 BTETLS
Beta Gamma-Globulin..........2807 BTETLS
Lactate, P
Phospholipid .........8296 BTETLS",
sep=";", stringsAsFactors=F)
# separate first column
l <- strsplit(dat[,1], split="[[:space:]]*\\.+[[:space:]]*")
l <- lapply(l, function(x) c(x,rep("",2-length(x))))
l <- do.call(rbind,l)
dat <- cbind(dat, l[,1])
# separate last two columns
l <- strsplit(l[,2], split="[[:space:]]+")
l <- lapply(l, function(x) c(x,rep("",2-length(x))))
l <- do.call(rbind,l)
dat <- cbind(dat, l)
colnames(dat) <- c("original","col1","col2","col3")
The separated columns look like this:
> dat[,-1]
col1 col2 col3
1 B-Lipotropin(S) 874 BTETLS
2 IgE-Dandelion(S) 4578 BTETLS
3 Beta Gamma-Globulin 2807 BTETLS
4 Lactate, P
5 Phospholipid 8296 BTETLS
Using base R with a regex to split the string in the right places.
setNames(as.data.frame( # coerce to data.frame
do.call(rbind, # bind list
lapply(
strsplit(dat$Col1, "\\.+|[0-9]+(?= )", perl=T), # split messy string
`length<-`, 3) # normalize lengths of lists
)
), paste0("Col", 1:3)) # add column names
# Col1 Col2 Col3
# 1 B-Lipotropin(S) 874 BTETLS
# 2 IgE-Dandelion(S) 4578 BTETLS
# 3 Beta Gamma-Globulin 2807 BTETLS
# 4 Lactate, P <NA> <NA>
# 5 Phospholipid 8296 BTETLS

Sorting two-dimensional dataframe using Pandas

I have a two-dimensional DataFrame, for simplicity it looks like:
df = pd.DataFrame([(1,2.2,5),(2,3,-1)], index=['row1', 'row2'], columns = ["col1","col2",'col3'])
with the output:
col1 col2 col3
row1 1 2.2 5
row2 2 3.0 -1
What's the best way to order it by values to get:
RowName ColName Value
row2 col3 -1
row1 col1 1
row2 col1 2
row1 col2 2.2
row2 col2 3.0
row1 col3 5
I did try using .stack(), didn't get very far, constructing this using nested for loops is possible - but inelegant..
Any ideas here?
melt is a reverse unstack
In [6]: df
Out[6]:
col1 col2 col3
row1 1 2.2 5
row2 2 3.0 -1
In [7]: pd.melt(df.reset_index(),id_vars='index')
Out[7]:
index variable value
0 row1 col1 1.0
1 row2 col1 2.0
2 row1 col2 2.2
3 row2 col2 3.0
4 row1 col3 5.0
5 row2 col3 -1.0
stack() plus sort() appears to give the desired output
In [35]: df
Out[35]:
col1 col2 col3
row1 1 2.2 5
row2 2 3.0 -1
In [36]: stacked = df.stack()
In [38]: stacked.sort()
In [39]: stacked
Out[39]:
row2 col3 -1.0
row1 col1 1.0
row2 col1 2.0
row1 col2 2.2
row2 col2 3.0
row1 col3 5.0