Convert word Python Pandas Data Frame into Zero One Data Frame - python-2.7

Input
userID col1 col2 col3 col4 col5 col6 col7 col8 col9
1 Java c c++ php python perl html hadoop nodejs
2 nodejs c# c++ oops css html angular java php
3 php python html java angular hadoop c nodejs c#
4 python php css perl hadoop c nodejs c# html
5 perl css python hadoop c nodejs c# java php
6 Java python css perl nodejs c# java php hadoop
7 javascript java perl nodejs angular php mysql hadoop html
8 angular mysql mongodb cs hadoop angular oops html perl
9 nodejs hadoop mysql mongodb angular oops html python java
Desire Output
userID Java C C++ php python perl html hadoop nodejs oops mysql mongo
1 1 1 1 1 1 1 1 1 1 0 0 0
2 1 0 1 1 0 0 1 0 1 0 0 0
3 1 1 0 1 1 1 1 1 1 0 0 0
4 0 0 0 0 1 1 1 0 1 1 1 1

Use get_dummies + groupby by column names and aggregate max:
df = pd.get_dummies(df.set_index('userID'), prefix='', prefix_sep='')
df = df.groupby(level=0, axis=1).max().reset_index()
print (df)
userID Java angular c c# c++ cs css hadoop html java javascript \
0 1 1 0 1 0 1 0 0 1 1 0 0
1 2 0 1 0 1 1 0 1 0 1 1 0
2 3 0 1 1 1 0 0 0 1 1 1 0
3 4 0 0 1 1 0 0 1 1 1 0 0
4 5 0 0 1 1 0 0 1 1 0 1 0
5 6 1 0 0 1 0 0 1 1 0 1 0
6 7 0 1 0 0 0 0 0 1 1 1 1
7 8 0 1 0 0 0 1 0 1 1 0 0
8 9 0 1 0 0 0 0 0 1 1 1 0
mongodb mysql nodejs oops perl php python
0 0 0 1 0 1 1 1
1 0 0 1 1 0 1 0
2 0 0 1 0 0 1 1
3 0 0 1 0 1 1 1
4 0 0 1 0 1 1 1
5 0 0 1 0 1 1 1
6 0 1 1 0 1 1 0
7 1 1 0 1 1 0 0
8 1 1 1 1 0 0 1

Related

SAS code - sum of last N rows for every row

I have a dataset like this for each ID;
Months
ID
Number
2018-07-01
1
0
2018-08-01
1
0
2018-09-01
1
1
2018-10-01
1
3
2018-11-01
1
1
2018-12-01
1
2
2019-01-01
1
0
2019-02-01
1
0
2019-03-01
1
1
2019-04-01
1
0
2019-05-01
1
0
2019-06-01
1
0
2019-07-01
1
1
2019-08-01
1
0
2019-09-01
1
0
2019-10-01
1
2
2019-11-01
1
0
2019-12-01
1
0
2020-01-01
1
0
2020-02-01
1
0
2020-03-01
1
0
2020-04-01
1
0
2020-05-01
1
0
2020-06-01
1
0
2020-07-01
1
0
2020-08-01
1
1
2020-09-01
1
0
2020-10-01
1
0
2020-11-01
1
1
2020-12-01
1
0
2021-01-01
1
0
2021-02-01
1
1
2021-03-01
1
1
2021-04-01
1
0
2018-07-01
2
0
.......
.......
.......
(Similar values for each ID)
I want a dataset like this;
Months
ID
Number
Sum_Next_6Number
2018-07-01
1
0
7
2018-08-01
1
0
7
2018-09-01
1
1
7
2018-10-01
1
3
4
2018-11-01
1
1
3
2018-12-01
1
2
1
2019-01-01
1
0
2
2019-02-01
1
0
2
2019-03-01
1
1
1
2019-04-01
1
0
3
2019-05-01
1
0
3
2019-06-01
1
0
3
2019-07-01
1
1
2
2019-08-01
1
0
2
2019-09-01
1
0
2
2019-10-01
1
2
0
2019-11-01
1
0
0
2019-12-01
1
0
0
2020-01-01
1
0
0
2020-02-01
1
0
1
2020-03-01
1
0
1
2020-04-01
1
0
1
2020-05-01
1
0
2
2020-06-01
1
0
2
2020-07-01
1
0
2
2020-08-01
1
1
2
2020-09-01
1
0
3
2020-10-01
1
0
3
2020-11-01
1
1
Nan
2020-12-01
1
0
Nan
2021-01-01
1
0
Nan
2021-02-01
1
1
Nan
2021-03-01
1
1
Nan
2021-04-01
1
0
Nan
2018-07-01
2
0
0
.......
.......
.......
.......
If there is no 6 months left then this values should be Nan.
Is there a way to do this? Thank you in advance.
data want(drop = i n);
set have curobs = c nobs = nobs;
Sum_Next_6Numbers = 0;
do p = c + 1 to 6 + c;
if p > nobs then do;
Sum_Next_6Numbers = .; leave;
end;
set have(keep = Number ID rename = (Number = n id = i)) point = p;
if id ne i then do;
Sum_Next_6Numbers = .; leave;
end;
Sum_Next_6Numbers + n;
end;
run;

Count the number of unique ids for every subset of variables

I want to find the number of unique ids for every subset combination of the variables. For example
data have;
input id var1 var2 var3;
datalines;
5 1 0 0
5 1 1 1
5 1 0 1
5 0 0 0
6 1 0 0
7 1 1 1
8 1 0 1
9 0 0 0
10 1 0 0
11 1 0 0
12 1 . 1
13 0 0 1
;
run;
I want the result to be
var1 var2 var3 count
. . 0 5
. . 1 5
. 0 . 7
. 0 0 5
. 0 1 3
. 1 . 2
. 1 1 2
0 . . 3
0 . 0 2
0 . 1 1
0 0 . 3
0 0 0 2
0 0 1 1
1 . . 7
1 . 0 4
1 . 1 4
1 0 . 5
1 0 0 4
1 0 1 2
1 1 . 2
1 1 1 2
which is the result of appending all the possible proc sql; group bys (var1 is shown below)
proc sql;
create table sub1 as
select var1, count(distinct id) as count
from have
where not missing(var1)
group by var1
;
quit;
I don't care about the case where all variables are missing or when any of the variables in the group by are missing. Is there a more efficient way of doing this?
You can use Proc SUMMARY to compute the combinations of var1-var3 values for each id by group. From the SUMMARY output a SQL query can count the distinct ids per combination.
Example:
data have;
input id var1 var2 var3;
datalines;
5 1 0 0
5 1 1 1
5 1 0 1
5 0 0 0
6 1 0 0
7 1 1 1
8 1 0 1
9 0 0 0
10 1 0 0
11 1 0 0
12 1 . 1
13 0 0 1
;
proc summary noprint missing data=have;
by id;
class var1-var3;
output out=combos;
run;
proc sql;
create table want as
select var1, var2, var3, count(distinct id) as count
from combos
group by var1, var2, var3
;

Transform categorical column into dummy columns using Power Query M

Using Power Query "M" language, how would you transform a categorical column containing discrete values into multiple "dummy" columns? I come from the Python world and there are several ways to do this but one way would be below:
>>> import pandas as pd
>>> dataset = pd.DataFrame(list('ABCDACDEAABADDA'),
columns=['my_col'])
>>> dataset
my_col
0 A
1 B
2 C
3 D
4 A
5 C
6 D
7 E
8 A
9 A
10 B
11 A
12 D
13 D
14 A
>>> pd.get_dummies(dataset)
my_col_A my_col_B my_col_C my_col_D my_col_E
0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
4 1 0 0 0 0
5 0 0 1 0 0
6 0 0 0 1 0
7 0 0 0 0 1
8 1 0 0 0 0
9 1 0 0 0 0
10 0 1 0 0 0
11 1 0 0 0 0
12 0 0 0 1 0
13 0 0 0 1 0
14 1 0 0 0 0
Interesting question. Here's an easy, scalable method I've found:
Create a custom column of all ones (Add Column > Custom Column > Formula = 1).
Add an index column (Add Column > Index Column).
Pivot on the custom column (select my_col > Transform > Pivot Column).
Replace null values with 0 (select all columns > Transform > Replace Values).
Here's what the M code looks like for this process:
#"Added Custom" = Table.AddColumn(#"Previous Step", "Custom", each 1),
#"Added Index" = Table.AddIndexColumn(#"Added Custom", "Index", 0, 1),
#"Pivoted Column" = Table.Pivot(#"Added Index", List.Distinct(#"Added Index"[my_col]), "my_col", "Custom"),
#"Replaced Value" = Table.ReplaceValue(#"Pivoted Column",null,0,Replacer.ReplaceValue,Table.ColumnNames(#"Pivoted Column"))
Once you've completed the above, you can remove the index column if desired.

Convert this Word DataFrame into Zero One Matrix Format DataFrame in Python Pandas

Want to convert user_Id and skills dataFrame matrix into zero one DataFrame matrix format user and their corresponding skills
Input DataFrame
user_Id skills
0 user1 [java, hdfs, hadoop]
1 user2 [python, c++, c]
2 user3 [hadoop, java, hdfs]
3 user4 [html, java, php]
4 user5 [hadoop, php, hdfs]
Desired Output DataFrame
user_Id java c c++ hadoop hdfs python html php
user1 1 0 0 1 1 0 0 0
user2 0 1 1 0 0 1 0 0
user3 1 0 0 1 1 0 0 0
user4 1 0 0 0 0 0 1 1
user5 0 0 0 1 1 0 0 1
You can join new DataFrame created by astype if need convert lists to str (else omit), then remove [] by strip and use get_dummies:
df = df[['user_Id']].join(df['skills'].astype(str).str.strip('[]').str.get_dummies(', '))
print (df)
user_Id c c++ hadoop hdfs html java php python
0 user1 0 0 1 1 0 1 0 0
1 user2 1 1 0 0 0 0 0 1
2 user3 0 0 1 1 0 1 0 0
3 user4 0 0 0 0 1 1 1 0
4 user5 0 0 1 1 0 0 1 0
df1 = df['skills'].astype(str).str.strip('[]').str.get_dummies(', ')
#if necessary remove ' from columns names
df1.columns = df1.columns.str.strip("'")
df = pd.concat([df['user_Id'], df1], axis=1)
print (df)
user_Id c c++ hadoop hdfs html java php python
0 user1 0 0 1 1 0 1 0 0
1 user2 1 1 0 0 0 0 0 1
2 user3 0 0 1 1 0 1 0 0
3 user4 0 0 0 0 1 1 1 0
4 user5 0 0 1 1 0 0 1 0

printing the number of times a element occurs in a file using regex

I have a long data similar to below
16:24:59 0 0 0
16:24:59 0 1 0
16:25:00 0 1 0
16:25:00 0 1 0
16:25:00 0 2 0
16:25:00 0 2 0
16:25:00 1 0 1
16:25:01 0 0 0
16:25:01 0 0 0
16:25:01 0 0 0
16:25:01 0 0 0
16:25:01 4 9 4
16:25:02 0 0 0
16:25:02 0 0 0
16:25:02 0 0 0
16:25:02 0 1 0
16:25:02 1 9 1
16:25:02 2 0 2
I wish to have a output where it prints the element in column 1, and the number of times it occurs. Below is what I expect. How can I do this?
16:24:59 2
16:25:00 5
16:25:01 5
16:25:02 6
How can I replace the above to
t1 2
t2 5
t3 5
t4 6
.
.
tn 9
It's pretty straight forward using awk
awk '{count[$1]++} END{ for ( i in count) print i, count[i]}'
Test
$ awk '{count[$1]++} END{ for ( i in count) print i, count[i]}' input
16:24:59 2
16:25:00 5
16:25:01 5
16:25:02 6
What it does?
count[$1]++ creates an associative array indexed by the first field.
END Action performed at the end of input file.
for ( i in count) print i, count[i] Iterate through the array count and print the values
Just in case you want a grep and uniq solution:
$ grep -Eo '^\s*\d\d:\d\d:\d\d' /tmp/lines.txt | uniq -c
2 16:24:59
5 16:25:00
5 16:25:01
6 16:25:02
Or, if tab delimited, use cut:
$ cut -f 2 /tmp/lines.txt | uniq -c
2 16:24:59
5 16:25:00
5 16:25:01
6 16:25:02