Comparison of two CSV files in Python

Comparison of two CSV files in Python - python-2.7

I want to compare two csv files looking like below.
Here I want to find out unmatched signals.
I need some help in python. Please help me.
File 1
2
USER Name
7/31/2015 0:00
<XXXXXXX>
1 Signal_1 10
2 Signal_2 1 2 3 4 5
3 Signal_3 X 5 10 15 20 25 Y 6 11 16 21 26
1 Signal_4 20
1 Signal_5 30
2 Signal_6 6 7 8 9 10 11 12 13
2 Signal_7 55 1.05 1.6 14.1
3 Signal_8 X 30 40 50 60 40 Y 14 15 26 14 26
2 Signal_9 1 1 2 3 2
1 Signal_10 40
File 2
2
USER Name
7/31/2015 0:00
<XXXXXXX>
3 Signal_3 X 20 10 15 17 25 Y 6 11 16 21 26
1 Signal_5 5
2 Signal_7 55 1.05 1.6 14.1
1 Signal_1 10
3 Signal_8 X 30 40 50 60 40 Y 14 15 26 14 26
1 Signal_10 14
2 Signal_9 1 1 2 3 2
2 Signal_6 6 7 8 59 10 15 12 13
1 Signal_4 20
2 Signal_2 1 2 3 4 5
Result:
File
3 Signal_3 X 5 10 15 20 25 Y 6 11 16 21 26
1 Signal_5 30
1 Signal_10 40
2 Signal_6 6 7 8 9 10 11 12 13
File 2
3 Signal_3 X 20 10 15 17 25 Y 6 11 16 21 26
1 Signal_5 5
1 Signal_10 14
2 Signal_9 1 1 2 3 2

If you want to check for fairly exact comparisons, you can use sets quite easily:
def sigset(fname):
with open(fname, 'rb') as f:
data = set(' '.join(line.split()) for line in f
if 'Signal' in line)
return data
s1 = sigset('sig1.txt')
s2 = sigset('sig2.txt')
print('File 1')
for line in sorted(s1 - s2):
print(line)
print('')
print('File 2')
for line in sorted(s2 - s1):
print(line)

with open('Sample1.csv', 'r') as t1, open('Sample2.csv', 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
print fileone
print filetwo
with open('update.csv', 'w') as outFile:
for line in filetwo:
if line not in fileone:
outFile.write(line)
for line in fileone:
if line not in filetwo:
outFile.write(line)

Related

count and group the column by sequence

I have a dataset that has to be grouped by number as follows.
ID dept count
1 10 2
2 10 2
3 20 4
4 20 4
5 20 4
6 20 4
7 30 4
8 30 4
9 30 4
10 30 4
so for every 3rd row I need a new level the output should be as follows.
ID dept count Level
1 10 2 1
2 10 2 1
3 20 4 1
4 20 4 1
5 20 4 2
6 20 4 2
7 30 4 1
8 30 4 1
9 30 4 2
10 30 4 2
I have tried counting the number of rows based on the dept and count.
data want;
set have;
by dept count;
if first.count then level=1;
else level+1;
run;
this generates a count but not what exactly I am looking for
ID dept count Level
1 10 2 1
2 10 2 1
3 20 4 1
4 20 4 1
5 20 4 2
6 20 4 2
7 30 4 1
8 30 4 1
9 30 4 2
10 30 4 2

It isn't quite clear what output you want. I've extended your input data a bit - please
could you clarify what output you'd expect for this input and what the logic is for generating it?
I've made a best guess at roughly what you might be aiming for - incrementing every 3 rows with the same dept and count - perhaps this will be enough for you to get to the answer you want?
data have;
input ID dept count;
cards;
1 10 2
2 10 2
3 20 4
4 20 4
5 20 4
6 20 4
7 30 4
8 30 4
9 30 4
10 30 4
11 30 4
12 30 4
13 30 4
14 30 4
;
run;
data want;
set have;
by dept count;
if first.count then do;
level = 0;
dummy = 0;
end;
if mod(dummy,3) = 0 then level + 1;
dummy + 1;
drop dummy;
run;
Output:
ID dept count level
1 10 2 1
2 10 2 1
3 20 4 1
4 20 4 1
5 20 4 1
6 20 4 2
7 30 4 1
8 30 4 1
9 30 4 1
10 30 4 2
11 30 4 2
12 30 4 2
13 30 4 3
14 30 4 3

One way to do this is to nest the SET statement inside a DO loop. Or in this case two DO loops. One to generate the LEVEL (within DEPT) and the second to count by twos. Use the LAST.DEPT flag to handle odd number of observations.
So if I modify the input to include odd number of observations in some groups.
data have;
input ID dept count;
cards;
1 10 2
2 10 2
3 20 4
4 20 4
5 20 4
6 20 4
7 20 4
8 30 4
9 30 4
10 30 4
;
Then can use this step to assign the LEVEL variable.
data want ;
do level=1 by 1 until(last.dept);
do sublevel=1 to 2 until(last.dept);
set have;
by dept;
output;
end;
end;
run;
Results:
Obs level sublevel ID dept count
1 1 1 1 10 2
2 1 2 2 10 2
3 1 1 3 20 4
4 1 2 4 20 4
5 2 1 5 20 4
6 2 2 6 20 4
7 3 1 7 20 4
8 1 1 8 30 4
9 1 2 9 30 4
10 2 1 10 30 4

Pandas grouped differences with variable lags

I have a pandas data frame with three variables. The first is a grouping variable, the second a within group "scenario" and the third an outcome. I would like to calculate the within group difference between the null condition, scenario zero, and the other scenarios within the group. The number of scenarios varies between the different groups. My data looks like:
ipdb> aDf
FieldId Scenario TN_load
0 0 0 134.922952
1 0 1 111.787326
2 0 2 104.805951
3 1 0 17.743467
4 1 1 13.411849
5 1 2 13.944552
6 1 3 17.499152
7 1 4 17.640090
8 1 5 14.220673
9 1 6 14.912306
10 1 7 17.233862
11 1 8 13.313953
12 1 9 17.967438
13 1 10 14.051882
14 1 11 16.307317
15 1 12 12.506358
16 1 13 16.266233
17 1 14 12.913150
18 1 15 18.149811
19 1 16 12.337736
20 1 17 12.008868
21 1 18 13.434605
22 2 0 454.857959
23 2 1 414.372215
24 2 2 478.371387
25 2 3 385.973388
26 2 4 487.293966
27 2 5 481.280175
28 2 6 403.285123
29 3 0 30.718375
... ... ...
29173 4997 3 53.193992
29174 4997 4 45.800968
I will also have to write functions to get percentage differences etc. but this has me stumped. Any help greatly appreciated.

You can get the difference with the scenario 0 within groups using groupby and transform like:
df['TN_load_0'] = df['TN_load'].groupby(df['FieldId']).transform(lambda x: x - x.iloc[0])
df
FieldId Scenario TN_load TN_load_0
0 0 0 134.922952 0.000000
1 0 1 111.787326 -23.135626
2 0 2 104.805951 -30.117001
3 1 0 17.743467 0.000000
4 1 1 13.411849 -4.331618
5 1 2 13.944552 -3.798915
6 1 3 17.499152 -0.244315

how to print a list vertically python

>list1=[1,2,3,4]
>list2=[5,6,7,8]
>list3=[9,10,11,12]
>list4=[13,14,15,16]
>list5=[17,18,19,20]
>lists=[list1,list2,list3,list4,list5
I want to print the following code so that it outputs this way:
4 8 12 16 20
3 7 11 15 19
2 6 10 14 18
sorry didn't knew it ignored new lines:
1 5 9 13 17
Thanks in advance (new to python)

One way you could achieve this is to zip up the reversed lists and simply print all the elements out.
list1=[1,2,3,4]
list2=[5,6,7,8]
list3=[9,10,11,12]
list4=[13,14,15,16]
list5=[17,18,19,20]
for l1, l2, l3, l4, l5 in zip(reversed(list1), reversed(list2), reversed(list3), reversed(list4), reversed(list5)):
print(l1, l2, l3, l4, l5, end=' ')
output
4 8 12 16 20 3 7 11 15 19 2 6 10 14 18 1 5 9 13 17

SAS IF then statement

Hello for whatever reason my if then statement will not work for this code. What I am trying to get it to do is (kinda obvious but whatever) if the salary is LE 30,000 then make new variable income equal to low. Here is what I have so far.
data newdd2;
input subject group$ year salary : comma7. ##;
IF (salary <= 30,000) THEN income = 'low';
datalines;
1 A 2 53,900 2 B 2 37,400 3 A 1 49,500
4 C 2 43,900 5 B 3 38,400 6 A 3 39,500
7 A 3 53,600 8 B 2 37,700 9 C 1 49,900
10 C 2 43,300 11 B 3 57,400 12 B 3 39,500
13 B 1 33,900 14 A 2 41,400 15 C 2 49,500
16 C 1 43,900 17 B 1 39,400 18 A 3 39,900
19 A 2 53,600 20 A 2 37,700 21 C 3 42,900
22 C 2 43,300 23 B 1 57,400 24 C 3 69,500
25 C 2 33,900 26 A 2 35,300 27 A 2 47,500
28 C 2 43,900 29 B 3 38,400 30 A 1 32,500
31 A 3 53,600 32 B 2 37,700 33 C 1 41,600
34 C 2 43,300 35 B 3 57,400 36 B 3 39,500
37 B 2 33,900 38 A 2 41,400 39 C 2 79,500
40 C 1 43,900 41 C 1 29,500 42 A 3 39,900
43 A 2 53,600 44 A 2 37,500 45 C 3 42,900
46 C 2 43,300 47 B 1 47,400 48 C 3 59,500
run;
The error I keep getting is (The work dataset may be incomplete), however I am sure that my code is correct I've tried a number of things but no success yet thanks in advance.

You cannot use a comma in a numeric literal.
IF (salary <= 30000) THEN income = 'low';

Converting dict of dicts into pandas DataFrame - memory issues

I have a data structure that consists of a three-level nested dict that keeps counts of occurrences of a three part object. I'd like to build a DataFrame out of it with a specific shape, but I can't figure out a way to do it that doesn't involve consuming a lot of working memory---because the table is quite large (several GBs at full extent).
The basic functionality looks like this:
class SparseCubeTable:
def __init__(self):
self.table = {}
self.dim1 = []
self.dim2 = []
self.dim3 = []
def increment(self, dim1, dim2, dim3):
if dim1 in self.table:
if dim2 in self.table[dim1]:
if dim3 in self.table[dim1][dim2]:
self.table[dim1][dim2][dim3] += 1
else:
self.dim3.append(dim3)
self.table[dim1][dim2][dim3] = 1
else:
self.dim2.append(dim2)
self.dim3.append(dim3)
self.table[dim1][dim2] = {dim3:1}
else:
self.dim1.append(dim1)
self.dim2.append(dim2)
self.dim3.append(dim3)
self.table[dim1] = {dim2:{dim3:1}}
This was constructed to make summing over keys easier, among other things. A SparseCubeTable is used like this:
In [23]: example = SparseCubeTable()
In [24]: example.increment("thing1", "thing2", "thing3")
In [25]: example.increment("thing1", "thing2", "thing3")
In [26]: example.increment("thing4", "thing5", "thing6")
In [27]: example.increment("thing1", "thing3", "thing5")
And you can get the data like this:
In [29]: example.table['thing1']['thing2']['thing3']
Out[29]: 2
The sort of DataFrame I want looks like this:
1 2 3 4
thing1 thing2 thing3 2
thing1 thing3 thing5 1
thing4 thing5 thing6 1
The DataFrame is going to be saved as an HDF5 db with columns 1-3 indexed and statistical transformations on column 4 (that require the whole table be temporarily in memory).
The problem is that the pandas.DataFrame.from_dict function builds a whole other sort of structure with the keys used as row labels, as far as I understand it. However, trying to use from_records forces me to copy out the whole data structure into a list, meaning that I now have double the memory size to worry about.
I tried implementing the solution in:
Create a pandas DataFrame from generator?
but in 0.12.0 what it ends up doing is first building a giant list of strings which is even worse. I assume writing out the structure to a csv and reading it back in is also going to be terrible on memory.
Is there a better way of doing this? Or should I just try to squeeze memory even further in the SparseCubeTable somehow? It seems so wasteful to have to build an intermediate list data structure to use from_records.

Here is a code for an efficient solution.
Create some data looking like yours. This is a list of 1000 3-tuples
In [1]: import random
In [2]: tags = [ 'thing{0}'.format(i) for i in xrange(100) ]
In [3]: data = [ (random.choice(tags),random.choice(tags),random.choice(tags)) for i in range(1000) ]
Our writing function, makes sure that when we write the index is globally unique (its not actually necessary, but since the index is actually written its 'nicer')
In [4]: def write(store,c):
...: df = DataFrame(c,columns=['dim1','dim2','dim3'])
...: try:
...: nrows = store.get_storer('df').nrows
...: except:
...: nrows = 0
...: df.index += nrows
...: store.append('df',df,data_columns=True)
...: return []
...:
In [5]: collector = []
In [6]: store = pd.HDFStore('data.h5',mode='w')
Iterate thru your data (or from a stream or whatever), and write it.
In [7]: for i, d in enumerate(data):
...: collector.append(d)
...: if i % 100 == 0 and i:
...: collector = write(store,collector)
...:
In [8]: write(store,collector)
Out[8]: []
The store
In [9]: store
Out[9]:
<class 'pandas.io.pytables.HDFStore'>
File path: data.h5
/df frame_table (typ->appendable,nrows->1000,ncols->3,indexers->[index],dc->[dim1,dim2,dim3])
In [9]: store
Out[9]:
<class 'pandas.io.pytables.HDFStore'>
File path: data.h5
/df frame_table (typ->appendable,nrows->1000,ncols->3,indexers->[index],dc->[dim1,dim2,dim3])
In [10]: store.select('df')
Out[10]:
dim1 dim2 dim3
0 thing28 thing87 thing29
1 thing62 thing70 thing50
2 thing64 thing12 thing98
3 thing33 thing98 thing46
4 thing46 thing5 thing76
5 thing2 thing9 thing21
6 thing1 thing63 thing68
7 thing42 thing30 thing45
8 thing56 thing71 thing77
9 thing99 thing10 thing91
10 thing40 thing9 thing10
11 thing70 thing54 thing59
12 thing94 thing65 thing3
13 thing93 thing24 thing25
14 thing95 thing94 thing86
15 thing41 thing55 thing3
16 thing88 thing10 thing47
17 thing89 thing58 thing33
18 thing16 thing66 thing55
19 thing68 thing20 thing99
20 thing34 thing71 thing28
21 thing67 thing87 thing97
22 thing77 thing74 thing6
23 thing63 thing41 thing30
24 thing14 thing62 thing66
25 thing20 thing36 thing67
26 thing33 thing19 thing58
27 thing0 thing71 thing24
28 thing1 thing48 thing42
29 thing18 thing12 thing4
30 thing85 thing97 thing20
31 thing73 thing71 thing70
32 thing91 thing43 thing48
33 thing45 thing6 thing87
34 thing0 thing28 thing8
35 thing56 thing38 thing61
36 thing39 thing92 thing35
37 thing69 thing26 thing22
38 thing16 thing16 thing79
39 thing4 thing16 thing12
40 thing81 thing79 thing1
41 thing77 thing90 thing83
42 thing53 thing17 thing89
43 thing53 thing15 thing37
44 thing25 thing7 thing20
45 thing44 thing14 thing25
46 thing62 thing84 thing23
47 thing83 thing50 thing60
48 thing68 thing64 thing24
49 thing73 thing53 thing43
50 thing86 thing67 thing31
51 thing75 thing63 thing82
52 thing8 thing10 thing90
53 thing34 thing23 thing12
54 thing66 thing97 thing26
55 thing66 thing53 thing27
56 thing79 thing22 thing37
57 thing43 thing82 thing66
58 thing87 thing53 thing92
59 thing33 thing71 thing97
... ... ...
[1000 rows x 3 columns]
In [11]: store.close()
Then you can do interesting things. If you are not reading the entire set in you may want to chunk this (which is a bit more involved if you are counting things).
In [56]: pd.read_hdf('data.h5','df').apply(lambda x: x.value_counts())
Out[56]:
dim1 dim2 dim3
thing0 12 6 8
thing1 14 7 8
thing10 10 10 7
thing11 8 10 14
thing12 11 14 11
thing13 11 12 7
thing14 8 14 3
thing15 12 11 11
thing16 7 10 11
thing17 16 9 13
thing18 13 8 10
thing19 11 7 8
thing2 9 5 17
thing20 6 7 11
thing21 7 8 8
thing22 4 17 14
thing23 14 11 7
thing24 10 5 14
thing25 11 11 12
thing26 13 10 15
thing27 12 15 16
thing28 11 10 8
thing29 7 7 8
thing3 11 14 14
thing30 11 16 8
thing31 7 6 12
thing32 8 12 9
thing33 13 12 12
thing34 12 8 5
thing35 6 10 8
thing36 6 9 13
thing37 8 10 12
thing38 7 10 4
thing39 14 11 7
thing4 9 7 10
thing40 12 8 9
thing41 8 16 11
thing42 9 11 13
thing43 8 6 13
thing44 9 13 11
thing45 7 13 7
thing46 12 8 13
thing47 9 10 9
thing48 8 9 9
thing49 4 8 7
thing5 13 7 7
thing50 14 12 9
thing51 5 7 11
thing52 9 11 12
thing53 9 15 15
thing54 7 9 13
thing55 6 10 10
thing56 12 11 11
thing57 12 9 11
thing58 12 12 10
thing59 6 13 10
thing6 8 5 7
thing60 12 9 6
thing61 5 9 9
thing62 8 10 8
... ... ...
[100 rows x 3 columns]
You can then do a 'groupby' like this:
In [69]: store = pd.HDFStore('data.h5')
In [61]: dim1 = Index(store.select_column('df','dim1').unique())
In [66]: store.close()
In [67]: groups = dim1[0:10]
In [68]: groups
Out[68]: Index([u'thing28', u'thing62', u'thing64', u'thing33', u'thing46', u'thing2', u'thing1', u'thing42', u'thing56', u'thing99'], dtype='object')
In [70]: pd.read_hdf('data.h5','df',where='dim1=groups').apply(lambda x: x.value_counts())
Out[70]:
dim1 dim2 dim3
thing1 14 2 1
thing10 NaN 1 1
thing11 NaN 1 2
thing12 NaN 5 NaN
thing13 NaN 1 NaN
thing14 NaN 1 1
thing15 NaN 1 1
thing16 NaN 1 3
thing17 NaN NaN 2
thing18 NaN 1 1
thing19 NaN 1 2
thing2 9 1 1
thing20 NaN 2 NaN
thing21 NaN NaN 1
thing22 NaN 2 2
thing23 NaN 2 3
thing24 NaN 2 1
thing25 NaN 3 2
thing26 NaN 2 2
thing27 NaN 3 1
thing28 11 NaN NaN
thing29 NaN 1 2
thing30 NaN 2 NaN
thing31 NaN 1 1
thing32 NaN 1 1
thing33 13 1 2
thing34 NaN 1 NaN
thing35 NaN 1 NaN
thing36 NaN 1 1
thing37 NaN 1 2
thing38 NaN 3 NaN
thing39 NaN 3 1
thing4 NaN 2 NaN
thing41 NaN NaN 1
thing42 9 1 1
thing43 NaN NaN 1
thing44 NaN 1 2
thing45 NaN NaN 2
thing46 12 NaN 1
thing47 NaN 1 1
thing48 NaN 1 NaN
thing49 NaN 1 NaN
thing5 NaN 2 2
thing50 NaN NaN 3
thing51 NaN 2 2
thing52 NaN 1 3
thing53 NaN 2 4
thing55 NaN NaN 2
thing56 12 1 1
thing57 NaN NaN 3
thing58 NaN 1 2
thing6 NaN NaN 1
thing60 NaN 1 1
thing61 NaN 1 4
thing62 8 2 1
thing63 NaN 1 1
thing64 15 NaN 1
thing66 NaN 1 2
thing67 NaN 2 NaN
thing68 NaN 1 1
... ... ...
[90 rows x 3 columns]

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Comparison of two CSV files in Python - python-2.7

Related

count and group the column by sequence

Pandas grouped differences with variable lags

how to print a list vertically python

SAS IF then statement

Converting dict of dicts into pandas DataFrame - memory issues

Categories

Resources