Replace specific values in openpyxl - python-2.7

I have a excel file that looks like this:
1984 1 1
1985 1 1
I want to change all of the values in column 2 to 0 but I am not sure how to loop through rows.
I have tried:
import openpyxl
wb=openpyxl.load_workbook(r'C:\file.xlsx')
ws=wb['Sheet1']
for row in ws:
row = [x.replace('1', '0') for x in row]
but that must not be how you loop through rows.
my desired out put is:
1984 0 1
1985 0 1

You can do something like this:
import openpyxl
excelFile = openpyxl.load_workbook('file.xlsx')
sheet1 = excelFile.get_sheet_by_name('Sheet1')
currentRow = 1
for eachRow in sheet1.iter_rows():
sheet1.cell(row=currentRow, column=2).value = "0"
currentRow += 1
excelFile.save('file.xlsx')
Updates 2nd column to all zeros.

Related

Create boolean dataframe showing existance of each element in a dictionary of lists

I have a dictionary of lists and I have constructed a dataframe where the index is the dictionary keys and the columns are the set of possible values contained within the lists. The dataframe values represent existance of each column for each list contained in the dictionary. What is the most efficient way to construct this? Below is the way I have done it now using for loops, but I am sure there is a more efficient way using either vectorization or concatenation.
import pandas as pd
data = {0:[1,2,3,4],1:[2,3,4],2:[3,4,5,6]}
cols = sorted(list(set([x for y in data.values() for x in y])))
df = pd.DataFrame(0,index=data.keys(),columns=cols)
for row in df.iterrows():
for col in cols:
if col in data[row[0]]:
df.loc[row[0],col] = 1
else:
df.loc[row[0],col] = 0
print(df)
Output:
1 2 3 4 5 6
0 1 1 1 1 0 0
1 0 1 1 1 0 0
2 0 0 1 1 1 1
Use MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(data.values()),
columns=mlb.classes_,
index=data.keys())
print (df)
1 2 3 4 5 6
0 1 1 1 1 0 0
1 0 1 1 1 0 0
2 0 0 1 1 1 1
Pure pandas, but much slowier solution with str.get_dummies:
df = pd.Series(data).astype(str).str.strip('[]').str.get_dummies(', ')

inputing a pandas dataframe matching two columns from two dataframes

I have a dataframe C and another dataframe S. I want to change the values in one of the column of C if two columns in C and S have same values.
Please consider the example given below,
C.head(3)
id1 id2 title val
0 1 0 'abc' 0
1 2 0 'bcd' 0
2 3 0 'efg' 0
S.head(3)
id1 id2
0 1 1
1 3 0
I want to assign the value of 1 to the column 'val' in C corresponding only to the rows where C.id1 = S.id1 and C.id2 = S.id2
The combination of (C.id1, C.id2) and (S.id1, S.id2) is unique in respective tables
In the above case, I want the result as
C.head(3)
id1 id2 title val
0 1 0 'abc' 0
1 2 0 'bcd' 0
2 3 0 'efg' 1
as only in the third row of C, it matches with one of the rows of S for the columns id1 and id2.
I think need merge with left join and parameter indicator, last convert boolen mask to 0 and 1:
#if same columns for join in both df parameter on is possible omit
df = C.merge(S, indicator=True, how='left')
#if multiple same columns in both df
#df = C.merge(S, indicator=True, how='left', on=['id1', 'id2'])
df['val'] = (df['_merge'] == 'both').astype(int)
df = df.drop('_merge', axis=1)
print (df)
id1 id2 val
0 1 0 0
1 2 0 0
2 3 0 1
Solutiion working nice with new data:
df = C.merge(S, indicator=True, how='left')
#if multiple same columns in both df
#df = C.merge(S, indicator=True, how='left', on=['id1', 'id2'])
df['val'] = (df['_merge'] == 'both').astype(int)
df = df.drop('_merge', axis=1)
print (df)
id1 id2 title val
0 1 0 abc 0
1 2 0 bcd 0
2 3 0 efg 1

In pandas, i have 2 columns(ID and Name). If ID is assigned to more than one name how do i replace duplicates with first occurance

Image with the csv file with the two columns
You can use:
df.drop_duplicates('Salesperson_1')
Or maybe need:
df.groupby('Salesperson_1')['Salesperson_1_ID'].transform('first')
Sample:
df = pd.DataFrame({'Salesperson_1':['a','a','b'],
'Salesperson_1_ID':[4,5,6]})
print (df)
Salesperson_1 Salesperson_1_ID
0 a 4
1 a 5
2 b 6
df1 = df.drop_duplicates('Salesperson_1')
print (df1)
Salesperson_1 Salesperson_1_ID
0 a 4
2 b 6
df.Salesperson_1_ID = df.groupby('Salesperson_1')['Salesperson_1_ID'].transform('first')
print (df)
Salesperson_1 Salesperson_1_ID
0 a 4
1 a 4
2 b 6
Pandas.groupby.first()
if your DataFrame is called df, you could just do this:
df.groupby('Salesperson_1_ID').first()

Creating a unique list using Pandas

I have an xlsx file with over 1000 columns of data. I would like to firstly parse every second column from the data file (which can contain numbers and letters) and then create a unique list from the parsed data.
I'm a complete noob & have tried a "for" and "do while" loop but neither have worked for me.
So far I have:
import pandas as pd
workbook = pd.read_excel('C:\Python27\Scripts\Data.xlsx')
worksheet = workbook.sheetname='Data'
for col in range(worksheet[0], worksheet[1300]):
print(col)
I think I need to append the data and maybe write to a text file then create a unique list from the text file - I can do the second part it's just getting it into the text file I'm having trouble with.
Thanks
You can iterate over your columns by slicing and using a step arg i.e. df.ix[:, ::2]
In [35]:
df = pd.DataFrame({'a':1, 'b':[1,2,3,4,5], 'c':[2,3,4,5,6], 'd':0,'e':np.random.randn(5)})
df
Out[35]:
a b c d e
0 1 1 2 0 -0.352310
1 1 2 3 0 1.189140
2 1 3 4 0 -1.470507
3 1 4 5 0 0.742709
4 1 5 6 0 -2.798007
here we step every 2nd column:
In [37]:
df.ix[:,::2]
Out[37]:
a c e
0 1 2 -0.352310
1 1 3 1.189140
2 1 4 -1.470507
3 1 5 0.742709
4 1 6 -2.798007
we can then just call np.unique on the entire df to get a single array of all the unique values:
In [36]:
np.unique(df.ix[:,::2])
Out[36]:
array([-2.79800676, -1.47050675, -0.35231005, 0.74270934, 1. ,
1.18914011, 2. , 3. , 4. , 5. , 6. ])

Pandas read_table using wrong column as index

I'm trying to make a dataframe for a url that is delimited by tabs. However, pandas is using the industry_code column as the index.
dff = pd.read_table('http://download.bls.gov/pub/time.series/ce/ce.industry')
will output
industry_code naics_code publishing_status industry_name display_level selectable sort_sequence
0 - B Total nonfarm 0 T 1 NaN
5000000 - A Total private 1 T 2 NaN
6000000 - A Goods-producing 1 T 3 NaN
7000000 - B Service-providing 1 T 4 NaN
8000000 - A Private service-providing 1 T 5 NaN
Easy!
table_location = 'http://download.bls.gov/pub/time.series/ce/ce.industry'
dff = pd.read_table(table_location, index_col=False)