storing different rows of two lists in new list - list

I am trying to create a new object by comparing two list. If the rows are matching the row should be removed form the splitted row_list or appended to a new list containing only the differences between both lists.
Sample data: basically the data is structured in a way that splitted_row_list has all the rows all_rows has, but contains additional rows, which are different, aswell(btw also meaning there is an unequal amount of rows between both lists) . I am amining to do put these additional rows into a new object.
all_rows[0]:'1390', '139080', '13980', '1380', '139080', '13080'
splitted_row_list[0]:'35335','53527','353529','242424','5222','444'
results = []
for row in splitted_row_list:
print(row)
for row1 in all_rows:
if row1 == row:
splitted_row_list.remove(row)
else:
results.append(row)
print(results)
However, this code just returns all the rows. Does anyone have a suggestion?

The two lists are distict, therefore you get every item in one list because if row1 == row is never true, then you wont remove anything.
There are no differences.
EDIT:
You can simply
nonunique = []
for row in splitted_row_list:
print(row)
for row1 in all_rows:
if row1 == row:
nonunique.append(splitted_row_list.remove(row))
result = splitted_row_list #the nonunique have been removed
if you want the non unique from all_rows, just add a all_rows.remove(row).
For the complete sets, just concatenate them after the loop.
all = nonunique +splitted_row_list+all_rows

Thanks I have 'solved' this problem in this particular context by just appending a string and then later sorting based on whether they contain the string or not..not very elegant but it works..
def append_mark2(splitted_row_list):
for row in splitted_row_list:
for row1 in all_rows:
if row1 == row:
row.append('jaja')
print(row)
return splitted_row_list
def sort_on_appendix(splitted_row_list_appended_mark):
next_row_list3=[]
for row in splitted_row_list_appended_mark:
if 'jaja' not in row:
print(row)
next_row_list3.append(row)
print('next_row_list3:',next_row_list3)
return next_row_list3

Related

how to solve concatenate issue with.cell()? row = row work, column = column gives error

I am looping through an excel sheet, looking for a specific name. When found, I print the position of the cell and the value.
I would like to find the position and value of a neighbouring cell, however I can't get .cell() to work by adding 2, indicating I would like the cell 2 columns away in the same row.
row= row works, but column= column gives error, and column + 2 gives error. Maybe this is due to me listing columns as 'ABCDEFGHIJ' earlier in my code? (For full code, see below)
print 'Cell position {} has value {}'.format(cell_name, currentSheet[cell_name].value)
print 'Cell position next door TEST {}'.format(currentSheet.cell(row=row, column=column +2))
Full code:
file = openpyxl.load_workbook('test6.xlsx', read_only = True)
allSheetNames = file.sheetnames
#print("All sheet names {}" .format(file.sheetnames))
for sheet in allSheetNames:
print('Current sheet name is {}'.format(sheet))
currentSheet = file[sheet]
for row in range(1, currentSheet.max_row + 1):
#print row
for column in 'ABCDEFGHIJ':
cell_name = '{}{}'.format(column,row)
if currentSheet[cell_name].value == 'sign_name':
print 'Cell position {} has value {}'.format(cell_name, currentSheet[cell_name].value)
print 'Cell position TEST {}'.format(currentSheet.cell(row=row, column=column +2))
I get this output:
Current sheet name is Sheet1
Current sheet name is Sheet2
Cell position D5 has value sign_name
and:
TypeError: cannot concatenate 'str' and 'int' objects
I get the same error if I try "column = column" as "column = column +2".
Why does row=row work, but column=column dosen't? And how to find the cell name of the cell to the right of my resulting D5 cell?
The reason row=row works and column=column doesn't is because your column value is a string (letter from A to J) while the column argument of a cell is expecting an int (A would be 1, B would be 2, Z would be 26, etc.)
There are a few changes I would make in order to more effectively iterate through the cells and find a neighbor. Firstly, OpenPyXl offers sheet.iter_rows(), which given no arguments, will provide a generator of all rows that are used in the sheet. So you can iterate with
for row in currentSheet.iter_rows():
for cell in row:
because each row is a generator of cells in that row.
Then in this new nested for loop, you can get the current column index with cell.column (D would give 4) and the cell to the right (increment by one column) would be currentSheet.cell(row=row, column=cell.column+1)
Note the difference between the two cell's: currentSheet.cell() is a request for a specific cell while cell.column+1 is the column index of the current cell incremented by 1.
Relevant OpenPyXl documentation:
https://openpyxl.readthedocs.io/en/stable/api/openpyxl.cell.cell.html
https://openpyxl.readthedocs.io/en/stable/api/openpyxl.worksheet.worksheet.html

Converting some elements of a list of a dataset to float

I'm reading in a csv file, and all rows contain string elements. One might be:
"orange", "2", "65", "banana"
I want to change this, within my dataset, to become:
row = ["orange", 2.0, 65.0, "banana"]
Here is my code:
data = f.read().split("\n")
for row in data:
for x in row:
if x.isdigit():
x = float(x)
print row
But it still prints the original rows like:
"orange", "2", "65", "banana"
I also want to achieve this without using list comprehensions (for now).
I believe it is because you cannot edit the row array in place like that. The x variable doesn't actually refer to the element in the array but a copy of it, so all changes you make evaporate after you're done iterating through the array.
I'm not sure if this is the idiomatic 'python way' of doing this but you could do:
data = f.read().split("\n")
for row in data:
parsed_row = []
for x in row:
if x.isdigit():
x = float(x)
parsed_row.append(x)
print parsed_row
Alternatively a more 'pythonic' way, as provided by JGreenwell in the comments, may be to allow an exception to be thrown if an element cannot be parsed to float.
data = f.read().split("\n")
for row in data:
parsed_row = []
for x in row:
try:
parsed_row.append(float(x))
except ValueError:
parsed_row.append(x)
print parsed_row
It really would come down to personal preference I imagine. Python exceptions shouldn't be slow so I wouldn't be concerned about that.
Perhaps is your delimiter. Try something like
with open('yourfile.csv', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='|')
for row in reader:
for x in row:
if x.isdigit():
print(float(x))

KeyError: Not in index, using a keys generated from a Pandas dataframe on itself

I have two columns in a Pandas DataFrame that has datetime as its index. The two column contain data measuring the same parameter but neither column is complete (some row have no data at all, some rows have data in both column and other data on in column 'a' or 'b').
I've written the following code to find gaps in columns, generate a list of indices of dates where these gaps appear and use this list to find and replace missing data. However I get a KeyError: Not in index on line 3, which I don't understand because the keys I'm using to index came from the DataFrame itself. Could somebody explain why this is happening and what I can do to fix it? Here's the code:
def merge_func(df):
null_index = df[(df['DOC_mg/L'].isnull() == False) & (df['TOC_mg/L'].isnull() == True)].index
df['TOC_mg/L'][null_index] = df[null_index]['DOC_mg/L']
notnull_index = df[(df['DOC_mg/L'].isnull() == True) & (df['TOC_mg/L'].isnull() == False)].index
df['DOC_mg/L'][notnull_index] = df[notnull_index]['TOC_mg/L']
df.insert(len(df.columns), 'Mean_mg/L', 0.0)
df['Mean_mg/L'] = (df['DOC_mg/L'] + df['TOC_mg/L']) / 2
return df
merge_func(sve)
Whenever you are considering performing assignment then you should use .loc:
df.loc[null_index,'TOC_mg/L']=df['DOC_mg/L']
The error in your original code is the ordering of the subscript values for the index lookup:
df['TOC_mg/L'][null_index] = df[null_index]['DOC_mg/L']
will produce an index error, I get the error on a toy dataset: IndexError: indices are out-of-bounds
If you changed the order to this it would probably work:
df['TOC_mg/L'][null_index] = df['DOC_mg/L'][null_index]
However, this is chained assignment and should be avoided, see the online docs
So you should use loc:
df.loc[null_index,'TOC_mg/L']=df['DOC_mg/L']
df.loc[notnull_index, 'DOC_mg/L'] = df['TOC_mg/L']
note that it is not necessary to use the same index for the rhs as it will align correctly

Combining data from two dataframe columns into one column

I have time series data in two separate DataFrame columns which refer to the same parameter but are of differing lengths.
On dates where data only exist in one column, I'd like this value to be placed in my new column. On dates where there are entries for both columns, I'd like to have the mean value. (I'd like to join using the index, which is a datetime value)
Could somebody suggest a way that I could combine my two columns? Thanks.
Edit2: I written some code which should merge the data from both of my column, but I get a KeyError when I try to set the new values using my index generated from rows where my first df has values but my second df doesn't. Here's the code:
def merge_func(df):
null_index = df[(df['DOC_mg/L'].isnull() == False) & (df['TOC_mg/L'].isnull() == True)].index
df['TOC_mg/L'][null_index] = df[null_index]['DOC_mg/L']
notnull_index = df[(df['DOC_mg/L'].isnull() == True) & (df['TOC_mg/L'].isnull() == False)].index
df['DOC_mg/L'][notnull_index] = df[notnull_index]['TOC_mg/L']
df.insert(len(df.columns), 'Mean_mg/L', 0.0)
df['Mean_mg/L'] = (df['DOC_mg/L'] + df['TOC_mg/L']) / 2
return df
merge_func(sve)
And here's the error:
KeyError: "['2004-01-14T01:00:00.000000000+0100' '2004-03-04T01:00:00.000000000+0100'\n '2004-03-30T02:00:00.000000000+0200' '2004-04-12T02:00:00.000000000+0200'\n '2004-04-15T02:00:00.000000000+0200' '2004-04-17T02:00:00.000000000+0200'\n '2004-04-19T02:00:00.000000000+0200' '2004-04-20T02:00:00.000000000+0200'\n '2004-04-22T02:00:00.000000000+0200' '2004-04-26T02:00:00.000000000+0200'\n '2004-04-28T02:00:00.000000000+0200' '2004-04-30T02:00:00.000000000+0200'\n '2004-05-05T02:00:00.000000000+0200' '2004-05-07T02:00:00.000000000+0200'\n '2004-05-10T02:00:00.000000000+0200' '2004-05-13T02:00:00.000000000+0200'\n '2004-05-17T02:00:00.000000000+0200' '2004-05-20T02:00:00.000000000+0200'\n '2004-05-24T02:00:00.000000000+0200' '2004-05-28T02:00:00.000000000+0200'\n '2004-06-04T02:00:00.000000000+0200' '2004-06-10T02:00:00.000000000+0200'\n '2004-08-27T02:00:00.000000000+0200' '2004-10-06T02:00:00.000000000+0200'\n '2004-11-02T01:00:00.000000000+0100' '2004-12-08T01:00:00.000000000+0100'\n '2011-02-21T01:00:00.000000000+0100' '2011-03-21T01:00:00.000000000+0100'\n '2011-04-04T02:00:00.000000000+0200' '2011-04-11T02:00:00.000000000+0200'\n '2011-04-14T02:00:00.000000000+0200' '2011-04-18T02:00:00.000000000+0200'\n '2011-04-21T02:00:00.000000000+0200' '2011-04-25T02:00:00.000000000+0200'\n '2011-05-02T02:00:00.000000000+0200' '2011-05-09T02:00:00.000000000+0200'\n '2011-05-23T02:00:00.000000000+0200' '2011-06-07T02:00:00.000000000+0200'\n '2011-06-21T02:00:00.000000000+0200' '2011-07-04T02:00:00.000000000+0200'\n '2011-07-18T02:00:00.000000000+0200' '2011-08-31T02:00:00.000000000+0200'\n '2011-09-13T02:00:00.000000000+0200' '2011-09-28T02:00:00.000000000+0200'\n '2011-10-10T02:00:00.000000000+0200' '2011-10-25T02:00:00.000000000+0200'\n '2011-11-08T01:00:00.000000000+0100' '2011-11-28T01:00:00.000000000+0100'\n '2011-12-20T01:00:00.000000000+0100' '2012-01-19T01:00:00.000000000+0100'\n '2012-02-14T01:00:00.000000000+0100' '2012-03-13T01:00:00.000000000+0100'\n '2012-03-27T02:00:00.000000000+0200' '2012-04-02T02:00:00.000000000+0200'\n '2012-04-10T02:00:00.000000000+0200' '2012-04-17T02:00:00.000000000+0200'\n '2012-04-26T02:00:00.000000000+0200' '2012-04-30T02:00:00.000000000+0200'\n '2012-05-03T02:00:00.000000000+0200' '2012-05-07T02:00:00.000000000+0200'\n '2012-05-10T02:00:00.000000000+0200' '2012-05-14T02:00:00.000000000+0200'\n '2012-05-22T02:00:00.000000000+0200' '2012-06-05T02:00:00.000000000+0200'\n '2012-06-19T02:00:00.000000000+0200' '2012-07-03T02:00:00.000000000+0200'\n '2012-07-17T02:00:00.000000000+0200' '2012-07-31T02:00:00.000000000+0200'\n '2012-08-14T02:00:00.000000000+0200' '2012-08-28T02:00:00.000000000+0200'\n '2012-09-11T02:00:00.000000000+0200' '2012-09-25T02:00:00.000000000+0200'\n '2012-10-10T02:00:00.000000000+0200' '2012-10-24T02:00:00.000000000+0200'\n '2012-11-21T01:00:00.000000000+0100' '2012-12-18T01:00:00.000000000+0100'] not in index"
You are close, but you actually don't need to iterate over the rows when using the isnull() functions. by default
df[(df['DOC_mg/L'].isnull() == False) & (df['TOC_mg/L'].isnull() == True)].index
Will return just the index of the rows where DOC_mg/L is not null and TOC_mg/L is null.
Now you can do something like this to set the values for TOC_mg/L:
null_index = df[(df['DOC_mg/L'].isnull() == False) & \
(df['TOC_mg/L'].isnull() == True)].index
df['TOC_mg/L'][null_index] = df['DOC_mg/L'][null_index] # EDIT To switch the index position.
This will use the index of the rows where TOC_mg/L is null and DOC_mg/L is not null, and set the values for TOC_mg/L to the those found in DOC_mg/L in the same rows.
Note: This is not the accepted way for setting values using an index, but it is how I've been doing it for some time. Just make sure that when setting values, the left side of the equation is df['col_name'][index]. If col_name and index are switched you will set the values to a copy which is never set back to the original.
Now to set the mean, you can create a new column, we'll call this Mean_mg/L and set the value = 0.0. Then set this new column to the mean of both columns:
# Insert a new col at the end of the dataframe columns name 'Mean_mg/L'
# with default value 0.0
df.insert(len(df.columns), 'Mean_mg/L', 0.0)
# Set this columns value to the average of DOC_mg/L and TOC_mg/L
df['Mean_mg/L'] = (df['DOC_mg/L'] + df['TOC_mg/L']) / 2
In the columns where we filled null values with the corresponding column value, the average will be the same as the values.

using function wheen looping through dataframe python/pandas

I have a function that uses two colomns in a dataframe:
def create_time(var, var1):
if var == "Helår":
y = var1+'Q4'
else:
if var == 'Halvår':
y = var1+'Q2'
else:
y = var1+'Q'+str(var)[0:1]
return y
Now i want to loop hrough my dataframe, creatring a new column using the function, where var and var1 are columns in the dataframe
I try with the following, but have no luck:
for row in bd.iterrows():
A = str(bd['Var'])
B = str(bd['Var1'])
bd['period']=create_time(A,B)
Looping is a last resort. There is usually a "vectorized" way to operate on the entire DataFrame, which always faster and usually more readable too.
To apply your custom function to each row, use apply with the keyword argument axis=1.
bd['period'] = bd[['Var', 'Var1']].apply(lambda x: create_time(*x), axis=1)
You might wonder why it's not just bd.apply(create_time). Since create_time wants two arguments, we have to "unpack" the row x into its two values and pass those to the function.