My pandas dataframe has a column where each row is a string which corresponds to a filename. I read my data from a JSON file and extract the column like this:
df = pd.read_json("mergedJSON.txt",lines=True,orient='columns')
df2 = df.set_index("subject")
for key,value in some_dict.iteritems():
df2.loc[value,"file_name"].to_csv(outfile,index=False, header=False)
I need to drop certain rows from this dataframe based on whether the file is found on disk. Not sure how to do this. Appreciate help.
Just use this as the last line
df2[df2.file_name.str.contains('stringValue')].loc[value,:].to_csv()
First, set_index,reindex
use the filename as index,and then do df.drop(filename).
Related
I have a text file with two columns and I want to average the second column, while ignoring the first one. I'm new to programming, I hope you can help.
Input File
Have you tried pandas?
import pandas as pd
df=pd.read_csv(file, sep="\t") #assuming you have a tab (\t) separated file
df["xy"].mean()
I have a pandas dataframe in the following format:
Latitude Longiutde
53.553 -80.3123
58.1211 -81.3245
I am trying to convert this dataframe to a list of lists to use it in my plotting package.
[[53.553, -80.3123], [58.1211, -81.3245]]
I tried with iterating through the pandas rows to append these columns to a list but for some reason I am not even able to make my first level list.
list.append(row['Latitude'], row['Longitude'])
Any help would be appreciated.
use the tolist method on the underlying numpy array
df
df.values.tolist()
[[53.553000000000004, -80.3123], [58.1211, -81.3245]]
I have pandas dataframe called df that contains several columns and a df['MY STATE'] column. My goal is to remove all the rows from the dataframe which to not contains US states. I want to do this by comparing the value in the cell to a pandas series I have containing all the state abbreviations. I have seen people use something like the following to clean a dataframe:
df = df[df['COST'] <= 0]
But something like what I need (below) doesn't work
df = df[df['MY STATE'] not in states['Abbreviation'].values]
Is there a way to do this simply?
I have read that df.query() can be used to do something like this, but I have not yet found an example, and have also read that df.query() cannot be used when there is a space in the name of the column.
Thank you,
Michael
IIUC you can use isin with inverse operator ~:
df = df[~df['MY STATE'].isin(states['Abbreviation'].values)]
How to remove the entire blank row from the existing Excel Sheet using Python?
I need a solution which DOES NOT :
Include reading the whole file and rewriting it without the deleted row.
IS THERE ANY DIRECT SOLUTION?
I achieved using Pandas package....
import pandas as pd
#Read from Excel
xl= pd.ExcelFile("test.xls")
#Parsing Excel Sheet to DataFrame
dfs = xl.parse(xl.sheet_names[0])
#Update DataFrame as per requirement
#(Here Removing the row from DataFrame having blank value in "Name" column)
dfs = dfs[dfs['Name'] != '']
#Updating the excel sheet with the updated DataFrame
dfs.to_excel("test.xls",sheet_name='Sheet1',index=False)
If using memory is not an issue you can achieve this using an extra dataframe.
import pandas as pd
#Read from Excel
xl= pd.ExcelFile("test.xls")
dfs = xl.parse(xl.sheet_names[0])
df1 = dfs[dfs['Sal'] == 1000]
df1 = df1[df1['Message']=="abc"]
ph_no = df1['Number']
print df1
To delete an Excel row, say row 5, column does not matter so 1:
sh.Cells(5,1).EntireRow.Delete()
To delete a range of Excel rows, say row 5 to 20
sh.Range(sh.Cells(5,1),sh.Cells(20,1)).EntireRow.Delete()
With the csv module, I loop through the rows to execute logic:
import csv
with open("file.csv", "r") as csv_read:
r = csv.reader(csv_read, delimiter = ",")
next(r, None) #Skip headers first row
for row in rows:
#Logic here
I'm new to Pandas, and I want to execute the same logic, using the second column only in the csv as the input for the loop.
import pandas as pd
pd.read_csv("file.csv", usecols=[1])
Assuming the above is correct, what should I do from here to execute the logic based on the cells in column 2?
I want to use the cell values in column 2 as input for a web crawler. It takes each result and inputs it as a search term on a webpage, and then scrapes data from that webpage. Is there any way to grab each cell value in the array rather than the whole array at the same time?
Basically the pandas equivalent of your code is this:
import pandas as pd
df = pd.read_csv("file.csv", usecols=[1])
So passing usecols=[1] will only load the second column, see the docs.
now assuming this column has a name like 'url' but really it doesn't matter we can do something like:
def crawl(x):
#do something
df.apply(crawl)
So in principle the above will crawl each url in your column a value at a time.
EDIT
You can pass param axis=1 to apply so that it process each row rather than the entire column:
df.apply(crawl, axis=1)