I have a csv with name, role, years of experience. I want to create a list of tuples that aggregates (name, role1, total_exp_inthisRole) for all the employess.
so far i am able to use defaultdict to do the below
import csv, urllib2
from collections import defaultdict
response = urllib2.urlopen(url)
cr = csv.reader(response)
parsed = ((row[0],row[1],int(row[2])) for row in cr)
employees =[]
for item in parsed:
employees.append(tuple(item))
employeeExp = defaultdict(int)
for x,y,z in employees: # variable unpacking
employeeExp[x] += z
employeeExp.items()
output: [('Ken', 15), ('Buckky', 5), ('Tina', 10)]
but how do i use the second column also to achieve the result i want. Shall i try to solve by groupby multiple keys or simpler way is possible? Thanks all in advance.
You can simply pass a tuple of name and role to your defaultdict, instead of only one item:
for x,y,z in employees:
employeeExp[(x, y)] += z
For your second expected output ([('Ken', ('engineer', 5),('sr. engineer', 6)), ...])
You need to aggregate the result of aforementioned snippet one more time, but this time you need to use a defaultdict with a list:
d = defaultdict(list)
for (name, rol), total_exp_inthisRole in employeeExp.items():
d[name].append(rol, total_exp_inthisRole)
Related
I am trying to find out October(mentioned 2 times), I had the idea to use dictionary to solve this problem. However I struggled a lot to figure out how to find/separate the months, I was not able to use my solution for the 1st str values where there are some spaces. Can someone please suggest how can I modify that split section to cover - , and white space?
import re
#str="May-29-1990, Oct-18-1980 ,Sept-1-1980, Oct-2-1990"
str="May-29-1990,Oct-18-1980,Sept-1-1980,Oct-2-1990"
val=re.split(',',str)
monthList=[]
myDictionary={}
#put the months in a list
def sep_month():
for item in val:
if not item.isdigit():
month,day,year=item.split("-")
monthList.append(month)
#process the month list from above
def count_month():
for item in monthList:
if item not in myDictionary.keys():
myDictionary[item]=1
else:
myDictionary[item]=myDictionary.get(item)+1
for k,v in myDictionary.items():
if v==2:
print(k)
sep_month()
count_month()
from datetime import datetime
import calendar
from collections import Counter
datesString = "May-29-1990,Oct-18-1980,Sep-1-1980,Oct-2-1990"
datesListString = datesString.split(",")
datesList = []
for dateStr in datesListString:
datesList.append(datetime.strptime(dateStr, '%b-%d-%Y'))
monthsOccurrencies = Counter((calendar.month_name[date.month] for date in datesList))
print(monthsOccurrencies)
# Counter({'October': 2, 'May': 1, 'September': 1})
Something to be aware in my solution with %b for the month is that Sept has changed to Sep to work (Month as locale’s abbreviated name). In this case you can either use fullname months (%B) or abbreviated name (%b). If you can not have the big string as with correct month name formatting, just replace the wrong ones ("Sept" for example with "Sep" and always work with date obj).
Not sure that regex is the best tool for this job, I would just use strip() along with split() to handle your whitespace issues and get a list of just the month abbreviations. Then you could create a dict with counts by month using the list method count(). For example:
dates = 'May-29-1990, Oct-18-1980 ,Sept-1-1980, Oct-2-1990'
months = [d.split('-')[0].strip() for d in dates.split(',')]
month_counts = {m: months.count(m) for m in set(months)}
print(month_counts)
# {'May': 1, 'Oct': 2, 'Sept': 1}
Or even better with collections.Counter:
from collections import Counter
dates = 'May-29-1990, Oct-18-1980 ,Sept-1-1980, Oct-2-1990'
months = [d.split('-')[0].strip() for d in dates.split(',')]
month_counts = Counter(months)
print(month_counts)
# Counter({'Oct': 2, 'May': 1, 'Sept': 1})
I've been trying to get nlargest rows for a group by following method from this question. The solution to the question is correct up to a point.
In this example, I groupby column A and want to return the rows of C and D based on the top two values in B.
For some reason the index of grp_df is multilevel and includes both A and the original index of ddf.
I was hoping to simply reset_index() and drop the unwanted index and just keep A, but I get the following error:
ValueError: The columns in the computed data do not match the columns in the provided metadata
Here is a simple example reproducing the error:
import numpy as np
import dask.dataframe as dd
import pandas as pd
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
ddf = dd.from_pandas(df, npartitions=3)
grp_df = ddf.groupby('A')[['B','C']].apply(lambda x: x.nlargest(2, columns=['B']), meta={
"B": 'f8', "C": 'f8'})
# Print is successful and results are correct
print(grp_df.head())
grp_df = grp_df.reset_index()
# Print is unsuccessful and shows error below
print(grp_df.head())
Found approach for solution here.
Following code now allows for reset_index() to work and gets rid of the original ddf index. Still not sure why the original ddf index came through the groupby in the first place, though
meta = pd.DataFrame(columns=['B', 'C'], dtype=int, index=pd.MultiIndex([[], []], [[], []], names=['A', None]))
grp_df = ddf.groupby('A')[['B','C']].apply(lambda x: x.nlargest(2, columns=['B']), meta=meta)
grp_df = grp_df.reset_index().drop('level_1', axis=1)
I have a file with millions of records like this
2017-07-24 18:34:23|CN:SSL|RESPONSETIME:23|BYTESIZE:1456|CLIENTIP:127.0.0.9|PROTOCOL:SSL-V1.2
Each record contains around 30 key-value pairs with "|" delimeter. Key-value pair position is not constant.
Trying to parse these records using python dictionary or list concepts.
Note: 1st column is not in key-value format
your file is basically a |-separated csv file holding first the timestamp, then 2 fields separated by :.
So you could use csv module to read the cells, then pass the result of str.split to a dict in a gencomp to build the dictionary for all elements but the first one.
Then update the dict with the timestamp:
import csv
list_of_dicts = []
with open("input.txt") as f:
cr = csv.reader(f,delimiter="|")
for row in cr:
d = dict(v.split(":") for v in row[1:])
d["date"] = row[0]
list_of_dicts.append(d)
list_of_dicts contains dictionaries like
{'date': '2017-07-24 18:34:23', 'PROTOCOL': 'SSL-V1.2', 'RESPONSETIME': '23', 'CN': 'SSL', 'CLIENTIP': '127.0.0.9', 'BYTESIZE': '1456'}
You repeat the below process for all the lines in your code. I am not clear about the date time value. So I haven't included that in the input. You can include it based on your understanding.
import re
given = "CN:SSL|RESPONSETIME:23|BYTESIZE:1456|CLIENTIP:127.0.0.9|PROTOCOL:SSL-
V1.2"
results = dict()
list_for_this_line = re.split('\|',given)
for i in range(len(list_for_this_line)):
separated_k_v = re.split(':',list_for_this_line[i])
results[separated_k_v[0]] = separated_k_v[1]
print results
Hope this helps!
I am trying to use networkx to create a DiGraph. I want to use add_edges_from(), and I want the edges and their data to be generated from three tuples.
I am importing the data from a CSV file. I have three columns: one for ids (first set of nodes), one for a set of names (second set of nodes), and another for capacities (no headers in the file). So, I created a dictionary for the ids and capacities.
dictionary = dict(zip(id, capacity))
then I zipped the tuples containing the edges data:
List = zip(id, name, capacity)
but when I execute the next line, it gives me an assertion error.
G.add_edges_from(List, 'weight': 1)
Can someone help me with this problem? I have been trying for a week with no luck.
P.S. I'm a newbie in programming.
EDIT:
so, i found the following solution. I am honestly not sure how it works, but it did the job!
Here is the code:
import networkx as nx
import csv
G = nx.DiGraph()
capacity_dict = dict(zip(zip(id, name),capacity))
List = zip(id, name, capacity)
G.add_edges_from(capacity_dict, weight=1)
for u,v,d in List:
G[u][v]['capacity']=d
Now when I run:
G.edges(data=True)
The result will be:
[(2.0, 'First', {'capacity': 1.0, 'weight': 1}), (3.0, 'Second', {'capacity': 2.0, 'weight': 1})]
I am using the network simplex. Now, I am trying to find a way to make the output of the flowDict more understandable, because it is only showing the ids of the flow. (Maybe i'll try to input them in a database and return the whole row of data instead of using the ids only).
A few improvements on your version. (1) NetworkX algorithms assume that weight is 1 unless you specifically set it differently. Hence there is no need to set it explicitly in your case. (2) Using the generator allows the capacity attribute to be set explicitly and other attributes to also be set once per record. (3) The use of a generator to process each record as it comes through saves you having to iterate through the whole list twice. The performance improvement is probably negligible on small datasets but still it feels more elegant. Having said that -- your method clearly works!
import networkx as nx
import csv
# simulate a csv file.
# This makes a multi-line string behave as a file.
from StringIO import StringIO
filehandle = StringIO('''a,b,30
b,c,40
d,a,20
''')
# process each row in the file
# and generate an edge from each
def edge_generator(fh):
reader = csv.reader(fh)
for row in reader:
row[-1] = float(row[-1]) # convert capacity to float
# add other attributes to the dict() below as needed...
# e.g. you might add weights here as well.
yield (row[0],
row[1],
dict(capacity=row[2]))
# create the graph
G = nx.DiGraph()
G.add_edges_from(edge_generator(filehandle))
print G.edges(data=True)
Returns this:
[('a', 'b', {'capacity': 30.0}),
('b', 'c', {'capacity': 40.0}),
('d', 'a', {'capacity': 20.0})]
Using: Python 2.7 and Pandas 0.11.0 on Mac OSX Lion
I'm trying to create an empty DataFrame and then populate it from another dataframe, based on a for loop.
I have found that when I construct the DataFrame and then use the for loop as follows:
data = pd.DataFrame()
for item in cols_to_keep:
if item not in dummies:
data = data.join(df[item])
Results in an empty DataFrame, but with the headers of the appropriate columns to be added from the other DataFrame.
That's because you are using join incorrectly.
You can use a list comprehension to restrict the DataFrame to the columns you want:
df[[col for col in cols_to_keep if col not in dummies]]
What about just creating a new frame based off of the columns you know you want to keep, instead of creating an empty one first?
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':np.random.randn(5),
'b':np.random.randn(5),
'c':np.random.randn(5),
'd':np.random.randn(5)})
cols_to_keep = ['a', 'c', 'd']
dummies = ['d']
not_dummies = [x for x in cols_to_keep if x not in dummies]
data = df[not_dummies]
data
a c
0 2.288460 0.698057
1 0.097110 -0.110896
2 1.075598 -0.632659
3 -0.120013 -2.185709
4 -0.099343 1.627839