Merge two annotated results into one dictionary - django

I have to annotated results:
cred_rec = Disponibilidad.objects.values('mac__mac', 'int_mes').annotate(tramites=Count('fuar'), recibidas=Count('fecha_disp_mac')
cred14 = Disponibilidad.objects.filter(int_disponible__lte=14).values('mac__mac', 'int_mes').annotate(en14=Count('fuar'))
Both have the same keys 'mac__mac' and 'int_mes', what I want is to create a new dictionary with the keys in cred_red, plus en14 from cred14.
I try some answers found here, but I missing something.
Thanks.
EDIT. After some tries and errors, I got this:
for linea in cred_rec:
clave = (linea['mac__mac'], linea['int_mes'])
for linea2 in cred14:
clave2 = (linea2['mac__mac'], linea2['int_mes'])
if clave2 == clave:
linea['en14'] = linea2['en14']
linea['disp'] = float(linea2['en14'])/linea['recibidas']*100
Now, I have to ask for a better solution. Thanks again.
=======
EDIT
This is how the input looks like:
fuar, mac_id, int_mes, int_disponible, int_exitoso, fecha_tramite, fecha_actualiza_pe, fecha_disp_mac
1229012106349,1,7,21,14,2012-07-02 08:33:54.0,2012-07-16 17:33:21.0,2012-07-23 08:01:22.0
1229012106350,1,7,25,17,2012-07-02 09:01:25.0,2012-07-19 17:45:57.0,2012-07-27 17:45:59.0
1229012106351,1,7,21,14,2012-07-02 09:15:12.0,2012-07-16 19:14:35.0,2012-07-23 08:01:22.0
1229012106352,1,7,24,16,2012-07-02 09:25:19.0,2012-07-18 07:52:18.0,2012-07-26 16:04:11.0
... a few thousand lines dropped ...
The fuar is like an order_id; mac__mac is like the site_id, mes is month; int_disponible is the timedelta between fecha_tramite and fecha_disp_mac; int_exitoso is the timedelta between fecha_tramite and fecha_actualiza_pe.
The output is like this:
mac, mes, tramites, cred_rec, cred14, % rec, % en 14
1, 7, 2023, 2006, 1313, 99.1596638655, 65.4536390828
1, 8, 1748, 1182, 1150, 67.6201372998, 97.2927241963
2, 8, 731, 471, 441, 64.4322845417, 93.6305732484
3, 8, 1352, 840, 784, 62.1301775148, 93.3333333333
tramites is the sum of all orders (fuar) in a month
cred_rec cred is our product, in theory for each fuar there is a cred, cred_rec is the sum of all cred produced in a month
cred_14 is the sum of all cred made in 14 days
% rec the relationship between fuar received and cred produced, in %
% en 14 is the relationship between the cred produced and the cred produced in time
I will use this table in a Annotated Time Line chart or a Combo Chart from Google Charts to show the performance of our manufacturation process.
Thanks for your time.

One immediate improvement to the current code you have would be to have the en14 and disp values precalculated and indexed by key. This will reduce the scans on the cred14 list, but will use memory to store the precalculated values.
def line_key(line):
return (line['mac__mac'], line['int_mes'])
cred14_calcs = {}
for line in cred14:
cred14_calcs[line_key(line)] = {
'en14': line['en14'],
'disp': float(line['en14'])/line['recibidas']*100
}
for line in cred_rec:
calc = cred14_calcs.get(line_key(line))
if calc:
line.update(calc)

Related

Trouble using utcoffset with Chart.js

I'm trying to use Chart.js with a datetime x axis, and I need to adjust all my values by subtracting 5 hours. Here's some of my code:
var timeFormat = 'MM/DD HH:mm';
time: {
format: timeFormat,
tooltipFormat: 'll',
parser: function(utcMoment) {
return moment(utcMoment).utcOffset(5, true);
}
},
Without the parser function, my values are normal (10:00, January 10, 2021), but with the parser function, for some reason my values are set back all the way to 2001. Yes two-thousand-and-one.(10:00, January 10, 2001) Note that the time is not actually changed (So two errors: 1.time not adjusted when it should be. 2:years adjusted when it shouldn't be). Why could this be?
I will assume that the reason you want to roll it back by 5 hours is because of a timezone difference. If that's the case, you should use moment-timezone instead of moment.
With that said, subtracting 5 hours from the current date is actually simpler than what you're doing.
Before feeding a date into moment, you need to convert it to the js Date object like so: new Date('2021-01-10 00:00:00'). Since your parser function accepts the date in m/d H:M format, you would need to append the year to it first.
So here is how your code should look:
parser: function(utcMoment) {
const new_date = utcMoment.split(' ')[0] + '/' + (new Date().getFullYear()) + ' ' + utcMoment.split(' ')[1];
return moment(new Date(new_date)).subtract({hours: 5})
}

Calculating results pro rata over several months with PowerQuery

I am currently stuck on below issue:
I have two tables that I have to work with, one contains financial information for vessels and the other contains arrival and departure time for vessels. I get my data combining multiple excel sheets from different folders:
financialTable
voyageTimeTable
I have to calculate the result for above voyage, and apportion the result over June, July and August for both estimated and updated.
Time in June : 4 hours (20/06/2020 20:00 - 23:59) + 10 days (21/06/2020 00:00 - 30/06/2020 23:59) = 10.1666
Time in July : 31 full days
Time in August: 1 day + 14 hours (02/08/2020 00:00 - 14:00) = 1.5833
Total voyage duration = 10.1666 + 31 + 1.5833 = 42.7499
The result for the "updated" financialItem would be the following:
Result June : 100*(10.1666/42.7499) = 23.7816
Result July : 100*(31/42.7499) = 72.5148
Result August : 100*(1.5833/42.7499) = 3.7036
sum = 100
and then for "estimated" it would be twice of everything above.
This is the format I ideally would like to get:
prorataResultTable
I have to do this for multiple vessels, with multiple timespans and several voyage numbers.
Eagerly awaiting responses, if any. Many thanks in advance.
Brds,
Not sure if you're still looking for an answer, but code below gives me your expected output:
let
financialTable = Table.FromRows({{"A", 1, "profit/loss", 200, 100}}, type table [vesselName = text, vesselNumber = Int64.Type, financialItem = text, estimated = number, updated = number]),
voyageTimeTable = Table.FromRows({{"A", 1, #datetime(2020, 6, 20, 20, 0, 0), #datetime(2020, 8, 2, 14, 0, 0)}}, type table [vesselName = text, vesselNumber = Int64.Type, voyageStartDatetime = datetime, voyageEndDatetime = datetime]),
joined =
let
joined = Table.NestedJoin(financialTable, {"vesselName", "vesselNumber"}, voyageTimeTable, {"vesselName", "vesselNumber"}, "$toExpand", JoinKind.LeftOuter),
expanded = Table.ExpandTableColumn(joined, "$toExpand", {"voyageStartDatetime", "voyageEndDatetime"})
in expanded,
toExpand = Table.AddColumn(joined, "$toExpand", (currentRow as record) =>
let
voyageInclusiveStart = DateTime.From(currentRow[voyageStartDatetime]),
voyageExclusiveEnd = DateTime.From(currentRow[voyageEndDatetime]),
voyageDurationInDays = Duration.TotalDays(voyageExclusiveEnd - voyageInclusiveStart),
createRecordForPeriod = (someInclusiveStart as datetime) => [
inclusiveStart = someInclusiveStart,
exclusiveEnd = List.Min({
DateTime.From(Date.EndOfMonth(DateTime.Date(someInclusiveStart)) + #duration(1, 0, 0, 0)),
voyageExclusiveEnd
}),
durationInDays = Duration.TotalDays(exclusiveEnd - inclusiveStart),
prorataDuration = durationInDays / voyageDurationInDays,
estimated = prorataDuration * currentRow[estimated],
updated = prorataDuration * currentRow[updated],
month = Date.MonthName(DateTime.Date(inclusiveStart)),
year = Date.Year(inclusiveStart)
],
monthlyRecords = List.Generate(
() => createRecordForPeriod(voyageInclusiveStart),
each [inclusiveStart] < voyageExclusiveEnd,
each createRecordForPeriod([exclusiveEnd])
),
toTable = Table.FromRecords(monthlyRecords)
in toTable
),
expanded =
let
dropped = Table.RemoveColumns(toExpand, {"estimated", "updated", "voyageStartDatetime", "voyageEndDatetime"}),
expanded = Table.ExpandTableColumn(dropped, "$toExpand", {"month", "year", "estimated", "updated"})
in expanded
in
expanded
The code tries to:
join financialTable and voyageTimeTable, so that for each vesselName and vesselNumber combination, we know: estimated, updated, voyageStartDatetime and voyageEndDatetime.
generate a list of months for the period between voyageStartDatetime and voyageEndDatetime (which get expanded into new table rows)
for each month (in the list), do all the arithmetic you mention in your question
get rid of some columns (like the old estimated and updated columns)
I recommend testing it with different vesselNames and vesselNumbers from your dataset, just to see if the output is always correct (I think it should be).
You should be able to manually inspect the cells in the $toExpand column (of the toExpand step/expression) to see the nested rows before they get expanded.

Merging Pandas Dataframe from one CSV file with identical columns

I hope my question isn't a dublicate of another, but I have searched for three days and I aven't found the answer.
Okay, so I have a CSV file containing two headers. The file contains information about hotels (their name), how much they cost (price), their rating and where they are located (Area 1, 2 or 3):
The CSV file imported
As you can see the first row describes the area, while the second row are the Hotelname, price and rating. What I want is to rearrange the file and save it to a new CSV file, where the format looks like this:
The hopeful output
So the information about the area for the hotels have been given its own column. The names in the seond row are all identical.
Is there a way to create this? I am a bit new to these tree-like datastructures when they have to be imported. Could it be done with if the tree had more nodes (e.g. if we started by country, moved down to area and then down to hotel name, price and rating)? Can it be done with Pandas?
First, could you share the csv files as text files? That is really helpful to try out my own solution. It seems not productive to write down the data from the picture.
Second, have you tried out to achieve this by scripting yourself? Or have you tried to use some library? You added the tag pandas but in the text you do not mention that. Any specific reason it should be pandas?
A solution which works for that one case seems simple to do just by using slicing. I guess the format you have is rather specific and not standard so the libraries might not help much. Pandas e.g. allows multiple rows as a header, but it is interpreted in a different way, see pandas dataframe with 2-rows header and export to csv
A solution idea:
table = []
with open(my_csv_file) as f:
for line in f:
a1, p1, r1, a2, p2, r2, a3, p3, r3 = line[:-1].split(",")
table.append([a1, p1, r1, "area1"])
table.append([a2, p2, r2, "area2"])
table.append([a3, p3, r3, "area3"])
# ... convert table into dataframe etc.
Okay so I created a possible solution to the problem:
infile = csv.reader(infile, delimiter=';')
out = []
counter = 0
i = 0
k = 0
names = []
temp1 = 0
for line in infile:
temp = list(set(line))
if counter == 0:
names = line
counter +=1
elif counter == 1:
k = len(list(set(line)))
while i < len(line):
line.insert(i+k, name)
i += (k + 1)
counter += 1
out.append(line)
else:
i = 0
ind = 0
while i < len(line):
line.insert(i+k, names[ind*k])
i += (k + 1)
ind +=1
out.append(line)
headers = out.pop(0)
n = len(set(headers))
table = pd.DataFrame(out, columns=headers)
for i in range(0, len(table.columns)):
if i ==0:
temp1 = table.ix[:,n*i:n*(i+1)]
else:
temp1 = pd.concat([temp1, table.ix[:,n*i:n*(i+1)]], ignore_index=True)
I would very much like some input and suggestions to make the solution more elegant or to add extra levels of headers to the file.

difflib.get_close_matches GET SCORE

I am trying to get the score of the best match using difflib.get_close_matches:
import difflib
best_match = difflib.get_close_matches(str,str_list,1)[0]
I know of the option to add 'cutoff' parameter, but couldn't find out how to get the actual score after setting the threshold.
Am I missing something? Is there a better solution to match unicode strings?
I found that difflib.get_close_matches is the simplest way for matching/fuzzy-matching strings. But there are a few other more advanced libraries like fuzzywuzzy as you mentioned in the comments.
But if you want to use difflib, you can use difflib.SequenceMatcher to get the score as follows:
import difflib
my_str = 'apple'
str_list = ['ape' , 'fjsdf', 'aerewtg', 'dgyow', 'paepd']
best_match = difflib.get_close_matches(my_str,str_list,1)[0]
score = difflib.SequenceMatcher(None, my_str, best_match).ratio()
In this example, the best match between 'apple' and the list is 'ape' and the score is 0.75.
You can also loop through the list and compute all the scores to check:
for word in str_list:
print "score for: " + my_str + " vs. " + word + " = " + str(difflib.SequenceMatcher(None, my_str, word).ratio())
For this example, you get the following:
score for: apple vs. ape = 0.75
score for: apple vs. fjsdf = 0.0
score for: apple vs. aerewtg = 0.333333333333
score for: apple vs. dgyow = 0.0
score for: apple vs. paepd = 0.4
Documentation for difflib can be found here: https://docs.python.org/2/library/difflib.html
To answer the question, the usual route would be to obtain the comparative score for a match returned by get_close_matches() individually in this manner:
match_ratio = difflib.SequenceMatcher(None, 'aple', 'apple').ratio()
Here's a way that increases speed in my case by about 10% ...
I'm using get_close_matches() for spellcheck, it runs SequenceMatcher() under the hood but strips the scores returning just a list of matching strings. Normally.
But with a small change in Lib/difflib.py currently around line 736 the return can be a dictionary with scores as values, thus no need to run SequenceMatcher again on each list item to obtain their score ratios. In the examples I've shortened the output float values for clarity (like 0.8888888888888888 to 0.889). Input n=7 says to limit the return items to 7 if there are more than 7, i.e. the highest 7, and that could apply if candidates are many.
Current mere list return
In this example result would normally be like ['apple', 'staple', 'able', 'lapel']
... at the default cutoff of .6 if omitted (as in Ben's answer, no judgement).
The change
in difflib.py is simple (this line to the right shows the original):
return {v: k for (k, v) in result} # hack to return dict with scores instead of list, original was ... [x for score, x in result]
New dictionary return
includes scores like {'apple': 0.889, 'staple': 0.8, 'able': 0.75, 'lapel': 0.667}
>>> to_match = 'aple'
>>> candidates = ['lapel', 'staple', 'zoo', 'able', 'apple', 'appealing']
Increasing minimum score cutoff/threshold from .4 to .8:
>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.4)
{'apple': 0.889, 'staple': 0.8, 'able': 0.75, 'lapel': 0.667, 'appealing': 0.461}
>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.7)
{'apple': 0.889, 'staple': 0.8, 'able': 0.75}
>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.8)
{'apple': 0.889, 'staple': 0.8}

Need help in improving the speed of my code for duplicate columns removal in Python

I have written a code to take a text file as input and print only the variants which repeat more than once. By variants I mean, chr positions in the text file.
The input file looks like this:
chr1 1048989 1048989 A G intronic C1orf159 0.16 rs4970406
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1113121 1113121 G A intronic TTLL10 0.13 rs12092254
As you can see, rows 2 and 3 repeat. I'm just taking the first 3 columns and seeing if they are the same. Here, chr1 1049083 1049383 repeat in both row2 and row3. So I print out saying that there is one duplicate and it's position.
I have written the code below. Though it's doing what I want, it's quite slow. It takes me about 5 min to run on a file which have 700,000 rows. I wanted to know if there is a way to speed things up.
Thanks!
#!/usr/bin/env python
""" takes in a input file and
prints out only the variants that occur more than once """
import shlex
import collections
rows = open('variants.txt', 'r').read().split("\n")
# removing the header and storing it in a new variable
header = rows.pop()
indices = []
for row in rows:
var = shlex.split(row)
indices.append("_".join(var[0:3]))
dup_list = []
ind_tuple = collections.Counter(indices).items()
for x, y in ind_tuple:
if y>1:
dup_list.append(x)
print dup_list
print len(dup_list)
Note: In this case the entire row2 is a duplicate of row3. But this is not necessarily the case all the time. Duplicate of chr positions (first three columns) is what I'm looking for.
EDIT:
Edited the code as per the suggestion of damienfrancois. Below is my new code:
f = open('variants.txt', 'r')
indices = {}
for line in f:
row = line.rstrip()
var = shlex.split(row)
index = "_".join(var[0:3])
if indices.has_key(index):
indices[index] = indices[index] + 1
else:
indices[index] = 1
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
I used, time to see how long both the code takes.
My original code:
time run remove_dup.py
14428
CPU times: user 181.75 s, sys: 2.46 s,total: 184.20 s
Wall time: 209.31 s
Code after modification:
time run remove_dup2.py
14428
CPU times: user 177.99 s, sys: 2.17 s, total: 180.16 s
Wall time: 222.76 s
I don't see any significant improvement in the time.
Some suggestions:
do not read the whole file at once ; read line by line and process it on the fly ; you'll save memory operations
let indices be a default dict and increment the value at key "_".join(var[0:3]) ; this saves the costly (guessing here, should use a profiler) collections.Counter(indices).items() step
try pypy or a python compiler
split your data in as many subsets as your computer has cores, apply the program to each subset in parallel then merge the results
HTH
A big time sink is probably the if..has_key() portion of the code. In my experience, try-except is a lot faster...
f = open('variants.txt', 'r')
indices = {}
for line in f:
var = line.split()
index = "_".join(var[0:3])
try:
indices[index] += 1
except KeyError:
indices[index] = 1
f.close()
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
Another option there would be replace the four try except lines with:
indices[index] = 1 + indices.get(index,0)
This approach only tells how many lines of the lines are duplicated, and not how many times they are repeated. (So if one line is duped 3x, then it will say one...)
If you are only trying to count the duplicates and not delete or note them, you could tally the lines of the file as you go, and compare this to the length of the indices dictionary, and the difference is the number of dupe lines (instead of looping back through and re-counting). This might save a little time, but gives a different answer:
#!/usr/bin/env python
f = open('variants.txt', 'r')
indices = {}
total_len=0
for line in f:
total_len +=1
var = line.split()
index = "_".join(var[0:3])
indices[index] = 1 + indices.get(index,0)
f.close()
print "Number of duplicated lines:", total_len - len(indices.keys())
I'd be curious to hear what your benchmarks are for code that does not include the has_key() test...