Understanding *[] in python passed to .agg() in pyspark - python-2.7

I am trying to understand how the *[] allows me to pass parameters to this
aggregate in pyspark. This runs, but I am trying to reuse the code in another example and was hoping someone could point me to the appropriate documentation so that I knew what was going on here. I like that it can pass the columns in the list as a parameter.
I was hoping that either someone knew what *[] is doing here.
How does it know to append a column to the DataFrame and not just iterate through the list, and execute once for each element in testdata.
import pyspark.sql.functions as fn
spark = SparkSession.builder.getOrCreate()
testdata= spark.createDataFrame([
(1, 144.5, 5.9, 33, 'M'),
(2, 167.2, None, 45, 'M'),
(3, 124.1, 5.2, 23, 'F'),
(4, None, 5.9, None, 'M'),
(5, 133.2, 5.7, 54, 'F'),
(3, 124.1, None, None, 'F'),
(5, 129.2, 5.3, None, 'M'),
],
['id', 'weight', 'height', 'age', 'gender']
)
testdata.where(
fn.col("gender") == 'M'
).select(
'*'
).agg(*[
(1 - (fn.count(c) / fn.count('*'))).alias(c + '_missing')
for c in testdata.columns
]).toPandas()
output:
+----------+--------------+--------------+-----------+--------------+
|id_missing|weight_missing|height_missing|age_missing|gender_missing|
+----------+--------------+--------------+-----------+--------------+
| 0.0| 0.25| 0.25| 0.5| 0.0|
+----------+--------------+--------------+-----------+--------------+

Using * in front of a list expands out the members as individual arguments. So, the following two function calls will be equivalent:
my_function(*[1, 2, 3])
my_function(1, 2, 3)
Obviously, the first one is not very useful if you already know the precise number of arguments. It becomes more useful with a comprehension like you are using, where is is not clear how many items will be in the list.

Related

Return the last k numbers of a list (Python) [duplicate]

I need the last 9 numbers of a list and I'm sure there is a way to do it with slicing, but I can't seem to get it. I can get the first 9 like this:
num_list[0:9]
You can use negative integers with the slicing operator for that. Here's an example using the python CLI interpreter:
>>> a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
>>> a[-9:]
[4, 5, 6, 7, 8, 9, 10, 11, 12]
the important line is a[-9:]
a negative index will count from the end of the list, so:
num_list[-9:]
Slicing
Python slicing is an incredibly fast operation, and it's a handy way to quickly access parts of your data.
Slice notation to get the last nine elements from a list (or any other sequence that supports it, like a string) would look like this:
num_list[-9:]
When I see this, I read the part in the brackets as "9th from the end, to the end." (Actually, I abbreviate it mentally as "-9, on")
Explanation:
The full notation is
sequence[start:stop:step]
But the colon is what tells Python you're giving it a slice and not a regular index. That's why the idiomatic way of copying lists in Python 2 is
list_copy = sequence[:]
And clearing them is with:
del my_list[:]
(Lists get list.copy and list.clear in Python 3.)
Give your slices a descriptive name!
You may find it useful to separate forming the slice from passing it to the list.__getitem__ method (that's what the square brackets do). Even if you're not new to it, it keeps your code more readable so that others that may have to read your code can more readily understand what you're doing.
However, you can't just assign some integers separated by colons to a variable. You need to use the slice object:
last_nine_slice = slice(-9, None)
The second argument, None, is required, so that the first argument is interpreted as the start argument otherwise it would be the stop argument.
You can then pass the slice object to your sequence:
>>> list(range(100))[last_nine_slice]
[91, 92, 93, 94, 95, 96, 97, 98, 99]
islice
islice from the itertools module is another possibly performant way to get this. islice doesn't take negative arguments, so ideally your iterable has a __reversed__ special method - which list does have - so you must first pass your list (or iterable with __reversed__) to reversed.
>>> from itertools import islice
>>> islice(reversed(range(100)), 0, 9)
<itertools.islice object at 0xffeb87fc>
islice allows for lazy evaluation of the data pipeline, so to materialize the data, pass it to a constructor (like list):
>>> list(islice(reversed(range(100)), 0, 9))
[99, 98, 97, 96, 95, 94, 93, 92, 91]
The last 9 elements can be read from left to right using numlist[-9:], or from right to left using numlist[:-10:-1], as you want.
>>> a=range(17)
>>> print a
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
>>> print a[-9:]
[8, 9, 10, 11, 12, 13, 14, 15, 16]
>>> print a[:-10:-1]
[16, 15, 14, 13, 12, 11, 10, 9, 8]
Here are several options for getting the "tail" items of an iterable:
Given
n = 9
iterable = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Desired Output
[2, 3, 4, 5, 6, 7, 8, 9, 10]
Code
We get the latter output using any of the following options:
from collections import deque
import itertools
import more_itertools
# A: Slicing
iterable[-n:]
# B: Implement an itertools recipe
def tail(n, iterable):
"""Return an iterator over the last *n* items of *iterable*.
>>> t = tail(3, 'ABCDEFG')
>>> list(t)
['E', 'F', 'G']
"""
return iter(deque(iterable, maxlen=n))
list(tail(n, iterable))
# C: Use an implemented recipe, via more_itertools
list(more_itertools.tail(n, iterable))
# D: islice, via itertools
list(itertools.islice(iterable, len(iterable)-n, None))
# E: Negative islice, via more_itertools
list(more_itertools.islice_extended(iterable, -n, None))
Details
A. Traditional Python slicing is inherent to the language. This option works with sequences such as strings, lists and tuples. However, this kind of slicing does not work on iterators, e.g. iter(iterable).
B. An itertools recipe. It is generalized to work on any iterable and resolves the iterator issue in the last solution. This recipe must be implemented manually as it is not officially included in the itertools module.
C. Many recipes, including the latter tool (B), have been conveniently implemented in third party packages. Installing and importing these these libraries obviates manual implementation. One of these libraries is called more_itertools (install via > pip install more-itertools); see more_itertools.tail.
D. A member of the itertools library. Note, itertools.islice does not support negative slicing.
E. Another tool is implemented in more_itertools that generalizes itertools.islice to support negative slicing; see more_itertools.islice_extended.
Which one do I use?
It depends. In most cases, slicing (option A, as mentioned in other answers) is most simple option as it built into the language and supports most iterable types. For more general iterators, use any of the remaining options. Note, options C and E require installing a third-party library, which some users may find useful.

Django is giving different date values for objects accessed via queryset

Context I get different values for datetime field when I access them differently. I am sure there is some utc edge magic going on here.
(Pdb++)
Foo.objects.all().values_list('gated_out__occurred__date')[0][0]
datetime.date(2021, 9, 9)
(Pdb++) Foo.objects.all()[0].gated_out.occurred.date()
datetime.date(2021, 9, 10)
Edit: They have the same PK
Foo.objects.all().order_by("pk")[0].gated_out.occurred.date()
datetime.date(2021, 9, 10)
(Pdb++) Foo.objects.all().order_by("pk").values_list('gated_out__occurred__date')[0][0]
datetime.date(2021, 9, 9)
How do I fix/figure out what is happening?

Django - update dictionary with missing date values, set to 0

So to display a small bargraph using Django and Chart.js I constructed the following query on my model.
views.py
class BookingsView(TemplateView):
template_name = 'orders/bookings.html'
def get_context_data(self, **kwargs):
today = datetime.date.today()
seven_days = today + datetime.timedelta(days=7)
bookings = dict(Booking.objects.filter(start_date__range = [today, seven_days]) \
.order_by('start_date') \
.values_list('start_date') \
.annotate(Count('id')))
# Edit set default for missing dictonairy values
for dt in range(7):
bookings.setdefault(today+datetime.timedelta(dt), 0)
# Edit reorder the dictionary before using it in a template
context['bookings'] = OrderedDict(sorted(bookings.items()))
This led me to the following result;
# Edit; after setting the default on the dictionary and the reorder
{
datetime.date(2019, 8, 6): 12,
datetime.date(2019, 8, 7): 12,
datetime.date(2019, 8, 8): 0,
datetime.date(2019, 8, 9): 4,
datetime.date(2019, 8, 10): 7,
datetime.date(2019, 8, 11): 0,
datetime.date(2019, 8, 12): 7
}
To use the data in a chart I would like to add the missing start_dates into the dictionary but I'm not entirely sure how to do this.
So I want to update the dictionary with a value "0" for the 8th and 11th of August.
I tried to add the for statement but I got the error;
"'datetime.date' object is not iterable"
Like the error says, you can not iterate over a date object, so for start_date in seven_days will not work.
You can however use a for loop here like:
for dt in range(7):
bookings.setdefault(today+datetime.timedelta(dt), 0)
A dictionary has a .setdefault(..) function that allows you to set a value, given the key does not yet exists in the dicionary. This is thus shorter and more efficient than first checking if the key exists yourself since Python does not have to perform two lookups.
EDIT: Since python-3.7 dictionaries are ordered in insertion order (in the CPython version of python-3.6 that was already the case, but seen as an "implementation detail"). Since python-3.7, you can thus sort the dictionaries with:
bookings = dict(sorted(bookings.items()))
Prior to python-3.7, you can use an OrderedDict [Python-doc]:
from collections import OrderedDict
bookings = OrderedDict(sorted(bookings.items()))

Django dynamic verbose name

I'm struggling to think about how to achieve this. What I want to do is have a series of questions (to represent a Likert table) in a CharField object like so:
for a in range(1, 11):
locals()['ATL' + str(a)] = models.PositiveIntegerField(
choices=[
[1, 'Disagree Completely'],
[2, 'Disagree Strongly'],
[3, 'Disagree'],
[4, 'Neutral'],
[5, 'Agree'],
[5, 'Agree Strongly'],
[7, 'Agree Completely'],
],
widget=widgets.RadioSelectHorizontal(),
verbose_name = Constants.ATL_qu_list[a-1])
del a
And then change the verbose name for the question depending on the question number (again, I know I'm not supposed to be using locals() to store variables). Is there an easier way of achieving a dynamic label though? Thanks!
Okay, here's my answer (as well as a clarification for what I am looking for). Basically I had a series of Likert questions to put to participants which I wanted to represent as CharFields. Because each Likert question uses the same seven choice scale, it seems like inefficient coding to repeat the same functionality and only change the verbose name between each declaration.
Accordingly, I've instead used this method to achieve what I want:
# Reads in the list of survey questions
with open('survey/survey_questions.csv') as csvfile:
data_read = list(csv.reader(csvfile))
...
for a in range(1, 11):
locals()['ATL' + str(a)] = models.PositiveIntegerField(
choices=[
[1, 'Disagree Completely'],
[2, 'Disagree Strongly'],
[3, 'Disagree'],
[4, 'Neutral'],
[5, 'Agree'],
[6, 'Agree Strongly'],
[7, 'Agree Completely'],
],
widget=widgets.RadioSelectHorizontal(),
verbose_name = data_read[a-1][0])
del a

Django - order queryset alphabetically for cyrillic symbols

I have an issue with default ordering of cyrillic CharField's in Django. Is there a way to order cyrillic words alphabetically?
Taxonomy.objects.filter(type=tax_type).order_by('name')
This ordering returns me such data:
For english words ordering works as expected:
Shell output is exactly the same:
In [3]: Taxonomy.objects.filter(type="COUNTRY").order_by('name').values_list('id', 'name')
Out[3]: [(30, 'Abkhazija'), (31, 'Armenia'), (33, 'Gruzia'), (53, 'Kipr'), (59, 'Nepal'), (56, 'Thailand'), (46, 'Turkey'), (52, 'USA')]
In [4]: Taxonomy.objects.filter(type="PLACE").order_by('name').values_list('id', 'name')
Out[4]: [(42, 'Дон'), (49, 'Крым'), (73, 'Алтай'), (4, 'Архыз'), (71, 'Плато Путорана'), (44, 'Адыгея'), (75, 'Байкал'), (64, 'Домбай'), (11, 'Кавказ'), (69, 'Хибины'), (35, 'Карелия'), (54, 'Эверест'), (32, 'Эльбрус'), (34, 'Камчатка'), (51, 'Псковская область'), (19, 'Заграничный'), (50, 'Подмосковье'), (65, 'Приэльбрусье'), (60, 'Ленинградская область')]