Replace null values with mean for specific columns in pyspark

Replace null values with mean for specific columns in pyspark - replace

I would like to replace null values with mean for the age and height column. I know there is a post
Fill Pyspark dataframe column null values with average value from same column
but in this post the given function throws an error.
df = spark.createDataFrame([(1, 'John', 1.79, 28,'M', 'Doctor'),
(2, 'Steve', 1.78, 45,'M', None),
(3, 'Emma', 1.75, None, None, None),
(4, 'Ashley',1.6, 33,'F', 'Analyst'),
(5, 'Olivia', 1.8, 54,'F', 'Teacher'),
(6, 'Hannah', 1.82, None, 'F', None),
(7, 'William', 1.7, 42,'M', 'Engineer'),
(None,None,None,None,None,None),
(8,'Ethan',1.55,38,'M','Doctor'),
(9,'Hannah',1.65,None,'F','Doctor')]
, ['Id', 'Name', 'Height', 'Age', 'Gender', 'Profession'])
the function in the post given
def fill_with_mean(df, exclude=set()):
stats = df.agg(*(
avg(c).alias(c) for c in df.columns if c not in exclude
))
return df.na.fill(stats.first().asDict())
fill_with_mean(df, ["Age", "Height"])
when I run this function, it says
NameError: name 'avg' is not defined
Can anybody fix this? Thank you.

Fixed example. It works for me in a way as you expect!
from pyspark.sql.functions import avg
df = spark.createDataFrame(
[
(1, 'John', 1.79, 28, 'M', 'Doctor'),
(2, 'Steve', 1.78, 45, 'M', None),
(3, 'Emma', 1.75, None, None, None),
(4, 'Ashley', 1.6, 33, 'F', 'Analyst'),
(5, 'Olivia', 1.8, 54, 'F', 'Teacher'),
(6, 'Hannah', 1.82, None, 'F', None),
(7, 'William', 1.7, 42, 'M', 'Engineer'),
(None, None, None, None, None, None),
(8, 'Ethan', 1.55, 38, 'M', 'Doctor'),
(9, 'Hannah', 1.65, None, 'F', 'Doctor')
],
['Id', 'Name', 'Height', 'Age', 'Gender', 'Profession']
)
def fill_with_mean(this_df, exclude=set()):
stats = this_df.agg(*(avg(c).alias(c) for c in this_df.columns if c not in exclude))
return this_df.na.fill(stats.first().asDict())
res = fill_with_mean(df, ["Gender", "Profession", "Id", "Name"])
res.show()

Related

how to get only values from query dict

if request.method == 'POST':
product=request.POST.get('product')
upload_month = request.POST.get('upload_month')
un_month= Planning_quantity_data.objects.values('month').filter(product=product,upload_month=upload_month).distinct()
print(un_month)
<QuerySet [{'month': 'Mar_22'}, {'month': 'Apr_22'}, {'month': 'May_22'}, {'month': 'Jun_22'}]>
I want to get only the values without key and store it in a new list in
views.py file:
like newlist = ['Mar_22' , 'Apr_22', 'May_22','Jun_22']
while I am using
un_month1=list(un_month.values())
print(un_month1)
It is showing like something this:
[{'id': 1, 'upload_month': 'Mar_22', 'product': 'MAE675', 'material_code': 'MAE675 (MEMU â OB) RCF', 'order_type': 'Onhand', 'BOM_CODE': '675MEMU', 'month': 'Mar_22', 'quantity': 3, 'po_value': '37/5', 'remarks': 'Qty in Rakes. 3-5 rakes partial qty dispatched', 'empid': None}, {'id': 2, 'upload_month': 'Mar_22', 'product': 'MAE675', 'material_code': 'MAE675 (MEMU â OB) RCF', 'order_type': 'Onhand', 'BOM_CODE': '675MEMU', 'month': 'Apr_22', 'quantity': 3, 'po_value': '37/5', 'remarks': 'Qty in Rakes. 3-5 rakes partial qty dispatched', 'empid': None}, {'id': 3, 'upload_month': 'Mar_22', 'product': 'MAE675', 'material_code': 'MAE675 (MEMU â OB) RCF', 'order_type': 'Onhand', 'BOM_CODE': '675MEMU', 'month': 'May_22', 'quantity': 3, 'po_value': '37/5', 'remarks': 'Qty in Rakes. 3-5 rakes partial qty dispatched', 'empid': None}]

If you use values_list() [django-docs] with a single field, you can use flat=True to return a QuerySet of single values, I mean:
if request.method == 'POST':
product=request.POST.get('product')
upload_month = request.POST.get('upload_month')
newlist = list(Planning_quantity_data.objects.filter(product=product,upload_month=upload_month).values_list('month', flat=True))
print(newlist)
And this will print just ['Mar_22', 'Apr_22', 'May_22', 'Jun_22'] for you.

Django ORM select, concat, extract from data and order by

I'm having trouble performing a simple transformation with the django orm.
Desired outcome should look like this:
2018-08
2018-07
2018-06
...
And is created with this sql:
select
distinct
strftime('%Y',a."Buchung") || "-" ||
strftime('%m',a."Buchung") as YearMonth
from
hhdata_transaktion a
order by
1 desc
I need it for a ModelChoiceField as queryset, so I'm bound to the ORM here?
My try
from django.db.models.functions import TruncMonth, TruncYear
Transaktion.objects
.annotate(year=TruncYear('Buchung'),
month=TruncMonth('Buchung'))
.distinct()
.order_by('-year', '-month')
.values('year','month')
returns:
<QuerySet [{'year': datetime.date(2018, 1, 1), 'month': datetime.date(2018, 8, 1)}, {'year': datetime.date(2018, 1, 1), 'month': datetime.date(2018, 7, 1)}, {'year': datetime.date(2018, 1, 1), 'month': datetime.date(2018, 6, 1)}, {'year': datetime.date(2018, 1, 1), 'month': datetime.date(2018, 5, 1)}, {'year': datetime.date(2018, 1, 1), 'month': datetime.date(2018, 4, 1)}, {'year': datetime.date(2018, 1, 1), 'month': datetime.date(2018, 3, 1)}, {'year': datetime.date(2018, 1, 1), 'month': datetime.date(2018, 2, 1)}, {'year': datetime.date(2018, 1, 1), 'month': datetime.date(2018, 1, 1)}, {'year': datetime.date(2017, 1, 1), 'month': datetime.date(2017, 12, 1)}, {'year': datetime.date(2017, 1, 1), 'month': datetime.date(2017, 11, 1)}, {'year': datetime.date(2017, 1, 1), 'month': datetime.date(2017, 10, 1)}, {'year': datetime.date(2017, 1, 1), 'month': datetime.date(2017, 9, 1)}, {'year': datetime.date(2017, 1, 1), 'month': datetime.date(2017, 8, 1)}]>
I have the feeling I'm miles away from the desired result..

If you want to obtain the year or month, you can use ExtractYear [Django-doc] and ExtractMonth [Django-doc] respectively. Truncating will give you the start of the year or month.
So we can rewrite the query to:
from django.db.models.functions import ExtractMonth, ExtractYear
qs = Transaktion.objects.annotate(
year=ExtractYear('Buchung'),
month=ExtractMonth('Buchung')
).order_by('-year', '-month').values('year','month').distinct()
Although it is possible to do the processing at SQL level, I think it will make work more complex. For example if you concatenate the numbers in SQL, it will probably require some work to get leading zeros for months (less than 10). Furthermore it is likely that the query contains "SQL dialect"-specific features making it less portable.
Therefore I suggest to do the post processing at the Django/Python level. For exampe with:
from django.db.models.functions import ExtractMonth, ExtractYear
class MyForm(forms.Form):
my_choice_field = forms.ChoiceField()
# ...
def __init__(self, *args, **kwargs):
super(MyForm, self).__init__(*args, **kwargs)
qs = Transaktion.objects.annotate(
year=ExtractYear('Buchung'),
month=ExtractMonth('Buchung')
).order_by('-year', '-month').values('year','month').distinct()
self.fields['my_choice_field'].choices = [
(row['year']*100+row['month'], '{}-{:02d}'.format(row['year'], row['month'])
for row in qs
]
Here we thus generate a list of 2-tuples where the first element is some sort of number we use to identify the choice (I here multiplied the year by 100, such that 201804 is april 2018). The second element of the tuple is the string that determines the format.

If you want a list of strings like 2018-06, something like that should work:
[ '%i-%02i' % (x.Buchung.year, x.Buchung.month) for x in Transaktion.objects.order_by(-Buchung) ]

Replace dictionary values containing array of integers with array of objects

I am currently working with django and I am able to fetch the JSON of my model.But one of the keys of the JSON contains an array of numbers which I need to replace with array of objects of those numbers.Below is the query to get json of the Contexts model
queryset = serializers.serialize("json", Contexts.objects.all())
This is what I get
[{"model": "app.contexts", "pk": 1, "fields": {"context_name": "tech-experts", "context_description": "this is for tech experts", "context_priority": "H", "users": [1, 3, 4]}}, {"model": "app.contexts", "pk": 2, "fields": {"context_name": "video-conf-issue", "context_description": "bla bla", "context_priority": "H", "users": [4, 5]}}, {"model": "app.contexts", "pk": 3, "fields": {"context_name": "video-conf-issue", "context_description": "bla bla", "context_priority": "L", "users": [3]}}, {"model": "app.contexts", "pk": 15, "fields": {"context_name": "Network debug", "context_description": "Group for debugging network issues", "context_priority": "L", "users": [2]}}]
Now I am interested in just the fields values.So I do this
result = [i.get('fields') for i in ast.literal_eval(queryset)]
So now I get this
[{'context_priority': 'H', 'context_name': 'tech-experts', 'context_description': 'this is for tech experts', 'users': [1, 3, 4]}, {'context_priority': 'H', 'context_name': 'video-conf-issue', 'context_description': 'bla bla', 'users': [4, 5]}, {'context_priority': 'L', 'context_name': 'video-conf-issue', 'context_description': 'bla bla', 'users': [3]}, {'context_priority': 'L', 'context_name': 'Network debug', 'context_description': 'Group for debugging network issues', 'users': [2]}]
Now as you can see users has an array which contains integers.Basically these integers are user ids and I want the user objects of these ids.
So my User model object for the userId 1, it will be
User.objects.filter(userId=1)
So in order to achieve this I do the below operation
[i.update({"users":[].append(User.objects.filter(userId=j))}) for i in result for j in i.get("users")]
But now I get the resulting value for the key users as None
[{'context_description': 'this is for tech experts', 'users': None, 'context_priority': 'H', 'context_name': 'tech-experts'}, {'context_description': 'bla bla', 'users': None, 'context_priority': 'H', 'context_name': 'video-conf-issue'}, {'context_description': 'bla bla', 'users': None, 'context_priority': 'L', 'context_name': 'video-conf-issue'}, {'context_description': 'Group for debugging network issues', 'users': None, 'context_priority': 'L', 'context_name': 'Network debug'}]
How can I achieve this?
Added the Contexts and User model below
class User(models.Model):
userId = models.PositiveIntegerField(null = False)
pic = models.ImageField(upload_to=getUserImagePath,null=True)
Email = models.EmailField(null = True)
class Contexts(models.Model):
context_name = models.CharField(max_length=50)
context_description = models.TextField()
context_priority = models.CharField(max_length=1)
users = models.ManyToManyField(User, related_name='context_users')

change
[i.update({"users":[].append(User.objects.filter(userId=j))}) for i in result for j in i.get("users")]
to
[i.update({"users":[].append(User.objects.filter(pk=j))}) for i in result for j in i.get("users")]

Django How to access another object by having a user name?

I have this in models:
class CustomUser(AbstractUser):
selectat = models.BooleanField(default=False)
def __str__(self):
return self.username
class Score(models.Model):
VALUE = (
(1, "Nota 1"),
(2, "Nota 2"),
(3, "Nota 3"),
(4, "Nota 4"),
(5, "Nota 5"),
(6, "Nota 6"),
(7, "Nota 7"),
(8, "Nota 8"),
(9, "Nota 9"),
(10, "Nota 10"),
)
user_from = models.ForeignKey(settings.AUTH_USER_MODEL, default=0)
user_to = models.ForeignKey(settings.AUTH_USER_MODEL, default=0, related_name='user_to')
nota = models.PositiveSmallIntegerField(default=0, choices=VALUE)
def __str__(self):
return str(self.user_to)
How can i access the score objects by having the user?
When i give the user to score object i can get the notes.
x = Score.objects.filter(user_to__username='Fane')
x
<QuerySet [<Punctaj: Fane>, <Punctaj: Fane>]>
for a in x:
print(a.nota)
1
5
I want to use something like this:
y = CustomUser.objects.get(id=7)
x = x.score.all()
for a in x:
print(a.nota)
1
5
But this won't work, it's giving me:
Traceback (most recent call last):
File "<input>", line 1, in <module>
AttributeError: 'CustomUser' object has no attribute 'score'

You have two foreign keys from CustomUser to Score. The first one, user_from, does not set a related_name, so it uses the default, which is score_set:
x = y.score_set.all()
The second does set a related_name, so you use that:
x = y.user_to.all()
Note that this does not make much sense as a related name, since it points to scores, not users; it should probably be something like scores_to_user.

Aggregate grouped annotation

I'd like to sum all the event durations per day. This is my model:
class Event(models.Model):
start = models.DateTimeField()
end = models.DateTimeField()
Sample data:
import datetime
from random import randint
for i in range(0, 1000):
start = datetime.datetime(
year=2016,
month=1,
day=randint(1, 10),
hour=randint(0, 23),
minute=randint(0, 59),
second=randint(0, 59)
)
end = start + datetime.timedelta(seconds=randint(30, 1000))
Event.objects.create(start=start, end=end)
I can get the event count per day like so:
(I know extra is bad, but I'm using 1.9 at the moment. When I upgrade I'll move to using TruncDate)
Event.objects.extra({'date': 'date(start)'}).order_by('date').values('date').annotate(count=Count('id'))
[{'count': 131, 'date': datetime.date(2016, 1, 1)},
{'count': 95, 'date': datetime.date(2016, 1, 2)},
{'count': 99, 'date': datetime.date(2016, 1, 3)},
{'count': 85, 'date': datetime.date(2016, 1, 4)},
{'count': 87, 'date': datetime.date(2016, 1, 5)},
{'count': 94, 'date': datetime.date(2016, 1, 6)},
{'count': 97, 'date': datetime.date(2016, 1, 7)},
{'count': 111, 'date': datetime.date(2016, 1, 8)},
{'count': 97, 'date': datetime.date(2016, 1, 9)},
{'count': 104, 'date': datetime.date(2016, 1, 10)}]
I can annotate to add the duration:
In [3]: Event.objects.annotate(duration=F('end') - F('start')).first().duration
Out[3]: datetime.timedelta(0, 470)
But I can't figure out how to sum this annotation the same way I can count events. I've tried the following but I get a KeyError on 'duration'.
Event.objects.annotate(duration=F('end') - F('start')).extra({'date': 'date(start)'}).order_by('date').values('date').annotate(total_duration=Sum('duration'))
And If I add duration to the values clause then it no longer groups by date.
Is this possible in a single query and without adding a duration field to the model?

I was about to write an answer that Django ORM does not support this. And yes, then I spent another hour on this problem (in addition to the 1,5 hours already spent before starting to write this answer), but as it turns out, Django does support it. And without hacking. Good news!
import datetime as dt
from django.db import models
from django.db.models import F, Sum, When, Case
from django.db.models.functions import TruncDate
from app.models import Event
a = Event.objects.annotate(date=TruncDate('start')).values('date').annotate(
day_duration=Sum(Case(
When(date=TruncDate(F('start')), then=F('end') - F('start')),
default=dt.timedelta(), output_field=models.DurationField()
))
)
And some preliminary tests to (hopefully) prove that this stuff actually does what you asked.
In [71]: a = Event.objects.annotate(date=TruncDate('start')).values('date').annotate(day_duration=Sum(Case(
...: When(date=TruncDate(F('start')), then=F('end') - F('start')),
...: default=dt.timedelta(), output_field=models.DurationField()
...: ))
...: )
In [72]: for e in a:
...: print(e)
...:
{'day_duration': datetime.timedelta(0, 41681), 'date': datetime.date(2016, 1, 10)}
{'day_duration': datetime.timedelta(0, 46881), 'date': datetime.date(2016, 1, 3)}
{'day_duration': datetime.timedelta(0, 48650), 'date': datetime.date(2016, 1, 1)}
{'day_duration': datetime.timedelta(0, 52689), 'date': datetime.date(2016, 1, 8)}
{'day_duration': datetime.timedelta(0, 45788), 'date': datetime.date(2016, 1, 5)}
{'day_duration': datetime.timedelta(0, 49418), 'date': datetime.date(2016, 1, 7)}
{'day_duration': datetime.timedelta(0, 45984), 'date': datetime.date(2016, 1, 9)}
{'day_duration': datetime.timedelta(0, 51841), 'date': datetime.date(2016, 1, 2)}
{'day_duration': datetime.timedelta(0, 63770), 'date': datetime.date(2016, 1, 4)}
{'day_duration': datetime.timedelta(0, 57205), 'date': datetime.date(2016, 1, 6)}
In [73]: q = dt.timedelta()
In [74]: o = Event.objects.filter(start__date=dt.date(2016, 1, 7))
In [75]: p = Event.objects.filter(start__date=dt.date(2016, 1, 10))
In [76]: for e in o:
...: q += (e.end - e.start)
In [77]: q
Out[77]: datetime.timedelta(0, 49418) # Matches 2016.1.7, yay!
In [78]: q = dt.timedelta()
In [79]: for e in p:
...: q += (e.end - e.start)
In [80]: q
Out[80]: datetime.timedelta(0, 41681) # Matches 2016.1.10, yay!
NB! This works from version 1.9, I don't think you can do this with the earlier versions because the TruncDate function is missing. And before 1.8 you don't have the Case and When thingies as well of course.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Replace null values with mean for specific columns in pyspark - replace

Related

how to get only values from query dict

Django ORM select, concat, extract from data and order by

Replace dictionary values containing array of integers with array of objects

Django How to access another object by having a user name?

Aggregate grouped annotation

Categories

Resources