To extract numbers length >4, using regex from a dataframe column, I have these lines:
import pandas as pd
data = {'Company': ["0652369- INTER SUPPORT LLP, 202011",
"CIRCLE TRADING LTD 1-593616, 2020-06, 0201",
"Area Food Service Co., Ltd.-6958047, 2020-07"]}
df = pd.DataFrame(data)
df['co'] = df['Company'].str.extract('(\d+).{5,}')
print (df['co'])
Output:
0 0652369
1 1
2 6958047
It doesn't get right for the second line, which shall return '593616'.
What's the right way to write it? Thank you.
Try extracting (\d{5,}):
df['co'] = df['Company'].str.extract('(\d{5,})')
# Company co
# 0 0652369- INTER SUPPORT LLP, 202011 0652369
# 1 CIRCLE TRADING LTD 1-593616, 2020-06, 0201 593616
# 2 Area Food Service Co., Ltd.-6958047, 2020-07 6958047
My original data looks like this:
SUBBASIN HRU HRU_SLP OV_N
1 1 0.016155144 0.15
1 2 0.015563287 0.14
2 1 0.010589782 0.15
2 2 0.011574839 0.14
3 1 0.013865396 0.15
3 2 0.01744597 0.15
3 3 0.018983217 0.14
3 4 0.013890315 0.05
3 5 0.011792533 0.05
I need to modify value of OV_N for each SUBBASIN number:
hru = pd.read_csv('hru.csv')
for i in hru.OV_N:
hru.ix[hru.SUBBASIN.isin([76,65,64,72,81,84,60,46,37,1,2]), 'OV_N'] = i*(1+df21.value[12])
hru.ix[hru.SUBBASIN.isin([80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12]), 'OV_N'] = i*(1+df23.value[12])
hru.ix[hru.SUBBASIN.isin([85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,49,29,22,24,25,9,10]), 'OV_N'] = i*(1+df56.value[12])
hru.ix[hru.SUBBASIN.isin([92,88,95,94,93]), 'OV_N'] = i*(1+df58.value[12])
where df21.value[12] is a value from a txt file
The code results in an infinite value of OV_N for all subbasins, so I assume that looping through a file goes multiple times, but I can't find a mistake and this code was working before with different numbers of subbasins.
It is generally better not to loop and index over rows in a pandas DataFrame. Transforming the DataFrame by column operations is the more pandasnic approach. A pandas DataFrame can be thought of as a zipped combination of pandas Series: each column is its own pandas Series – all sharing the same index. Operations can be applied to one or more pandas Series to create a new Series that shares the same index. Operations can also be applied to combine a Series with one dimensional numpy array to create a new Series. It is helpful to understand pandas indexing – however this answer will just use sequential integer indexing.
To modify the value of OV_N for each SUBBASIN number:
Initialize the hru DataFrame by reading it in from the hru.csv as in the original question. Here we initialize it with the data given in the question.
import numpy as np
import pandas as pd
hru = pd.DataFrame({
'SUBBASIN':[1,1,2,2,3,3,3,3,3],
'HRU':[1,2,1,2,1,2,3,4,5],
'HRU_SLP':[0.016155144,0.015563287,0.010589782,0.011574839,0.013865396,0.01744597,0.018983217,0.013890315,0.011792533],
'OV_N':[0.15,0.14,0.15,0.14,0.15,0.15,0.14,0.05,0.05]})
Create one separate pandas Series that gathers and stores all the values from the various DataFrames, i.e. df21, df23, df56, and df58, into one place. This will be used to look up values by index. Let’s call it subbasin_multiplier_ds. Let’s respectively assume values of 21, 23, 56, and 58 were read from the txt file. Do replace these with the real values read in from the txt file.
subbasin_multiplier_ds=pd.Series([21]*96)
subbasin_multiplier_ds[80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,
34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12] = 23
subbasin_multiplier_ds[85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,
86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,
49,29,22,24,25,9,10] = 56
subbasin_multiplier_ds[92,88,95,94,93] = 58
Replace OV_N in hru DataFrame based on columns in the DataFrame and a lookup in subbasin_multiplier_ds by index.
hru['OV_N'] = hru['OV_N'] * (1 + subbasin_multiplier_ds[hru['SUBBASIN']].values)
A numpy array is created by .values above so expected results are achieved. If you want to experiment with removing values give it a try to see what happens.
I have 3 models in Django-project:
class Hardware(models.Model):
inventory_number = models.IntegerField(unique=True,)
class Subdivision(models.Model):
name = models.CharField(max_length=50,)
class Relocation(models.Model):
hardware = models.ForeignKey('Hardware',)
subdivision = models.ForeignKey('Subdivision',)
relocation_date = models.DateField(verbose_name='Relocation Date', default=date.today())
Table 'Hardware_Relocation' with values for example:
id hardware subdivision relocation_date
1 1 1 01.01.2009
2 1 2 01.01.2010
3 1 1 01.01.2011
4 1 3 01.01.2012
5 1 3 01.01.2013
6 1 3 01.01.2014
7 1 3 01.01.2015 # Now hardware 1 located in subdivision 3 because relocation_date is max
I would like to write a filter to find hardwares in subdivisions on today.
I'm trying to write a filter:
subdivision = Subdivision.objects.get(pk=1)
hardware_list = Hardware.objects.annotate(relocation__relocation_date=Max('relocation__relocation_date')).filter(relocation__subdivision = subdivision)
Now hardware_list contains hardware 1, but it is wrong (because now hardware 1 in subdivision 3).
hardware_list must be None in this example.
The following code works wrong (hardware_list contains hardware 1, for subdivision 1).
limit_date = datetime.datetime.now()
q1 = Hardware.objects.filter(relocation__subdivision=subdivision, relocation__relocation_date__lte=limit_date)
q2 = q1.exclude(~Q(relocation__relocation_date__gt=F('relocation__relocation_date')), ~Q(relocation__subdivision=subdivision))
hardware_list = q2.distinct()
Maybe better use SQL?
This might work...
from django.db.models import F, Q
Hardware.objects
.filter(relocation__subdivision=target_subdivision, relocation__relocation_date__lte=limit_date)
.exclude(~Q(relocation__subdivision=target_subdivision), relocation__relocation_date__gt=F('relocation__relocation_date'))
.distinct()
The idea is, give me all hardware that have been relocated to target division before limit date, which DON'T have been relocated to other divisions after that.
My goal is to compute the KL distance between the following text documents:
1)The boy is having a lad relationship
2)The boy is having a boy relationship
3)It is a lovely day in NY
I first of all vectorised the documents in order to easily apply numpy
1)[1,1,1,1,1,1,1]
2)[1,2,1,1,1,2,1]
3)[1,1,1,1,1,1,1]
I then applied the following code for computing KL distance between the texts:
import numpy as np
import math
from math import log
v=[[1,1,1,1,1,1,1],[1,2,1,1,1,2,1],[1,1,1,1,1,1,1]]
c=v[0]
def kl(p, q):
p = np.asarray(p, dtype=np.float)
q = np.asarray(q, dtype=np.float)
return np.sum(np.where(p != 0,(p-q) * np.log10(p / q), 0))
for x in v:
KL=kl(x,c)
print KL
Here is the result of the above code: [0.0, 0.602059991328, 0.0].
Texts 1 and 3 are completely different, but the distance between them is 0, while texts 1 and 2, which are highly related has a distance of 0.602059991328. This isn't accurate.
Does anyone has an idea of what I'm not doing right with regards to KL? Many thanks for your suggestions.
Though I hate to add another answer, there are two points here. First, as Jaime pointed out in the comments, KL divergence (or distance - they are, according to the following documentation, the same) is designed to measure the difference between probability distributions. This means basically that what you pass to the function should be two array-likes, the elements of each of which sum to 1.
Second, scipy apparently does implement this, with a naming scheme more related to the field of information theory. The function is "entropy":
scipy.stats.entropy(pk, qk=None, base=None)
http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.entropy.html
From the docs:
If qk is not None, then compute a relative entropy (also known as
Kullback-Leibler divergence or Kullback-Leibler distance) S = sum(pk *
log(pk / qk), axis=0).
The bonus of this function as well is that it will normalize the vectors you pass it if they do not sum to 1 (though this means you have to be careful with the arrays you pass - ie, how they are constructed from data).
Hope this helps, and at least a library provides it so don't have to code your own.
After a bit of googling to undersand the KL concept, I think that your problem is due to the vectorization : you're comparing the number of appearance of different words. You should either link your column indice to one word, or use a dictionnary:
# The boy is having a lad relationship It lovely day in NY
1)[1 1 1 1 1 1 1 0 0 0 0 0]
2)[1 2 1 1 1 0 1 0 0 0 0 0]
3)[0 0 1 0 1 0 0 1 1 1 1 1]
Then you can use your kl function.
To automatically vectorize to a dictionnary, see How to count the frequency of the elements in a list? (collections.Counter is exactly what you need). Then you can loop over the union of the keys of the dictionaries to compute the KL distance.
A potential issue might be in your NP definition of KL. Read the wikipedia page for formula: http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
Note that you multiply (p-q) by the log result. In accordance with the KL formula, this should only be p:
return np.sum(np.where(p != 0,(p) * np.log10(p / q), 0))
That may help...