I have a table that looks like this
date car_crashes city
01.01 1 Washington
01.02 4 Washington
01.03 0 Washington
01.04 2 Washington
01.05 0 Washington
01.06 3 Washington
01.07 4 Washington
01.08 1 Washington
01.01 0 Detroit
01.02 2 Detroit
01.03 4 Detroit
01.04 2 Detroit
01.05 0 Detroit
01.06 3 Detroit
01.07 1 Detroit
I want to know how many car crashes for each day happened in the entire nation, and I can do that with this:
Model.values("date") \
.annotate(car_crashes=Sum('car_crashes')) \
.values("date", "car_crashes")
Now, let's suppose I have an array like this:
weights = [
{
"city": "Washington",
"weight": 1,
},
{
"city": "Detroit",
"weight": 2,
}
]
This means that Detroit's car crashes should be multiplied by 2 before being aggregated with Washington's.
It can be done like this:
from django.db.models import IntegerField
when_list = [When(city=w['city'], then=w['weight']) for w in weights]
case_params = {'default': 1, 'output_field': IntegerField()}
Model.objects.values('date') \
.annotate(
weighted_car_crashes=Sum(
F('car_crashes') * Case(*when_list, **case_params)
))
However, this generates very slow SQL code, especially as more properties and a larger array are introduced.
Another solution which is way faster but still sub-optimal is using pandas:
aggregated = false
for weight in weights:
ag = Model.filter(city=w[‘city’]).values("date") \
.annotate(car_crashes=Sum('car_crashes') * w[‘weight’]) \
.values("date", "car_crashes")
if aggregated is False:
aggregated = ag
else:
aggregated = aggregated.union(ag)
aggregated = pd.DataFrame(aggregated)
if len(weights) > 1:
aggregated = aggregated.groupby("date", as_index=False).sum(level=[1])
This is faster, but still not as fast as what happens if, before calling pandas, I take the aggregated.query string and
wrap it with a few lines of SQL.
SELECT "date", sum("car_crashes") FROM (
// String from Python
str(aggregated.query)
) as "foo" GROUP BY "date"
This works perfectly when pasted into my database SQL. I could do this in Python/Django using .raw() but the documentation says to ask before using .raw() as mostly anything could be acomplished with the ORM.
Yet, I don't see how. Once I call .union() on 2 querysets, I cannot aggregate further.
aggregated.union(ag).annotate(cc=Sum('car_crashes'))
gives
Cannot compute Sum('car_crashes'): 'car_crashes' is an aggregate
Is this possible to do with the Django ORM or should I use .raw()?
Related
I'm new to Python and Pandas but I try to use Pandas Dataframes to merge two dataframes based on regular expression.
I have one dataframe with some 2 million rows. This table contains data about cars but the model name is often specified in - lets say - a creative way, e.g. 'Audi A100', 'Audi 100', 'Audit 100 Quadro', or just 'A 100'. And the same for other brands. This is stored in a column called "Model". In a second model I have the manufacturer.
Index
Model
Manufacturer
0
A 100
Audi
1
A100 Quadro
Audi
2
Audi A 100
Audi
...
...
...
To clean up the data I created about 1000 regular expressions to search for some key words and stored it in a dataframe called 'regex'. In a second column of this table I save the manufacture. This value is used in a second step to validate the result.
Index
RegEx
Manufacturer
0
.* A100 .*
Audi
1
.* A 100 .*
Audi
2
.* C240 .*
Mercedes
3
.* ID3 .*
Volkswagen
I hope you get the idea.
As far as I understood, the Pandas function "merge()" does not work with regular expressions. Therefore I use a loop to process the list of regular expressions, then use the "match" function to locate matching rows in the car DataFrame and assign the successfully used RegEx and the suggested manufacturer.
I added two additional columns to the cars table 'RegEx' and 'Manufacturer'.
for index, row in regex.iterrows():
cars.loc[cars['Model'].str.match(row['RegEx']),'RegEx'] = row['RegEx']
cars.loc[cars['Model'].str.match(row['RegEx']),'Manufacturer'] = row['Manfacturer']
I learnd 'iterrows' should not be used for performance reasons. It takes 8 minutes to finish the loop, what isn't too bad. However, is there a better way to get it done?
Kind regards
Jiriki
I have no idea if it would be faster (I'll be glad, if you would test it), but it doesn't use iterrows():
regex.groupby(["RegEx", "Manufacturer"])["RegEx"]\
.apply(lambda x: cars.loc[cars['Model'].str.match(x.iloc[0])])
EDIT: Code for reproduction:
cars = pd.DataFrame({"Model": ["A 100", "A100 Quatro", "Audi A 100", "Passat V", "Passat Gruz"],
"Manufacturer": ["Audi", "Audi", "Audi", "VW", "VW"]})
regex = pd.DataFrame({"RegEx": [".*A100.*", ".*A 100.*", ".*Passat.*"],
"Manufacturer": ["Audi", "Audi", "VW"]})
#Output:
# Model Manufacturer
#RegEx Manufacturer
#.*A 100.* Audi 0 A 100 Audi
# 2 Audi A 100 Audi
#.*A100.* Audi 1 A100 Quatro Audi
#.*Passat.* VW 3 Passat V VW
# 4 Passat Gruz VW
I am working on an investment app in Django which requires calculating portfolio balances and values over time. The database is currently set up this way:
class Ledger(models.Model):
asset = models.ForeignKey('Asset', ....)
amount = models.FloatField(...)
date = models.DateTimeField(...)
...
class HistoricalPrices(models.Model):
asset = models.ForeignKey('Asset', ....)
price = models.FloatField(...)
date = models.DateTimeField(...)
Users enter transactions in the Ledger, and I update prices through APIs.
To calculate the balance for a day (note multiple Ledger entries for the same asset can happen on the same day):
def balance_date(date):
return Ledger.objects.filter(date__date__lte=date).values('asset').annotate(total_amount=Sum('amount'))
Trying to then get values for every day between the date of the first Ledger entry and today becomes more challenging. Currently I am doing it this way - assuming a start_date and end_date that are datetime.date() and tr_dates a list on unique dates on which transactions did occur (to avoid calculating balances on days where nothing happened) :
import pandas as pd
idx = pd.date_range(start_date, end_date)
main_df = pd.DataFrame(index=tr_dates)
main_df['date_send'] = main_df.index
main_df['balances'] = main_df['date_send'].apply(lambda x: balance_date(x))
main_df = main_df.sort_index()
main_df.index = pd.DatetimeIndex(main_df.index)
main_df = main_df.reindex(idx, method='ffill')
This works but my issue is performance. It takes at least 150-200ms to run this, and then I need to get the prices for each date (all of them, not just transaction dates) and somehow match and multiply by the correct balances, which makes the run time about 800 ms or more.
Given this is a web app the view taking 800ms at minimum to calculate makes it hardly scalable, so I was wondering if anyone had a better way to do this?
EDIT - Simple example of expected input / output
Ledger entries (JSON format) :
[
{
"asset":"asset_1",
"amount": 10,
"date": "2015-01-01"
},
{
"asset":"asset_2",
"amount": 15,
"date": "2017-10-15"
},
{
"asset":"asset_1",
"amount": -5,
"date": "2018-02-09"
},
{
"asset":"asset_1",
"amount": 20,
"date": "2019-10-10"
},
{
"asset":"asset_2",
"amount": 3,
"date": "2019-10-10"
}
]
Sample Price from Historical Prices:
[
{
"date": "2015-01-01",
"asset": "asset_1"
"price": 5,
},
{
"date": "2015-01-01",
"asset": "asset_2"
"price": 15,
},
{
"date": "2015-01-02",
"asset": "asset_1"
"price": 6,
},
{
"date": "2015-01-02",
"asset": "asset_2"
"price": 11,
},
...
{
"date": "2017-10-15",
"asset": "asset_1"
"price": 20
},
{
"date": "2017-10-15",
"asset": "asset_2"
"price": 30
}
{
]
In this case:
tr_dates is ['2015-01-01', '2017-10-15', '2018-02-09', '2019-10-10']
date_range is ['2015-01-01', '2015-01-02', '2015-01-03'.... '2019-12-14, '2019-12-15']
Final output I am after: Balances by date with price by date and total value by date
date asset balance price value
2015-01-01 asset_1 10 5 50
2015-01-01 asset_2 0 10 0
.... balances do not change as there are no new Ledger entries but prices change
2015-01-02 asset_1 10 6 60
2015-01-02 asset_2 0 11 0
.... all dates between 2015-01-02 and 2017-10-15 (no change in balance but change in price)
2017-10-15 asset_1 10 20 200
2017-10-15 asset_2 15 30 450
... dates in between
2018-02-09 asset_1 5 .. etc based on price
2018-02-09 asset_2 15 .. etc based on price
... dates in between
2019-10-10 asset_1 25 .. etc based on price
2019-10-10 asset_2 18 .. etc based on price
... goes until the end of date_range
I have managed to get this working but takes about a second to compute and I ideally need this to be at least 10x faster if possible.
EDIT 2 Following ac2001 method:
ledger = (Ledger
.transaction
.filter(portfolio=p)
.annotate(transaction_date=F('date__date'))
.annotate(transaction_amount=Window(expression=Sum('amount'),
order_by=[F('asset').asc(), F('date').asc()],
partition_by=[F('asset')]))
.values('asset', 'transaction_date', 'transaction_amount'))
df = pd.DataFrame(list(ledger))
df.transaction_date = pd.to_datetime(df.transaction_date).dt.date
df.set_index('transaction_date', inplace=True)
df.sort_index(inplace=True)
df = df.groupby(by=['asset', 'transaction_date']).sum()
yields the following dataframe (with multiindex):
transaction_amount
asset transaction_date
asset_1 2015-01-01 10.0
2018-02-09 5.0
2019-10-10 25.0
asset_2 2017-10-15 15.0
2019-10-10 18.0
These balances are correct (and also yield correct results on more complex data) but now I need to find a way to ffill these results to all dates in between as well as from the last date 2019-10-10 to today 2019-12-15 but not sure how that works given the multi-index.
Final solution
Thanks to #ac2001's code and pointers I have come up with the following:
ledger = (Ledger
.objects
.annotate(transaction_date=F('date__date'))
.annotate(transaction_amount=Window(expression=Sum('amount'),
order_by=[F('asset').asc(), F('date').asc()],
partition_by=[F('asset')]))
.values('asset', 'transaction_date', 'transaction_amount'))
df = pd.DataFrame(list(ledger))
df.transaction_date = pd.to_datetime(df.transaction_date)
df.set_index('transaction_date', inplace=True)
df.sort_index(inplace=True)
df['date_cast'] = pd.to_datetime(df.index).dt.date
df_grouped = df.groupby(by=['asset', 'date_cast']).last()
df_unstacked = df_.unstack(['asset'])
df_unstacked.index = pd.DatetimeIndex(df_unstacked.index)
df_unstacked = df_unstacked.reindex(idx)
df_unstacked = df_unstacked.ffill()
This gives me a matrix of asset by dates. I then get a matrix of prices by dates (from database) and multiply the two matrices.
Thanks
I think this might take some back and forth. I think the best approach is to do this in a couple steps.
Let's start with getting asset balances daily and then we will merge the prices together. The transaction amount is a cumulative total. Does this look correct? I don't have your data so it is a little difficult for me to tell.
ledger = (Ledger
.objects
.annotate(transaction_date=F('date__date'))
.annotate(transaction_amount=Window(expression=Sum('amount'),
order_by=[F('asset').asc(), F('date').asc()],
partition_by=[F('asset')]))
.values('asset', 'transaction_date', 'transaction_amount'))
df = pd.DataFrame(list(ledger))
df.transaction_date = pd.to_datetime(df.transaction_date)
df.set_index('transaction_date', inplace=True)
df.groupby('asset').resample('D').ffill()
df = df.reset_index() <--- added this line here
<---edit below --->
Then create a dataframe from HistoricalPrices and merge it with the ledger. You might have to adjust the merge criteria to ensure you are getting what you want, but I think this is the correct path.
# edit
ledger = df
prices = (HistoricalPrice
.objects
.annotate(transaction_date=F('date__date'))
.values('asset', 'price', 'transaction_date'))
prices = pd.DataFrame(list(prices))
result = ledger.merge(prices, how='left', on=['asset', 'transaction_date'])
Depending on how you are using the data later, if you need a list of dicts which is a preferred method in Django templates, you can do that conversion with df.to_dict(orient='records')
If you want to group your Ledgers by date, then calculate the daily asset amount;
Ledger.objects.values('date__date').annotate(total_amount=Sum('amount'))
this should help (edit: fix typo)
second edit: assuming you want to group them by asset as well:
Ledger.objects.values('date__date', 'asset').annotate(total_amount=Sum('amount'))
I have a jsonb structure on postgres named data where each row (there are around 3 million of them) looks like this:
[
{
"number": 100,
"key": "this-is-your-key",
"listr": "20 Purple block, THE-CITY, Columbia",
"realcode": "LA40",
"ainfo": {
"city": "THE-CITY",
"county": "Columbia",
"street": "20 Purple block",
"var_1": ""
},
"booleanval": true,
"min_address": "20 Purple block, THE-CITY, Columbia LA40"
},
.....
]
I would like to query the min_address field in the fastest possible way. In Django I tried to use:
APModel.objects.filter(data__0__min_address__icontains=search_term)
but this takes ages to complete (also, "THE-CITY" is in uppercase, so, I have to use icontains here. I tried dropping to rawsql like so:
cursor.execute("""\
SELECT * FROM "apmodel_ap_model"
WHERE ("apmodel_ap_model"."data"
#>> array['0', 'min_address'])
#> %s \
""",\
[json.dumps([{'min_address': search_term}])]
)
but this throws me strange errors like:
LINE 4: #> '[{"min_address": "some lane"}]'
^
HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.
I am wondering what is the fastest way I can query the field min_address by using rawsql cursors.
Late answer, probably it won't help OP anymore. Also I'm not at all an expert in Postgres/JSONB, so this might be a terrible idea.
Given this setup;
so49263641=# \d apmodel_ap_model;
Table "public.apmodel_ap_model"
Column | Type | Collation | Nullable | Default
--------+-------+-----------+----------+---------
data | jsonb | | |
so49263641=# select * from apmodel_ap_model ;
data
-------------------------------------------------------------------------------------------
[{"number": 1, "min_address": "Columbia"}, {"number": 2, "min_address": "colorado"}]
[{"number": 3, "min_address": " columbia "}, {"number": 4, "min_address": "California"}]
(2 rows)
The following query "expands" objects from data arrays to individual rows. Then it applies pattern matching to the min_address field.
so49263641=# SELECT element->'number' as number, element->'min_address' as min_address
FROM apmodel_ap_model ap, JSONB_ARRAY_ELEMENTS(ap.data) element
WHERE element->>'min_address' ILIKE '%col%';
number | min_address
--------+---------------
1 | "Columbia"
2 | "colorado"
3 | " columbia "
(3 rows)
However, I doubt it will perform well on large datasets as the min_address values are casted to text before pattern matching.
Edit: Some great advice here on indexing JSONB data for search https://stackoverflow.com/a/33028467/1284043
I am trying to automate 100 google searches (one per individual String in a row and return urls per each query) on a specific column in a csv (via python 2.7); however, I am unable to get Pandas to read the row contents to the Google Search automater.
*GoogleSearch source = https://breakingcode.wordpress.com/2010/06/29/google-search-python/
Overall, I can print Urls successfully for a query when I utilize the following code:
from google import search
query = "apples"
for url in search(query, stop=5, pause=2.0):
print(url)
However, when I add Pandas ( to read each "query") the rows are not read -> queried as intended. I.E. "data.irow(n)" is being queired instead of the row contents, one at a time.
from google import search
import pandas as pd
from pandas import DataFrame
query_performed = 0
querying = True
query = 'data.irow(n)'
#read the excel file at column 2 (i.e. "Fruit")
df = pd.read_csv('C:\Users\Desktop\query_results.csv', header=0, sep=',', index_col= 'Fruit')
# need to specify "Column2" and one "data.irow(n)" queried at a time
while querying:
if query_performed <= 100:
print("query")
query_performed +=1
else:
querying = False
print("Asked all 100 query's")
#prints initial urls for each "query" in a google search
for url in search(query, stop=5, pause=2.0):
print(url)
Incorrect output I receive at the command line:
query
Asked all 100 query's
query
Asked all 100 query's
Asked all 100 query's
http://www.irondata.com/
http://www.irondata.com/careers
http://transportation.irondata.com/
http://www.irondata.com/about
http://www.irondata.com/public-sector/regulatory/products/versa
http://www.irondata.com/contact-us
http://www.irondata.com/public-sector/regulatory/products/cavu
https://www.linkedin.com/company/iron-data-solutions
http://www.glassdoor.com/Reviews/Iron-Data-Reviews-E332311.htm
https://www.facebook.com/IronData
http://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=35267805
http://www.indeed.com/cmp/Iron-Data
http://www.ironmountain.com/Services/Data-Centers.aspx
FYI: My Excel .CSV format is the following:
B
1 **Fruit**
2 apples
2 oranges
4 mangos
5 mangos
6 mangos
...
101 mangos
Any advice on next steps is greatly appreciated! Thanks in advance!
Here's what I got. Like I mentioned in my comment, I couldn't get the stop parameter to work like i thought it should. Maybe i'm misunderstanding how its used. I'm assuming you only want the first 5 urls per search.
a sample df
d = {"B" : ["mangos", "oranges", "apples"]}
df = pd.DataFrame(d)
Then
stop = 5
urlcols = ["C","D","E","F","G"]
# Here i'm using an apply() to call the google search for each 'row'
# and a list is built for the urls return by search()
df[urlcols] = df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])) #get 5 by slicing
which gives you. Formatting is a bit rough on this
B C D E F G
0 mangos http://en.wikipedia.org/wiki/Mango http://en.wikipedia.org/wiki/Mango_(disambigua... http://en.wikipedia.org/wiki/Mangifera http://en.wikipedia.org/wiki/Mangifera_indica http://en.wikipedia.org/wiki/Purple_mangosteen
1 oranges http://en.wikipedia.org/wiki/Orange_(fruit) http://en.wikipedia.org/wiki/Bitter_orange http://en.wikipedia.org/wiki/Valencia_orange http://en.wikipedia.org/wiki/Rutaceae http://en.wikipedia.org/wiki/Cherry_Orange
2 apples https://www.apple.com/ http://desmoines.citysearch.com/review/692986920 http://local.yahoo.com/info-28919583-apple-sto... http://www.judysbook.com/Apple-Store-BtoB~Cell... https://tr.foursquare.com/v/apple-store/4b466b...
if you'd rather not specify the columns (i.e. ["C",D"..]) you could do the following.
df.join(df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])))
I know Mongoid 4 is still in beta and maybe I've found a bug, but I'm having a hard time understanding why the first query works and the second one returns nothing:
Product.or({sender_uid: params[:user_id]}, {receiver_uid: params[:user_id]})
Product.where({sender_uid: params[:user_id]}).or({receiver_uid: params[:user_id]})
It sort of making it hard to compose any complex queries, so any pointers would be appreciated.
See the following example:
Product 1: sender_uid = 1, receiver_uid = 2
Product 2: sender_uid = 2, receiver_uid = 1
Product 3: sender_uid = 1, receiver_uid = 2
params[:user_id] = 1
In the first query what you are getting is ALL the products where the sender_uid OR the receiver_uid is equal to 1. That is Product 1, 2 and 3.
In the second query you are querying all products where the sender_uid is 1. That is Product 1 and Product 3 and then (on that criteria), the products with receiver_id = 1. Neither the Product 1, not the Product 2 have a receiver with uid 1. So, that's why you're getting nothing. What you are doing in the second query is something like:
Product.where(sender_uid: params[:user_id]).where(receiver_uid: params[:user_id])
UPDATE:
Answering to a comment:
Product.or({ product_id: 1 }, { product_id: 2, sender_uid: 2 })
As you can see, the or method receive to Hashes of conditions. Each one is like a where query.