BaseQuery.paginate() return inconsistent results - python-2.7

When I manually execute this query in MySQL, the number of results I get is 28:
SELECT *
FROM position_reporting_structures prs
INNER JOIN positions AS pos ON pos.id = prs.reports_to
INNER JOIN jobs ON jobs.id = pos.job_id
INNER JOIN position_fulfillments AS pf ON pf.position_id = pos.id
INNER JOIN parties ON parties.id = pf.party_id
WHERE jobs.title LIKE '%QA/QC%' OR parties.name LIKE '%QA/QC%';
This is the interpreted SQLAlchemy value of the query above:
query = (
PositionReportingStructure.query
.join(Position,
Position.id == PositionReportingStructure.reports_to)
.join(Job,
Job.id == Position.job_id)
.join(PositionFulfillment,
PositionFulfillment.position_id == Position.id)
.join(Party,
Party.id == PositionFulfillment.party_id)
.filter(db.or_(
Job.title.like('%QA/QC%'),
Party.name.like('%QA/QC%')))
)
I have a pagination decorator that wraps a function whose returning value is a BaseQuery. Inside the decorator, these are the code snippets that I've used:
query = f(*args, **kwargs)
page = 1 // just a dummy value
per_page = 5 // just a dummy value
paginate = query.paginate(page=page, per_page=per_page)
I'm expecting that it will return more or less 6 pages and 28 items in total, but it's not. The outcome is 1 page, 4 items per page, 4 total items and no previous/next pages which is incorrect. For further investigation, I have tried changing the value of page variable to compare each one:
page = 1 : 1 page, 4 items, 4 total items
page = 2 : 6 pages, 3 items, 28 total items
page = 3 : 6 pages, 2 items, 28 total items
page = 4 : 6 pages, 3 items, 28 total items
page = 5 : 6 pages, 5 items, 28 total items
page = 6 : 6 pages, 3 items, 28 total items
As you may notice, per page items is too inconsistent. Anybody who can explain what causes this contingency? I am currently using Flask-SQLAlchemy-2.0.

I already knew the root cause of this problem. My query returns some duplicate rows which I think is not acceptable to SQLAlchemy so in return automatically gives the distinct ones (even without using GROUP BY or DISTINCT).
As the context of my problem stated, I passed a count of 5 items in the per_page parameter of the paginate function, at the first page, it would only return 4 items since it groups the duplicate ones, thus, satisfies the paginate function's conditional statement (the one that has an emphasis):
**if page == 1 and len(items) < per_page:
total = len(items)**
else:
total = self.order_by(None).count()

Related

Planning subsequent orders

Let's say I have 5 orders and 3 drivers. I want to maximize the amount of miles they have on the road. Each driver has times that they're available to drive and orders have times that they're able to be picked up at.
Ideally, I would like to be able to plan subsequent orders in one go, rather than writing multiple models at once. My current iteration is to write multiple models that give output and subsequent models take those as inputs. How can you write this as a singular LP model?
O = {Order1, Order2, Order3, Order4, Order5}
D = {Driver1, Driver2, Driver3}
O_avail = {2 pm, 3pm, 230 pm, 8pm, 9pm, 12 am}
D_avail = {2pm, 3pm, 230pm}
Time_to_depot = {7 hours,5 hours,2 hours,5 hours,3hours, 4hours}
constraints
d_avail <= o_avail
obj function
max sum D_i*time_to_depot_i
I laid it out in such a way that driver 1 takes order 1, order 5 and order6. Driver 2 takes order 2 and order 4.

DynamoDB Client.Scan() is not returning the LastEvaluatedKey parameter

I have a table with 10k rows.
I'm trying to parse them to change a small thing inside an attribute (inside each row) with Python, so I'm using the client.scan() taking batches of 10 rows and giving the "LastEvaluatedKey" parameter to the next .scan().
The problem is that after 40 rows the scan() doesn't return the lastKey, like the DB it's only 40 lines long.
I've noticed that launching the same script against another table, 3x times bigger, the stop happens at 120 rows (3x times bigger).
The table has On-Demand capacity.
Any idea about this?
client = boto3.client('dynamodb')
resource = boto3.resource('dynamodb')
table = resource.Table(table_name)
remaining = 3961
iteration = 0
limit = 10
while remaining > 0:
# retrieve Limit
if iteration == 0:
response = client.scan(
TableName=table_name,
Limit=limit,
Select='ALL_ATTRIBUTES',
ReturnConsumedCapacity='TOTAL',
TotalSegments=123,
Segment=122,
)
key = response["LastEvaluatedKey"]
else:
response = client.scan(
TableName=table_name,
Limit=limit,
Select='ALL_ATTRIBUTES',
ExclusiveStartKey=key,
ReturnConsumedCapacity='TOTAL',
TotalSegments=123,
Segment=122,
)
key = response["LastEvaluatedKey"]
iteration += 1
for el in response["Items"]:
print(el)
I think there are two problems:
you seem to be scanning with a limit: try removing that
your are running a parallel scan and always scanning the last segment:
TotalSegments=123
Segment=122
I'm not sure how big your tables are but 123 segments is quite a lot and I don't see you scanning any of other segments, from 0 to 121.
Try this:
iteration = 0
response = client.scan(
TableName=table_name,
Select='ALL_ATTRIBUTES',
ReturnConsumedCapacity='TOTAL'
)
while True:
iteration += 1
for el in response["Items"]:
print(el)
last_key = response["LastEvaluatedKey"]
if not last_key:
break
response = client.scan(
TableName=table_name,
Select='ALL_ATTRIBUTES',
ExclusiveStartKey=last_key,
ReturnConsumedCapacity='TOTAL'
)
I expect the above should work to retrieve all items in your table. Then, if you still would like to run a parallel scan, you can do so but you'll have to handle the splitting into segments and in order for that to be efficient you'll have to handle running those concurrently (more complicated to do than a sequential scan).

Pandas: Iterate on a column one row at a time to automate a google search?

I am trying to automate 100 google searches (one per individual String in a row and return urls per each query) on a specific column in a csv (via python 2.7); however, I am unable to get Pandas to read the row contents to the Google Search automater.
*GoogleSearch source = https://breakingcode.wordpress.com/2010/06/29/google-search-python/
Overall, I can print Urls successfully for a query when I utilize the following code:
from google import search
query = "apples"
for url in search(query, stop=5, pause=2.0):
print(url)
However, when I add Pandas ( to read each "query") the rows are not read -> queried as intended. I.E. "data.irow(n)" is being queired instead of the row contents, one at a time.
from google import search
import pandas as pd
from pandas import DataFrame
query_performed = 0
querying = True
query = 'data.irow(n)'
#read the excel file at column 2 (i.e. "Fruit")
df = pd.read_csv('C:\Users\Desktop\query_results.csv', header=0, sep=',', index_col= 'Fruit')
# need to specify "Column2" and one "data.irow(n)" queried at a time
while querying:
if query_performed <= 100:
print("query")
query_performed +=1
else:
querying = False
print("Asked all 100 query's")
#prints initial urls for each "query" in a google search
for url in search(query, stop=5, pause=2.0):
print(url)
Incorrect output I receive at the command line:
query
Asked all 100 query's
query
Asked all 100 query's
Asked all 100 query's
http://www.irondata.com/
http://www.irondata.com/careers
http://transportation.irondata.com/
http://www.irondata.com/about
http://www.irondata.com/public-sector/regulatory/products/versa
http://www.irondata.com/contact-us
http://www.irondata.com/public-sector/regulatory/products/cavu
https://www.linkedin.com/company/iron-data-solutions
http://www.glassdoor.com/Reviews/Iron-Data-Reviews-E332311.htm
https://www.facebook.com/IronData
http://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=35267805
http://www.indeed.com/cmp/Iron-Data
http://www.ironmountain.com/Services/Data-Centers.aspx
FYI: My Excel .CSV format is the following:
B
1 **Fruit**
2 apples
2 oranges
4 mangos
5 mangos
6 mangos
...
101 mangos
Any advice on next steps is greatly appreciated! Thanks in advance!
Here's what I got. Like I mentioned in my comment, I couldn't get the stop parameter to work like i thought it should. Maybe i'm misunderstanding how its used. I'm assuming you only want the first 5 urls per search.
a sample df
d = {"B" : ["mangos", "oranges", "apples"]}
df = pd.DataFrame(d)
Then
stop = 5
urlcols = ["C","D","E","F","G"]
# Here i'm using an apply() to call the google search for each 'row'
# and a list is built for the urls return by search()
df[urlcols] = df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])) #get 5 by slicing
which gives you. Formatting is a bit rough on this
B C D E F G
0 mangos http://en.wikipedia.org/wiki/Mango http://en.wikipedia.org/wiki/Mango_(disambigua... http://en.wikipedia.org/wiki/Mangifera http://en.wikipedia.org/wiki/Mangifera_indica http://en.wikipedia.org/wiki/Purple_mangosteen
1 oranges http://en.wikipedia.org/wiki/Orange_(fruit) http://en.wikipedia.org/wiki/Bitter_orange http://en.wikipedia.org/wiki/Valencia_orange http://en.wikipedia.org/wiki/Rutaceae http://en.wikipedia.org/wiki/Cherry_Orange
2 apples https://www.apple.com/ http://desmoines.citysearch.com/review/692986920 http://local.yahoo.com/info-28919583-apple-sto... http://www.judysbook.com/Apple-Store-BtoB~Cell... https://tr.foursquare.com/v/apple-store/4b466b...
if you'd rather not specify the columns (i.e. ["C",D"..]) you could do the following.
df.join(df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])))

Mongoid 4 .or not working in complex query

I know Mongoid 4 is still in beta and maybe I've found a bug, but I'm having a hard time understanding why the first query works and the second one returns nothing:
Product.or({sender_uid: params[:user_id]}, {receiver_uid: params[:user_id]})
Product.where({sender_uid: params[:user_id]}).or({receiver_uid: params[:user_id]})
It sort of making it hard to compose any complex queries, so any pointers would be appreciated.
See the following example:
Product 1: sender_uid = 1, receiver_uid = 2
Product 2: sender_uid = 2, receiver_uid = 1
Product 3: sender_uid = 1, receiver_uid = 2
params[:user_id] = 1
In the first query what you are getting is ALL the products where the sender_uid OR the receiver_uid is equal to 1. That is Product 1, 2 and 3.
In the second query you are querying all products where the sender_uid is 1. That is Product 1 and Product 3 and then (on that criteria), the products with receiver_id = 1. Neither the Product 1, not the Product 2 have a receiver with uid 1. So, that's why you're getting nothing. What you are doing in the second query is something like:
Product.where(sender_uid: params[:user_id]).where(receiver_uid: params[:user_id])
UPDATE:
Answering to a comment:
Product.or({ product_id: 1 }, { product_id: 2, sender_uid: 2 })
As you can see, the or method receive to Hashes of conditions. Each one is like a where query.

Get objects created in last 30 days, for each past day

I am looking for fast method to count model's objects created within past 30 days, for each day separately. For example:
27.07.2013 (today) - 3 objects created
26.07.2013 - 0 objects created
25.07.2013 - 2 objects created
...
27.06.2013 - 1 objects created
I am going to use this data in google charts API. Have you any idea how to get this data efficiently?
items = Foo.objects.filter(createdate__lte=datetime.datetime.today(), createdate__gt=datetime.datetime.today()-datetime.timedelta(days=30)).\
values('createdate').annotate(count=Count('id'))
This will (1) filter results to contain the last 30 days, (2) select just the createdate field and (3) count the id's, grouping by all selected fields (i.e. createdate). This will return a list of dictionaries of the format:
[
{'createdate': <datetime.date object>, 'count': <int>},
{'createdate': <datetime.date object>, 'count': <int>},
...
]
EDIT:
I don't believe there's a way to get all dates, even those with count == 0, with just SQL. You'll have to insert each missing date through python code, e.g.:
import datetime
# needed to use .append() later on
items = list(items)
dates = [x.get('createdate') for x in items]
for d in (datetime.datetime.today() - datetime.timedelta(days=x) for x in range(0,30)):
if d not in dates:
items.append({'createdate': d, 'count': 0})
I think this can be somewhat more optimized solution with #knbk 's solution with python. This has fewer iterations and iterations inside SET is highly optimized in python (both in processing and in CPU-cycles).
from_date = datetime.date.today() - datetime.timedelta(days=7)
orders = Order.objects.filter(created_at=from_date, dealer__executive__branch__user=user)
orders = orders.annotate(count=Count('id')).values('created_at').order_by('created_at')
if len(orders) < 7:
orders_list = list(orders)
dates = set([(datetime.date.today() - datetime.timedelta(days=i)) for i in range(6)])
order_set = set([ord['created_at'] for ord in orders])
for dt in (order_set - dates):
orders_list.append({'created_at': dt, 'count': 0})
orders_list = sorted(orders_list, key=lambda item: item['created_at'])
else:
orders_list = orders