MapReduce to replicate self join - mapreduce

In the traditional db way, I can do joins and find say, a list of users who visited 'pageA' but not 'pageB'.
Heres how I'm doing:
Table Schema:
t_user_actions {
user_id,
action,
page
}
Sample Data:
user_id, action, page
111, visit, pageX
222, visit, pageA
222, visit, pageB
333, visit, pageA
I can write this SQL to find list of all users who visited pageA but not pageB:
SELECT distinct u1.user_id user_id
FROM t_user_actions u1 left join t_user_actions u2 USING (user_id)
WHERE u1.page="pageA" and u2.page="pageB" and
u2.user_id is NULL
How do I achieve the same with MapReduce if I'm working on a large data set assuming I can import/insert the raw data into some NOSQL db?
I notice there are ways to do union, intersect but I'm trying to figure out how to do relative complement or difference in tuples.

Depending on which database you are actually using there might be much better ways to do this than MapReduce. But you asked specifically for MapReduce, so...
The Map phase would check all documents for action == "visit" && (page == "pageA" || page == "pageB"). When this is true, it would emit a document with the user_id as key and page as value.
The Reduce phase would iterate all values it receives per user. When there is at least one value with "pageB" it returns "pageB", otherwise it returns "pageA".
When you examine the result-set, ignore all returned values with page == "pageB". Those user visited pageB at least once (but not necessarily pageA as well). Those with page == "pageA" are those who you are searching for: the users who visited only pageA but never pageB.

Related

Netsuite suiteql how to get all available tables to query?

I am using Postman and Netsuite's SuiteQL to query some tables. I would like to write two queries. One is to return all items (fulfillment items) for a given sales order. Two is to return all sales orders that contain a given item. I am not sure what tables to use.
The sales order I can return from something like this.
"q": "SELECT * FROM transaction WHERE Type = 'SalesOrd' and id = '12345'"
The item I can get from this.
"q": "SELECT * FROM item WHERE id = 1122"
I can join transactions and transactionline for the sale order, but no items.
"q": "SELECT * from transactionline tl join transaction t on tl.transaction = t.id where t.id in ('12345')"
The best reference I have found is the Analytics Browser, https://system.netsuite.com/help/helpcenter/en_US/srbrowser/Browser2021_1/analytics/record/transaction.html, but it does not show relationships like an ERD diagram.
What tables do I need to join to say, given this item id 1122, return me all sales orders (transactions) that have this item?
You are looking for TransactionLine.item. That will allow you to query transaction lines whose item is whatever internal id you specify.
{
"q": "SELECT Transaction.ID FROM Transaction INNER JOIN TransactionLine ON TransactionLine.Transaction = Transaction.ID WHERE type = 'SalesOrd' AND TransactionLine.item = 1122"
}
If you are serious about getting all available tables to query take a look at the metadata catalog. It's not technically meant to be used for learning SuiteQL (supposed to make the normal API Calls easier to navigate), but I've found the catalog endpoints are the same as the SuiteQL tables for the most part.
https://{{YOUR_ACCOUNT_ID}}.suitetalk.api.netsuite.com/services/rest/record/v1/metadata-catalog/
Headers:
Accept application/schema+json
You can review all the available records, fields and joins in the Record Catalog page (Customization > Record Catalog).

How to search through rows and assign column value based on search in Postgres?

I'm creating an application similar to Twitter. In that I'm writing a query for the profile page. So when the user visits someone other users profile, he/she can view the tweets liked by that particular user. So for that my query is retrieving all such tweets liked by that user, along with total likes and comments on that tweet.
But an additional parameter I require is whether the current user has liked any of those tweets, and if yes, I want it to retrieve it as boolean True in my query so I can display it as liked in UI.
But I don't know how to achieve this part. Following is a sub-query from my main query
select l.tweet_id, count(*) as total_likes,
<insert here> as current_user_liked
from api_likes as l
INNER JOIN accounts_user ON l.liked_by_id = accounts_user.id
group by tweet_id
Is there an inbuilt function in postgres that can scan through the filtered rows and check whether current user id is present in liked_by_id. If so mark current_user_liked as True, else False.
You want to left outer join back into the api_likes table.
select l.tweet_id, count(*) as total_likes,
case
when lu.twee_id is null then false
else true
end as current_user_liked
from api_likes as l
INNER JOIN accounts_user ON l.liked_by_id = accounts_user.id
left join api_likes as lu on lu.tweet_id = l.tweet_id
and lu.liked_by_id = <current user id>
group by tweet_id
This will continue to bring in the rows you are seeing and will add a row for the lu alias on api_likes. If no such row exists matching the l.tweet_id and the current user's id, then the columns from the lu alias will be null.

DynamoDB query/scan only returns subset of items

I noticed that DynamoDB query/scan only returns documents that contain a subset of the document, just the key columns it appears.
This means I need to do a separate Batch_Get to get the actual documents referenced by those keys.
I am not using a projection expression, and according to the documentation this means the whole item should be returned.1
How do I get query to return the entire document so I don't have to do a separate batch get?
One example bit of code that shows this is below. It prints out found documents, yet they contain only the primary key, the secondary key, and the sort key.
t1 = db.Table(tname)
q = {
'IndexName': 'mysGSI',
'KeyConditionExpression': "secKey= :val1 AND " \
"begins_with(sortKey,:status)",
'ExpressionAttributeValues': {
":val1": 'XXX',
":status": 'active-',
}
}
res = t1.query(**q)
for doc in res['Items']:
print(json.dumps(doc))
This situation is discussed in the documentation for the Select parameter. You have to read quite a lot to find this, which is not ideal.
If you query or scan a global secondary index, you can only request
attributes that are projected into the index. Global secondary index
queries cannot fetch attributes from the parent table.
Basically:
If you query the parent table then you get all attributes by default.
If you query an LSI then you get all attributes by default - they're retrieved from the projection in the LSI if all attributes are projected into the index (so that costs nothing extra) or from the base table otherwise (which will cost you more reads).
If you query or scan a GSI, you can only request attributes that are projected into the index. GSI queries cannot fetch attributes from the parent table.

DynamoDB QuerySpec {MaxResultSize + filter expression}

From the DynamoDB documentation
The Query operation allows you to limit the number of items that it
returns in the result. To do this, set the Limit parameter to the
maximum number of items that you want.
For example, suppose you Query a table, with a Limit value of 6, and
without a filter expression. The Query result will contain the first
six items from the table that match the key condition expression from
the request.
Now suppose you add a filter expression to the Query. In this case,
DynamoDB will apply the filter expression to the six items that were
returned, discarding those that do not match. The final Query result
will contain 6 items or fewer, depending on the number of items that
were filtered.
Looks like the following query should return (at least sometimes) 0 records.
In summary, I have a UserLogins table. A simplified version is:
1. UserId - HashKey
2. DeviceId - RangeKey
3. ActiveLogin - Boolean
4. TimeToLive - ...
Now, let's say UserId = X has 10,000 inactive logins in different DeviceIds and 1 active login.
However, when I run this query against my DynamoDB table:
QuerySpec{
hashKey: null,
rangeKeyCondition: null,
queryFilters: null,
nameMap: {"#0" -> "UserId"}, {"#1" -> "ActiveLogin"}
valueMap: {":0" -> "X"}, {":1" -> "true"}
exclusiveStartKey: null,
maxPageSize: null,
maxResultSize: 10,
req: {TableName: UserLogins,ConsistentRead: true,ReturnConsumedCapacity: TOTAL,FilterExpression: #1 = :1,KeyConditionExpression: #0 = :0,ExpressionAttributeNames: {#0=UserId, #1=ActiveLogin},ExpressionAttributeValues: {:0={S: X,}, :1={BOOL: true}}}
I always get 1 row. The 1 active login for UserId=X. And it's not happening just for 1 user, it's happening for multiple users in a similar situation.
Are my results contradicting the DynamoDB documentation?
It looks like a contradiction because if maxResultSize=10, means that DynamoDB will only read the first 10 items (out of 10,001) and then it will apply the filter active=true only (which might return 0 results). It seems very unlikely that the record with active=true happened to be in the first 10 records that DynamoDB read.
This is happening to hundreds of customers that are running similar queries. It works great, when according to the documentation it shouldn't be working.
I can't see any obvious problem with the Query. Are you sure about your premise that users have 10,000 items each?
Your keys are UserId and DeviceId. That seems to mean that if your user logs in with the same device it would overwrite the existing item. Or put another way, I think you are saying your users having 10,000 different devices each (unless the DeviceId rotates in some way).
In your shoes I would just remove the filterexpression and print the results to the log to see what you're getting in your 10 results. Then remove the limit too and see what results you get with that.

Multiple access to static data in a django app

I'm building an application and I'm having trouble making a choice about how is the best way to access multiple times to static data in a django app. My experience in the field is close to zero so I could use some help.
The app basically consists in a drag & drop of foods. When you drag a food to a determined place(breakfast for example) differents values gets updated: total breakfast calories, total day nutrients(Micro/Macro), total day calories, ...That's why I think the way I store and access the data it's pretty important performance speaking.
This is an excerpt of the json file I'm currently using:
foods.json
{
"112": {
"type": "Vegetables",
"description": "Mushrooms",
"nutrients": {
"Niacin": {
"unit": "mg",
"group": "Vitamins",
"value": 3.79
},
"Lysine": {
"units": "g",
"group": "Amino Acids",
"value": 0.123
},
... (+40 nutrients)
"amount": 1,
"unit": "cup whole",
"grams": 87.0 }
}
I've thought about different options:
1) JSON(The one I'm currently using):
Every time I drag a food to a "droppable" place, I call a getJSON function to access the food data and then update the corresponding values. This file has a 2mb size, but it surely will increase as I add more foods to it. I'm using this option because it was the most quickest to begin to build the app but I don't think it's a good choice for the live app.
2) RDBMS with normalized fields:
I could create two models: Food and Nutrient, each food has 40+ nutrients related by a FK. The problem I see with this is that every time a food data request is made, the app will hit the db a lot of times to retrieve it.
3) RDBMS with picklefield:
This is the option I'm actually considering. I could create a Food models and put the nutrients in a picklefield.
4) Something with Redis/Django Cache system:
I'll dive more deeply into this option. I've read some things about them but I don't clearly know if there's some way to use them to solve the problem I have.
Thanks in advance,
Mariano.
This is a typical use case for a relational database. More or less normalized form is the proper way most of the time.
I wrote this data model up from the top of my head, according to your example:
CREATE TABLE unit(
unit_id integer PRIMARY KEY
,unit text NOT NULL
,metric_unit text NOT NULL
,atomic_amount numeric NOT NULL
);
CREATE TABLE food_type(
food_type_id integer PRIMARY KEY
,food_type text NOT NULL
);
CREATE TABLE nutrient_type(
nutrient_type_id integer PRIMARY KEY
,nutrient_type text NOT NULL
);
CREATE TABLE food(
food_id serial PRIMARY KEY
,food text NOT NULL
,food_type_id integer REFERENCES food_type(food_type_id) ON UPDATE CASCADE
,unit_id integer REFERENCES unit(unit_id) ON UPDATE CASCADE
,base_amount numeric NOT NULL DEFAULT 1
);
CREATE TABLE nutrient(
nutrient_id serial PRIMARY KEY
,nutrient text NOT NULL
,metric_unit text NOT NULL
,base_amount numeric NOT NULL
,calories integer NOT NULL DEFAULT 0
);
CREATE TABLE food_nutrient(
food_id integer references food (food_id) ON UPDATE CASCADE ON DELETE CASCADE
,nutrient_id integer references nutrient (nutrient_id) ON UPDATE CASCADE
,amount numeric NOT NULL DEFAULT 1
,CONSTRAINT food_nutrient_pkey PRIMARY KEY (food_id, nutrient_id)
);
CREATE TABLE meal(
meal_id serial PRIMARY KEY
,meal text NOT NULL
);
CREATE TABLE meal_food(
meal_id integer references meal(meal_id) ON UPDATE CASCADE ON DELETE CASCADE
,food_id integer references food (food_id) ON UPDATE CASCADE
,amount numeric NOT NULL DEFAULT 1
,CONSTRAINT meal_food_pkey PRIMARY KEY (meal_id, food_id)
);
This is definitely not, how it should work:
every time a food data request is made, the app will hit the db a lot
of times to retrieve it.
You should calculate / aggregate all values you need in a view or function and hit the database only once per request, not many times.
Simple example to calculate the calories of a meal according to the above model:
SELECT sum(n.calories * fn.amount * f.base_amount * u.atomic_amount * mf.amount)
AS meal_calories
FROM meal_food mf
JOIN food f USING (food_id)
JOIN unit u USING (unit_id)
JOIN food_nutrient fn USING (food_id)
JOIN nutrient n USING (nutrient_id)
WHERE mf.meal_id = 7;
You can also use materialized views. For instance, store computed values per food in a table and update it automatically if underlying data changes. Most likely, those rarely change (but are still easily updated this way).
I think the flat file version you are using comes in last place. Every time it is requested it is being read from top to bottom. For the size I think this comes in last place. The cache system would provide the best performance, but the RDBMS would be the easiest to manage/extend, plus your queries will automatically be cached.