Grouping collection with threshold filtering

Grouping collection with threshold filtering - mapreduce

I'm trying to come up with an algorithm to do the following in a map reduce. I receive a bunch of objects and the user ids of the owner. In other words, I receive a bunch of pairs:
(object, uid)
I want to end up with a list of pairs (object,count), where count refers to the number of times the object occurs in the list. The caveat is that we would need to filter everything as follows:
We should only include object pairs such that the object is repeated for at least n different uids.
We should only include objects such that the total count of times it repeats is at least m.
Objects and users are all represented as integers. The problem is that it would be trivial to convert each (object,uid) pair into (object, 1) and then reduce together these by summing the second integers. I could then filter everything that doesn't hit the threshold of (2). However, at this point I would have lost the information necessary to filter by (1), which is what I don't know how to incorporate into this. Anyone have any suggestions?

The easiest and most natural way is to run two MR jobs in sequence. Goal of the first job is to count how much times each object is owned by each uid. Result is triplets (object, uid, count). uid field here is for debugging purpose only -- it is not required in second job. Second job groups triplets by object. In the end of each reduce() call, you know:
number of different uids for object (number of received triplets)
total number of how much time object is owned (sum of count fields)
So, now you may apply both filters.
Single-job setup is also possible, but it requires manipulating with job on a bit lower level with setSortComparatorClass(), setGroupingComparatorClass() and setPartitionerClass(). Idea is that map() should emit composite keys which contain both object and uid fields, value is not used at all (NullWritable):
Partitioner class partitions keys only by using object field of the key. This guarantees that all records with the same object will go to the same reduce task.
SortComparator class is implemented in such way that first it compares object field, and if they are identical, uid field.
GroupingComparatorClass uses only object field for comparison.
In the result, input of single reduce task will look like following:
object1 uid1
object1 uid2
object1 uid2
object1 uid2
object1 uid5
object1 uid6
object1 uid6
------------ <- boundary of call to reduce
object7 uid1
object7 uid1
object7 uid5
------------- <-- boundary of call to reduce()
object9 uid3
As you can see, uids are strictly ordered inside each call to reduce(), which allows you to count both number of distinct and non-distinct uids simultaneously.

Related

Is this a reasonable way to design this DynamoDB table? Alternatives?

Our team has started to use AWS and one of our projects will require storing approval statuses of various recommendations in a table.
There are various things that identify a single recommendation, let's say they're : State, ApplicationDate, LocationID, and Phase. And then a bunch of attributes corresponding to the recommendation (title, volume, etc. etc.)
The use case will often require grabbing all entries for a given State and ApplicationDate (and then we will look at all the LocationId and Phase items that correspond to it) for review from a UI. Items are added to the table one at a time for a given Station, ApplicationDate, LocationId, Phase and updated frequently.
A dev with a little more AWS experience mentioned we should probably use State+ApplicationDate as the partition key, and LocationId+Phase as the sort key. These two pieces combined would make the primary key. I generally understand this, but how does that work if we start getting multiple recommendations for the same primary key? I figure we either are ok with just overwriting what was previously there, OR we have to add some other attribute so we can write a recommendation for the State+ApplicationDate/LocationId+Phase multiple times and get all previous values if we need to... but that would require adding something to the primary key right? Would that be like adding some kind of unique value to the sort key? Or for example, if we need to do status and want to record different values at different statuses, would we just need to add status to the sort key?
Does this sound like a reasonable approach or should I be exploring a different NAWS offering for storing this data?

Use a time-based id property, such as a ULID or KSID. This will provide randomness to avoid overwriting data, but also provide a time-based sorting of your data when used as part of a sort key
Because the id value is random, you will want to add it to your sort key for the table or index where you perform your list operations, and reserve the pk for known values that can be specified exactly.
It sounds like the 'State' is a value that can change. You can't update an item's key attributes on the table, so it is more common to use these attributes in a key for a GSI if they are needed to list data.
Given the above, an alternative design is to use the LocationId as the pk, the random id value as the sk, and a GSI with the GSI with 'State' as the pk and the random id as the sk. Or, if you want to list the items by State -> Phase -> date, the GSI sk could be a concatenation of the Phase and id property. The above pattern gives you another list mechanism using the LocationId + timestamp of the recommendation create time.

DynamoDB get object count within sort-key

I'm creating a database of local businesses.
My Primary key looks like this:
Partition key: "slug" - used by URL engine
Sort key: "category" - the category of business, such ass "services", "electronics" etc.
How can I scan DynamoDB for the count of businesses within a certain category, while keeping minimum reads? I want to get a result like: "8 electronics, 3 building, 4 services".
I have like 8 categories, maybe scanning the database with "Filter expressions" for each category will be efficient?

Other variables to think about are how often you need these counts and when the counts are needed, at what velocity. It might be cheaper to do an out of band process in Lambda to do these counts every so often and then write the count value back into an item. So when the app needs the counts, it just grabs those items. Otherwise doing counts all the time like this could get expensive for something that probably does not change very often.

Short incremental uinque id for neo4j

I use django with neo4j as database. I need to use short url based on node ids in my rest api. In neo4j there is an id used in database that didn't recommended to use in app, and there is approach to use uuid that is too long for my short urls. So I add my uid generator:
def uid_generator():
last_id = db.cypher_query("MATCH (n) RETURN count(*) AS lastId")[0][0][0]
if last_id is None:
last_id = 0
last_id = str(last_id)
hash = sha256()
hash.update(str(time.time()).encode())
return hash.hexdigest()[0:(max(2, len(last_id)))] + str(uuid.uuid4()).replace('-', '')[0:(max(2, len(last_id)))]
I have two question, First I read this question in stack overflow and still not sure that MATCH (n) RETURN count(*) AS lastId is O(1) there was no reference to that! Is there any reference for that answer? Second is there a better approach to do in both id uniqueness and speed?

First, you should put a unique constraint on the id property to make sure there are no collisions created by parallel create statements. This requires using a label, but you NEED this fail-safe if you plan to do anything serious with this data. But this way, you can have rolling ids for different labels. (All indexed labels will have a count table. UNIQUE CONSTRAINT also creates an index)
Second, you should do the generation and creation in the same cypher like this
MATCH (n:Node) WITH count(*) AS lastId
CREATE (:Node{id:lastId})
This will minimize time between generation and commit, reducing chances of collision. (Remember to retry on failed attempts from unique violations)
I'm not sure what you are doing with the hash, just that you are doing it wrong. Either you generate a new time based UUID (It will require no parameters) and use it as is, or you use the incriminating id. (By altering a UUID, you invalidate the logic that guaranteed uniqueness, thus significantly increasing collision chance)
You can also store the current index count in a node like is explained here. It's not guaranteed to be thread safe, but shouldn't be a problem as long as you have Unique Constraints in place, and retry on constraint violations. This will be more tolerant of deleting nodes.

Your approach is not good because it's based on the number of node in the database.
What happened if you create a node (call it A), and then delete a random node, and then create a new node (call it B).
A and B will have the same ID, and I think that's why you have added a hash in code based on the time (but I barely understand the line :)).
On the other side, Neo4j's ID ensure you to have a unique ID across the database, but not in the time. Per default, Neo4j recycle unused ID (an ID is release when a node is deleted).
You can change this behavour by changing the configuration (see the doc HERE ) : dbms.ids.reuse.types.override=RELATIONSHIP
Becarefull with such a configuration, the size of your database on your harddrive can only increase, even if you delete nodes.

Why not create your own identifier? You can get the maximum of your last identifier (let's call it RN for record number).
match (n) return max(n.RN) as lastID
max is one of several numeric functions in cypher.

how to deal with virtual index in a database table in Django + PostgreSQL

Here is my current scenario:
Need to add a new field to an existing table that will be used for ordering QuerySet.
This field will be an integer between 1 and not a very high number, I expect less than 1000. The whole reasoning behind this field is to use it for visual ordering on the front-end, thus, index 1 would be the first element to be returned, index 2 second, etc...
This is how the field is defined in model:
priority = models.PositiveSmallIntegerField(verbose_name=_(u'Priority'),
default=0,
null=True)
I will need to re-arrange (reorder) the whole set of elements in this table if a new or existing element gets this field updated. So for instance, imagine I have 3 objects it this table:
Element A
priority 1
Element B
priority 2
Element C
priority 3
If I change Element C priority to 1 I should have:
Element C
priority 1
Element A
priority 2
Element B
priority 3
Since this is not a real db index ( and have empty values), I'm gonna have to query for all elements on database each time a new element is created / updated and change priority value for each record in table. Not really worried about performance since table will always be small BUT, I'm worried this way to proceed is not the way to go or simply it generates too much overhead.
Maybe there is simpler way to do this with plain SQL stuff? If I use an index though, I will get an error every time an existing priority is used, something I don't want either.
Any pointers?

To insert at 10th position all you need is a single sql query:
MyModel.objects.filter(priority__gte=10).update(priority=models.F('priority')+1)
Then you would need a similar one for deleting an element, and swapping two elements (or whatever your use case requires). It all should be doable in a similar manner with bulk update queries, no need to manually update entry by entry.

First, you can very well index this column, just don't enforce it to contains unique values. Such standard indexes can have nulls and duplicates... they are just used to locate the row(s) matching a criteria.
Second, updating each populated* row each time you insert/update a record should be looked at based on the expected update frequency. If each user is inserting several records each time they use the system and you have thousands of concurrent users, it might not be a good idea... whereas if you have a single user updating any number of rows once in a while, it is not so much an issue. On the same vein, you need to consider if other updates are occurring to the same rows or not. You don't want to lock all rows too often if they are to be updated/locked for updating other fields.
*: to be accurate, you wouldn't update all populated rows, but only the ones having a priority lower than the inserted one. (inserting a priority 999 would only decrease the priority of items with 999 and 1000)

Fastest way to select several inserted rows

I have a table in a database which stores items. Each item has a unique ID, which the DB generates upon insertion (auto-increment).
A user may perform a specific task that will add X items to the database, however my program (C++ server application using MySQL connector) should return the IDs that the database generated right away. For example, if I add 6 items, the server must return 6 new unique IDs to the client.
What is the fastest/cleanest way to do such thing? So far I have been doing INSERT followed by SELECT for each new item OR INSERT followed by last_insert_id, however if there are 50 items to add it will take a few seconds at least which is not good at all for user experience.
sql_task.query("INSERT INTO `ItemDB` (`ItemName`, `Type`, `Time`) VALUES ('%s', '%d', '%d')", strName.c_str(), uiType, uiTime);
Getting the ID:
uint64_t item_id { sql_task.last_id() }; //This calls mysql_insert_id

I believe you need to rethink your design slightly. Let's use the analogy of a sales order. With a sales order (or invoice #) the user gets an invoice number (auto_incr) as well as multiple line item numbers (also auto_inc).
The sales order and all of the line items are selected for insert (from the GUI) and the inserts are performed. First, the sales order row is inserted and its id is saved in a variable for subsequent calls to insert the line items. But the line items are then just inserted without immediate return of their auto_inc id values. The application is merely returned the sales order number in the end. How your app uses that sales order number in subsequent calls is up to you. But it does not need to be immediate to retrieve all the X or 50 rows at once, as it has the sales order number iced and saved somewhere. Let's call that sales order number XYZ.
When you actually need the information, an example call could look like
select lineItemId
from lineItems
where salesOrderNumber=XYZ
order by lineItemId
You need to remember that in a multi-user system that there is no guarantee of receiving a contiguous block of numbers. Nor should it matter to you, as they are all attached appropriately with the correct sales order number.
Again, the above is just an analogy, used for illustration purposes.

That's a common but hard to solve problem. Unsure for mysql, but PostreSQL uses sequences to generate automatic ids. Inserting frameworks (object relationnal mappers) use that when they expect to insert many values: they query directly the sequence for a bunch of IDs and then insert new rows using those already known IDs. That way, no need for an additional query after each insert to get the ID.
The downside is that the relation ID - insertion time can be non monotonic when different writers intermix their inserts. It is not a problem for the database, but some (poorly written?) program could expect it is.

As you ID is autoincremental, you can do only two SELECT queries - before and after INSERT queries:
SELECT AUTO_INCREMENT FROM information_schema.tables WHERE table_name = 'dbTable' AND table_schema = DATABASE();
--
-- INSERT INTO dbTable... (one or many, does not matter);
--
SELECT LAST_INSERT_ID() AS lastID;
This will give you the siquence between first and last inserted IDs. Then you can easily calculate how many they are.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js