Building a topic hierarchy for indexing content

Building a topic hierarchy for indexing content - topic-maps

Im looking to build a topic map to catagorize content.
For example the Topic 'Art' may have sub categories of 'Art History', 'Painting', 'Sculpture' etc etc.
I've crawled a few online resources, but I've hit a problem related to how I wish to use the hierarchy.
I've got a lot of content that I wish to index by topic. So to give the above example, if a user searches for 'Art' then they will not only get anything that mentions 'Art', but also anything that mentions 'Painting', even if it doesnt mention 'Art'. Fair enough.
But if, in another part of my heirarchy, I have 'House Maintenance', for example, then that might also have a subtopic of 'Painting'.
But then if a user searches for 'Art', my engine will say 'well, Painting is a sub category of 'Art', so I'll include this peice of content thats all about the best colour to paint your bathroom walls....
Has anyone come across this problem before? I've tried googling, but without knowing the exact terminology its hard to make headway....
EDIT: More succinctly, 'Painting' is a subtopic of 'Art', but if something is about 'Painting' then it doesnt neecssarily follow that its about 'Art', since 'Art' is not the only parent of 'Painting'.

In "topic maps", as it is understood in the related standard you can set different "scopes" to a topic. So "painting" may be part of two scopes, with different meanings.
A topic map:
http://www.ontopia.net/page.jsp?id=vizigator
Scope:
http://www.ontopia.net/topicmaps/materials/tao.html#stp-scope

If the Topic Map you are creating is built on Topic Maps technology, then subjectIdentifiers can be used to distinguish between two Topics with the same name (both named "Painting") that actually represent two different Subjects (Painting as an Art form, and Painting in the sense of home renovation).
If someone queries about Art and you drill down to Painting, then you can return only those entries related to 'Painting as an Art form' because those Painting entries are no longer thrown together on one heap.

Turning up late to this party (you've probably already built it or moved on or found an answer) but thought I'd throw in my 2 cents having worked on a high end Topic Map based CMS.
What you are missing out in your description is how topics are linked together. Topic are linked together via Associations that in themselves have Type's and Roles. So yes painting would be a child of art and of house maintenance but they would be linked differently.
Defining your type and role is up to you really, there is no hard and fast rules its really just down to your own leanings. So
Topic: Art
Association: Source=Art, Reference=Painitng, Type=Culture, Role=Practice
Topic: House Maintenance
Association: Soruce=House Maintenance, Reference=Painting, Type=DIY, Role=Activity
I suck at categorisation but hopefully you can see what I'm getting at. You'd filter your searches based on the type and role. So if someone searched for art you'd return painting and if you wanted to dig deeper and return co-related topics you are talking about returning Culture associated topics and not DIY associated topics.
Topic Maps if done right are extremely flexible, you've also got scope and language baked in too if you do it right. You should be able to link the same topics together in a 100 different ways and see the data differently depending on your starting point.

Information Architecture for the World Wide Web would give you a good start on organizing information... it's a good read, but might not be so technically detailed.

Since you want to process House/Painting and Art/Painting differently, then it seems like you'll need two distinct entries for Painting (one for each meaning). Which one you associate a given 'lump of text' with could be based on context clues from the text itself, if your text processor is powerful enough.
For example, whenever you have a conflict like this, look in the text - do you see other words there? Like 'sink', 'wall', 'hard wood', or 'windows'? Or do you see other terms like 'Monet', 'impressionism', 'canvas', and 'gallery'? That'll allow you to automate the decision, and should be fairly accurate. The only snag is that this presumes you have a fairly healthy dictionary of 'related terms' lying around somewhere.
On the user-end, when Painting is selected, you'd simply have to either merge all the results together, or present the user an option to select which parent topic they want to be viewing results from.

I don't know of a specific name for that, but I don't think it should really be a problem, either. All it calls for is that Art/Painting and House Maintenance/Painting are understood as separate entities. Someone searching for "art" gets subcategories of Art, so gets Art/Painting. Someone searching for "house maintenance" gets subcategories of House Maintenance, so gets House Maintenance/Painting. Someone searching for "painting" gets Art/Painting and House Maintenance/Painting, which is appropriate.

Related

Django Master/Detail

I am designing a master/detail solution for my app. I have searched for ever in the django docs, also here and elsewhere I could, so I guess the answer is not that obvious, despite being an answer many people look for - not only in django, but in every language, I think.
Generally, in most cases, the master already exists: for example, the Django Docs illustrate the Book example, where we already have an Author and we want to add several Books for that Author.
In my case, the parent is not yet present on the database think of a purchase order, for instance.
I have thought to divide the process in two steps: the user would start to fill in the info for the master model (regular form) and then proceed to another view to add the lines (inline formset). But I don't think this is the best process at all - there are a lot of possible flaws in it.
I also thought about creating a temporary parent object in a different table and only having a definitive master when the children are finally created. But it still doesn't look clean.
Because of that, for my app it would be ideal to create the master object at the same time as the detail objects (lines) - again, like an order.
Is there a way where I can have the same view to manage both master and detail? Like this I would receive both in the same POST request and it would make a lot more sense, not to say it would be much cleaner.
Sorry if it's too long, and thank you in advance!

So I found out that in my case the process could actually be split in two phases.
For this I simply use the traditional model form and inline formset.
But! I also found out that there could be several answers to this:
We could get crazy and build some spaceship in AJAX that would get the job done, simply by sending a JSON object (in which the lines could be an array of objects)
Django also has its ways and it's possible to send multiple forms in the same request! (thank you #mousetail for the tip).
Of course, be there as it may, there are many ways to build a house, these are just the ones I found out.

Google Datastore ancestor query returning data too far down

I have an "inbox/messaging" structure that I'm working on, that allows for multiple kinds of parents. As in, people can leave comments on a few different kinds of objects. For this example, let's say someone is leaving a comment on a Article object.
The way we've formatted our data, the comment is created as a Message object, and that object is a child of Article (and Article is a child of Account). So when we query for the list of messages, we simply ask for all Messages that are children of that instance of Article. That looks like this:
Message.query(ancestor=source_key)
source_key here is the Key of the article we're viewing.
Great, this works really well and is pretty fast.
Now we want to add replies to those Message objects. I figure we'll just store replies the same way we add Messages to Articles. Which is to say, a Reply is simply another instance of Message, and the parent of that reply object is the message it's replying to. So basically, instead of leaving a comment on an Article, you're leaving a comment on a Message.
This sounds good on paper but it seems that in practice, the Key it ends up getting is structured like so:
Key('Account', 5629499534213120, 'Article', 5946158883012608, 'Message', 6509108836433920)
Which turns out, when we query for the list of messages, it return the replies as well in the response, as if they aren't replies at all.
Some questions:
Is there any way we can do like a "shallow" query? To strictly get only the immediate children of that Article?
I've read more on how ancestor queries work and because ancestor queries have a 1 write per second limitation, I'm now wondering if it may be better to change how we store this to where a Message is not the child of Article, and instead maybe have a KeyProperty of Article exist on Message, if that makes sense. And maybe no parent for Message. There could be lots of people leaving a comment on an article, or also lots of people leaving replies to those comments. But even so, Article is a child of Account too, along with a lot of other kinds of objects, and generally we don't run into any issues with lots of different writes. So would we even run into this write limitation?
EDIT: I've moved on a little bit and am trying to query only replies for a given message, so I'm looking for all messages that have a parent (ancestor) of another message.
So given this key as the ancestor: Key('Account', 5629499534213120, 'Article', 5946158883012608, 'Message', 5663034638860288)
I query our message table, and I get back that exact same key (as well as other messages). How is that possible? If I'm specifying an ancestor, in what world does it make sense that I would get back the same object I'm using to query the ancestor with? The parent of that message is just:
Key('Account', 5629499534213120, 'Article', 5946158883012608)
So, obviously the ancestor doesn't strictly match there. Why would my query return it then? Hastebin of what, basically, I'm running into: https://hastebin.com/karojolisi.py

Regarding the question on write limitation, if you are using the Cloud Firestore in Datastore mode, then the limitation of 1 write per second is by entity and not entity group.
See https://cloud.google.com/datastore/docs/firestore-or-datastore
"Writes to an entity group are no longer limited to 1 per second."
and https://cloud.google.com/datastore/docs/concepts/limits
"Maximum write rate to an entity" is "1 per sec"
So, irrespective of which approach you take, with datastore mode, writes shouldn't be a concern as the messages and replies are not expected to be edited. Unless of course, if you have any kind of aggregate information like the number of replies for a given message which require updating the parent message record with each child record.
Regarding your main question of querying only the messages for an article and not their replies, one option is to have a field called article_id and populate this only for the top level messages and have this also in the index (prefix of the ancestor composite index). The reason to recommend article_id and not a boolean is, since this field is indexed, it is better to have the field not be based on a narrow range of values.
The reason to prefer this approach to storing the messages in a separate table is that all messages belonging to an article will be stored close by with the initial approach and that is better for read performance.

How to extend the event/occurrence models in django-scheduler

I'd like to augment events/occurrences in django-scheduler with three things:
Location
Invitees
RSVPs
For Location, my initial thought was to subclass Event and add Location as a foreign key to a Location class, but my assumption is that each occurrence saved won't then include Location, so if the location changes for one occurrence, I'll have nowhere to store that information.
In this situation, is it recommended to create an EventRelation instead? Will I then be able to specify a different Location for one occurrence in a series? The EventRelation solution seems untidy to me, I'd prefer to keep models in classes for clarity and simplicity.
I think Invitees is the same problem, so presumably I should use a similar solution?
For RSVPs, I intend to make an RSVP class with Occurrence as a foreign key, and as far as I can tell that should work without any issues as long as I save the occurrence before attaching it to an RSVP?
I've read all the docs, all the GitHub issues, various StackOverflow threads, the tests, the model source, etc, but it's still unclear what the "right" way to do it is.
I found a PR which introduces abstract models: https://github.com/llazzaro/django-scheduler/pull/389 which looks like exactly what I want, but I'm reluctant to use code which was seemingly abandoned 18 months ago as I won't get the benefit of future improvements.
EDIT: I'm now thinking that another way to do this would be to have just one object linked to the event using EventRelation, so I'd have an "EventDetails" object connected to the Event via EventRelation, then include FKs to Location, Guests, etc from that object.
I should then also be able to subclass my EventDetails object with different kinds of events and attach those too. I'll give it a go ant see if it works!

Just in case anyone find this and is wondering the same thing: I ended up ditching Django-scheduler and using Django-recurrence instead. Had to do a bit more work myself, but it was easier to create the custom event types that I was looking for. Worked pretty well!

Sitecore Content Tree Architecture

Let's say there exists a presentational component in a project that renders an unordered list (called ListRenderer, perhaps.) We have a couple options of supplying data to any given ListRenderer on a page:
Have a TreeList (or TreeListEx) field on the content item, and have ListRenderer read from it.
Supply a DataSource (or other Parameter) to the ListRenderer via the presentation details.
I usually avoid #1 in my projects because it binds Sublayouts to templates, which gets quite messy. If you go down that path, eventually you'll have fields to support every potential sublayout in your project.
So my solutions tend toward option #2, which gets rid of that problem. It does, however, come with its own bag of questions. Where do I put these various "Lists" for a given ListRenderer to use? To maximize reuse and sharing, I usually create a components directory near the site root that contains all these types of things, if I predict the Lists will be shared. This seems less findable and harder to use for the content author, who suddenly have no idea where the source for their ListRenderer is unless they know how to crack open the presentation details (which is slightly advanced for my average user).
If I feel like Lists won't be shared, and are very specific to the page, I'll put them directly underneath the item in question. This has a tendency to muddle up the content tree, though, and any dynamically generated navigation sublayout then has to check for whether or not an item is an actual page before it generates the link to it. The more I work in Sitecore, the less I use this approach, but it seems easier for the content author. There is much easier access to information when you use this approach.
Is there any industry-accepted way of approaching this problem? It happens in projects all the time, and in my head I struggle to balance technical and content authorship concerns in situations like these.

Great question. I've used all the techniques you mentioned, depending on the audience and specifics of the project. The problem is that, as with all things Sitecore, they are all valid ways of achieving the same goal and you will struggle to find one answer that will work in every situation.
I almost always use #2 as well, but some content author retraining maybe necessary and make sure you add in restrictions to what the content author is able to select as a target. I have (within the same project) structured the items near the root (in a shared content folder) and under the item in question, depending on what I felt would provide the best context.
Also, if other child pages would exist below the item as well as the list items, then I would put the list items in a separate folder (with a common "list items" icon") and re-order it to be the first item for separation and clarity.
If you want to use any kind of personalization and DMS then you will need the ability to switch out the datasource anyway so you shouldn't hard code locations.
You might also (if you have not already) want to consider using:
Convert Data Source Paths to IDs Using the Sitecore ASP.NET CMS
- Useful if you need to restructure your content at a later date
Queryable Datasource Locations
- Useful for multi-site situations when you need to make clones, or setting as the default datasource value in Standard Values when the lists are directly below the item but gives you the flexibility to change it.
I prefer using querable datasources personally, I find the xpath syntax more logical.

As Mark has commented, there is no real industry standard.
I feel like this is something that needs improvement.
Especially when you are using the DataSource option, things become less transparent to the editors and as the size of the site grows, so does the complexity.
All I can tell you is how I would do it, which is most likely much like how you are doing it.
1) For overview pages like news, events and faq items, I will put the items underneath the overview item and use the NewsMover shared source module to auto-create a hierarchy.
2) I will create a Global site that contains items that are shared across sites or pages. DataSource items for components will be put in here.
3) For components that are present on the standard values, I will add a list field to the template (for example, when you display related items on a content page)
Most often it's a logical choice and sometimes it's just a matter of taste.
I'd like to add that I've written a blog post on how to have datasource items created automatically for components that are set on standard values. That might help you if you are using those.
Edit:
"I usually avoid #1 in my projects because it binds Sublayouts to templates, which gets quite messy. If you go down that path, eventually you'll have fields to support every potential sublayout in your project."
Today I've blogged about a method of hiding fields and sections in the content editor if there is no sublayout set on the item that requires those fields, which helps to prevent the mess of having a lot of unused fields on your items.

how to make next/previous buttons to toggle between gql query results

Say I have a website that has 100 products. Then this is filtered down to 5 sections containing 20 products each. If you were in one of the sections that contained 20 products (e.g. toys), what would be the optimal method to display only 5 toys per page. At the bottom of the list would be next/previous buttons to show the next/previous set of 5 toys.
A better analogy would be google search. There are millions of results but only ~10 are shown at a given time.
So right now I'm using google app engine (python) and django templates. One way I thought of to remedy this problem would be making all the query results go into a div which could then be modified through javascript to give a similar effect. However, if someone were to click their browser's back button, they wouldn't go where they originally came from.
Thanks in advance. Any help would be useful...I don't know what this technique is called so google hasn't been really useful :(
Edit: based on responses, I found my question was solved here: How to use cursor() for pagination?

Look into query cursors. Thay are made to be serialized and sent to client, to be used in creating "next" and "previous" paging requests.
NOTE: don't use offset on queries. This can be VERY expensive, as it actually fetches (and charges) all entities up to offset+limit position, but returns to application only limit results.

I'm not sure that putting all the results as hidden content in the HTML and manipulating it using JS is a very good idea if you might have a large result set (think about what happened if Google used this approach). There's also the back functionality issue that you've mentioned.
So, as for querying a wanted "results page" each time, I think the Google's GQL Reference might help you, take a look specifically at the LIMIT clause, it can help you create the paging mechanism you're looking for by supplying it with the number of items-per-page you want as "count" and the numbers of items-previously-viewed as "offset" (0 at first call).
As for displaying, I think that the Google Images / Facebook News Feed approach might also be interesting to think about (loading on scroll instead of paging), but that's a matter of your personal choice :)
Hope this helps, good luck!
EDIT: After reading Peter's answer, I found it much more efficient to use cursors for pagination, a good reference is given in his answer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js