How do I find relevance between documents while creating multi-document summary? - data-mining

I want to generate a multi-document summary. I have already generated single document summary of each document. In some research papers, a multi-document summary is generated by making single document by appending each input document into it and then generate a single document summary. I have done it by combining summary of each input document. but, I am not satisfied. I am following the traditional approach to finding out the relevance between a summary of each document i.e. TF-IDF. Am I working correctly? Or should I follow different approach?

Related

eXist DB and Xquery : xincludes or collections (TEI-XML)?

I have a corpus in TEI-XML which uses a 'master' corpus XML document that then contains, via xi:include, thousands of other documents. Each of these documents themselves contain xi:includes to master lists of named entities (people, places, etc linked by xml:ids) . All of this works very well in XSLT (and in my IDE Oxygen for fast encoding).
I am now embarking on building a website using eXist-DB applications. I am rewriting everything directly in Xquery (to replace XSLT), and I have hit upon an unexpected decision. I am used to using xi:includes to traverse the corpus and the various XMLs files. But reading the documentation of eXist DB, it seems that the encouraged practice is to use collections and query them directly, instead of navigating via xi:includes. It also seems that eXist-DB does not support the full implementation of xi:includes anyway and requires some work arounds?
I am looking for guidance as to best practices of eXist-DB/Xquery in this context.
Many thanks in advance.
Correct, eXist's XInclude implementation is focused on output (i.e., serialization) rather than on querying or indexing. As eXist's documentation page on XInclude states:
The XInclude processor is implemented as a filter in between the serializer's output event stream and the receiver... XInclude processing is therefore applied whenever eXist-db serializes an XML fragment, whether it's a document, the result of an XQuery or an XSLT stylesheet.
Thus, if you use XInclude to assemble your corpus and you want to query/traverse this corpus, you could do so by (1) writing a query to read your XInclude and following it like a map to find the component documents, (2) pre-serializing your data into a new document and then querying the resulting document directly, or (3) placing the documents into collections that facilitate the kinds of queries you want to do.
Depending on the size of those thousands of documents, traversing the xinclude when running xqueries tends to be slow and quite memory intensive. In my experience Joe's option 3 is usually the way to go.
Unlike with straight-up xslt, in exist-db you can define indexes. E.g. you have a <listPerson> element as a wrapper for 1000s xincludes going to <person> elements as root of their own document.
If you have defined and index for <person> you can use e.g. ft:query() to query the index directly, irrespective of where in the tree of sub-collections and documents the element is located. This tends to be orders of magnitude faster, compared to traversing the whole document starting at master, and resolving xincludes.
As for validation, you will need to decide if a full validation run of the whole expanded document is really always necessary. This requires some fiddling, but there isn't much general advice I can offer, without seeing the actual files and code.
You can find more information about indexing in exist in the documentation

How to get most similar words to a document in gensim doc2vec?

I have built a gensim Doc2vec model. Let's call it doc2vec. Now I want to find the most relevant words to a given document according to my doc2vec model.
For example, I have a document about "java" with the tag "doc_about_java". When I ask for similar documents, I get documents about other programming languages and topics related to java. So my document model works well.
Now I want to find the most relevant words to "doc_about_java".
I follow the solution from the closed question How to find most similar terms/words of a document in doc2vec? and it gives me seemingly random words, the word "java" is not even among the first 100 similar words:
docvec = doc2vec.docvecs['doc_about_java']
print doc2vec.most_similar(positive=[docvec], topn=100)
I also tried like this:
print doc2vec.wv.similar_by_vector(doc2vec["doc_about_java"])
but it didn't change anything. How can I find the most similar words to a given document?
Not all Doc2Vec modes even train word-vectors. In particular, the PV-DBOW mode dm=0, which often works very well for doc-vector comparisons, leaves word-vectors at randomly-assigned (and unused) positions.
So that may explain why the results of your initial attempt to get a list-of-related-words seem random.
To get word-vectors, you'd need to use PV-DM mode (dm=1), or add optional concurrent word-vector training to PV-DBOW (dm=0, dbow_words=1).
(If this isn't the issue, there maybe other problems in your training setup, so you should show more detail about your data source, size, and code.)
(Separately, your alternate attempt code-line, by using doc2vec["doc_about_java"] is retrieving a word-vector for "doc_about_java" (which may not be present at all). To get the doc-vector, use doc2vec.docvecs["doc_about_java"], as in your first code block.)

What is the corresponding FACS Action unit name to the Affdex Emotion SDK expressions output?

I wonder where I can find a full list of Action Unit names which are detected from Affdex? I have manually identified some: browfurrow (AU4), browraise (AU2), chinraise (AU17). However, an official information document will be a better choice for me. Thanks
We follow the same FACS system for naming the AU's, you can find the list of Action Units on Wikipedia. For a more visual representation you can refer to Academic Pages.

Where Can I Find Details On The <ivy-report> XML Format?

I need to customize the dependency report from Ivy, and am trying to find more details on the format of the XML file that is the data source for any report.
I have found the Ivy Report documentation, and have easily provided my custom XSL template to generate the report. Indeed, my report works fine.
But, I want to have a better understanding about the expected format of this file. Most of the tags and attributes are obvious, but there are some that I am just making assumptions about.
Is there a schema for this report file? Or some place that has an explanation of the various attributes?

Is there a way to count tags on a physical (PDF) page using XSL-FO?

Here is the scenario. I have an XML document which contains tags. I want to create a transform that does this
<tag>content A</tag> 1. content A
<tag>content B</tag> ----> 2. content B
<tag>content C</tag> 3. content C
but only if the tag contents appear on the same physical page. The numbering should restart on each new page. Is there any way to do this using XSL-FO? I know with latex the only way to accomplish something like this is to run latex twice, with the interim document used to determine content page placement.
As far as I can tell (and as confirmed by the Antenna House tech support team), there is no way to do this using standard XSL-FO. Antenna House offers <axf:footnote*/> extensions which include the ability to set an axf:footnote-number-reset="page" attribute, and as suggested in the comments, RenderX offers a generic mechanism which might be used for this purpose, but both of these involve vendor-specific extensions to the language.
This points to a number of shortcomings in XSL-FO that really should have been addressed a long time ago with a 2.0 version of the specification. A w3c committee to develop an XSL-FO 2.0 spec was formed and then disbanded quite some time ago; I have no idea why, as I find the tool indispensable for a large class of document to PDF conversions.