Is there a "code free" way to get SOLR/LUCENE (or something similar) pointed at a set of word docs to make them quickly searchable by a user?
I am prototyping, seeing if there is value in, a system to search through some homegrown news articles. Before I stand up code to handle search string input and document indexing, I wanted to see if it was even worth it before I starting trying to figure it all out.
Thanks,
Judd
Using the bin/post tool of Solr and the Tika handler (named the ExtractingRequestHandler), you should be able to get something up and running for prototyping rather quickly.
See the introduction of Uploading Data with Solr Cell using Apache Tika. Tika is used to process a wide range of different document types.
You can give the Solr post tool a directory or a list of files to submit to the index.
Automatically detect content types in a folder, and recursively scan it for documents for indexing into gettingstarted.
bin/post -c gettingstarted afolder/
I have a Django Project where I used Sphinx to create my documentation. I went through sphinx-apidoc and ran 'make latexpdf'. The resulting documentation has a quite a few lines that flow out of the margin. On top of margin issues, lines in the index start overflowing onto each other.
Overflowing Lines
Margin Issues :(
Is there an easy way to fix these issues (or an easier way to create PDF documentation)?
ELI5 if possible (I'm not well-versed in LaTeX)
The overflowing lines situation in the index should improve from adding this to conf.py:
latex_elements = {
'printindex': '\\footnotesize\\raggedright\\printindex',
}
Or, you can switch to Japanese language which does something like that (even better) out-of-the box from its special document class ;-)
TeX does not always know how by itself how to insert linebreaks: after all it is good at hyphenation of natural language. But as pointed out in comments Sphinx coerces LaTeX into handling better long code lines since 1.4.2.
Since recent 1.5.3, user can customize page margins, check http://www.sphinx-doc.org/en/stable/latex.html#the-sphinx-latex-style-package-options for documentation of hmargin and vmargin which can be configured via 'sphinxsetup'.
Okay SO. I need some guidance. I apologize for the length of this post, but I need to provide some details:
I've got someone who is interested in me to do a small project for them. The application in general is a fairly straightforward employee record keeping / documentation app, but it makes pretty heavy use templated Word and Lotus documents. The idea is you select the employee “event” such as commendation, promotion, discipline, etc., and it loads the appropriate template doc and you fill it in from there, and later you can select an employee, view all the “events,” and view the individual documents associated with each one.
Thus, the app must know where the .docs are saved when the user is done.
The client actually has a v1 of this app (it doesn’t do any management of the files or anything, just launches Word/Lotus with the document you wanted to view in a new instance, presumably via a system() call.) We’ve not gotten into a detailed requirements phase, but the client and I agree that for this to really work, some kind of control over where the user saves the .doc’s to is going to be critical , because otherwise the app provides them with the new copy of the template doc, they "Save as" somewhere else, and the app is pointing to the blank copy it provided them with.
Obviously, I can’t think of a way to achieve “Save as” restriction/control in any way via just launching a new instance of Word. The client has the idea of an embedded Word/Lotus instance in the app with the template doc when you choose one, but I’ve few reservations with that:
I’ve dug around online and I’ve read that whichever version of Word I borrow MSWORD.OLB from will be the one the end user would require?
I’ve tried to do the MSDN example of embedding a Word doc from here, but as I’ve come to get used to, the MSDN example doesn’t even compile.
Even if I CAN figure out how to embed a .doc file into their application, I don’t know that I could control the use of “Save as…”
All of this STILL hasn’t touched on Lotus (!)
So… instinctively, I feel the embedded Word/Lotus thing has to be more work than it’s worth in the end.
So I’ve had a few other ideas brewing around.
One is looking into using Office XML (and if there’s a lotus equivalent), and get the user’s “inputs” separately and generate the document on the fly each time. I’m not particularly thrilled with that idea, but I think it COULD work, provided I just use old features to try and stay far backwards compatible.
Get user’s “inputs” separately and generate a document in HTML. Meh. Works, very cross platform and easily parsed and understood, but not good if you want to be able to email it to someone (who emails a .html? Works, yes, very unconventional which to the average user will throw them off) and even worse if you need to email it to someone for revisions…
Perhaps some kind of editable PDF? I know there are PDF libraries out there, and the more I stew on it, the more this sounds like the best option, though I’ve not done much work with PDFs and I don’t know how easily embeddable they are / what options one has when creating them. I know they can be save-disabled, I’ve had that with my bloody state taxes before.
I need some input here. Here’s the TLDR questions:
Is launching a new instance of Word for each .doc as bad as I feel, given user can “Save as” document wherever and then application is left pointing to a blank document?
Is trying to support embedded Word as big of a trouble as I feel like it is / more work than it’s worth / likely to cause problems with supporting multiple versions of Word? (Forward compatibility as well as currently released versions?)
What are thoughts on the PDF plan?
Any other good ideas?
Word does allow for programming some "Save" and "Save As" control via its object model. Any subroutines coded in VBA and placed into your Word template will be copied into all documents generated from that template. Additionally, most menu and Ribbon commands can be intercepted by creating a module containing subroutines named for the intercepted commands. So, for example, if a module contains a sub named FileSaveAs(), any code in that sub will be executed instead of the standard File|Save As command. Lastly, this code will replace Save As commands executed via keystroke, toolbar, menu, or Ribbon.
The code below will launch a dialog box to a predetermined path whenever a "Save" or "Save As" command is executed:
Sub FileSave()
ControlSaveLocation
End Sub
Sub FileSaveAs()
ControlSaveLocation
End Sub
Sub ControlSaveLocation()
Dim Directory As String
Directory = "C:\Documents\"
With Application.Dialogs(wdDialogFileSaveAs)
.Name = Directory
.Show
End With
End Sub
Hope this helps.
I'm going to manage some documentation using Django (I come from Sphinx) in order to have more control on the output. The docs are in rst (restructured text) in a git archive, and it's trivial to display them in HTML using a filter. My problem is that they are quite long, and I'd like to have more control on how the pagination goes, so I can show a single section per HTML page, have comments for a single section and so on...
My goal would be to be able to parse each doc, create my TOC with links to each section in a separate HMTL page, where a view would go through whole doc to render in html just a section.
I understand that it's mostly a issue of docutils, the most interesting example I've been able to find is: http://www.ibm.com/developerworks/library/x-matters24/#code2 but it seems outdated and the examples in section "Tree-oriented processing", which is where the magic goes, don't seem work with my version of docutils. Article is good: I could use more of the same subject!
Is there something similar to what I'm planning to do already available that I can study, or maybe could someone point me to a gentle introduction to docutils for parsing rst documents?
Here is a blog describing howto make a custom rst writer and call it from Django. I think it should give you a good start http://www.arnebrodowski.de/blog/write-your-own-restructuredtext-writer.html
Pygments has a ReST lexer that you could examine (or possibly even use directly).
I have a book project which I'd like to start sooner than later. This would follow an agile-like publishing workflow, i.e: publish early and often. It is meant to be self-publsihed by me and I'm not really looking to paper-publish it, even though we never know.
If I weren't a geek, I'd probably have already started writting in Word or any other WYSIWYG tool and just export to PDF. However, we know it is not the best solution, and emacs rules my text-editing life, so, the output format should be as simple as possible and be text-based.
I've thought about the following options:
Just use orgmode and export to PDF (orgmode has this feature natively)
Use markdown mode and export to PDF (markdown->LaTeX->PDF should not be hard to setup);
Use something similar to what the guys # Pragmatic Progammers do: A XML + XSLT + LaTeX.
More complex, but much more control over the style.
EDIT: Someone just told me that he uses a combo of Textile+Adobe In Design and the XTags plugin. Not sure how they are glued together though, gotta do some research.
Any other ideas / references ?
I want to start writting as soon as possible. In fact, I already have a draft in an org-formatted file. However, I do want to have and use the full power of LaTex later on to format it the way I want and make it look fabulous :)
Thanks in advance,
Marcelo.
I have done a TON of research on this lately, since I'm planning on starting my own small press soon.
It really depends on what you want your final output to be (PDF, HTML, other?), and what the book is about.
Org mode is great, as I'm sure you know, because it expands as you do. I often write my outlines in org mode, then just fill in the body text when I'm really ready to start writing.
IF it's prose, and you just need some simple divisions (chapters and sections and not much else), org mode -> latex should do you just fine. Then you also have the possibility of org mode -> html
IF you need math in it, you can just write the math right in the org mode file.
If it's really really technical information, docbook might be nice (emacs + nxml), then dockbook 4.5 -> jade -> jadetex -> pdf.
I'd stay away from docbook 5, because it uses FOP to generate PDFs, and the typesetting is really inferior to latex.
BOTTOM LINE: If you want a PDF, use org -> latex, the path of least resistance ;) -- whatever you do, concentrate on the content of the book first, and worry about what it looks like til after.
And why not paper publish? Have you looked at lulu.com? I recently formatted a book with latex, uploaded the pdf to lulu, and had them print it. The quality is pretty good, and definitely worth a look. I have a ton of bookmarks at home about publishing in general, if you're interested.
Typography is hard.
TeX/LaTeX are tools that can get you the best possible results, however they require knowledge about typography to be used correctly--especially with a big document like a book. And I haven't seen any other cheap (=not for professional use) software that would do things correctly automatically. (I haven't seen any professional software, so it is possible they don't do that either)
However, assuming that you'll write your book in some machine-readable format, putting it into TeX/LaTeX should not be very hard: once I had a set of documents in a custom XML format. Proper usage of XSLT, TeXML and LaTeX gave me something I could tweak manually (and this tweaking was necessary!) and get the best possible result.
My advice: prepare content in something that is easy to parse and easy to write in. I'd dismiss XML. Markdown seems to be good choice. This will also allow you to quickly show your work. Then if you decide to make the result better, write some simple script to translate that to TeX (it is not that hard to get basic functionality) and fix things by hand. This might actually be a good exercise to learn TeX.
Don't try to get everything right from the beginning. Firstly get the content, then play with formatting.
If you are really wanting to do online only, I would suggest you use org mode and just stay in HTML. Then you can use CSS to style it however you would like.
That being said, if you really want to output to PDF for technical stuff, I would strongly suggest using Docbook (www.docbook.org). It's made for that, it works great with Emacs.
You have already answered yourself. Not mentioning that you already started writing in org-mode. Org-mode is really extremely powerful and will enable you to publish to PDF and HTML eventually with no effort.
In case of PDF you can take advantage of LaTeX and how org-mode is working with exports. You can include any LaTeX code to your org file. Also IMHO it's way better to write the book/article in org-mode since something becomes even easier than in plain .tex files take for example tables.
Regarding Publishing it's a same story with one single function you can trigger exporting to HTML/PDF and uploading to your server. And notice that you are still using just plain text file which is human readable and very clean.
Org-mode really follows the Emacs philosphy just start using it and it will grow with you.
If you are writing a book, it would certainly be worth the overhead of learning tex.
Even something like,
\documentclass[a4paper,10pt]{book}
\title{SERPA'S BOOK}
\author{SERPA}
\date{\today}
\begin{document}
\maketitle
\tableofcontents
\include{chapterA}
\include{chapterB}
\include{chapterC}
\end{document}
Then, in the same directory have files chapterA.tex, chapterB.tex, chapterC.tex that look like
\chapter{My chapter title}
Lorem ipsum dolor sit amet, consectetur adipiscing elit....
That alone will produce an extremely nice looking document. You can edit each chapter separately and then just compile the main tex file. I think if you try to learn intermediate tools that try to abstract away from tex, you'll only make it more difficult later to do what you actually want, because you will be both fighting tex and an abstraction of tex at the same time.
Best of luck on such an undertaking.
Also, no matter what you do, make sure to use some kind of version control system, such as SVN, to manage your files. It will be worth it.
I would write it in Latex and have an online repository that does nightly compiles to PDF of the 'publish-ready' branch, available to readers.
I would not start with using LaTeX these days. TeX input is unstructured and the only thing you can get out of TeX input is PDF. If you need HTML or anything else, you are screwed.
Use something structured, such as XML (DocBook is a good suggestion) or define your own XML subset as you need it. Use XSLT to transform it into something usable (HTML etc.) That way you are set for the future.
Depending on your typographical needs, you can then use TeX as a backend processor, or XSLT or whatever.
Also, have a look at ConTeXt, it can read XML directly and has great typography!