Solr/Lucene "kit" to test searching? - templates

Is there a "code free" way to get SOLR/LUCENE (or something similar) pointed at a set of word docs to make them quickly searchable by a user?
I am prototyping, seeing if there is value in, a system to search through some homegrown news articles. Before I stand up code to handle search string input and document indexing, I wanted to see if it was even worth it before I starting trying to figure it all out.
Thanks,
Judd

Using the bin/post tool of Solr and the Tika handler (named the ExtractingRequestHandler), you should be able to get something up and running for prototyping rather quickly.
See the introduction of Uploading Data with Solr Cell using Apache Tika. Tika is used to process a wide range of different document types.
You can give the Solr post tool a directory or a list of files to submit to the index.
Automatically detect content types in a folder, and recursively scan it for documents for indexing into gettingstarted.
bin/post -c gettingstarted afolder/

Related

how to realize a network-based book query system

Please notice that if i do not want to use database.
i am now learning Unix networking programming. And i have my university book library all book list in seperated txt files. for example, the 'b' begin books is stored in b.txt. all a-z book count is about 1 million record. a line for a book' name and detailed other info.
Now i want to do a program to provide the query service of book list, for example, giving a book name, it can return the detailed info of this bool is it exists.
So i need to first build a module to take the function of query.
Then write the server side to call the query module and get the result and sending the result to the client module.
My question is , if i do not using database. How to realize the query module using c/c++, just first locating the first letter, for example, H begin book name should find in H.txt or H1.txt and H2.txt, using fopen open the file, then read line by line, then compare with queried book name using strFind, strCmp similar function, if have then return the result. i just think this is a time consuming thing and is not realize for using. And if have any such query system could for reference not using database but is bearable in time?
There are several options. The cheapest option (=low development time, low maintenance, low hardware requirements), IMO, is to create a html page on a separate site that links to all the data files. Then you set up another page that uses google.com to search that site. Then you just tell the google web spider to index your site. That way you get excellent performance with minimal work. But... you don't get to program any C.
Simple solution using C:
Do as you yourself suggest. If you have lots of memory available for file caching the performance won't be so bad unless the load gets high.
There will still be some work to do with the rest of the solution since you should delegate the search to worker threads.
Intermediate solution using C:
Find a 3rd party search engine and integrate it with your network code.
Advanced solution using C:
Implement your own search engine.
The problem is WHY DON'T U WANT TO USE DATABASE ?
1. make it easier to deploy?
sqlite may a good choice .
2. trying another method ?
lucene is a good choice of information retrieval, which is written by java.
clucene is someone rewrite the lucene to c .
You may also need stemmer tool(get the root of words),ictclas(chinese words' term extract) etc .
3. would like to do anything by yourself ?
It is easy to manage text file in system , while , as for a "query system", store is not enough , the main problem is IR(information retrieval ).
You may learn something about index building, store and query the index

Extracting key words from HTML to C++ under linux

I am working on a simple client-server project. Client is written in Java, it sends key words to C++ server written under Linux and recives a list of URLs with best ranks ( depending on number of occurrences of key words ). Server's job is to go through some URLs in search of key words and return best-fitting URLs. And now the problem is that I have to parse HTML sites to find occurrences of key words, plus I need to extract links from visited page to search on them as well. And my question is what library can I use to do that? Remember only C++ linux libraries are suitable for me. There were some similar topics, so I tried to go through most of them, but some of libraries parse only html files and I don't want to download every site I visit, but parse it on the fly and just store it's rank and url. Some of them look a bit complicated to me - for instance firstly parsing HTML to XML or something else and then finally work on the results with C++. Is there something simple and sufficient to do what I need it to do? Any advise will be appreciated.
I don't think regular expressions are appropriate for HTML parsing. I'm using libxml2, and I enjoy it very much - easy to use, portable and lightning fast.
To get URLs from the web using C/C++ you could use the libcurl library. To parse URLs and other not too easy stuff from the site you can use a regex library.
Separating the HTML tags from the real content can also be done without the use of a library.
For more advanced stuff one could use Qt which offers classes such as QWebPage (which uses WebKit) that allows one to access the DOM-Model of the page and extract individual HTML objects (e.g. single cells of a table) rather easyly.
You can try xerces-c. It's a powerful library for xml parsing. It support xml reading on the fly, dom and sax parsing.

How to retrieve a list of files of given repository (MirrorBrain)?

The repository I am asking is for Linux, but my problem is related to client -- i.e. with retrieving those data, and client can be Linux, Windows, Mac OS X, etc. So I opted against asking this question on Unix&Linux site, if admins feel it should be U&L question please move it to the other site.
Consider such repository as http://download.opensuse.org/repositories/LCD/openSUSE_11.4/x86_64/ -- you can fetch the html for it, parse it, and get the list of files. However I hardly believe it is correct way -- since the html is created by website engine (MirrorBrain in this case), there should be some web service API to get this list directly.
I googled, but didn't find anything relevant.
So -- how to get the list of the file directly, no parsing, just call, and getting the collection of file names.
MirrorBrain doesn't have an API call to retrieve a list of files. (It only has API calls to retrieve a list of mirrors for a single file, by appending .mirrorlist or .meta4 to a file's URL.) It would be a worthwhile idea to add such an api call (patches welcome!).
So there's only the standard HTTP server directory index to read a file list from. The format varies from server to server, and even Apache has different variants. With Apache, a little trick that can help is to append ?F=0 to the directory URL if you want to get only the filenames (it will simplify the index), or to append ?F=1 to switch to the fancier variant which includes more details.
Hope this helps.

Custom client app - need ability to control where documents are saved

Okay SO. I need some guidance. I apologize for the length of this post, but I need to provide some details:
I've got someone who is interested in me to do a small project for them. The application in general is a fairly straightforward employee record keeping / documentation app, but it makes pretty heavy use templated Word and Lotus documents. The idea is you select the employee “event” such as commendation, promotion, discipline, etc., and it loads the appropriate template doc and you fill it in from there, and later you can select an employee, view all the “events,” and view the individual documents associated with each one.
Thus, the app must know where the .docs are saved when the user is done.
The client actually has a v1 of this app (it doesn’t do any management of the files or anything, just launches Word/Lotus with the document you wanted to view in a new instance, presumably via a system() call.) We’ve not gotten into a detailed requirements phase, but the client and I agree that for this to really work, some kind of control over where the user saves the .doc’s to is going to be critical , because otherwise the app provides them with the new copy of the template doc, they "Save as" somewhere else, and the app is pointing to the blank copy it provided them with.
Obviously, I can’t think of a way to achieve “Save as” restriction/control in any way via just launching a new instance of Word. The client has the idea of an embedded Word/Lotus instance in the app with the template doc when you choose one, but I’ve few reservations with that:
I’ve dug around online and I’ve read that whichever version of Word I borrow MSWORD.OLB from will be the one the end user would require?
I’ve tried to do the MSDN example of embedding a Word doc from here, but as I’ve come to get used to, the MSDN example doesn’t even compile.
Even if I CAN figure out how to embed a .doc file into their application, I don’t know that I could control the use of “Save as…”
All of this STILL hasn’t touched on Lotus (!)
So… instinctively, I feel the embedded Word/Lotus thing has to be more work than it’s worth in the end.
So I’ve had a few other ideas brewing around.
One is looking into using Office XML (and if there’s a lotus equivalent), and get the user’s “inputs” separately and generate the document on the fly each time. I’m not particularly thrilled with that idea, but I think it COULD work, provided I just use old features to try and stay far backwards compatible.
Get user’s “inputs” separately and generate a document in HTML. Meh. Works, very cross platform and easily parsed and understood, but not good if you want to be able to email it to someone (who emails a .html? Works, yes, very unconventional which to the average user will throw them off) and even worse if you need to email it to someone for revisions…
Perhaps some kind of editable PDF? I know there are PDF libraries out there, and the more I stew on it, the more this sounds like the best option, though I’ve not done much work with PDFs and I don’t know how easily embeddable they are / what options one has when creating them. I know they can be save-disabled, I’ve had that with my bloody state taxes before.
I need some input here. Here’s the TLDR questions:
Is launching a new instance of Word for each .doc as bad as I feel, given user can “Save as” document wherever and then application is left pointing to a blank document?
Is trying to support embedded Word as big of a trouble as I feel like it is / more work than it’s worth / likely to cause problems with supporting multiple versions of Word? (Forward compatibility as well as currently released versions?)
What are thoughts on the PDF plan?
Any other good ideas?
Word does allow for programming some "Save" and "Save As" control via its object model. Any subroutines coded in VBA and placed into your Word template will be copied into all documents generated from that template. Additionally, most menu and Ribbon commands can be intercepted by creating a module containing subroutines named for the intercepted commands. So, for example, if a module contains a sub named FileSaveAs(), any code in that sub will be executed instead of the standard File|Save As command. Lastly, this code will replace Save As commands executed via keystroke, toolbar, menu, or Ribbon.
The code below will launch a dialog box to a predetermined path whenever a "Save" or "Save As" command is executed:
Sub FileSave()
ControlSaveLocation
End Sub
Sub FileSaveAs()
ControlSaveLocation
End Sub
Sub ControlSaveLocation()
Dim Directory As String
Directory = "C:\Documents\"
With Application.Dialogs(wdDialogFileSaveAs)
.Name = Directory
.Show
End With
End Sub
Hope this helps.

How should I migrate a site from ZWiki to MediaWiki?

I have a fairly extensive wiki on ZWiki on Zope (in turn on Plone). Most pages are in reStructured text format, but there are several in straight HTML as well.
What is the best approach to migrate those pages over to a MediaWiki wiki with pages converted to MediaWiki and HTML formats? Of course I'd like to automagically convert all links (internal and external).
extract your wiki content to files, using the zwikiexport.py script. The command will be something like:
ZOPE/bin/zopectl run ZOPE/Products/ZWiki/bin/zwikiexport.py /zodb/path/to/wiki/folder
convert the restructured text markup to mediawiki markup. pandoc should work well - for each wiki page, run something like:
pandoc -r rst -w mediawiki PAGE.rst >PAGE.mw
convert the wiki links, which pandoc doesn't know about. Depending on your content, this may be the hardest part to do accurately. Write a perl script, or modify the zwikiexport script, using Zwiki's knowledge of where the links are (see methods in ZWikiPage.py).
import the mediawiki-format pages into mediawiki, however that's done
refinements:
the exported file tree will reflect your zwiki page hierarchy - if you use this heavily, you'll want to think about how to represent it in mediawiki
as Mark says, you'll lose the page history, unless you work extra hard to find a way to replicate that. The same goes for all page metadata you may have have been using (you can inspect most of this metadata in the page's Properties tab in the zope management interface). In particular the page creation time, last edit time, and the usernames of the page's creator and last editor are quite important, to understand your content. So I would try to script some way of preserving those or if all else fails, doing it by hand.
if you have uploaded files to the wiki, I think the export script might save those too, otherwise use the ZMI to export/save them. When you import them to mediawiki, you may need to choose a page to attach them to. You could use grep or Zwiki's search to find the pages that reference a particular file.
be prepared to iterate, testing the results pretty thoroughly and refining the process, before you declare victory. After that, the content will diverge and you won't want to re-do this.
manual fixups: at some point, it may be cheaper to stop fiddling with scripts and do the remaining cleanup by hand, by yourself or with an army of helpers.
Good luck! - Simon
http://zwiki.org
I've no experience of ZWiki and don't know how large your wiki is. But general advice - you can use find / replace in Notepad or Notepad++ - or you can write a macro in Excel.
This is per page copying which is only really suitable if your wiki is not larger than, say, 1000 pages.
I suspect you'll still have manually to check each page though, and update your scripts accordingly.
Good luck with it - I suspect you'll be pleased with the final result because MediaWiki is pretty awesome.
Update: one disadvantage of moving to a new wiki is that you will lose the page history (i.e. who wrote what, when).