Indexing PDF - Faceted Search with Apache Solr and Apache Tika - regex

Two weeks ago I'm having trouble finding the Internet a way for my solution. I need to integrate a web application with Apache Solr and Apache tika, to be made faceted search PDF's that are in the database of the system. The configuration of solr and tika on my server everything is ok, but as I am new with these two tools, I'm not sure how to integrate one another and also with the application.

Solr 6.2 ships with files example in the example/files that is configured specifically to index and browse rich-content files (like PDF).
Start by using that and try to understand how it is put together.

Related

Can a MediaWiki site running on Apache Tomcat be transferred to SharePoint 2013?

I have a requirement to transfer an internal media wiki site which is running on Cent OS server, Apache tomcat webserver to Sharepoint 2013.
Is it possible to migrate if yes how to proceed with that?
In general, the answer is no, unless your MediaWiki system contains only trivially basic articles.
MediaWiki and SharePoint are completely different products. They are not compatible in any special way. MediaWiki is a wiki with hundreds of features and behaviors that do not exist in SharePoint, not even in SharePoint's wiki product. Examples are:
Transclusion
A thousand configuration variables
Many extensions (plugins) that affect the content
Now, if your wiki content is completely trivial and you don't care about any of MediaWiki's rich features, you can dump all your articles to HTML files (say, using wget or similar) and try to import them into SharePoint somehow. You'll still need to handle embedded images (anything in the File: namespace) specially, however. Your HTML files will contain links to the images, and you'll need to change those links to point to the images' new location in SharePoint.
Alternatively, if you're running SharePoint in house (i.e., not as SaaS) on a Windows server, you could install MediaWiki on that server and pretend it's part of SharePoint. :-)
On hearing your question, I can't help but wonder what you're trying to accomplish. Does somebody in your organization just not like MediaWiki (or Linux)? Maybe they don't understand MediaWiki. (That's likely, if they think it can be migrated into SharePoint.) Anyway, good luck!

Apache Solr with C++ Application

I have a builder C++ application. I need to know how to use Solr with my C++ Application.
Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Apache Tomcat or Jetty.
I need solr for indexing and search.
is there any way to use Solr with my C++ application?
Thank you!
You'll interact with Solr through HTTP, so using libcurl or POCO to make the request and then parse the resulting XML or JSON is a possible (and easy) solution.
The only client I've seen mentioned is SolrCPP, although I don't think that is maintained or available in any decent form any longer (It's the only one mentioned on the Integrating Solr list).

coldfusion web-inf\web.xml purpose

Im looking at using New Relic for monitoring our coldfusion sites. however it uses the web application display name defined in web.xml to define applications in its admin.
As far as I can work out coldfusion only have the one web.xml file in:
...\ColdFusion9\wwwroot\WEB-INF\web.xml
What is the purpose of this file? and can elements of it be overwritten on a site by site basis?
It looks like New Relic is a tool for monitoring Java (and other) apps. ColdFusion is a Java application. And the way you have it installed (standard) it is a single application with a single web.xml. Regardless of how many ColdFusion sites (apps) you run on it, it is still a single web application.
If you have CF Enterprise you can set up a multi-server install where you can deploy each of your sites as a separate Java app, but the way you have it set up now, you'll probably only be able to monitor CF as a whole vs each individual site.

ColdFusion Unable to create Solr collection Error

I'm trying to create a Solr collection in ColdFusion 9. I have never used Solr before, but I am following the directions in Forta's Web Application Construction Kit.
Every time I go to create the collection, I get the following error:
Unable to create collection usaf.
Unable to create Solr collection usaf.
An error occurred while creating the collection: org.apache.solr.common.SolrException. Check the Solr logs for more detail.
Anyone have a clue what's wrong? I have read that the update to CF 9.0.1 causes some issues with Solr -- I tried installing that update and it failed several times. Could that be the problem?
If so, how to solve it? This is on a production Windows Server 2008 and a previous attempt to uninstall and reinstall forced us to restore the server from an image because it was such a disaster.
I know this is a bit old but here is what I did to fix the same problem. Solr service in CF Administrator wasnt showing the core collection and it wouldnt let me create a new collection (as per above).
Using Win7, CF9.0.1
Stopped the Search service and the Solr service via the windows service manager.
Edited the file ColdFusion9\solr\multicore\solr.xml and removed the entries for the collections I was working on at the time it all stopped working. This is the step that seems to have made the difference. Backup the file first!!!
For the entries I removed from solr.xml I also removed the collection folders and files completely from the file system using windows file manager.
Restarted the Search service and Solr service. Core collection now appears in CF Administrator. My CF pages now create and index collections as they should. Phew!
Cheers,
Murray
You can check CFAdmin under Data & Services > ColdFusion Collections to make Solr is running. Should be a default collection listed. If not, search runs as seperate services on Windows. Check that ColdFusion 9 Search Server and Solr Service are there and started.
Adobe has a standalone Solr install. http://www.adobe.com/support/coldfusion/downloads.html
Updating to 9.0.1 and hotfixes corrupted my Solr install. Had to reinstall CF from scratch.
It's also possible for the ColdFusion Solr Search Service to be running even though Solr is not. This can happen, for example, when there are errors in a collection's schema.xml file. I imagine there are other conditions under which this can happen. At any rate, as the poster above explained, if you look on CF Admin under "ColdFusion Collections" you should see at least the default Solr collection (core0). If you don't see that collection then Solr isn't running properly even if Windows tells you that the service is running.
Also, you may want to see if you can reach the Solr web service (port 8983 by default): http://localhost:8983/solr/
There could be 3 reasons for this:
1. Solr is not running.
2. Solr is running and if you are on Unix, Solr/CF is running as a non privileged user.
3. Solr is installed after CF. In that case go to CF Admin Data& Services->Solr Server(CF10) and provide solr home path.
It seems like solr service is timing out or not working properly for some reason.
First make sure that you can go to the Solr Admin UI on one host. Try http://hostname:8983/solr/ or http://localhost:8983/solr/ from RDP. If it is not working, then you will get the exact error or reason for why you are getting an error while adding CF collection. Most probably there should be CF collection which is not configured properly and you may remove those from 'ColdFusion9\solr\multicore\solr.xml'.
"TAKE A BACKUP IF YOU ARE TRYING TO MODIFY ANYTHING"

Inconsistent PDF URLs returned from SharePoint query web service

I'm searching a SharePoint server through web services. When the web services return Word/Excel/PowerPoint documents, they contain links to the actual files, e.g. http://server/site/mydoc.doc. When the web services return PDF documents, they contain links to pages that link to the PDF document, e.g. http://server/site/DispForm.aspx?ID=1 which would contain a link to http://server/site/mydoc.pdf. I've tried _vti_bin/search.asmx with actions Query and QueryEx with no luck. What is the best way to get a link to the actual document so my app can download it?
Unfortunately, I'm using a large, shared SharePoint installation, and it's very unlikely that the server configuration can be changed.
You need the PDF IFilter installed on the server in order for the crawl to recognize PDF files and index them correctly. Here are some instructions from Adobe (PDF warning!) as well.