Best full text search alternative to MS SQL, C++ solution - c++

What is the best full text search alternative to Microsoft SQL? (which works with MS SQL)
I'm looking for something similar to Lucene and Lucene.NET but without the .NET and Java requirements. I would also like to find a solution that is usable in commercial applications.

Take a look at CLucene - It's a well maintained C++ port of java Lucene. It's currently licenced under LGPL and we use it in our commercial application.
Performance is incredible, however you do have to get your head around some of the strange API conventions.

Sphinx is one of the best solutions. It's written in C++ and has amazing performance.

DT Search is hands down the best search tool I have used. They have a number of solutions available. Their Engine will run on Native Win32, Linux or .NET. It will index pretty much every kind of document you might have (Excel, PDF, Word, etc.) I did some benchmarks comparisons a while ago and it was the easiest to use and had the best performance.

Solr is based on Lucene, but accessible via HTTP, so it can be used from any platform.

I second Sphinx, but Lucene is also not so bad despite the Java. :) If you are not dealing with too much data spread out etc., then also look into MySQL's FULLTEXT. We are using it to search across a 20 GB database.

Related

C++, Get text from a website

I was told I have to use winsock, but I dont know where to start. For example, I am trying to access, lets say http://www.newegg.com/, I am trying to get the text title of just the three front page products. Any help is greatly appreciated. :D
I'd also recommend libcurl for this sort of thing.
You can use the cURL command line tool to generate sample code as well, which is helpful for experimentation.
W3.org themselves provide sample C / C++ librarys for Http requests.
Find them here
Specifically, look for HTTPReq.c
Use boost library and poco. They both provide solutions for network programming. Boost also provide spirit library which you can use for parsing data from websites. Poco libraru also provides NetSSL, crypto solutions.
P.S. boost::spirit is not a library for parsing data from websites, it provides solution for parsing strings ...
you need to open a socket.
then you need to do an http get
somewhat like :-
http://www.esqsoft.com/examples/troubleshooting-http-using-telnet.htm
You could use the QNetworkAccessmanager class from Qt framework.
I'm assuming you need to use c++ for a reason, such as integration with existing software, otherwise, as per some of the other suggestions, choosing a language with a more convenient framework (eg: scripting language) would be better suited for the task.
If you would like to avoid getting your hands dirty with WINSOCK, or have the need to run on a platform other than windows, you could look at the using the boost asio library.
The following page contains links to simple sync and async http clients:
http://www.boost.org/doc/libs/1_37_0/doc/html/boost_asio/examples.html
You can find documentation on the library itself at:
http://www.boost.org/doc/libs/1_37_0/doc/html/boost_asio.html
Use c++ if you must, but it might be a lot less painful to use python.
Look at the Python httplib module for how to set the host you want to pull from etc. Python's available for free for most platforms and is enough like C++ that you can probably learn python a heck of a lot faster than you can learn to write a program controlled browser in c++. Well, maybe that's not true for everyone on this site, but I'll bet it's true for "most" of us. I used to get stock quotes updated in near real time from CNN Money years ago and IIRC it was around 100 lines of python code.
Hotei

How does one port c++ functions to the internet?

I have a few years experience programming c++ and a little less then that using Qt. I built a data mining software using Qt and I want to make it available online. Unfortunately, I know close to nothing about web programming. Firstly, how easy or hard is this to do and what is the best way to go about it?
Supposing I am looking to hire someone to make me a secure, long-term, extensible, website for an online software service, what skill set should I be looking for?
Edit:
I want to make my question a little more specific:
How can I take a bunch of working c++ functions and port the code so I can run it server side on a website?
Once this is done, would it be easy to make changes to the c++ code and have the algorithm automatically update on the site?
What technologies would be involved? Are there any cloud computing platforms that would be good for something like this?
#Niklaos-what does it mean to build a library and how does one do that?
You might want to have a look at Wt[1]. Its a C++ web framework which is programmed more or less like a desktop GUI application. One of the use cases quoted is to bring legacy apps into the web.
[1] http://www.webtoolkit.eu
Port the functions to Java, easily done from C++, you can even find some tools to help - don't trust them implicitly but they could provide a boost.
See longer answer below.
Wrap them in a web application, and deploy them on Google App-Engine.
Java version of a library would be a jar file.
If you really want to be able to update the algorithm implementation dynamically, then you could implement them in Groovy, and upload changes through a form on your webapp, either as files or as a big text block, need to consider version control.
The effort/skillset involved to perform the task depends on how your wrote your code. If it is in a self-contained library, and has a clean (re-entrant, thread safe) API, you could probably hire a web developer (html/php/asp etc) to write the UI interface to the library for a relatively small cost. The skills required would be dependant on the technologies you wanted to use. For Windows development I would suggest C#/ASP. The applicant would require knowledge of interfacing with native libraries from a managed language. This is assuming that you dont mind the costs of Windows deployment for your application.
On the otherhand, if the library is complex or needs to be re-written to support the extensibility you are looking for, asking here will not get you much.
BTW: here is a great article on Marshalling if you chose to implement using C#/ASP
http://msdn.microsoft.com/en-us/magazine/cc164193.aspx
First, DO NOT USE PHP :D
I used it for some projects (the last one with symphony framework) and i almost shoot my self !
If you are very familiar with C++, ASP .NET could be a good solution because if you like C++ you are going to love C#.
Any ways, I personally use Ruby on Rails for 6 months now and I LOVE IT. I won't write you a book here but the framework is pure gold !
The only problem is that Ruby is a very special language. You will probably be a bit lost a the beginning. But as every one you will learn to love it.
But that was only for the server side. Indeed, there 3 technologies you won't be able to avoid if you want to start to develop web applications.
HTML, CSS and JavaScript are presents every where. This is why i'm thinking you should start by HTML and CSS then JavaScript (with jQuery).
When you've got some basics with these 3 technologies you should be able to choose the server side language.
But you've got to tell you one thing, it's not going to be easy !
PS : Ruby on Rails uses HAML and SASS. These 2 languages replaces HTML and CSS you should have a look at them quickly because they are awesome.

What are some open-source applications written in C/C++ using PostgreSQL?

I'm trying to find open source applications using PostgreSQL that are written in C/C++ so I can study them. A few open source projects using PostgreSQL are Evergreen ILS, SpamAssassin, and pgpool. However, Evergreen and SpamAssassin are written in Perl, and pgpool (written in C) is a replication tool, not a typical application. Moreover, I looked at the SQL code in Evergreen, and it is quite voluminous and complicated.
Hence, I'm looking for one or more applications using PostgreSQL, preferably those that are somewhat trivial (but not too trivial).
seen libpqxx? try asking on its mailing list (but scour their wiki first)
http://pqxx.org/development/libpqxx
pgAdmin is written using c++ using wxwidgets.
how about pgAdmin 3 ?
Also, you may find Qt4 a very easy way interact with databases programming in C++.
http://doc.trolltech.com/4.6-snapshot/sql-programming.html
Have you searched through the projects at http://pgfoundry.org ?
Two examples that are open-source:
Kexi (see kexi-project.org)
FOST4 (
http://support.felspar.com/Fost%204 )
It's pretty big, but the KDE Project's Amarok is written in C++ and can use a PgSQL backend (among several others). While it's pretty large, you may be able to find some interesting things in the database code. Since it uses a pre-defined schema (as opposed to the extremely general types of access that something like pgAdmin uses) it may have some good things to teach you. It will definitely be easier to pick apart than Evergreen, which actually has an entire middleware layer that actually does the data access through exposed services (The OpenSRF Project).

How to replace text in a PowerPoint (.ppt) document?

What solutions are there? I know only solutions for replacing Bookmarks in Word (.doc) files with Apache POI?
Are there also possibilities to change images, layouts, text-styles in .doc and .ppt documents?
I think about replacement of areas in Word and PowerPoint documents for bulk processing.
Platform: MS-Office 2003
What are your platform limitations?
Obviously Apache POI will get you at least part of the way there.
Microsoft's own COM API's are fairly powerful and are documented here. I would recommend using them if a) you are not running in a server (many users, multithreaded) environment; b) you can have a proper version of powerpoint installed on the production machine; and c) you can code against a COM object model.
It's a bit pricey, but Aspose.Slides is a very powerful library for manipulating PowerPoint files
If you include using other Office suits as an option, here's a list of possible solutions:
Apache POI-HSLF
PowerPoint 2007 APIs
OpenOffice.org UNO
Using POI you can't edit .pptx file format, but you don't depend on the apps installed on the system. Other two options, on the contrary, make use of other apps, but they are definitely better for dealing with presentations. OpenOffice has better compability with older formats, by the way. Also if you use UNO, you'll have a great choice of languages, UNO exists for Java, C++, Python and other languages.
My experience is not directly with Power Point, but I've actually rolled my own WordML (XML) generator. It a) removed all dependencies on Word, b) was very fast c) and let me build up documents from scratch.
But it was a lot of work to create. And I was only creating a write only implementation.
I'm not as familiar with Power Point, so this is conjecture, but you may be able to roll your own by reading XML (Power Point 2003??) and/or cracking the Office Open XML file (zipped XML), then using XPath to manipulate the data, and then saving everything back to disk.
This won't work on older OLE Compound Document based Power Point files though.
I've done something like that before: programmatically accessed and manipulated PowerPoint presentations. Back when I did it, it was all in C++ using COM, but similar principles apply to C#/VB .NET apps, since they do COM interop very easily.
What you're looking for is called the Office Document Model. Basically, Office applications expose their documents programmatically, as trees of objects that define their contents. These objects are accessible via an API, and you can manipulate them, add new ones, and do whatever other processing you want. It's exceedingly powerful; you can use it to manipulate pretty much all aspects of a document. But you'll need an installation of Office and Visual Studio to be able to use it.
Some links:
Intro: http://msdn.microsoft.com/en-us/library/d58327k6.aspx
Hope this helps!
Apparently new users can only include one link per posting. How lame! :)
Here's the other link I meant to include:
Example of manipulating PowerPoint documents programmatically: http://msdn.microsoft.com/en-us/library/cc668192.aspx

What's the best way to parse RSS/Atom feeds for an iPhone application?

So I understand that there are a few options available as far as parsing straight XML goes: NSXMLParser, TouchXML from TouchCode, etc. That's all fine, and seems to work fine for me.
The real problem here is that there are dozens of small variations in RSS feeds (and Atom feeds too), so supporting all possible permutations of feeds available out on the Internet gets very difficult to manage. I searched around for a library that would handle all of these low-level details for me, but came out without anything.
Since one could link to an external C/C++ library in Objective-C, I was wondering if there is a library out there that would be best suited for this task? Someone must have already created something like this, it's just difficult to find the "right" option from the thousands of results in Google.
Anyway, what's the best way to parse RSS/Atom feeds in an iPhone application?
I've just released an open source RSS/Atom Parser for iPhone and hopefully it might be of some use.
I'd love to hear your thoughts on it too!
"Best" is relative. The best performance you'll need to go the SAX route and implement the handlers. I don't know of anything out there open source available (start a google code project and release it for the rest of us to use!)
Whatever you do, it's probably a really bad idea to try and load the whole XML file into memory and act on it like a DOM. Chances are you'll get feeds that are much larger than you can handle on the device leading to frequent memory warnings and crashes.
I'm currently trying out the MWFeedParser #Michael Waterfall is developing.
Quite easy to set up and use (I'm a beginner iPhone developer).
His sample code for using MWFeedParser to populate a UITableViewController implementation is helpful as well.
take a look at apple's XML Performance sample -- which points to using libXML directly -- for performance and quicker updates to the display. Which may be important if you are working with very large feeds.
Check out my library for parsing Atom feeds, (BSAtomParser) at GitHub. It doesn't care about validating the feed, it does its best at returning whatever is valid. The parser covers most of RFC 4287, even extensions.
Here's my solution: a really simple yet powerful RSS parsing library: https://github.com/H2CO3/RSSKit
Have you looked at TouchCode yet? I don't think it has an RSS processor, but it might give you a start.
http://code.google.com/p/touchcode/
I came accross igasus project on sourceforge today. I haven't used it or really checked it, but perhaps it might help.
From their site:
igagus is a web service for the iPhone that allows aggregation of RSS to be delivered in an iPhone friendly format.
Actually, I was trying to suggest you ask on the TouchCode discussion board, because I remember someone was trying to expand it to support RSS. That might be a decent starting point. But I was being rushed by my wife.
But I see now that TouchCode doesn't have a discussion board. I'd still ask the author, though, he might know what came of that effort.
This might be a reasonable starting point for you. Atom support isn't there yet, but you could help out?