Extracting key words from HTML to C++ under linux - c++

I am working on a simple client-server project. Client is written in Java, it sends key words to C++ server written under Linux and recives a list of URLs with best ranks ( depending on number of occurrences of key words ). Server's job is to go through some URLs in search of key words and return best-fitting URLs. And now the problem is that I have to parse HTML sites to find occurrences of key words, plus I need to extract links from visited page to search on them as well. And my question is what library can I use to do that? Remember only C++ linux libraries are suitable for me. There were some similar topics, so I tried to go through most of them, but some of libraries parse only html files and I don't want to download every site I visit, but parse it on the fly and just store it's rank and url. Some of them look a bit complicated to me - for instance firstly parsing HTML to XML or something else and then finally work on the results with C++. Is there something simple and sufficient to do what I need it to do? Any advise will be appreciated.

I don't think regular expressions are appropriate for HTML parsing. I'm using libxml2, and I enjoy it very much - easy to use, portable and lightning fast.

To get URLs from the web using C/C++ you could use the libcurl library. To parse URLs and other not too easy stuff from the site you can use a regex library.
Separating the HTML tags from the real content can also be done without the use of a library.
For more advanced stuff one could use Qt which offers classes such as QWebPage (which uses WebKit) that allows one to access the DOM-Model of the page and extract individual HTML objects (e.g. single cells of a table) rather easyly.

You can try xerces-c. It's a powerful library for xml parsing. It support xml reading on the fly, dom and sax parsing.

Related

Solr/Lucene "kit" to test searching?

Is there a "code free" way to get SOLR/LUCENE (or something similar) pointed at a set of word docs to make them quickly searchable by a user?
I am prototyping, seeing if there is value in, a system to search through some homegrown news articles. Before I stand up code to handle search string input and document indexing, I wanted to see if it was even worth it before I starting trying to figure it all out.
Thanks,
Judd
Using the bin/post tool of Solr and the Tika handler (named the ExtractingRequestHandler), you should be able to get something up and running for prototyping rather quickly.
See the introduction of Uploading Data with Solr Cell using Apache Tika. Tika is used to process a wide range of different document types.
You can give the Solr post tool a directory or a list of files to submit to the index.
Automatically detect content types in a folder, and recursively scan it for documents for indexing into gettingstarted.
bin/post -c gettingstarted afolder/

How generate random word from real languages

How I can generate random word from real language?
Anybody know any API from internet with this functional?
For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=en' and I get responce 'Town'. Or 'Fast'. Or 'Received'... For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=ru' and I get responce 'Ходить'. Or 'Шапка'. Or 'Отправлено'... Any form (noun, adjective, verb etc...) of the words of the any language.
I find resource 'http://www.randomlists.com/random-words'. But this is not JSON format, only English, and don't any warranty work in long time.
Please any ideas.
See this answer : https://stackoverflow.com/questions/824422/can-i-get-an-english-dictionary-word-list-somewhere Download a word dictionary, stick in the databse and fetch a random record or read a random line from the file each time. This way you don't depend on 3rd party API and you can extend it in all the languages you can find words for.
You can download the OpenOffice dictionaries. They come as extension (oxt), which is nothing different than a ZIP file. You may open them with 7zip or alike. Within you will find lots of files, interesting for you are the *.dic files. They will also contain resolutions or number words.
When you encounter something like abandon/LdS get rid of the /LdS this is used for hunspell.
Take these *.dic files use their name as key, put them into a database and pick a random word from there for a given language code.
Update
Older, but easier to access, the archived hunspell dictionaries from OpenOffice.
This question can be viewed in two ways and therefore I give two answers:
To collect words, I would run a spider on websites with known language (Wikipedia is a good starting point) and strip HTML tags.
To generate words from a real language is trickier. Using statistics from the collected words, it is possible to use Markow chains that produces statistically real words. I have tried letter by letter generation, and that works poorly. It is probably a better approach to use syllable construction instead.

C++ Logger-Should I use an ordinary xml parser?

I'm working on a logging system for my 2D engine, and I'm confused on how I should go about creating/editing the file, and how I should output that file.
I've learned that XML is more of a data carrier rather than a data displayer like HTML is. I've read that I can use XML to HTML converters. One method I've thought about is writing characters to a file in HTML.
Clarity on these matters is what I ask of you, stack overflow.
Creating an XML (or HTML) file doesn't need any special library. Straightforward string concatenation is usually good enough, you may have to encode some special characters (e.g. > into >.
But as Owen says, plain text is a log more common for log files. One reasonable compromise is comma-separated values in a text file, this gives you a little bit of structure without much overhead. For example, the Windows web server (IIS) uses this format by default, and if you have some fields that are output for each line such as timestamp or source filename and line number, this makes it easy to separate those out again.
Just about every log I've ever worked with has been pure text delimited by newlines. If you're going to depart from that, you may want to ask yourself what it is about your logging needs that you want to accomplish with markup.
If you must go the way of markup, I would suggest an XML format that contains a minimal set of markup that would be useful in your situation. You could use XML to capture structure in your log entries (timestamp, severity, and operational code, for example) that would be inconvenient to code for in HTML.
Note that you could also go hybrid and embed some XHTML tags in an XML element whose purpose is to capture displayable text, if you want.
The problem with XML or HTML files is that you cannot append at any time. You have to close the final tag (document tag) properly at the end of writing.
Therefore, it's not a popular format for logging.
For logging, I suggest using one of the existing log engines, such as Apache logger, or, John Torjo's boost log candidate. They will support log levels, runtime configuration, etc.
If you are considering writing logs in XML files, please, stop.
Log files should be simple plain text files, XML-izing it is introducing needless complexity. They are not structured data, they are meant to be read by people, not automated tools.
It all starts with XML logs, and then it goes downhill from there.

Get a particular text from website

I'm looking for a way if you know the location where to read the text for example say, under a particular category, how would you connect to a website and search & read the text from it?
what steps do i need to follow to learn about that?
you could use libcurl/cURL for your HTML retrival
You're probably looking for a web crawler.
Here's an example of a simple crawler written in C++.
Moreover, you might want to have a look to wget, a software to retrieve files via HTTP, HTTPS and FTP.
if you are looking at a specific web-page, you could try retrieving the page and parsing it to get to the exact location you want. e.g. specific div, etc.
since you are using c++, you could try reading up on using libcurl to retrieve the information you need from the URL.
You can download an html file with WinHTTP(working example) and then search the file. There's some find algos in the std::string class for searching if your needs are relatively basic.

Indexing text content of html

I want to pull the text out of html files for indexing purposes, and do so as fast as possible. Rather than create something from scratch, I want to see how much I can find already done for me.
Currently I'm just piping the output of html2text, which works, but between being python and trying to prettify the text, I'm sure the speed could be improved.
So, with Linux/unix being priority, what (c/c++) libraries would be best suited to this kind of task?
To extract the text you can use an HTML parser like htmlcxx or libxml. You can can also use any XML library after tidying up the HTML. For indexing the text you can use CLucene.