Parse Website Data in C++ - c++

So I am trying to develop a program that will parse a website for data, send that data into variable that I can then use for functions inside the program.
Specifically I'm trying to parse this page (Click the debuffs tab)
http://worldoflogs.com/reports/rt-1smdoscr7neq0k6b/spell/94075/
The source is pretty simple and looks like this.
<td><a href='/reports/rt-1smdoscr7neq0k6b/details/62/' class='actor'><span class='Warrior'>Zonnza</span></a></td>
<td>100</td>
</tr>
<tr>
<td><a href='/reports/rt-1smdoscr7neq0k6b/details/3/' class='actor'><span class='DeathKnight'>Fillzholez</span></a></td>
<td>89</td>
</tr>
While I only want the numbers and name, ex what is between <td></td> and between the <span class=''></span> tags. Is there anyway to do what I'm looking for?
Any help would be greatly appreciated.

I'd look into Tag Soup. It's a parser for HTML that can cope with all the horrible HTML that's out there. There's a C++ port of it available too (haven't used that so can't comment on how stable it is).

There are no C++ libraries for what you're trying to do (unless you're going to link with a half of Mozilla or WebKit), but you can consider using Java with HTMLUnit.
And for those suggesting regular expressions, an obligatory reference.

There's no need to use C++, when C-style sscanf will do, or even perl or any language with regular expression support.

Related

Can anyone identify what language this code is?

I'm trying to identify what language this code is (being used in DotCMS) so i can learn to code it. Any help identifying this would be massively appreciated. Code has been desensitized.
#panelStart('XXX.XXX' )
<div class="col-md-6">
#readOnlyField('XXX', 'XXX.XXX', $XXX.XXX, {'XXX':X, 'notCurrency':true})
#readOnlyField('XXX', 'XXX.XXX', $XXX.XXX, {'XXX':X, 'notCurrency':true})
#readOnlyField('XXX', 'XXX.XXX', $XXX.XXX, {'XXX':X, 'notCurrency':true})
#readOnlyField('XXX', 'XXX.XXX', $XXX.XXX, {'XXX':X, 'notCurrency':true})
</div>
#panelEnd
Thanks
As it says in the oficial dotCMS documentation, they use HTML, CSS and Javascript as a programming language.
The "#" signs are used by Apache Velocity, which is the main server-side scripting language dotCMS support.
In Velocity, "#" signs identify Velocity commands, macros, or directives.
But those aren't built-in Velocity commands, and they're unlikely to be directives (which require Java code to implement), so they're probably macros. You'll need to look in other files to see where those macros are defined.

how to disable Google Translate API from not translating proper names with common words

I am using Google Cloud Platform to create content in Indian regional language. Some of the content contains buildings and society name which have common words like 'The Nest Glory'. After conversion from Google Translate API the building name should only be spelled in the regional language, instead it is literally being translated. It sounds funny and user will never find that building.
Cloud Translation API can be told not to translate some part of the text, use the following HTML tags:
<span translate="no"> </span>
<span class="notranslate"> </span>
This functionality requires the source text to be submitted in HTML.
Access to the Cloud Translation API
How to Disable Google Translate from Translating Specific Words
After some search I think one older way is to use the transliteration API which was deprecated by google
Google Transliteration API Deprecated
translate="no"
class="notranslate"
in last API versions this features doesn't work
That's a very complex thing to do: it needs semantic analysis as well as translation and there is no real way to simply add that to your application. The best we can suggest is that you identify the names, remove them from the strings (or replace them with non-translated markers as the may not be in anything like the same position in translated messages) and re-instate them afterwards.
But ... that may well change the quality of the translation; as I said, Google Translate is pretty clever ... but it isn't at Human Interpreter yet.
If that doesn't work, you need to use a real-world human translator who is very familiar with the source and target languages.

How to parse the java comments of a groovy file to html format?

I have a set of .groovy files (Java). All of these files have the same comment format.
I developped a tool with wich I'm able to read those files and applying a REGEX to get all the comments in a list. (Finally i just have to copy paste these comments to .html file)
I would like to know if it's a correct practice in order to generate a HTML page with the comment (a kind of documentation). If not, what would you recommend ?
I read about Doxygen and Javadoc but i'm not sure about using them (if they can be really useful in my case since the comments are already written)
If you can suggest a library in order to generate easily a HTML Webpage or any other advice.
Any help is appreciated.
There exists Groovydoc which is roughly the equivalent of Javadoc, just for Groovy.
As your setup is not that (you already have comments, probably not in Groovydoc format, and you have half the tooling), there are still multiple ways open to you. As you already extract the documentation from groovy, if I were you, I would do a minimal post-formatting, if necessary, and output the documentation as markdown (e.g., github markdown) or asciidoc (e.g., asciidoctor). Then you can use any preferred tool to convert the post-formatted documentation into HTML.
To answer the question "How to parse the java comments" – you shouldn't. If possible, especially in a new project, stick with the standard tooling. In the case of Groovy that's Groovydoc. The normal (non Java/Groovy-Doc style) comments themselves you should never need to extract from the source code. They should be so much context-specific, that without the corresponding code they are anyways useless.

Remove alt attributes from HTML

I have a huge HTML file which I'm trying to format to be able to import the content into a different application. The one thing that remains is that I need to remove all alt attributes from the HTML entirely. They all have different values and there are around 5000 of them, so clearly a straight find & replace isn't an option. Perhaps there's a way to find and replace with regex in Visual Web developer?
The tools/skills I have available are: HTML, Javascript, ASP (Classic), a little bit of .NET, Visual Web Developer Express 2010, but the only similar things I can find are PHP-based and they don't explain fully enough for me to set up a solution and feed the HTML to it.
I've found things like this: Regular expression to replace several html attributes, which give suggestions of regex functions which do similar things, but I'm not even sure how to run a regex function on a straight HTML file (my browser is struggling with the size of the HTML file as it is, so I don't think javascript is going to cut it).
Can anyone suggest the best way to accomplish this?
Thanks folks...
Since you use Visual Studio, you can try the Regex search & replace option, though the implementation of regexes in Visual Studio is pretty different from other regex engines.
Here's a short article about it:
http://www.codinghorror.com/blog/2006/07/the-visual-studio-ide-and-regular-expressions.html
As it says in the article, the builtin regex engine isn't ideal. They mention a plugin with implements standard regexes though:
http://www.codeproject.com/Articles/9125/Standard-Regular-Expression-Searcher-Addin-For-VS

Extracting key words from HTML to C++ under linux

I am working on a simple client-server project. Client is written in Java, it sends key words to C++ server written under Linux and recives a list of URLs with best ranks ( depending on number of occurrences of key words ). Server's job is to go through some URLs in search of key words and return best-fitting URLs. And now the problem is that I have to parse HTML sites to find occurrences of key words, plus I need to extract links from visited page to search on them as well. And my question is what library can I use to do that? Remember only C++ linux libraries are suitable for me. There were some similar topics, so I tried to go through most of them, but some of libraries parse only html files and I don't want to download every site I visit, but parse it on the fly and just store it's rank and url. Some of them look a bit complicated to me - for instance firstly parsing HTML to XML or something else and then finally work on the results with C++. Is there something simple and sufficient to do what I need it to do? Any advise will be appreciated.
I don't think regular expressions are appropriate for HTML parsing. I'm using libxml2, and I enjoy it very much - easy to use, portable and lightning fast.
To get URLs from the web using C/C++ you could use the libcurl library. To parse URLs and other not too easy stuff from the site you can use a regex library.
Separating the HTML tags from the real content can also be done without the use of a library.
For more advanced stuff one could use Qt which offers classes such as QWebPage (which uses WebKit) that allows one to access the DOM-Model of the page and extract individual HTML objects (e.g. single cells of a table) rather easyly.
You can try xerces-c. It's a powerful library for xml parsing. It support xml reading on the fly, dom and sax parsing.