Qt adjust value relatively using regular expression - c++

I have some HTML (excerpt)
<span style="font-size:30pt;">HELLO</span>
I have this html stored in a QString.
I wish to reduce the font size by a factor. For simplicity sake let's halve it, so I need it changed to
<span style="font-size:15pt;">HELLO</span>
Can I apply QString::replace() here somehow?
The only examples I have seen replace absolute, but mine needs to read the current value and apply simple math on it, then write it back.
FWIW, I have this expression worked out:
<span.+font-size:(\d+)pt;.*?>
I don't think it really matters that this question places it in context of HTML.
I suppose it could apply to any string.

Use QRegularExpression.
Calling QRegularExpression::match(), it will return a QRegularExpressionMatch that will give you access to captured strings.
Then you will have to parse the string to a number, do the maths and rebuild the final string.
Note that Qt provides a useful example program for QRegularExpression: https://doc.qt.io/qt-5/qtwidgets-tools-regularexpression-example.html
If you are using Qt Creator you can open it from the Welcome page.
Also note that using regular expression to parse HTML does not work well.
If you are sure that you will get a simple HTML it can work, but if the HTML is coming from untrusted sources, you could end up with a mix of HTML, CSS and JavaScript that won't match your regular expression.

Related

How to get the first digit on the left side of a string with python and regex?

I want to get a specific digit based on the right string.
This stretch of string is in body2.txt
string = "<li>3 <span class='text-info'>quartos</span></li><li>1 <span class='text-info'>suíte</span></li><li>96<span class='text-info'>Área Útil (m²)</span></li>"
with open("body2.txt", 'r') as f:
area = re.compile(r'</span></li><li>(\d+)<span class="text-info">Área Útil')
area = area.findall(f.read())
print(area)
output: []
expected output: 96
You have a quote mismatch. Note carefully the difference between 'text-info' and "text-info" in your example string and in your compiled regex. IIRC escaping quotes in raw strings is a bit of a pain in Python (if it's even possible?), but string concatenation sidesteps the issue handily.
area = re.compile(r'</span></li><li>(\d+)<span class='"'"'text-info'"'"'>Área Útil')
Focusing on the quotes, this is concatenating the strings '...class', "'", 'text-info', "'", and '>.... The rule there is that if you want a single quote ' in a single-quote raw string you instead write '"'"' and try to ignore Turing turning in his grave. I haven't tested the performance, but I think it might behave much like '...class' + "'" + 'text-info' + "'" + '>.... If that's the case, there is a bunch of copying happening behind the scenes, and that strategy has a quadratic runtime in the number of pieces being concatenated (assuming they're roughly the same size and otherwise generally nice for such an analysis). You'd be better off with nearly any other strategy (such as ''.join(...) or using triple quoted raw strings r'''...'''). It might not be a problem though. Benchmark your solution and see if it's good enough before messing with alternatives.
As one of the comments mentioned, you probably want to be parsing the HTML with something more powerful than regex. Regex cannot properly parse arbitrary HTML since it can't parse arbitrarily nested structures. There are plenty of libraries to make the job easier though and handle all of the bracket matching and string munging for you so that you can focus on a high-level description of exactly the data you want. I'm a fan of lxml. Without putting a ton of time into it, something like the following would be roughly equivalent to what you're doing.
from lxml import html
with open("body2.txt", 'r') as f:
tree = html.fromstring(f.read())
area = tree.xpath("//li[contains(span/text(), 'Área Útil')]/text()")
print(area)
The html.fromstring() method parses your data as html. The tree.xpath method uses xpath syntax to query that parsed tree. Roughly speaking it means the following:
// Arbitrarily far down in the tree
li A list node
[*] Satisfying whatever property is in the square brackets
contains(span/text(), 'Área Útil') The li node needs to have a span/text() node containing the text 'Área Útil'
/text() We want any text that is an immediate child of the root li we're describing.
I'm working on a pretty small amount of text here and don't know what your document structure is in the general case. You could add or change any of those properties to better describe the exact document you're parsing. When you inspect an element, any modern browser is able to generate a decent xpath expression to pick out exactly the element you're inspecting. Supposing this snippet came from a larger document I would imagine that functionality would be a time saver for you.
This will get the right digits no matter how / what form the target is in.
Capture group 1 contains the digits.
r"(\d*)\s*<span(?=\s)(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?\sclass\s*=\s*(?:(['\"])\s*text-info\s*\2))\s+(?=((?:\"[\S\s]*?\"|'[\S\s]*?'|[^>]?)+>))\3\s*Área\s+Útil"
https://regex101.com/r/pMATkj/1

Escaping and unescaping HTML

In a function I do not control, data is being returned via
return xmlFormat(rc.content)
I later want to do a
<cfoutput>#resultsofreturn#</cfoutput>
The problem is all the HTML tags are escaped.
I have considered
<cfoutput>#DecodeForHTML(resultsofreturn)#</cfoutput>
But I am not sure these are inverses of each other
Like Adrian concluded, the best option is to implement a system to get to the pre-encoded value.
In the current state, the string your working with is encoded for an xml document. One option is to create an xml document with the text and parse the text back out of the xml document. I'm not sure how efficient this method is, but it will return the text back to it's pre-encoded value.
function xmlDecode(text){
return xmlParse("<t>#text#</t>").t.xmlText;
}
TryCF.com example
As of CF 10, you should be using the newer encodeFor functions. These functions account for high ASCII characters as well as UTF-8 characters.
Old and Busted
XmlFormat()
HTMLEditFormat()
JSStringFormat()
New Hotness
encodeForXML()
encodeForXMLAttribute()
encodeForHTML()
encodeForHTMLAttribute()
encodeForJavaScript()
encodeForCSS()
The output from these functions differs by context.
Then, if you're only getting escaped HTML, you can convert it back using Jsouo or the Jakarta Commons Lang library. There are some examples in a related SO answer.
Obviously, the best solution would be to update the existing function to return either version of the content. Is there a way to copy that function in order to return the unescaped content? Or can you just call it from a new function that uses the Java solution to convert the HTML?

Parse a string for open and close tags

Let's say I have the following strings:
"This [color=RGB]is[\color] a string."
"This [color=RGB][bold]is[\bold][\color] another string."
What I'm looking for is a good way to parse the string in order to extract the tag information and then reconstruct the original string without tags.
The tag informations will be used during text rendering.
Obviously I can achieve the goal by working directly with strings (find/substr/replace and so on), but I'm asking if there is another way cleaner, for example using regular expression.
Note:
There are very few tags I need, but there is the possibility to nest them (only of different type).
Can't use Boost.
There's a very simple answer that might work, depending on the complexity of your strings. (And me understanding you correctly, i.e. you just want to get the cleaned up strings, not actually extract the tags.) Just remove all tags. Replace
\[.*?]
with nothing. Example here
Now, if your string should be able to contain tag-like objects this might not work.
Regards

Stripping superscript from plaintext

I often grab quotes from articles that include citations that include superscripted footnotes, which when copied are a pain in the ass. They show up as actual letters in the text as they are pasted in plaintext and not in html.
Is there a way I could run this through a regex to take out these superscripts?
For example
In the abeginning bGod ccreated the dheaven and the eearth.
Should become
In the beginning God created the heaven and the earth.
I can't think of a way to have regex search for misspellings and a corresponding sequential set of numbers and letters.
Any thoughts? I'm also using Sublime Text 3 for the majority of my writing, but I wouldn't mind outsourcing this to an AppleScript, or text replacement app (aText, textExpander, etc.).
Matching Code vs. Matching a Screen
It's hard to tell without seeing an example, but this should be doable if you copy the text from code view, as opposed to the regular browser view. (Ctrl or Cmd-J is your friend). Since writing the rules will take time, this will only be worthwhile for large chunks of text.
In code view, your superscript will be marked up in a way that can be targetted by regex. For instance:
and therefore bananas make you smartera
in the browser view (where the a at the end is a citation note) may look like this in code view:
and therefore bananas make you smarter<span class="mycitations">a</span>
In your editor, using regex, you can process the text to remove all tags, or just certain tags. The rules may not always be easy to write, and of course there are many disclaimers about using regex to parse html.
However, if your source is always the same (Wikipedia for instance), then you can create and save rules that should work across many pages.

Extract specific portion of HTML file using c++/boost::regex

I have a series of thousands of HTML files and for the ultimate purpose of running a word-frequency counter, I am only interested on a particular portion from each file. For example, suppose the following is part of one of the files:
<!-- Lots of HTML code up here -->
<div class="preview_content clearfix module_panel">
<div class="textelement "><div><div><p><em>"Portion of interest"</em></p></div>
</div>
<!-- Lots of HTML code down here -->
How should I go about using regular expressions in c++ (boost::regex) to extract that particular portion of text highlighted in the example and put that into a separate string?
I currently have some code that opens the html file and reads the entire content into a single string, but when I try to run a boost::regex_match looking for that particular beginning of line <div class="preview_content clearfix module_panel">, I don't get any matches. I'm open to any suggestions as long as it's on c++.
How should I go about using regular expressions in c++ (boost::regex) to extract that particular portion of text highlighted in the example and put that into a separate string?
You don't.
Never use regular expressions to process HTML. Whether in C++ with Boost.Regex, in Perl, Python, JavaScript, anything and anywhere. HTML is not a regular language; therefore, it cannot be processed in any meaningful way via regular expressions. Oh, in extremely limited cases, you might be able to get it to extract some particular information. But once those cases change, you'll find yourself unable to get done what you need to get done.
I would suggest using an actual HTML parser, like LibXML2 (which does have the ability to read HTML4). But using regex's to parse HTML is simply using the wrong tool for the job.
Since all I needed was something quite simple (as per question above), I was able to get it done without using regex or any type of parsing. Following is the code snippet that did the trick:
// Read HTML file into string variable str
std::ifstream t("/path/inputFile.html");
std::string str((std::istreambuf_iterator<char>(t)), std::istreambuf_iterator<char>());
// Find the two "flags" that enclose the content I'm trying to extract
size_t pos1 = str.find("<div class=\"preview_content clearfix module_panel\">");
size_t pos2 = str.find("</em></p></div>");
// Get that content and store into new string
std::string buf = str.substr(pos1,pos2-pos1);
Thank you for pointing out the fact that I was totally on the wrong track.