How to cut text placed beetween two words using regular expression? - regex

I'm beginner in regular expressions and I want to cut some text placed beeween two other words. I'm using QT to do it. Some exapmle:
<li class="wx-feels">
Feels like <i><span class="wx-value" itemprop="feels-like-temperature-fahrenheit">55</span>°</i>
</li>
I want to get
Feels like <i><span class="wx-value" itemprop="feels-like-temperature-fahrenheit">55</span>°
From code above, sespecially a number 55 , my idea was to cut whole line from text first and then search it for nubers, but I cannot recover it from whole text.
I typed somthing like that:
QRegExp rx("(Feels like <i><span class=\"wx-value\" itemprop=\"feels-like-temperature-fahrenheit\">)[0-9]{1,3}(</span>°</i>)");
QStringList list;
list = all.split(rx);
Where all is a whole text, but a list contains only those substrings I didn't wanted, is there a posibity split QString into three pieces?
First - text at the beginning (which I don't want)
Second - wanted text
Third - rest of text?

Description
This regex will collect the inner string within the li tags where the li tag has a class of wx-feels, it'll also capture the numeric value inside the span tag.
<li\b[^>]*\bclass=(["'])wx-feels\1[^>]*?>(.*?\bitemprop=(['"])feels-like-temperature-fahrenheit\3[^>]*>(\d+).*?)<\/li>
Groups
Group 0 gets the entire string including the open and close LI tags
gets the open quote for the LI class attribute. This allows us to find the correct close quote after the value
get the string directly inside the LI tag
gets the open quote for the itemprop attribute
gets the digits from the span inner text
Example
This PHP example is simply to show how the regex works.
<?php
$sourcestring="<li class=\"wx-feels\">
Feels like <i><span class=\"wx-value\" itemprop=\"feels-like-temperature-fahrenheit\">55</span>°</i>
</li>";
preg_match('/<li\b[^>]*\bclass=(["\'])wx-feels\1[^>]*?>(.*?\bitemprop=([\'"])feels-like-temperature-fahrenheit\3[^>]*>(\d+).*?)<\/li>/ims',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => <li class="wx-feels">
Feels like <i><span class="wx-value" itemprop="feels-like-temperature-fahrenheit">55</span>°</i>
</li>
[1] => "
[2] =>
Feels like <i><span class="wx-value" itemprop="feels-like-temperature-fahrenheit">55</span>°</i>
[3] => "
[4] => 55
)
Disclaimer
Parsing html with a regex can be problematic because of the high number of edge cases. If you are in control of the input text or if it's always as basic as your sample, then you should have no problem.
If QT has one, I recommend using an HTML parsing tool to capture this data.

Related

Mass regex search-and-replace BETWEEN patterns

I have a directory with a bunch of text files, all of which follow this structure:
...
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
- Again, some list items of random text
- Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
....
And I need to run a replace operation (let's say, I need to prepend CCC at the beginning of the line, just after the dash) on only those "list items", which are between PATTERN_A and PATTERN_B. The problem is they aren't really much different from the text above PATTERN_A, or below PATTERN_B, so an ordinary regex can't really catch them without also affecting the remaining text.
So, my question would be, what tool and what regex should I use to perform that replacement?
(Just in case, I'm fine with Vim, and I can collect those files in a QuickFix for a further :cdo, for example. I'm not that good with awk, unfortunately, and absolutely bad with Perl :))
Thanks!
If I have understood your questions, you can do so quite easily with a pattern-range selection and the general substitution form with sed (stream editor). For example, in your case:
$ sed '/PATTERN_A/,/PATTERN_B/s/^\([ ]*-\)/\1CCC/' file
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
-CCC Again, some list items of random text
-CCC Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
(note: to substitute in place within the file add the -i option, and to create a backup of the original add -i.bak which will save the original file as file.bak)
Explanation
/PATTERN_A/,/PATTERN_B/ - select lines between PATTERN_A and PATTERN_B
s/^\([ ]*-\)/\1CCC/ - substitute (general form 's/find/replace/') where find is from beginning of line ^ capturing text between \(...\) that contains [ ]*- (any number of spaces and a hyphen) and then replace with \1 (called a backreference that contains all characters you captured with the capture group \(...\)) and appending CCC to its end.
Look things over and let me know if you have questions or if I misinterpreted your question.
With Perl also, you can get the results
> perl -pe ' { s/^(\s*-)/\1CCC/g if /PATTERN_A/../PATTERN_B/ } ' mass_replace.txt
...
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
-CCC Again, some list items of random text
-CCC Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
....
>

Regex to Capture and wrap outline formatted text

I have source text that is not particularly clean or well formed but I have a need to find text and wrap a line in a tag. The text is in outline format.
1. becomes a <h1> tag
A. becomes a <h2> tag
(1) becomes a <h3> tag
and so on...
Here are some examples of the source.
PREPARE FOR TEST A. Open the door. B. Turn on the light.
The desired result would be
<h1>1. PREPARE FOR TEST</h1>
<h2>A. Open the door.</h2>
<h2>B. Turn on the light.</h2>
Unfortunately, the text could be the same line or it could be on multiple lines or even have a different number of spaces between the outline number and the text. Another example
(1) Check air inlet and air outlet valves are shown open if OAT is above > 53.6 deg F., or closed if OAT is below
48.2 deg F.
In this case the desired result would be
<h3>(1) Check skin air inlet and skin air outlet valves are shown open if temperature is above 53.6 deg F., or closed if temperature is below 48.2 deg F.</h3>
My questions are
How do I find an entire line of text that is associated with an outline level, i.e., the 1., A., (1) and so on.
How do I then wrap that text with the appropriate tag.
I'm not particularly strong at regex, I have been able to do some of the simpler things required of this project but this has me stumped a bit. Here's what I used to try to find the H1 lines, but as anyone that knows regex can plainly see, this won't work past the first word.
\d{1,3}.\s+[A-Z]{2,}
I'm using Python at the moment but am better with PHP and can move to that if needed and still may because I'm better at PHP then Python.
Thank you.
Since every regex needs a different substitution, you need to apply each regex in turn. Assuming that you want the match to always span an entire line, I'd suggest something like this:
import re
s = """1. becomes a h1 tag
A. becomes a h2 tag
(1) becomes a h3 tag
and so on..."""
regexes = {r"\d+\.": "h1",
r"[A-Z]+\.": "h2",
r"\(\d+\)": "h3",
}
for regex in regexes:
repl = regexes[regex]
s = re.sub("(?m)^" + regex + ".*", "<" + repl + ">" + r"\g<0>" + "</" + repl + ">", s)
print(s)
Result:
<h1>1. becomes a h1 tag</h1>
<h2>A. becomes a h2 tag</h2>
<h3>(1) becomes a h3 tag</h3>
and so on...
Explanation:
Each of the regexes (which only match the actual identifiers) is modified to match from the start of the line until the end of the line:
"(?m)^" + regex + ".*" # (?m) allows ^ to match at the start of lines
The entire match is contained in group 0 which can be accessed in the replacement string via \g<0>.
"<" + repl + ">" + r"\g<0>" + "</" + repl + ">" # add tags around line
For future reference and to close this, what I eventually came up with was to run through the entire string of text and remove some trash first. There are actually 15 of these that I use for this step.
$regexes['lf'] = "/[\n\r]*/";
$regexes['tab-cr-lf'] = "/\t[\r\n]/";
preg_replace($regexes,"", $string);
I then discovered that I could count on space and \t after each header identifier, so then I run some more regexes on the string
$regexes['step1'] = "/(\d{1,2}\..\t)/";
$regexes['step2'] = "/([A-Z]\. \t)/";
$replacements['step1'] = "\n\n<step1>$0";
$replacements['step2'] = "\n\n<step2>$0";
preg_replace($this->headerRegexes, $replacements, $string);
These steps have given me some usable text that I can work with.
Thanks to everyone that chimed in, it gave me somethings to think about as I tackled this problem.

I need a regular Expression for Starts with and ends with

I am looking for a regular expression for locating numerous expressions to find and replace. The expression looks like s360a__fieldname__c. I need to find all the instances where the s360a__ is then followed by the __c.
The issue is that it has to be within the one line so it is not finding a starting s360a__ and then the next __c which may be several lines below.
Here is an example of some of the xml I am changing.
<fields>
<fullName>s360a__AddressPreferredStreetAddressCity__c</fullName>
<deprecated>false</deprecated>
<externalId>false</externalId>
<label>Preferred Street City</label>
<length>255</length>
<required>false</required>
<trackFeedHistory>false</trackFeedHistory>
<trackHistory>false</trackHistory>
<type>Text</type>
<unique>false</unique>
</fields>
<fields>
<fullName>s360a__AddressPreferredStreetAddressCountry__c</fullName>
<deprecated>false</deprecated>
<externalId>false</externalId>
<label>Preferred Street Country</label>
<picklist>
You'd better of using a parser, combined with an xpath instead. Here's an example with PHP (can easily be adopted for e.g. Python as well). The idea is to load the DOM, then use a function to filter out elements (starts-with() and text() in this example):
<?php
$xml = '<fields>
<fullName>s360a__AddressPreferredStreetAddressCity__c</fullName>
<deprecated>false</deprecated>
<externalId>false</externalId>
<label>Preferred Street City</label>
<length>255</length>
<required>false</required>
<trackFeedHistory>false</trackFeedHistory>
<trackHistory>false</trackHistory>
<type>Text</type>
<unique>false</unique>
</fields>';
$dom = simplexml_load_string($xml);
// find everything where the text starts with 's360a_'
$fields = $dom->xpath("//*[starts-with(text(), 's360a_')]");
print_r($fields);
# s360a__AddressPreferredStreetAddressCity__c
The code checks if the text starts with s360a_. To actually check if it also ends with some specific string, you need to fiddle quite a bit (as the corresponding function ends-with() is not yet supported).
# check if the node text starts and ends with a specific string
$fields = $dom->xpath("//*[starts-with(., 's360a_') and substring(text(), string-length(text()) - string-length('_c') +1) = '_c']");
?>

vim: search, capture & replace on different lines using regex

Relatively new linux/vim/regex user here. I want to use regex to search for a numerical patterns, capture it, and then use the captured value to append a string to the previous line. In other words...I have a file of format:
title: description_id
text: {en: '2. text description'}
I want to capture the values from the text field and append them to the beginning of the title field...to yield something like this:
title: q2_description_id
text: {en: '2. text description'}
I feel like I've come across a way to reference other lines in a search & replace but am having trouble finding that now. Or maybe a macro would be suitable. Any help would be appreciated...thanks!
Perhaps something like:
:%s/\(title: \)\(.*\n\)\(text: \D*\)\(\d*\)/\1q\4_\2\3\4/
Where we are searching for 4 parts:
"title: "
rest of line and \n
"text: " and everything until next digit in line
first string of consecutive digits in line
and spitting them back out, with 4) inserted between 1) and 2).
EDIT: Shorter solution by Peter in the comments:
:%s/title: \zs\ze\_.\{-}text: \D*\(\d*\)/q\1_/
Use \n for the new lines (and ^v+enter for new lines on the substitute line): A quick and not very elegant example:
:%s/title: description_id\n\ntext: {en: '\(\i*\)\(.*\)/title: q\1_description_id^Mtext: {en: '\1\2/

Regex to remove footer using wildcards

Ok - this is well beyond my limited knowledge of regular expressions. We receive a report from a banking entity in a fixed with text file format. Unfortunately their system exports page headers with the data file that must be removed before processing on our end. The page headers start and end with the same text but the content changes (dates and page numbers). A typical one looks like:
00007xxxxx LAST1,FIRST1 111111 20120930
ABCD EXPORT RPT 10/04/12 at 10/04/12 16:20 Seq 1501 Page 16
MRK014 Report Date: 10/04/12
Acct# Name SH. Balance QTR (YYYYMMDD)
----------------------------------------------------------------------------------------------------
00007xxxxx LAST2,FIRST2 222222 20120930
So each header starts with "ABCD" (actually the name of the bank, just removed here for privacy) and ends with the row of -------------------.
What I need to get it down to is the customer data on two rows (00007xxxxx - those account numbers change per person).
So I need to select from the " ABCD" to the end of the "---" to remove that block of text.
Try this regex.. This is a Java code.. You can use the given pattern in your language..
str = str.replaceAll("ABCD((.*?)[\n\r])+(\\-*)", "");
Where str contains your above data.. Lines are separated by \n I assume..
To ensure you are removing correct part of report I would go with more complicated regex pattern.
Use regex pattern
(?<=[\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+
and replace each match with empty string.
However if your environment does not support regex lookbehind, then you have to use pattern:
([\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+
and replace each match with first group.
For example in JavaScript it would be:
str.replace(/([\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+/g, "$1")
Test this code here.