Regex to Capture and wrap outline formatted text - regex

I have source text that is not particularly clean or well formed but I have a need to find text and wrap a line in a tag. The text is in outline format.
1. becomes a <h1> tag
A. becomes a <h2> tag
(1) becomes a <h3> tag
and so on...
Here are some examples of the source.
PREPARE FOR TEST A. Open the door. B. Turn on the light.
The desired result would be
<h1>1. PREPARE FOR TEST</h1>
<h2>A. Open the door.</h2>
<h2>B. Turn on the light.</h2>
Unfortunately, the text could be the same line or it could be on multiple lines or even have a different number of spaces between the outline number and the text. Another example
(1) Check air inlet and air outlet valves are shown open if OAT is above > 53.6 deg F., or closed if OAT is below
48.2 deg F.
In this case the desired result would be
<h3>(1) Check skin air inlet and skin air outlet valves are shown open if temperature is above 53.6 deg F., or closed if temperature is below 48.2 deg F.</h3>
My questions are
How do I find an entire line of text that is associated with an outline level, i.e., the 1., A., (1) and so on.
How do I then wrap that text with the appropriate tag.
I'm not particularly strong at regex, I have been able to do some of the simpler things required of this project but this has me stumped a bit. Here's what I used to try to find the H1 lines, but as anyone that knows regex can plainly see, this won't work past the first word.
\d{1,3}.\s+[A-Z]{2,}
I'm using Python at the moment but am better with PHP and can move to that if needed and still may because I'm better at PHP then Python.
Thank you.

Since every regex needs a different substitution, you need to apply each regex in turn. Assuming that you want the match to always span an entire line, I'd suggest something like this:
import re
s = """1. becomes a h1 tag
A. becomes a h2 tag
(1) becomes a h3 tag
and so on..."""
regexes = {r"\d+\.": "h1",
r"[A-Z]+\.": "h2",
r"\(\d+\)": "h3",
}
for regex in regexes:
repl = regexes[regex]
s = re.sub("(?m)^" + regex + ".*", "<" + repl + ">" + r"\g<0>" + "</" + repl + ">", s)
print(s)
Result:
<h1>1. becomes a h1 tag</h1>
<h2>A. becomes a h2 tag</h2>
<h3>(1) becomes a h3 tag</h3>
and so on...
Explanation:
Each of the regexes (which only match the actual identifiers) is modified to match from the start of the line until the end of the line:
"(?m)^" + regex + ".*" # (?m) allows ^ to match at the start of lines
The entire match is contained in group 0 which can be accessed in the replacement string via \g<0>.
"<" + repl + ">" + r"\g<0>" + "</" + repl + ">" # add tags around line

For future reference and to close this, what I eventually came up with was to run through the entire string of text and remove some trash first. There are actually 15 of these that I use for this step.
$regexes['lf'] = "/[\n\r]*/";
$regexes['tab-cr-lf'] = "/\t[\r\n]/";
preg_replace($regexes,"", $string);
I then discovered that I could count on space and \t after each header identifier, so then I run some more regexes on the string
$regexes['step1'] = "/(\d{1,2}\..\t)/";
$regexes['step2'] = "/([A-Z]\. \t)/";
$replacements['step1'] = "\n\n<step1>$0";
$replacements['step2'] = "\n\n<step2>$0";
preg_replace($this->headerRegexes, $replacements, $string);
These steps have given me some usable text that I can work with.
Thanks to everyone that chimed in, it gave me somethings to think about as I tackled this problem.

Related

Backrefence without matching it on find result

Consider the text structure
(Title)[#1Title-link]
(Chapter1)[#Chapter1-link]
(Chapter2)[#Chapter2-link]
(Chapter3)[#Chapter3-link]
How can i backrefence to [#Title-link] without matching it on find result. Im trying to change
(Chapter1)[#Chapter1-link] => (Chapter1)[#1Title-link-Chapter1-link]
(Chapter2)[#Chapter2-link] => (Chapter2)[#1Title-link-Chapter2-link]
(Chapter3)[#Chapter3-link] => (Chapter3)[#1Title-link-Chapter3-link]
I tried to use and find
(\(Title\)\[(.*?)])([\s\S]*?\[)#(\D.*?\])
then replace it with
$1$3$2-$4
but the problem in here it only highlight once per find and i got lots of chapter its too inefficient to replace it one by one.
Making a constant title is no good too because i have multiple files with that same structure.
Is this possible in regex? any solution or alternative is welcome.
You can first do a search to get the correct substitution string and then do a subsequent replace operation with that substitution string. You did not specify what language you were using, so here is the code in Python (where that back reference to group 1 is \1 rather than the more usual $1):
import re
text = """(Title)[#1Title-link]
(Chapter1)[#Chapter1-link]
(Chapter2)[#Chapter2-link]
(Chapter3)[#Chapter3-link]"""
m = re.search(r'(?:\(Title\)\[#([^\]]*)\])', text)
assert(m) # that we have a match
substitution = m.group(1)
text = re.sub(r'\[#Chapter([^\]]*)\]', r'[#' + substitution + r'-Chapter\1' + ']', text)
print(text)
Prints:
(Title)[#1Title-link]
(Chapter1)[#1Title-link-Chapter1-link]
(Chapter2)[#1Title-link-Chapter2-link]
(Chapter3)[#1Title-link-Chapter3-link]
See Regex Demo 1 for getting the substitution string
See Regex Demo 2 for making the subsitutions

Remove quotation marks between tags using regex

I have been struggling with trying to remove all quotation marks in an XML-file within specific tags in my Ruby on Rails project. The simple question is this: How do I remove all existing " if, and only if, they are within the description tag in the XML-file (using gsub)?
Example
<xml attribute="stuff"><name>Two inch thing (2")</name><description>This thing is really "awesome"></description></xml>
so that it becomes
<xml attribute="stuff"><name>Two inch thing (2")</name><description>This thing is really awesome></description></xml>
I have been struggling with regex for a few hours without getting anywhere.
I.e.
myxml_file.gsub(<regex matching quotation marks>, "")
This is a part of a bigger problem where I use the gem "Ox" to parse XML-files using Ox.load(myxml_file, mode: :hash) to load the XML-file but the description parts hold CDATA which Ox seems to ignore (just sets it all to nil) so I do a gsub to remove the CDATA tags but then some description seems to include quotation marks which crashes the Ox load. So, this problem could (preferrably) be solved already in the Ox.load part, for example by telling it to ignore CDATA-tags...
Edit Upon request:
I fetch the XML-file (which is a product feed) from a url which is in this case gzipped (which I am quite sure does not affect the issue in case):
tmp_data = Net::HTTP.get(URI.parse(url))
gz = Zlib::GzipReader.new(StringIO.new(tmp_data))
data = gz.read
#feed = Ox.load(data, mode: :hash)
The product descriptions in this case looks like this example (where I have added a " just for sake of the issue):
<products><product><merchant_deep_link>https://www.sportlala.se/lopning-40y-edition-2-pack-thundercrus/22361/express</merchant_deep_link><display_price>SEK319</display_price><merchant_product_id>05353-392410-XS</merchant_product_id><merchant_image_url>https://www.sportlala.se/images/products/22361/1905353_392410_40y_Edition_2-Pack_Set_F.png</merchant_image_url><merchant_category></merchant_category><search_price>319</search_price><merchant_name>Sportlala SE</merchant_name><category_id>0</category_id><aw_deep_link>...</aw_deep_link><category_name></category_name><last_updated></last_updated><product_name>40y Edition 2-Pack Thunder/Crus</product_name><aw_product_id>24553291137</aw_product_id><aw_image_url>https://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=ssl%3Awww.sportlala.se%2Fimages%2Fproducts%2F22361%2F1905353_392410_40y_Edition_2-Pack_Set_F.png&feedId=35735&k=477d0110b807fbbbcddc9fb74c52fc30c401ca4a</aw_image_url><delivery_cost></delivery_cost><data_feed_id>35735</data_feed_id><description><![CDATA[I detta paket får du två av Craft's absolut bästa baslager jerseys. Dessa "jerseys" har samlat det bästa från Craft's kollektioner och har den absolut högsta kvalitén! Material: 100% Polyester]]></description><merchant_id>17150</merchant_id><currency>SEK</currency><store_price></store_price><language></language></product></products>
This will make the description=nil in the resulting hash from Ox which I am quite certain is due to the CDATA wrapping in the tag (as it is always nil, no matter if there are quotation marks (") or not.
I did a gsub that removed the CDATA with a gsub (I removed it now but it was something like .gsub("<description><![CDATA[", "<description>").gsub("]]</description>", "</description>") which efffectively removed the CDATA but then brought out the quotation marks-issue.
So, this problem can either be solved on the (preferrably) "Ox load"-level through some configuration that yet have not seen or by regexp on the "-marks that extends over the entire text.
Code:
s = '<xml attribute="stuff"><name>Two inch thing (2")</name><description>This thing is really "awesome"></description></xml>'
t = s.gsub(/(<description>)(.*?)(<\/description>)/) do
open_tag, content, end_tag = $1, $2, $3
content = content.gsub(/"/, '')
[open_tag, content, end_tag].join
end
p s
p t
Output:
"<xml attribute=\"stuff\"><name>Two inch thing (2\")</name><description>This thing is really \"awesome\"></description></xml>"
"<xml attribute=\"stuff\"><name>Two inch thing (2\")</name><description>This thing is really awesome></description></xml>"
Limitations: This is very specific to the exact format of the XML. Many valid changes to the XML that do not change its meaning will break this code. For external use only; use only as directed. Stop taking this regular expression if serious side effects occur.

Regex to find 4th value inside bracket

How i can read 4th Value(inside "" i.e "vV0...." using Regex in below condition ?
I am updating a bit this part - Is it possible to first find Word "LaunchFileUploader" and then select the 4th Value, if there are multiple instance of LaunchFileUploader in the file just select 4th Value of first word found ? Attaching screenshot of file where this needs to be searched (In the file word is "LaunchFileUploader")
I tried this but it gives as - I need 4th value (Group 1 is giving me third value)
\bLaunchFileUploader\b(\:?.*?,){3}.*?\)
Match 1
Full match 11030-11428 LaunchFileUploader("ERM-1BLX3D04R10-0001", 1662, "2ecbb644-34fa-4919-9809-a5ff47594c2d", "8dZOPyHKBK...
Group 1. n/a "2ecbb644-34fa-4919-9809-a5ff47594c2d",
I am still looking for solution for this. Any help is aprreciated.
Depending on what's available to you to use, there's a couple of ways to do it.
Either way, this would work better if there were no new lines in the string, just plain ("value1","value2","value3","value4") etc. It'll still work, but you may need to clean up some new lines from the resulting string.
The easy way - use code for the hard part. Grab the inner string with:
(?<=\().*?(?=\))
This will get everything that's between the 2 parentheses (using positive lookarounds). In code, you could then split/explode this string on , and take the 4th item.
If you want to do it all in regex, you could use something along the lines of:
(?<=\()(?:.*?,){3}(.*?)(?=\))
This would a) match the entire contents of the parentheses and b) capture the 4th option in a capture group. To go even deeper:
(?<=\()(?:.*?,){3}\"(.*?)\"(?=\))
would capture the contents of the "" quotation marks only.
Some tools don't allow you to use lookarounds, if this is the case let me know and I'll see what other ways there are around it.
EDIT Ran this in JS console on browser. This absolutely does work.
EDIT 2 I see you've updated your question with the text you're actually searching in. This pattern will include the space and the new line character as per the copy/paste of the above text.
(?<=\(\")(?:.*?,\s?\n?){3}\"(.*?)\"(?=\))
See my second image for the test in console
This works for python and PHP:
(?<=\")(.*)(?:\"\);)\Z
Demo for Python and PHP
For Java, replace \Z with $ as follows:
(?:")(.*)(?:\"\);)$
Demo for JavaScript
NOTE: Be sure to look the captured group and not the matched group.
UPDATE:
Try this for your updated request:
"(.*)"(?:[\\);\] \/>}]*)$
Demo for updated input string
all the above regex patterns assume there is a line break after each comma
Auto-generated Java Code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "\"(.*)\"(?:[\\\\);\\] \\/>\\}]*)$";
final String string = "\n"
+ "}$(document).ready( function(){ PathUploader\n"
+ " (\"ERM-1BLX3D04R10-0001\", \n"
+ " 1662, \n"
+ " \"1bff5c85-7a52-4cc5-86ef-a4ccbf14c5d5\", \n"
+ "\"vV0mX3VadCSPnN8FsAO7%2fysNbP5b3SnaWWHQETFy7ORSoz9QUQUwK7jqvCEr%2f8UnHkNNVLkJedu5l%2bA%2bne%2fD%2b2F5EWVlGox95BYDhl6EEkVAVFmMlRThh1sPzPU5LLylSsR9T7TAODjtaJ2wslruS5nW1A7%2fnLB%2bljZaQhaT9vZLcFkDqLjouf9vu08K9Gmiu6neRVSaISP3cEVAmSz5kxxhV2oiEF9Y0i6Y5%2f5ASaRiW21w3054SmRF0rq3IwZzBvLx0%2fAk1m6B0gs3841b%2fw%3d%3d\"); } );//]]>";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}

Vim search replace regex + incremental function

I'm currently stuck in vim trying to find a search/replace oneliner to replace a number with another + increment for each new iteration = when it finds a new match.
I'm working in xml svg code to batch process files Inkscape cannot process the text (plain svg multiline text bug).
<tspan
x="938.91315"
y="783.20563"
id="tspan13017"
style="font-weight:bold">Text1:</tspan><tspan
x="938.91315"
y="833.20563"
id="tspan13019">Text2</tspan><tspan
x="938.91315"
y="883.20563"
id="tspan13021">✗Text3</tspan>
etc.
So what I want to do is to change that to this result:
<tspan
x="938.91315"
y="200"
id="tspan13017"
style="font-weight:bold">Text1:</tspan><tspan
x="938.91315"
y="240"
id="tspan13019">Text2</tspan><tspan
x="938.91315"
y="280"
id="tspan13021">✗Text3</tspan>
etc.
So I duckducked and found the best vim tips resource from zzapper, but I cannot understand it:
convert yy to 10,11,12 :
:let i=10 | ’a,’bg/Abc/s/yy/\=i/ |let i=i+1
I then adapted it to something I can understand and should work in my home vim:
:let i=300 | 327,$ smagic ! y=\"[0-9]\+.[0-9]\+\" ! \=i ! g | let i=i+50
But somehow it doesn't loop, all I get is that:
<tspan
x="938.91315"
300
id="tspan13017"
style="font-weight:bold">Text1:</tspan><tspan
x="938.91315"
300
id="tspan13019">Text2</tspan><tspan
x="938.91315"
300
id="tspan13021">✗Text3</tspan>
So here I'm seriously stuck. I cannot figure out what doesn't work :
My adaptation of the original formula ?
My data layout ?
My .vimrc ?
I'll try to find other resources by myself, but on that kind of trick they are pretty rare I find, and like in zzapper tips, not always delivered with a manual.
One way to fix it:
:let i = 300 | g/\m\<y=/ s/\my="\zs\d\+.\d\+\ze"/\=i/ | let i += 50
Translation:
let i = 300 - hopefully obvious
g/\m\<y=/ ... - for all lines matching \m\<y=, apply the following command; the "following command" is s/.../.../ | let ...; the regexp:
\m - "magic" regexp
\< - match only at word boundary
s/\my="\zs\d\+.\d\+\ze"/\=i/ - substitute; the regexp:
\m - "magic" regexp
\d\+ - one or more digits
\zs...\ze - replace only what is matched between these points
\=i - replace with the value of expression i
let i += 50 - hopefully obvious again.
For more information: :help :g, :help \zs, :help \ze, help s/\\=.
Just to add my take as a memo (wrote this as an answer as an EDIT didn't seem right). Sorry it is not the best vim scripting here but it enables me to understand (I'm not a vim specialist).
:let i=300 | 323,$g/y="/smagic![0-9]\+.[0-9]\+!\=i!g | let i+=50
Assign the initial value to i :
:let i=300
Start :global (:g) function from line 323 to the end of file:
323,$g
Pattern to match for executing the commands (litteral text here)
y="
Substitution with magic on (magic meaning special characters "enabled")
smagic
Pattern to find
[0-9]\+.[0-9]\+
(numbers between 0-9 one or more times, a litteral dot, the numbers again)
Replaced with
\=i
\= tells vim to evaluate i not to write it litterally
Increment i with 50 for the next iteration
let i+=50
This part is still in the g function.
The separators, in bold:
| are the separators between the different functions
/ are the separators in the :g function
! are the separators in the smagic function

VB.Net Beginner: Replace with Wildcards, Possibly RegEx?

I'm converting a text file to a Tab-Delimited text file, and ran into a bit of a snag. I can get everything I need to work the way I want except for one small part.
One field I'm working with has the home addresses of the subjects as a single entry ("1234 Happy Lane Somewhere, St 12345") and I need each broken down by Street(Tab)City(Tab)State(Tab)Zip. The one part I'm hung up on is the Tab between the State and the Zip.
I've been using input=input.Replace throughout, and it's worked well so far, but I can't think of how to untangle this one. The wildcards I'm used to don't seem to be working, I can't replace ("?? #####") with ("??" + ControlChars.Tab + "#####")...which I honestly didn't expect to work, but it's the only idea on the matter I had.
I've read a bit about using Regex, but have no experience with it, and it seems a bit...overwhelming.
Is Regex my best option for this? If not, are there any other suggestions on solutions I may have missed?
Thanks for your time. :)
EDIT: Here's what I'm using so far. It makes some edits to the line in question, taking care of spaces, commas, and other text I don't need, but I've got nothing for the State/Zip situation; I've a bad habit of wiping something if it doesn't work, but I'll append the last thing I used to the very end, if that'll help.
If input Like "Guar*###/###-####" Then
input = input.Replace("Guar:", "")
input = input.Replace(" ", ControlChars.Tab)
input = input.Replace(",", ControlChars.Tab)
input = "C" + ControlChars.Tab + strAccount + ControlChars.Tab + input
End If
input = System.Text.RegularExpressions.Regex.Replace(" #####", ControlChars.Tab + "#####") <-- Just one example of something that doesn't work.
This is what's written to input in this example
" Guar: LASTNAME,FIRSTNAME 999 E 99TH ST CITY,ST 99999 Tel: 999/999-9999"
And this is what I can get as a result so far
C 99999/9 LASTNAME FIRSTNAME 999 E 99TH ST CITY ST 99999 999/999-9999
With everything being exactly what I need besides the "ST 99999" bit (with actual data obviously omitted for privacy and professional whatnots).
UPDATE: Just when I thought it was all squared away, I've got another snag. The raw data gives me this.
# TERMINOLOGY ######### ##/##/#### # ###.##
And the end result is giving me this, because this is a chunk of data that was just fine as-is...before I removed the Tabs. Now I need a way to replace them after they've been removed, or to omit this small group of code from a document-wide Tab genocide I initiate the code with.
#TERMINOLOGY###########/##/########.##
Would a variant on rgx.Replace work best here? Or can I copy the code to a variable, remove Tabs from the document, then insert the variable without losing the tabs?
I think what you're looking for is
Dim r As New System.Text.RegularExpressions.Regex(" (\d{5})(?!\d)")
Dim input As String = rgx.Replace(input, ControlChars.Tab + "$1")
The first line compiles the regular expression. The \d matches a digit, and the {5}, as you can guess, matches 5 repetitions of the previous atom. The parentheses surrounding the \d{5} is known as a capture group, and is responsible for putting what's captured in a pseudovariable named $1. The (?!\d) is a more advanced concept known as a negative lookahead assertion, and it basically peeks at the next character to check that it's not a digit (because then it could be a 6-or-more digit number, where the first 5 happened to get matched). Another version is
" (\d{5})\b"
where the \b is a word boundary, disallowing alphanumeric characters following the digits.