I havent worked with regex before... But I need to parse values in about 500 urls and I need regex for automate it.
Each site contains about 10 values, I need to separate them to own list.
1.
<td width="78" style="padding-left:9px;" align="left"><a style="font-weight:bold;color:#E93393;" href="/meanings/Example1.html">Example1</a> </td>
2.
<td width="78" style="padding-left:9px;" align="left"><a style="font-weight:bold;color:#004EFF;" href="/meanings/Example2.html">Example2</a> </td>
So, I need to get those 2 values to separate list. It should look for color code to determine in which list value goes.
Could somebody help me? :)
NO..NO..NO..
Regex doesnt work for parsing HTML files..
HTML is not strict nor is it regular with its format..
Use htmlagilitypack
Related
This question already has answers here:
How to remove HTML comments using Regex in Python
(6 answers)
Closed 4 years ago.
I have an html-file that has some sections that need to be removed.
All section will be removed except one. I was able to give you a small example, however it is pretty weird that a regex editor recognizes the section.
I want to remove everything between <!-- and -->, but it doesn't work.
test = '<br/><br/> </span> <!--TABLE<table class=MsoTableGrid border=1 cellspacing=0 cellpadding=0 style=\'border-collapse:collapse;border:none\'> <tr style=\'height:12.95pt\'> <td width=225 valign=top style=\'width:109.45pt;border:solid windowtext 1.0pt;padding:2.4pt 5.4pt 2.4pt 5.4pt;height:12.95pt\'> <span style=\'font-family:"Arial",sans-serif\'> <b>Kontosaldo in \x80</b> </span> </td> </tr> <tr style=\'height:12.95pt\'> <td width=146 valign=top style=\'width:109.45pt;border:solid windowtext 1.0pt;padding:2.4pt 5.4pt 2.4pt 5.4pt;height:12.95pt\'> <span style=\'font-family:"Arial",sans-serif\'> [substringR] </span> </td> </tr> </table>TABLE-->'
r = re.compile(r"(?<=<!--)([\s\n.<>\]\[\\=;,€\/\-\'\":\w\n]+)(?=-->)")
mystring = r.sub('', test)
"Everything inbetween <!-- and -->" is this expression:
<!--.*?-->
replaced with the empty string. Compile with the re.DOTALL flag.
Note Modifying HTML with regex is a recipe for disaster. Don't do it. This particular task, namely "removing comments" is a grey area: Regex cannot deal with languages that can be arbitrarily nested (such as HTML), but HTML comments cannot be nested, so there is a good chance that this works. However, don't try the same approach with "replacing all tables", it won't work.
But still, HTML can be functional and still horribly broken in soooo many ways, that even for this task there will be HTML files that disintegrate completely when you try this seemingly safe regex on them.
The proper approach is just as #Aaron suggests: Parse the HTML file into a DOM tree. Find nodes you want to remove. Write the DOM tree back to a file; as shown in this answer: How to find all comments with Beautiful Soup.
I have pretty new to regular expression, so please allow any prematurity of my question :)
I am trying to find a substring from a string (the string contains new line as well)
I searched for similar questions in stackoverflow but none of those seem to answer what i am looking for.
Sample String
<td class="business-info">
<address>
104-59 118th Street<br />
Jamaica,
NY
11419
</address>
</td>
What i am looking for is something like -
jamaica 11419
Edit
I can select everything inside <tr> and </tr> with this
<tr>([^\n]*?\n+?)+?</tr>
But, if i try <address>([^\n]*?\n+?)+?</address> it seems not to work
Please provide some more details to fully understand what you are trying to do; what you have tried and what kind of error or difficulty you are facing?
As to match your address tag -
<address>([\s\n\d\w\"\/\<\>\,\-]*)</address>
use this and see if it works.
I am trying to pull some info here is my regex
<tr>
<td>([^<]+)<i><a href="([^<]+)" title="([^<]+)">([^<]+)<\/a><\/i><sup id="([^<]+)" class="([^<]+)"><a href="([^<]+)"><span>[<\/span>1<span>]<\/span><\/a><\/sup><\/td>
<td><a href="([^<]+)" title="([^<]+)">([^<]+)<\/a><\/td>
<td><a href="([^<]+)" title="([^<]+)">([^<]+)<\/a><\/td>
<td>([^<]+)<\/td>
<td>([^<]+)<\/td>
</tr>
here is sample html
<tr>
<td><i>3Xtreme</i><sup id="cite_ref-18" class="reference"><span>[</span>18<span>]</span></sup></td>
<td>989 Studios</td>
<td>989 Studios</td>
<td>1999-03-31<sup>NA</sup></td>
<td>NA</td>
</tr>
As of now i just want to get the data to find matches.. Can you see any reason why it would not match this?
for all the haters....
I dont care about your options on if i should use regex on html or not.. For this case it will work great. I have one page , the data i need is in a table. Once i can get the data i will save it to my db and never have to use the regex again.. Soooo if your comment or answer is about your option on using regex with html.. dont post.
...Second line:
<td>([^<]+)<i>
cannot hope to match:
<td><i>
as you put a '+' equivalent to '{1,}' while there is nothing between your tags. Didn't check the rest of your regex, but anyway it can't work.
Edit:
Please also correct the "([^<]+)" and so on (I hope you see why)... And edit your regex when you correct it.
Edit 2:
Seeing as it's quite a disaster (sorry but it's the truth :/): please consider replacing all your ([^<]+) things that won't work for all your cases by a simple (.*?)
Edit 3:
[ and ] must be escaped. (\d will help you catch numbers)
<span>[<\/span>1<span>]<\/span>
Lots of problems here: you must escape the brackets and obviously 1 won't match 18
This is the sample text that I'm working with. I'm using Coda to do a find and replace...
<td width="20%"><div > Item #</div></td>
<td width="20%"><div > Pole Tip</div></td>
<td width="20%"><div > Length</div></td>
<td width="20%"><div > Test Weight (lbs.)</div></td>
<td width="20%"><div > Price</div></td>
I want to get rid of the div tags that markup the text inside the td.
Ex...I want to change this:
<td width="20%"><div > Item #</div></td>
to this:
<td width="20%">Item #</td>
So far I have this as a regex:
<div >[\s\w\(\)#]*</div>
However this matches all of the above in my sample text EXCEPT:
<td width="20%"><div > Test Weight (lbs.)</div></td>
In my regex, I even tried to add the ( and )...what am I doing wrong?
In Reply to Andy, I agree that Data Parsing of Well-Formed Markup should be kept to DOM Navigational tools. XML for sure, or HTML>XML Converters are good. I don't know what Miles is working with, but I frequently work with HTML that is so malformed that it can't be parsed by Markup parsers.
In some of my Regex tutorials on Document Parsing, I discuss the Regex Trim pattern, which is simply Zero or More Whitespace {\s*}. Though you might shy away from it because it adds a tiny bit of length to the Regex Pattern, there is virtually zero efficiency loss. That being said...
(<td[^>]*>)\s*<div[^>]*>\s*((?:[^<]*(?(?!</div>\s*</td>)<))*)\s*</div>\s*(</td>)
Replace this with $1$2$3 and you win, as well as get back a clean result. Of course, you can replace or remove as many Trims (\s*) as you like, just a personal preference if I am parsing Documents or Malformed Markup.
Thats because you missed the . This works just fine
<div >[\s\w\(\)#.]*</div>
I want to write a regular expression to parse this webpage(view-source:http://www.imdb.com/search/title?title=spiderman&title_type=feature). Basically I want to extract all the sections between <tr class=".+"> and </tr>. This webpage is a list of movies from imdb(http://www.imdb.com/search/title?title=spiderman&title_type=feature) and each section here indicates a movie. I tried the regular expression
<tr class=".+">(.+\n)+</tr>
However, it doesn't work. Also, I'm not allowed to use DOM. Does anyone have any suggestions? Thanks!
I strongly suggest you use a proper parser. But here is the regex for your case.
<tr class="(.+)">([\s\S]+?)</tr>