vb.net webbrowser regex extract to textbox - regex

I want to extract from those numbers webbrowser
<tr style="font-size: 14pt;">
<td align="center">1</td>
<td align="center">2</td>
<td align="center">3</td>
<td align="center">4</td>
<td align="center">5</td>
</tr>
textbox.text = 12345

You can do it using regex but it is not recommanded extract it like this:
Dim elemcol As HtmlElementCollection = Webbrowser1.Document.GetElementsByTagName("td")
For i As Integer = 0 To (elemcol.Count - 1)
Textbox1.Text &= elemcol(i).InnerHTML ' here do whatever you want with its content
Next i

Related

using jsoup to modify data

i have successfully used and got html from the website, i am having some troubles while showing the Data
Here is my generated code
<tr class="2" id="AS 2238_2022-10-18T08:50:00"> <td id=" Air"> <img src="/webfids/logos/AS.jpg" width="138" height="31" title=" Air" alt=" Air"> </td> <td id="2238"> 2238</td> <td id="Phoenix"> Phoenix</td> <td id="1666108200000"> 8:50A 10-18-22</td> <td id="AS 2238_2022-10-18T08:50:00_status"> <font class="default"> On Time </font></td> <td id="AS 2238_2022-10-18T08:50:00_gate">2A</td> <td id="AS 2238_2022-10-18T08:50:00_terminal"> </td> <td id="AS 2238_2022-10-18T08:50:00_codeShares"> </td> <td id="AS 2238_2022-10-18T08:50:00_CDS"> </td> <td id="marker" style="display: none">0</td> </tr>
i am trying to remove the last TD of every row, i have many rowd, i am running over the loop
here is my code
rows = TheTable.select("tr");
for ( row in rows ){
writedump(row.ToString());
writeoutput('<br><br><br>');
row.select('##marker').remove();
row.select("td:eq(0)").attr("rel", "nofollow");
// writeoutput(image.toString());
}
i am trying to remove the last TD
I want to remove the Img and just use the text in the img tag like title or alt
i am trying to remove the last TD
I want to remove the Img and just use the text in the img tag like title or alt
for( row in rows ){
// get the first image object
image = row.select( "img" )[ 1 ]
// extract the alt or title text
imageAlt = image.attr( "alt" )?:image.attr( "title" )?:""
// replace the image with the extracted text
image.parent().append( imageAlt )
image.remove()
//remove the last column
row.select( "td" ).last().remove()
}

Xpath - Retrieveing Text value when condition contains a tag

I have section of a table and I am trying to get the value "Distributor 10"
<table class="d">
<tr>
<td class="ah">supplier<td>
<td class="ad">
Supplier 10
</td>
</tr>
<tr>
<td class="ah">distributor<pre><td>
<td class="ad">
Distributor 10
</td>
</tr>
</table>
If I am within Chrome Developer, I get this value by using the following xpath string
//tr/td[text()="distributor]/following-sibling::td[#class="ad"]/a/text()
But when I code this in python - it returns an empty list... From what I can see its is because of the <pre> tag next to "distributor"
When I amend the above mentioned xpath to look for "supplier" instead of distributor it works perfectly well
any suggestions would be welcome
Assuming you're using lxml you can use one of the following XPath to get this working :
//tr[contains(.,"distributor")]//a/text()
//a[parent::td[#class="ad"] and starts-with(#href,"/D")]/text()
Piece of code :
from lxml import etree
from io import StringIO
html = '''<table class="d">
<tr>
<td class="ah">supplier<td>
<td class="ad">
Supplier 10
</td>
</tr>
<tr>
<td class="ah">distributor<pre><td>
<td class="ad">
Distributor 10
</td>
</tr>
</table>'''
parser = etree.HTMLParser()
tree = etree.parse(StringIO(html), parser)
data = tree.xpath('//tr[contains(.,"distributor")]//a/text()')
print (data)
Output : ['Distributor 10']
Alternative : use lxml html cleaner class ("remove_tags") to remove the pre element from your page.
References :
https://lxml.de/api/lxml.html.clean.Cleaner-class.html
https://lxml.de/lxmlhtml.html#cleaning-up-html

Repeat Groups to form objects

I have a html table like this:
<table style="width:100%">
<tr>
<td class="country">Germany</td>
</tr>
<tr>
<td class="city">Berlin</td>
</tr>
<tr>
<td class="city">Cologne</td>
</tr>
<tr>
<td class="city">Munich</td>
</tr>
<tr>
<td class="country">France</td>
</tr>
<tr>
<td class="city">Paris</td>
</tr>
<tr>
<td class="country">USA</td>
</tr>
<tr>
<td class="city">New York</td>
</tr>
<tr>
<td class="city">Las Vegas</td>
</tr>
</table>
From this table, I want to generate Objects like the classes Country and City. Country would have a List of Cities.
Now to the problem:
It's easy to create a regex to get all countries and all cities, but i wonder if i can get groups for the cities to repeat until the next country starts? I need to do this, because I can't figure out programmatically which city belongs to which country if I have them in seperated regex-matches.
It should be like (quick&dirty solution):
country">([\w]*)<{.*\n.*\n.*\n.*"city">([\w]*)}
the curly braces should be repeated until the next country item shows up.
If you have a completely different idea on how to get objects out of a html table in c#, let me know!
Thanks in advance!
Agree that for any non-trivial HTML a HTML parser like HtmlAgilityPack should be used. With that said, if your HTML is as simple as the snippet above, this works, even if there are multiple line breaks in the string:
string HTML = #"
<table style='width:100%'>
<tr><td class='country'>Germany</td></tr>
<tr><td class='city'>Berlin</td></tr>
<tr><td class='city'>Cologne</td></tr>
<tr><td class='city'>Munich</td></tr>
<tr><td class='country'>France</td></tr>
<tr><td class='city'>Paris</td></tr>
<tr><td class='country'>USA</td></tr>
<tr><td class='city'>New York</td></tr>
<tr><td class='city'>Las Vegas</td></tr>
</table>";
var regex = new Regex(
#"
class=[^>]*?
(?<class>[-\w\d_]+)
[^>]*>
(?<text>[^<]+)
<
",
RegexOptions.Compiled | RegexOptions.IgnoreCase
| RegexOptions.IgnorePatternWhitespace
);
var country = string.Empty;
var Countries = new Dictionary<string, List<string>>();
foreach (Match match in regex.Matches(HTML))
{
string countryCity = match.Groups["class"].Value.Trim();
string text = match.Groups["text"].Value.Trim();
if (countryCity.Equals("country", StringComparison.OrdinalIgnoreCase))
{
country = text;
Countries.Add(text, new List<string>());
}
else
{
Countries[country].Add(text);
}
}

get values from table with BeautifulSoup Python

I have a table where I am extracting links and text. Although I can only do one or the other. Any idea how to get both?
Essentially I need to pull the text: "TEXT TO EXTRACT HERE"
for tr in rows:
cols = tr.findAll('td')
count = len(cols)
if len(cols) >1:
third_column = tr.findAll('td')[2].contents
third_column_text = str(third_column)
third_columnSoup = BeautifulSoup(third_column_text)
#issue starts here. How can I get either the text of the elm <td>text here</td> or the href texttext here
for elm in third_columnSoup.findAll("a"):
#print elm.text, third_columnSoup
item = { "code": random.upper(),
"name": elm.text }
items.insert(item )
The HTML Code is the following
<table cellpadding="2" cellspacing="0" id="ListResults">
<tbody>
<tr class="even">
<td colspan="4">sort results: <a href=
"/~/search/af.aspx?some=LOL&Category=All&Page=0&string=&s=a"
rel="nofollow" title=
"sort results in alphabetical order">alphabetical</a> | <strong>rank</strong> ?</td>
</tr>
<tr class="even">
<th>aaa</th>
<th>vvv.</th>
<th>gdfgd</th>
<td></td>
</tr>
<tr class="odd">
<td align="right" width="32">******</td>
<td nowrap width="60"><a href="/aaa.html" title=
"More info and direct link for this meaning...">AAA</a></td>
<td>TEXT TO EXTRACT HERE</td>
<td width="24"></td>
</tr>
<tr class="even">
<td align="right" width="32">******</td>
<td nowrap width="60"><a href="/someLink.html"
title="More info and direct link for this meaning...">AAA</a></td>
<td><a href=
"http://www.fdssfdfdsa.com/aaa">TEXT TO EXTRACT HERE</a></td>
<td width="24">
<a href=
"/~/search/google.aspx?q=lhfjl&f=a&cx=partner-pub-2259206618774155:1712475319&cof=FORID:10&ie=UTF-8"><img border="0"
height="21" src="/~/st/i/find2.gif" width="21"></a>
</td>
</tr>
<tr>
<td width="24"></td>
</tr>
<tr>
<td align="center" colspan="4" style="padding-top:6pt">
<b>Note:</b> We have 5575 other definitions for <strong><a href=
"http://www.ddfsadfsa.com/aaa.html">aaa</a></strong> in our
database</td>
</tr>
</tbody>
</table>
You can just use the text property on a td element:
from bs4 import BeautifulSoup
html = """HERE GOES THE HTML"""
soup = BeautifulSoup(html, 'html.parser')
for tr in soup.find_all('tr'):
columns = tr.find_all('td')
if len(columns) > 2:
print columns[2].text
prints:
TEXT TO EXTRACT HERE
TEXT TO EXTRACT HERE
Hope that helps.
The way to do it is by doing the following:
third_column = tr.find_all('td')[2].contents
third_column_text = str(third_column)
third_columnSoup = BeautifulSoup(third_column_text)
if third_columnSoup:
print third_columnSoup.text

Clean html code with a Regex

I am parsing some transactions, for example 3 transactions look like this:
<TR class=DefGVRow>
<TD>29/04/2013</TD>
<TD>DEPOSITO 0140959158</TD>
<TD>0140959158</TD>
<TD align=right>336,00</TD>
<TD align=center>+</TD>
<TD align=right>16.210,60</TD></TR>H
<TR class=DefGVAltRow>
<TD>29/04/2013</TD>
<TD>RETIRO ATM CTA/CTE</TD>
<TD>1171029739</TD>
<TD align=right>600,00</TD>
<TD align=center>-</TD>
<TD align=right>15.610,60</TD></TR>
<TR class=DefGVRow>
<TD>29/04/2013</TD>
<TD>C.SERV.CAJERO AUT.</TD>
<TD>1171029739</TD>
<TD align=right>3,25</TD>
<TD align=center>-</TD>
<TD align=right>15.607,35</TD></TR>
And my current Regex is:
<TR class=\w+>
<TD>(?<day>\d{1,2})/(?<month>\d{1,2})/(?<year>\d{4})</TD>
<TD>(?<description>.+?)</TD>
<TD>(?<id>\d{3,30})</TD>
<TD.+?>(?<amount>[\d\.]{1,20},\d{1,10})</TD>
<TD.+?>(?<info>.+?)</TD>
<TD.+?>(?<balance>[\d\.]{1,20},\d{1,10})</TD></TR>
How can I edit the
<TD>(?<description>.+?)</TD>
To process optional tags that match other parts of the same extraction? (basically: how to ignore the A tag when capturing the group)
Thanks!
It is a very common problem. Please check this epic answer and stop using regexp to "parse" html, instead use a proper parser and get what you need with XPath or even a CSS selector.
This removes the 'optional' link:
<TR class=\w+>
<TD>(?<day>\d{1,2})/(?<month>\d{1,2})/(?<year>\d{4})</TD>
<TD>(?:<A href=".*>)?(?<description>.+?)(?:</A>)?</TD>
<TD>(?<id>\d{3,30})</TD>
<TD.+?>(?<amount>[\d\.]{1,20},\d{1,10})</TD>
<TD.+?>(?<info>.+?)</TD>
<TD.+?>(?<balance>[\d\.]{1,20},\d{1,10})</TD></TR>