I have posted my HTML below. In which I want to get the name value from within my textbox area. I've tried several processes and I'm still not getting any valid solution. Please check my HTML and code snippet, and show me a possible solution.
The name prefix will always stay the same when I refresh the page. However, the last name within the "name" area will change, but will always contain the literal "mr." as the first 3 digits. regex as ([mM]r.\ ) - Four digits if you consider the literal space. Below is my table example.
<table>
<tr><td><b>Your Name is </b> mr. kamrul</td></tr>
<tr><td><b>your age </b> 12</td></tr>
<tr><td><b>Email:</b>kennethdasma30#gmail.com</td></tr>
<tr><td><b>job title</b> sales man</td></tr>
</table>
As shown below I am trying this process using listbox but I am not receiving anything.
HtmlElementCollection bColl =
webBrowser1.Document.GetElementsByTagName("table");
foreach (HtmlElement bEl in bColl)
{
if (bEl.GetAttribute("table") != null)
{
listBox1.Items.Add(bEl.GetAttribute("table"));
}
}
If anyone ca give me an idea of how I am able to receive all in the browser window as ("mr. " + text) within my list box I would appreciate it. Also, if you can explain the answer verbosely and with good comments I would appreciate it, as I'd like to understand the answer in greater detail as well.
Here is one simple way using Regex, assuming that the format of your html page doesn't change.
Regex re = new Regex(#"(?<=<tr><td><b>Your\sName\sis\s?</b>\s?)[mM]r\.\s.+?(?=</td></tr>)", RegexOptions.Singleline);
foreach (Match match in re.Matches(webBrowser1.DocumentText))
{
listBox1.Items.Add(match.Value);
}
Related
Getting stuck on how to read and pretty up these values from a multiline cell via arrayformula.
Im using regex as preceding line can vary.
just formulas please, no custom code
The first column looks like a set of these:
```
[config]
name = the_name
texture = blah.dds
cost = 1000
[effect0]
value = 1000
type = ATTR_A
[effect1]
value = 8
type = ATTR_B
[feature0]
name = feature_blah
[components]
0 = comp_one,1
[resources]
res_one = 1
res_five = 1
res_four = 1
<br/>
Where to be useful elsewhere, at minimum it needs each [tag] set ([effect\d], [feature\d], ect) to be in one column each, for example the 'effects' column would look like:
ATTR_A:1000,ATTR_B:8
and so on.
Desired output can also be seen in the included spreadsheet
<br/>
<b>Here is the example spreadsheet:</b>
https://docs.google.com/spreadsheets/d/1arMaaT56S_STTvRr2OxCINTyF-VvZ95Pm3mljju8Cxw/edit?usp=sharing
**Current REGEXREPLACE**
Kinda works, finds each 'type' and 'value' great, just cant figure out how to extract just that from the rest, tried capture (and non-capturing) groups before and after but didnt work
=ARRAYFORMULA(REGEXREPLACE($A3:$A,"[\n.][effect\d][\n.](.)\n(.)","1:$1 2:$2"))
**Current SUBSTITUTE + REGEXEXTRACT + REGEXREPLACE**
A different approach entirely, also kinda works, longer form though and left with having to parse the values out of that string, where got stuck again. Idea was to use this to simplify, then regexreplace like above. Getting stuck removing content around the final matches though, and if can do that then above approach is fine too.
// First ran a substitute
=ARRAYFORMULA(SUBSTITUTE(SUBSTITUTE($A3:$A,char(10),";"),";;",char(10)))
// Then variation of this (gave up on single line 'effect/d' so broke it up to try and get it working)
=ARRAYFORMULA(IF(A3:A<>"",IFERROR(REGEXEXTRACT(A3:A,"(?m)^(?:[effect0]);(.)$")&";;")&""&IFERROR(REGEXEXTRACT(A3:A,"(?m)^(?:[effect1]);(.)$")&";;")&""&IFERROR(REGEXEXTRACT(A3:A,"(?m)^(?:[effect2]);(.)$")&";;"),""))
// Then use regexreplace like above
=ARRAYFORMULA(REGEXREPLACE($B3:$B,"value = (.);type = (.);;","1:$1 2:$2"))
**--EDIT--**
Also, as my updated 'Desired Output' sheet shows (see timestamped comment below), bonus kudos if you can also extract just the values of matching 'type's to those extra columns (see spreadsheet).
All good if you cant though, just realized would need that too for lookups.
**--END OF EDIT--**
<br/>
Ive tried dozens of things, discarding each in turn, had a quick look in version history to grab out two promising attempts and shared them in separate sheets.
One of these also used SUBSTITUTE to simplify input column, im happy for a solution using either RAW or the SUBSTITUTE results.
<br/>
**Potentially Useful links:**
https://github.com/google/re2/wiki/Syntax
<br/>
<b>Just some more words:</b>
I also have looked at dozens of stackoverflow and google support pages, so tried both REGEXEXTRACT and REGEXREPLACE, both promising but missing that final tweak. And i tried dozens of tweaks already on both.
Any help would be great, and hopefully help others in future since examples with spreadsheets are great since every new REGEX seems to be a new adventure ;)
<br/>
P.S. if we can think of better title for OP, please say in comment or your answer :)
paste in B3:
=ARRAYFORMULA(SUBSTITUTE(TRIM(TRANSPOSE(QUERY(TRANSPOSE(
IF(C3:E<>"", C2:E2&":"&C3:E, )),,999^99))), " ", ", "))
paste in C3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT($A3:$A, "(\d+)\ntype = "&C2)))
paste in D3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT($A3:$A, "(\d+)\ntype = "&D2)))
paste in E3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT($A3:$A, "(\d+)\ntype = "&E2)))
paste in F3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT(A3:A, "\[feature\d+\]\nname = (.*)")))
paste in G3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT(A3:A, "\[components\]\n\d+ = (.*)")))
paste in H3:
=ARRAYFORMULA(IFNA(REGEXREPLACE(INDEX(SPLIT(REGEXEXTRACT(
REGEXREPLACE(A3:A, "\n", ", "), "\[resources\], (.*)"), "["),,1), ", , $", )))
spreadsheet demo
This was a fun exercise. :-)
Caveat first: I have added some "input data". Examples:
[feature1]
name = feature_active_spoiler2
[components]
0 = spoiler,1
1 = spoilerA, 2
So the output has "extra" output.
See the tab ADW's Solution.
I'm trying to parse a csv file. I got the following regular expression from google. It works pretty good except I have one issue and that it doesnt parse blank data.
let arrItem = row.match(/(".*?"|[^",]+)(?=\s*,|\s*$)/g);
arrItem = arrItem || [];
Example row data
9598,"HERE IS LOOKING AT YOU KID, LLC",85647 GOLDEN BLAH BLAH,,ASHBURN,VA,20147,USA,555-555-1511,45-1111111,SOME#GMAIL.COM,9598,,
Here is a screenshot of the arrItem:
I modified the data in the sample and covered it in the screenshot for privacy.
The problem is that in the array, the third item should be blank and then the 4th should be "Ashburn" and so forth. Any ideas on how to fix the expression?
I created the following sample
Thanks
I'm here to ask you some help with QXmlQuery and Xpath.
I'm trying to use this combination to extract some data from several HTML documents.
These documents are downloaded and then cleaned with the HTML Tidy Library.
The problem is when I try my XPath. Here is an example code :
[...]
<ul class="bullet" id="idTab2">
<li><span>Hauteur :</span> 1127 mm</li>
<li><span>Largeur :</span> 640 mm</li>
<li><span>Profondeur :</span> 685 mm</li>
<li><span>Poids :</span> 159.6 kg</li>
[...]
The clean code is stored in a QString "code" :
QStringList fields, values;
QXmlQuery query;
query.setFocus(code);
query.setQuery("//*[#id=\"idTab2\"]/*/*/string()");
query.evaluateTo(&fields);
My goal is to get all the fields (Hauteur, Largeur, Profondeur, Poids, etc.) and their value (1127 mm, 640 mm, 685 mm, 159.6 kg, etc.).
Question 1
As you can see, I use this XPath //*[#id="idTab2"]/*/*/string() to recover the fields because this : //ul[#id="idTab2"]/li/span/string() doesn't work. When I try to specify a tag name, it gives me nothing. It only works with *. Why ? I've checked the code returned by the tidy function and the XPath is not altered. So, I don't see any prolem. Is this normal ? Or maybe there is something I don't know...
Question 2
In the previous XHTML code, the li tags wrap a span tag and some text. I don't know how to get only the text and not the content of the span tag. I tried :
//*[#id="idTab2"]/*/string() gives : Hauteur : 1127 mm Largeur : 640 mm Profondeur : 685 mm
//*[#id="idTab2"]/*[2]/string() gives : Nothing
So, if I'm not wrong, the text in the li tag is not considered as a child node but it should be. See the accepted answer : Select just text directly in node, not in child nodes.
Thanks for reading, I hope someone can help me.
To get the elements (not the text representation) inside the different <li>s, you can test the text content:
//*[#id=\"idTab2\"]/li[starts-with(span, "Hauteur")]
Same thing of other items:
//*[#id=\"idTab2\"]/li[starts-with(span, "Largeur")]
//*[#id=\"idTab2\"]/li[starts-with(span, "Profondeur")]
//*[#id=\"idTab2\"]/li[starts-with(span, "Poids")]
To get the string representation of these <li>, you can use string() around the whole expression, like this:
string(//*[#id=\"idTab2\"]/li[starts-with(span, "Poids")])
which gives "Poids : 159.6 kg"
To extract only the text node in the <li>, without the <span>, you can use these expressions, which select the text nodes which are direct children of <li> (<span> is not a text node), and removes the leading and trailing whitespace characters (normalize-space())
normalize-space(//*[#id=\"idTab2\"]/li[starts-with(span, "Hauteur")]/text())
normalize-space(//*[#id=\"idTab2\"]/li[starts-with(span, "Largeur")]/text())
normalize-space(//*[#id=\"idTab2\"]/li[starts-with(span, "Profondeur")]/text())
normalize-space(//*[#id=\"idTab2\"]/li[starts-with(span, "Poids")]/text())
The last on gives "159.6 kg"
I'm about to break this down into two operations since I can't seem to figure out the regular expression to do it in one. However, I thought I would ask the brain trust here to see if anyone can do it (which I'm sure someone can).
Essentially I have a string containing a recipients field from an email in Exchange. I want to parse it out into individual recipients. I don't need to validate emails or anything. Essentially the data is comma separated except if the comma is in between a set of quotes. That's the part that's messing me up.
Right now I'm using: (?"[^"\r\n]*")
Which gives me the quoted names, and ([a-zA-Z0-9_-.]+)#(([[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.)|(([a-zA-Z0-9-]+.)+))([a-zA-Z]{2,4}|[0-9]{1,3})
which gives me the email addresses
Here's what I have..
Data:
"George Washington" <gwashington#government.net>, "Abraham Lincoln" <alincoln#government.net>, "Carter, Jimmy" <jimmy.carter#presidents.com>, "Nixon, Richard M." <tricky.dick#presidents.com>
What I'd like to get back is this:
"George Washington" <gwashington#government.net>
"Abraham Lincoln" <alincoln#government.net>
"Carter, Jimmy" <jimmy.carter#presidents.com>
"Nixon, Richard M." <tricky.dick#presidents.com>
I dont know enough about the exchange to get the pattern that will match for any exchange recipients entries.
But based on information past for you as an example. I give you this:
["][^"]+["][^",]+(?=[,]?)
This match all for entries that you post.
And know a simple example in C# how to use:
var input = "\"George Washington\" <gwashington#government.net>, \"Abraham Lincoln\" <alincoln#government.net>, \"Carter, Jimmy\" <jimmy.carter#presidents.com>, \"Nixon, Richard M.\" <tricky.dick#presidents.com>";
var pattern = "[\"][^\"]+[\"][^\",]+(?=[,]?)";
var items = Regex.Matches(input, pattern)
.Cast<Match>()
.Select(s => s.Value)
.ToList();
If there is a input text that this pattern dont work please post the input here.
Regex.Match(input, #"\"[^\"]*\"\s\<[^>]*>");
I've stumbled on a bit of challenge here: how to get the contents of a table in HTML with the help of a regular expression. Let's say this is our table:
<table someprop=2 id="the_table" otherprop="val">
<tr>
<td>First row, first cell</td>
<td>Second cell</td>
</tr>
<tr>
<td>Second row</td>
<td>...</td>
</tr>
<tr>
<td>Another row, first cell</td>
<td>Last cell</td>
</tr>
</table>
I already found a method that works, but involves multiple regular expression to be executed in steps:
Get the right table and put it's rows in back-reference 1 (there may be more than one in the document):
<table[^>]*?id="the_table"[^>]*?>(.*?)</table>
Get the rows of the table and put the cells in back-reference 1:
<tr.*?>(.*?)</tr>
And lastly fetch the cell contents in back-reference 1:
<td.*?>(.*?)</td>
Now this is all good, but it would be infinitely more awesome to do this all using one fancy regular expression... Does someone know if this is possible?
There really isn’t a possible regex solution that works for an arbitrary number of table data and puts each cell into a separate back reference. That’s because with backreferences, you need to have a distinct open paren for each backref you want to create, and you don’t know how many cells you have.
There’s nothing wrong with using looping of one or another sort to pull out the data. For example, on the last one, in Perl it would be this, given that $tr already contains the row you need:
#td = ( $tr =~ m{<td.*?>(.*?)</td>}sg );
Now $td[0] will contain the first <td>, $td[1] will contain the second one, etc. If you wanted a two-dimensional array, you might wrap that in a loop like this to populate a new #cells variable:
our $table; # assume has full table in it
my #cells;
while(my($tr) =~ $table = m{<tr.*?>(.*?)</tr>}sg) {
push #cells, [ $tr =~ m{<td.*?>(.*?)</td>}sg ];
}
Now you can do two-dimensional addressing, allowing for $cells[0][0], etc. The outer explicit loop processes the table a row at a time, and the inner implicit loop pulls out all the cells.
That will work on the canned sample data you showed. If that’s good enough for you, then great. Use it and move on.
What Could Ever Be Wrong With That?
However, there are actually quite a few assumptions in your patterns about the contents of your data, ones I don’t know that you’re aware of. For one thing, notice how I’ve used /s so that it doesn’t get stuck on newlines.
But the main problem is that minimal matches aren’t always quite what you want here. At least, not in the general case. Sometimes they aren’t as minimal as you think, matching more than you want, and sometimes they just don’t match enough.
For example, a pattern like <i>(.*?)</i> will get more than you want if the string is:
<i>foo<i>bar</i>ness</i>
Because you will end up matching the string <i>foo<i>bar</i>.
The other common problem (and not counting the uncommon ones) is that a pattern like <tag.*?> may match too little, such as with
<img alt=">more" src="somewhere">
Now if you use a simplistic <img.*?> on that, you would only capture <img alt=">, which is of course wrong.
I think the last major remaining problem is that you have to altogether ignore certain things in parsing. The simplest demo of this embedded comments (also <script>, <style>, andCDATA`), since you could have something like
<i> some <!-- secret</i> --> stuff </i>
which will throw off something like <i>(.*?)</i>.
There are ways around all these, of course. Once you’ve done so, and it is really quite a bit of effort, you’ll find that you have built yourself a real parser, completely with a lot of auxiliary logic, not just one pattern.
Even then you are only processing well-formed input strings. Error recovery and failing softly is an entirely different art.
This answer was added before it was known the the OP needed a solution for c++...
Since using regex to parse html is technically wrong, I'll offer a better solution. You could use js to get the data and put it into a two dimensional array. I use jQuery in the example.
var data = [];
$('table tr').each(function(i, n){
var $tr = $(n);
data[i] = [];
$tr.find('td').text(function(j, text){
data[i].push(text);
});
});
jsfiddle of the example: http://jsfiddle.net/gislikonrad/twzM7/
EDIT
If you want a plain javascript way of doing this (not using jQuery), then this might be more for you:
var data = [];
var rows = document.getElementById('the_table').getElementsByTagName('tr');
for(var i = 0; i < rows.length; i++){
var d = rows[i].getElementsByTagName('td');
data[i] = [];
for(var j = 0; j < d.length; j++){
data[i].push(d[j].innerText);
}
}
Both these functions return the data the same way.