Webscraping with VBA morningstar financial - regex

I'm trying to scrape the inside ownership from Morningstar at this url:
http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR&region=usa&culture=en-US
This is the code I'm using:
Sub test()
Dim appIE As Object
Set appIE = CreateObject("InternetExplorer.Application")
With appIE
.Navigate "http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR&region=usa&culture=en-US"
.Visible = True
End With
While appIE.Busy
DoEvents
Wend
Set allRowOfData = appIE.Document.getElementById("currentInsiderVal")
Debug.Print allRowOfData
Dim myValue As String: myValue = allRowOfData.Cells(0).innerHTML
appIE.Quit
Set appIE = Nothing
Range("A30").Value = myValue
End Sub
I get run-time error 13 at line
Set allRowOfData = appIE.Document.getElementById("currentInsiderVal")
but I can't see any mismatch. What is going on?

You can just do it with XHR and RegEx instead of cumbersome IE:
Sub Test()
Dim sContent
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR&region=usa&culture=en-US", False
.Send
sContent = .ResponseText
End With
With CreateObject("VBScript.RegExp")
.Pattern = ",""currInsiderVal"":(.*?),"
Range("A30").Value = .Execute(sContent).Item(0).SubMatches(0)
End With
End Sub
Here is the description how the code works:
First of all MSXML2.XMLHTTP ActiveX instance is created. GET request opened with target URL in synchronous mode (execution interrupts until response received).
Then VBScript.RegExp is created. By default .IgnoreCase, .Global and .MultiLine properties are False. The pattern is ,"currInsiderVal":(.*?),, where (.*?) is a capturing group, . means any character, .* - zero or more characters, .*? - as few as possible characters (lazy matching). Other characters in pattern to be found as is. .Execute method returns a collection of matches, there is only one match object in it since .Global is False. This match object has a collection of submatches, there is only one submatch in it since the pattern contains the only capturing group.There are some helpful MSDN articles on regex:
Microsoft Beefs Up VBScript with Regular Expressions
Introduction to Regular Expressions
Here is the description how I created the code:
First I found an element containing the target value on the webpage DOM using browser:
The corresponding node is:
<td align="right" id="currrentInsiderVal">143.51</td>
Then I made XHR and found this node in the response HTML, but it didn't contain the value (you can find response in the browser developer tools on network tab after you refresh the page):
<td align="right" id="currrentInsiderVal">
</td>
Such behavior is typical for DHTML. Dynamic HTML content is generated by scripts after the webpage loaded, either after retrieving a data from web via XHR or just processing already loaded withing webpage data. Then I just searched for the value 143.51 in the response, the snippet ,"currInsiderVal":143.51, located within JS function:
fundsArr = {"fundTotalHistVal":132.61,"mutualFunds":[[1,89,"#a71620"],[2,145,"#a71620"],[3,152,"#a71620"],[4,198,"#a71620"],[5,155,"#a71620"],[6,146,"#a71620"],[7,146,"#a71620"],[8,132,"#a71620"]],"insiderHisMaxVal":3.535,"institutions":[[1,273,"#283862"],[2,318,"#283862"],[3,351,"#283862"],[4,369,"#283862"],[5,311,"#283862"],[6,298,"#283862"],[7,274,"#283862"],[8,263,"#283862"]],"currFundData":[2,2202,"#a6001d"],"currInstData":[1,4370,"#283864"],"instHistMaxVal":369,"insiders":[[5,0.042,"#ff6c21"],[6,0.057,"#ff6c21"],[7,0.057,"#ff6c21"],[8,3.535,"#ff6c21"],[5,0],[6,0],[7,0],[8,0]],"currMax":4370,"histLineQuars":[[1,"Q2"],[2,"Q3"],[3,"Q4"],[4,"Q1<br>2015"],[5,"Q2"],[6,"Q3"],[7,"Q4"],[8,"Q1<br>2016"]],"fundHisMaxVal":198,"currInsiderData":[3,143,"#ff6900"],"currFundVal":2202.85,"quarters":[[1,"Q2"],[2,""],[3,""],[4,"Q1<br>2015"],[5,""],[6,""],[7,""],[8,"Q1<br>2016"]],"insiderTotalHistVal":3.54,"currInstVal":4370.46,"currInsiderVal":143.51,"use10YearData":"false","instTotalHistVal":263.74,"maxValue":369};
So the regex pattern created based on that it should find the snippet ,"currInsiderVal":<some text>, where <some text> is our target value.

Had a look on the site and the element you are trying to retrieve has a typo in it; instead of currentInsiderVal try using currrentInsiderVal and you should retrieve the data correctly.
Probably worth considering some error trapping to catch stuff like this for any other fields you retrieve?
After your comment I took a closer look. Your issue seemed like it was trying to trap the id of the individual cell rather than navigating down the object tree. I've modified the code to retrieve the row of the table you are after and then set myValue to be the correct cell within that row. Seemed to be working when I tried it out. Give this a shot?
Sub test()
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
With appIE
.Navigate "http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR&region=usa&culture=en-US"
.Visible = True
End With
While appIE.Busy
DoEvents
Wend
Set allRowOfData = appIE.Document.getelementbyID("tableTest").getElementsByTagName("tbody")(0).getElementsByTagName("tr")(5)
myValue = allRowOfData.Cells(2).innerHTML
appIE.Quit
Set appIE = Nothing
Range("A30").Value = myValue
End Sub

Related

Set MSXML2.XSLTemplate60.stylesheet gives variable or parameter error before I have a chance to supply them

I have an XSLT that I want to test for processing XML data through MSXML2 in VBA.
The usual use of this XSLT is to process XML data from MicroStation or OpenRoads Designer which supplies these parameters according to settings in the report browser.
In my case, as I'm trying to run some of the report XML through MSXML2, I find that this error comes up:
A reference to variable or parameter 'xslStationPrecision' cannot be resolved. The variable or parameter may not be defined, or it may not be in scope.
This is true even as I'm repurposing code from Microsoft's own documentation here: https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms762312(v=vs.85)
To be specific, here's my code block that duplicates the sample code:
Sub test_xsl_from_ms_learn()
' from https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms762312(v=vs.85)
Dim xslt As New MSXML2.XSLTemplate60
Dim xsldoc As New MSXML2.FreeThreadedDOMDocument60
Dim xslproc
Dim PathToXslFile As String
PathToXslFile = "my_own_custom.xsl"
xsldoc.async = False
xsldoc.Load (PathToXslFile)
If xsldoc.parseError.ErrorCode <> 0 Then
Debug.Print xsldoc.parseError.reason
Else
On Error Resume Next
Set xslt.stylesheet = xsldoc
Debug.Print Err.Number, Err.Description
'this is where I get the error
'if not for this error, I would advance through to where I would use addParameter
On Error GoTo 0
End If
End Sub
How can I use addParameter before this error?
I found the solution. First, through testing I found that a reference within the XSLT file covers the gap with the parameters, so I needed to resolve external references.
But after resolving that, I got a new error "Security settings do not allow the execution of script code within this stylesheet." Through a bit of searching, I found the answer from Microsoft here: https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms763800(v=vs.85)?redirectedfrom=MSDN
So the following lines are needed before Set xslt.stylesheet:
xsldoc.resolveExternals = True
'this wasn't mentioned in the Learn code, but it is essential to resolve this error:
'A reference to variable or parameter 'xslStationPrecision' cannot be resolved. The variable or parameter may not be defined, or it may not be in scope.
xsldoc.SetProperty "AllowXsltScript", True
'to avoid this error for having <msxsl:script>:
'Security settings do not allow the execution of script code within this stylesheet.
In the end, this code block loads properly:
Sub test_xsl_from_ms_learn()
' from https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms762312(v=vs.85)
Dim xslt As New MSXML2.XSLTemplate60
Dim xsldoc As New MSXML2.FreeThreadedDOMDocument60
Dim xslproc
Dim PathToXslFile As String
PathToXslFile = "my_own_custom.xsl"
xsldoc.async = False
xsldoc.Load (PathToXslFile)
If xsldoc.parseError.ErrorCode <> 0 Then
Debug.Print xsldoc.parseError.reason
Else
xsldoc.resolveExternals = True
xsldoc.SetProperty "AllowXsltScript", True
On Error Resume Next
Set xslt.stylesheet = xsldoc
Debug.Print Err.Number, Err.Description
On Error GoTo 0
End If
End Sub

Apps Script findText() for Google Docs

I'm applying RegEx search to a Google Document text with some markdown code block ticks (```). Running the code below on my doc is returning a null result.
var codeBlockRegEx = '`{3}((?:.*?\s?)*?)`{3}'; // RegEx to find (lazily) all text between triple tick marks (/`/`/`), inclusive of whitespace such as carriage returns, tabs, newlines, etc.
var reWithCodeBlock = body.findText(codeBlockRegEx); // reWithCodeBlock evaluates to 'null'
I suspect that there's some element of regex in my code that is not supported by RE2, but the documentation has not shed light on this. Any ideas?
I received null as well- I was able to get the below to work using 3 ` surrounding the word test within a paragraph.
I did find this information:
findText method of objects of class Text in Apps Script, extending Google Docs. Documentation says “A subset of the JavaScript regular expression features are not fully supported, such as capture groups and mode modifiers.” In particular, it does not support lookarounds.
function findXtext() {
var body = DocumentApp.getActiveDocument().getBody();
var foundElement = body.findText("`{3}(test)`{3}");
while (foundElement != null) {
// Get the text object from the element
var foundText = foundElement.getElement().asText();
// Where in the element is the found text?
var start = foundElement.getStartOffset();
var end = foundElement.getEndOffsetInclusive();
// Set Bold
foundText.setBold(start, end, true);
// Change the background color to yellow
foundText.setBackgroundColor(start, end, "#FCFC00");
// Find the next match
foundElement = body.findText("`{3}(test)`{3}", foundElement);
}
}

vb.net Regex remove a tags with mailto

I have a text for example:
" Visit www.flexstaff.com for details
Email rachel#flexstaff.com apply online."
I would like to delete only the a tags that contain "mailto" so
rachel#flexstaff.com will become
rachel#flexstaff.com
I have this regex:
Dim rgxMailTo = New Regex("<a\b\s[^<>]*(?<=#.*)>|(?<=#.*)</a>",RegexOptions.IgnoreCase)
Dim ret As String = rgxMailTo.Replace(text, Environment.NewLine)
But it selects other a tags as well.
Use the below regex and then replace the match with $1.
<a\b\s*[^<>]*\bmailto\b[^<>]*>([^<>]*)<\/a>
DEMO
To select only the tags.
<a\b\s*[^<>]*\bmailto\b[^<>]*>|(?<=<a\b\s*[^<>]*\bmailto\b[^<>]*>[^<>]*)<\/a>
If your text is of uncertain source (so it was not all generated in 100% predictable way), using regex is a very bad idea - trust me, I've been there.
One option is to use Html Agility Pack, and load the HTML as an XElement (C#, as I have sample on hand):
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(HTML);
htmlDoc.OptionOutputAsXml = true;
using (var stream = new MemoryStream())
{
htmlDoc.Save(stream);
stream.Position = 0;
var xelement = XElement.Load(stream);
DoStuffToXElement(xelement);
}
Note, that in case you have just a fragment without a root element:
Link
<img src="#"/>
Remember to wrap it in something neutral, like htmlDoc.LoadHtml("<div>"+HTML+"</div>");
Now you can use LinqToXml to find whatever you need, traverse the tree or do anything quite safely:
xHtml
.Descendants()
.Where(e=>e.Name.LocalName.Equals("a", StringComparison.OrdinalIgnoreCase)
&& e.Attribute("href") != null
&& e.Attribute("href").Value.StartsWith("mailto:", StringComparison.OrdinalIgnoreCase))
.Remove();
Final note: this is nearly always much slower than regex - if time is important (for example you do it at every page load or sth) it might be too slow, but I guess this kind of processing can be done beforehand?
You can use the power of LINQ to XML like this:
Imports System.Text.RegularExpressions
Imports System.Xml.Linq
Imports System.Xml
Imports System.Xml.XPath
Public Class Form1
Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
Dim str As String = "Visit www.flexstaff.com for details\nEmail rachel#flexstaff.com apply online."
Dim xDoc As XDocument = XDocument.Parse("<?xml version= '1.0'?><root>" + str + "</root>")
Dim query = xDoc.XPathSelectElements("//a[contains(#href,'mailto')]")
For Each element In query
element.Remove()
Next element
Dim Res As String = xDoc.ToString().Replace("<root>", String.Empty).Replace("</root>", String.Empty)
End Sub
End Class
Outoput (Res):
Visit www.flexstaff.com for details\nEmail apply online.

How to match and return part of an HTML file in ASP

I'm looking to match a part of several HTML files that get passed into a loop in an ASP file and then return that part of the HTML files to include in my output. Here's my code so far:
<%for i=0 to uBound(fileIDs) ' fileIDs is an array of URLs
dim srcText, outText, url
Set ex = New RegExp
ex.Global = true
ex.IgnoreCase = true
ex.Pattern = "<section>[\S\s]+</section>" ' This finds the HTML I want
url = fileIDs(i)
Set xmlhttp = CreateObject("MSXML2.ServerXMLHTTP")
xmlhttp.open "GET", url, false
xmlhttp.send ""
srcText = xmlhttp.responseText
outputText = ex.Execute(mediaSrcText) ' I expect this to be the HTML I want
Response.Write(outputText.Item(0).Value) ' This would then return the first instance
set xmlhttp = nothing
next %>
I've tested the regular expression on my files and it's matching the parts that I want it to.
when I run the page containing this code, I get an error:
Microsoft VBScript runtime error '800a01b6'
Object doesn't support this property or method
on the line with ex.Execute. I've also tried ex.Match, but got the same error. So I'm clearly missing the proper method for returning the match so I can write it out into the file. What is that method? Or am I approaching the problem from the wrong direction?
Thanks!
You need a Set when you're assigning outputText:
Set outputText = ex.Execute(mediaSrcText)
I should probably also say that you really shouldn't be using regular expressions to attempt to parse HTML, although I don't know enough about the context to offer more specific advice.

How to use regular expression in WatiN

I'm working on WatiN automation tool. I'm having problem in regular expression. I've situation where i have to enter some text and click on a button in the popup window. I'm using AttachToIE method and URL attribute("http://192.168.25.10:215/admin/SelectUsers.aspx?Type=FeedbackID=ef5ad7ef5490-4656-9669-32464aeba7cd") of the popup to attach to the popup.
The problem is each time the popup appears the ID value in the URL changes. So i'm not able to access the popup. can anyone plz help with this by giving me Regular Expression for the changing value of ID in the below URL
("http://192.168.25.10:215/admin/SelectUsers.aspx?Type=FeedbackID=ef5ad7ef5490-4656-9669-32464aeba7cd")
thanking you
It appears that you have a URL with 2 query string parameters Type and ID and your pattern is:
"http://192.168.25.10:215/admin/SelectUsers.aspx?Type=Feedback&ID={some id}"
You can use the Find.ByUrl() attribute constraint method and pass it to AttachToIE() as shown below with the regex for matching that pattern.
string url = "http://192.168.25.10:215/admin/SelectUsers.aspx?Type=Feedback&ID="
Regex regex = new Regex(url + "[a-z0-9]+", RegexOptions.IgnoreCase);
IE ie = IE.AttachToIE(Find.ByUrl(regex));
string baseUrl ="http://192.168.25.10:215/admin/SelectUsers.aspx?Type=FeedbackID="
Regex urlIE= new Regex(baseUrl + "[\\wd]+", RegexOptions.IgnoreCase);
IE ie = IE.AttachToIE(Find.ByUrl(urlIE);
I'm not familiar with WatiN but it looks like it's runs on .Net so perhaps this might help?
var desiredId = "000000000000-0000-0000-000000000000";
var url = "http://192.168.25.10:215/admin/SelectUsers.aspx?Type=FeedbackID=ef5ad7ef5490-4656-9669-32464aeba7cd&someMoreStuff";
var pattern = #"(?i)(?<=FeedBackId=)[-a-z0-9]+";
var result = Regex.Replace(url, pattern, desiredId);
Console.WriteLine(result);
//Output: http://192.168.25.10:215/admin/SelectUsers.aspx?Type=FeedbackID=000000000000-0000-0000-000000000000&someMoreStuff
The following pattern should have the same affect but is more defensive. It should only match stuff in the query string, it requires the id to be 35 characters and won't match similar parameter names like "PreviousFeedBackId".
var pattern = #"(?i)(?<=\?.*\bFeedBackId=)[-a-z0-9]{35,35}\b";
If you just want to extract the id:
var id = Regex.Match(url, pattern).Value;
Console.WriteLine(id);
//output: ef5ad7ef5490-4656-9669-32464aeba7cd
WatiN has a feature where in we can use the url by neglecting the query string. Below is the code which is working fine for me.
string baseUrl = "http://192.168.25.10:215/admin/SelectUsers.aspx";
IE ie = IE.AttachToIE(Find.ByUrl(baseUrl,true));