VBA webscraper - Return InnerHTML with regex

VBA webscraper - Return InnerHTML with regex - regex

Using Excel VBA, i have to scrape some data from this website.
Since the relevant website objects dont contain an id, I cannot use HTML.Document.GetElementById.
However, I noticed that the relevant information is always stored in a <div>-section like the following:
<div style="padding:7px 12px">Basler Versicherung AG Özmen</div>
Question:
Is it possible to construct a RegExp that, probably in a Loop, returns the contents inside <div style="padding:7px 12px"> and the next </div>?
What I have so far is the complete InnerHtml of the container, obviously I need to add some code to loop over the yet-to-be-constructed RegExp.
Private Function GetInnerHTML(url As String) As String
Dim i As Long
Dim Doc As Object
Dim objElement As Object
Dim objCollection As Object
On Error GoTo catch
'Internet Explorer Object is already assigned
With ie
.Navigate url
While .Busy
DoEvents
Wend
GetInnerHTML = .document.getelementbyId("cphContent_sectionCoreProperties").innerHTML
End With
Exit Function
catch:
GetInnerHTML = Err.Number & " " & Err.Description
End Function

Another way you can achieve the same using XMLHTTP request method. Give it a go:
Sub Fetch_Data()
Dim S$, I&
With New XMLHTTP60
.Open "GET", "https://www.uid.admin.ch/Detail.aspx?uid_id=CHE-105.805.649", False
.send
S = .responseText
End With
With New HTMLDocument
.body.innerHTML = S
With .querySelectorAll("#cphContent_sectionCoreProperties label[id^='cphContent_ct']")
For I = 0 To .Length - 1
Cells(I + 1, 1) = .Item(I).innerText
Cells(I + 1, 2) = .Item(I).NextSibling.FirstChild.innerText
Next I
End With
End With
End Sub
Reference to add to the library before executing the above script:
Microsoft HTML Object Library
Microsoft XML, V6.0

I don't think you need Regular expressions to find the content on the page. You can use the relative positions of the elements to find the content I believe you are after.
Code
Option Explicit
Public Sub GetContent()
Dim URL As String: URL = "https://www.uid.admin.ch/Detail.aspx?uid_id=CHE-105.805.649"
Dim IE As Object: Set IE = CreateObject("InternetExplorer.Application")
Dim Labels As Object
Dim Label As Variant
Dim Values As Variant: ReDim Values(0 To 1, 0 To 5000)
Dim i As Long
With IE
.Navigate URL
.Visible = False
'Load the page
Do Until IE.busy = False And IE.readystate = 4
DoEvents
Loop
'Find all labels in the table
Set Labels = IE.document.getElementByID("cphContent_pnlDetails").getElementsByTagName("label")
'Iterate the labels, then find the divs relative to these
For Each Label In Labels
Values(0, i) = Label.InnerText
Values(1, i) = Label.NextSibling.Children(0).InnerText
i = i + 1
Next
End With
'Dump the values to Excel
ReDim Preserve Values(0 To 1, 0 To i - 1)
ThisWorkbook.Sheets(1).Range("A1:B" & i) = WorksheetFunction.Transpose(Values)
'Close IE
IE.Quit
End Sub

Related

Get Full Range Text to String

I'm writing a script that looks through my outgoing emails and searches for frequent stylistic errors I make. It locates them using regex and then highlights them yellow. Code:
Public Sub highlightBadForm()
Dim oWordDoc As Object
Dim oMatches As Object
Dim oRange As Range
Dim strText As String
Dim lngFindFrom As Long
Dim varMtch As Variant
Set oWordDoc = Application.ActiveInspector.WordEditor
strText = LCase(oWordDoc.Range.Text)
lngFindFrom = InStr(strText, "from: ")
If lngFindFrom > 0 Then
strText = Left(strText, lngFindFrom - 1)
End If
Set oMatches = extractMatches(strText, getBadStrs)
If Not oMatches Is Nothing Then
For Each varMtch In oMatches
Set oRange = oWordDoc.Range(varMtch.firstindex, varMtch.firstindex + varMtch.Length)
oRange.HighlightColorIndex = wdYellow
Next varMtch
End If
Set oRange = Nothing
Set oWordDoc = Nothing
Set oMatches = Nothing
End Sub
extractMatches is a private function implementing VBA's RegEx engine. getBadStrs returns the regular expression containing the errors.
It all works unless I've embedded hyperlinks in my email. If so, oWordDoc.Range.Text returns only the anchor text of the links, not the links (and any other characters Word pads the hyperlinks with - I don't know what they might be). As a result, varMtch.firstindex is correct for strText but not oRange so the text it highlights is offset by several characters.
I tried to assemble the full oRange text by looping through the hyperlinks in oRange and adding the link text to the string assuming it would be included in oRange. Something like:
Dim lngEndLnk as Long
Set oRange = oWordDoc.Range
For Each varMtch In oRange.Hyperlinks
strText = strText & oWordDoc.Range(lngEndLnk, varMtch.Range.Start)
strText = strText & varMtch.TextToDisplay & varMtch.Name
lngEndLnk = varMtch.Range.End
Next varMtch
If lngEndLnk = 0 Then
strText = oRange.text
Else
strText = strText & oWordDoc.Range(lngEndLnk, oWordDoc.Range.End)
End If
That reduced the offset, but there still is one. Also, if I were to include a linked image in the email, the .Anchor property of varMtch fails so I'd have to come up with another workaround.
Is there a more straightforward way to get a String containing all the characters of the Range object so the regex indices line up?

You can access the hyperlink address using the hyperlinks collection of a document:
Private Sub CommandButton1_Click()
strtext = ActiveDocument.Range.Text
MsgBox (strtext)
For Each hLink In Documents(1).Hyperlinks
MsgBox (hLink.Address)
Next hLink
End Sub
This first displays all the text in a document, and then loops through each hyperlink displaying its URL.
This can then be used through your RegEx.
For more information and examples, see hyperlinks.

I ended up with a similar solution to #slightly snarky. I don't know that it's better so I won't mark it as the solution. Happy for comments on pros and cons, in case there's a clear winner I'm just not seeing.
Personally, I like looping the character collection and probably should use it in my code, this works. I find using the position array to highlight matches much less intuitive than constructing a string from the range. For my purposes padding the string with # in place of the zero-length characters in oWordDoc.Range works, but I also know it won't work for everybody.
Public Sub highlightBadForm()
Dim oWordDoc As Object
Dim oMatches As Object
Dim oRange As Range
Dim strText As String
Dim lngFindFrom As Long, lngC As Long, lngPrevLen As Long
Dim varMtch As Variant
Set oWordDoc = Application.ActiveInspector.WordEditor
For lngC = 0 To oWordDoc.Range.End - 1
strText = strText & oWordDoc.Range(lngC, lngC + 1)
If Len(strText) = lngPrevLen Then
strText = strText & "#"
End If
lngPrevLen = lngPrevLen + 1
Next lngC
strText = LCase(strText)
lngFindFrom = InStr(strText, "from: ")
If lngFindFrom > 0 Then
strText = Left(strText, lngFindFrom - 1)
End If
Set oMatches = extractMatches(strText, getBadStrs)
If Not oMatches Is Nothing Then
For Each varMtch In oMatches
Set oRange = oWordDoc.Range(varMtch.FirstIndex, varMtch.FirstIndex + varMtch.Length)
oRange.HighlightColorIndex = wdYellow
Next varMtch
End If
Set oRange = Nothing
Set oWordDoc = Nothing
Set oMatches = Nothing
End Sub

The key to this seems to be that when you iterate through a Range looking at each "position" in the range, e.g. via something like
With ActiveDocument.Range
For i = 0 to .End - 1
Debug.Print i,Ascw(.Range(i,i+1).Text)
Next
End With
The Range does contain all the characters in the code of a field such as HYPERLINK field, and all the characters in its result (which might be displayed or it might be hidden text). But in some cases a Range may contain additional characters which are never displayed. For example, if you have a field code such as { SET x 123 } then the Range contains what are in effect the field braces and the code " SET X 123 ", but before the field end brace it also contains a marker followed by the value "123". But the SET field does not display its result.
That makes it difficult to construct a "find" string that's the same length as the Range.
But Range.Text is the same text as the concatenation of all the characters in Range.Characters, and each Character in that Collection is a Range that contains the .Start position
So that lets us get the .Start and .End as the following example shows.
This assumes you are working with the ActiveDocument in Word, and have some text, a HYPERLINK field (say), and possibly other fields, with the text "test1" in various places.
I haven't done much testing so it may still need tweaking.
Sub HighlightFinds()
Dim match As VBScript_RegExp_55.match
Dim matches As VBScript_RegExp_55.MatchCollection
Dim rng1 As Word.Range
Dim rng2 As Word.Range
Set rng1 = ActiveDocument.Content
Set rng2 = ActiveDocument.Content ' or rng1.Duplicate
' When you do this, rng1.Text returns the text of the field *codes* but
' not the field *results*, and so does rng1.Characters
'rng1.TextRetrievalMode.IncludeFieldCodes = True
' when you do this, it returns the *results* but not the *codes*
rng1.TextRetrievalMode.IncludeFieldCodes = False
' You could do both, one after the other, to try to get all the matches
' You might also need to set .TextRetrievalMode.IncludeHiddenText
With New VBScript_RegExp_55.RegExp
.Pattern = "test1"
.Global = True
Set matches = .Execute(rng1.Text)
End With
For Each match In matches
rng2.SetRange rng1.Characters(match.FirstIndex + 1).Start, rng1.Characters(match.FirstIndex + 1 + match.Length).End
rng2.HighlightColorIndex = wdYellow
Next
Set matches = Nothing
Set rng2 = Nothing
Set rng1 = Nothing
End Sub

Using VBA regex on Array

I am writing a macro and the macro works fine, but I am trying to add some error handling to it so others are using it and an error occurs they are able to figure out what happened. The last problem I am having is I am using the Application.GetOpenFilename to open multiple files with multiselect = True. I am using a regex to match the file name and if the wrong file name is chosen then it displays an error message. If multiselect = False then I get no errors, but when it is equal to True I get a Type Mismatch error. I can only assume this is because when mutliselect = True the file is an array which the regex cannot handle. Is there a solution to this or can anyone point me to a better solution to handle the error. I have attached the VBA script as well.
Sub DataImport_Loop()
Dim nom As String
Dim wb As Excel.Workbook
Dim i, j, k, m, n, file As Variant
Dim strPattern As String: strPattern = "Strain End Point [0-9] - FEA Loop - Loading - (Timed)" 'File Pattern
Dim regex As Object
Set regex = CreateObject("VBScript.RegExp")
'Turns Screen Updating and Alert Displays off
Application.ScreenUpdating = False
Application.DisplayAlerts = False
nom = ActiveWorkbook.Name
'takes user straight into necessary folder
If CurDir() <> CurDir("J:") Then
ChDrive "J:"
ChDir "J:FEA Material Data"
End If
'Number of specimens tested
For i = 1 To 5
'Allows user to select multiple files to open
file = Application.GetOpenFilename( _
FileFilter:="Text Files (*.csv), *.csv", _
MultiSelect:=True)
'If no file selected, stop data import and display error message
If Not IsArray(file) Then
MsgBox ("You only imported " & (i - 1) & " Specimens.")
Exit Sub
'Sets patteren to check if correct file
With regex
.Pattern = strPattern
End With
'Checks set pattern, displays error message if not correct file
If regex.Test(file) = False Then
MsgBox ("Select Loading Only")
Exit Sub
End If
Else
Counter = 1
While Counter <= UBound(file)
j = (2 * i) - 1
Workbooks.Open file(Counter)
Set wb = Workbooks("Strain End Point " & Counter & " - FEA Loop - Loading - (Timed).csv")
'End of column, needs + 3 to account for first 3 unused cells
k = Range("F4", Range("F4").End(xlDown)).Count + 3
'Loops through data, deletes negative values
For m = 4 To k
If Range("F" & m).value < 0 Or Range("F" & m).Offset(0, 1) < 0 Then
Range("F" & m).Delete
Range("F" & m).Offset(0, 1).Delete
'If cell is deleted, rechecks new value
m = m - 1
End If
Next m
Range("F4:G" & k).Copy
Workbooks(nom).Sheets(Counter + 1).Cells(4, j).PasteSpecial
wb.Close
'Opens next file
Counter = Counter + 1
Wend
End If
Next i
'Turns Screen Updating and Alert Displays back on
Application.DisplayAlerts = True
Application.ScreenUpdating = True
End Sub

When MultiSelect is true, file will always be a variant array, even if only a single file is selected. Therefore you must iterate through each element of the array in order to check it against your mask.
With regard to your mask, I would suggest using the Like operator as it seems simpler and will probably run faster. Note the # replacing the regex pattern [0-9]) eg:
'Checks set pattern, displays error message if not correct file
Const strPattern as String = "Strain End Point # - FEA Loop - Loading - (Timed)" 'File Pattern
For I = LBound(file) To UBound(file)
If Not file(I) Like strPattern Then
MsgBox ("Select Loading Only")
Exit Sub
End If
Next I

Using regex in a libreoffice calc macro to extract text from parentheses in a cell

Using Libreoffice 3.5.7.2 on Ubuntu 12.04.
I have text in calc cells in the form of: (IBM) Ibm Corporation.
I am trying to use regex to extract the text between the ()'s using a basic macro. This is what I have tried so far.
Sub getMktValue()
Dim oDoc as Object
Dim oSheet as Object
Dim oCell as Object
oDoc = ThisComponent
oSheet = oDoc.Sheets.getByName("Income")
'regex test code'
oCell = oSheet.getCellByPosition(0, 1)
stk = oCell.String()
myRegex = oCell.createSearchDescriptor
myRegex.SearchRegularExpression = True
myRegex.SearchString = "\((.*)\)" '"[\([A-Z]\)]" "\(([^)]*)\)" "\(([^)]+)\)"'
found = oCell.FindFirst(myRegex)
MsgBox found.String
End Sub
The myRegex.SearchString line contains the various versions I have tried. The result is always the same. The entire contents of the cell are returned not just the text between the ()'s. Is there a way to extract just the text between the ()'s?
Thanks, Jim

The method you tried, .FindFirst, finds in an XSearchable (such as a spreadsheet or range) the first occurrence of the SearchString.
If you want to search within a string value, then you need a different service, com.sun.star.util.TextSearch.
Sub getMktValue()
Dim oDoc as Object
Dim oSheet as Object
Dim oCell as Object
oDoc = ThisComponent
oSheet = oDoc.Sheets.getByName("Income")
'regex test code'
oCell = oSheet.getCellByPosition(0, 1)
stk = oCell.getString()
oTextSearch = CreateUnoService("com.sun.star.util.TextSearch")
oOptions = CreateUnoStruct("com.sun.star.util.SearchOptions")
oOptions.algorithmType = com.sun.star.util.SearchAlgorithms.REGEXP
oOptions.searchString = "\((.*)\)"
oTextSearch.setOptions(oOptions)
oFound = oTextSearch.searchForward(stk, 0, Len(stk))
sFound = mid(stk, oFound.startOffset(0) + 1, oFound.endOffset(0) - oFound.startOffset(0))
MsgBox sFound
sFound = mid(stk, oFound.startOffset(1) + 1, oFound.endOffset(1) - oFound.startOffset(1))
MsgBox sFound
End Sub
Greetings
Axel

What is the RegExp Pattern to Extract Bullet Points Between Two Group Words using VBA in Word?

I can't seem to figure out the RegExp to extract the bullet points between two group of words in a word document.
For example:
Risk Assessment:
Test 1
Test 2
Test 3
Internal Audit
In this case I want to extract the bullet points between "Risk Assessment" and "Internal Audit", one bullet at a time and assign that bullet to an Excel cell. As shown in the code below I have pretty much everything done, except I cant figure out the correct Regex pattern. Any help would be great. Thanks in advance!
Sub PopulateExcelTable()
Dim fd As Office.FileDialog
Set fd = Application.FileDialog(msoFileDialogFilePicker)
With fd
.AllowMultiSelect = False
.Title = "Please select the file."
.Filters.Clear
.Filters.Add "Word 2007-2013", "*.docx"
If .Show = True Then
txtFileName = .SelectedItems(1)
End If
End With
Dim WordApp As Word.Application
Set WordApp = CreateObject("Word.Application")
Dim WordDoc As Word.Document
Set WordDoc = WordApp.Documents.Open(txtFileName)
Dim str As String: str = WordDoc.Content.Text ' Assign entire document content to string
Dim rex As New RegExp
rex.Pattern = "\b[^Risk Assessment\s].*[^Internal Audit\s]"
Dim i As long : i = 1
rex.Global = True
For Each mtch In rex.Execute(str)
Debug.Print mtch
Range("A" & i).Value = mtch
i = i + 1
Next mtch
WordDoc.Close
WordApp.Quit
End Sub

This is probably a long way around the problem but it works.
Steps I'm taking:
Find bullet list items using keywords before and after list in regexp.
(Group) regexp pattern so that you can extract everything in-between words.
Store listed items group into a string.
Split string by new line character into a new array.
Output each array item to excel.
Loop again since there may be more than one list in document.
Note: I don't see your code for a link to Excel workbook. I'll assume this part is working.
Dim rex As New RegExp
rex.Pattern = "(\bRisk Assessment\s)(.*)(Internal\sAudit\s)"
rex.Global = True
rex.MultiLine = True
rex.IgnoreCase = True
Dim lineArray() As String
Dim myMatches As Object
Set myMatches = rex.Execute(str)
For Each mtch In rex.Execute(str)
'Debug.Print mtch.SubMatches(1)
lineArray = Split(mtch.SubMatches(1), vbLf)
For x = LBound(lineArray) To UBound(lineArray)
'Debug.Print lineArray(x)
Range("A" & i).Value = lineArray(x)
i = i + 1
Next
Next mtch
My test page looks like this:
Results from inner Debug.Print line return this:
Item 1
Item 2
Item 3

Removing unwanted characters VBA (excel)

I want to be able to copy raw data into column A, hit run on the macro and it should remove any unwanted characters both before and after the data that I want to keep resulting in a cell just containing the data that I want. I also want it to go through all cells that are in the column, bearing in mind some cells may be empty.
The data that I want to keep is in this format: somedata0000 or somedata000
Sometimes the cell will contain 'rubbish' both before and after the data that I want to keep i.e. rubbishsomedata0000 or somedata0000rubbish or rubbishsomedata0000rubbish.
And also, sometimes a single cell will contain:
rubbishsomedata0000rubbish
rubbishsomedata0000rubbish
rubbishsomedata0000rubbish
This will need to be changed to:
NEW CELL: somedata0000
NEW CELL: somedata0000
NEW CELL: somedata0000
The 'somedata' text will not change but the 0000 (which could be any 4 numbers) will sometimes be any 3 numbers.
Also there may be some rows in the column that have no useful data; these should be removed/deleted from the sheet.
Finally, some cells will contain the perfect somedata0000, these should stay the same.
Sub Test()
Dim c As Range
For Each c In Range("A2:A" & Range("A" & Rows.Count).End(xlUp).Row)
c = removeData(c.text)
Next
End Sub
Function removeData(ByVal txt As String) As String
Dim result As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
RE.Pattern = "(somedata-\d{4}|\d{3})"
RE.Global = True
RE.IgnoreCase = True
Set allMatches = RE.Execute(text)
If allMatches.Count <> 0 Then
result = allMatches.Item(0).submatches.Item(0)
End If
ExtractSDI = result
End Function
I have put my code that I've got so far, all it does is go through each cell, if it matches it just removes the text that I want to keep as well as the stuff that I want removed! Why?

There are several issues in your code
As Gary said, you Function isn't returning a result
Your Regex.Pattern doesn't make sense
Your Sub doesn't attempt to handle multiple matches
Your Function doesn't even attempt to return multiple matches
Sub Test()
Dim rng As Range
Dim result As Variant
Dim i As Long
With ActiveSheet
Set rng = Range(.Cells(2, 1), .Cells(.Rows.Count, 1).End(xlUp))
End With
For i = rng.Rows.Count To 1 Step -1
result = removeData(rng.Cells(i, 1))
If IsArray(result) Then
If UBound(result) = 1 Then
rng.Cells(i, 1) = result(1)
Else
rng.Cells(i, 1).Offset(1, 0).Resize(UBound(result) - 1, 1).Insert xlShiftDown
rng.Cells(i, 1).Resize(UBound(result), 1) = Application.Transpose(result)
End If
Else
rng.Cells(i, 1).ClearContents
End If
Next
End Sub
Function removeData(ByVal txt As String) As Variant
Dim result As Variant
Dim allMatches As Object
Dim RE As Object
Dim i As Long
Set RE = CreateObject("vbscript.regexp")
RE.Pattern = "(somedata\d{3,4})"
RE.Global = True
RE.IgnoreCase = True
Set allMatches = RE.Execute(txt)
If allMatches.Count > 0 Then
ReDim result(1 To allMatches.Count)
For i = 0 To allMatches.Count - 1
result(i + 1) = allMatches.Item(i).Value
Next
End If
removeData = result
End Function

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

VBA webscraper - Return InnerHTML with regex - regex

Related

Get Full Range Text to String

Using VBA regex on Array

Using regex in a libreoffice calc macro to extract text from parentheses in a cell

What is the RegExp Pattern to Extract Bullet Points Between Two Group Words using VBA in Word?

Removing unwanted characters VBA (excel)

Categories

Resources