Extract text from html with powershell - bad pattern - regex

I want to extract this text
Spectrum Mortis - Bit Meseri - The Incantation (2022)
Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)
from this html block
<span id='tid-span-369523'><a id="tid-link-369523" href="http://metalarea.org/forum/index.php?showtopic=369523" title="This topic was started: Sep 16 2022, 04:18:47">Spectrum Mortis - Bit Meseri - The Incantation (2022)</a></span>
<span id='tid-span-221568'><a id="tid-link-221568" href="http://metalarea.org/forum/index.php?showtopic=221568" title="This topic was started: Apr 11 2014, 14:31:18">Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)</a></span>
I'm trying to set this code but nothing is written on output2.txt
$html = Get-Content -Path 'C:\temp\html\metalarea2.html' -Raw
$pattern = '<span id="tid-span-\\d+"><a id="tid-link-\\d+" href=".+?" title=".+?">(.+?)</a></span>'
$matches = Select-String -InputObject $html -Pattern $pattern -AllMatches
$result = $matches | % { $_.Matches } | % { $_.Groups[1].Value }
$result | Out-File -FilePath "C:\temp\html\output2.txt"
I don't understand where the problem lies
EDIT: SOLUTIONS
$pattern = '<span id=\x27tid-span-\d+\x27><a id="tid-link-\d+" href=".+?" title=".+?">(.+?)</a></span>'
OR
$pattern = '<a id="tid-link-\d+".+?>(.+?)</a>'

It is generally a bad idea to peek and/or poke in structured text using regular expressions. Instead, it is better to use a proper (html) parser to manipulate your data.
To give you an example using the IHTMLDocument2 interface:
$Html = #'
<html>
<head>
<title>Title</title>
</head>
<body>
<span id="tid-span-369523"><a id="tid-link-369523" href="http://metalarea.org/forum/index.php?showtopic=369523" title="This topic was started: Sep 16 2022, 04:18:47">Spectrum Mortis - Bit Meseri - The Incantation (2022)</a></span>
<span id='tid-span-221568'><a id="tid-link-221568" href="http://metalarea.org/forum/index.php?showtopic=221568" title="This topic was started: Apr 11 2014, 14:31:18">Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)</a></span>
<div id="something">Text within div</div>
</body>
</html>
'#
function ParseHtml($String) {
$Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
$Html = New-Object -Com 'HTMLFile'
if ($Html.PSObject.Methods.Name -Contains 'IHTMLDocument2_Write') {
$Html.IHTMLDocument2_Write($Unicode)
}
else {
$Html.write($Unicode)
}
$Html.Close()
$Html
}
$Document = ParseHtml $Html
$Document.getElementsByTagName('a') |
Where-Object { $_.id -Like 'tid-link-*' } |
Foreach-Object { $_.innerText }
Spectrum Mortis - Bit Meseri - The Incantation (2022)
Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)

You can use below regular expression to capture plain text between HTML tags:
(<[^>]*>)+(?<plaintext>[^<]+)<\/[^>]*>
You can refer to this example from regex101.com: Live sample
Here is a full script example:
$html = #"
<span id="tid-span-369523"><a id="tid-link-369523" href="http://metalarea.org/forum/index.php?showtopic=369523" title="This topic was started: Sep 16 2022, 04:18:47">Spectrum Mortis - Bit Meseri - The Incantation (2022)</a></span>
<span id='tid-span-221568'><a id="tid-link-221568" href="http://metalarea.org/forum/index.php?showtopic=221568" title="This topic was started: Apr 11 2014, 14:31:18">Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)</a></span>
<div id="something">Text within div</div>
"#
$pattern = '(<[^>]*>)+(?<plaintext>[^<]+)<\/[^>]*>'
$options = [System.Text.RegularExpressions.RegexOptions]::Multiline
$matches = [regex]::Matches($html, $pattern, $options)
$results = $matches | %{ $_.Groups["plaintext"].Value }
$results

Related

PowerShell filer out invalid AD users using -filte {SamAccountName -eq $_} but with regex

I am trying to filter AD for user names based on computer names which contain the user name, like XXXXXX01BLOGGSJ (BLOGGSJenter code here is the user name in this example)
In order to extract the user name, I use this method:
"XXXXXX01BLOGGSJ" | %{($_ -split '\d+')[-1]}
The output is BLOGGSJ
However, I need to filter many computer names like this, a small percentage of which have invalid usernames in the machine name like "XXXXXX01RUBBISH"
In order to stop the inevitable errors from appearing I am trying to use the -filter {SamAccountName $_} method which works like this:
"BLOGGSJ", "RUBBISH" | % {Get-ADUser -Server domain.com -Filter{SamAccountName -eq $_ }} | select Name
But not when I attempt to do this, which is what I want to do:
“XXXXXX01BLOGGSJ”, “XXXXXX01BLOGGSJ” | % {Get-ADUser -Server domain.com -Filter{SamAccountName -eq "'($_ -split '\d+')[-1]'"}} | select Name
……or various permutations of that. So I am struggling with the syntax I think.
I know I can do this instead:
"XXXXXX01BLOGGSJ","XXXXXX01RUBBISH" | %{($_ -split '\d+')[-1]} | %{Get-ADUser -Server domain.com -Filter {SamAccountName -eq $_ }} | Select Name
but there is something else happening further down the pipe that requires me to do it in the way shown above.
Any help please.
Especially because you say something else is happening further down, I would suggest not trying to do all in a one-line code.
This should get you on your way:
"XXXXXX01BLOGGSJ","XXXXXX01RUBBISH" | ForEach-Object {
$name = ($_ -split '\d+')[-1]
$user = Get-ADUser -Server domain.com -Filter "SamAccountName -eq '$name'" -ErrorAction SilentlyContinue
if ($user) {
# a user with that SamAccountName was found
[PsCustomObject]#{
ComputerName = $_
SamAccountName = $user.SamAccountName
UserName = $user.Name
}
}
else {
# user not found
[PsCustomObject]#{
ComputerName = $_
SamAccountName = $name
UserName = "User Not found in AD"
}
}
}
Output:
ComputerName SamAccountName UserName
------------ -------------- --------
XXXXXX01BLOGGSJ bloggsj Joe Bloggs
XXXXXX01RUBBISH RUBBISH User Not found in AD

Extract data from .log file with Regex

I'm trying to extract data using Regex positive lookbehind. I have created a .ps1 file with the following content:
$input_path = ‘input.log’
$output_file = ‘Output.txt’
$regex = ‘(?<= "name": ")(.*)(?=",)|(?<= "fullname": ")(.*)(?=",)|(?<=Start identity token validation\r\n)(.*)(?=ids: Token validation success)|(?<= "ClientName": ")(.*)(?=",\r\n "ValidateLifetime": false,)’
select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } >$output_file
The input file looks like this:
08:15.27.47-922: T= 11 ids: Start end session request
08:15.27.47-922: T= 11 ids: Start end session request validation
08:15.27.47-922: T= 11 ids: Start identity token validation
08:15.27.47-922: T= 11 ids: Token validation success
{
"ClientId": "te_triouser",
"ClientName": "TE Trio User",
"ValidateLifetime": false,
"Claims": {
"iss": "http://sv-trio17.adm.linkoping.se:34000/core/",
"aud": "te_triouser",
"exp": "1552054900",
"nbf": "1552054600",
"nonce": "f1ae9044-25f9-4e7f-b39f-bd7bdcb9dc8d",
"iat": "1552054600",
"at_hash": "Wv_7nNe42gUP945FO4p0Wg",
"sid": "9870230d92cb741a8674313dd11ae325",
"sub": "23223",
"auth_time": "1551960154",
"idp": "tecs",
"name": "tele2",
"canLaunchAdmin": "1",
"isLockedToCustomerGroup": "0",
"customerGroupId": "1",
"fullname": "Tele2 Servicekonto Test",
"tokenIdentifier": "2Ljta5ZEovccNlab9QXb8MPXOqaBfR6eyKst/Dc4bF4=",
"tokenSequence": "bMKEXP9urPigRDUguJjvug==",
"tokenChecksum": "NINN0DDZpx7zTlxHqCb/8fLTrsyB131mWoA+7IFjGhAV303///kKRGQDuAE6irEYiCCesje2a4z47qvhEX22og==",
"idpsrv_lang": "sv-SE",
"CD_UserInfo": "23223 U2 C1",
"amr": "optional"
}
}
If i run the regex through http://regexstorm.net/tester i get the right matches. But when i run my script with powershell on my computer I dont get the matches where I have \r\n in the regex question. I only get the matches from the first two regex questions.
I agree with #AdminOfThings to use Get-Content with the -raw parameter.
also don't use typographic quotes in scripts.
If the number of leading spaces aren't really fixed replace with one space and + or * quantifier.
make the \r optional => \r?.
A minimal complete verifiable example should also include your expected output.
EDIT changed Regex to be better readable
The following script
## Q:\Test\2019\03\22\SO_55298614.ps1
$input_path = 'input.log'
$output_file = 'Output.txt'
$regexes = ('(?<= *"(full)?name": ")(.*)(?=",)',
'(?<=Start identity token validation\r?\n)(.*)(?=ids: Token validation success)',
'(?<= *"ClientName": ")(.*)(?=",\r?\n *"ValidateLifetime": false,)')
$regex = [RegEx]($regexes -join'|')
Get-Content $input_path -Raw | Select-String -pattern $regex -AllMatches |
ForEach-Object { $_.Matches.Value }
yields this sample output:
> Q:\Test\2019\03\22\SO_55298614.ps1
08:15.27.47-922: T= 11
TE Trio User
tele2
Tele2 Servicekonto Test

Parse email body paragragh in Powershell

I am creating a script to parse outlook email body, so that I can get say an (ID number, date, name) after strings ID: xxxxxx Date: xxxxxx Name:xxxxx. I was looking around and could not fine anything that allows me to take the string after a match.
What I manage so far is to query for the email that was send by the specific users from outlook.
Add-Type -Assembly "Microsoft.Office.Interop.Outlook"
$Outlook = New-Object -ComObject Outlook.Application
$namespace = $Outlook.GetNameSpace("MAPI")
$inbox = $namespace.GetDefaultFolder([Microsoft.Office.Interop.Outlook.OlDefaultFolders]::olFolderInbox)
foreach ($items in $inbox.items){if (($items.to -like "*email*") -or ($items.cc -like "*email.add*")){$FindID = $items.body}}
Now that I have the email body in the for loop I am wondering how I can parse the content?
In between the paragraphs will be a text something like this
ID: xxxxxxxx
Name: xxxxxxxxx
Date Of Birth : xxxxxxxx
I did some testing on the below to see if I can add that into the for loop but it seem like I cannot break the paragraphs.
$FindID| ForEach-Object {if (($_ -match 'ID:') -and ($_ -match ' ')){$testID = ($_ -split 'ID: ')[1]}}
I get the following results which I cannot get just the ID.
Sample Result when i do $testID
xxxxxxxx
Name: xxxxxxxxx
Date Of Birth : xxxxxxxx
Regards,
xxxxx xxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
How do I get just the results I want? I am just struggling at that portion.
You'll need a Regular Expression with (named) capture groups to grep the values. See example on rexgex101.com.
Provdid $item.bodyis not html and a single string, this could work:
## Q:\Test\2018\07\24\SO_51492907.ps1
Add-Type -Assembly "Microsoft.Office.Interop.Outlook"
$Outlook = New-Object -ComObject Outlook.Application
$namespace = $Outlook.GetNameSpace("MAPI")
$inbox = $namespace.GetDefaultFolder(
[Microsoft.Office.Interop.Outlook.OlDefaultFolders]::olFolderInbox)
## see $RE on https://regex101.com/r/1B2rD1/1
$RE = [RegEx]'(?sm)ID:\s+(?<ID>.*?)$.*?Name:\s+(?<Name>.*?)$.*?Date Of Birth\s*:\s*(?<DOB>.*?)$.*'
$Data = ForEach ($item in $inbox.items){
if (($item.to -like "*email*") -or
($item.cc -like "*email.add*")){
if (($item.body -match $RE )){
[PSCustomObject]#{
ID = $Matches.ID
Name = $Matches.Name
DOB = $Matches.DOB
}
}
}
}
$Data
$Data | Export-CSv '.\data.csv' -NoTypeInformation
Sample output with above anonimized mail
> Q:\Test\2018\07\24\SO_51492907.ps1
ID Name DOB
-- ---- ---
xxxxxx... xxxxxxx... xxxxxx...
I don't have Outlook available at the moment, but i think this will work
Add-Type -Assembly "Microsoft.Office.Interop.Outlook"
$Outlook = New-Object -ComObject Outlook.Application
$namespace = $Outlook.GetNameSpace("MAPI")
$inbox = $namespace.GetDefaultFolder([Microsoft.Office.Interop.Outlook.OlDefaultFolders]::olFolderInbox)
$inbox.items | Where-Object { $_.To -like "*email*" -or $_.CC -like "*email.add*"} {
$body = $_.body
if ($body -match '(?s)ID\s*:\s*(?<id>.+)Name\s*:\s*(?<name>.+)Date Of Birth\s*:\s*(?<dob>\w+)') {
New-Object -TypeName PSObject -Property #{
'Subject' = $_.Subject
'Date Received' = ([datetime]$_.ReceivedTime).ToString()
'ID' = $matches['id']
'Name' = $matches['name']
'Date of Birth' = $matches['dob']
}
}
}

Access next webpage after clicking

Requirement : After clicking on webpage named in $ie.Navigate below. I Need to access HTML / OuterHTML source of Web-page which opens next.
Ex: When I open https://www.healthkartplus.com/search/all?name=Sporanox (by setting $control = Sporanox), below code simply clicks on first matching link. After link is clicked, I need to access HTML of resulting page.
Update : referred another SO question and learned that we can search appropriate window. Code seems to be working for some scenarios but not for all. For $ie2 I get problem accessing Document property.
function getStringMatch
{
# Loop through all 2 digit combinations in the $path directory
foreach ($control In $controls)
{
$ie = New-Object -COMObject InternetExplorer.Application
$ie.visible = $true
$site = $ie.Navigate("https://www.healthkartplus.com/search/all?name=$control")
$ie.ReadyState
while ($ie.Busy -and $ie.ReadyState -ne 4){ sleep -Milliseconds 100 }
$link = $null
$link = $ie.Document.get_links() | where-object {$_.innerText -eq "$control"}
$link.click()
while ($ie.Busy -and $ie.ReadyState -ne 4){ sleep -Milliseconds 100 }
$ie2 = (New-Object -COM 'Shell.Application').Windows() | ? {
$_.Name -eq 'Windows Internet Explorer' -and $_.LocationName -match "^$control"
}
# NEED outerHTML of new page. CURRENTLY it is working for some.
$ie.Document.body.outerHTML > d:\med$control.txt
}
}
$controls = "Sporanox"
getStringMatch
I think the issue is when you look for the links in the first page.
The link innerText is not equal to $control, it contains $control i.e. innerText is "Sporanox (100mg)".
The following might help:
$link = $ie.Document.get_links() | where-object {if ($_.innerText){$_.innerText.contains($control)}}
EDIT
Here is the complete code I'm using:
function getStringMatch
{
# Loop through all 2 digit combinations in the $path directory
foreach ($control In $controls)
{
$ie = New-Object -COMObject InternetExplorer.Application
$ie.visible = $true
$site = $ie.Navigate("https://www.healthkartplus.com/search/all?name=$control")
$ie.ReadyState
while ($ie.Busy -and $ie.ReadyState -ne 4){ sleep -Milliseconds 100 }
$link = $null
$link = $ie.Document.get_links() | where-object {if ($_.innerText){$_.innerText.contains($control)}}
$link.click()
while ($ie.Busy)
{
sleep -Milliseconds 100
}
# NEED outerHTML of new page. CURRENTLY it is working for some.
$ie.Document.body.outerHTML > d:\med$control.txt
}
}
$controls = "Sporanox"
getStringMatch

netsh result to a PowerShell object

I am trying to work with NETSH from PowerShell. I want see a result from this command such as an object, but netsh returns a string:
netsh wlan show hostednetwork | Get-Member
TypeName: System.String
...
My script must work on system with rather localization, and I can't use -match for parsing a string to an object directly.
How I can solve my trouble?
$netshResult = Invoke-Command -Computername localhost {netsh int tcp show global}
$result = #{}
$netshObject = New-Object psobject -Property #{
ReceiveSideScalingState = $Null
ChimneyOffloadState = $Null
NetDMAState = $Null
}
$netshResult = $netshResult | Select-String : #break into chunks if colon only
$i = 0
while($i -lt $netshResult.Length){
$line = $netshResult[$i]
$line = $line -split(":")
$line[0] = $line[0].trim()
$line[1] = $line[1].trim()
$result.$($line[0]) = $($line[1])
$i++
}
$netshObject.ReceiveSideScalingState = $result.'Receive-Side Scaling State'
$netshObject.ChimneyOffloadState = $result.'Chimney Offload State'
$netshObject.NetDMAState = $result.'NetDMA State'
You got a few alternatives, none of which are nice.
1) Read the netsh output into a string[] and use a custom record parser to create your own object. That is, look at the output on different locales and find out if, say, Hosted newtork settings is always the first header followed by bunch of - characters. If that's the case, assume that next element in array is Mode and so on. This is very error prone, but usually MS command line tools only translate messages, not their order.
2) Look for .Net API for the same information. There is System.Net.NetworkInformation which contains a bunch of connection things. It's a start, though I am not sure if it has info you need.
3) Failing the previous options, use P/Invoke to call native Win32 API. It's a lot of work, so look for pre-existing wrapper libraries before rolling your own.
I recently wrote a cmdlet to parse arbitrary, multi-line text using regular expressions, called ConvertFrom-Text. (Not a great name, if you ask me, but it conforms to the PowerShell naming rules; suggestions are welcome!) So assuming you have that cmdlet, here is one possible solution to your question. (Caveat emptor! The regular expression given was derived from a very small sample of netsh output, so may need some tuning.)
$regex = [regex] '(?ms)(?:^\s*$\s*)?^(?<section>.*?)\s*-+\s*(?<data>.*?)\s*^\s*$'
$result = netsh wlan show hostednetwork | Out-String |
ConvertFrom-Text -pattern $regex -multiline
$result | % {
$dataObj = [PsCustomObject]#{}
$_.Data -split "`r`n" | % {
$element = $_ -split '\s*:\s*'
Add-Member -InputObject $dataObj -MemberType NoteProperty -Name $element[0].Trim() -Value $element[1].Trim()
}
$_.Data = $dataObj # Replace data text with data object
}
$result
On my test system, netsh wlan show hostednetwork returns this:
Hosted network settings
-----------------------
Mode : Allowed
Settings : <Not configured>
Hosted network status
---------------------
Status : Not available
And the output of the $result variable in the code above yields this:
section data
------- ----
Hosted network settings #{Mode=Allowed; Settings=<Not configured>}
Hosted network status #{Status=Not available}
So $result is an array of objects with section and data properties, and the latter is an object with properties defined by the output of the netsh command.
Of course, the above does not get you very far without the ConvertFrom-Text cmdlet. So here is the implementation. (I have copious documentation and examples for it, which will be publicly available once I eventually add it to my open-source PowerShell library.)
filter ConvertFrom-Text
{
[CmdletBinding()]
Param (
[Parameter(Mandatory=$true,Position=0, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
[string[]]$InputObject,
[Parameter(Mandatory=$true,Position=1)]
[regex]$Pattern,
[switch]$RequireAll,
[switch]$Multiline
)
if ($Multiline) {
$dataString = $InputObject -join "`n"
IterateByMatch $dataString $Pattern
}
else {
IterateByLine $InputObject $Pattern
}
}
function IterateByLine([string[]]$data, [regex]$regex)
{
$data | ForEach-Object {
if ($PSItem -match $regex)
{
New-Object PSObject -Property (GetRegexNamedGroups $matches)
}
elseif ($RequireAll) {
throw "invalid line: $_"
}
}
}
function IterateByMatch([string[]]$data, [regex]$regex)
{
$regex.matches($data) | Foreach-Object {
$match = $_
$obj = new-object object
$regex.GetGroupNames() |
Where-Object {$_ -notmatch '^\d+$'} |
Foreach-Object {
Add-Member -InputObject $obj NoteProperty `
$_ $match.groups[$regex.GroupNumberFromName($_)].value
}
$obj
}
}
function Get-RegexNamedGroups($hash)
{
$newHash = #{};
$hash.keys | ? { $_ -notmatch '^\d+$' } | % { $newHash[$_] = $hash[$_] }
$newHash
}