Search HTML Data Retrieved from Invoke-WebRequest with Regex - regex

I am trying to scrape data from https://www.reuters.com/finance/stocks/lookup?searchType=any&comSortBy=marketcap&sortBy=&dateRange=&search=Accor.
The end goal is to pull the table down that contains the Company, Symbol and Exchange.
I have successfully gained the HTML that I need but I can't pull the data I need from it.
I've used some online RegEx 'helpers' and the string works fine and selects the data I need, but when I try and use the command it doesn't work.
$web = Invoke-WebRequest -uri 'https://www.reuters.com/finance/stocks/lookup?searchType=any&comSortBy=marketcap&sortBy=&dateRange=&search=Accor' -UseBasicParsing
$str = ($web.Content).ToString()
[regex]$regex = '<table[\s\S]*?</table>'
$str | Select-String -Pattern $regex -AllMatches
$str > raw.txt; Select-String -Pattern $regex -Path ./raw.txt -AllMatches
I'm expecting to return the whole element but it returns the full string in the piped command and nothing in the -Path command.
I've tried also doing this using a IE Com object.

Rubber ducky effect.
As soon as I asked I figured it out...
$url = 'https://www.reuters.com/finance/stocks/lookup?searchType=any&comSortBy=marketcap&sortBy=&dateRange=&search=Accor'
$content = (New-Object System.Net.WebClient).DownloadString($url)
$content -match '<table[\s\S]*?</table>'
$matches
Name Value
---- -----
0 <table width="100%" cellspacing="0" cellpadding="1" class="search-table-data">...

Related

RegEx not matching when using Select String

I've verified that my regex is correct with this code:
#this is the string where I'm trying to extract everything within the []
$text = "MS14-012[2925418],MS14-029[2953522;2961851]"
$text -match "\[(.*?)\]"
$matches[1]
Output:
True
2925418
I'd like to use Select-String to get my result, like this for example:
$result = $text| Select-String -Pattern $regex
Output:
MS14-012[2925418],MS14-029[2953522;2961851]
What else I've tried:
$result = Select-String -Pattern $regex -InputObject $text
$result = Select-String -Pattern ([regex]::Escape("\[(.*?)\]")) -InputObject $text
And some more variations as well as different kinds of " and ' around the regex and so on. I'm really out of ideas...
Can anyone please tell me why the regex is not matching when I'm using Select-String?
After piping the output to Get-Member I noticed that Select-String returns a MatchInfo object and that I needed to access the MatchInfo.Matches property to get the result. Thanks to Mathias R. Jessen for giving me the hint! ;)

How to Write Powershell Script for Removing Specific Tags in c# Project Files

I'm editing a Powershell script written by Owen Johnson on GitHub for migrating MSBuild-Integrated solutions to use Automatic Package Restore with Nuget. Here is the original migration script:
########################################
# Regex Patterns for Really Bad Things!
$listOfBadStuff = #(
#sln regex
"\s*(\.nuget\\NuGet\.(exe|targets)) = \1",
#*proj regexes
"\s*<Import Project=""\$\(SolutionDir\)\\\.nuget\\NuGet\.targets"".*?/>",
"\s*<Target Name=""EnsureNuGetPackageBuildImports"" BeforeTargets=""PrepareForBuild"">(.|\n)*?</Target>"
"\s*<RestorePackages>\w*</RestorePackages>"
)
#######################
# Delete NuGet.targets
ls -Recurse -include 'NuGet.exe','NuGet.targets' |
foreach {
remove-item $_ -recurse -force
write-host deleted $_
}
#########################################################################################
# Fix Project and Solution Files to reverse damage done by "Enable NuGet Package Restore
ls -Recurse -include *.csproj, *.sln, *.fsproj, *.vbproj, *.wixproj, *.vcxproj |
foreach {
sp $_ IsReadOnly $false
$content = cat $_.FullName | Out-String
$origContent = $content
foreach($badStuff in $listOfBadStuff){
$content = $content -replace $badStuff, ""
}
if ($origContent -ne $content)
{
$content | out-file -encoding "UTF8" $_.FullName
write-host messed with $_.Name
}
}
I want to also remove <Target> tags and their contents where Name= "EnsureBclBuildImported". I'm not very experienced with regular expressions, and my initial attempts to get this to work have failed unfortunately. I tried changing the regex for the target tag to be "\s*<Target Name=""(EnsureNuGetPackageBuildImports|EnsureBclBuildImported)""/.+?(?=>)/(.|\n)*?</Target>"
I also tried making a new regualar expression like this: "\s*<Target Name=""EnsureBclBuildImported"" ^[^\>]*(.|\n)*?</Target>"

Search mutiple words using regular expression in powershell

I am new to powershell. I highly appreciate any help you can provide for the below. I have a powershell script but not being able to complete to get all the data fields from the text file.
I have a file 1.txt as below.
I am trying to extract output for "pid" and "ctl00_lblOurPrice" from the file in table format below so that I can get open this in excel. Column headings are not important. :
pid ctl00_lblOurPrice
0070362408 $6.70
008854787666 $50.70
Currently I am only able to get pid as below. Would like to also get the price for each pid. -->
0070362408
008854787666
c:\scan\1.txt:
This is sentence 1.. This is sentence 1.1... This is sentence A1...
fghfdkgjdfhgfkjghfdkghfdgh gifdgjkfdghdfjghfdg
gkjfdhgfdhgfdgh
ghfghfjgh
...
href='http://example.com/viewdetails.aspx?pid=0070362408'>
This is sentence B1.. This is sentence B2... This is sentence B3...
GFGFGHHGH
HHGHGFHG
<p class="price" style="display:inline;">
ctl00_lblOurPrice=$6.70
This is sentence 1.. This is sentence 1.1... This is sentence A1...
fghfdkgjdfhgfkjghfdkghfdgh gifdgjkfdghdfjghfdg
gkjfdhgfdhgfdgh
ghfghfjgh
...
href='http://example.com/viewdetails.aspx?pid=008854787666'>
This is sentence B1.. This is sentence B2... This is sentence B3...
6GBNGH;L
887656HGFHG
<p class="price" style="display:inline;">
ctl00_lblOurPrice=$50.70
...
...
Current powershell script:
$files=Get-ChildItem c:\scan -recurse
$output_file = ‘c:\output\outdata.txt’
foreach ($file in $files) {
$input_path = $file
$regex = ‘num=\d{1,13}’
select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % {
($_.Value) -replace "num=","" } | Out-File $output_file -Append }
Thanks in advance for your help
I'm going to assume that you either mean pid=\d{1,13} in your code, or that your sample text should have read num= instead of pid=. We will go with the assumption that it is in fact supposed to be pid.
In that case we will turn the entire file into one long string with -Join "", and then split it on "href" to create records for each site to parse against. Then we match for pid= and ending when it comes across a non-numeric character, and then we look for a dollar amount (a $ followed by numbers, followed by a period, and then two more numbers).
When we have a pair of PID/Price matches we can create an object with two properties, PID and Price, and output that. For this I will assign it to an array, to be used later. If you do not have PSv3 or higher you will have to change [PSCustomObject][ordered] into New-Object PSObject -Property but that loses the order of properties, so I like the former better and use it in my example here.
$files=Get-ChildItem C:\scan -recurse
$output_file = 'c:\output\outdata.csv'
$Results = #()
foreach ($file in $files) {
$Results += ((gc $File) -join "") -split "href" |?{$_ -match "pid=(\d+?)[^\d].*?(\$\d*?\.\d{2})"}|%{[PSCustomObject][ordered]#{"PID"=$Matches[1];"Price"=$Matches[2]}}
}
$Results | Select PID,Price | Export-Csv $output_file -NoTypeInformation

Multiline Regex in PowerShell

I have this PowerShell script that's main purpose is to search through HTML files within a folder, find specific HTML markup, and replace with what I tell it to.
I have been able to do 3/4 of my find and replaces perfectly. The one I am having trouble with involves a Regular Expression.
This is the markup that I am trying to make my regex find and replace:
<a href="programsactivities_skating.html"><br />
</a>
Here is the regex I have so far, along with the function I am using it in:
automate -school "C:\Users\$env:username\Desktop\schools\$question" -query '(?mis)(?!exclude1|exclude2|exclude3)(<a[^>]*?>(\s| |<br\s?/?>)*</a>)' -replace ''
And here is the automate function:
function automate($school, $query, $replace) {
$processFiles = Get-ChildItem -Exclude *.bak -Include "*.html", "*.HTML", "*.htm", "*.HTM" -Recurse -Path $school
foreach ($file in $processFiles) {
$text = Get-Content $file
$text = $text -replace $query, $replace
$text | Out-File $file -Force -Encoding utf8
}
}
I have been trying to figure out the solution to this for about 2 days now, and just can't seem to get it to work. I have determined that problem is that I need to tell my regex to account for Multiline, and that's what I'm having trouble with.
Any help anyone can provide is greatly appreciate.
Thanks in Advance.
Get-Content produces an array of strings, where each string contains a single line from your input file, so you won't be able to match text passages spanning more than one line. You need to merge the array into a single string if you want to be able to match more than one line:
$text = Get-Content $file | Out-String
or
[String]$text = Get-Content $file
or
$text = [IO.File]::ReadAllText($file)
Note that the 1st and 2nd method don't preserve line breaks from the input file. Method 2 simply mangles all line breaks, as Keith pointed out in the comments, and method 1 puts <CR><LF> at the end of each line when joining the array. The latter may be an issue when dealing with Linux/Unix or Mac files.
I don't get what it is you're trying to do with those Exclude elements, but I find multi-line regex is usually easier to construct in a here-string:
$text = #'
<a href="programsactivities_skating.html"><br />
</a>
'#
$regex = #'
(?mis)<a href="programsactivities_skating.html"><br />
\s+?</a>
'#
$text -match $regex
True
Get-Content will return an array of strings, you want to concatenate the strings in question to create one:
function automate($school, $query, $replace) {
$processFiles = Get-ChildItem -Exclude *.bak -Include "*.html", "*.HTML", "*.htm", "*.HTM" -Recurse -Path $school
foreach ($file in $processFiles) {
$text = ""
$text = Get-Content $file | % { $text += $_ +"`r`n" }
$text = $text -replace $query, $replace
$text | Out-File $file -Force -Encoding utf8
}
}

Powershell match Regex and Replace

I have a large file that I am searching through to locate and replace invalid dates. I’m using a REGEX expression to locate the dates and then determining if they are valid or not. If the script finds an invalid date it needs to replace the date with the current date. For audit purposes I need to record the invalid string and the line number on which the error was found. So far (with some prior help to SO) I have been able to locate the invalid dates, but I have not been able to successfully change them.
This is the code I’m using to locate the invalid dates. How can I locate and change the date in a single pass?
$matchInfos = #(Select-String -Pattern $regex -AllMatches -Path $file)
foreach ($minfo in $matchInfos)
{
#"LineNumber $($minfo.LineNumber)"
foreach ($match in #($minfo.Matches | Foreach {$_.Groups[0].value}))
{
if (([Boolean]($match -as [DateTime]) -eq $false ) -or ([DateTime]::parseexact($match,"MM-dd-yyyy",$null).Year -lt "1800")) {
Write-host "Invalid date on line $($minfo.LineNumber) - $match"
#Add-Content -Path $LOGFILE -Value "Invalid date on line $($minfo.LineNumber) - $match"
# Replace the invalid date with a corrected one
Write-Host "Replacing $match with $(Get-Date -Format "MM-dd-yyyy")"
#Add-Content -Path $LOGFILE -Value "Replacing $match with $(Get-Date -Format "MM-dd-yyyy")"
}
}
}
You have to write out a temporary file with the changes and replace the file with the temporary. Here's one I wrote that will do that part for you:
Windows IT Pro: Replacing Strings in Files Using PowerShell
Example of use:
replace-filestring -pattern 'find' -replacement 'replace' -path myfile.txt -overwrite
With this command, the script will read myfile.txt, replace 'find' with 'replace', write the output to a temporary file, and then replace myfile.txt with the temporary file. (Without the -Overwrite parameter, the script will only output the contents of myfile.txt with the changes.)
Bill
$lines = get-content $file
$len = $lines.count
$bad = #{}
for($i=0;$i-lt$len;$i++){
if($lines[$i] -match ""){
$bad_date = $lines[$i].substring(10) #Get the bad date
$good_date = Get-Date -Format G
$bad["$i"] += #($line[$i])
$lines[$i] = $lines[$i].Replace($bad_date,$good_date)
}
}
$lines > $NewFile
$bad > $bad_date_file
Here is some pseudo code of how I would combat this problem. Not sure how big your file is. Reading and writing could be slow.