Regex to match URL in Powershell - regex

I am new to programming and Powershell, I've put together the following script; it parses through all the emails in a specified folder and extract the URLs from them. The script uses a regex pattern to identify the URLs and then extracts them to a text file. The extracted text is then run through another command where I am trying to remove the http:// or https:// portion (I need help with figuring this out), these are placed into another text file, from which I go through again to remove duplicates.
The main issue I am having is that the regex doesnt appear to extract the urls correctly. What I am getting is something like an example I have created below:
URL is http://www.dropbox.com/3jksffpwe/asdj.exe
But I end up getting
dropbox.com/3jksffpwe/asdj.exe
dropbox.com
drop
dropbox
The script is
#Adjust paths to location of saved Emails
$in_files = ‘C:\temp\*.eml, *.msg’
$out_file = ‘C:\temp\Output.txt’
$Working_file = ‘C:\temp\working.txt'
$Parsed_file = ‘C:\temp\cleaned.txt'
# Removes the old output file from earlier runs.
if (Test-Path $Parsed_file) {
Remove-Item $Parsed_file
}
# regex to parse thru each email and extract the URLs to a text file
$regex = ‘([a-zA-Z]{3,})://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)*?’
select-string -Path $in_files -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $out_file
#Parses thru the output of urls to strip out the http or https portion
Get-Content $Out_file | ForEach-Object {$_.SubString(7)} | Out-File $Working_file
#Parses thru again to remove exact duplicates
$set = #{}
Get-Content $Working_file | %{
if (!$set.Contains($_)) {
$set.Add($_, $null)
$_
}
} | Set-Content $Parsed_file
#Removes the files no longer required
Del $out_file, $Working_file
#Confirms if the email messages should be removed
$Response = Read-Host "Do you want to remove the old messages? (Y|N)"
If ($Response -eq "Y") {del *.eml, *msg}
#Opens the output file in notepad
Notepad $Parsed_file
Exit
Thanks for any help

Try this RegEx:
(http[s]?|[s]?ftp[s]?)(:\/\/)([^\s,]+)
But remember that powershell -match is only capturing the first match. To capture all matches you could do something like this:
$txt="https://test.com, http://tes2.net, http:/test.com, http://test3.ro, text, http//:wrong.value";$hash=#{};$txt|select-string -AllMatches '(http[s]?|[s]?ftp[s]?)(:\/\/)([^\s,]+)'|%{$hash."Valid URLs"=$_.Matches.value};$hash
Best of luck! Enjoy!

RegExp for checking for URL can be like:
(?i)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
Check for more info here.

Related

Powershell script to replace link:lalala.html[lalala] with xref:lalala.adoc[lalala] capture pattern and replace recursively

I have a folder full of text documents in .adoc format that have some text in them. The text is following: link:lalala.html[lalala]. I want to replace this text with xref:lalala.adoc[lalala]. So, basically, just replace link: with xref:, .html with .adoc, leave all the rest unchanged.
But the problem is that lalala can be anything from a word to ../topics/halva.html.
I definitely know that I need to use regex patterns, I previously used similar script. A replace directive wrapped in an object:
Get-ChildItem -Path *.adoc -file -recurse | ForEach-Object {
$lines = Get-Content -Path $PSItem.FullName -Encoding UTF8 -Raw
$patterns = #{
'(\[\.dfn \.term])#(.*?)#' = '$1_$2_' ;
}
$option = [System.Text.RegularExpressions.RegexOptions]::Singleline
foreach($k in $patterns.Keys){
$pat = [regex]::new($k, $option)
$lines = $pat.Replace($lines, $patterns.$k)
}
$lines | Set-Content -Path $PSItem.FullName -Encoding UTF8 -Force
}
Looks like I need a different script since the new task cannot be added as just another object. I could've just replaced each part separately, using two objects: replace link: with xref:, then replace .html with .adoc.
But this can interfere with other links that end with .html and don't start with link:. In the text, absolute links usually don't have link: in the beginning. They always start with http:// or https://. And they still may or may not end with .html. So the best idea is to take the whole string link:lalala.html[lalala] and try to replace it with xref:lalala.adoc[lalala].
I need the help of someone who knows regex and PowerShell, please this would save me.
As a pattern, you might use
\blink:(.+?)\.html(?=\[[^][]*])
\blink: Match link:
(.+?) Capture 1+ chars as least as possbile in group 1
\.html match .html
(?=\[[^][]*]) Assert from an opening till closing square bracket at the right
Regex demo
In the replacement use group 1 using $1
xref:$1.adoc
Example
$Strings = #("link:lalala.html[lalala]", "link:../topics/halva.html[../topics/halva.html]")
$Strings -replace "\blink:(.+?)\.html(?=\[[^][]*])",'xref:$1.adoc'
Output
xref:lalala.adoc[lalala]
xref:../topics/halva.adoc[../topics/halva.html]

Regex is working on Regex101 but not inside Powershell

I have this Text file:
[Tabs]
MAILBOXSEND=1
MAILBOX=8
USERS=6
DOCUMENTS_Q=9
MED_WEBSERVCALLS_LOA=3
FCLMNA=1
INCZOOMFORM=1
USERSB=1
USERSB_ONE=1
DATAPRIV=1
MED_WEBSERVCALLS=2
TINVOICES=1
PORDERS=9
PORDERSTOTAL=1
LOGPART=1
LOGCOUNTERS=1
PARTMSG=1
[External Mail]
Send=Y
Hostname=Server
Domain=Domain
Myemail=My#email.com
MyName=My Test
Port=25
SSL=0
[Search]
SUPPLIERS=5,1
StartButton=1
Ignore Case=0
PART=6,1
I'm Trying to capture all the text between [External Mail] to the Next [] Brackets Group,
I have this Regex which do the job and tested in Regex101, after all the testing's I found it's not working inside powershell:
$Text = Get-Content c:\text.txt
$Text -match '(?s)(?<=\[External Mail\]).*?(?=\[.*?\])'
or:
$Text | Select-String '(?s)(?<=\[External Mail\]).*?(?=\[.*?\])'
Nothing Return
Do you have any idea what I'm missing?
Thanks
Looks like you are parsing an .INI file. Don't try to invent the wheel again, take leverage from existing code. This solution reads the .Ini file as nested hash tables that are easy to work with.
In case of link rot, here's the function from Scripting Guys archive:
function Get-IniContent ($filePath)
{
$ini = #{}
switch -regex -file $FilePath
{
"^\[(.+)\]" # Section
{
$section = $matches[1]
$ini[$section] = #{}
$CommentCount = 0
}
"^(;.*)$" # Comment
{
$value = $matches[1]
$CommentCount = $CommentCount + 1
$name = "Comment" + $CommentCount
$ini[$section][$name] = $value
}
"(.+?)\s*=(.*)" # Key
{
$name,$value = $matches[1..2]
$ini[$section][$name] = $value
}
}
return $ini
}
# Sample usage:
$i = Get-IniContent c:\temp\test.ini
$i["external mail"]
Name Value
---- -----
Domain Domain
SSL 0
Hostname Server
Send Y
MyName My Test
Port 25
Myemail My#email.com
$i["external mail"].hostname
Server
Since you are trying to get a multiline regex match you need to be working against a single multiline string. That is the difference between your two cases of regex101 and PowerShell. Get-Content will be returning a string array. Your regex was not matching anything as it was only doing the test on single lines within the file.
PowerShell 2.0
$Text = Get-Content c:\text.txt | Out-String
PowerShell 3.0 of higher
$Text = Get-Content c:\text.txt -Raw
As I said in my comments you don't really need regex, in this way, for this type of string extraction. There are scripts that already exist to parse INI content. If you intend to be replacing content you would have to find the partner cmdlet Out-INIContent assuming it exists but I am sure someone made it. vonPryz's answer contains more information on the cmdlet

grep string between two other strings as delimiters

I have to do a report on how many times a certain CSS class appears in the content of our pages (over 10k pages). The trouble is, the header and footer contains that class, so a grep returns every single page.
So, how do I grep for content?
EDIT: I am looking for if a page has list-unstyled between <main> and </main>
So do I use a regular expression for that grep? or do I need to use PowerShell to have more functionality?
I have grep at my disposal and PowerShell, but I could use a portable software if that is my only option.
Ideally, I would get a report (.txt or .csv) with pages and line numbers where the class shows up, but just a list of the pages themselves would suffice.
EDIT: Progress
I now have this in PowerShell
$files = get-childitem -recurse -path w:\test\york\ -Filter *.html
foreach ($file in $files)
{
$htmlfile=[System.IO.File]::ReadAllText($file.fullName)
$regex="(?m)<main([\w\W]*)</main>"
if ($htmlfile -match $regex) {
$middle=$matches[1]
[regex]::Matches($middle,"list-unstyled")
Write-Host $file.fullName has matches in the middle:
}
}
Which I run with this command .\FindStr.ps1 | Export-csv C:\Tools\text.csv
it outputs the filename and path with string in the console, put does not add anything to the CSV. How can I get that added in?
What Ansgar Wiechers' answer says is good advice. Don't string search html files. I don't have a problem with it but it is worth noting that not all html files are the same and regex searches can produce flawed results. If tools exists that are aware of the file content structure you should use them.
I would like to take a simple approach that reports all files that have enough occurrences of the text list-unstyled in all html files in a given directory. You expect there to be 2? So if more than that show up then there is enough. I would have done a more complicated regex solution but since you want the line number as well I came up with this compromise.
$pattern = "list-unstyled"
Get-ChildItem C:\temp -Recurse -Filter *.html |
Select-String $pattern |
Group-Object Path |
Where-Object{$_.Count -gt 2} |
ForEach-Object{
$props = #{
File = $_.Group | Select-Object -First 1 -ExpandProperty Path
PatternFound = ($_.Group | Select-Object -ExpandProperty LineNumber) -join ";"
}
New-Object -TypeName PSCustomObject -Property $props
}
Select-String is a grep like tool that can search files for string. It reports the located line number in the file which I why we are using it here.
You should get output that looks like this on your PowerShell console.
File PatternFound
---- ------------
C:\temp\content.html 4;11;54
Where 4,11,54 is the lines where the text was found. The code filters out results where the count of lines is less than 3. So if you expect it once in the header and footer those results should be excluded.
You can create a regexp that will be suitable for multiline match. The regexp "(?m)<!-- main content -->([\w\W]*)<!-- end content -->" matches a multiline content delimited by your comments, with (?m) part meaning that this regexp has multiline option enabled. The group ([\w\W]*) matches everything between your comments, and also enables you to query $matches[1] which will contain your "main text" without headers and footers.
$htmlfile=[System.IO.File]::ReadAllText($fileToGrep)
$regex="(?m)<!-- main content -->([\w\W]*)<!-- end content -->"
if ($htmlfile -match $regex) {
$middle=$matches[1]
[regex]::Matches($middle,"list-unstyled")
}
This is only an example of how should you parse the file. You populate $fileToGrep with a file name which you desire to parse, then run this snippet to receive a string that contains all the list-unstyled strings in the middle of that file.
Don't use string matches for something like this. Analyze the DOM instead. That should allow you to exclude headers and footers by selecting the appropriate root element.
$ie = New-Object -COM 'InternetExplorer.Application'
$url = '...'
$classname = 'list-unstyled'
$ie.Navigate($url)
do { Start-Sleep -Milliseconds 100 } until ($ie.ReadyState -eq 4)
$root = $ie.Document.getElementsById('content-element-id')
$hits = $root.getElementsByTagName('*') | ? { $_.ClassName -eq $classname }
$hits.Count # number of occurrences of $classname below content element

Extracting text (filename) out of a file with Powershell if line contains search pattern

I´ve got a problem... I´ve got a file where the content looks like
IMPORT ("$(#T_Company-BAG)\KSAKTE13","06141030.eou")
IMPORT ("$(#T_Company-Gesmbh)\KSAKTE13","06141032.eou")
IMPORT ("$(#T_Company-ITALIA)\KSAKTE13","06141038.eou")
IMPORT ("$(#T_Company-ITALIA)\KSAKTE13","06141045.eou")
IMPORT ("$(#T_Company-ITALIA)\RWRECH13","06141512.eou")
The thing i want to do is to extract the file name (*.eou) which is inside the last quotes and only the file names which line contains the string T_Company-ITALIA...
The first part, extracting all lines containing the search pattern isn´t so difficult...
gc -Path C:\Scripts\Easyarchiv\level2.ebt | Select-String -Pattern T_Company-ITALIA
But i don´t know how to get only the file names (*.eou) out of the already selected lines...
Now I´m searching for a regex which can extract this
Here's an option without using Select-String:
Get-Content file.txt |
where {$_ -match 'T_Company-ITALIA'} |
foreach { $_ -replace '^.+,"(.+)"\).*$','$1'}
Try this instead of Select-String:
... | ? { $_ -match 'T_Company-ITALIA.*?,"(.*?)"' } | % { $matches[1] }
If you wanted to stay away from regular expressions (which can be hard to read, debug, and understand) you could do this:
gc file.txt |
foreach {
$splitArray = $_ -split '"'; # split at the quotation marks
if ($splitArray[1] -match "T_Company-ITALIA")
{$splitArray[3]}
}

Using Powershell v1 to Remove Script from webpages

My website has been hacked, with the effect being the addition of a script (vbScript, I think) just before the /body tag on certain pages. I can select all of the pages which are targeted using
$files=get-childitem . -recurse -include $a | where {$_.LastWriteTime -gt
[datetime]::parse("08/14/2011")}
where $a is an array of file specs. I would like to run each of these files through a get-content|-replace|set-content pipeline, but I can't get the -replace arguments right. Basically, I want to replace everything between the and tags, including the tags, with blank space or an HTML comment. I'm pretty sure this can be solved with regex, but I just can't get it right - something like:
foreach ($f in $files)
{(get-content $f)|foreach-object {$_ -replace "<script>\w+</script>","<!--Script Replaced-->"}|set-content $f}
Thanks in advance,
Eric F
Disclaimer: Regex is not HTML parser. You will run into corner cases.
The script tags are probably multiline, so you want to:
1) Get all the lines of the file ( get-content and piping it like you have done will only process line-by-line )
2) Use a regex that can replace / process over multiple line ( the regex you have used will only look within a single line)
So you can try something like below for getting the content and replacing the tags:
$content = [System.IO.File]::ReadAllText($f)
$content -replace "(?s)<script>.+?</script>","" | out-file $f