How do I escape double quote and other troublesome characters in Powershell regex - regex

I've hit a snag in a script I'm putting together to download the latest installation packages without needing to use Chocolatey or Ketarin. Unfortunately a few utilities aren't provided at a direct download link and are hidden behind redirecting URLs, with the download URL expiring after 15 minutes. To complicate things a bit further, I'm doing this in PowerShell 2 as we have a few Vista machines in our office.
After researching other similar scenarios, it seems as though I can invoke the .NET WebClient to handle the download, though there isn't a progress bar. As I haven't found a sample of code to handle downloading files behind redirects after a certain amount of time that works with a .NET WebClient, I decided that what I could do is use a WebClient request to load the page, and then get the current direct download URL from the page using the following regex, and then use a regex to that URL to download the file. I've checked with regexr.com to verify that the regex catches the sample URL below.
Sample URL
CF DL here
Regex
<a(?: [^>]*?)? href=(["'])([^\1]*?ProgramName*?)\1(?: .*?)?>.*?<\/a>
Unfortunately Powershell red flags this, as it seems to think the double quotes need to be terminated. After attempting to escape any red-flagged characters using backticks, I've wound up with the following, that throws a error saying that '?:' is not recognized as a term, cmdlet, etc.
$downloadLinkRegex = New-Object System.Text.RegularExpressions.Regex (<a(?: [^>]*?)? href=(`[`"`'])(`[`^\1]*?ProgramName.exe*?)\1(?: .*?)?>.*?</a>)
if ("https://www.example.com/randomstring003ejdjd38/dl/ProgramName.exe" -match $downloadLinkRegex){
write-host "yay"
} else{
write-host "nope"}
Attempts to escape the ? using backticks fails also. Regex's are incredibly difficult for me, so at this point I'm out of ideas on how to make the ISE recognize that this is a valid regex, and that it doesn't need to be validated, and that it can be stored as the value of a variable to be called later on the contents of a webrequest.
If anyone could point out where I've gone wrong, or how to resolve the issue, I would be immensely grateful.

The easiest way I can think of is by using the #" bla "# block in powershell (I don't know the official name).
For example :
$regex = #"
Insert regex here
"#
Everything between the #" "# block will be treated as a string value.

I just removed the items PowerShell flags. I had to test several different ways to make sure this was the only way PowerShell would let me print to HTML. Even the ConvertTo-HTML won't bypass PowerShell's issues. It is a like a hybrid to HTML. I also noticed that PowerShell doesn't pay attention to blank space when you type so my real code has lots of spaces and empty lines to differentiate my script.
$My_HTML_table = "<!DOCTYPE html>
<head><title> My Excellent Page </title></head>
<H2> Table 1 </H2>
<text></text>
<table border=1;border-style:solid>
<tr>
<td colspan=1 style=color:blue;background-color:#CCCCCC;font-size:18;padding:5px> Cute Header </td>
</tr>"
$My_HTML_table > C:\File_Path\My_Excellent_HTML.html

But it doesn't match on regexr.com ...? It fails because it thinks the </a> is the end of the regex. It also fails because it's trying to match ProgramNam(one or unlimited 'e') and ignoring the .exe bit. (And "must not match octal number 1"? That's probably not what you want in there (no, I didn't know that, I just saw it while scratching my head trying to decipher this on regex101.com)).
Anyway, to your question: PowerShell doesn't have regex literals, so you can't just write <a(?: [^>]*?... into the shell and have it work. They have to be strings.
But they don't have to be run through New-Object System.Text.RegularExpressions.Regex.
e.g.
$url = 'CF DL here'
$pattern = "<a.*?href=[`"'](.*?)[`"'][^>]*>.*?</a>"
$url -match $pattern
$Matches[1]
I've quoted the string in double quotes around the outside. And then I've used a backtick to escape the double quotes inside the pattern.
Where the regex pattern is explained much more helpfully here

I actually reworked the regex into something simpler to resolve the issue. While the URL continually changes the file name doesn't, so I focused on the filename, rather than the whole URL, and was able to grab the URL I needed.

Looks good
$a='CF DL here'
$a -match '(?<=ef=")[^"]+?(\w+).(exe|pdf)'
Iwr $matches[0] -outfile "$($matches[1]).$($matches[2])"

Related

Regex removing bold markdown from inside codeblock only

I'm editing in bulk some markdown files to be compliant with mkdocs syntax (material theme).
My previous documentation software accepted bold inside codeblock, but I discover now it's far from standard.
I've more than 10k codeblocks in this documentation, with more than 300 md files in nested directories, and most of them has ** in order to bold some word.
To be precise I should make any CodeBlock from this:
this is a **code block** with some commands
```my_lexer
enable
configure **terminal**
interface **g0/0**
```
to this
this is a **code block** with some commands
```my_lexer
enable
configure terminal
interface g0/0
```
The fun parts:
there are bold words in the rest of the document I would like to maintain (outside code block)
not every row of the code block has bold in it
not even every code block has necessarily bold in it
Now I'm using visual studio code with the substitute in files, and most of the easy regex I did for the porting is working. But it's not a perfect regex syntax (for examples, groups are denoted with $1 instead of \1 and maybe some other differences I don't know about).
But I accept other software (regex flavors) too if they are more regex compliant and accept 'replace in all files and subdirectories' (like notepad++, atom, etc..)
Sadly, I don't even know how to start something so complicated.
The most advanced I did is this: https://regex101.com/r/vRnkop/1 (there is also the text i'm using to test it)
(^```.*\n)(.*?\*\*(.*?)\*\*.*$\n)*
I hardly think this is a good start to do that!
Thanks
Visual Studio is not my forté but I did read you should be able to use PCRE2 regex syntax. Therefor try to substitute the following pattern with an empty string:
\*\*(?=(((?!^```).)*^```)(?:(?1){2})*(?2)$)
See an online demo. The pattern seems a bit rocky and maybe someone else knows a much simpler pattern. However I did wanted to make sure this would both leave italic alone and would make bold+italic to italic. Note that . matches newline here.
If you have unix tools like sed. it is quite easy:
sed '/^```my_lexer/,/^```/ s/\*\*//g' orig.md >new.md
/regex1/,/regex2/ cmd looks for a group of lines where the first line matches the first regex and the final line matches the second regex, and then runs cmd on each of them. This limits the replacements to the relevant sections of the file.
s/\*\*//g does search and replace (I have assumed any instance of ** should be deleted
Some versions of sed allow "in-place" editing with -i. For example, to edit file.md and keep original version as file.md.orig:
sed -i.orig '...' file.md
and you can edit multiple files with something like:
find -name '*.md' -exec sed -i.orig '...' \{} \+

regex to remove hyperlinks

Input:
source http://www.emaxhealth.com/1275/misdiagnosing from here http://www.cancerresearchuk.org/about-cancer/type recounting her experiences and thoughts blog http://fty720.blogspot.com even carried the new name. She was far from home.
From the about input I want to remove the hyperlinks. Below is the regex that I am trying
http://[\w|\W|\d|\s]*(?=[ ])
This regex will encompass all characters,digits and whitespaces after encountering the word 'http' and will continue till first blank space.
Unfortunately, it is not working as expected. Please do help me find out my error.Thanks
Try this sed command
sed 's/http[^ ]\+//g' FileName
Output :
source from here recounting her experiences and thoughts blog even carried the new name. She was far from home.
To find the hyperlink use:
\b(https?)://[A-Z0-9+&##/%?=~_|$!:,.;-]*[A-Z0-9+&##/%=~_|$]
or:
If you want to find the html a tag use:
<a\b[^>]*>(.*?)</a>

search & replace wordpress video shortcode with plain URL using regular expressions

i am transferring a friend's wordpress.com blog to a self-hosted install on my server. problem is, he has many videos embedded in his blog using a shortcode plugin that is not necessary on wordpress 3 (you need only to paste the plain URL to embed videos from YouTube, Vimeo, etc;
I've found a Search Regex plugin that will search & replace using regular expressions, but am unfamiliar with regex myself. how might i catch the url in a shortcode such as [youtube="URL"] and replace it with just the URL?
Thanks for any help you can provide!!
-Jenny
Are you trying to go from "[youtube=http://www.youtube.com/watch?v=JaNH56Vpg-A]" to http://www.youtube.com/watch?v=JaNH56Vpg-A?
This works if there's a white space between different URLs.
find: \[youtube=(\S*)\]
replace with: $1
It's difficult to replace every different service at once since it seems that their short codes are different. For Vimeo this would work. It allows a random number of white space between "vimeo" and URL. And it again needs the white space after closing "]".
find: \[vimeo\s+(\S*)\]
replace with: $1
Maybe theres more robust way to write the expression. (Which validates the correct syntax.) This one's pretty straightforward thought.
The actual regex syntax depend on the language used. Hope this helps.

perl regex problem -- $amp in yahoo finance page

I found an old perl hack on the O'Reilly site http://oreilly.com/pub/h/1041 and decided to check it out. After a little fiddling around it started to run but the regex are out of date.
Here is the question: with this
/<a href="\/q\/op\?s=(.*?)\&m=(.*?)">/
as the first line of regex, what needs to be modified to make the regex function again? The following are snippets from
http://finance.yahoo.com/q/op?s=FISV
<a href="/q/op?s=FISV&k=55.000000">
and
<a href="/q/os?s=FISV&m=2011-04-15">
.
The original hack is dated 2004 and option symbols looked like this (FQVAH or FQVFF) back then instead of fisv110416c00060000 for a call option and fisv110416p00090000 for a put option. First thing I did to get it going was to modify all instances of $url to $curl because until the name was changed the symbol was not being passed to yahoo for lookup. The &amp is giving me the most trouble. If this is found to run without modification I would be very surprised and would very much like to know what system and perl -V is installed. SLES 10 and perl 5.8.0 is what I am currently using.
Any suggestions would be helpful. It could be a useful script to anyone who is serious about protecting themselves from a falling equity market.
Thanks,
robm
I'm not /100%/ sure what you're asking, but if I'm understanding, you want a regex that will capture "fisv110416c00060000" and tell you the first few letters, whether it's a call or a put, and the amount?
If so, you're looking for something like:
/([a-z]+)(\d+)([cp])(\d+)/
That should capture the following for the first example
$1 = "fisv"
$2 = 110416
$3 = c
$4 = 00060000
The original regex was very specific to that html string. You can include the beginning bits of it if you need to use it to check that the entire string is there as well. Of course, make your regex as tight as possible to avoid over-matches and wasted time pattern matching. I'm just not sure the exact pattern you're trying to match (ie: is it always "fisv"?).
You should either first unescape the html, this would turn the & into a &, or just change the regex, like this:
/<a href="\/q\/os\?s=(.*?)\&(?:amp;)?m=(.*?)">/
To match both types of urls:
/<a href="\/q\/o[ps]\?s=(.*?)\&(?:amp;)?[mk]=(.*?)">/

replace url paths using Regex

How can I change the url of my images from this:
http://www.myOLDwebsite.com/**********.*** (i have gifs, jpgs, pngs)
to this:
http://www.myNEWwebiste.com/somedirectory/**********.***
Using REGexp text editor?
Really thanks for your time
[]'s
Mateus
Why use regex?
Using conventional means, replace:
src="http://www.myOLDwebsite.com/
with:
src="http://www.myNEWwebiste.com/somedirectory/
Granted, this assumes your image tags always follow the 'src="<url>"' pattern, with double quotes and everything.
Using regex is of course also possible. Replace this:
(src\s*=\s*["'])http://www\.myOLDwebsite\.com/
with:
\1http://www.myNEWwebiste.com/somedirectory/
alternatively, if your text editor uses $ to mark back references:
$1http://www.myNEWwebiste.com/somedirectory/
On second thought - why do your images have absolute URLs in the first place? Isn't that unnecessary?
Well, the easiest way is probably to use sed in in-place mode:
sed -ir \
's#http://www[.]myOLDwebsite[.]com/#http://www.myNEWwebsite.com/subdirectory/#g' \
file1 file2 ...
If for some reason you need to actually interpret the HTML (rather than just do a simple string replacement), a quick script built around BeautifulSoup is going to be safer -- lots of people try to do HTML or XML parsing via regular expressions, but it's very hard if not impossible to cover all corner cases.
All that said, it'd be better if you were using relative links to not have your HTML depend on the server it's hosted on. See also the <BASE HREF="..."> element you can put in your <HEAD> to specify a location all URLs are relative to; if you were using that, you'd only need to do a single replacement.