Extracting url from a string with regex and Powershell - regex

I'm using powershell and regex. I'm scraping a web page result to a variable, but I can't seem to extract a generated url from that variable.
this is the content (the actual url varies):
"https://api16-something-c-text.sitename.com/aweme/v2/going/?video_id=v12044gd0666c8ohtdbc77u5ov2cqqd0&
$reg = "([^&]*)&;$" always returns false.
I've been trying -match and Select-String with regex but I'm in need of guidance.

I suggest using a -replace operation:
$str = '"https://api16-something-c-text.sitename.com/aweme/v2/going/?video_id=v12044gd0666c8ohtdbc77u5ov2cqqd0&'
$str -replace '^"(.+)&$', '$1'

It really depends on what format the content is in.
(?<=\") looks behind "&quot" for (.*?) which any numbers of non-newline characters and then looks ahead for (?=\&) which is "&".
Here's a fair start:
$pattern = "(?<=\")(.*?)(?=\&)"
$someText = ""https://api16-something-c-text.sitename.com/aweme/v2/going/?video_id=v12044gd0666c8ohtdbc77u5ov2cqqd0&"
$newText = [regex]::match($someText, $pattern)
$newText.Value
Returns:
https://api16-something-c-text.sitename.com/aweme/v2/going/?video_id=v12044gd0666c8ohtdbc77u5ov2cqqd0

Related

Using Regex to replace multiple lines of text in file

Basically, I have a .bas file that I am looking to update. Basically the script requires some manual configuration and I don't want my team to need to reconfigure the script every time they run it. What I would like to do is have a tag like this
<BEGINREPLACEMENT>
'MsgBox ("Loaded")
ReDim Preserve STIGArray(i - 1)
ReDim Preserve SVID(i - 1)
STIGArray = RemoveDupes(STIGArray)
SVID = RemoveDupes(SVID)
<ENDREPLACEMENT>
I am kind of familiar with powershell so what I was trying to do is to do is create an update file and to replace what is in between the tags with the update. What I was trying to do is:
$temp = Get-Content C:\Temp\file.bas
$update = Get-Content C:\Temp\update
$regex = "<BEGINREPLACEMENT>(.*?)<ENDREPLACEMENT>"
$temp -replace $regex, $update
$temp | Out-File C:\Temp\file.bas
The issue is that it isn't replacing the block of text. I can get it to replace either or but I can't get it to pull in everything in between.
Does anyone have any thoughts as to how I can do this?
You need to make sure you read the whole files in with newlines, which is possible with the -Raw option passed to Get-Content.
Then, . does not match a newline char by default, hence you need to use a (?s) inline DOTALL (or "singleline") option.
Also, if your dynamic content contains something like $2 you may get an exception since this is a backreference to Group 2 that is missing from your pattern. You need to process the replacement string by doubling each $ in it.
$temp = Get-Content C:\Temp\file.bas -Raw
$update = Get-Content C:\Temp\update -Raw
$regex = "(?s)<BEGINREPLACEMENT>.*?<ENDREPLACEMENT>"
$temp -replace $regex, $update.Replace('$', '$$')

Replace text between two string powershell

I have a question which im pretty much stuck on..
I have a file called xml_data.txt and another file called entry.txt
I want to replace everything between <core:topics> and </core:topics>
I have written the below script
$test = Get-Content -Path ./xml_data.txt
$newtest = Get-Content -Path ./entry.txt
$pattern = "<core:topics>(.*?)</core:topics>"
$result0 = [regex]::match($test, $pattern).Groups[1].Value
$result1 = [regex]::match($newtest, $pattern).Groups[1].Value
$test -replace $result0, $result1
When I run the script it outputs onto the console it doesnt look like it made any change.
Can someone please help me out
Note: Typo error fixed
There are three main issues here:
You read the file line by line, but the blocks of texts are multiline strings
Your regex does not match newlines as . does not match a newline by default
Also, the literal regex pattern must when replacing with a dynamic replacement pattern, you must always dollar-escape the $ symbol. Or use simple string .Replace.
So, you need to
Read the whole file in to a single variable, $test = Get-Content -Path ./xml_data.txt -Raw
Use the $pattern = "(?s)<core:topics>(.*?)</core:topics>" regex (it can be enhanced in case it works too slow by unrolling it to <core:topics>([^<]*(?:<(?!</?core:topics>).*)*)</core:topics>)
Use $test -replace [regex]::Escape($result0), $result1.Replace('$', '$$') to "protect" $ chars in the replacement, or $test.Replace($result0, $result1).

Regex code for finding a URL containing and domain and adding a parameter

Help! Needed for wordpress search and replace
Have a bunch of links to amazon.com/productname/dp/productid or similar.
Want to have a regex code that will search any URLs that contain amazon.com and end of the URL add the following "?tag=tagname"
Thanks.
This should do the trick:
Search: (<a[^<>]*href=["']amazon.com.*?)(?=["'])
Replace: $0?tag=tagname
It basically looks for anything that start with <a, then tests if some characters are present or not. Finally it adds the addition, ?tag=tagname, before the closing " or '.
Assuming you're using PHP:
$re = '/(<a[^<>]*href=["\']amazon.com.*?)(?=["\'])/';
$str = '<a data-foo="href=\'amazon.com/\'" href=\'amazon.com/productname/dp/productid\' data-foo2=\'bar\'/>';
$subst = '$0?tag=tagname';
$result = preg_replace($re, $subst, $str);
echo "The result of the substitution is ".$result;
I am not sure why it's not capturing the data-foo="href='amazon.com/'" though, but works as required.

Trim More than 20 Characters

I am working on a script that will generate AD usernames based off of a csv file. Right now I have the following line working.
Select-Object #{n=’Username’;e={$_.FirstName.ToLower() + $_.LastName.ToLower() -replace "[^a-zA-Z]" }}
As of right now this takes the name and combines it into a AD friendly name. However I need to name to be shorted to no more than 20 characters. I have tried a few different methods to shorten the username but I haven't had any luck.
Any ideas on how I can get the username shorted?
Probably the most elegant approach is to use a positive lookbehind in your replacement:
... -replace '(?<=^.{20}).*'
This expression matches the remainder of the string only if it is preceded by 20 characters at the beginning of the string (^.{20}).
Another option would be a replacement with a capturing group on the first 20 characters:
... -replace '^(.{20}).*', '$1'
This captures at most 20 characters at the beginning of the string and replaces the whole string with just the captured group ($1).
$str[0..19] -join ''
e.g.
PS C:\> 'ab'[0..19]
ab
PS C:\> 'abcdefghijklmnopqrstuvwxyz'[0..19] -join ''
abcdefghijklmnopqrst
Which I would try in your line as:
Select-Object #{n=’Username’;e={(($_.FirstName + $_.LastName) -replace "[^a-z]").ToLower()[0..19] -join '' }}
([a-z] because PowerShell regex matches are case in-senstive, and moving .ToLower() so you only need to call it once).
And if you are using Strict-Mode, then why not check the length to avoid going outside the bounds of the array with the delightful:
$str[0..[math]::Min($str.Length, 19)] -join ''
To truncate a string in PowerShell, you can use the .NET String::Substring method. The following line will return the first $targetLength characters of $str, or the whole string if $str is shorter than that.
if ($str.Length -gt $targetLength) { $str.Substring(0, $targetLength) } else { $str }
If you prefer a regex solution, the following works (thanks to #PetSerAl)
$str -replace "(?<=.{$targetLength}).*"
A quick measurement shows the regex method to be about 70% slower than the substring method (942ms versus 557ms on a 200,000 line logfile)

Powershell to get a DLL name out of it's full path

I have a string "....\xyz\abc\0.0\abc.def.ghi.jkl.dll" am trying to get the value of a "abc.def.ghi.jkl.dll" into a variable using powershell.
I am totally new to regex and PS and kinda confused on how to get this done. I read various posts about regex and I am unable to get anything to work
Here is my code,
$str = "..\..\xyz\abc\0.0\abc.def.ghi.jkl.dll"
$regex = [regex] '(?is)(?<=\b\\b).*?(?=\b.dll\b)'
$result = $regex.Matches($str)
Write-Host $result
I would like to get "abc.def.ghi.jkl.dll" into $result. Could someone please help me out
You can use the following regex:
(?is)(?<=\\)[^\\]+\.dll\b
See regex demo
And no need to use Matches, just use a -match (or Match).
Explanation:
(?<=\\) - make sure there is a \ right before the current position in string
[^\\]+ - match 1 or more characters other than \
\.dll\b - match a . symbol followed by 3 letters dll that are followed by a trailing word boundary.
Powershell:
$str = "..\..\xyz\abc\0.0\abc.def.ghi.jkl.dll"
[regex]$regex = "(?is)(?<=\\)[^\\]+\.dll\b"
$match = $regex.match($str)
$result = ""
if ($match.Success)
{
$result = $match.Value
Write-Host $result
}