REGEX Pattern for Username inside a longer string - regex

MAC OSX, PowerShell 6.1 Core
I'm struggling with creating the correct REGEX pattern to find a username string in the middle of a url. In short, I'm working in Powershell Core 6.1 and pulling down a webpage and scraping out the "li" elements. I write this to a file so I have a bunch of lines like this:
<LI>Smith, Jimmy
The string I need is the "jimmysmith" part, and every line will have a different username, no longer than eight alpha characters. My current pattern is this:
(<(.|\n)+?>)|( )
and I can use a "-replace $pattern" in my code to grab the "Smith, Jimmy" part. I have no idea what I'm doing, and any success in getting what I did get was face-roll-luck.
After using several online regex helpers I'm still stuck on how to just get the "string after the third "/" and up-to but not including the last quote.
Thank you for any assistance you can give me.

You could go super-simple,
expand-user/([^"]+)
Find expand-user, then capture until a quotation.

(?:\/.*){2}\/(?<username>.*)"
(?:\/.*) Matches a literal / followed by any number of characters
{2} do the previous match two times
\/ match another /
(?<username>.*)" match everything up until the next " and put it in the
username group.
https://regex101.com/r/0gj7yG/1
Although, since each line is presumably identical up until the username:
$line = ("<LI>Smith, Jimmy ")
$line = $line.Substring(36,$line.LastIndexOf("\""))

the answer is what was posted by Dave. I saved my scraped details to a file (the lines with "li") by doing:
get-content .\list.txt -ReadCount 1000| foreach-object { $_ -match "<li>"} |out-file .\transform.txt
I then used the method proposed by Dave as follows:
$a = get-content .\transform.txt |select-string -pattern '(?:\/.*){2}\/(?<username>.*)"' | % {"$($_.matches.groups[1])"} |out-file .\final.txt
I had to look up how to pull the group name out, and i used this reference to figure that out: How to get the captured groups from Select-String?

Related

Regex: Find usename inside a url

I'm struggling with creating the correct REGEX pattern to find a username string in the middle of a url. In short, I'm working in Powershell and pulling down a webpage and scraping out the "li" elements. I write this to a file so I have a bunch of lines like this:
<LI>Smith, Jimmy
The string I need is the "jimmysmith" part, and every line will have a different username, no longer than eight alpha characters. My current pattern is this:
(<(.|\n)+?>)|( )
and I can use a "-replace $pattern" in my code to grab the "Smith, Jimmy" part. I have no idea what I'm doing, and any success in getting what I did get was face-roll-luck.
After using several online regex helpers I'm still stuck on how to just get the "string after the third "/" and up-to but not including the last quote.
Thank you for any assistance you can give me.
I suggest you use an HTML parser instead. Try:
$html = New-Object -ComObject "HTMLFile"
$source = '<LI>Smith, Jimmy '
$html.IHTMLDocument2_write($source)
$html.links | % nameprop
jimmysmith
Try the following regex:
[^\/"]+(?=">.*<\/A>)
This wll capture the last string in href attribute of <a> tag.
Just simply to replace redundant strings.
'<LI>Smith, Jimmy ' -replace ".*user/|`"\>.*"
If you have multiple lines, try this:
'<LI>Smith, Jimmy ' -replace "^\<LI.*user/|`"\>.*"
Both work, tested.
The answer to my question, was contained in this response by Sergio.
Try the following regex:
[^\/"]+(?=">.*<\/A>)
This will capture the last string in href attribute of <a> tag.

Powershell: Using regex to match patterns and its variations (which has special characters)

I have several thousand text files containing form information (one text file for each form), including the unique id of each form.
I have been trying to extract just the form id using regex (which I am not too familiar with) to match the string of characters found before and after the form id and extract only the form ID number in between them. Usually the text looks like this: "... 12 ID 12345678 INDEPENDENT BOARD..."
The bolded 8-digit number is the form ID that I need to extract.
The code I used can be seen below:
$id= ([regex]::Match($text_file, "12 ID (.+) INDEPENDENT").Groups[1].Value)
This works pretty well, but I soon noticed that there were some files for which this script did not work. After investigation, I found that there was another variation to the text containing the form ID used by some of the text files. This variation looks like this: "... 12 ID 12345678 (a.12(3)(b),45)..."
So my first challenge is to figure out how to change the script so that it will match the first or the second pattern. My second challenge is to escape all the special characters in "(a.12(3)(b),45)".
I know that the pipe | is used as an "or" in regex and two backslashes are used to escape special characters, however the code below gives me errors:
$id= ([regex]::Match($text_one_line, "34 PR (.+) INDEPENDENT"|"34 PR (.+) //(a//.12//(3//)//(b//)//,45//)").Groups[1].Value)
Where have I gone wrong here and how I can fix my code?
Thank you!
When you approach a regex pattern always look for fixed vs. variable parts.
In your case the ID seems to be fixed, and it is, therefore, useful as a reference point.
The following pattern applies this suggestion: (?:ID\s+)(\d{8})
(click on the pattern for an explanation).
$str = "... 12 ID 12345678 INDEPENDENT BOARD..."
$ret = [Regex]::Matches($str, "(?:ID\s+)(\d{8})")
for($i = 0; $i -lt $ret.Count; $i++) {
$ret[0].Groups[1].Value
}
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. It contains a treasure trove of useful information.

Regex to capture an URL

I've extracted an URL from a website in this string form:
#{href=http://download.company.net/file.exe}[0]
I can't figure out pattern how to get this part out of it: http://download.company.net/file.exe so I can use it as URL to download file.
From my point of view the logic would be, that I need to first match "http" as beggining of a string, wildcard inbetween and then match "}", but not include it in final output. So IDK ...[http]*\} (I know that this "syntax" of mine is totally wrong, but you get the idea)
Reason I dont want to include "exe" to pattern, is that file extension could be "msi" and I want it to be more universal. Also some good and comprehensive PS regex article would help me greatly (with inexperience in mind) - I really didnt find any "newbie friendly" or comprehensive enough to understand this topic.
You can either, use [regex]::match or -replace.
In the following example, I capture everything after href= that is not a starting curly bracket }:
'#{href=http://download.company.net/file.exe}[0]' -replace '#{href=([^}]+).*', '$1'
Output:
http://download.company.net/file.exe
I'd use -cmatch or -imatch as
if ($content -imatch '(?<=href=).*(?=})') {
$result = $matches[0]
} else {
$result = ''
}
In case of test data, it will return
http://download.company.net/file.exe

How to Regex Multiple URLs From Same Variable In Perl

I'm trying to search a field in a database to extract URLs. Sometimes there will be more than 1 URL in a field and I would like to extract those in to separate variables (or an array).
I know my regex isn't going to cover all possibilities. As long as I flag on anything that starts with http and ends with a space I'm ok.
The problem I'm having is that my efforts either seem to get only 1 URL per record or they get only 1 the last letter from each URL. I've tried a couple different techniques based on solutions other have posted but I haven't found a solution that works for me.
Sample input line:
Testing http://marko.co http://tester.net Just about anything else you'd like.
Output goal
$var[0] = http://marko.co
$var[1] = http://tester.net
First try:
if ( $status =~ m/http:(\S)+/g ) {
print "$&\n";
}
Output:
http://marko.co
Second try:
#statusurls = ($status =~ m/http:(\S)+/g);
print "#statusurls\n";
Output:
o t
I'm new to regex, but since I'm using the same regex for each attempt, I don't understand why it's returning such different results.
Thanks for any help you can offer.
I've looked at these posts and either didn't find what I was looking for or didn't understand how to implement it:
This one seemed the most promising (and it's where I got the 2nd attempt from, but it didn't return the whole URL, just the letter: How can I store regex captures in an array in Perl?
This has some great stuff in it. I'm curious if I need to look at the URL as a word since it's bookended by spaces: Regex Group in Perl: how to capture elements into array from regex group that matches unknown number of/multiple/variable occurrences from a string?
This one offers similar suggestions as the first two. How can I store captures from a Perl regular expression into separate variables?
Solution:
#statusurls = ($status =~ m/(http:\S+)/g);
print "#statusurls\n";
Thanks!
I think that you need to capture more than just one character. Try this regex instead:
m/http:(\S+)/g

Odd Perl Regex Behavior with Parens

I'm pulling in some Wikipedia markup and I'm wanting to match the URLs in relative (on Wikipedia) links. I don't want to match any URL containing a colon (not counting the protocol colon), to avoid special pages and the like, so I have the following code:
while ($body =~ m|<a href="(?<url>/wiki/[^:"]+)|gis) {
my $url = $+{url};
print "$url\n";
}
unfortunately, this code is not working quite as expected. Any URL that contains a parenthetical [i.e. /wiki/Eon_(geology)] is getting truncated prematurely just before the opening paren, so that URL would match as /wiki/Eon_. I've been looking at the code for a bit and I cannot figure out what I'm doing wrong. Can anyone provide some insight?
There isn't anything wrong in this code as it stands, so long as your Perl is new enough to support these RE features. Tested with Perl 5.10.1.
$body = <<"__ENDHTML__";
Body Blah blah
Body
__ENDHTML__
while ($body =~ m|<a href="(?<url>/wiki/[^:"]+)|gis) {
my $url = $+{url};
print "$url\n";
}
Are you using an old Perl?
You didn't anchor the RE to the end of the string. Put a " afterwards.
While that is a problem, it isn't the problem he was trying to solve. The problem he was trying to solve was that there was nothing to match the method/hostname (http://en.wiki...) in the RE. Adding a .*? would help that, before the "(?"