Parse EML text With Regular Expression

Parse EML text With Regular Expression - regex

Could you help me please parse EML text with regular expression.
I want to get separately:
1). text between Content-Transfer-Encoding: base64 and --=_alternative, if there is above line Content-Type: text/html
2). text between Content-Transfer-Encoding: base64 and --=_related, if there is two lines above line Content-Type: image/jpeg
Take a look, please, on peace of code in powershell:
$text = #"
--=_alternative XXXXXXXXXXXXXX_=
Content-Type: text/html; charset="KOI8-R"
Content-Transfer-Encoding: base64
111111111111111111111111111111111111111111111111111111
--=_alternative XXXXXXXXXXXXXX_=
Content-Type: text/html; charset="KOI8-R"
Content-Transfer-Encoding: base64
222222222222222222222222222222222222222222222222222222
--=_alternative XXXXXXXXXXXXXX_=--
--=_related XXXXXXXXXXXXXX_=--_=
Content-Type: image/jpeg
Content-ID: <_2_XXXXXXXXXXXXXX>
Content-Transfer-Encoding: base64
333333333333333333333333333333333333333333333333333333
--=_related XXXXXXXXXXXXXX_=
Content-Type: image/jpeg
Content-ID: <_2_XXXXXXXXXXXXXX>
Content-Transfer-Encoding: base64
444444444444444444444444444444444444444444444444444444
--=_related XXXXXXXXXXXXXX_=
Content-Type: image/jpeg
Content-ID: <_2_XXXXXXXXXXXXXX>
Content-Transfer-Encoding: base64
555555555555555555555555555555555555555555555555555555
--=_related XXXXXXXXXXXXXX_=--
"#
$regex1 = "(?ms).+?Content-Transfer-Encoding: base64(.+?)--=_alternative"
$text1 = ([regex]::Matches($text,$regex1) | foreach {$_.groups[1].value})
Write-Host "text1 : " -fore red
Write-Host $text1
#I want to get as output elements (of array, maybe, or one after another)
#1). text between Content-Transfer-Encoding: base64 and --=_alternative, if there is above line Content-Type: text/html
#this
#1111111111111111111111111111111111111111111111111111111
#then this
#2222222222222222222222222222222222222222222222222222222
$regex2 = "(?ms).+?Content-Transfer-Encoding: base64(.+?)--=_related"
$text2 = ([regex]::Matches($text,$regex2) | foreach {$_.groups[1].value})
#I want to get as output elements (of array, maybe, or one after another)
#2). text between Content-Transfer-Encoding: base64 and --=_related, if there is two lines above line Content-Type: image/jpeg
#this
#3333333333333333333333333333333333333333333333333333333
#then this
#4444444444444444444444444444444444444444444444444444444
#then this
#5555555555555555555555555555555555555555555555555555555
Write-Host "text2 : " -fore red
Write-Host $text2
Thanks for your help. Have a nice day.
P.S. Based on code of Jessie Westlake, here is a little edited version of RegEx, that worked for me:
$files = Get-ChildItem -Path "\\<SERVER_NAME>\mailroot\Drop"
Foreach ($file in $files){
$text = Get-Content $file.FullName
$RegexText = '(?:Content-Type: text/html.+?Content-Transfer-Encoding: base64(.+?)(?:--=_))'
$RegexImage = '(?:Content-Type: image/jpeg.+?Content-Transfer-Encoding: base64(.+?)(?:--=_))'
$TextMatches = [Regex]::Matches($text, $RegexText, [System.Text.RegularExpressions.RegexOptions]::Singleline)
$ImageMatches = [Regex]::Matches($text, $RegexImage, [System.Text.RegularExpressions.RegexOptions]::Singleline)
If ($TextMatches[0].Success)
{
Write-Host "Found $($TextMatches.Count) Text Matches:"
Write-Output $TextMatches.ForEach({$_.Groups[1].Value})
}
If ($ImageMatches[0].Success)
{
Write-Host "Found $($ImageMatches.Count) Image Matches:"
Write-Output $ImageMatches.ForEach({$_.Groups[1].Value})
}
}

TL;DR : Just go to the code at the bottom...
The code below is pretty ugly, so forgive me.
Essentially I just I created a regular expression that matches starting with Content-Type: text/html. It matches anything following that, lazily until it hits a newline \n, carriage return \r, or a combination of one after the other \r\n.
You have to wrap those in parentheses in order to use the or | operator. We don't want to actually capture/return any of those groups, so we use the non-capturing group syntax of (?:text-to-match). We use this elsewhere as you can see. You can place capturing and non-capturing groups inside of each other too.
Anyway, continuing on. After matching the new line, we want to see Content-Transfer-Encoding: base64. That seems to be required in each of your examples.
After that we want to identify the next newline, just like the last time. Except this time we want to match 1 or more, by using the +. The reason we need to match more than one, is that there seems to be times when your data that you want to save is preceded by an extra line. But since sometimes it is NOT preceded by an extra line, we need to make it "lazy" by following the plus with a question mark +?.
After that comes the part where we will be capturing your actual data. This will be the first time we use an actual capturing group, versus a non-capturing group (i.e. no question mark followed by a colon).
We want to capture anything that is NOT a new line, because it seems that sometimes your data is followed by a new line and sometimes not. By not allowing ourselves to capture any new lines, it will also force our previous group to gobble up any extra new lines that are preceding our data. That capturing group is ([^(?:\n|\n\r)]+)
What we were doing there is wrapping the regex in parentheses in order to capture it. We place the expression inside of brackets because we want to create our own "class" of characters. Any of the characters inside of brackets is going to be what our code is looking for. The difference with ours, though, is that we put a carat ^ as the first character inside the brackets. That means NOT any of these characters. Obviously we want to match everything until the next line, so we want to capture anything that is not a newline, once or more, as many times as possible.
We then make sure our regex is anchored to some ending text, so we keep trying to match. Starting with another newline matching at least one, but as few as required to make our capture a success (?:\n|\r|\r\n)+?.
Lastly, we anchor to what we know for sure will be where we can stop looking for our important data. And that is the --=_. I wasn't sure if we would stumble across an "alternative" word or "related", so I didn't go that far. Now it's done.
THE KEY TO IT ALL
We wouldn't be able to match through new lines if we didn't add the regular expression "SingleLine" mode. In order to enable that we have to use the .NET language to create our matches. We type accelerate from the [System.Text.RegularExpressions.RegexOptions] type. The options are "SingleLine" and "MultiLine".
I create a separate regex for the text/html and the image/jpeg searches. We save the results of those matches into their respective variables.
We can test the success of the matches by indexing into the 0 index, which would contain the entire match object and accessing its .success property, which returns a boolean value. The count of matches is accessible with the .count property. In order to access the specific groups and captures, we have to dot notate into them after finding the appropriate capture group index. Since we are only using one capturing group and the rest are non-capturing, we will have the [0] index for our entire text match, and [1] should contain the match of our capture group. Because it is an object, we have to access the value property.
Obviously the below code will require your $text variable to contain the data to search.
$RegexText = '(?:Content-Type: text/html.+?(?:\n|\r|\r\n)Content-Transfer-Encoding: base64(?:\n|\r|\r\n)+?([^(?:\n|\n\r)]+)(?:\n|\r|\r\n)+?(?:\n|\r|\r\n)(?:--=_))'
$RegexImage = '(?:Content-Type: image/jpeg.+?(?:\n|\r|\r\n)Content-Transfer-Encoding: base64(?:\n|\r|\r\n)+?([^(?:\n|\n\r)]+)(?:\n|\r|\r\n)+?(?:\n|\r|\r\n)(?:--=_))'
$TextMatches = [Regex]::Matches($text, $RegexText, [System.Text.RegularExpressions.RegexOptions]::Singleline)
$ImageMatches = [Regex]::Matches($text, $RegexImage, [System.Text.RegularExpressions.RegexOptions]::Singleline)
If ($TextMatches[0].Success)
{
Write-Host "Found $($TextMatches.Count) Text Matches:"
Write-Output $TextMatches.ForEach({$_.Groups[1].Value})
}
If ($ImageMatches[0].Success)
{
Write-Host "Found $($ImageMatches.Count) Image Matches:"
Write-Output $ImageMatches.ForEach({$_.Groups[1].Value})
}
The code above results in the below output to the screen:
Found 2 Text Matches:
111111111111111111111111111111111111111111111111111111
222222222222222222222222222222222222222222222222222222
Found 3 Image Matches:
333333333333333333333333333333333333333333333333333333
444444444444444444444444444444444444444444444444444444
555555555555555555555555555555555555555555555555555555

Related

REGEX Pattern for Username inside a longer string

MAC OSX, PowerShell 6.1 Core
I'm struggling with creating the correct REGEX pattern to find a username string in the middle of a url. In short, I'm working in Powershell Core 6.1 and pulling down a webpage and scraping out the "li" elements. I write this to a file so I have a bunch of lines like this:
<LI>Smith, Jimmy
The string I need is the "jimmysmith" part, and every line will have a different username, no longer than eight alpha characters. My current pattern is this:
(<(.|\n)+?>)|( )
and I can use a "-replace $pattern" in my code to grab the "Smith, Jimmy" part. I have no idea what I'm doing, and any success in getting what I did get was face-roll-luck.
After using several online regex helpers I'm still stuck on how to just get the "string after the third "/" and up-to but not including the last quote.
Thank you for any assistance you can give me.

You could go super-simple,
expand-user/([^"]+)
Find expand-user, then capture until a quotation.

(?:\/.*){2}\/(?<username>.*)"
(?:\/.*) Matches a literal / followed by any number of characters
{2} do the previous match two times
\/ match another /
(?<username>.*)" match everything up until the next " and put it in the
username group.
https://regex101.com/r/0gj7yG/1
Although, since each line is presumably identical up until the username:
$line = ("<LI>Smith, Jimmy ")
$line = $line.Substring(36,$line.LastIndexOf("\""))

the answer is what was posted by Dave. I saved my scraped details to a file (the lines with "li") by doing:
get-content .\list.txt -ReadCount 1000| foreach-object { $_ -match "<li>"} |out-file .\transform.txt
I then used the method proposed by Dave as follows:
$a = get-content .\transform.txt |select-string -pattern '(?:\/.*){2}\/(?<username>.*)"' | % {"$($_.matches.groups[1])"} |out-file .\final.txt
I had to look up how to pull the group name out, and i used this reference to figure that out: How to get the captured groups from Select-String?

Regex to match and optionally remove a / at the end of an url

I have a form should get urls in the following formats:
http://website.com/test/1
http://website.com/test/1/
My regex is currently this:
url.match(/website.com(.*)/);
I want the capture group to match the content and automatically remove the last "/" at the end of the URL, so that, no matter if there is / or not, it would always return "/test/1" . How?

Try using the below regex,
url.match(/website.com(.*)\b/);
it should work for your case. Let me know if it is not.

Hmm... Regexs' can't really "remove"/explode characters. Say, for example, you have a field where you can enter a URL. You can restrict the user from adding a "/", like either giving him an error message that says: "Remove the / in the end of your URL" or just not allow it (with Javascript, so when they try to add it, the "/" gets removed).
I'll give you a regex expression that will check for everything in your URL:
(https?:\/\/)?(([A-Za-z0-9]+)((\.[a-z]{1,3})))(\/\w+)\/(\d)\/?
Here's what this regex expression does:
First it checks to see if http:// or https:// is present in the URL: (https?:\/\/)?
The https? basically means: if http has an "s" at the end of it, hence the s? (which could be any character, though: a? b? c? 1? 2? 3?). It's the same with the "?" after (https?:\/\/): (https?:\/\/)?, but here it checks for, as said, if the entire http:// or https:// is present. Meaning that an URL like this: example.com (without the http or https in the beginning) would get matched too.
Then we have this entire section of the expression: (([A-Za-z0-9]+)((\.[a-z]{1,3}))). Let's break it down a bit:
([A-Za-z0-9]+)
Here it checks for any letters or numbers (example: "website"), (uppercase or lowercase) until it meets: ((\.[a-z]{1,3})), which checks any letters any lowercase only with a maximum of 3 letters (example: .com).
So (([A-Za-z0-9]+)((\.[a-z]{1,3}))) would match, just to mention a few examples: stackoverflow.com, twitter.com, google.se but not example.online, because of the {1,3} which basically says "between 1 to 3" letters only.
Then we have the last part: (\/\w+)\/(\d)\/?. First we have (\/\w+) which checks for any word following a slash, for example: "/test". The \w basically mean check for any word.
After that it check for a "/": \/, and lastly a number (\d), following the "/", so for example: "/1". In the end of this regex expression we have a \/?, which just checks to see if there is a trailing slash or not.
So in PHP this regex expression could be used like this:
$pattern = "/(https?:\/\/)?(([A-Za-z0-9]+)((\.[a-z]{1,3})))(\/\w+)\/(\d)\/?/";
$url = "https://example.com/user/1/";
if(preg_match($pattern, $url, $matches)){
echo $matches[1]; // Will echo https://
echo $matches[2]; // Will echo "example.com"
echo $matches[3]; // Will echo "example"
echo $matches[4]; // Will echo ".com"
echo $matches[5]; // Will echo ".com"
echo $matches[6]; // Will echo "/user"
echo $matches[7]; // Will echo "1"
var_dump($matches); // Will dump the array
}
Hope this helps.
Edit; Of course, the regex can be written in more ways and it's differently for different languages. But this is just an example of how I usually build my regexs'; I always structure it and break it down into parts so I can more easily see what is what and try to think of everything I want to check for in a regex.

Exclude a slash in capture group of regex

i have string, which has the value like
"urls":[
{
"url":"https:\/\/t.co\/OjiDUThEvK",
"expanded_url":"http:(escape sequence slash)/(escape sequence slash)/fb.me\/7Wnh0hMLL",
"display_url":"fb.me(escape sequence slash)/7Wnh0hMLL",
"indices":[48,71]}],
"user_mentions":[],
"symbols":[]
}
]
i need to capture only "expanded url" i tried the following regex:
"expanded_url"\:\"http\:\\\/\\\/(.*?)\"
this gave a result :
"fb.me(escape sequence slash)/7Wnh0hMLL"
but i want to exclude the escape sequence slash in the URL, is it possible to achieve the same, kindly let me know the changes to me made to the regex

I'm not 100% sure if this is what you're after. Can you post the raw input without the "(escape sequence slash)" part I'm assuming that this is actually / in the text you're matching against.
match:
\"expanded_url\":\"http:\\\/\\\/([^\\]*)\\\/([^\\"]*)\"
replace with:
$1/$2

Extract text from "content-disposition: attachment" body part

I regularly receive a generated email message containing a text part and a text attachment. I want to test if attachment is base64 encoded, then decode it like:
:0B
* ^(Content-Transfer-Encoding: *base64(($)[a-z0-9].*)*($))
{
msgID=`printf '%s' "$MATCH" | base64 -d`
}
But it always say invalid input, anyone know what's wrong?
procmail: Match on "^()\/[a-z]+[0-9]+[^\+]"
procmail: Assigning "msgID=PGh0b"
procmail: matched "^(Content-Disposition: *attachment.*(($)[a-z0-9].*)* |Content-Transfer-Encoding: *base64(($)[a-z0-9].*)*($)"
procmail: Executing "printf '%s' "$MATCH" | base64 -d"
base64: invalid input
procmail: Assigning "msgID=<ht"
procmail: Unexpected EOL
procmail: Assigning "msgID=PGh0b"
procmail: Match on "^(Content-Transfer-Encoding: *base64(($)[a-z0-9].*)*($))"
procmail: Executing "printf '%s' "$MATCH" | base64 -d"
base64: invalid input
procmail: Assigning "msgID=<ht"
procmail: Unexpected EOL

If your requirements are complex, it might be easier to write a dedicated script which extracts the information you want -- a modern scripting language with proper MIME support is going to be a lot more versatile when it comes to all the myriad different possibilities for content encoding and body part structure in modern MIME email.
The following finds the first occurrence of MIME headers with Content-Disposition: attachment and extracts the first token of the following body. This might do what you want if you are corresponding with a sender who uses a well-defined, static template. There is no real MIME parsing here, so (say) a forwarded message which happens to contain an embedded part which matches the pattern will also trigger the conditions. (This can be a bug, or a feature.)
A useful but not frequently used feature of Procmail is the ability to write a regular expression which spans multiple lines. Within a regex, ($) always matches a literal newline. So with that, we can look for a Content-Disposition: attachment header followed by other headers (zero or more) followed by an empty line, followed by the token you want to extract.
:0B
* ^Content-Disposition: *attachment.*(($)[A-Z].*)*($)($)\/[A-Z]+[0-9]+
{ msgid="$MATCH" }
For simplicity, I have not attempted to cope with multi-line MIME headers. If you want to support that, the fix should be reasonably obvious, though not at all elegant.
In the somewhat more general case, you might want to add a condition to check that the group of MIME headers in the condition also contains a Content-type: text/plain; you will need to set up two alternatives for having Content-type: before or after Content-disposition: (or somehow normalize the MIME headers before getting to this recipe; or trust that the sender always generates them in exactly the order in the sample message).

Regex to get a filename from a url

I am trying to write a regex to get the filename from a url if it exists.
This is what I have so far:
(?:[^/][\d\w\.]+)+$
So from the url http://www.foo.com/bar/baz/filename.jpg, I should match filename.jpg
Unfortunately, I match anything after the last /.
How can I tighten it up so it only grabs it if it looks like a filename?

The examples above fails to get file name "file-1.name.zip" from this URL:
"http://sub.domain.com/sub/sub/handler?file=data/file-1.name.zip&v=1"
So I created my REGEX version:
[^/\\&\?]+\.\w{3,4}(?=([\?&].*$|$))
Explanation:
[^/\\&\?]+ # file name - group of chars without URL delimiters
\.\w{3,4} # file extension - 3 or 4 word chars
(?=([\?&].*$|$)) # positive lookahead to ensure that file name is at the end of string or there is some QueryString parameters, that needs to be ignored

This one works well for me.
(\w+)(\.\w+)+(?!.*(\w+)(\.\w+)+)

(?:.+\/)(.+)
Select all up to the last forward slash (/), capture everything after this forward slash. Use subpattern $1.

Non Pcre
(?:[^/][\d\w\.]+)$(?<=\.\w{3,4})
Pcre
(?:[^/][\d\w\.]+)$(?<=(?:.jpg)|(?:.pdf)|(?:.gif)|(?:.jpeg)|(more_extension))
Demo
Since you test using regexpal.com that is based on javascript(doesnt support lookbehind), try this instead
(?=\w+\.\w{3,4}$).+

I'm using this:
(?<=\/)[^\/\?#]+(?=[^\/]*$)
Explanation:
(?<=): positive look behind, asserting that a string has this expression, but not matching it.
(?<=/): positive look behind for the literal forward slash "/", meaning I'm looking for an expression which is preceded, but does not match a forward slash.
[^/\?#]+: one or more characters which are not either "/", "?" or "#", stripping search params and hash.
(?=[^/]*$): positive look ahead for anything not matching a slash, then matching the line ending. This is to ensure that the last forward slash segment is selected.
Example usage:
const urlFileNameRegEx = /(?<=\/)[^\/\?#]+(?=[^\/]*$)/;
const testCases = [
"https://developer.mozilla.org/en-US/docs/Web/API/MutationObserverInit#yo",
"https://developer.mozilla.org/static/fonts/locales/ZillaSlab-Regular.subset.bbc33fb47cf6.woff2",
"https://developer.mozilla.org/static/build/styles/locale-en-US.520ecdcaef8c.css?is-nice=true"
];
testCases.forEach(testStr => console.log(`The file of ${testStr} is ${urlFileNameRegEx.exec(testStr)[0]}`))

It might work as well:
(\w+\.)+\w+$

You know what your delimiters look like, so you don't need a regex. Just split the string. Since you didn't mention a language, here's an implementation in Perl:
use strict;
use warnings;
my $url = "http://www.foo.com/bar/baz/filename.jpg";
my #url_parts = split/\//,$url;
my $filename = $url_parts[-1];
if(index($filename,".") > 0 )
{
print "It appears as though we have a filename of $filename.\n";
}
else
{
print "It seems as though the end of the URL ($filename) is not a filename.\n";
}
Of course, if you need to worry about specific filename extensions (png,jpg,html,etc), then adjust appropriately.

> echo "http://www.foo.com/bar/baz/filename.jpg" | sed 's/.*\/\([^\/]*\..*\)$/\1/g'
filename.jpg

Assuming that you will be using javascript:
var fn=window.location.href.match(/([^/])+/g);
fn = fn[fn.length-1]; // get the last element of the array
alert(fn.substring(0,fn.indexOf('.')));//alerts the filename

Here is the code you may use:
\/([\w.][\w.-]*)(?<!\/\.)(?<!\/\.\.)(?:\?.*)?$
names "." and ".." are not considered as normal.
you can play with this regexp here https://regex101.com/r/QaAK06/1/:

In case you are using the JavaScript URL object, you can use the pathname combined with the following RegExp:
.*\/(.[^(\/)]+)
Benefit:
It matches anything at the end of the path, but excludes a possible trailing slash (as long as there aren't two trailing slashes)!

Try this one instead:
(?:[^/]*+)$(?<=\..*)

This is worked for me, no matter if you have '.' or without '.' it take the sufix of url
\/(\w+)[\.|\w]+$

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parse EML text With Regular Expression - regex

Related

REGEX Pattern for Username inside a longer string

Regex to match and optionally remove a / at the end of an url

Exclude a slash in capture group of regex

Extract text from "content-disposition: attachment" body part

Regex to get a filename from a url

Categories

Resources