Extract text from "content-disposition: attachment" body part - procmail

I regularly receive a generated email message containing a text part and a text attachment. I want to test if attachment is base64 encoded, then decode it like:
:0B
* ^(Content-Transfer-Encoding: *base64(($)[a-z0-9].*)*($))
{
msgID=`printf '%s' "$MATCH" | base64 -d`
}
But it always say invalid input, anyone know what's wrong?
procmail: Match on "^()\/[a-z]+[0-9]+[^\+]"
procmail: Assigning "msgID=PGh0b"
procmail: matched "^(Content-Disposition: *attachment.*(($)[a-z0-9].*)* |Content-Transfer-Encoding: *base64(($)[a-z0-9].*)*($)"
procmail: Executing "printf '%s' "$MATCH" | base64 -d"
base64: invalid input
procmail: Assigning "msgID=<ht"
procmail: Unexpected EOL
procmail: Assigning "msgID=PGh0b"
procmail: Match on "^(Content-Transfer-Encoding: *base64(($)[a-z0-9].*)*($))"
procmail: Executing "printf '%s' "$MATCH" | base64 -d"
base64: invalid input
procmail: Assigning "msgID=<ht"
procmail: Unexpected EOL

If your requirements are complex, it might be easier to write a dedicated script which extracts the information you want -- a modern scripting language with proper MIME support is going to be a lot more versatile when it comes to all the myriad different possibilities for content encoding and body part structure in modern MIME email.
The following finds the first occurrence of MIME headers with Content-Disposition: attachment and extracts the first token of the following body. This might do what you want if you are corresponding with a sender who uses a well-defined, static template. There is no real MIME parsing here, so (say) a forwarded message which happens to contain an embedded part which matches the pattern will also trigger the conditions. (This can be a bug, or a feature.)
A useful but not frequently used feature of Procmail is the ability to write a regular expression which spans multiple lines. Within a regex, ($) always matches a literal newline. So with that, we can look for a Content-Disposition: attachment header followed by other headers (zero or more) followed by an empty line, followed by the token you want to extract.
:0B
* ^Content-Disposition: *attachment.*(($)[A-Z].*)*($)($)\/[A-Z]+[0-9]+
{ msgid="$MATCH" }
For simplicity, I have not attempted to cope with multi-line MIME headers. If you want to support that, the fix should be reasonably obvious, though not at all elegant.
In the somewhat more general case, you might want to add a condition to check that the group of MIME headers in the condition also contains a Content-type: text/plain; you will need to set up two alternatives for having Content-type: before or after Content-disposition: (or somehow normalize the MIME headers before getting to this recipe; or trust that the sender always generates them in exactly the order in the sample message).

Related

Special (Meta) Characters while using Regular Expression Extractor in JMeter

In a response : there is a string like this.
,"Software":"abcde",
and i want to select abcde
i write the regular Expression as ,"Software":"([^"]+)",
then i used this variable in a GET request and a POST request.
in GET request it works but in POST request doesnt.
when i look at the post request i see some % characters. but they are not included in abcde string.
in abcde string some special characters like + and / may be present. and they will not transmitted properly to the request.
Why do yo think so? And are there any solution for this?
Check the variable value using Debug Sampler and View Results Tree listener combination, if the variable contains these "special" characters and they should not be there - amend your regular expression accordingly
If your response is in JSON format it worth considering switching to JSON Extractor
With regards to these % and /, can it be connected with URL Encoding?

Regex Match Sentence Except When it Contains a Word

I'm trying to write a regex that our mail filter can use to analyse the message headers of an email and look for a specific header and act upon it depending on whether a word is in that header...
So:
I want to match:
X-FEAS-ANTIVIRUS: FortiSandbox:
except when it is
X-FEAS-ANTIVIRUS: FortiSandbox: uri
I've tried:
/(?:X\-FEAS\-ANTIVIRUS\:\ FortiSandbox\:\ \ (?:[^uri]*))/
But it didn't work, so tried:
/(?:[^X\-FEAS\-ANTIVIRUS\:\ FortiSandbox\:\ \ uri]*)(?:X\-FEAS\-ANTIVIRUS\:\ FortiSandbox\:)/
And neither did that.
I need to be specific about the whole "X-FEAS-ANTIVIRUS: FortiSandbox:" header so that there's no chance that the action is taken in error if it matched a different header by accident that contained "uri".
Thank you!

Parse EML text With Regular Expression

Could you help me please parse EML text with regular expression.
I want to get separately:
1). text between Content-Transfer-Encoding: base64 and --=_alternative, if there is above line Content-Type: text/html
2). text between Content-Transfer-Encoding: base64 and --=_related, if there is two lines above line Content-Type: image/jpeg
Take a look, please, on peace of code in powershell:
$text = #"
--=_alternative XXXXXXXXXXXXXX_=
Content-Type: text/html; charset="KOI8-R"
Content-Transfer-Encoding: base64
111111111111111111111111111111111111111111111111111111
--=_alternative XXXXXXXXXXXXXX_=
Content-Type: text/html; charset="KOI8-R"
Content-Transfer-Encoding: base64
222222222222222222222222222222222222222222222222222222
--=_alternative XXXXXXXXXXXXXX_=--
--=_related XXXXXXXXXXXXXX_=--_=
Content-Type: image/jpeg
Content-ID: <_2_XXXXXXXXXXXXXX>
Content-Transfer-Encoding: base64
333333333333333333333333333333333333333333333333333333
--=_related XXXXXXXXXXXXXX_=
Content-Type: image/jpeg
Content-ID: <_2_XXXXXXXXXXXXXX>
Content-Transfer-Encoding: base64
444444444444444444444444444444444444444444444444444444
--=_related XXXXXXXXXXXXXX_=
Content-Type: image/jpeg
Content-ID: <_2_XXXXXXXXXXXXXX>
Content-Transfer-Encoding: base64
555555555555555555555555555555555555555555555555555555
--=_related XXXXXXXXXXXXXX_=--
"#
$regex1 = "(?ms).+?Content-Transfer-Encoding: base64(.+?)--=_alternative"
$text1 = ([regex]::Matches($text,$regex1) | foreach {$_.groups[1].value})
Write-Host "text1 : " -fore red
Write-Host $text1
#I want to get as output elements (of array, maybe, or one after another)
#1). text between Content-Transfer-Encoding: base64 and --=_alternative, if there is above line Content-Type: text/html
#this
#1111111111111111111111111111111111111111111111111111111
#then this
#2222222222222222222222222222222222222222222222222222222
$regex2 = "(?ms).+?Content-Transfer-Encoding: base64(.+?)--=_related"
$text2 = ([regex]::Matches($text,$regex2) | foreach {$_.groups[1].value})
#I want to get as output elements (of array, maybe, or one after another)
#2). text between Content-Transfer-Encoding: base64 and --=_related, if there is two lines above line Content-Type: image/jpeg
#this
#3333333333333333333333333333333333333333333333333333333
#then this
#4444444444444444444444444444444444444444444444444444444
#then this
#5555555555555555555555555555555555555555555555555555555
Write-Host "text2 : " -fore red
Write-Host $text2
Thanks for your help. Have a nice day.
P.S. Based on code of Jessie Westlake, here is a little edited version of RegEx, that worked for me:
$files = Get-ChildItem -Path "\\<SERVER_NAME>\mailroot\Drop"
Foreach ($file in $files){
$text = Get-Content $file.FullName
$RegexText = '(?:Content-Type: text/html.+?Content-Transfer-Encoding: base64(.+?)(?:--=_))'
$RegexImage = '(?:Content-Type: image/jpeg.+?Content-Transfer-Encoding: base64(.+?)(?:--=_))'
$TextMatches = [Regex]::Matches($text, $RegexText, [System.Text.RegularExpressions.RegexOptions]::Singleline)
$ImageMatches = [Regex]::Matches($text, $RegexImage, [System.Text.RegularExpressions.RegexOptions]::Singleline)
If ($TextMatches[0].Success)
{
Write-Host "Found $($TextMatches.Count) Text Matches:"
Write-Output $TextMatches.ForEach({$_.Groups[1].Value})
}
If ($ImageMatches[0].Success)
{
Write-Host "Found $($ImageMatches.Count) Image Matches:"
Write-Output $ImageMatches.ForEach({$_.Groups[1].Value})
}
}
TL;DR : Just go to the code at the bottom...
The code below is pretty ugly, so forgive me.
Essentially I just I created a regular expression that matches starting with Content-Type: text/html. It matches anything following that, lazily until it hits a newline \n, carriage return \r, or a combination of one after the other \r\n.
You have to wrap those in parentheses in order to use the or | operator. We don't want to actually capture/return any of those groups, so we use the non-capturing group syntax of (?:text-to-match). We use this elsewhere as you can see. You can place capturing and non-capturing groups inside of each other too.
Anyway, continuing on. After matching the new line, we want to see Content-Transfer-Encoding: base64. That seems to be required in each of your examples.
After that we want to identify the next newline, just like the last time. Except this time we want to match 1 or more, by using the +. The reason we need to match more than one, is that there seems to be times when your data that you want to save is preceded by an extra line. But since sometimes it is NOT preceded by an extra line, we need to make it "lazy" by following the plus with a question mark +?.
After that comes the part where we will be capturing your actual data. This will be the first time we use an actual capturing group, versus a non-capturing group (i.e. no question mark followed by a colon).
We want to capture anything that is NOT a new line, because it seems that sometimes your data is followed by a new line and sometimes not. By not allowing ourselves to capture any new lines, it will also force our previous group to gobble up any extra new lines that are preceding our data. That capturing group is ([^(?:\n|\n\r)]+)
What we were doing there is wrapping the regex in parentheses in order to capture it. We place the expression inside of brackets because we want to create our own "class" of characters. Any of the characters inside of brackets is going to be what our code is looking for. The difference with ours, though, is that we put a carat ^ as the first character inside the brackets. That means NOT any of these characters. Obviously we want to match everything until the next line, so we want to capture anything that is not a newline, once or more, as many times as possible.
We then make sure our regex is anchored to some ending text, so we keep trying to match. Starting with another newline matching at least one, but as few as required to make our capture a success (?:\n|\r|\r\n)+?.
Lastly, we anchor to what we know for sure will be where we can stop looking for our important data. And that is the --=_. I wasn't sure if we would stumble across an "alternative" word or "related", so I didn't go that far. Now it's done.
THE KEY TO IT ALL
We wouldn't be able to match through new lines if we didn't add the regular expression "SingleLine" mode. In order to enable that we have to use the .NET language to create our matches. We type accelerate from the [System.Text.RegularExpressions.RegexOptions] type. The options are "SingleLine" and "MultiLine".
I create a separate regex for the text/html and the image/jpeg searches. We save the results of those matches into their respective variables.
We can test the success of the matches by indexing into the 0 index, which would contain the entire match object and accessing its .success property, which returns a boolean value. The count of matches is accessible with the .count property. In order to access the specific groups and captures, we have to dot notate into them after finding the appropriate capture group index. Since we are only using one capturing group and the rest are non-capturing, we will have the [0] index for our entire text match, and [1] should contain the match of our capture group. Because it is an object, we have to access the value property.
Obviously the below code will require your $text variable to contain the data to search.
$RegexText = '(?:Content-Type: text/html.+?(?:\n|\r|\r\n)Content-Transfer-Encoding: base64(?:\n|\r|\r\n)+?([^(?:\n|\n\r)]+)(?:\n|\r|\r\n)+?(?:\n|\r|\r\n)(?:--=_))'
$RegexImage = '(?:Content-Type: image/jpeg.+?(?:\n|\r|\r\n)Content-Transfer-Encoding: base64(?:\n|\r|\r\n)+?([^(?:\n|\n\r)]+)(?:\n|\r|\r\n)+?(?:\n|\r|\r\n)(?:--=_))'
$TextMatches = [Regex]::Matches($text, $RegexText, [System.Text.RegularExpressions.RegexOptions]::Singleline)
$ImageMatches = [Regex]::Matches($text, $RegexImage, [System.Text.RegularExpressions.RegexOptions]::Singleline)
If ($TextMatches[0].Success)
{
Write-Host "Found $($TextMatches.Count) Text Matches:"
Write-Output $TextMatches.ForEach({$_.Groups[1].Value})
}
If ($ImageMatches[0].Success)
{
Write-Host "Found $($ImageMatches.Count) Image Matches:"
Write-Output $ImageMatches.ForEach({$_.Groups[1].Value})
}
The code above results in the below output to the screen:
Found 2 Text Matches:
111111111111111111111111111111111111111111111111111111
222222222222222222222222222222222222222222222222222222
Found 3 Image Matches:
333333333333333333333333333333333333333333333333333333
444444444444444444444444444444444444444444444444444444
555555555555555555555555555555555555555555555555555555

Setting regular expression to validate URL format in Adobe CQ5

I want to validate a URL inside a textfield using Adobe CQ5, so I set up the properties regex and regexText as usual, but for some reason is not working:
<facebook
jcr:primaryType="cq:Widget"
emptyText="http://www.facebook.com/account-name"
fieldDescription="Set the Facebook URL"
fieldLabel="Facebook"
name="./facebookUrl"
regex="/^(http://www.|https://www.|http://|https://)[a-z0-9]+([-.]{1}[a-z0-9]+)*.[a-z]{2,5}(:[0-9]{1,5})?(/.*)?$/"
regexText="Invalid URL format"
xtype="textfield"/>
So when I type inside the component I can see an error message at the console:
Uncaught TypeError: this.regex.test is not a function
To be more accurate the error comes from this line:
if (this.regex && !this.regex.test(value)) {
I tried several regular expressions and none of them worked. I guess the problem is the regular expression itself, because in the other hand I have this other regex to evaluate email address, and it works perfectly fine:
/^[A-za-z0-9]+[\\._]*[A-za-z0-9]*#[A-za-z.-]+[\\.]+[A-Za-z]{2,4}$/
Any suggestions? Thanks in advance.
The syntax of your regex seems to treat the forward slashes (/) as special characters. Since you want to parse a URL containing slashes, my guess is you should escape them twice like this: '\\/' instead of '/'. The result would be:
/^(http:\\/\\/www.|https:\\/\\/www.|http:\\/\\/|https:\\/\\/)[a-z0-9]+([-.]{1}[a-z0-9]+)‌​*.[a-z]{2,5}(:[0-9]{1,5})?(\\/.*)?$/
You need to escape them twice because the string to be compiled as a regex must contain '\/' to escape the slashes, but to introduce a backslash in a string you have to escape the backslash itself too.

Regex HTTP header parsing

I'm trying to use Regex to get a bit if HTTP header parsing done. I'd like to use groups to organize some of the information:
Let's say I have this:
Content-Disposition: form-data; name="item1"
I'd like the result of my regex to create two groups:
contentdisposition : form-data
name : item1
I've tried several methods, but I can't seem to figure out how to do this. If name= doesn't exist then only one group should be created, but the regex should not error out.
Any ideas?
/Content-Disposition: (.*?);(?: name="(.*?)")?/ might be what you're looking for. It uses an optional greedy quantifier to get the name unless that would cause the match to fail.