MailGun: Stored Message API: How to separate forwarded email from original email? - mailgun

i'm using mailgun ROUTING to post emails into my app when received at xyz#ouremail.myapp.com.
Similar to the [Retrieving Stored Messages] API.(https://documentation.mailgun.com/en/latest/api-sending.html#retrieving-stored-messages)
The documentation says:
Note: Do not rely on the body-plain, stripped-text, and stripped-signature fields for HTML sanitization. These fields merely provide content from the text/plain portion of an incoming message. This content may contain unescaped HTML.
I have been using bod-plain to retrieve the thread, but it includes the entire email thread without separation context.
I'm like to show a single message. There's no documentation for recommended parsing and separating the thread into multiple messages.

So far I have found that all forwarded messaged have retained headers, and thus start with From:.
Using Regexp I have been able to split the thread into multiple messages.
email_body.split(/(?=From:)/)
The positive lookbehind regular expression allows me to keep the delimiter in the split up chunks.

Related

Any reason to use a json string in the MessageBody instead of individual attributes?

I think using Message Attributes is the way to go. We only use 4 attributes and are worried that eventually we'll hit the 10 attribute limitation.
Is there any benefit to using MessageBody instead of individual attributes other than the 10 attribute limitation?
I believe MessageBody doesn't have a limit except for the total message size limit of 256 KB which is huge. Then again, a single attribute also has the same limit.
A better question is when to use one over the other?
The purpose of SQS Message attributes is that they are designed to be used as message metadata (like message category or message type of ) and not the message itself.
E.x. if your application supports both JSON and XML payloads types, then possibly you can put the payload type as one of the message attribute and when you fetch the message, then based on the payload type attribute you can choose if a XML message processor is to be used or a JSON processor. This is just an superficial example for explaining usage of body and attributes
The actual message payload should be given in body of SQS Message, ideally.
Following para is an extract from AWS Doc
Amazon SQS lets you include structured metadata (such as timestamps, geospatial data, signatures, and identifiers) with messages using message attributes. Each message can have up to 10 attributes. Message attributes are optional and separate from the message body (however, they are sent alongside it). Your consumer can use message attributes to handle a message in a particular way without having to process the message body first.

How does multi-line logging work in Lambda -> CloudWatch

My multi-line logging events all end up multi-events - one event per line. According to the documentation:
Each call to LambdaLogger.log() results in a CloudWatch Logs event...
but then:
However, note that AWS Lambda treats each line returned by System.out
and System.err as a separate event.
Looking inside LambdaAppender's source code, it seems that it proceeds to log the event to System.out anyway. So does that mean multi-line messages will always be broken down into multiple event?
I have read about configuring the multi_line_start_pattern, but that seems only applicable when you get to deploy a log agent, which isn't accessible in Lambda.
[Edit] LambdaAppender logs to LambdaLogger which logs to System.out.
[Edit] I found some post where a workaround was suggested - use '\r' for the eol when printing the messages. This seems to work for messages that my code produces. Stack traces logged everywhere are still a problem.
[Edit] I have been using two workarounds:
Log complex data structures (e.g. sizable maps) in JSON. CloudWatch actually recognizes JSON strings in log events, and pretty print them.
Replace '\n' with '\r'. For stack traces I created a utility method (this is in Kotlin, but the idea is generic enough):
fun formatThrowable(t: Throwable): String {
val buffer = StringWriter()
t.printStackTrace(PrintWriter(buffer))
return buffer.toString().replace("\n", "\r")
}
I think in the long run a more ideal solution would be an Appender implementation that decorates ConsoleAppender, which would do the \r replacement on all messages passing through.
Best practice is to use json in your logs. Instead of sending multiline outputs, send a formatted json (regardless of the language you are using, you will find a lib that already does that for you)
You will be amazed how easy it gets to browse your logs from there. For instance, aws cloudwatch insights automatically detects your fields, it allow to parse them and query them within seconds
I suggest to use the project slf4j-simple-lambda and to refer to this blog for more explanations.
Using slf4j and slf4j-simple-lambda is solving elegantly your problem and the solution stay lightweight. The project includes the usage of the parameter org.slf4j.simpleLogger.newlineMethod which is there to solve this problem. By default, its value is auto and should be able to detect automatically the need for manual newline handling.
Discloser: I am co-author of slf4j-simple-lambda and author of the blog.

How can I extract the canonical email address given an address that includes BATV or other tags?

Our webapp has a feature that allows users to import data by sending emails to a specific email address. When the emails are received by our app, they are processed differently depending on who sent them. We look at the "sender" field of the email, and match it to a user in our database. Once the user who sent the email has been determined, we handle that email based on that user's personal settings.
This has generally been working fine for most users. However, certain users were complaining that their emails weren't getting processed. When we looked into it, we found that their email server was adding information to the senders email address, and this caused the email address not to match what was in our User table in the database. For example, the user's email might be testuser#example.com in the database, but the "sender" field in the email we received would be something like btv1==502867923ab==testuser#example.com. Some research suggested this was caused by Bounce Address Tag Validation (BATV) being used by the sender's server.
We need to be able to extract the canonical email address from the "sender" field provided to us, so we can match it to our user table. One of the other developers here wrote a function to do this, and submitted it to me for code review. This is what he wrote (C#):
private static string SanitizeEmailSender(string sender)
{
if (sender == null)
return null;
return System.Text.RegularExpressions.Regex.Replace(
sender,
#"^((btv1==.{11}==)|(prvs=.{9}=))",
"",
System.Text.RegularExpressions.RegexOptions.None);
}
The regex pattern here covers the specific cases we've seen in our email logs. My concern is that the regex might be too specific. Are btv1 and prvs the only prefixes used in these tags? Are there always exactly 9 characters after prvs=? Are there other email sender tagging schemes other than BATV that we need to look out for? What I don't want is to put this fix in production just to find out next month that we need to fix it again because there were other cases we didn't consider.
My gut instinct was to just trim the email address to only include the part after the last =. However, research suggests that = is a valid character in email addresses and thus may be part of the user's canonical email address. I personally have never seen = used in an email address outside some kind of tagging or sub-addressing scheme, but you never know. Murphy's law suggests that the minute I assume a user will never have a certain character in their email address, somebody with that sort of address will immediately sign up.
My question is: is there a industry-accepted reliable way to extract a user's canonical email address given a longer address that may include BATV or other tags? Failing that, is there at least a more reliable way than what we've got so far? Or is what we've got actually sufficient?
As the information added by BATV is always preceded by the BATV tag and delimiting the information between two == strings, this is what I should use:
((btv1|prvs)==([^=]|=[^=])*==))
Of course, you are right in the sense that an = sign is admitted as a valid character in an email addres, but that's preciselly the reason to use that sequence (to form a valid email address).
If you try to dig a little more in RFCs relating to email, you'll se that MIME adds some constructs to allow non-ascii characters to an email address by use of the quoted-printable feature. A little of RFC reading is needed to select how to cope right with these things.
Finally, to answer your question, as the mail servers are authorised to modify/rewrite the envelope addresses ---these are the addresses in the control protocol SMTP used for routing of mail messages--- (sendmail can do it even in the mail header fields) The right answer to your question is that there's no reliable way (industrial accepted or not) to extract the sender canonical email address. Addresses are rewritten as message progresses to the target recipient and information is lost in the way. You cannot recover the original address used.
And last, to illustrate a little:
Sender field is added by the final SMTP recipient to include in the email the address of the envelope sender (the address used as FROM: <sender#address.com> in the original SMTP protocol message)
From field is added by the original mail client to identify the origin of the message. This behaviour can be modified by the existence of Resent-from or Resent-sender fields in case the message is resent. These identify the resend of messages.
Finally, the sender can use a Reply-to header to indicate responses to be sent to that address.
To get an idea of how the SMTP protocol works, read the dense RFC-2821 (SMTP protocol) and RFC-2822 (format of internet mail messages) documents.
Are btv1 and prvs the only prefixes used in these tags?
prvs is a prefix that conform to the "meta-syntax" defined in the RFC. btv1 is a Barracuda appliance Invalid Spoof Suppression rewrite which doesn't follow the BATV standard (hence the double equal sign).
A regex that just matches all BATV local-parts would be
[0-9A-Za-z\-]+=[0-9A-Za-z\-]+=.+#.+]
But this wouldn't catch the Barracuda btv1 rewrites (and other rewrites)
Are there always exactly 9 characters after prvs=?
No, the spec says there are 10 but in the wild it's most often 9
Are there other email sender tagging schemes other than BATV that we need to look out for?
Yes, see below.
is there a industry-accepted reliable way to extract a user's canonical email address given a longer address that may include BATV or other tags?
No
By looking at various code bases it looks like everybody implements their own solution. Some of the complexity comes from the fact that there are
the BATV rewrites
BATV rewrites which try but fail to follow the standard by swapping the loc-core and tag-val positions. Here is an example showing these reversed versions and some code which validates each to see if it's a prvs value and then assumes the other one is the loc-core
the Barracuda non standard rewrites
other non BATV rewrites like
SRS
Google Forwards
Here's a unit test containing a list of possible sender rewritten examples and here are some examples of syntaxes found in the wild.
Failing that, is there at least a more reliable way than what we've got so far? Or is what we've got actually sufficient?
It looks like best approach is to address each of the conditions in the way that ezmlm-idx and rspamd do.
The regex you're using won't cover
prvs with loc-core and tag-val reversed
prvs that follow the spec with 10 characters instead of 9
SRS
Google forwards

Handlerbars on escaping input?

I am testing some frontend code, and I can see the code that takes input using the {{}} handlerbars, so if I entered an input = &123 , shouldn't this be converted to &amp123 and then stored in the server since two double mustache means the characters like '&' is escaped. When I look at the post being send to the server, it still appears as &123.
No, the HTML escaping done by {{}} only has to do with how a value is rendered into the DOM. A string entered using {{input}} is not transformed in any way by Ember, nor should it be.
In general, one does not want to HTML-escape information being held in the DB. The data in the DB should be the actual data. The HTML escaping is something that should be done, as Ember does, "on the way out" when the data is being displayed in an HTML context.
If you really want to keep HTML-escaped data in your server, then you could escape it on the server prior to saving, or perhaps in an Ember serializer. However, when retrieving the data, you'd then have to either unescape it on the server, or send it down the client as is, either unescaping it is the deserializer, or remembering that it is already escaped and putting in the DOM using {{{}}} (triple handlebars).

Using HTTP URIs as identifiers of resources in RESTful web API

Usually to retrieve a resource one uses:
GET http://ws.mydomain.com/resource/123212
But what if your item IDs are HTTP URIs?:
GET http://ws.mydomain.com/resource/http://id.someotherdomain.com/SGX.3211
Browsers replace two slashes with one, and the request turns into:
GET http://ws.mydomain.com/resource/http:/id.someotherdomain.com/SGX.3211
which will not work.
URI encoding the "http://id.someotherdomain.com/SGX.3211" -part results in HTTP 400 - Bad request.
Is there a best practice for handling this?
Edit:
Then of course if we would need to have (I don't at the moment) request in form:
resources/ID/collections/ID
and all IDs are HTTP URIs, things get out of hand... Possibly one could do something like this and parse the contents inside the curly braces:
resources/{http://id...}/collections/{http://id...}
Encode the other system's URI, and then pass the value as a query parameter:
GET http://ws.mydomain.com/resource?ref=http%3A%2F%2Fid.someotherdomain.com%2FSGX.3211
Ugly looking, but no one said that URIs used in a REST architecture have to be beautiful. :)
By the way, a GET actually looks like this when it's sent:
GET /resource?ref=http%3A%2F%2Fid.someotherdomain.com%2FSGX.3211 HTTP/1.1
Host: ws.mydomain.com
UPDATE: apparently you no longer have to encode "/" and "?" within a query component. From RFC 3986:
The characters slash ("/") and question mark ("?") may represent data
within the query component. Beware that some older, erroneous
implementations may not handle such data correctly when it is used as
the base URI for relative references (Section 5.1), apparently
because they fail to distinguish query data from path data when
looking for hierarchical separators. However, as query components are
often used to carry identifying information in the form of "key=value"
pairs and one frequently used value is a reference to another URI, it
is sometimes better for usability to avoid percent-encoding those
characters.
So you could legally do this:
GET /resource?ref=id.someotherdomain.com/SGX.3211 HTTP/1.1
Host: ws.mydomain.com