procmail: getting procmail to exclude hostname while saving Maildir format messages - procmail

How do I get procmail to save messages in my Maildir folder, but not include the hostname in the file (message name)? I get the following message names in my new/ sub-folder:
1464003587.H805375P95754.gator3018.hostgator.com, S=20238_2
I just want to eliminate the hostname. Is that possible to do, using procmail? How? Separately, it is possible to replace the first time stamp with the time-sent time-stamp? Is it possible to prescribe a format for procmail?

No, you can't override Maildir's filename format, not least because it's prescribed to be in a particular way for interoperability reasons. The format is guaranteed to be robust against clashes when multiple agents on multiple hosts concurrently write to the same message store. This can only work correctly if they all play by the same rules. An obvious part of those rules is the one which dictates that the host name where the agent is running must be included in the filename of each new message.
The Wikipedia Maildir article has a good overview of the format's design and history, and of course links to authoritative standards and other primary sources.
If you don't particularly require Maildir compatibility (with the tmp / new / cur subdirectories etc) you can simply create a unique mbox file on each run; if you can guarantee that it is unique, you don't need locking when you write to it.
For example, if you have a tool called uuid which generates a guaranteed unique identifier on each invocation, you can use that as the file name easily;
:0 # or maybe :0r
`uuid`
It should be easy to see how to supply your own tool instead, if you really think you can create your own solution for concurrent delivery. (Maildir solves concurrent and distributed delivery, so the requirements for that are stricter.)
The other formats supported by Procmail have their own hardcoded rules for how file names are generated, though perhaps the simple MH folder format, with a (basically serially incrementing) message number as the file name, would be worth investigating as well. The old mini-FAQ has a brief overview of the supported formats and how to select which one Procmail uses for delivery in each individual recipe.

Related

Inotify-like feature in a distributed file system

As the title goes, I want to trigger a notification when some events happen.
A event above can be user-defined, such as updating specified files in 1-miniute.
If files are stored locally, I can easily make it with the system call inotify, but the case is that files locate on a distributed file system such as mfs..
How to make it? I wonder to know if there are some solutions or open-source project to solve this problem. Thanks.
If you have only black-box access (e.g. NFS protocol) to the remote system(s), you don't have much options unless the protocol supports what you need. So I'll assume you have control over the remote systems.
The "trivial" approach is running a local inotify/fanotify listener on each computer that would forward the notification over the network. FAM can do this over NFS.
A problem with all notification-based system is the risk of lost notifications in various edge cases. This becomes much more acute over a network - e.g. client confirms reciept of notification, then immediately crashes. There are reliable message queues you can build on but IMHO this way lies madness...
A saner approach is stateless hash-based scan.
I like to call the following design "hnotify" but that's not an established term. The ideas are widely used by many version control and backup systems, dating back to Plan 9.
The core idea is if you know cryptographic hashes for files, you can compose a single hash that represents a directory of files - it changes if any of the files changed - and you can build these bottom-up to represent the whole filesystem's state.
(Git stores things this way and is very efficient at it.)
Why are hash trees cool? If you have 2 hash trees — one representing the filesystem state you saw at point in the past, one representing the current state — you can easily find out what changed between them:
You start at the roots. If they are different you read the 2 root directories and compare hashes for subdirectories.
If a subdirectory has same hash in both trees, then nothing under it changed. No point going there.
If a subdirectory's hash changed, compare its contents recursively — call step (1).
If one has a subdirectory the other doesn't, well that's a change. With some global table you can also detect moves/renames.
Note that if few files changed, you only read a small portion of the current state. So the remote system doesn't have to send you the whole tree of hashes, it can be an interactive ping-pong of "give me hashes for this directory; ok now for this...".
(This is akin to how Git's dumb http protocol worked; there is a newer protocol with less round trips.)
This is as robust and bug-proof as polling the whole filesystem for changes — you can't miss anything — but reasonably efficient!
But how does the server track current hashes?
Unfortunately, fully hashing all disk writes is too expensive for most people. You may get if for free if you're lucky to be running a deduplicating filesystem, e.g. ZFS or Btrfs.
Otherwise you're stuck with re-reading all changed files (which is even more expensive than doing it in the filesystem layer) or using fake file hashes: upon any change to a file, invent a new random "hash" to invalidate it (and try to keep the fake hashes on moves). Still compute real hashes up the tree. Now you may have false positives — you "detect a change" when the content is the same — but never false negatives.
Anyway, the point is that whatever stateful hacks you do (e.g. inotify with periodic scans to be sure), you only do them locally on the server. Across the network, you only ever send hashes that represent snapshots of current state (or its subtrees)! This way you can have a distributed system with many servers and clients, intermittent connectivity, and still keep your sanity.
P.S. Btrfs can efficiently find differences from an older snapshot. But this is a snapshot taken on the server (and causing all data to be preserved!), less flexible than a client-side lightweight tree-of-hashes.
P.S. One of your tags is HadoopFS. I'm not really familiar with it, but I suspect a lot of its files are write-once-then-immutable, and it might be able to natively give you some kind of file/chunk ids that can serve as fake hashes?
Existing tools
The first tool that springs to my mind is bup index. bup is a very clever deduplicating backup tool built on git (only scalable to huge data), so it sits on the foundation described above. In theory, indexing data in bup on the server and doing git fetch over the network would even implement the hash-walking comparison of what's new that I described above — unfortunately the git repositories that bup produces are too big for git itself to cope with. Also you probably don't want bup to read and store all your data. But bup index is a separate subsystem that quickly scans a filesystem for potential changes, without yet reading the changed files.
Currently bup doesn't use inotify but it's been discussed in depth.
Oh, and bup uses Bloom Filters which are a nearly optimal way to represent sets with false positives. I'm almost certain Bloom filters have a role to play in optimizion stateless notification protocols ("here is a compressed bitmap of all I have; you should be able to narrow your queries with it" or "here is a compressed bitmap of what I want to be notified about"). Not sure if the way bup uses them is directly useful to you, but this data structure should definitely be in your toolbelt.
Another tool is git annex. It's also based on Git (are you noticing a trend?) but is designed to keep the data itself out of Git repos (so git fetch should just work!) and has a "WORM" option that uses fake hashes for faster performance.
Alternative design: compressed replayable journal
I used to think the above is the only sane stateless approach for clients to check what's changed. But I just read http://arstechnica.com/apple/2007/10/mac-os-x-10-5/7/ about OS X's FSEvents framework, which has a perhaps simpler design:
ALL changes are logged to a file. It's kept forever.
Clients can ask "replay for me everything since event 51348".
The magic trick is the log has coarse granularity ("something in this directory changed, go re-scan it to find out what", repeated changes within 30 seconds are combined) so this journal file is very compact.
At the low level you might resort to similar techniques — e.g. hashes — but the top-level interface is different: instead of snapshots you deal with a timeline of events. It may be an easier fit for some applications.

Should output be encoded at the API or client level?

We are moving our Web app architecture to being microservice based. We have an internal debate as to whether an REST API that provides content (in JSON, let's say) should be looking to encode content to make it safe, or whether the consumers that take that content and display it (in HTML, for example, or otherwise use it) should be responsible for that encoding. The use case is to prevent XSS attacks and similar.
The provider stance is "Well, we can't know how to encode it for everyone, or how you're going to use the content, so of course the consumers should encode the content."
The consumer stance is "There is one provider and multiple consumers, so it's more secure to do it once in the providing API than to hope that every consumer does it."
Are there any generally accepted best practices on this and why?
As a rule, data when passing through "internal" processes (whatever that might mean to use) should be stored or encoded in whatever "internal" format makes sense. The format chosen is typically designed to minimize encoding/decoding steps and to prevent data loss.
Then, upon output, data is encoded using whatever output format makes sense. Preventing data loss is important, but also proper escaping and formatting is key here.
So for example, with internal APIs, data in binary format may be sufficent. But when you output JSON or HTML or XML or PDF, you have to encode and escape your data appropriately to fit the output format.
The important point here is that different output formats have different concepts of "safe". What's "safe" for HTML may not be safe for JSON, and what's safe for JSON may not be safe for SQL. Data is encoded upon output specifically so that you can use the proper encoding for the task. You cannot assume that this step is done for you ahead of time, nor should you put your output function in the position to determine whether or not encoding must be done. If you stick with the rule: "output function ALWAYS encodes for safety", then you will never have to worry about data injection attacks.
I would say that the two important points are the following:
The encoding used by the provider MUST be specified with extreme clarity and precision in a reference document, so that all consumer implementors can know what to expect.
Whatever default encoding is used by the provider MUST keep all needed information, i.e. still be amenable to transcoding by any consumer who would wish to do it.
If you follow these two rules then you will have done 95% of the job for reliability and security.
As for your specific question, a good practice is a middle-ground: the provider follows by default a "generic" encoding, but consumers can ask (optionally) for a specific encoding which the provider may then apply -- this allows the provider to support a number of dumb, lightweight clients of possibly different kinds and can be extended later on with extra encodings without breaking the API.
I firmly believe it is both the consumer and the provider that need to do their part in being good citizens in the security space.
As the provider I want to make sure I deliver a secure product. I don't need to know the context in which my client is going to use my product, all I need to know is how I am going to deliver it. If my delivery is in JSON, then I can use that context to escape my data before sending it off, similarly for XML, plain text, etc. Further more there are transport methods that aid in security already. JSONP is one such delivery method. This ensures the payload is consumed appropriately.
As the consumer, which by the way in our environment no one is the final consumer, we are all providers to the final end client (the end users via a web browser mostly.). Because of this we have to also secure the data at this end. I would never trust a black box API to do this job for me, I would always make a point to ensure a secure payload. There are many tools out there, the ESAPI project from OWASP comes to mind, that will aid in the sanitization by context of data. Remember that you are eventually sending this data on to the end-user (browser) and if there is something awry you won't be able to pass the buck. Your service will be viewed as the vulnerable one regardless of where the flaw lies. Additionally, as the consumer, you may not always be able to rely on the black box provider to fix their flaws in a timely fashion. What if their support is lacking or they have higher priorities. Does that mean you continue to provide a known flaw to your end-users?
Security is about layers, and having safeguards at the source and end-points is always preferable.

How to create extended (custom) file property in Windows?

We have a proprietary file format which has embedded in it a product-code.
I am just starting down the path of "enabling the end-user to sort / filter by product-code when opening a file".
The simplest approach for us might be to simply have another drop-down in our customized Open File Dialog in which to choose a product-code to filter by.
However, I think it might be more useful to the end-user if we could present this information as a column in the details view for this file type - just as name, date-modified, type, size, etc., are also detail properties of a file-type (or perhaps generic to all files).
My vague understanding is that XP and prior Windows OSes embedded some sort of meta data like this in an alternate data stream in NTFS. However, Starting in Vista Microsoft stopped using alternate data streams due to their dependence upon NTFS, and hence fragility (i.e. can't send via file attachment, can't move to a FAT formatted thumb drive, etc.)
Things I need to know but haven't figured out yet:
Is it possible / Is it practicable / how to create a custom extended file property for our file type that expresses the product-code to the Windows shell so that it can be seen in Windows Explorer (and hence File dialogs)?
If that is doable, then how to configure things so that the product-code column is displayed by default for folders containing our file type.
Can anyone point me to a good starting point on the above? We certainly don't have to accomplish this by publishing a custom extended file property - but that seems like a sensible approach, in absence of any way to measure the costs of going this route.
If you have sensible alternative approaches to the problem, I'd be interested in those as well!
Just found: http://www.codeproject.com/Articles/830/The-Complete-Idiot-s-Guide-to-Writing-Shell-Extens
CRAP! It seems I'm very late to the banquet, and MS has already removed this functionality from their shell: http://xpwasmyidea.blogspot.com/2009/10/evil-conspiracy-behind-customizable.html
By far the easiest approach to developing a shell extension is to use a library made for the purpose.
I can recommend EZShellExtension because I have used it in the past to add columns and thumbnails/preview for a custom file format for our company.

url shortening algorithm in c/c++ - interview

Refer - https://stackoverflow.com/a/742047/161243
Above algo says that we use a DB to store the data. Now if interviewer says that you can't use a DB. Then in that case we can have a stucture:
struct st_short_url{
char * short_url;
char * url;
}
Then a hashtable - st_short_url* hashTable[N];
Now we can have an int id which is incremented each time or a random number generated id which is converted to base62.
Problem i see:
-- if this process terminates then i lose track of int id and complete hashTable from RAM. So do i keep writing the hashTable back to disk so that it is persisted? if yes, then a B-tree will be used? Also we need to write id to disk as well?
P.S. Hashtable+writing to disk is Database, but what if i can't use a DBMS? What if i need to come up with my own implementation?
Your thoughts please...
Another Question:
In general, How do we handle infinite redirects in URL shortening?
If you can't use a DB of any kind (i.e. no persistent storage; the file system is nothing but a primitive DB!), then the only way to do it which I see is lossless compression + encoding in allowed characters. The compression algorithm may employ knowledge about URLS (e.g. that it is very likely that they begin with either http:// or https://, quite a few go on with www. and the domain name most often ends in .com, .org or .net. Moreover you can always assume a slash after the host name (because http://example.org and http://example.org/ are equivalent). You also may assume that the URL only contains valid characters, and special-case some substrings which are very likely to occur in the URL (e.g. frequently linked domains, or known naming schemes for certain sites). Probaby the compression scheme should feature a version field so that you can update the algorithm when usage patterns change (e.g. a new web site gets popular and you want to special-case that as well, or a popular site changes its URL pattern which you special-cased) without risking the old links to go invalid.
Such a scheme could also be supported directly in the browser through an extension, saving server bandwidth (the server would still have to be there for those without a browser extension and as fallback if the extension doesn't yet have the newest compression data).
The requirement isn't practical, but you don't have to give a practical answer. Just use the file system and he won't realize that.
To store:
convert input URL to a string e.g. base64 conversion.
make a file of that name
return the inode number as the short url (e.g. ls -i filename ) or stat() etc.
To retrieve:
get the inode number from user.
find / -inum n -print or some other mechanism.
convert that back to a URL from filename.
A database is a data structure that supports insertion, removal and search of items. As has been pointed out in the comments to the OP, nearly everything is a database, so this constraint seems somewhat uninformed.
If you're not allowed to use an existing DBMS, you can resort to storing items on disk, making use of tmpnam() or a similar technique that doesn't suffer from race conditions. tmpnam() yields unique IDs, and you can use the associated file to store information.

Logging Etiquette

I have a server program that I am writing. In this program, I log allot. Is it customary in logging (for a server) to overwrite the log of previous runs, append to the file with some sort of new run header, or to create a new log file (it won't be restarted too often).
Which of these solutions is the way of doing things under Linux/Unix/MacOS?
Also, can anyone suggest a logging library for C++/C? I need one, regardless of the answer to the above question.
Take a look in /var/log/...you'll see that files are structured like
serverlog
serverlog.1
serverlog.2
This is done by logrotate which is called in a cronjob. But everything is simply in chronological order within the files. So you should just append to the same log file each time, and let logrotate split it up if needed.
You can also add a configuration file to /etc/logrotate.d/ to control how a particular log is rotated. Depending on how big your logfiles are, it might be a good idea to add here information about your logging. You can take a look at other files in this directory to see the syntax.
This is a rather complex issue. I don't think that there is a silver bullet that will kill all your concerns in one go.
The first step in deciding what policy to follow would be to set your requirements. Why is each entry logged? What is its purpose? In most cases this will result in some rather concrete facts, such as:
You need to be able to compare the current log with past logs. Even when an error message is self-evident, the process that led to it can be determined much faster by playing spot-the-difference, rather than puzzling through the server execution flow diagram - or, worse, its source code. This means that you need at least one log from a past run - overwriting blindly is a definite No.
You need to be able to find and parse the logs without going out of your way. That means using whatever facilities and policies are already established. On Linux it would mean using the syslog facility for important messages, to allow them to appear in the usual places.
There is also some good advice to heed:
Time is important. No only because there's never enough of it, but also because log files without proper timestamps for each entry are practically useless. Make sure that each entry has a timestamp - most system-wide logging facilities will do that for you. Make also sure that the clocks on all your computers are as accurate as possible - using NTP is a good way to do that.
Log entries should be as self-contained as possible, with minimal cruft. You don't need to have a special header with colors, bells and whistles to announce that your server is starting - a simple MyServer (PID=XXX) starting at port YYYYY would be enough for grep (or the search function of any decent log viewer) to find.
You need to determine the granularity of each logging channel. Sending several GB of debugging log data to the system logging daemon is not a good idea. A good approach might be to use separate log files for each logging level and facility, so that e.g. user activity is not mixed up with low-level data that in only useful when debugging the code.
Make sure your log files are in one place, preferably separated from other applications. A directory with the name of your application is a good start.
Stay within the norm. Sure you may have devised a new nifty logfile naming scheme, but if it breaks the conventions in your system it could easily confuse even the most experienced operators. Most people will have to look through your more detailed logs in a critical situation - don't make it harder for them.
Use the system log handling facilities. E.g. on Linux that would mean appending to the same file and letting an external daemon like logrotate to handle the log files. Not only would it be less work for you, it would also automatically maintain any general logging policies as a whole.
Finally: Always copy log important data to the system log as well. Operators watch the system logs. Please, please, please don't make them have to look at other places, just to find out that your application is about to launch the ICBMs...
https://stackoverflow.com/questions/696321/best-logging-framework-for-native-c
For the logging, I would suggest creating a new log file and clean it using a certain frequency to avoid it growing too fat. Overwrite logs of previous login is usually a bad idea.