How do I imitate twitters url-shortener? - regex

the main question is a bit short so I'll collaborate.
I'm building an app for twitter with which you can do the basic actions (get posts, do a post, reply etc.)
Now I figured it would be a good idea if I'd check the max 140 char limit in my app.
So far so good, then someone asked if I could also do the url-shortener thing.
so at the moment I have a regex that picks op most (in fact too much) url's, takes the lenght of them and either adds or deduces the difference from the 140 max.
It's still a but buggy but I can manage that.
Now my problem....
It seems twitter is quite picky in what they think is an url:
I got the most basic ones (starting with http(s):// and such), but twitter also replaces some tld's very easily, (www.)google.com [whatever].net/.biz/.info are just a few of them)
but not .nl .de .tk
Now I was wondering if perhaps someone has found out which ones they do and which ones they don't 'shorten'.
now because I'm pretty sure my regex isn't the best either I'll drop that here as well:
((http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:\/~\+#]*[\w\-\#?^=%&\/~\+#])?)|([\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:\/~\+#]*[\w\-\#?^=%&\/~\+#])?)

http://support.twitter.com/articles/78124-how-to-shorten-links-urls# indicates that all URLs posted to Twitter will be rewritten to be exactly 19 characters long.

I am using this: var url_expression = /[-a-zA-Z0-9#:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&//=]*)?/gi; Nobody has complained :)

I figured it out, I found a pretty important line on the tld wikipage. It states that all country TLD's are two chars long. And also the other way around; all 2 char tld's are countries. With that in mind, I started testing a bunch of them with twitter and I'm pretty sure I now know what url's twitter shortens and which ones they don't.
All url's starting with http:// or https://
All url's like [something].[non country tld] # .com .biz .mobi etc. (Except .arpa & .aero)
All url's like [something].[something].[valid tld] # including countries
links like http://[user]:[pass]#[something].[tld] will NOT be shortened
Now to build a regex for it, i'll post it here as soon as I think I have it :D
this is what I got this far:
/(^(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?:(?:[-\w]+\.)+(?:com|asia|cat|coop|edu|int|tel|pro|org|net|gov|mil|biz|info|mobi|name|jobs|museum|travel|([a-z]{2})))(?::[\d]{1,5})?(?:(?:(?:\/(?:[-\w~!$+|.,=\(\)]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?)/gim;
one major flaw still in it, it also accepts [domain].[tld] which twitter doesn't.
I hope this will help someone in the future. I'm pretty sure there's not a whole lot easy-to-find info about this on the web (or at least I couldn't find it).

Related

Get spamassassin to drop emails containing a specific REGEX in attached filenames

newbie asking first question :)
I'm running a mail server (Ubuntu/Postfix/Dovecot) with SpamAssassin. Most of the known spam is flagged (RBLs, and obvious UCE) except for this particular malspam in attached zip files like "order_info_654321.zip", "paymet_document_123456.zip", and so on, when it doesn't fit any other SA rules. I'd like to procure a rule which drops the matching offenders into oblivion.
After fiddling with regex101.com, I've come up with an expression that matches these patterns exclusively:
/\w+[_][0-9]{6}.zip$/img
Question is... How to format it all, get it to work, and where to put it? So far, I edited /etc/spamassassin/local.cf, added this to the bottom, and restarted:
mimeheader TROJAN_ATTACHED Content-Type =~ /\w+[_][0-9]{6}.zip$/img
describe ZIP_ATTACHED email contains a zip trojan attachment
score TROJAN_ATTACHED 99.
But it doesn't seem to do the magic. Where else can I look for this?
Thank you all,
Keijo.-
You have a wrong regex. You do not need a $ char at the end, because filename strings are not necessarily at the end of the Content-Type header. Instead, you can use a word boundary \b anchor. In my rules, I have the following, and it perfectly works:
mimeheader MIME_FAIL Content-Type =~ /\.(ade|adp|bat|chm|cmd|com|cpl|exe|hta|ins|isp|jse|lib|lnk|mde|msc|msp|mst|pif|scr|sct|shb|sys|vb|vbe|vbs|vxd|wsc|wsf|wsh|reg)\b/i
describe MIME_FAIL Blacklisted file extension detected
score MIME_FAIL 5
First up, SA doesn't drop e-mails by default, but it can score them so high on spam content that they don't show up to anyone's inbox. Second, the "ingredients" I started with were incorrect, plus messed up with SA ability to function at all.
This actually did the trick when added into/etc/spamassassin/local.cf:
full TROJAN_ZIPUNDS /\w*[_][\d]{1,6}\.zip/img
score TROJAN_ZIPUNDS 99
describe TROJAN_ZIPUNDS RM zip attached trojan underscore
Even though these spammers altered from zip to rar, to underscores to dashes, different filenames, and so on, creating rules to counter them became simple after succeeding with the first one. Here's what I added too:
full TROJAN_RARDASH /\w*[-][\d]{1,6}\.rar/img
score TROJAN_RARDASH 99
describe TROJAN_RARDASH RM rar attached trojan dash
Also, as first described, I needed to specifically block certain zip file names which soon morphed to rar and dashes, so, morphing the regex and appending as a rule triad to spamassassin's local.cf (and restarting) is currently holding, until next spam wave :-)
Finally, this is a very very blunt workaround, so anyone with expertise on the subject is more than welcome to chime in.
You are using the wrong mime header to check for the filename. Use this instead:
mimeheader TROJAN_ATTACHED Content-Disposition =~ /\w+[_][0-9]{6}.zip/img
Also make sure you have the MimeHeader plugin loaded.
loadplugin Mail::SpamAssassin::Plugin::MIMEHeader

SQL Server Regular Expression Workaround in T-SQL?

I have some SQLCLR code for working with Regular Expresions. But now that it is getting migrated into Azure, which does not allow SQLCLR, that's out. I need to find a way to do regex in pure T-SQL.
Master Data Services are not available because the dev edition of MSSQL we have is not R2.
All ideas appreciated, thanks.
Regular expression match samples that need handling
(culled from regexlib and other places over the past few years)
email address
^[\w-]+(\.[\w-]+)*#([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?$
dollars
^(\$)?(([1-9]\d{0,2}(\,\d{3})*)|([1-9]\d*)|(0))(\.\d{2})?$
uri
^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*#)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$
one numeric digit
^\d$
percentage
^-?[0-9]{0,2}(\.[0-9]{1,2})?$|^-?(100)(\.[0]{1,2})?$
height notation
^\d?\d'(\d|1[01])"$
numbers between 1 1000
^([1-9]|[1-9]\d|1000)$
credit card numbers
^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}$
list of years
^([1-9]{1}[0-9]{3}[,]?)*([1-9]{1}[0-9]{3})$
days of the week
^(Sun|Mon|(T(ues|hurs))|Fri)(day|\.)?$|Wed(\.|nesday)?$|Sat(\.|urday)?$|T((ue?)|(hu?r?))\.?$
time on 12 hour clock
(?<Time>^(?:0?[1-9]:[0-5]|1(?=[012])\d:[0-5])\d(?:[ap]m)?)
time on 24 hour clock
^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
usa phone numbers
^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$
Unfortunately, you will not be able to move your CLR function(s) to SQL Azure. You will need to either use the normal string functions (PATINDEX, CHARINDEX, LIKE, and so on) or perform these operations outside of the database.
EDIT Adding some information for the examples added to the question.
Email address
This one is always controversial because people disagree about which version of the RFC they want to support. The original didn't support apostrophes, for example (or at least people insist that it didn't - I haven't dug it up from the archives and read it myself, admittedly), and it has to be expanded quite often for new TLDs (once for 4-letter TLDs like .info, then again for 6-letter TLDs like .museum). I've often heard quite knowledgeable people state that perfect e-mail validation is impossible, and having previously worked for an e-mail service provider, I can tell you that it was a constantly moving target. But for the simplest approaches, see the question TSQL Email Validation (without regex).
One numeric digit
Probably the easiest one of the bunch:
WHERE #s LIKE '[0-9]';
Credit card numbers
Assuming you strip out dashes and spaces, which you should do in any case. Note that this isn't an actual check of the credit card number algorithm to ensure that the number itself is actually valid, just that it conforms to the general format (AmEx = 15 digits starting with a 3, the rest are 16 digits - Visa starts with a 4, MasterCard starts with a 5, Discover starts with 6 and I think there's one that starts with a 7 (though that may just be gift cards of some kind)):
WHERE #s + ' ' LIKE '[3-7]'+ REPLICATE('[0-9]', 14) + '[0-9 ]';
If you want to be a little more precise at the cost of being long-winded, you can say:
WHERE (LEN(#s) = 15 AND #s LIKE '3' + REPLICATE('[0-9]', 14))
OR (LEN(#s) = 16 AND #s LIKE '[4-7]' + REPLICATE('[0-9]', 15));
USA phone numbers
Again, assuming you're going to strip out parentheses, dashes and spaces first. Pretty sure a US area code can't start with a 1; if there are other rules, I am not aware of them.
WHERE #s LIKE '[2-9]' + REPLICATE('[0-9]', 9);
-----
I'm not going to go further, because a lot of the other expressions you've defined can be extrapolated from the above. Hopefully this gives you a start. You should be able to Google for some of the others to see how other people have replicated the patterns with T-SQL. Some of them (like days of the week) can probably just be checked against a table - seems overkill to do an invasie pattern matching for a set of 7 possible values. Similarly with a list of 1000 numbers or years, these are things that will be much easier (and probably more efficient) to check if the numeric value is in a table rather than convert it to a string and see if it matches some pattern.
I'll state again that a lot of this will be much better if you can cleanse and validate the data before it gets into the database in the first place. You should strive to do this wherever possible, because without CLR, you just can't do powerful RegEx inside SQL Server.
Ken Henderson wrote about ways to replicate RegEx without CLR, but they require sp_OA* procedures, which are even less likely to ever see the light of day in Azure than CLR. Most of the other articles you'll find online use an approach similar to Ken's or use complex use of built-in string functions.
Which portions of RegEx specifically are you trying to replicate? Can you show an example of the input/output of one of your functions? Perhaps it will be easy to convert to get similar results using the built-in string functions like PATINDEX.

Controlling Word Wrap in a container

I have a peculiar problem. I have an email group that pipes emails to a message board. The word wrap of the emails varies. In yahoo, the messages tend to fill the entire container on the message board. But in all other mail clients, only part of the container width is filled, because the original mail was wrapped. I want all of the email messages to fill the entire width of the container. I've thought of two possible solutions: CSS, or a Regex that eliminates line breaks. Because I am only a garage mechanic (at these sorts of things), I simply cannot get the job done. Any help out there?
Here is a link that shows the issue: http://seanwilson.org/forum/index.php?t=msg&th=1729&start=0&S=171399e41f2c10c4357dd9b217caaa3f
(compare the message of "sean" with that of "rob." One fills the container, the other not).
Can any of you suggest how to get all the mail to fill the container?
You gave too little information - what programming language are you using - PHP/Javascript/anything different?
I think you only need to replace \n, \r and \r\n with whitespace. PHP code for that:
$nowrap = str_replace('\r\n', ' ', $nowrap);
$nowrap = str_replace('\r', ' ', $nowrap);
$nowrap = str_replace('\n', ' ', $nowrap);
You can do that analogically in other languages (for JS see string.replace method: http://www.tizag.com/javascriptT/javascript-string-replace.php).
Depending on the situation (people always seem to add 2 linebreaks between paragraphs), you could say the problem is: replace all newlines not directly preceded or followed by a newline with a space.
//just to be sure, remove \r's
$string = str_replace("\r",'',$string);
$string = preg_replace('/(?<!\n)\n(?!\n)/',' ',$string);
While allowing \r's:
$string = preg_replace('/(?<!\r|\n)\r?\n(?!\r|\n)/',' ',$string);
Edit: nevermind: do not use: while people tend to write their email text in paragraphs, you will break their signature / signoff with this regex. One could fiddle around with a minimum linelength before deeming it 'breakable' (i chose 63), but fiddly it will be:
$string = preg_replace('/([^\r\n]{63,})\r?\n(?!\r|\n)/','$1 ',$string);
The problem is: there are no assurance the linebreak wasn't intended. With a fiddleable line-length you could base it on average users, but the question is: what do they mind more: the differences between breaking & non-breaking paragraphs, or the breaking of their signatures?
Thanks for getting back so quickly!
The discussion board uses php (and also CSS). The only trouble is that I am somewhat limited in my ability to tinker with its programing. If I am to do this at my current level of skilty, I have only one of two options.
using a preg-replace in php. The discussion board allows us to do this from a control panel. So If I could do it with one preg-replace statement, it should work.
Would Wrikken's solution work if I do not remove \r's? Because that seems to be spot on. (could the \r's be added to the preg-replace?)
I had hoped the solution could come through a css property of some sort. I guess that isn't possible.
Thanks so much for your help!
[NOTE: thanks so much for your help! The solution worked!!! I changed the number to 53 or so. It needed to be a little smaller. I don't care that a rare, long signature lines may lose its carriage return. That's a small price to pay for a full message box! You easily saved me several days of learning something that was bound to be moderately frustrating, Thanks so much for that quick fix. I am joyous at the help I received here.]

Regexp to parse out a person's name?

This might be a hard one (if not impossible), but can anyone think of a regular expression that will find a person's name, in say, a resume? I know this won't be 100% accurate, but I can't come up with something.
Let's assume the name only shows up once in the document.
No, you can't use regular expressions for this. The only chance you have is if the document is always in the same format and you can find the name based on the context surrounding it. But this probably isn't the case for you.
If you are asking your applicants to submit their résumé online you could provide a separate field for them to enter their name and any other information you need instead of trying to automatically parse résumés.
Forget it - seriously.
Or expect to get a lot of applications from a Mr C Vitae
In my experience, having written something very similar (but a very long time ago), about 95% of resumes have the person's name as the very first line. You could probably have a pretty loose regex checking for alpha, hyphens, periods, and assume that's the name.
Obviously there's no way to do this 100% accurately, as you said, but this would be close.
Unless you wanted to build an expression that contained every possible name, or-ed together, the expression you are referring to is not "Regular," with a capital R. A good guess might be to go looking for the largest-font words in the document. If they follow a pattern that looks like firstname-lastname, name-initial-name, etc., you could call it a good guess...
That's a really hairy problem to tackle. The regex has to match two words that could be someone's name. The problem with that is that some people, of Hispanic origin, for example, might have a name that's more than 2 words. Also, how would you define two words to match for a name? Would you use a database of common first and last name fields? That might work unless someone has an uncommon name.
I'm reminded of a story of a COBOL teacher in college told me about an individual of Asian origin who's name would break every rule the programmers defined for a bank's internal system. His first name was "O." just the letter O.
The only remotely dependable way to nail down the regex would be if you had something to set off your search with; maybe if a line of text in the resume began with "Name: " then you'd know where to start looking.
tl;dr: People's names and individual resumes are too heavily varied for a regular expression to pick apart.
You could do something like Amazon does for book overviews: SIPs. This would require some after-the-fact double checking by humans but you might find the person's name(s) in there.

Coding a Gmail style "hide quoted text" for web based mailing list archive

I'm working on a web application that parses and displays email messages in a threaded format (among other things). Emails may come from any number of different mail clients, and in either text or HTML format.
Given that most people have a tendency to top post, I'd like to be able to hide the duplicated message in an email reply in a manner similar to how Gmail does it (e.g. "show quoted text").
Determining which part of the message is the reply is somewhat challenging. Personally, I use "> " delimiters at the beginning of the quoted text when replying. I created a regexp that looks for these lines and wraps a div around them to allow some JS to hide or show this block of text.
I then noticed that Outlook doesn't use the "> " characters by default, it simply adds a header block above the reply with the summary of the headers (From, Subject, Date, etc.). The reply is untouched. I can match on this and hide the rest of the email, working with the assumption that it's a top quote.
I then looked at Thunderbird, and it uses "> " for text, and <blockquote> for HTML mails. I still haven't looked at what Apple Mail does, what Notes does, or what any of the other millions of mail clients out there do.
Will I be writing a special case regexp for every single client out there? or is there something I'm missing?
Any suggestions, sample code or pointers to third party libraries much appreciated!
It'll be pretty hard to duplicate the way gmail does it since it doesn't care about whether it was a quoted piece or not, like Zac says, it just seems to care about the diff.
Its actually pretty hard to get this right 100% of the time. Plain text email is "lossy", its entirely possible for you to send
> Here is my long line that is over 74 chars (email line length limit)
Which can get encoded as something like
> Here is my long line that is over 74 chars (email=
line length limit)
And then is decoded as
> Here is my long line that is over 74 chars (email
line length limit)
Making it indistinguishable from an inline-reply.
This is email, so variations are abound. Email usually line-wraps at something like 74 characters, and encoding schemes can differ. Its a real PITA. If you can access the HTML version, you will probably have better luck looking for quote tags and the like. Another idea would be to parse both the plain text and html version to try and determine the boundries.
Additionally, its best to just plan for specific client hacks. They all construct mime messages differently, both in structure and header content.
Edit: I say this with the experience of writing an email processing system as well as seeing several people try to do the -exact- thing you're doing. It always only got "ok" results.
From what I can tell, gmail does not bother about prefixed lines or section headings, except to ignore them. If the text lines appeared earlier in the thread, and then reappear, it is considered to be quoted. Thus, e.g., if you send multiple messages and don't change your signature, the signature is considered to be quoted. If you've already dealt with the '>' prefix, a simple diff should do most of the rest. No need to get fancy.
First thing I think I'd do is strip out all the white space, or reduce white space to 1 between each word, and special characters from both blocks, then look for the old one in the new one.
Here's a mozdev project that may be helpful for others who stumble across this page looking for a Thunderbird solution:
http://quotecollapse.mozdev.org/