preg match email and name from to - regex

i want to find name and email from following formats (also if you know any other format that been getting use in mail application for sending emails, please tell in comment :))
how can i know name and email for following format strings (its one string and can be in any following format):
- jon435#hotmail.com
- james jon435#hotmail.com
- "James Jordan" <jon435#hotmail.com> (gmail format)
- janne - jon44#hotmail.com (possible format)

The answer is straightforward, at least for the email portion. The rest can be special-cased away.
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Proof I'm not insane.

If you only have those strings, it is going to require more work than a simple regular expression. For instance, your first example doesn't include the full name, it is only the e-mail, thus, you would have to use the Microsoft Live ID API to retrieve that information...and that turns out to be really hard.
What exactly are you trying to do? Perhaps there is another way?

Related

Validate Street Address Format

I'm trying to validate the format of a street address in Google Forms using regex. I won't be able to confirm it's a real address, but I would like to at least validate that the string is:
[numbers(max 6 digits)] [word(minimum one to max 8 words with
spaces in between and numbers and # allowed)], [words(minimum one to max four words, only letters)], [2
capital letters] [5 digit number]
I want the spaces and commas I left in between the brackets to be required, exactly where I put them in the above example. This would validate
123 test st, test city, TT 12345
That's obviously not a real address, but at least it requires the entry of the correct format. The data is coming from people answering a question on a form, so it will always be just an address, no names. Plus they're all address is one area South Florida, where pretty much all addresses will match this format. The problem I'm having is people not entering a city, or commas, so I want to give them an error if they don't. So far, I've found this
^([0-9a-zA-Z]+)(,\s*[0-9a-zA-Z]+)*$
But that doesn't allow for multiple words between the commas, or the capital letters and numbers for zip. Any help would save me a lot of headaches, and I would greatly appreciate it.
There really is a lot to consider when dealing with a street address--more than you can meaningfully deal with using a regular expression. Besides, if a human being is at a keyboard, there's always a high likelihood of typing mistakes, and there just isn't a regex that can account for all possible human errors.
Also, depending on what you intend to do with the address once you receive it, there's all sorts of helpful information you might need that you wouldn't get just from splitting the rough address components with a regex.
As a software developer at SmartyStreets (disclosure), I've learned that regular expressions really are the wrong tool for this job because addresses aren't as 'regular' (standardized) as you might think. There are more rigorous validation tools available, even plugins you can install on your web form to validate the address as it is typed, and which return a wealth of of useful metadata and information.
Try Regex:
\d{1,6}\s(?:[A-Za-z0-9#]+\s){0,7}(?:[A-Za-z0-9#]+,)\s*(?:[A-Za-z]+\s){0,3}(?:[A-Za-z]+,)\s*[A-Z]{2}\s*\d{5}
See Demo
Accepts Apt# also:
(^[0-9]{1,5}\s)([A-Za-z]{1,}(\#\s|\s\#|\s\#\s|\s)){1,5}([A-Za-z]{1,}\,|[0-9]{1,}\,)(\s[a-zA-Z]{1,}\,|[a-zA-Z]{1,}\,)(\s[a-zA-Z]{2}\s|[a-zA-Z]{2}\s)([0-9]{5})

Regex Check Facebook Video URL

I try to check facebook video url using regex.
this is example Valid fb video URL :
https://www.facebook.com/video.php?v=100000000000000 (VALID)
this is example Valid fb video URL with username :
https://www.facebook.com/{username}/videos/100000000000000
note : {username} can contain any string.
example :
https://www.facebook.com/username1/videos/100000000000000 (VALID)
https://www.facebook.com/username2/videos/100000000000000 (VALID)
But my reqex still wrong if i check fb video url with username.
This is my regex :
^http(s)?://(www\.)?facebook.([a-z]+)/(?!(?:video\.php\?v=\d+|usernameFB/videos/\d+)).*$
You can run it :
https://regex101.com/r/dF5iP1/6
This will work for you:
^(https?://www\.facebook\.com/(?:video\.php\?v=\d+|.*?/videos/\d+))$
Demo
https://regex101.com/r/sC6oR2/3
UPDATED October 2018
Neither of the two existing REGEX proposals worked for me, and there are more visible cases than the ones considered.
Here's my REGEX Proposal:
^(?:(?:https?:)?\/\/)?(?:www\.)?facebook\.com\/[a-z\.]+\/videos\/(?:[a-z0-9\.]+\/)?([0-9]+)\/?(?:\?.*)?$
^(?:(?:https?:)?\/\/)?(?:www\.)?facebook\.com\/[a-zA-Z0-9\.]+\/videos\/(?:[a-zA-Z0-9\.]+\/)?([0-9]+)
I ignored video.php, I think it's old enough to safely ignore it.
Matches:
https://www.facebook.com/aguardos.nocturnos/videos/vb.1614866072064590/1828228624061666/?type=2&theater
https://www.facebook.com/aguardos.nocturnos/videos/vb.1614866072064590/1828228624061666?type=2&theater
https://www.facebook.com/aguardos.nocturnos/videos/1828228624061666/
https://www.facebook.com/latavernadelssomnis/videos/1609038972452561/?hc_ref=NEWSFEED
//www.facebook.com/aguardos.nocturnos/videos/1828228624061666/
https://facebook.com/aguardos.nocturnos/videos/1828228624061666/
http://www.facebook.com/aguardos.nocturnos/videos/1828228624061666/
www.facebook.com/aguardos.nocturnos/videos/18282286240612666/
facebook.com/aguardos.nocturnos/videos/18282286240612666/
https://www.facebook.com/aguardos.nocturnos/videos/1828228624061666
https://www.facebook.com/WEAU13News/videos/588612391555522/UzpfSTEzMzAzMDk4NjM6MTAyMTMxMjMzNDE3ODE0MTI/
I do not own nor I have watched any of the videos. I just picked random ones that were on my facebook feed.
Groups
Video ID.
Gotchas
One of the most common Facebook video formats is more complex than I'd like it to be and matching every case perfectly with REGEX would probably lead to a very messy query.
https://www.facebook.com/RolandGarros/videos/10155404760334920/FOO (valid)
https://www.facebook.com/RolandGarros/videos/FOO/10155404760334920 (valid)
https://www.facebook.com/RolandGarros/videos/10155404760334920/FOO/FOO (invalid)
The way this one seems to work is by retrieving the numeric value in the first or second part after videos/.
https://www.facebook.com/RolandGarros/videos/10155361533554920/1015536153355492134
What about this one where two valid numeric values are involved? It seems like the second one is the one that will prevail.
For this reason the REGEX solution above was softened1 to match only the beginning of the Facebook URL, up to the video group that we're looking for. Considering that your goal's probably to extract the video ID, rather than verify the URL, I think that's a valid trade-off. At the end of the day, you'll be checking the video either way (either through API or scrapping) to extract the video information since an ID doesn't mean that the video exists or it's public.
1 Not just softened, but also improved to match the test case format.
Test
You can easily test it yourself # Regex101
This is a little different than Pedro's, but it works well.
^http(?:s)?://(?:www\.)?facebook.(?:[a-z]+)/((?:video\.php\?v=\d+|username\d/videos/\d+)).*$
https://regex101.com/r/nV4rI3/1
Latest:
/(?:https?:\/\/)?(?:www.|web.|m.)?(facebook|fb).(com|watch)\/(?:video.php\?v=\d+|(\S+)|photo.php\?v=\d+|\?v=\d+)|\S+\/videos\/((\S+)\/(\d+)|(\d+))\/?/
That will help you
regexr.com/4tdur
you can use like this
const myURL = "https://www.facebook.com/video.php?v=100000000000000";
const res = /^https?:\/\/www\.facebook\.com.*\/(video(s)?|watch|story)(\.php?|\/).+$/gm.test(myURL);
console.log(res);
The Facebook Video URLs nowadays are of the formats as following:-
https://www.facebook.com/NowThisPolitics/videos/968643940204333/
https://www.facebook.com/chandni.nathani2/videos/10158204539960536/UzpfSTEwMDAwMTc3MzU1MjI2NzoyNzMxNDUyMTYzNTkwNTQy/
Also, since the facebook could be replaced by fb, I created this regex:
/(?:https?:\/{2})?(?:w{3}\.)?(facebook|fb).com\/.*\/videos\/.*/

Are my regex just wrong or is there a buggy behaviour in td-agent's format behaviour?

I am using fluentd, elasticsearch and kibana to organize logs. Unfortunately, these logs are not written using any standard like apache, so I had to come up with the regex for the format myself. I used this site here to verify that they are working: http://fluentular.herokuapp.com/ .
The logs have roughly this format here:
DEBUG: 24.04.2014 16:00:00 [SingleActivityStrategy] Start Activitiy 'barbecue' zu verabeiten.
the format regex I am using is as follows:
format /(?<pri>([INFO]|[DEBUG]|[ERROR])+)...(?<date>(\d{2}\.\d{2}\.\d{4})).(?<time>(\d{2}:\d{2}:\d{2})).\[(?<subject>(.*))\].(?<msg>(.*))/
Now, judging by that website that is supposed to test specifically fluentd's behaviour with regexes, the output SHOULD be this one:
Record
Key Value
pri DEBUG
date 24.04.2014
subject SingleActivityStrategy
msg Start Activitiy 'barbecue' zu verabeiten.
Instead though, I have this ?bug? that pri is always shortened to DEBU. Same for ERROR which becomes ERRO, only INFO stays INFO. I am not very experienced with regular expressions and I find it hard to believe that this is a bug, still it confuses me and any help is greatly appreciated.
I'm not sure I can link the complete config file because I dont personally own these log files and I am trying to keep it on a level that my boss won't get mad at me for posting sensitive information, but should it definately be needed, I will post them later on after having asked him how much I can reveal.
In general, the logs always look roughly like this:
First the priority, which is either DEBUG, ERROR or INFO, next the date , next what we call the subject which is always written in [ ] and finally just a message.
Here is a link to fluentular with the format I am using and a teststring that produces the right result in fluentular, but not in my config file:
Fluentular
Sorry I couldn't make it work like a regular link to just click on.
Another link to test out regex with my format and test string is this one:
http://rubular.com/r/dfXOkQYNXP
tl;dr version:
my td-agent format regex cuts off the last letter, although fluentular says it shouldn't. My fault or a bug?
How the regex would look if you're trying to match the data specifically:
(INFO|DEBUG|ERROR)\:\s+(\d{2}\.\d{2}\.\d{4})\s(\d{2}:\d{2}:\d{2})\s\[(.*)\](.*)
In your format string, you were using . and ... for where your spaces and colon should be. I'm not to sure on why this works in Fluentular, but you should have matched the \: explicitly and each space between the values.
So you'd be looking at the following regular expression with the Fluentd fields (which are grouping names):
(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))
Meaning your td-agent.conf should look like:
<source>
type tail
path /var/log/foo/bar.log
pos_file /var/log/td-agent/foo-bar.log.pos
tag foo.bar
format /(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))/
</source>
I would also take a look into comparing Logstash vs. Fluentd. I like Logstash far more because you create Grok filters to match the type of data you want, and it makes formatting your fields much easier because you are providing an abstraction layer, but you essentially will get the same data.
And I would watch out when you're using sites like Rubular, as they are fairly particular about multi-line matching and the like. I'd suggest something like Regexr which gives immediate feedback and you can set global and multiline matching as well.

How to create Gmail filter searching for text only at start of subject line?

We receive regular automated build messages from Jenkins build servers at work.
It'd be nice to ferret these away into a label, skipping the inbox.
Using a filter is of course the right choice.
The desired identifier is the string [RELEASE] at the beginning of a subject line.
Attempting to specify any of the following regexes causes emails with the string release in any case anywhere in the subject line to be matched:
\[RELEASE\]*
^\[RELEASE\]
^\[RELEASE\]*
^\[RELEASE\].*
From what I've read subsequently, Gmail doesn't have standard regex support, and from experimentation it seems, as with google search, special characters are simply ignored.
I'm therefore looking for a search parameter which can be used, maybe something like atstart:mystring in keeping with their has:, in: notations.
Is there a way to force the match only if it occurs at the start of the line, and only in the case where square brackets are included?
Sincere thanks.
Regex is not on the list of search features, and it was on (more or less, as Better message search functionality (i.e. Wildcard and partial word search)) the list of pre-canned feature requests, so the answer is "you cannot do this via the Gmail web UI" :-(
There are no current Labs features which offer this. SIEVE filters would be another way to do this, that too was not supported, there seems to no longer be any definitive statement on SIEVE support in the Gmail help.
Updated for link rot The pre-canned list of feature requests was, er canned, the original is on archive.org dated 2012, now you just get redirected to a dumbed down page telling you how to give feedback. Lack of SIEVE support was covered in answer 78761 Does Gmail support all IMAP features?, since some time in 2015 that answer silently redirects to the answer about IMAP client configuration, archive.org has a copy dated 2014.
With the current search facility brackets of any form () {} [] are used for grouping, they have no observable effect if there's just one term within. Using (aaa|bbb) and [aaa|bbb] are equivalent and will both find words aaa or bbb. Most other punctuation characters, including \, are treated as a space or a word-separator, + - : and " do have special meaning though, see the help.
As of 2016, only the form "{term1 term2}" is documented for this, and is equivalent to the search "term1 OR term2".
You can do regex searches on your mailbox (within limits) programmatically via Google docs: http://www.labnol.org/internet/advanced-gmail-search/21623/ has source showing how it can be done (copy the document, then Tools > Script Editor to get the complete source).
You could also do this via IMAP as described here:
Python IMAP search for partial subject
and script something to move messages to different folder. The IMAP SEARCH verb only supports substrings, not regex (Gmail search is further limited to complete words, not substrings), further processing of the matches to apply a regex would be needed.
For completeness, one last workaround is: Gmail supports plus addressing, if you can change the destination address to youraddress+jenkinsrelease#gmail.com it will still be sent to your mailbox where you can filter by recipient address. Make sure to filter using the full email address to:youraddress+jenkinsrelease#gmail.com. This is of course more or less the same thing as setting up a dedicated Gmail address for this purpose :-)
Using Google Apps Script, you can use this function to filter email threads by a given regex:
function processInboxEmailSubjects() {
var threads = GmailApp.getInboxThreads();
for (var i = 0; i < threads.length; i++) {
var subject = threads[i].getFirstMessageSubject();
const regex = /^\[RELEASE\]/; //change this to whatever regex you want, this one should cover OP's scenario
let isAtLeast40 = regex.test(subject)
if (isAtLeast40) {
Logger.log(subject);
// Now do what you want to do with the email thread. For example, skip inbox and add an already existing label, like so:
threads[i].moveToArchive().addLabel("customLabel")
}
}
}
As far as I know, unfortunately there isn't a way to trigger this with every new incoming email, so you have to create a time trigger like so (feel free to change it to whatever interval you think best):
function createTrigger(){ //you only need to run this once, then the trigger executes the function every hour in perpetuity
ScriptApp.newTrigger('processInboxEmailSubjects').timeBased().everyHours(1).create();
}
The only option I have found to do this is find some exact wording and put that under the "Has the words" option. Its not the best option, but it works.
I was wondering how to do this myself; it seems Gmail has since silently implemented this feature. I created the following filter:
Matches: subject:([test])
Do this: Skip Inbox
And then I sent a message with the subject
[test] foo
And the message was archived! So it seems all that is necessary is to create a filter for the subject prefix you wish to handle.

Coding a Gmail style "hide quoted text" for web based mailing list archive

I'm working on a web application that parses and displays email messages in a threaded format (among other things). Emails may come from any number of different mail clients, and in either text or HTML format.
Given that most people have a tendency to top post, I'd like to be able to hide the duplicated message in an email reply in a manner similar to how Gmail does it (e.g. "show quoted text").
Determining which part of the message is the reply is somewhat challenging. Personally, I use "> " delimiters at the beginning of the quoted text when replying. I created a regexp that looks for these lines and wraps a div around them to allow some JS to hide or show this block of text.
I then noticed that Outlook doesn't use the "> " characters by default, it simply adds a header block above the reply with the summary of the headers (From, Subject, Date, etc.). The reply is untouched. I can match on this and hide the rest of the email, working with the assumption that it's a top quote.
I then looked at Thunderbird, and it uses "> " for text, and <blockquote> for HTML mails. I still haven't looked at what Apple Mail does, what Notes does, or what any of the other millions of mail clients out there do.
Will I be writing a special case regexp for every single client out there? or is there something I'm missing?
Any suggestions, sample code or pointers to third party libraries much appreciated!
It'll be pretty hard to duplicate the way gmail does it since it doesn't care about whether it was a quoted piece or not, like Zac says, it just seems to care about the diff.
Its actually pretty hard to get this right 100% of the time. Plain text email is "lossy", its entirely possible for you to send
> Here is my long line that is over 74 chars (email line length limit)
Which can get encoded as something like
> Here is my long line that is over 74 chars (email=
line length limit)
And then is decoded as
> Here is my long line that is over 74 chars (email
line length limit)
Making it indistinguishable from an inline-reply.
This is email, so variations are abound. Email usually line-wraps at something like 74 characters, and encoding schemes can differ. Its a real PITA. If you can access the HTML version, you will probably have better luck looking for quote tags and the like. Another idea would be to parse both the plain text and html version to try and determine the boundries.
Additionally, its best to just plan for specific client hacks. They all construct mime messages differently, both in structure and header content.
Edit: I say this with the experience of writing an email processing system as well as seeing several people try to do the -exact- thing you're doing. It always only got "ok" results.
From what I can tell, gmail does not bother about prefixed lines or section headings, except to ignore them. If the text lines appeared earlier in the thread, and then reappear, it is considered to be quoted. Thus, e.g., if you send multiple messages and don't change your signature, the signature is considered to be quoted. If you've already dealt with the '>' prefix, a simple diff should do most of the rest. No need to get fancy.
First thing I think I'd do is strip out all the white space, or reduce white space to 1 between each word, and special characters from both blocks, then look for the old one in the new one.
Here's a mozdev project that may be helpful for others who stumble across this page looking for a Thunderbird solution:
http://quotecollapse.mozdev.org/