How to find arbitrary URLs in plain text? - regex

There are tons of solutions to find and/or parse normal URLs, but none of them deals with arbitrary text, i.e. URLs that are split over several lines? How would you find a URL that can have line breaks after any character?
Note: I'm not interested in the individual parts of the URL. I just want to find all URLs in a given text to convert them to links (e.g. like in plain e-mail text).
Example:
Text text text text text. Look at this:
http://stackoverfl
ow.com/
questions/15252042/
find-urls-in-text
Question question question.

Several approaches are possible:
1) Write a regex with whitespace rules after each regular char. This will certainly blow up the regex pattern but is the most flexible one. For catching line breaks use DOT_ALL mode. DOT_ALL will however produce the same problems as the next approach.
2) (Temporarily) remove line breaks and use normal regex pattern matching. This approach has problems though as it can happen that you include more text than necessary (at the end of the URL) or don't find a URL (if the linebreak is at the start, messing up the protocol string).
2a) A modification of 2) could be to do several match attempts removing only certain line breaks, e.g. after looking for an initial URL part (e.g. www, http etc.). Only possible if recognition time is secondary.
3) Ease your task with domain specific knowlege. For instance if you know where line breaks can occur (or if they occur only at specific positions) then look for these specific cases and solve them first. Then return to the usual regex search.
3a) A variation of 3) could be to look specifically for the protocol and and page extension using a regex with full whitespace rules to find start and stop of an URL. This works obviously only if there's always a protocol/filename_with_extension. Transform the found tokens into regular ones without whitespaces (but include a space before the protocol and after the extension) and then remove all line breaks in the text. Now you can match the URL with a regular regex.
There are certainly more variations possible, but the general idea is the same.

Related

Use regex in UFT PDF comparison

In one of my UFT test cases, I need to verify a amount on a PDF file.
Sometimes the amount is "3000" and sometimes it is "3.000". And sometimes even "3 000"!
I would like to accept those 3 possibilities, knowing that this amount is stored in a datatable.
I tried something like "3.?000" (with regex check in the file checkpoint) but it's not matching any of the 3 solutions.
How would you do?
One of UFT's idiosyncrasies when dealing with regexs is that it add implicit anchors at the beginning and ends of lines.
Try adding .* before and after the text you want to match - .*\s3[., ]?000\s.*.
Also verify that you've activated the regular expression flag for your line. I find the UI for File Content Checkpoints to be a bit unintuitive so you may have missed that.

RegEx filter links from a document

I am currently learning regex and I am trying to filter all links (eg: http://www.link.com/folder/file.html) from a document with notepad++. Actually I want to delete everything else so that in the end only the http links are listed.
So far I tried this : http\:\/\/www\.[a-zA-Z0-9\.\/\-]+
This gives me all links which is find, but how do I delete the remaining stuff so that in the end I have a neat list of all links?
If I try to replace it with nothing followed by \1, obviously the link will be deleted, but I want the exact opposite to have everything else deleted.
So it should be something like:
- find a string of numbers, letters and special signs until "http"
- delete what you found
- and keep searching for more numbers, letters ans special signs after "html"
- and delete that again
Any ideas? Thanks so much.
In Notepad++, in the Replace menu (CTRL+H) you can do the following:
Find: .*?(http\:\/\/www\.[a-zA-Z0-9\.\/\-]+)
Replace: $1\n
Options: check the Regular expression and the . matches newline
This will return you with a list of all your links. There are two issues though:
The regex you provided for matching URLs is far from being generic enough to match any URL. If it is working in your case, that's fine, else check this question.
It will leave the text after the last matched URL intact. You have to delete it manually.
The answer made previously by #psxls was a great help for me when I have wanted to perform a similar process.
However, this regex rule was written six years ago now: accordingly, I had to adjust / complete / update it in order it can properly work with the some recent links, because:
a lot of URL are now using HTTPS instead of HTTP protocol
many websites less use www as main subdomain
some links adds punctuation mark (which have to be preserved)
I finally reshuffle the search rule to .*?(https?\:\/\/[a-zA-Z0-9[:punct:]]+) and it worked correctly with the file I had.
Unfortunately, this seemingly simple task is going to be almost impossible to do in notepad++. The regex you would have to construct would be...horrible. It might not even be possible, but if it is, it's not worth it. I pretty much guarantee that.
However, all is not lost. There are other tools more suitable to this problem.
Really what you want is a tool that can search through an input file and print out a list of regex matches. The UNIX utility "grep" will do just that. Don't be scared off because it's a UNIX utility: you can get it for Windows:
http://gnuwin32.sourceforge.net/packages/grep.htm
The grep command line you'll want to use is this:
grep -o 'http:\/\/www.[a-zA-Z0-9./-]\+\?' <filename(s)>
(Where <filename(s)> are the name(s) of the files you want to search for URLs in.)
You might want to shake up your regex a little bit, too. The problems I see with that regex are that it doesn't handle URLs without the 'www' subdomain, and it won't handle secure links (which start with https). Maybe that's what you want, but if not, I would modify it thusly:
grep -o 'https\?:\/\/[a-zA-Z0-9./-]\+\?' <filename(s)>
Here are some things to note about these expressions:
Inside a character group, there's no need to quote metacharacters except for [ and (sometimes) -. I say sometimes because if you put the dash at the end, as I have above, it's no longer interpreted as a range operator.
The grep utility's syntax, annoyingly, is different than most regex implementations in that most of the metacharacters we're familiar with (?, +, etc.) must be escaped to be used, not the other way around. Which is why you see backslashes before the ? and + characters above.
Lastly, the repetition metacharacter in this expression (+) is greedy by default, which could cause problems. I made it lazy by appending a ? to it. The way you have your URL match formulated, it probably wouldn't have caused problems, but if you change your match to, say [^ ] instead of [a-zA-Z0-9./-], you would see URLs on the same line getting combined together.
I did this a different way.
Find everything up to the first/next (https or http) (then everything that comes next) up to (html or htm), then output just the '(https or http)(everything next) then (html or htm)' with a line feed/ carriage return after each.
So:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace with: \1\2\3\r\n
Saves looking for all possible (incl non-generic) url matches.
You will need to manually remove any text after the last matched URL.
Can also be used to create url links:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace: \1\2\3\r\n
or image links (jpg/jpeg/gif):
Find: .*?(https:|http:)(.*?)(jpeg|jpg|gif)
Replace: <img src="\1\2\3">\r\n
I know my answer won't be RegEx related, but here is another efficient way to get lines containing URLs.
This won't remove text around links like Toto mentioned in comments.
At least if there is nice pattern to all links, like https://.
CTRL+F => change tab to Mark
Insert https://
Tick Mark to bookmark.
Mark All.
Find => Bookmarks => Delete all lines without bookmark.
I hope someone who lands here in search of same problem will find my way more user-friendly.
You can still use RegEx to mark lines :)

Regular Expression to match a specific URL broken up by arbitrary characters

I run a Django-based forum (the framework is probably not important to the question, but still) and it has been increasingly getting spammed with posts that link to a specific website constantly (www.solidwoodkitchen.co.uk - these people are apparently the worst).
I've implemented a string blocking system that stops them posting to the forum if the URL of the website is included in the post, but as spam bots usually do, it has figured out a way around that by breaking up the URL with other characters (eg. w_w_w.s*olid_wood*kit_ch*en._*co.*uk .). So a couple of questions:
Is it even possible to build a regex capable of finding the specific URL within a block of text even when it has been modified like that?
If it is, would this cause a performance hit?
Description
You could break the url into a string of characters, then join them together with [^a-z0-9]*?. So in this case with www.solidwoodkitchen.co.uk the resulting regex would look like:
w[^a-z0-9]*?w[^a-z0-9]*?w[^a-z0-9]*?[.][^a-z0-9]*?s[^a-z0-9]*?o[^a-z0-9]*?l[^a-z0-9]*?i[^a-z0-9]*?d[^a-z0-9]*?w[^a-z0-9]*?o[^a-z0-9]*?o[^a-z0-9]*?d[^a-z0-9]*?k[^a-z0-9]*?i[^a-z0-9]*?t[^a-z0-9]*?c[^a-z0-9]*?h[^a-z0-9]*?e[^a-z0-9]*?n[^a-z0-9]*?[.][^a-z0-9]*?c[^a-z0-9]*?o[^a-z0-9]*?[.][^a-z0-9]*?u[^a-z0-9]*?k
Edit live on Debuggex
This could would basically search for the entire string of characters seperated by zero or more non alphanumeric characters.
Or you could take the input text and strip out all punctuation then simply search for wwwsolidwoodkitchencouk.

regex, find last part of a url

Let's take an url like
www.url.com/some_thing/random_numbers_letters_everything_possible/set_of_random_characters_everything_possible.randomextension
If I want to capture "set_of_random_characters_everything_possible.randomextension" will [^/\n]+$work? (solution taken from Trying to get the last part of a URL with Regex)
My question is: what does the "\n" part mean (it works even without it)? And, is it secure if the url has the most casual combination of characters apart "/"?
First, please note that www.url.com/some_thing/random_numbers_letters_everything_possible/set_of_random_characters_everything_possible.randomextension is not a URL without a scheme like http:// in front of it.
Second, don't parse URLs yourself. What language are you using? You probably don't want to use a regex, but rather an existing module that has already been written, tested, and debugged.
If you're using PHP, you want the parse_url function.
If you're using Perl, you want the URI module.
Have a look at this explanation: http://regex101.com/r/jG2jN7
Basically what is going on here is "match any character besides slash and new line, infinite to 1 times". People insert \r\n into negated char classes because in some programs a negated character class will match anything besides what has been inserted into it. So [^/] would in that case match new lines.
For example, if there was a line break in your text, you would not get the data after the linebreak.
This is however not true in your case. You need to use the s-flag (PCRE_DOTALL) for this behavior.
TL;DR: You can leave it or remove it, it wont matter.
Ask away if anything is unclear or I've explained it a little sloppy.

In Yahoo-Pipes, how to use regex when you can't see non-printable characters and html tags?

I keeping having the problem trying to extract data using regex whereas my result is not what I wanted because there might be some newlines, spaces, html tags, etc in the string, but is there anyway to actually see what is in the string, the debugger seems to show only the real text. How do you deal with this?
If the content of the string is HTML then debugger gives you a choice of viewing "HTML" or "Source". Source should show you any HTML tags that are there.
However if your concern is white space, this may not be enough. Your only option is to "view source" on the original page.
The best course of action is to explicitly handle these possibilities in your regex. For example, if you think you might be getting white space in your target string, use the \s* pattern in the critical positions. That will match zero or more spaces, tabs, and new lines (you must also have the "s" option checked in the regex panel for new lines).
However, without specific examples of source text and the regex you are using - advice can only be generic.
What I do is use a regex tester (whichever uses the same regex engine that you are using) and I test my pattern on it. I've tried using text editors that display invisible characters but to me they only add to the confusion.
So I just go by trial and error. For instance, if a line ends in:
</a>
Then I'll try the following patterns on the regex tester until I find one that works:
</a>.
</a>..
</a>\s
</a>\s*
</a>\n
</a>\r
</a>\r\n
Etc.