Regex help: Identifying websites in text - regex

I am trying to write a function which removes websites from a piece of text. I have:
removeWebsites<- function(text){
text = gsub("(http://|https://|www.)[[:alnum:]~!#$%&+-=?,:/;._]*",'',text)
return(text)
}
This handles a large set of the problem, but not a popular one, i.e something of the form xyz.com
I do not wish to add .com at the end of the above regex, as it limits the scope of that regex. However I tried writing some more regexex like:
gsub("[[:alnum:]~!#$%&+-=?,:/;._]*.com",'',testset[10])
This worked, but it also modified email ids of the form abc#xyz.com to abc#. I don't want this, so I modified it to
gsub("*((^#)[[:alnum:]~!#$%&+-=?,:/;._]*).com",'\\1',testset[10])
This left the email ids alone but stopped recognising websites of the form xyz.com
I understand that I need some sort of a set difference here, of the form of what was explained here but I was not able to implement it (mainly because I was not able to completely understand it). Any idea on how I go about solving my problem?
Edit: I tried negative lookaheads:
gsub("[[:alnum:]~!#$%&+-=?,:/;._](?!#)[^(?!.*#)]*.com",'',testset[10])
I got a 'invalid regex' error. I believe a little help in correcting may get this to work...

I can't believe it. There actually is a simple solution to it.
gsub(" ([[:alnum:]~!#$%&+-=?,:/;._]+)((.com)|(.net)|(.org)|(.info))",' ',text)
This works by:
Start with a space.
Put all sorts of things, except an '#' in.
end with a .com/net/org/info/
Please do look into breaking it! I'm sure there will be cases that will break this as well.

your lookarounds look a bit funny to me: you cant look behind inside a character class and why are you looking ahead? A look behind is imho more appropriate.
I think the following expression should work, although i didn't test it:
gsub("*((?<!#)[[:alnum:]~!#$%&+-=?,:/;._]*).com",'\\1',testset[10])
also note that lookbehinds must have a fixed length, so no multipliers are allowed

Related

select area within characters using regex (spaces are an issue)

Some other guy asked a similar question earlier which got a lot of down votes, and I was interested in solving it. I came to a similar issue and would like some help with it.
Take into consideration this wall of text:
__don't__ and __do it__
__yellow__
__green__ and __purple__
I would like to select all the area within the underscores __'s
I attempted the following regex:
/__[!-~]+__/g which worked great on most things. I would like to add the ability to have spaces within the underscores. __do it__ will not be encapsulated in the search because it includes a space which was ruled out by the regex. I attempted the following:
/__[ -~]+__/g
It didn't work as planned, and selected everything from the very first __ to the very last. I was wondering how to tell the regex it has reached the end of a search once it sees a space after a __.
Here is the regex you could play around with below:
http://regexr.com/39br7
I tried using __[^ ]/g at the end but It didn't seem to help.
You could simply use the below regex,
__[^_]*__
DEMO
__(.*?)__
This seems to work.Look at the demo.
http://regex101.com/r/lJ1jB1/1

Select start and end of a RegEx

I'm having trouble naming this question, and it feels like it's something I should have found myself, but I'm too dumb it seems. RegEx is still incredibly complicated to me, so please don't be too harsh to me.
Basically, I have a huge list of text of which I need to extract certain word sections. I know the mask around the word, but I obviously only need the word itself. Let me try to give you a simple example:
<b>Name1</b>
<i>Name2</i>
<u>Name3</u>
I can clearly see the things I want are all surrounded by <> tags. My approach was always to find the entire string and then simply do a plain replace to get rid of these extra characters.
<\w>{1}\w+<\/\w>{1}
string.replace("<b>","");
string.replace("</b>","");
... and so on.
However, something just feels wrong about it. Like, incredibly wrong. Can't I just directly say in my RegEx search what exactly I'm looking for? Like:
<\w>{1}START\w+END<\/\w>{1}
Does something like this exist?
(This is a general question, not a specific problem, so please don't provide alternate workarounds or something. I've had this problem many, many times already, and I'm fed up with solving it with this hackish way.)
A regex like (?!<\w>)\w+(?=<\/\w>) might be what you are looking for. See example here regextester
How about <[^>]+>([^<]+)<\/[^>]+>? It'll match the whole "tag", but it'll only capture what's between the tags...

Need to better my regex for full sentence removal instead of link removal

So, after some help from some lovely folks surfing stackoverflow, I got a regex to remove links that people posted. Now, I think I want to find one that removes their entire post, perhaps with " ", so my form will not allow the post. (instead of hey, check out my site at [LINK REMOVED]. Which is awesome, but could be better if it removed the whole sentence instead of just the link.) I am terrible with regexes atm, so any help would be greatly appreciated!
Here is my current regex:
$a = $_POST['msge'];
$b = preg_replace('%[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)%', '[LINK REMOVED]', $a);
Any ideas?
There are better ways to find links in a string, here's an example in Perl that was given in this related question. If you're dead set on using a regex, this was mentioned in another related question and looks more promising than the one you're currently trying.
If you want to do replacement of the entire sentence given a link, you could use something like the following:
[^.|^!|^?]*(link)[^.|^!|^?]*[.|!|?]
Obviously you would want to replace link with your link pattern match.
Subjectively I would also suggest it may be a little odd to remove entire sentences from the middle of content that people are posting since it may alter the entire meaning of the post. If your main intent is to remove the link (for example, to prevent spam backlinks) you may just want to obfuscate the link by replacing it with something obvious like -LINK-.

regex best practice?

Today I got an email from my boss saying to change the regex in our java script code that goes onto our client's website from
[a-zA-Z0-9]+[a-zA-Z0-9_\.\-]
to
[a-zA-Z0-9]+[a-zA-Z0-9_\-\.]
because one of our clients were complaining that it wasn't regex best practices and it's causing problems with their CMS and their DB.
Looking at those two regexes, It appears to me they match the exact same thing.
the . and the - are swapped at the end, but that shouldn't make a difference. Should it?
Am I missing something?
The developer from our client's company is really adamant about us changing it.
Can someone shed some light?
Thanks!
There is no functional difference.
If anything is having issues with that regex, then it is a non-standard/buggy implementation. I recommend finding out exactly what the problem is.
While I see no reason to change it, I see no reason not to change it, so do what you wish.
Tip: I'm guessing the regex is written wrong. If I know what it is supposed to mean, I would write it:
[a-zA-Z0-9]+[_\.\-]?
If you use a - in a character group, it goes last otherwise it denotes a range of characters, like A-Z. If you're escaping it, like you are, then it can be anywhere.
It's possible the CMS or other code they use un-escapes the regex, so in this case it will throw errors if the - isn't the last character in the group. I would say that having as few escaped characters in a regular expression as possible makes it easier to read, but that's from a personal perspective.

How can I manipulate just part of a Perl string?

I'm trying to write some Perl to convert some HTML-based text over to MediaWiki format and hit the following problem: I want to search and replace within a delimited subsection of some text and wondered if anyone knew of a neat way to do it. My input stream is something like:
Please mail support. if you want some help.
and I want to change Please help and Please can some one help me out here to Please%20help and Please%20can%20some%20one%20help%20me%20out%20here respectively, without changing any of the other spaces on the line.
Naturally, I also need to be able to cope with more than one such link on a line so splicing isn't such a good option.
I've taken a good look round Perl tutorial sites (it's not my first language) but didn't come across anything like this as an example. Can anyone advise an elegant way of doing this?
Your task has two parts. Find and replace the mailto URIs - use a HTML parsing module for that. This topic is covered thoroughly on Stack Overflow.
The other part is to canonicalise the URI. The module URI is suitable for this purpose.
use URI::mailto;
my #hrefs = ('mailto:help#myco.com&Subject=Please help&Body=Please can some one help me out here');
print URI::mailto->new($_)->as_string for #hrefs;
__END__
mailto:help#myco.com&Subject=Please%20help&Body=Please%20can%20some%20one%20help%20me%20out%20here
Why dont you just search for the "Body=" tag until the quotes and replace every space with %20.
I would not even use regular expresions for that since I dont find them useful for anything except mass changes where everything on the line is changes.
A simple loop might be the best solution.