notepad ++ replacement expression - replace

Hi guys I really need some help, I need to do a mass replace expression in files
I have a large list of urls which needs to be replaced.
I want to search files and replace each with the appropriate brand anchor link e.g.
http://www.example.com
becomes
<a href=”http://www.example.com”> http://www.example.com</a>
I need to do this with a large list of urls in multiple files
I tried the following expression
(1)|(2)|(3)
(?1a)(?2b)(?3c)
But It doesn’t work. This is beyond me. Any help would be appreciated. Thanks

Go to Search > Replace menu (shortcut CTRL+H) and do the following:
Find what:
http:\/\/www\.\w+\.com
Replace:
$0
Select radio button "Regular Expression"
Then press Replace All in All Opened Documents
You can test it and see the results at regex101.
Important note: matching URLs with regular expressions can be complicated! I gave you the simplest example matching only URLs like http://www.example.com. If you have more complicated stuff, let us know but showing some of your data! More info on this matter here and here.
UPDATE:
Let's make it slightly more complicated to match also
yoursite.com/index.php?remainingurl
Find what:
(?:https?:\/\/)?(?:www\.)?(\w+\.\w{2,6})(?:\/\w+\.\w+(?:\?\w+)?)?\b
Replace:
$1

Related

use regex to get both link and text associated with it (anchor tag)

I created a regex string that I hoped would get both the link and the associated text in an html page. For instance, if I had a link such as:
<a href='www.la.com/magic.htm'>magicians of los angeles</a>
Then the link I want is 'www.la.com/magic.htm' and the text I want is 'magicians of los angeles'.
I used the following regex expression:
strsearch = "\<a\s+(.*?)\>(.*?)\</a\s*?\>|"
But my vb program told me I was getting too many matches.
Is there something wrong with the regEx expression?
The circle-brackets are meant to get 'groups' that can be back-referenced.
Thanks
What about this one:
\<a href=.+\</a>
All there is left to do is to go over each match and extract the substrings using regular string manipulation.
Check here (although regexr follows javascript regex implementation, it is still useful in our scenario)
With that being said, I often see people stating that regexes are not suited for parsing Html. You might need to use an Html Parser for this. You have HtmlAgilityPack, which is not maintained anymore, and AngleSharp, that I know of to recommend.
I tried with following pattern , it worked.
\<a href=(.*?)\>(.*?)\<\/a\s*?\>|
Also Found two errors on your origin string:
missed a escape syntax on /a
the reserved word 'href' is captured on
first group
At last , i would like recommend you a great site to test REGEX string. It will helps your debug really fast. Refer this (also demonstrating the result you want) :
REGEX101

Dreamweaver regexp positive look behind error

I'm getting a syntax error and invalid quantifier error in dreamweaver when I try to use a regexp in the source code.
The purpose is to find spaces in front of numbers on table cells and delete them.
(?<=>)\s+(?=\d)
this expression works on notepad++ but not in dreamweaver.
Can this be a Dreamweaver bug or the syntax is wrong?
Of course I can make a text search looking for >\s and replace by > but then I cant catch more spaces than the ones specified in the search string
thanks in advance
PS: Would be nice also to have a multisearch option in the dreamweaver search screen, to run multiple search and replace in one operation, like code clean up. An extension maybe?
I don't use DW but, since I have read several posts about lookarounds problems with DW, I assume that DW doesn't support these regex features.
You can use capturing groups instead (if DW supports it!):
search : (>)\s+(\d)
replace: $1$2
or
replace: \1\2
To append to the previous answer, when formulating a replace statement in DreamWeaver, use the format of $1 rather than ^1 for the variables.
I receive a similar "invalid quantifier" response in DreamWeaver CC 2015.1 when using a negative lookbehind:
(?<!somephrase)

RegEx filter links from a document

I am currently learning regex and I am trying to filter all links (eg: http://www.link.com/folder/file.html) from a document with notepad++. Actually I want to delete everything else so that in the end only the http links are listed.
So far I tried this : http\:\/\/www\.[a-zA-Z0-9\.\/\-]+
This gives me all links which is find, but how do I delete the remaining stuff so that in the end I have a neat list of all links?
If I try to replace it with nothing followed by \1, obviously the link will be deleted, but I want the exact opposite to have everything else deleted.
So it should be something like:
- find a string of numbers, letters and special signs until "http"
- delete what you found
- and keep searching for more numbers, letters ans special signs after "html"
- and delete that again
Any ideas? Thanks so much.
In Notepad++, in the Replace menu (CTRL+H) you can do the following:
Find: .*?(http\:\/\/www\.[a-zA-Z0-9\.\/\-]+)
Replace: $1\n
Options: check the Regular expression and the . matches newline
This will return you with a list of all your links. There are two issues though:
The regex you provided for matching URLs is far from being generic enough to match any URL. If it is working in your case, that's fine, else check this question.
It will leave the text after the last matched URL intact. You have to delete it manually.
The answer made previously by #psxls was a great help for me when I have wanted to perform a similar process.
However, this regex rule was written six years ago now: accordingly, I had to adjust / complete / update it in order it can properly work with the some recent links, because:
a lot of URL are now using HTTPS instead of HTTP protocol
many websites less use www as main subdomain
some links adds punctuation mark (which have to be preserved)
I finally reshuffle the search rule to .*?(https?\:\/\/[a-zA-Z0-9[:punct:]]+) and it worked correctly with the file I had.
Unfortunately, this seemingly simple task is going to be almost impossible to do in notepad++. The regex you would have to construct would be...horrible. It might not even be possible, but if it is, it's not worth it. I pretty much guarantee that.
However, all is not lost. There are other tools more suitable to this problem.
Really what you want is a tool that can search through an input file and print out a list of regex matches. The UNIX utility "grep" will do just that. Don't be scared off because it's a UNIX utility: you can get it for Windows:
http://gnuwin32.sourceforge.net/packages/grep.htm
The grep command line you'll want to use is this:
grep -o 'http:\/\/www.[a-zA-Z0-9./-]\+\?' <filename(s)>
(Where <filename(s)> are the name(s) of the files you want to search for URLs in.)
You might want to shake up your regex a little bit, too. The problems I see with that regex are that it doesn't handle URLs without the 'www' subdomain, and it won't handle secure links (which start with https). Maybe that's what you want, but if not, I would modify it thusly:
grep -o 'https\?:\/\/[a-zA-Z0-9./-]\+\?' <filename(s)>
Here are some things to note about these expressions:
Inside a character group, there's no need to quote metacharacters except for [ and (sometimes) -. I say sometimes because if you put the dash at the end, as I have above, it's no longer interpreted as a range operator.
The grep utility's syntax, annoyingly, is different than most regex implementations in that most of the metacharacters we're familiar with (?, +, etc.) must be escaped to be used, not the other way around. Which is why you see backslashes before the ? and + characters above.
Lastly, the repetition metacharacter in this expression (+) is greedy by default, which could cause problems. I made it lazy by appending a ? to it. The way you have your URL match formulated, it probably wouldn't have caused problems, but if you change your match to, say [^ ] instead of [a-zA-Z0-9./-], you would see URLs on the same line getting combined together.
I did this a different way.
Find everything up to the first/next (https or http) (then everything that comes next) up to (html or htm), then output just the '(https or http)(everything next) then (html or htm)' with a line feed/ carriage return after each.
So:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace with: \1\2\3\r\n
Saves looking for all possible (incl non-generic) url matches.
You will need to manually remove any text after the last matched URL.
Can also be used to create url links:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace: \1\2\3\r\n
or image links (jpg/jpeg/gif):
Find: .*?(https:|http:)(.*?)(jpeg|jpg|gif)
Replace: <img src="\1\2\3">\r\n
I know my answer won't be RegEx related, but here is another efficient way to get lines containing URLs.
This won't remove text around links like Toto mentioned in comments.
At least if there is nice pattern to all links, like https://.
CTRL+F => change tab to Mark
Insert https://
Tick Mark to bookmark.
Mark All.
Find => Bookmarks => Delete all lines without bookmark.
I hope someone who lands here in search of same problem will find my way more user-friendly.
You can still use RegEx to mark lines :)

Regular Expression to Remove Subdomains from Domain List

I have a list of domains and subdomains stored in a .txt file (I'm using Windows XP).
The format of the domains is this:
somesite1.com
sub1.somesite1.com
sub2.somesite1.com
somesite2.com
sub1.somesite2.com
sub2.somesite2.com
somesite3.com
sub1.somesite3.com
sub2.somesite3.com
I use notepad++, and I need to use regular expressions
Anyway, I don't know what to put in the find & replace boxes so it can go through the contents of the file and leave me with only the root domains. If done properly, it would turn the above example list into this:
somesite1.com
somesite2.com
somesite3.com
Can somebody help me out?
Thank you in advance.
It's an old question, but the answers provided didn't work for me. You need a negative lookahead. The correct regex is:
^\w*\.(?!\w+\s*\n)
You can use:
Find what: [^\r\n]+\.[^.\r\n]+\.[^.\r\n]+[\r\n]+
Replace with: empty_string
with regular expression checked and dot match line-feed NOT checked
I suggest using the Mark tab of the Notepad++ Find dialogue. Enter the regular expression ^\w+\.\w+\.\w+$, make sure that Bookmark line is selected, then click Mark all. Next, use Menu => Search => Bookmark => Remove bookmarked lines. These will remove all entries having with three "words" separated by two dots. It will leave all other lines in place.
An alternative is to mark all lines matching the regular expression ^\w+\.\w+$ and use the Remove unmarked lines menu entry. This I do not recommend as it will remove all lines with an unexpected format as well as the lines for subdomains.
Another method would use the Replace tab of the Notepad++ Find dialogue. Enter the regular expression ^\w+\.\w+\.\w+\r\n in the Find what field, and leave the Replace with field empty. The \r\n part of this expression may need some adjustment to account for the line endings set on the file.

How do I do a regex search and replace in sublime text 2?

I want to remove inline style from an html document in ST2.
I imagine my regex will be something like this
style=\"*\"
If that's wrong, it doesn't matter. I'm sure I'll figure out the expression I'll need.
What I haven't been able to figure out, is how to actually use a regex to find or to find and replace text in ST2. The docs say that it can be done. But I can't find the documentation for how to do it.
Simply open the Search+Replace function (via Ctrl-H or the menu bar) and check the first box on the left of it (the one with an '*' on it, or you can press Alt+R)
Then the search field will be used as a Regex, and you can use the found patterns using the usual $1, $2, $3 vars in the replace box
More info here
I had a similar task to perform, the regex I used in the manner that Nas62 suggested was
style=\"(.*?)\"
Find What : style=".*" Replace With : leave it as blank
Click Replace All button.