replace url paths using Regex

replace url paths using Regex - regex

How can I change the url of my images from this:
http://www.myOLDwebsite.com/**********.*** (i have gifs, jpgs, pngs)
to this:
http://www.myNEWwebiste.com/somedirectory/**********.***
Using REGexp text editor?
Really thanks for your time
[]'s
Mateus

Why use regex?
Using conventional means, replace:
src="http://www.myOLDwebsite.com/
with:
src="http://www.myNEWwebiste.com/somedirectory/
Granted, this assumes your image tags always follow the 'src="<url>"' pattern, with double quotes and everything.
Using regex is of course also possible. Replace this:
(src\s*=\s*["'])http://www\.myOLDwebsite\.com/
with:
\1http://www.myNEWwebiste.com/somedirectory/
alternatively, if your text editor uses $ to mark back references:
$1http://www.myNEWwebiste.com/somedirectory/
On second thought - why do your images have absolute URLs in the first place? Isn't that unnecessary?

Well, the easiest way is probably to use sed in in-place mode:
sed -ir \
's#http://www[.]myOLDwebsite[.]com/#http://www.myNEWwebsite.com/subdirectory/#g' \
file1 file2 ...
If for some reason you need to actually interpret the HTML (rather than just do a simple string replacement), a quick script built around BeautifulSoup is going to be safer -- lots of people try to do HTML or XML parsing via regular expressions, but it's very hard if not impossible to cover all corner cases.
All that said, it'd be better if you were using relative links to not have your HTML depend on the server it's hosted on. See also the <BASE HREF="..."> element you can put in your <HEAD> to specify a location all URLs are relative to; if you were using that, you'd only need to do a single replacement.

Related

Bash - Regex for HTML contents

I'm learning about Bash scripting, and need some help understanding regex's.
I have a variable that is basically the html of a webpage (exported using wget):
currentURL = "https://www.example.com"
currentPage=$(wget -q -O - $currentURL)
I want to get the id's of all linked photos in this page. I just need help figuring out what the RegEx should be.
I started with this, but I need to modify the regex:
Test string (this is what currentURL contains, there can be zero to many instances of this):
<img src="./download/file.php?id=123456&t=1">
Current Regex:
.\/download\/file.php\?id=[0-9]{6}\&mode=view
Here's the regex I created, but it doesn't seem to work in bash.
The best solution would be to have the ID of each file. In this case, simply 123456. But if we can start with getting the /download/file.php?id=123456, that'd be a good start.

Don't parse XML/HTML with regex, use a proper XML/HTML parser.
theory :
According to the compiling theory, HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint often installed by default with libxml2, xpath1
xmlstarlet can edit, select, transform... Not installed by default, xpath1
xpath installed via perl's module XML::XPath, xpath1
xidel xpath3
saxon-lint my own project, wrapper over #Michael Kay's Saxon-HE Java library, xpath3
or you can use high level languages and proper libs, I think of :
python's lxml (from lxml import etree)
perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath
Check: Using regular expressions with HTML tags
Example using xidel:
xidel -s "$currentURL" -e '//a/extract(#href,"id=(\d+)",1)'

Let's first clarify a couple of misunderstandings.
I'm learning about Bash scripting, and need some help understanding regex's.
You seem to be implying some sort of relation between Bash and regex.
As if Bash was some sort of regex engine.
It isn't. The [[ builtin is the only thing I recall in Bash that supports regular expressions, but I think you mean something else.
There are some common commands executed in Bash that support some implementation of regular expressions such as grep or sed and others. Maybe that's what you meant. It's good to be specific and accurate.
I want to get the id's of all linked photos in this page. I just need help figuring out what the RegEx should be.
This suggests an underlying assumption that if you want to extract content from an HTML, then regex is the way to go. That assumption is incorrect.
Although it's best to extract content from HTML using an XML parser (using one of the suggestions in Gilles' answer),
and trying to use regex for it is not a good reflect,
for simple cases like yours it might just be good enough:
grep -oP '\./download/file\.php\?id=\K\d+(?=&mode=view)' file.html
Take note that you escaped the wrong characters in the regex:
/ and & don't have a special meaning and don't need to be escaped
. and ? have special meaning and need to be escaped
Some extra tricks in the above regex are good to explain:
The -P flag of grep enables Perl style (powerful) regular expressions
\K is a Perl specific symbol, it means to not include in the match the content before the \K
The (?=...) is a zero-width positive lookahead assertion. For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab in the match.
The \K and the lookahead trickery is to work with grep -o, which outputs only the matched part. But without these trickeries the matched part would be for example ./download/file.php?id=123456&mode=view, which is more than what you want.

How to match plain text URL in a markdown?

I'm currently trying to match all plain text links in a markdown text.
Example of the markdown text:
Dude, look at this url http://www.google.com .. it's a great search engine
I would like it to be converted into
Dude, look at this url <http://www.google.com> .. it's a great search engine
So in short, processing url should become <url>, but processing existing <url> shouldnt become <<url>>. Also, the link in the markdown can be in the form of (url), so we'll have to avoid matching the normal brackets too.
So my working regex for matching the plain text url in java is :
"[^(\\<|\\(](https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|][^(\\>|\\)]",
with [^(\\<|\\(] and [^(\\>|\\)] to avoid matching the wrapping brackets.
But here lies one problem where i also do not want to match this kind of url :
[1]: http://slashdot.org
So, if the markdown text is
Dude, look at this url http://www.google.com .. it's a great search engine
[1]: http://slashdot.org
I want only http://www.google.com to be matched, but not the http://slashdot.org.
I wonder what's the pattern to meet this criteria ?

What you have here is a parsing problem. Regexes are fine, but just using regexes here will make it a mess (supposing you achieve it). After you fix this problem, you'll probably find yourself facing other ones, like URL in code (between ` or in lines starting with tabs or four spaces) that you don't want to replace.
A solution would be to split into lines and then
detect patterns (for example ^\[\d+\]:\s+)
apply your replacements (for example this URL to link change) only on lines which doesn't follow an incompatible pattern
That's the logic I use in this small pseudo-markdown parser that you can test here.
Note that there's always the solution to use an existing proved markdown parser, there are many of them.

Is posible to add characters to a string as part of a regular expression (regex)

I use an application to find specific text patterns in free text fields in XML records. It uses regex to identify the pattern and then it is tagged in the XML. For a specific project, it would be a great time saver (I am working with about 18 million records) if I could add 2 characters 27 in front of one of the pattern I have to use.
Can this be done or am I just going to have to go the long way around?

No, you can't have a regex match text that isn't there. A regex will only be able to return text that is part of the original text.
However, if you matched into groups, you could potentially use the group name for extra information about what you're matching.

Regex is not the right tool if you'd like to edit an XML file. Instead, use a modern language like Python, Perl, Ruby, PHP, Java with a proper XML parser module. If you work in Unix like shell, I recommend xmlstarlet
That said, if you'd like to go ahead with a substitution, you can try sed (at your own risks) :
sed -i -r 's/987654/27&/g' files*.xml
(use only -i switch only to modify in-place)

search & replace wordpress video shortcode with plain URL using regular expressions

i am transferring a friend's wordpress.com blog to a self-hosted install on my server. problem is, he has many videos embedded in his blog using a shortcode plugin that is not necessary on wordpress 3 (you need only to paste the plain URL to embed videos from YouTube, Vimeo, etc;
I've found a Search Regex plugin that will search & replace using regular expressions, but am unfamiliar with regex myself. how might i catch the url in a shortcode such as [youtube="URL"] and replace it with just the URL?
Thanks for any help you can provide!!
-Jenny

Are you trying to go from "[youtube=http://www.youtube.com/watch?v=JaNH56Vpg-A]" to http://www.youtube.com/watch?v=JaNH56Vpg-A?
This works if there's a white space between different URLs.
find: \[youtube=(\S*)\]
replace with: $1
It's difficult to replace every different service at once since it seems that their short codes are different. For Vimeo this would work. It allows a random number of white space between "vimeo" and URL. And it again needs the white space after closing "]".
find: \[vimeo\s+(\S*)\]
replace with: $1
Maybe theres more robust way to write the expression. (Which validates the correct syntax.) This one's pretty straightforward thought.
The actual regex syntax depend on the language used. Hope this helps.

get data using regex

hello
i want to get data from a site using regex
http://helwa.maktoob.com/sec8180/art97048/pno1/title_%D8%B7%D8%A8%D9%82-%D9%81%D9%8A%D8%AA%D9%88%D8%AA%D8%B4%D9%8A%D9%86%D9%8A-%D8%A8%D8%A7%D9%84%D8%AE%D8%B6%D8%A7%D8%B1/index.htm
i used that regex /<div class="txtblk"(.*)?<div class="imgv cls">/is
but i gave me Invalid RegExp
why ?
i want to get data inside <div class="txtblk"></div>

Try escaping your double-quotes. Depending on your regex interpreter, those might be causing you problems.

The regex itself looks valid.
It depends on where/how you are using it, though; JavaScript for example doesn't know the /s modifier. To simulate a dot-matches-all mode in JavaScript, use [\s\S] instead of ..
Then, you might be running into problems with the quotes depending on the quoting rules for your language.
Also, you probably want to use (.*?) instead of (.*)?. (Or, if it's JavaScript, ([\s\S]*?)).
Finally, using regex to match HTML is not recommended. Use a DOM parser.

u may need to use a site that collects rss from links
like this
http://www.allwebdesignresources.com/webdesignblogs/graphics/turn-html-web-sites-into-rss-feeds-20-tools-converters-for-html-to-rss-conversions/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js