Due to redbubble.com's lack of an API, I'm using an ATOM feed to steal information about a user's pictures.
This is what the XML looks like:
<entry>
<id>ID</id>
<published>Date Published</published>
<updated>Date Updated</updated>
<link type="text/html" rel="alternate" href="http://www.redbubble.com/link/to/post"/>
<title>Title</title>
<content type="html">
Blah blah blah stuff about the image..
<a href="http://www.redbubble.com/products/configure/config-id"><img src="http://ih1.redbubble.net/path-to-image" alt="" />
</content>
<author>
<name>Author Name</name>
<uri>http://www.redbubble.com/people/author-user-name</uri>
</author>
<link type="image/jpeg" rel="enclosure" href="http://ih0.redbubble.net/path-to-the-original-image"/>
<category term="1"/>
<category term="2"/>
</entry>
Basically using regex... how would I go about getting the href property inside the link in the content tag?
One thing we know for sure is it will always have configure in the path i.e. http://somesite.com/**configure**/id
So basically I just need to find the URL with configure in and grab the whole thing...
The following regex will extract the href content based on your requirements. It seems to work for the sample code.
href="(\w[^"]+/configure/\w[^"]+)
Whatever programming language you're using, don't try to parse the whole thing with a regex. Use an XML parser first to extract the href="...". Then, sure, use a regex to make sure the URL contains configure.
As #KARASZI commented, XPath is another good approach.
If you have to use regex try this one:
href="(?=[^"]*configure)([^"]*)
rubular.com
I am using a lookahead to find if it contains configure.
Thanks for your awesome answers but my colleague solved it for me!
This is what i ended up using:
/http:\/\/([^"\/]*\/)*configure\/([^"]*)/
(Ruby regex by the way)
Related
we received banking statements from the SAP System. We sometimes observe the naming convention of the file name will be not as per the standards and the files will be rejected.
We wanted to validate the file name, as per the below example, we get the file name in the name attribute.
Can the country ISO code escape in the validation?
We wanted an Xpath that captures GLO_***_UPLOAD_STATEMENT like this so that ISO code is not validated.
Example XML:
<?xml version="1.0" encoding="UTF-8"?>
<Details name="GLO_ZFA_UPLOAD_STATEMENT" type="banking" version="3.0">
<description/>
<object>
<encrypted data="b528f05b96102f5d99743ff6122bb0984aa16a02893984a9e427a44fcedae1612104a7df1173d9c61a99ebe0c34ea67a46aecc86f41f5924f74dd525"/>
</object>
</Details>
Xpath tried:
Details[#type="banking"]/#name[not(starts-with(., "GLO_***_UPLOAD_STATEMENT"))]
which is not working :(
Can anyone help here, please :)
Thanks in advance!
Try using the matches() function for a regex like this:
Details[#type="banking"]/#name[not(matches(., "^GLO_(.){3}_UPLOAD_STATEMENT"))]
starts-with() is char based, it doesn't recognize patterns.
If your XPath version doesn't support regex then you can use something like:
Details[#type="banking"]/#name[not(starts-with(., "GLO_")) and not(ends-with(., "_UPLOAD_STATEMENT"))]
You can match regular expressions using the matches() function. For example:
//Details[#type="banking" and not(matches(#name, "GLO_[A-Z]*_UPLOAD_STATEMENT"))]/#name
Will only select Details node's name attribute for Details that have type="banking" and name not matching the regular expression "GLO_[A-Z]*_UPLOAD_STATEMENT". You can refine the regex as needed.
In my ant build file, I have a task that needs to replace a specific element of a XML.
Here is the target XML that I am trying to modify:
<foo>
<sub>
<elem>name1</elem>
</sub>
<sub>
<elem>name2</elem>
</sub>
<sub>
<elem>name3</elem>
</sub>
</foo>
Ant build task:
<replaceregexp file="myfoo.xml"
match="<elem>(.*)elem>"
replace="<elem>${replace_only_second_match}elem>"
byline="true"
/>
The problem with the above task is that all the tags will get replaced. However, I want only the second element to be modified, not the first or 3rd match. (such a thing is quite easy with normal regular expressions.)
Dont know how to do it with Ant's regular expression. This is where I need help/suggestions on how best to solve this problem.
You should use xmltask for xmlrelated tasks, for your problem use it like that :
Modify the file inplace
<xmltask source="whatever.xml" dest="whatever.xml">
<replace path="//sub[2]/elem/text()" withText="newname2"/>
</xmltask>
Create new file
<xmltask source="whatever.xml" dest="newfile.xml">
<replace path="//sub[2]/elem/text()" withText="newname2"/>
</xmltask>
The replacesection also provides withXml / withFile / withBuffer.
See xmltask manual and tutorial for details.
Some XPath essentials here.
little bit of background:
I work at a multilingual communication company, where we’re working with a CMS system. Since its last update, all the files I export out of the system are ‘polluted’ with metadata, which I don't want to see, use or replace. To filter and change a heap of xml files, I use Powergrep, which operates with regexes.
I want my regex to find, e.g. "there is no spoon", "oracle", "I know kung-fu" and "bending method" (all straight quotation marks) and replace it with “there is no spoon”, “oracle”, “I know kung-fu” and “bending method” (all with curly quotation marks).
I don’t want it to find the metadata "concept.dtd" and "map.dtd"
The following lines are the first lines of my xml file. It's this "concept.dtd" that I would like to ignore.
<?xml version="1.0" encoding="UTF-16" standalone="no"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"[
]>
<?ish ishref="GUID-6B84EF92-DA99-4C54-BA91-FD0A113D4A96" version="1" lang="sv" srclng="en"?>
This is somewhere in the middle of the xml file
<row>
<entry colname="col1" valign="middle" align="left">"Bending method" </entry>
<entry colname="col2" valign="middle" align="left">another word</entry>
</row>
So.. this is the original regex:
(?<!=)”\b(.+?)\b”(?! \[)
Replacement:
“1”
Problem:
As the metadata “concept.dtd” and “map.dtd” are part of the file, I don’t want to replace their quotation marks in order not to change anything crucial. So I tried rewriting the regex:
(?<!=)”\b(.+?[\.d])\b”(?! \[)
It almost works: “concept.dtd” and “map.dtd” are skipped, most of the terms between quotation marks are found, but not all: “Bending method” is not found, for example.
What am I missing? Any help or opinions would be greatly appreciated!
Based on your last answers, here is a regexp that can help you:
(?<=<entry)[^>]+>[^<>]*?(".+?")[^<>]*?(?=<\x2Fentry>)
Description
Demo
http://regex101.com/r/lX2cU3
Discussion
I assume that you have one serie of words between straight quotations and that there are no carriage returns ou line feeds inside an <entry> node.
I need to replace the below url (including img tags) with text. I am not very good with regex... As you can see its dynamic with dates, and it ends in two different ways:
with alt=";)"> and sometimes with class="wp-smiley" />
<img src="http://thailandsbloggare.se/wp-content/uploads/2012/10/icon_wink.gif" alt=";)">
and sometimes with class="wp-smiley" at the end
<img src="http://thailandsbloggare.se/wp-content/uploads/2012/09/icon_wink.gif" alt=";)" class="wp-smiley" />
So any time this image is posted I want the complete string to replaced to text ";)"
I have managed to write the regex for everything until alt=";)"> and sometimes with class="wp-smiley" /> but then I am stuck, pressume need some OR functionality here.
<img src="http://thailandsbloggare.se/wp-content/uploads/20\d\d/\d+/icon_wink\.gif
Updated information after replies below
<img src="http://thailandsbloggare.se/wp-content/uploads/20[0-9]{2}/[01][0-9]/icon_wink.gif" alt=";\)" *(|class="wp-smiley")?>
and
Both fail returning strings whith class="wp-smiley" /> included
Its a site built in Wordpress using PHP and I am using http://urbangiraffe.com/plugins/search-regex/
Thanks in advance!
Normally, in a regex, you can create alternative sub-regexes:
(match this|or this)
In your case
(alt=";\)"|class="wp-smiley")
If alt=";)" is always there, do:
alt=";\)" *(|class="wp-smiley")
Of course, we don't know in which editor or programming language you are operating, and the actual regex implementation can be different from the above example.
Try the following pattern search:
<img src="http://thailandsbloggare.se/wp-content/uploads/20[0-9]{2}/[01][0-9]/icon_wink.gif" alt=";\)"(\sclass="wp-smiley")?>
Please refer to the syntax supported by the regex engine you are using. But, for most engines the above pattern should work. Note the character class used for date ranges, you should change it appropriately.
I have massive html code, with loooads of images, problem is, every single image has a different path, example:
<img src="../media/2010/01/something.jpg" />
<img src="../media/logo.png" />
What I wanted to do with regular expressions is, to find every image path and replace it with:
<img src="../img/FILENAME.EXTENSION" />
I know that it's definately possible with regular expressions ... but it's just not my cup of tea, could any1 help me please?
Cheers, Mart
This might not be the best solution but it might work:
(<img.*?src=")([^"]*?(\/[^/]*\.[^"]+))
and then you use capture group 1 and 3 to create the new string (depending on flavor):
$1../img$3
You can see it in action here: http://regexr.com?2v8ir
If you want to parse html, its much better if you use an html parser instead of regex. There are quite alot of them and they do a very good work.
Html Agility Pack is a good one
Try this link
Using this regex <img src="[\w/\.]+"(\s|)/> and replacing with <img src="../img/FILENAME.EXTENSION" />