I have a XML file I use to manually route users to specific pages in a website.
Currently, we have separate entries for every variation of possible searches (plural, typos etc.). I would like to know if there is a way I can condense it with regex to something like so:
<OnSiteSearch>
...
<Search>
<SearchTerm>(horses?|cows?) for sale</SearchTerm>
<Destination>~/some/path.html</Destination>
</Search>
...
</OnSiteSearch>
Is something like this possible? I've looked online for regex and XML but it seems to be about validating content between the XML tags and not about using regex as the content.
Yes, a regex can be stored in XML as long as you mind XML escaping rules to keep the XML well-formed:
Element content: Escape any < as < and any & as & when writing
the regex; reverse the substitution before using the regex.
Attribute value: Follow rules for element content plus escape any " as
"e; or any ' as ' to avoid conflict with chosen attribute
value delimiters.
CDATA: No escaping needed, but make sure your regex doesn't include
the string ]]>.
Related
I'm working with existing code where regular expressions are used to parse HTML. For specific reasons it is not possible to use XPATH. The HTML is actually a html/text email. In the email I have multiple div elements with text content. I'm trying to write regex which match n-th div element. Unfortunatelly these div elements do not have any attributes like classes or ids. I tried this but it match all occurrences
<div>(.*)<\/div>{1}
There many suggestions out there but none of theme is working for me.
Thanks.
Fellow Forum Members,
I am using the latest NotePad++. I have 430 separate XML files and my goal is to make a "dmcode" list of all 430 XML files. The dmcode identifies each XML file and looks like the example code shown below. I need help in developing a Regular Expression that will grab the dmcode tag content located between the <dmCode opening tag and the closing /> terminator. Also I only need this extraction to only apply to dmcode tags that follows the <dmIdent> tag. In other words, any dmcode tag that is not preceded by a <dmIdent> tag does not end up on my NotePad++ search result list. Is such a Regular Expression that can pull targeted data from a lot of XML files possible?
<dmIdent>
<dmCode assyCode="00" disassyCode="00" disassyCodeVariant="00" infoCode="042" infoCodeVariant="A" itemLocationCode="O" modelIdentCode="SASA" subSubSystemCode="6" subSystemCode="0" systemCode="A03" systemDiffCode="XY"/>
As an alternative I have been researching using an XPath expression to accomplish the same task. However, I can't seem to find a NotePad++ XPath plugin that will enable me to specify the data I want to extract from 430 XML files by using an XPath expression instead of a Regular Expression. I will also appreciate it if anyone can provide an example of an XPath expression that will perform the same task I'm trying to accomplish by using a Regular Expression.
Any help will be greatly appreciated.
I know there are plugins for XPath, but I don't know one that allows you to search several files. The following XPath would match all attributes in <dmCode> as a child of the root element <dmIdent>:
/dmIdent/dmCode[#*]
I need help in developing a Regular Expression that will grab the dmcode tag content located between the <dmCode opening tag and the closing /> terminator. Also I only need this extraction to only apply to dmCode tags that follow the <dmIdent> tag.
This will work for the most simple cases, where:
<dmCode> is the first child of <dmIdent>
There are no comments, CDATA tags, or similar constructs that could make it fail.
(?i)<dmIdent>\s*<dmCode \K[^"/>]*(?>(?:"[^\\"]*(?:\\.[^\\"]*)*"|/(?!>))[^"/>]*)*(?=/>)
regex101 demo
Matches:
(?i)<dmIdent>\s*<dmCode both tags spearated by whitespace (case-insensitively)
\K resets the matched text
[^"/>]* Any characters except ", / or >
And loops:
"[^\\"]*(?:\\.[^\\"]*)*" text in quotes, or
/(?!>) a / not followed by >
both followed by the previous [^"/>]*
(?=/>) All followed by />
I have the following content
<li>Title: [...]</li>
and I'm looking for regex that will match and replace this so that I can parse it as XML. I'm just looking to use a regex find and replace inside Sublime Text 2, so I want to match everything in the above example except for the [...] which is the content.
Why not extract the content and use it to build the xml rather than trying to mold the wrapper of the content into xml? (or am i mis understanding you?)
<li>Title: ([^<]*)<\/li>
is the regular expression to extract the content.
Its pretty self explanatory other than the [^<]* which means match any number of characters that is not a "<"
I don't know Sublime, but something like this should suffice to get you the contents of the li. It allows for there being optional extra attributes on the tag. Make sure and turn off case-sensitivity, incase of LI or Li etc. (lifted straight from http://www.regular-expressions.info/examples.html ):
<li\b[^>]*>(.*?)</li>
<li>\S*(.*)?</li>
That should match your string, with the content being capturing group 1.
If I have a bunch of urls like this:
<li>Xyz 123</li>
<li>Xyz 345</li>
What would a regex look like to erase the urls inside the hrefs so that they become:
<li>Xyz 123</li>
<li>Xyz 345</li>
The following should do what you like:
/href=\"([^\"]*)\"/
Basically match href="<any text but a '"'>".
Search for <a href="[^"]*" and replace with <a href="".
If you add more details about which language you're using, I can be more specific. Be aware also that regular expressions are usually not the tool of choice when dealing with HTML.
First of all, do not use regex to parse HTML — why? Have a look here or here.
Process the HTML using an XML reader / XML document processing engine. Then use XPath to find nodes matching your criteria and alter href attributes in the DOM.
Note: For HTML which is not well-formed XML a more-general HTML (SGML) parser is required.
I partially agree with the others but a more complete version would be
/(<a[^>]+href\s*=\s*\")(.*?)("[^>]*>)/$1$3/gi
I want to metch a keyword that is not linked, as the following example shows, I just match the google keyword that is neither between <a></a> nor included in the attributes, I only want to match the last google:
google is linked, google is not linked.
Do not parse HTML with regular expressions. HTML is an irregular language. Use a HTML parser.
This works for me (javascript):
var matches = str.match(/(?:<a[^>]*>[^<]*<\/a>[\s\S]*)*(google)/);
See it in action
Provided you can be sure that your HTML is well behaved (and valid), especially does not contain comments or nested a tags, you can try
google(?!((?!<a[\s>]).)*</a>)
That matches any "google" that is not followed by a closing a tag before the next opening a tag. But you might be better of using a HTML Parser instead.