Last Matched String Issue - regex

I'm using the following regular expression to pull out some html:
(?i)(?:\<tr\s*class='list'[^\>]*\>)[^$+]*\</tr\>
Problem is its not segregating the TRs correctly. I'm trying to use $+ to reference the tag selector again to ensure that the contents of the match don't have the start tag again. Here is the sample html:
http://www.pastie.org/1311827
There are multiple <tr>s in some matches. Please help.

I don't know what you think [^$+]* means, but it defines a negated character class that matches zero or more times. In other words, it matches an empty string, or one or more characters that aren't a literal dollar sign or plus.
HTML cannot be trivially parsed by regex (unless it is known ahead of time what the structure will look like) because in order to properly parse a document you need to be able to recurse, as elements within the document can be nested within themselves (for instance a <div> can contain another <div>). While some languages (you didn't specify what you're using) support recursive regular expressions (perl and PHP for instance), it would likely be more efficient to use a proper DOM parser than recursive regex (the complexity of which non-withstanding) anyways!

Use document.getElementsByTagName in your favorite DOM library and iterate through the nodeList with a loop, then parse the getAttribute('class').
I suggest not using regex because it's only a matter of time before the regex breaks, unless you're dealing with very trivial markup, in addition DOM is just made for that purpose.

Related

Find and replace with regular expression in Notepad++

At the moment, I have a PHP function that gets the contents of a CSV file and puts it into a multi-dimensional array, which contains text that I print out in various places, using the indexes.
an example of use would be:
$localText[index][pageText][conceptQualityText][$lang];
The first index, [index], would be the name of the page. The second index [pageText] would indicate what it is (text for the page). The third index, [conceptQualityText] indicates what the actual text is. The last index, [$lang] gets the text in the desired language.
so:
->page location
->what is it
->the content
->what language it should be displayed in.
This all worked fine in the previous PHP versions. However, upgrading to 7.2, PHP seems to be a bit more strict. I was a bit more green ~2 years ago when I first made this solution, and now know that since these indexes aren't defined as strings e.g. encapsulated in single quotes like so: ['index'], they fit the notation of a superglobal (DEFINE). I didn't give it much thought back then, but now PHP seems to interpret them as so (superglobals), and so I get thrown the error that x word is an undefined superglobal.
My initial thought is to make a search and replace on my example string:
$localText[index][pageText][conceptQualityText][$lang];
using the regular expression functionality in Notepad++.
However, the example is just one of many, the notation of the array indexing is basically:
$localText[index][index2][index3][$lang];
So my question is:
How can I make use of the Notepad++ search and replace, using a regular expression, so that my index pointers become strings, instead of acting as superglobal variables?
e.g. make:
$localText[index][index2][index3][$lang];
into:
$localText['index']['index2']['index3'][$lang];
I will need some sort of logic that checks for whatever is inside the brackets and encapsulates them with single quotes, except for the last index, [$lang].
I tried to give as much information as possible, let me know if anything needs to be elaborated.
I tried to refer to these docs without much luck.
I found a solution using
this:
find: \b(localText\[)([a-zA-z0-9_\-]+)(\]\[)([a-zA-z0-9_\-]+)(\]\[)([a-zA-z0-9_\-]+)
replace: $1'$2'$3'$4'$5'$6'
and it works like a charm. Thanks for everyone who took their time to help.
You can use the following regex to match:
\[[^'](\w+)[^']\]
The regex matches a Word between Square brackets unless it quoted.
Replace with:
['$1']
The regex will not match the last brackets because it contains a '$' sign.

Problems with finding and replacing

Hey stackoverflow community. Ive need help with huge information file. Is it possible with regular expression to find in this tag:
<category_name><![CDATA[Prekiniai ženklai>Adler|Kita buitinė technika>Buičiai naudingi prietaisai|Kita buitinė technika>Lygintuvai]]></category_name>
Somehow replace all the other data and leave only 'Adler' or 'Lygintuvai'. Im using Altova to edit xml files, so i cant find other way then find-replace. And im new in the regex stuff. So i thought maby you can help me.
#\<category_name\>.+?gt\;([\w]+?)\|.+?gt;([\w]+?)\]\]\>\<\/category_name\>#i
\1 - Adler
\2 - Lygintuvai
PHP
regex101.com
Fields may contain alphanumeric characters without spaces.
If you want to modify the scope of acceptable characters change [\w] to something other:
[a-z] - only letters
[0-9] - only digits
etc.
It's possible, but use of regular expressions to process XML will never be 100% correct (you can prove that using computer science theory), and it may also be very inefficient. For example, the solution given by Luk is incorrect because it doesn't allow whitespace in places where XML allows it. Much better to use XQuery or XSLT, both of which are designed for the job (and both work in Altova). You can then use XPath expressions to locate the element or attribute nodes you are interested in, and you can still use regular expressions (e.g. in the XPath replace() function) to process the content of text or attribute nodes.
Incidentally, your input is rather strange because it uses escape sequences like > within a CDATA section; but XML escape sequences are not recognized in a CDATA section.

Reg ex to match everything between two tags

I have a string similair to this
<td><p>alakjsdlajsdlkj</p><p><b>asdkjalsdkjaskldj</b></p><p>asdjlaksjdlaksjd</p></td>
What is the regular expression to grab everything between the tags?
I want to grab the following (including the HTML)
<p>alakjsdlajsdlkj</p><p><b>asdkjalsdkjaskldj</b></p><p>asdjlaksjdlaksjd</p>
You can't accomplish this with regular expressions. They just aren't descriptive/powerful enough, mainly in that there is no mechanism to keep track of how many of something it has seen. In short, this is because the regex mechanism has no notion of a stack (it represents finite state machines, not pushdown automata).
For example, consider the pattern <p>(.*)</p>. If you used greedy mode (match as much as possible) and have a string like <p>first</p><p>second</p>, the match will be first</p><p>second. If you went with non-greedy mode (make the smallest match possible) and get a string like <p><p>stuff</p></p>, you'll be rewarded with the match <p>stuff. So neither mode covers all cases (or any case) well.
As #kristopher points out, it's possible to have a pattern that avoids including another tag inside the match, but this will only match innermost tags.
To do what you want robustly, you'll need an actual parser. Several html parsing solutions have been created by others, or for simple parsing needs, you might be able to write your own.
if your tags nest this gets messy fast.
are you unable to use an html parser library? It would be FAR better to do so.
<([^>]+)>([^<]+)</\1>
gets you
any string wrapped in angle brackets
plus any characters up until the next tag
this doesn't handle nested or mismatched tags though
<div>test <b>nested</b></div>
will only catch the
< b >
not the div since the < div > will encounter the start of the < b > before encountering the end of its own tag.
try this, it should just match the outermost tags and return the inner string in the group
^<\w+>(.*)</\w+>$
But it does not checks for correct nesting etc. Use an appropriate framework if possible.
If you can't use an HTML parser and the td and ending td are at the beginning and end of the string:
^<td>(.*)</td>$

How to efficiently match an input string against several regular expressions at once?

How would one efficiently match one input string against any number of regular expressions?
One scenario where this might be useful is with REST web services. Let's assume that I have come up with a number of URL patterns for a REST web service's public interface:
/user/with-id/{userId}
/user/with-id/{userId}/profile
/user/with-id/{userId}/preferences
/users
/users/who-signed-up-on/{date}
/users/who-signed-up-between/{fromDate}/and/{toDate}
…
where {…} are named placeholders (like regular expression capturing groups).
Note: This question is not about whether the above REST interface is well-designed or not. (It probably isn't, but that shouldn't matter in the context of this question.)
It may be assumed that placeholders usually do not appear at the very beginning of a pattern (but they could). It can also be safely assumed that it is impossible for any string to match more than one pattern.
Now the web service receives a request. Of course, one could sequentially match the requested URI against one URL pattern, then against the next one, and so on; but that probably won't scale well for a larger number of patterns that must be checked.
Are there any efficient algorithms for this?
Inputs:
An input string
A set of "mutually exclusive" regular expressions (ie. no input string may match more than one expression)
Output:
The regular expression (if any) that the input string matched against.
The Aho-Corasick algorithm is a very fast algorithm to match an input string against a set of patterns (actually keywords), that are preprocessed and organized in a trie, to speedup matching.
There are variations of the algorithm to support regex patterns (ie. http://code.google.com/p/esmre/ just to name one) that are probably worth a look.
Or, you could split the urls in chunks, organize them in a tree, then split the url to match and walk the tree one chunk at a time. The {userId} can be considered a wildcard, or match some specific format (ie. be an int).
When you reach a leaf, you know which url you matched
The standard solution for matching multiple regular expressions against an input stream is a lexer-generator such as Flex (there are lots of these avalable, typically several for each programming langauge).
These tools take a set of regular expressions associated with "tokens" (think of tokens as just names for whatever a regular expression matches) and generates efficient finite-state automata to match all the regexes at the same time. This is linear time with a very small constant in the size of the input stream; hard to ask for "faster" than this. You feed it a character stream, and it emits the token name of the regex that matches "best" (this handles the case where two regexes can match the same string; see the lexer generator for the definition of this), and advances the stream by what was recognized. So you can apply it again and again to match the input stream for a series of tokens.
Different lexer generators will allow you to capture different bits of the recognized stream in differnt ways, so you can, after recognizing a token, pick out the part you care about (e.g., for a literal string in quotes, you only care about the string content, not the quotes).
If there is a hierarchy in the url structure, that should be used to maximize performance. Only an url that starts with /user/ can match any of the first three and so on.
I suggest storing the hierarchy to match in a tree corresponding to the url hierarchy, where each node matches a level in the hierarchy. To match an url, test the url against all roots of the tree where only nodes with regexes for "user" and "users" are. Matching url:s are tested against the children of those nodes until a match is found in a leaf node. A succesful match can be returned as the list of nodes from the root to the leaf. Named groups with property values such as {user-id} can be fetched from the nodes of the successful match.
Use named expressions and the OR operator, i.e. "(?P<re1>...)|(?P<re2>...)|...".
First I though that I couldn't see any good optimization for this process.
However, if you have a really large number of regexes you might want to partition them (I'm not sure if this is technically partitioning).
What I tell you to do is:
Suppose that you have 20 possible urls that start with user:
/user/with-id/X
/user/with-id/X/preferences # instead of preferences, you could have another 10 possibilities like /friends, /history, etc
Then, you also have 20 possible urls starting with users:
/users/who-signed-up-on
/users/who-signed-up-on-between #others: /registered-for, /i-might-like, etc
And the list goes on for /products, /companies, etc instead of users.
What you could do in this case is using "multi-level" matching.
First, match the start of the string. You'd be matching for /products, /companies, /users, one at a time and ignoring the rest of the string. This way, you don't have to test all the 100 possibilities.
After you know the url starts with /users, you can match only the possible urls that start with users.
This way, you would reduce a lot of unneeded matches. You won't match the string for all the /procucts possibilities.

How do we create such a regular expression to extract data?

<br>Aggie<br><br>John<br><p>Hello world</p><br>Mary<br><br><b>Peter</b><br>
I'd like to create a regexp that safely matches these:
<br>Aggie<br>
<br>John<br>
<br>Mary<br>
<br><b>Peter</b><br>
This is possible that there are other tags (e.g. <i>,<strike>...etc ) between each pair of <br> and they have to be collected just like the <br><b>Peter</b><br>
How should the regexp look like?
If you learn one thing on SO, let it be - "Do not parse HTML with a regex". Use an HTML Parser
<br>.*?<br>
will match anything from one <br> tag to the closest following one.
The main problem with parsing HTML using regexes is that regexes can't handle arbitrarily nested structures. This is not a problem in your example.
Split the string at (<br>)+. You'll get empty strings at the beginning and the end of the result, so you need to remove them, too.
If you want to preserve the <br>, then this is not possible unless you know that there is one before and after each element in the result.