Regex in Notepad++ to select on string length between specific XML tags - regex

I'm working with Emergency Services data in the NEMSIS XSD. I have a field, which is constrained to only 50 characters. I've searched this site extensively, and tried many solutions - Notepad++ rejects all of them, saying not found.
Here's an XML Sample:
<E09>
<E09_01>-5</E09_01>
<E09_02>-5</E09_02>
<E09_03>-5</E09_03>
<E09_04>-5</E09_04>
<E09_05>this one is too long Non-Emergency - PT IS BEING DISCHARGED FROM H AFTER BEING ADMITTED FOR FAILURE TO THRIVE AND ALCOHOL WITHDRAWAL</E09_05>
</E09>
<E09>
<E09_01>-5</E09_01>
<E09_02>-5</E09_02>
<E09_03>-5</E09_03>
<E09_04>-5</E09_04>
<E09_05>this one is is okay</E09_05>
</E09>
I've tried solutions naming the E09_05 tag in different ways, using <\/E09_05> for the closing tag as I've seen in some examples, and as just </E09_05> as I've seen in others. I've tried ^.{50,}$ between them, or [a-zA-Z]{50,}$ between them, I've tried wrapping those in-between expressions in () and without. I even tried just [\s\S]*? in between the tags. The only thing that Notepad++ finds is when I use ^.{50,}$ by itself with no XML tags ... but then I wind up hitting on all the E13_01 tags (which are EMS narratives, and always > 50 characters) -- making for painstaking and wrist-aching clicks.
I wanted to XSLT this, but there is too much individual, hands on tweeking of each E09_05 field for automating it. Perl is not an option in this environment (and not a tool I know at all anyway).
To be truly sublime, both E09_05 and E09_08 fields with string lengths >50 need to be what is selected on the search ... but no other elements of any kind or length.
Thanks in advance. I'm sure I'm just missing some subtle \, or () or [] somewhere ... hopefully ...

The following regex will find the text content of <E09_05> elements with more than 50 characters.
(?<=<E09_05>).{51,}?(?=</E09_05>)
Explanation
(?<=<E09_05>) Start matching right after <E09_05>
.{51,}? Match 51 or more characters (in a single line)
The ? makes it reluctant, so it'll stop at first </E09_05>
(?=</E09_05>) Stop matching right before </E09_05>
For truly sublime matching, i.e. both E09_05 and E09_08 fields with string lengths >50, use:
(?<=<(E09_0[58])>).{51,}?(?=</\1>)
Explanation
<(E09_0[58])> Match <E09_05> or <E09_08>, and capture the name as group 1
</\1> Use \1 backreference to match name inside </name>
If you want to shorten the text with ellipsis at the end, e.g. Hello World with max length 8 becomes Hello..., use:
Find what: (?<=<(E09_0[58])>)(.{47}).{4,}(?=</\1>)
Replace with: \2...

Related

Regex capture group with non-uniform space group

I'm trying to parse the output of the "display interface brief" Comware switch command to convert it to a CSV file using RegEx. This command is printed using the following format:
Interface Link Speed Duplex Type PVID Description
BAGG51 UP 4G(a) F(a) T 1
FGE1/0/42 DOWN auto A T 1 ### LIVRE ###
GE6/0/20 UP 100M(a) F(a) A 1 LIVRE (MGMT - [WAN8-P8]
It's seems quite challenging for me because doesn't matter which RegEx I try, it doesn't properly handle "DOWN auto" and "100M(a) F(a)" output that has only one space between them. I also couldn't find a way to properly handle the last field, that can contain one or more spaces, but into most RegEx that I tried it create a separate capture group for each space instead of handling it's text content properly.
I'd also tried countless ways to try to parse it, and I couldn't find much content about parsing non-uniform columns into the Internet and StackOverflow community.
I need to parse it into the following format, with 7 capture groups per line, respecting the end of line:
BAGG51;UP;4G(a);F(a);T;1
FGE1/0/42;DOWN;auto;A;T;1;### LIVRE ###
GE6/0/20;UP;100M(a);F(a);A;1;LIVRE (MGMT - [WAN8-P8]
The most successfully RegEx that I found so far was: ^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+) replacing it to $1;$2;$3;$4;$5;$6;$7 using Notepad++ but it doesn't properly handle the "Description" field, that can be empty.
The following pattern seems to be working here:
^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)(?:[ ]+(.*))?
This follows your pattern with six mandatory capture groups, followed by an optional seventh capture group. The (?:[ ]+(\S+))? at the end of the pattern matches one or more spaces followed by the content. Note that this pattern should be used in multiline mode.
Here is a working demo

Oracle - Search for text - Retrieve snippet of result

I'm currently building a simple search page in Node JS Express and Oracle.
I'd like to show the user a snippet of the matching text (first instance would do) to add a bit context of what the SQL found.
Example:
Search term: 'fish'
Results: Henry really likes going fishing, and once he caug ...
I'm not sure the best way to approach this - I could retrieve the whole block of text and do it in Node JS, but I don't really like the idea of dragging the whole text across to the app, just to get a snippet.
I've been thinking that REGEXP_SUBSTR could be way to do it... But I'm not sure whether I could use a regular expression to retrieve x amount of characters before and after the matching word.
Have I got the right idea or am I going about it in the wrong way?
Thanks
SELECT text
, REGEXP_SUBSTR(LOWER(text), LOWER('fish')) AS potential_snippet
FROM table
WHERE LOWER(text) LIKE LOWER('%fish%');
Try this:
select text
, SUBSTR( TEXT, INSTR(LOWER(TEXT),'fish', 1)-50,100 )
FROM test
WHERE INSTR(LOWER(text),'fish', 1)<>0;
Play with the position and length numbers(50 and 100 in my example) to limit the length of the string.
If you need to extract some context with the help of JavaScript, you can use limiting quantifiers in a regex:
/\b.{0,15}fish.{0,15}\b/i
See demo
Here,
\b - matches at the word boundary (so that the context contains only whole words)
.{0,15} - any characters other than a newline (replace with [\s\S] or [^] if you need to include newlines)
fish - the keyword
The /i modifier enables case-insensitive search.
If you need a dynamic regex creation, use a constructor notation:
RegExp("\\b.{0,15}" + keyword + ".{0,15}\\b", "i");
Also, if you need to find multiple matches, use g modifier alongside the i.

Using regex multiple capture groups to split up a string

I have a file that looks like this...
"1234567123456","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123456","D","TEST1 "
"1234567123456","D","TEST 2~TEST3"
"1234567123456","R","TEST4~TEST5"
"1234567123457","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123457","D","TEST 6"
"1234567123457","D","TEST7"
"1234567123457","R","TEST 8~TEST9~TEST,10"
All I'm trying to do is parse the D and R lines. The ~ is used in this case as a separator. So the end results would be...
"1234567123456","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123456","D","TEST1 "
"1234567123456","D","TEST3"
"1234567123456","D","TEST3"
"1234567123456","R","TEST4"
"1234567123456","R","TEST5"
"1234567123457","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123457","D","TEST 6"
"1234567123457","D","TEST7"
"1234567123457","R","TEST 8"
"1234567123457","R","TEST9"
"1234567123457","R","TEST,10"
I'm using regex on applications like Textpad and Notepad++. I have not figured out how to use a regex like /.+/g because the applications do not like the forward slashes. So I don't think I can use things like the global modifier. I currently have the following regex...
//In a program like Textpad/Notepad++
<FIND> "(.{13})","D","([^~]*)~(.*)
<REPLACE> "\1","D","\2"\n"\1","D","\3
Now if I run a find and replace with the above params a few times it would work fine (for the D lines only). The problem is there is an unknown number of lines to be made. For example...
"1234567123456","D","TEST1~TEST2~TEST3~TEST4~TEST5"
"1234567123457","D","TEST1~TEST2~TEST3"
"1234567123458","D","TEST1~TEST2"
"1234567123459","D","TEST1~TEST2~TEST3~TEST4"
I was hoping to be able to use a MULTI capture group to make this work. I found this PAGE talking about the common mistake between repeating a capturing group and capturing a repeated group. I need to capture a repeated group. For some reason I just could not make mine work right though. Anyone else have an idea?
Note: If I could get rid of the leading and trailing spaces EX: "1234567123456","D","TEST1 " ending up as "1234567123456","D","TEST1" that would be even better but not necessary.
RESOURCES:
http://www.regular-expressions.info/captureall.html
http://regex101.com/

Regular expression for syntax highlighting attributes in HTML tag

I'm working on regular expressions for some syntax highlighting in a Sublime/TextMate language file, and it requires that I "begin" on a non-self closing html tag, and end on the respective closing tag:
begin: (<)([a-zA-Z0-9:.]+)[^/>]*(>)
end: (</)(\2)([^>]*>)
So far, so good, I'm able to capture the tag name, and it matches to be able to apply the appropriate patterns for the area between the tags.
jsx-tag-area:
begin: (<)([a-zA-Z0-9:.]+)[^/>]*>
beginCaptures:
'1': {name: punctuation.definition.tag.begin.jsx}
'2': {name: entity.name.tag.jsx}
end: (</)(\2)([^>]*>)
endCaptures:
'1': {name: punctuation.definition.tag.begin.jsx}
'2': {name: entity.name.tag.jsx}
'3': {name: punctuation.definition.tag.end.jsx}
name: jsx.tag-area.jsx
patterns:
- {include: '#jsx'}
- {include: '#jsx-evaluated-code'}
Now I'm also looking to also be able to capture zero or more of the html attributes in the opening tag to be able to highlight them.
So if the tag were <div attr="Something" data-attr="test" data-foo>
It would be able to match on attr, data-attr, and data-foo, as well as the < and div
Something like (this is very rough):
(<)([a-zA-Z0-9:.]+)(?:\s(?:([0-9a-zA-Z_-]*=?))\s?)*)[^/>]*(>)
It doesn't need to be perfect, it's just for some syntax highlighting, but I was having a hard time figuring out how to achieve multiple capture groups within the tag, whether I should be using look-around, etc, or whether this is even possible with a single expression.
Edit: here are more details about the specific case / question - https://github.com/reactjs/sublime-react/issues/18
I may found a possible solution.
It is not perfect because as #skamazin said in the comments if you are trying to capture an arbitrary amount of attributes you will have to repeat the pattern that matches the attributes as many times as you want to limit the number of attributes you will allow.
The regex is pretty scary but it may work for your goal. Maybe it would be possible to simplify it a bit or maybe you will have to adjust some things
For only one attribute it will be as this:
(<)([a-zA-Z0-9:.]+)(?:(?: ((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))
DEMO
For more attributes you will need to add this as many times as you want:
(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))?
So for example if you want to allow maximum 3 attributes your regex will be like this:
(<)([a-zA-Z0-9:.]+)(?:(?: ((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)?(?: |>))?(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)?(?: |>))?
DEMO
Tell me if it suits you and if you need further details.
I'm unfamiliar with sublimetext or react-jsx but this to me sounds like a case of "Regex is your tool, not your solution."
A solution that uses regex as a tool for this would be something like this JsFiddle (note that the regex is slightly obfuscated because of html-entities like > for > etc.)
Code that does the actual replacing:
blabla.replace(/(<!--(?:[^-]|-(?!->))*-->)|(<(?:(?!>).)+>)|(\{[^\}]+\})/g, function(m, c, t, a) {
if (c!=undefined)
return '<span class="comment">' + c + '</span>';
if (t!=undefined)
return '<span class="tag">' + t.replace(/ [a-z_-]+=?/ig, '<span class="attr">$&</span>') + '</span>';
if (a!=undefined)
return a.replace(/'[^']+'/g, '<span class="quoted">$&</span>');
});
So here I'm first capturing the separate type of groups following this general pattern adapted for this use-case of HTML with accolade-blocks. Those captures are fed to a function that determines what type of capture we're dealing with and further replaces subgroups within this capture with its own .replace() statements.
There's really no other reliable way to do this. I can't tell you how this translates to your environment but maybe this is of help.
Regex alone doesn't seem to be good enough, but since you're working with sublime's scripting here, there's a way to simplify both the code and the process. Keep in mind, I'm a vim user and not familiar with sublime's internals - also, I usually work with javascript regexes, not PCREs (which seems to be the format used by sublime, or closest thereof).
The idea is as follows:
use a regex to get the tag, attributes (in a string) and contents of the tag
use capture groups to do further processing and matching if necessary
In this case, I made this regex:
<([a-z]+)\ ?([a-z]+=\".*?\"\ ?)?>([.\n\sa-z]*)(<\/\1>)?
It starts by finding an opening tag, creates a control group for the tag name, if it finds a space it proceeds, matches the bulk of attributes (inside the \"...\" pattern I could have used \"[^\"]*?\" to match only non-quote characters, but I purposefully match any character greedily until the closing quote - this is to match the bulk of attributes, which we can process later), matches any text in between tags and then finally matches the closing tag.
It creates 4 capture groups:
tag name
attribute string
tag contents
closing tag
as you can see in this demo, if there is no closing tag, we get no capture group for it, same for attributes, but we always get a capture group for the contents of the tag. This can be a problem generally (since we can't assume that a captured feature will be in the same group) but it isn't here because, in the conflict case where we get no attributes and no content, thus the 2nd capture group is empty, we can just assume it means no attributes and the lack of a 3rd group speaks for itself. If there's nothing to parse, nothing can be parsed wrongly.
Now to parse the attributes, we can simply do it with:
([a-z]+=\"[^\"]*?\")
demo here. This gives us the attributes exactly. If sublime's scripting lets you get this far, it certainly would allow you further processing if necessary. You can of course always use something like this:
(([a-z]+)=\"([^\"]*?)\")
which will provide capture groups for the attribute as a whole and its name and value separately.
Using this approach, you should be able to parse the tags well enough for highlighting in 2-3 passes and send off the contents for highlighting to whatever highlighter you want (or just highlight it as plaintext in whatever fancy way you want).
Your own regex was quite helpful in answering your question.
This seems to work well for me:
/(:?<|<\/)([a-zA-Z0-9:.]+)(?:\s(?:([0-9a-zA-Z_-]*=?))\s?)*[^/>]*(:?>|\/>)/g
The / at the beginning and end are just the wrappers regex usually requires. In addition, the g at the end stands for global, so it works for repetitions as well.
A good tool I use to figure out what I am doing wrong with my regex is: http://regexr.com/
Hope this helps!

Regexp for finding tags without nested tags

I'm trying to write a regexp which will help to find non-translated texts in html code.
Translated texts means that they are going through special tag: or through construction: ${...}
Ex. non-translated:
<h1>Hello</h1>
Translated texts are:
<h1><fmt:message key="hello" /></h1>
<button>${expression}</button>
I've written the following expression:
\<(\w+[^>])(?:.*)\>([^\s]+?)\</\1\>
It finds correct strings like:
<p>text<p>
Correctly skips
<a><fmt:message key="common.delete" /></a>
But also catches:
<li><p><fmt:message key="common.delete" /></p></li>
And I can't figure out how to add exception for ${...} strings in this expression
Can anybody help me?
If I understand you properly, you want to ensure the data inside the "tag" doesn't contain fmt:messsage or ${....}
You might be able to use a negative-lookahead in conjuction with a . to assert that the characters captured by the . are not one of those cases:
/<(\w+)[^>]*>(?:(?!<fmt:message|\$\{|<\/\1>).)*<\/\1>/i
If you want to avoid capturing any "tags" inside the tag, you can ignore the <fmt:message portion, and just use [^<] instead of a . - to match only non <
/<(\w+)[^>]*>(?:(?!\$\{)[^<])*<\/\1>/i
Added from comment If you also want to exclude "empty" tags, add another negative-lookahead - this time (?!\s*<) - ensure that the stuff inside the tag is not empty or only containing whitespace:
/<(\w+)[^>]*>(?!\s*<)(?:(?!\$\{)[^<])*<\/\1>/i
If the format is simple as in your examples you can try this:
<(\w+)>(?:(?!<fmt:message).)+</\1>
Rewritten into a more formal question:
Can you match
aba
but not
aca
without catching
abcba ?
Yes.
FSM:
Start->A->B->A->Terminate
Insert abcba and run it
Start is ready for input.
a -> MATCH, transition to A
b -> MATCH, transition to B
c -> FAIL, return fail.
I've used a simple one like this with success,
<([^>]+)[^>]*>([^<]*)</\1>
of course if there is any CDATA with '<' in those it's not going to work so well. But should do fine for simple XML.
also see
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/
for a discussion of using regex to parse html
executive summary: don't