Hoping that a regex can do this. Fixing broken XML - regex

I have a large XML file that I now want to parse. The XML is fundamentally broken, and with over 2000 lines, I'm trying to avoid a hand cranked fix ;)
Can I use regex replace in Notepad++ to do this?
<Sensor ID="21.1.1_L"/>
to
<Sensor ID="21.1.1_L">
losing the tag close slash in all "Sensor" tags (and bearing in mind that I cannot simply replace /> with > and the ID is variable, including it's length and may or may not have the trailing underscore and alpha).
Thanks for any suggestions.

This should work: Search for
(<Sensor [^<>]*)/>
and replace all with
\1>
[^<>]* will match any number of characters except angle brackets (this is to make sure that we can never match across a tag's boundary). Then, /> matches only if the current tag ends with a slash.
You will need to turn on regex matching in Notepad++, of course.

Related

Removing newline from text within tags

I would like to remove all the newlines within a specific html-tag that contains a block of text.
Im sure this is basic stuff but I have no experience with regex so any help would be welcomed.
Thanks
You haven’t specified your language, so I’ll just give you the regex (no code):
\n(?=[^<>]*</)
Replace all matches with a blank (to “delete” them).
This assumes well-formed XML (of which, HTML is a subset).
It works by requiring any matched newline to be followed by characters such that the next angle bracket encountered is a closing tag.
It’s not bulletproof, but will probably work for most cases, and hopefully your case.
I guess you want to do this :
str.replace("/<(html|div)>(.*)\n+(?=[\s\S]*<\/\1>)/g", "<$1>$2 ")
This regex target the html or div tags, you can add more just doing this (html|div|p|input|html6tag)
But, you have to run this regex until no more replacements are found

How to replace all lines based on previous lines in Notepad++?

I have an XML code:
<Line1>Matched_text Other_text</Line1>
<Line2>Text_to_replace</Line2>
How to tell Notepad++ to find Matched_text and replace Text_to_replace to Replaced_text? There are several similar blocks of code, with one exactly Matched _text and different Other_text and Text_to_replace. I want to replace all in once.
My idea is to put
Matched_text*<Line2>*</Line2>
in the Find field, and
Matched_text*<Line2>Replaced_text</Line2>
in the Replace field. I know that \1 in regex might be useful, but I don't know where to start.
The actual code is:
<Name>Matched_text, Other_text</Name>
<IsBillable>false</IsBillable>
<Color>-Text_to_replace</Color>
The regex you're looking for is something like the following.
Find: (Matched_text[\w,\s<>\/]*<Color>-).*(</Color>)
Replace: \1Replaced_text\2
Broken down:
`()` is how you tell regex that you want to keep things (for use in /1, /2, etc.), these are called capture groups in regex land.
`Matched_text[\w,\s<>\/]*` means you want your anchor `Matched_text` and everything after it up till the next part of the expression.
`<Color>-).*(</Color>)` Select everything between <Color>- and </Color> for replacement.
If you have any questions about the expression, I highly recommend looking at a regex cheatsheet.

Regex - Globally replace a slash in between given XML tags only

I'm trying to replace backslashes by forwardslashes, globally, over several lines, in an xml file but only in a given tag.
Example where I want to work on the content of Path:
<name>file1</name><path>c:\folder\folder</path><test>just\the\lolz</test>
<name>file2</name><path>c:\folder\folder\folder</path><test>some more\lolz</test>
Should become:
<name>file1</name><path>c:/folder/folder</path><test>just\the\lolz</test>
<name>file2</name><path>c:/folder/folder/folder</path><test>some more\lolz</test>
I've been trying with look arounds and recursion but I'm getting nowhere...
Last useless try was:
(?<=path>)(\w*?(\x2F))+(?=.*<\/path>)
Thanks!
You can search for this:
(?<=path>[^<]*)\\
And replace with this:
/
It's worth cautioning you that this will not work with any and every XML file. Truly parsing XML files and properly handling any and all possible legal XML is not possible (or at least recommended) with regex, but as long as the data is consistent, that should suffice.
This is what you need: (?<=<path>.*)\/(?=.*<\/path>)
Note: Does not work in JavaScript, because JavaScript does not support lookbehinds.
Let me explain:
(?<=<path>.*) this is a lookbehind for and any characters until it finds preceding the character you insert after it in our case \/ (an escaped / )
(?=.*<\/path>) this is a lookahead, works the same as the lookbehind, but searches everything to the right of the string preceding it, in our case the /. The lookahead does work in JavaScript.
Hope it helped.
Here is simple selecting phrases between <path>...</path>
(?<=path>).*(?=<\/path)
Example in Regex101

What is the regex syntax for a file name with spaces

I'm using a custom blog syndication tool and having problems in using the regex syntax.
Example:
The original code
<img src="http://www.mydomain.com/some image.png">
I tried:
/\<img src\=\"http\:\/\/www\.mydomain\.com\/some\%20image\.png\"\>/
and
/\<img src\=\"http\:\/\/www\.mydomain\.com\/some image\.png\"\>/
But none of them seem to work.
Any suggestions?
The pattern delimiters (/) at either end mean the regex engine should know where the pattern ends, so the engine shouldn't be confused by a space. Are you sure there isn't something else wrong with the pattern? I suspect this is the most likely problem.
You might like to try the %20 without the % being escaped.
Another thing to try could be an escaped space (with a preceding backslash).
Otherwise, you could try using \s to match a space - it's reasonably standard in regex engines (but also matches tabs, line feeds and carriage returns).

Adding "/index.html" to paths in Vim

I'm trying to append "/index.html" to some folder paths in a list like this:
path/one/
/another/index.html
other/file/index.html
path/number/two
this/is/the/third/path/
path/five
sixth/path/goes/here/
Obviously the text only needs to be added where it does not exist yet. I could achieve some good results with (vim command):
:%s/^\([^.]*\)$/\1\/index.html/
The only problem is that after running this command, some lines like the 1st, 5th and 7th in the previous example end up with duplicated slashes. That's easy to solve too, all I have to do is search for duplicates and replace with a single slashes.
But the question is:
Isn't there a better way to achieve the correct result at once?
I'm a Vim beginner, and not a regex master also. Any tips are really appreciated!
Thanks!
So very close :)
Just add an optional slash to the end of the regex:
\/\?
Then you need to change the rest of the pattern to a non-greedy match so that it ignores a trailing slash. The syntax for a non-greedy match in vim (replacing the *) is:
\{-}
So we end up with:
:%s/^\([^\.]\{-}\)\/\?$/\1\/index.html/
(Doesn't hurt to be safe and escape the period.)
Vim's regex supports the ability to match a bit of text foo if it does or doesn't precedes or follows some other text bar without matching bar, and this is exactly the sort of thing you're looking for. Here you want to match the end of line with an optional /, but only if the / isn't followed by index.html, and then replace it with /index.html. A quick look at Vim's help tells me \#<! is exactly what to use. It tells Vim that the preceding atom must be in the text but not in what's matched. With a little experimentation, I get
:%s;/\?\(index\.html\)\#<!$;/index.html;
I use ; to delimit the parts of the :s command so that I don't have to escape any / in the regex or replacement expression. In this particular situation, it's not a big deal though.
The / is optional, and we say so with \?.
We need to group index.html together because otherwise our special \#<! would only affect the l otherwise.