UltraEdit (or MacOS regex): Delete multiple lines in xml - regex

I have an unformatted xml file in which I would like to delete tags of a specific name that contain some value.
Example:
<XmlElement1>
</XmlElement1>
<XmlElement2 ... >
...
<Xml1SubElement someParameter="...SearchTerm..."/>
...
</XmlElement2>
<XmlElement3/>
... stands for random characters and random multiple lines
In above example I would like to delete all XmlElement2 elements that contain "SearchTerm" in the body. In other words select all text between <XmlElement2 and </XmlElement2> across multiple lines where SearchTerm is in the middle and replace with "".
I'm using UltraEdit on MacOS and am flexible with what tools to use.
Your help is much appreciated!

The Perl regular expression search string for this task can be for example:
(?s)^[\t ]*<XmlElement2(?:.(?!</XmlElement2>))+?SearchTerm.+?</XmlElement2>[\t ]*(?:\r?\n|\r)
Explanation:
(?s) ... flag to match newline characters also by dot in search expression.
^[\t ]* ... start search at beginning of a line and match 0 or more tabs or spaces.
<XmlElement2 ... the start tag of the element to remove on containing SearchTerm.
(?:.(?!</XmlElement2>))+? ... a non marking group to find any character one or more times non-greedy as long as the string after the current character is not </XmlElement2>. The negative lookahead (?!</XmlElement2>) prevents selecting a block starting with <XmlElement2 and matching anything including one or even more </XmlElement2> and <XmlElement2 tags until SearchTerm is found anywhere in file.
SearchTerm ... string which must be found inside element XmlElement2.
.+? ... any character (including newline characters) one or more times non-greedy. Non-greedy means here to stop matching characters on next occurrence of </XmlElement2> and not on last occurrence of </XmlElement2> in file.
</XmlElement2> ... the end tag of the XML element to remove on containing SearchTerm.
[\t ]*(?:\r?\n|\r) ... 0 or more tabs or spaces and either DOS/Windows (carriage return + line-feed) or UNIX (just line-feed) or MAC (just carriage return) line ending.
PS: The Perl regular expression replace was tested with UltraEdit for Windows v22.20.0.49 on Windows XP and v25.20.0.88 on Windows 7 as I don't have a Mac.

Related

Eclipse Add text to first line of all files

I need to add text to first line of all my JSP's in eclipse, this is the regex I a using \A.* but some how it selects the first line, I just want to prepend text to the start of the file. any help will be very much appreciated.
The .* pattern matches any 0+ chars other than line break characters, so it matches the first line.
It seems that Eclipse Find/Replace regex feature does not match entirely zero-width patterns (e.g. (?=,) will not find and insert a text before commas).
A workaround is to match and capture some text with (...) (where ... stand for a consuming pattern) capturing group and use $1 in the replacement pattern to reinsert the matched text.
Use
\A(.*)
Replace with MY_NEW_TEXT_HERE_AT_THE_START_OF_FILE$1.

Regular expression matching space but at the end of line

I'm trying to replace multiple spaces with a single one, but at the start of the line.
Example:
___abc___def__
___ghi___jkl__
should turn to
___abc_def__
___ghi_jkl__
Note that I've replaced space with underscore
A simple search using the following pattern:
([^\s])\s+
matches the space at the end of the first line up to the space at the beginning of the next one.
So, if I replace with \1_, I get the following:
___abc_def_ghi_jkl
And that is absolutely not what I expect and regex engines, e.g., PowerGREP or the one in Visual Studio, don't behave that way.
If you want to match only horizontal spaces, use \h:
Find what: (?<=\S)\h+(?=\S)
Replace with: (a space)
There are several possible interpretations of the question. For each of them the replacement will be a single space character.
If spaces is plural and means space characters but not tabs then use
a find string of (^ {2,})|( {2,}$).
If spaces is plural and should includes tabs then use a find string
of (^[ \t]{2,})|([ \t]{2,}$).
If any leading or trailing spaces and tabs (one or more) is to be
replaced with a space then use a find string of (^[ \t]+)|([ \t]+$).
The general form of each of these is (^...)|(...$). The | means an alternation so either the preceding or the following bracketed expression can match. Hence the find what text can match either at the beginning or the end of a line. The ... varies depending on exactly what needs to be matched. Specifying [ \t] means only the two characters space and tab, whereas \s includes the line-end characters.
Ok, so the intention was to replace this:
Hey diddle diddle, \n<br/>
The Cat and the fiddle,\n
with this:
Hey diddle diddle,\n<br/>
The Cat and the fiddle,\n
A slightly modified version of Toto's answer did the trick:
(?<=\S)\h+(?=\S)|\s+$
finding any space(s) between word-characters and trailing space at the end of the line.

Regex expression to select pair of lines

I have a logfile. Its general format is
log text 1 <br/>
log text 2 <br/>
Error: xxxxxx <br/>
error description (1 line only) <br/>
log text 3 <br/>
log text 4 <br/>
.... <br/>
Error: xxxxxx <br/>
error description (1 line only) <br/>
log text 5 <br/>
.... <br/>
I would like to select and extract pair of 2 lines containing errors (each error has two lines. The first line always has Error: as keyword. Word Error does not occur anywhere else in logfile).
How do I do it using regex or any other way. I can use MacOS, Unix or Windows XP. MacOS preferred.
Using grep on mac or unix based OS
grep -i error -A2 inputfile
A regular expression to find those 2 lines is for example:
^.*?Error.*(?:\r?\n|\r).*$
^ ... start each search at beginning of a line.
.*? ... match any character except carriage return and line-feed zero or more times non greedy. Non greedy means as less characters as possible. In other words stop on first occurrence of Error and not on last occurrence.
Error ... this word must be found in the first of 2 lines to match.
.* ... match any character except carriage return and line-feed zero or more times greedy. Greedy means now match as many characters as possible.
(?:\r?\n|\r) ... is a non-marking group matching either carriage return + line-feed (DOS/Windows text files), or only line-feed (UNIX text files), or only carriage return (old MAC text file).
.* ... match any character except carriage return and line-feed 0 or more times greedy.
$ ... anchor for end of line. Line termination is not included in matched string.
In other words this expression matches an entire line containing anywhere the word Error, the line terminator of this first line and everything on next line up to end of line, but not matching also the line terminator of this second line.
UltraEdit is a shareware text editor available for Windows, Linux and Mac.
Using this Perl regular expression in Search - Find dialog of UltraEdit with advanced find option List lines containing string enabled results in getting all found 2 line strings written to a window listing all found lines.
Opening the context menu of this window (right click on Windows) and clicking on Copy to Clipboard results in copying all found lines to system clipboard.
Pressing Ctrl+N to open a new file, Ctrl+V to paste the copied lines, and Ctrl+S to save the new file results finally in having a file with the wanted information.
Another method is using the UltraEdit script FindStringsToNewFile with the reduced regular expression search string Error.*(?:\r?\n|\r).*. This script writes all found strings starting with keyword Error and ending at end of next line directly to a new file.
One more note:
If a . (dot) matches also newline characters like carriage return and line-feed depends on a flag. In UltraEdit the flag is by default set that a dot does not match newline characters. With (?s) at beginning of a Perl regular expression search string the flag is changed and the dot would then match also newline characters for this search. With (?-s) at beginning of a search string the flag can be set for not matching newline characters by a dot if the internal default of the application is the opposite.

Regular Expression to search and replace in notepad ++

If i have a line of text that i want to remove from a text file in notepad and it is always formatted like this
[text]:
except that the words in the text area change. what is a regular expression i could create to remove the whole section with the search and replace function in notepad?
To delete the entire line starting with [any text]: you can use: ^[\t ]*\[.*?\]:.*?\r\n
Explanation:
^ ... start search at beginning of a line (in this case).
[\t ]* ... find 0 or more tabs or spaces.
\[ ... find the opening square bracket as literal character.
.*? ... find 0 or more characters except the new line characters carriage return and line-feed non greedy which means as less characters as possible to get a positive match, i.e. stop matching on first occurrence of following ] in the search expression.
\]: ... find the closing square bracket as literal character and a colon.
.*?\r\n ... find 0 or more characters except the new line characters and finally also the carriage return and line-feed terminating the line.
The search string ^[\t ]*\[.*?\]:.*?$ would find also the complete line, but without matching also the line termination.
The replace string is for both search strings an empty string.
If by removing the entire section, you mean remove the [text]: up to the next [otherText]:, you can try this:
\[text\]:((?!\[[^\]]*\]:).)*
Remember to set the flag for ". matches newline".
This regex basically first matches your section title. Then, it would start matching right after this title and for each character, it uses a negative lookahead to check if the string following this character looks like a section title. If it does the matching is terminated.
Note: Remember that this regex would replace all occurrences of the matched pattern. In other words, if you have more than one of that section, they are both replaced.

How to find and replace contents of a bracket inside notepad++

I have a large file with content inside every bracket. This is not at the beginning of the line.
1. Atmos-phere (7800)
2. Atmospheric composition (90100)
3.Air quality (10110)
4. Atmospheric chemistry and composition (889s120)
5.Atmospheric particulates (10678130)
I need to do the following
Replace the entire content, get rid of line numbers
1.Atmosphere (10000) to plain Atmosphere
Delete the line numbers as well
1.Atmosphere (10000) to plain Atmosphere
make it a hyperlink
1.Atmosphere (10000) to plain linky study
[I added/Edit] Extract the words into a new file, where we get a simple list of key words. Can you also please explain the numbers in replace the \1\2, and escape on some characters
Each set of key words is a new line
Atmospheric
Atmospheric composition
Air quality
Each set is a on one line separated by one space and commas
Atmospheric, Atmospheric composition, Air quality
I tried find with regex like so, \(*\) it finds the brackets, but dont know how to replace this, and where to put the replace, and what variable holds the replacement value.
Here is mine exression for notepad ([0-9(). ]*)(.*)(\s\()(.*)
You need split your search in groups
([0-9. ]*) numbers, spaces and dots combination in 0 or more times
(.*) everything till next expression
(\s\() space and opening parenthesis
(.*) everything else
In replace box - for practicing if you place
\1\2\3\4 this do nothing :) just print all groups from above from 1.1 to 1.4
\2 this way you get only 1.2 group
new_thing\2new_thing adds your text before and after group
<a href=blah.com/\2.html>linky study</a> so now your text is added - spaces between words can be problematic when creating link - so another expression need to be made to replace all spaces in link to i.e. _
If you need add backslash as text (or other special sign used by regex) it must be escaped so you put \\ for backslash or \$ for dolar sign
Want more tune - <a href=blah.com/\2.html>\2</a> add again 1.2 group - or use whichever you want
On the screenshot you can see how I use it (I had found and replaced one line)
Ok and then we have case 4.2 with colon at the end so simply add colon after extracted section:
change replace from \2 to \2,
Now you need join it so simplest way is to Edit->Line Operations->Join Lines
but if you want to be real pro switch to Extended mode (just above Regular expression mode in Replace window) and Find \r\n and replace with space.
Removing line endings can differ in some cases but this is another story - for now I assume that you using windows since Notepad++ is windows tool and line endings are in windows style :)
The following regex should do the job: \d+\.\s*(.*?)\s*\(.*?\).
And the replacement: <a href=example.com\\\1.htm>\1</a>.
Explanation:
\d+ : Match a digit 0 or more times.
\. : Match a dot.
\s* : Match spaces 0 or more times.
(.*?) : Group and match everything until ( found.
\s* : Match spaces 0 or more times.
\(.*?\) : Match parenthesis and what's between it.
The replacement part is simple since \1 is referring to the matching group.
Online demo.
Try replacing ^\d+\.(.*) \(\w+\)$ with <a href=blah.com\\\1.htm>linky study</a>.
The ^\d+. removes the leading number and dot. The (.*) collects the words. Then there is a single space. The \(\w+\)$ matches the final number in brackets.
Update for the added Q4.
Regular expressions capture things written between round brackets ( and ). Brackets that are to be found in the text being searched must be escaped as \( and \). In the replacement expression the \1 and \2 etc are replaced by the corresponding capture expression. So a search expression such as Z(\d+)X([aeiou]+)Y might match Z29XeieiY then the replacement expression P\2Q\1R would insert PeieiQ29R. In the search at the top of this answer there is one capture, the (.) captures or collects the words and then the \1 inserts the captured words into the replacement text.