Is there regex to remove space and newline from xml input file - regex

I would like to change an xml which is in format
<input>My
Input</input>
<input2>My
input2</input2>
to
<input>My Input</input>
<input2>My input2</input2>
The input xml file has more than 10000 records with xml in the above format which breaks the software to work properly.
Need a regex to fix it in one stroke.
I tried ('//n','') but it is not functioning as expected

If your regex flavor supports Lookbehinds, you may use something like this:
(?<!>)(\s)*[\r\n]+
..and replace with \1.
This will match any number of new-line characters, preceded by zero or more other whitespace characters and not preceded by the > character. Then, it will replace them with a whitespace character (if present) or nothing.
Demo.
If Lookbehind is not supported, you may use:
([^>])(\s)*[\r\n]+
..and replace with \1\2.

Related

regex match file with multiple extension

I have several strings like this
XYZ_TEST_2017.txt
ASD_TEST_2017.txt.tmp
I need to extract only those strings ending with .txt
So I'm using this regex:
[A-Z]{3}_TEST_[0-9]{4}.txt
However I still get the strings with multiple extensions like the second one (.txt.tmp)
See my regex demo.
How can I handle it?
To have your regex match everything up to the end, append an "end-of-text marker" ($) to your pattern like this:
[A-Z]{3}_TEST_[0-9]{4}\.txt$
As you may have noticed, I also escaped the dot, otherwise this filename would match as well:
SOM_TEST_1234Etxt
The dot (.) would match any character (depending on your flags, even newline and carriage return), in this case, the E before txt.

Regular Expression Find all LF characters not CRLF hexadecimal

I am viewing a CSV file which has LF characters in the middle of a field and CRLF character to actually denote a new line. I am viewing the file in hexadecimal in Sublime Text 3 and I want to do a simple find and replace where I search for LF characters but NOT CRLF and replace it with a space.
I've gotten as far as to search for LF but NOT CRLF, I could use the regular expression
[^0d]0a. Problem with this is that it doesn't capture the case where you could have XX0d 0aXX and I don't know how to capture this with regular expressions. I would then want to replace this with '20' which is space in hexadecimal.
Use a negative lookbehind that matches 0d with optional whitespace.
(?<!0d\s*)0a
However, some regexp engines won't allow quantifiers in lookbehinds. So you may need to put the whitespace check after the lookbehind, and then capture it to use it in the replacement.
(?<!0d)(\s*)0a replace with ${1}20
It would probably be easier if you did this in text mode instead of hex. Replace
(?<!\r)\n
with space.

RegExp to match visible non-letter characters before line break

I am working on a vbs regexp that will detect a tag which contains text and a CRLF character before closing tag.
I am currently using \w+[:;?!.,""\)\]-~]*(\s)*(\r\n\s*)(<\/.*>)
Looking from the end of the expression, I am matching any closing tag, CRLF plus optionally blank spaces, an optional spaces before CRLF and it should optionally match any other visible non-letter character which occurs after any word.
This is to match things like
myword! CRLF</tag>
mywordCRLF</tag>
myword CRLF</tag>
myword...CRLF </tag>
etc.
However, I do not want to match below, as I need to detect tags containing TEXT and linebreaks.
</otherclosingtag> CRLF </tag>
I am concerned about the \w+[:;?!.,""\)\]-~]* bit as it doesn't look right to me, as I would need to insert quite a large number of characters here.
I tried replacing it with \S, \W but they all seem to match CRLF characters as well.
Any ideas?
Cheers!
How about using non-greedy modifier:
\w+\W*?\r\n\s*(<\/.*>)
or
\w+[^\r\n]*\r\n\s*(<\/.*>)
The solution that I used:
\w+[^\r\n<>]*(\r\n\s*)(<\/.*>)
It matches a word (so not ) then anything that is not the CR, LF or > (so it doesn't match openingtag> CRLF</closingtag>)
This is a modified version of what M42 has proposed, I had added <> to make sure we won't match a tag.
Thanks for suggestions!
Try this:
^.*[\n\t\s]*</.*>$ --> BAD
^.*[\r\n\t\s]*</.*>$

Regexp to strip characters after URL in emacs

I have a .org file with lines of this sort:
*
http://en.wikipedia.org/wiki/Qibla Qibla - Wikipedia, the free
as you can see, an asterisk, followed by newline, followed by URL, followed by one space, and then some extraneous useless text that i want to get rid of.
i would like to format this file to this structure:
*
http://en.wikipedia.org/wiki/Qibla
or, strip all the characters after the end of the URL while maintaining the rest of the structure.
how can i do this in emacs?
Assuming you're doing this interactively with query-replace-regexp, try using this regex to string the junk off the end of the URLS:
^\(http[^ ]+\).*$
Replacement:
\1
You can get rid of the asterisks easily enough, use this regex and replace with nothing:
*^J
Use control-Q followed by control-J to enter the newline.
Edit: Or, to do it in one, replace
*^J\(http[^ ]+\) .*^J
With
\1^J
Where ^J is a literal newline inserted by typing control-Q followed by control-J.

Regex to match tag contents while simultaneously omitting leading and trailing whitespace

I am trying to write a regex that matches entire contents of a tag, minus any leading or trailing whitespace. Here is a boiled-down example of the input:
<tag> text </tag>
I want only the following to be matched (note how the whitespace before and after the match has been trimmed):
"text"
I am currently trying to use this regex in .NET (Powershell):
(?<=<tag>(\s)*).*?(?=(\s)*</tag>)
However, this regex matches "text" plus the leading whitespace inside of the tag, which is undesired. How can I fix my regex to work as expected?
You should not use regext to parse html.
Use a parser instead.
Also:
Regex to remove body tag attributes (C#)
Also also: RegEx match open tags except XHTML self-contained tags
If all that doesn't convince you, then don't use the dot in the middle of your expression. Use the alphanumeric escape. Your dot is consuming whitespace. Use \w (I think) instead.
Drop the lookarounds; they just make the job more complicated than it needs to be. Instead, use a capturing group to pick out the part you want:
<tag>\s*(.*?)\s*</tag>
The part you want is available as $matches[1].
Use these regular expressions to strip trailing and leading whitespaces. /^\s+/ and /\s+$/
test = "<tag> test </tag>";
string pattern3 = #"<tag>(.*?)</tag>";
Console.WriteLine("{0}", Regex.Match(test,pattern3).Groups[1].Value.Trim());