Regex: Finding space between two strings that is too long

Regex: Finding space between two strings that is too long - regex

I have an XML file that I am trying to parse into my database, but am getting an error stating a certain field exceeds my max character count (2000). I've identified the field in question, but don't have a row number in my error, so I have to find and delete the offender(s) in the XML itself.
Below is a sample. I need to find any entries where the characters between the first occurrence of "CCCStmts Correction" and "RoAmts" is over 2000 characters. I'm using Notepad++ and can only think this will work with regex. Ideas?
<CCCStmts Correction="sample text" />
<CCCStmts Correction="sample text" />
<CCCStmts Correction="sample text" />
<CCCStmts Correction="sample text" />
<CCCStmts Correction="sample text" />
<CCCStmts Correction="sample text" />
<CCCStmts Correction="sample text" />
<RoAmts PayType="x" AmtType="x" TotalAmt="x" />

Regex is not the answer. You could do it with regex, of course, but I assume you have used an API to represent the XML programmatically in a model? Or, even if not, that you are parsing it in order to submit the relevant value contained within the XML, to your database. So once you acquire the value, simply test its length then, and submit it if it conforms to the field's requirements.
To check the length of the string, simply use...
// if the length is 2000 or less
if (string.length()` < 2001) {
//your routine
}
... and it will skip over any value that is composed of 2001+ characters.
This approach does not require an additional iteration purely to search, and does not require any replacements to be made. It will be much tidier, and much more efficient.

Related

Closing tags formatting in IntelliJ IDEA

Help needed to set-up a macro/code-formatting/code-style/reformat code where in the code I write (Coldfusion), is tag based and the ending of a tag needs to be formatted.
The CFML code formatter is not doing this. All I want is when I format my code, any tag that is closed or ends with /> without a space from its previous character(any character), needs to be spaced and closed.
Example: any code line that ends with )/> or "/> or character/> capital-letter/> or digit/> or anything/> needs to be changed to ) /> or " /> or character /> capital letter /> or digit /> or anything /> respectively.
How do I get this done?

If you are looking for an automatic conversion for <empty-tag/> to <empty-tag />, you can do that in IntelliJ preferences: Editor->Code Style->XML. Open "Other" and under "Spaces" section on the left check "In empty tag".

I don't think it's possible to configure the IntelliJ formatter to do this. You need to use Find | Replace in Path... with a regular expression.

How to get string of everything between these two em tag?

I want to get string between em tag , including other html also.
for example:
<em>UNIVERSALPOSTAL UNION - International Bureau Circular<br />
By: K.J.S. McKeown</em>
output should be as:
UNIVERSALPOSTAL UNION - International Bureau Circular<br />
By: K.J.S. McKeown
please help me.
Thanks

Use the regular expression function like this:
REMatch("(?s)<em>.*?</em>", html)
See also: http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=regexp_01.html
The (?s) sets the mode to single line, so that the input text is interpreted as one line even if it contains line feeds. This is probably the default (I'm not sure) so it can be omitted. As Peter pointed out in a comment, this is not the default and therefore must be set.
The .*? matches all characters inbetween <em> and </em>. The questionmark after the multiplier makes it "non-greedy", so that as few as possible characters are matched. This is needed in case the input html contains something like <em>foo</em><em>bar</em> where otherwise only the outermost <em></em> tags are considered.
The returned array contains all matches found, i.e. all texts including html that was in <em> tags.
Note that this could fail for circumstances where </em> also occurs as attribute text and is incorrectly not html-encoded, for example: <em><a title="Help for </em> tag">click</a></em> or in other rare circumstances (e.g. javascript script tags etc.). A regex cannot replace a full HTML/XML parser and if you need 100% accurateness, you should consider using one: http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=functions_t-z_23.html

If your input is exactly in the format given above, you don't even need regex - just strip the outer tags:
<cfsavecontent variable="Input">[text from above]</cfsavecontent>
<cfset Output = mid( Input, 4 , len(Input) - 9 />
If your input is more than this (i.e. a significant piece of HTML, or a full HTML document), regex is still not the ideal tool - instead, you should be using a HTML parser, such as JSoup:
<cfset jsoup = createObject('java','org.jsoup.Jsoup') />
<cfset Output = jsoup.parse(Input).select('em').html() />
(With CF8, this code requires placing the jsoup JAR file in CF's lib directory, or using a tool such as JavaLoader.)

If you are using jquery you can do this also pretty easily.
$("em").html();
Will return all html between the em tags.
See this fiddle

I had to remove any text that was to follow after a partiucular tag . Now the HTML content was getting generated dynamically from a database that cater to 5 different langauges. so I only had the div tag to help me. I am not sure why REMatch("(?s).*?", html) did not work for me. However Ben helped me here (http://www.bennadel.com/blog/769-Learning-ColdFusion-8-REMatch-For-Regular-Expression-Matching.htm). My code looks like tghis:
<cfset extContentArr = REMatch("(?i)<div class=""inlineBlock"" style=""margin-right:30px;"">.+?</div>",qry_getContent.colval) />
<cfif !ArrayIsEmpty(extContentArr)>
Loop the array and do whatever you need with the extract , I just deleted them.
</cfif>

Trying to bulk remove in notepad++

I am trying to remove some data in a about 1500 lines. Here is my problem
Text<br />random text<br />
Text<br />slight different random text to the above<br />
Text<br />slightly different again random text the above 2<br />
What i need to do is remove
<br />slightly different again random text the above 2<br />
And everything in it. The problem is that the text changes every time. Is there a wildcard variable that i can use in the replace function?

You can use the "Use regular expressions" option in the Find and Replace dialog. You'll need something like:
<br />.+?<br />

try this regular expression in the find and replace dialog box
(<br\s*\/?>\s*)+[a-zA-Z]*(<br\s*\/?>\s*)+

Xslt: Embedding an image in a RSS feed

I am using Umbraco and I need to display an image in a Rss Feed. The feed is generated by Xslt.
Everything works if I do text stuff. Such stuff is technically feasible, but the feed I analyzed had been generated by WordPress.
The challenge is that I have no idea how I can embed within my tag.
I have a variable, say "url", that returns the full url of the underlying image. How can I insert within ? Remember I am using Xslt to achieve the task.
<content:encoded>
<img src="{$url}" />
</content:encoded>
I guess that CDATA must be used, but I am not able to escape correctly illegal characters :(
Thanks for your help.
Roland

roland, you're trying to escape things twice. It's unnecessary (not to mention hideous!) This page shows:
<content:encoded><![CDATA[This is <i>italics</i>.]]></content:encoded>
I.e. they're just escaping the markup inside the <content:encoded> once, and they use CDATA to do that. In your case, CDATA is awkward because you need to substitute $url in the middle. So you could use two CDATA sections wrapped around an <xsl:value-of select="$url" />: (indented for clarity)
<content:encoded>
<![CDATA[<img src="]]>
<xsl:value-of select='$url' />
<![CDATA[">]]>
</content:encoded>
But that would be needlessly verbose. The second CDATA section is unneeded. And we can do better while using the same principle: escape the markup characters (once) that would cause the string to be parsed into a tree. In your case, only the initial < needs to be escaped. You can use < instead of CDATA to escape the <. Put this in your XSLT:
<content:encoded><img src="<xsl:value-of select='$url' />"></content:encoded>
The <xsl:value-of> is not really inside quotes, from XSLT's perspective... those quotes are just the content of text nodes. The <xsl:value-of> works as a normal XSLT instruction.
Change select='$url' to select="concat($siteUrl, photo)" if that's what you need. (I.e. photo is a child element of the context node, and its text value is the name of the image file.)

XSL disable-output-escaping removes whitespaces

Part of the XML:
<text><b>Title</b> <b>Happy</b></text>
In my XSL I have:
<xsl:value-of select="text" disable-output-escaping="yes" />
My output becomes
**TitleHappy**
My spacing went missing - there's supposed to be a space between </b> and <b>.
I tried normalize-space(), it doesn't work.
Any suggestions? Thanks!

if you want whitespace from an xsl, use:
<xsl:text> </xsl:text>
whitespace is only preserved if its recognized as a text node (ie: " a " both spaces will be recognized)
whitespace from the orignal source xml has to be preserved by telling the parser (for example)
parser.setPreserveWhitespace(true);

As your outputting HTML you could substitute your space with a non-breaking space

Do you have any control over the generation of the original XML? If so, you could try adding xml:space="preserve" to the text element which should tell the processor to keep the whitespace.
<text xml:space="preserve"><b>Title</b> <b>Happy</b></text>
Alternatively, try looking at the "xsl:preserve-space" element in XSLT.
<xsl:preserve-space elements="text"/>
Although I have never used this personally, it might of some help. See W3Schools for more information.

thank you for everyone's input.
Currently I am using MattH suggestion which is to test for space and substitue to non-breaking space. Another method I thought of is to test for "</b> <b>" and substitue with " </b><b>". The space contain within a bold tags are actually output. Both methods worked. Don't know what the implications are though. And I still can't figure out why the spacing is removed when it is found between 2 seperate bold tags.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex: Finding space between two strings that is too long - regex

Related

Closing tags formatting in IntelliJ IDEA

How to get string of everything between these two em tag?

Trying to bulk remove in notepad++

Xslt: Embedding an image in a RSS feed

XSL disable-output-escaping removes whitespaces

Categories

Resources