XSL-FO / Apache FOP break line at comma,semicolon etc - line-breaks

I didn't succeed in finding an answer to this question concerning linebreaks:
how can I tell the XSL-FO Processor (I am using Apache FOP) to break long lines not only at whitespaces but also on comma, semicolon, minus, maybe backslashes etc.?
For example, if I have a text like "Hello I am using Apache FOP" and the place is not enough it will break totally fine at one of the whitespaces; but a text that is like "one,two,three,four,five,six" won't break.

You insert the zero-width space character. How you insert this into your output depends on what XSLT version you are using. By regex (XSLT 2) or recursive template (XSLT 1) would likely be the ways but you have not provided details on how the content is derived and processed.
In your example, you would have ...
one,​two,​three,​four,​five,​six
This would insert breaking spaces after the commas in your list.

Related

Regex: Replace double double quotes (solved), but only in lines that contain a special string (subcondition unsolved)

1. Summary of the problem
I have a csv file where I want to replace normal quotes in text with typographic ones.
It was hard (because HTML is also included), but I have meanwhile created a good regex expression that does just the right thing: in three "capturing groups" I find the left and right quotation marks and the text inside. Replacing then is a piece of cake.
2. Regex engine
I can use the regex engine of Notepad++ (boost) or PCRE2 comaptible, for developping and testing purposes I have used https://regex101.com.
3. What I'm having a hard time with and just can't get right, where I need your help is here:
I want to add a sub condition, in order to find the text in quotes only in certain lines, want to identify these lines by the language, e.g. ENGLISH or FRENCH (see also example in the screenshot).
Screenshot of a sample
The string indicating the language is always in the same line before the text to be found, BUT only the text in quotes (main condition) should be marked after matching the sub condition, so that I will be able to replace them.
It is about a few thousand records in the csv file, in the worst case I could also replace it manually. But I'm pretty sure that this should also work via regex.
4. What I have tried
Different approaches with look arounds and non-capturing groups didn't lead me to the desired result - possibly because I didn't really understand how they work.
An example can be found here: https://regex101.com/r/ketwwm/1
The example can be found here, it only contains the regex expression to match and mark the (three) groups WITHOUT the searched subcondition:
("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$)))
Hopefully anyone in the community could help? (Hopefully I have not missed anything, it's my first post here )
5. Update 03/18/2022: Almost resolved with two slightly different approaches (thank you all!) What is still unsolved ..
Solution of #Thefourthbird (see answer 1)
^(?!.?"ENGLISH")[^"]".*(SKIP)(F)|("")([^<>]?)("")(?=(?:[^>]?(?:<|$)))
Nearly perfect, just missing matches in an HTML section. HTML sections in the csv file are always enclosed by double quotes and may have line feeds (LF). https://regex101.com/r/x5shnx/1
Solution of #Wiktor Stribiżew (see in comments below)
^.?"ENGLISH".?\K("")([^<>]?)("")(?=(?:[^>]?(?:<|$)))
The same with matches in HTML sections, see above. Plus: Doesn't match text in double double quotes if more than one such entry occurs within a text. https://regex101.com/r/I4NTdb/1
Screenshot (only to illustrate)
If you want to match multiple occasions, you can use SKIP matching all lines that do not start with FRENCH:
^"(?!FRENCH")[^"]*".*(*SKIP)(*F)|("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$)))
The pattern matches:
^ Start of string
" Match literally
(?!FRENCH") Negative lookhead, assert not FRENCH" directly to the right
[^"]*" Match any char except " and match "
.*(*SKIP)(*F) Match the rest of the line and skip it
| Or
("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$))) Your current pattern
Regex demo

Removing newline from text within tags

I would like to remove all the newlines within a specific html-tag that contains a block of text.
Im sure this is basic stuff but I have no experience with regex so any help would be welcomed.
Thanks
You haven’t specified your language, so I’ll just give you the regex (no code):
\n(?=[^<>]*</)
Replace all matches with a blank (to “delete” them).
This assumes well-formed XML (of which, HTML is a subset).
It works by requiring any matched newline to be followed by characters such that the next angle bracket encountered is a closing tag.
It’s not bulletproof, but will probably work for most cases, and hopefully your case.
I guess you want to do this :
str.replace("/<(html|div)>(.*)\n+(?=[\s\S]*<\/\1>)/g", "<$1>$2 ")
This regex target the html or div tags, you can add more just doing this (html|div|p|input|html6tag)
But, you have to run this regex until no more replacements are found

How to search and replace from the last match of a until b?

I have a latex file in which I want to get rid of the last \\ before a \end{quoting}.
The section of the file I'm working on looks similar to this:
\myverse{some text \\
some more text \\}%
%
\myverse{again some text \\
this is my last line \\}%
\footnote{possibly some footnotes here}%
%
\end{quoting}
over several hundred lines, covering maybe 50 quoting environments.
I tried with :%s/\\\\}%\(\_.\{-}\)\\end{quoting}/}%\1\\end{quoting}/gc but unfortunately the non-greedy quantifier \{-} is still too greedy.
It catches starting from the second line of my example until the end of the quoting environment, I guess the greedy quantifier would catch up to the last \end{quoting} in the file. Is there any possibility of doing this with search and replace, or should I write a macro for this?
EDIT: my expected output would look something like this:
this is my last line }%
\footnote{possibly some footnotes here}%
%
\end{quoting}
(I should add that I've by now solved the task by writing a small macro, still I'm curious if it could also be done by search and replace.)
I think you're trying to match from the last occurrence of \\}% prior to end{quoting}, up to the end{quoting}, in which case you don't really want any character (\_.), you want "any character that isn't \\}%" (yes I know that's not a single character, but that's basically it).
So, simply (ha!) change your pattern to use \%(\%(\\\\}%\)\#!\_.\)\{-} instead of \_.\{-}; this means that the pattern cannot contain multiple \\}% sequences, thus achieving your aims (as far as I can determine them).
This uses a negative zero-width look-ahead pattern \#! to ensure that the next match for any character, is limited to not match the specific text we want to avoid (but other than that, anything else still matches). See :help /zero-width for more of these.
I.e. your final command would be:
:%s/\\\\}%\(\%(\%(\\\\}%\)\#!\_.\)\{-}\)\\end{quoting}/}%\1\\end{quoting}/g
(I note your "expected" output does not contain the first few lines for some reason, were they just omitted or was the command supposed to remove them?)
You’re on the right track using the non-greedy multi. The Vim help files
state that,
"{-}" is the same as "*" but uses the shortest match first algorithm.
However, the very next line warns of the issue that you have encountered.
BUT: A match that starts earlier is preferred over a shorter match: "a{-}b" matches "aaab" in "xaaab".
To the best of my knowledge, your best solution would be to use the macro.

Hoping that a regex can do this. Fixing broken XML

I have a large XML file that I now want to parse. The XML is fundamentally broken, and with over 2000 lines, I'm trying to avoid a hand cranked fix ;)
Can I use regex replace in Notepad++ to do this?
<Sensor ID="21.1.1_L"/>
to
<Sensor ID="21.1.1_L">
losing the tag close slash in all "Sensor" tags (and bearing in mind that I cannot simply replace /> with > and the ID is variable, including it's length and may or may not have the trailing underscore and alpha).
Thanks for any suggestions.
This should work: Search for
(<Sensor [^<>]*)/>
and replace all with
\1>
[^<>]* will match any number of characters except angle brackets (this is to make sure that we can never match across a tag's boundary). Then, /> matches only if the current tag ends with a slash.
You will need to turn on regex matching in Notepad++, of course.

Need assistance regex matching a single quote, but do not include the quote in the result

I'm trying to find out a way to match the following test string:
token = '1866FB352F4DF76BCB92C3482DB7D7B4F562';
The data I want returned is...
1866FB352F4DF76BCB92C3482DB7D7B4F562
I've tried the following, the closes I have is this, but it's including the single quote at the end:
(?!token = ')(\w+)';
Now, another one, which works closely, but it's including the last single quote:
'([^']+)'
Anyone want to take a stab at this?
Update: After looking at what I need to parse, I found the same value in the html, in the form area, which looks like it might be easier to grab:
name="token" value="482CD1FE037F68D5A36F4C961A6D57D9"
Again, I just need the contents within value="*"
However, the regex will have to parse the entire html source, so I assume I will need to search for name="toke" value= but not include that in the result set.
If your regex engine supports lookaround, you can use
(?<=')\w+(?=')
This matches an alphanumeric word if it's surrounded by single quotes, without making those quotes a part of the actual match. If you only want to match hexadecimal numbers, use
(?i)(?<=')[0-9A-F]+(?=')
EDIT:
Since you have now added that you're using JMeter, and because JMeter doesn't support lookbehind assertions for reasons incomprehensible to me (because Java itself does support it just fine), you can possibly cheat like this:
\b[0-9A-F]+(?=')
only checks whether an entire hex number occurs right before a ' character. It does not check for the presence of an opening quote, but chances are that this won't matter.