Regex (grep) Backward match - regex

I need a Regex which matches the closed [A-Za-z_0-9]*\.xml before ERROR.
The following input should match match_me.xml
ERROR should be determined.
The first [A-Za-z_0-9]*\.xml before ERROR should be matched. -> match_me.xml
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy
eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam
voluptua. At vero eos et accusam et justo duo
match_not_me.xml
dolores et ea rebum.
match_me.xml
Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna
ERROR
aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

You can use awk for this:
awk 'p && /ERROR/{print p; p==""} /^[A-Za-z_0-9]*\.xml$/{p=$0}' file
match_me.xml
When a line matches our pattern we store that line in variable p
When p is set and we encounter ERROR we print p and reset it to blank.

grep (PCRE) solution:
grep -Poz 'ERROR[\s\S]+?\s\K[A-Za-z_0-9]+\.xml' <(tac file) && echo
tac file - concatenate lines of the file in reverse order
[\s\S]+? - matches any character in "non-greedy" manner
\K - ignoring previous match
The output:
match_me.xml

Related

Extract specific string via regex from textfile

I have a text file with a lot of content. I want to extract the following text fragments beginning with TXT_ and ending with a ).
e.g.
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. (TXT_I_WANT_TO_EXTRACT_THIS). At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. (TXT_AND_THIS) Lorem ipsum dolor sit amet.
Expected result:
TXT_I_WANT_TO_EXTRACT_THIS
TXT_AND_THIS
I just the need the regex for the result.
Thank you so much for your help.
Greetings
Not sure in which language you are working. In R, for example, you can use str_extract_all:
str_extract_all(txt, "TXT\\w+")
[[1]]
[1] "TXT_I_WANT_TO_EXTRACT_THIS" "TXT_AND_THIS"
Even if you don't work in R, the pattern used in the solution will not change greatly; it is in fact simple: supposing that all target strings start with the same literal pattern, say "TXT", and only contain alphabetic characters and the underscore, the parts after "TXT" can conveniently be matched by \\w+, a character class for alphanumeric characters and the underscore.
txt <- "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. (TXT_I_WANT_TO_EXTRACT_THIS). At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. (TXT_AND_THIS) Lorem ipsum dolor sit amet."

Xml-Fo to keep justified text with same number of words in line as in left

I wonna make justified text with Apache FOP to have same number of words as the text with left aligh.
The goal is to make the justified text to have exactly the same words on each line, as the one with align="left".
I would try a different formatter. If anything, one would assume that the justified text would yield less words per line or equal, not more.
I used this sample:
<fo:flow flow-name="xsl-region-body">
<fo:block-container>
<fo:block>Dolores vero dolor sed accusam ipsum tempor justo ipsum tempor lorem vel
ipsum no. Ipsum dolor eleifend lorem ipsum dolor in takimata sit dolor. Sit et
te te duo diam diam stet no assum invidunt illum lorem no. Augue amet clita no
sit tempor nonumy nulla gubergren elitr liber magna ut facilisis eum et eos.
Tempor magna accusam elitr sit voluptua et aliquyam ut consectetuer lorem et
volutpat stet diam dolor voluptua consetetur amet. Eos magna stet. Ea dolore
adipiscing eos et dolor dolore et magna nulla eleifend. Sed est at molestie
dolore amet sit accusam nulla consetetur vero mazim sed kasd lobortis dolore.
Amet ipsum sanctus duo nibh vero invidunt quis clita at dolores nonumy eum kasd
vero illum eum eos est.</fo:block>
</fo:block-container>
<fo:block space-after="12pt"><fo:leader/></fo:block>
<fo:block-container text-align="justify">
<fo:block>Dolores vero dolor sed accusam ipsum tempor justo ipsum tempor lorem vel
ipsum no. Ipsum dolor eleifend lorem ipsum dolor in takimata sit dolor. Sit et
te te duo diam diam stet no assum invidunt illum lorem no. Augue amet clita no
sit tempor nonumy nulla gubergren elitr liber magna ut facilisis eum et eos.
Tempor magna accusam elitr sit voluptua et aliquyam ut consectetuer lorem et
volutpat stet diam dolor voluptua consetetur amet. Eos magna stet. Ea dolore
adipiscing eos et dolor dolore et magna nulla eleifend. Sed est at molestie
dolore amet sit accusam nulla consetetur vero mazim sed kasd lobortis dolore.
Amet ipsum sanctus duo nibh vero invidunt quis clita at dolores nonumy eum kasd
vero illum eum eos est.</fo:block>
</fo:block-container>
</fo:flow>
I format with Apache FOP and it yields this (replicating what you get). Note you can see that justified text actually yields more words per line which means to me that justification is changing letter and word spacing so much so that it yields more words on a line.
I format with RenderX XEP and it yields this, which is what you expect. The justification just increases letter and word spacing in a pleasing manner so that the same number of words are on each line.
I would note that in my (totally biased as I work for RenderX) opinion, the original left justified text by RenderX is much better. The ragged right edges are much tighter and it is already fitting content even in a left-justification. RenderX is not merely word/space/word/space ... break because a word does not fit on the line. RenderX has an adjustment for line tightness that will already squeeze to fit even with left-justified text is a word's length at line end is within a certain threshold. FOP does not do this or it is not apparent that it does.
I tried modifying things like letter and word spacing with no success. Not sure you can report this as a "bug" because it really isn't. It is merely a difference in the core engine behavior.
FOP may have implemented TeX line breaking too literally. I haven't seen a way to stop it from shrinking white-space.
FOP implements the TeX line-breaking algorithm, as does AH Formatter (https://www.antenna.co.jp/AHF/help/en/ahf-tech.html#line-breaking). The TeX algorithm (http://www.eprg.org/G53DOC/pdfs/knuth-plass-breaking.pdf) allows spaces to have both 'stretchability' and 'shrinkability'.
word-spacing (https://www.w3.org/TR/xsl11/#word-spacing) is the XSL property that, unsurprisingly, controls the spacing between words. CSS also has word-spacing, but the XSL word-spacing can be a <space>, with .minimum, .optimum, and .maximum components. As I read the word-spacing definition, the default normal value means that a space should be the width of a space.
text-align="justify" is defined as expanding the contents (https://www.w3.org/TR/xsl11/#text-align).
So as I read the spec, text-align="justify" should give the same words per line as text-align="start" (the XSL default value), as in the Antenna House sample below and #KevinBrown's sample. Squeezing in more words per line should require a negative word-spacing.minimum value, as below.

Is there an option to generate dynamically n tests in Spock

I have a file with n lines.
In my Spock test, I download, read the file, and assert each line of it.
Is there a way to produce n tests in the report instead of a one?
Maybe you know how to #Unroll Spock tests and feature method names like this:
package de.scrum_master.stackoverflow.q63002164
import spock.lang.Specification
import spock.lang.Unroll
class FixedInputBasedParametrisedTest extends Specification {
#Unroll
def "verify #inputLine"() {
expect:
inputLine.contains("et")
where:
inputLine << ["weather", "whether", "getters & setters"]
}
}
The result when running the test e.g. in IntelliJ IDEA looks like this:
But you can also use dynamic data providers, not just fixed sets of values. They just need to be Iterable.
If for example you have a resource file like this
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy
eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam
voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet
clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit
amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam
nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,
sed diam voluptua. At vero eos et accusam et justo duo dolores et ea
rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem
ipsum dolor sit amet.
you can use it as a data provider like this:
package de.scrum_master.stackoverflow.q63002164
import spock.lang.Specification
import spock.lang.Unroll
class InputFileBasedParametrisedTest extends Specification {
#Unroll
def "verify #inputLine"() {
expect:
inputLine.contains("et")
where:
inputLine << new File("src/test/resources/test.txt").readLines()
}
}
The result will look like this:

Regular Expression for length that validate word with length but not email

I need a regular expression to validate a large text (max 2000 characters), it should work as follows:
Say n=20
If there is any word in the text with letters greater than n, it will not validate the whole text.
If there is an email in the text say emailaddress#emailserver.com, then it should ignore that.
If the email address is like 123456789012345678901#email.com (as in this case n > 20), it will not validate the whole text.
I'm using ^(?!.*\S{10}).*$ as of now but it does not validate the email.
I racked my brain for a good ten minutes with no luck, I think it must have been some syntax I have yet to learn.
All suggestions would be greatly appreciated, many thanks!
I would use the following pattern:
\b[^\s#]{30,}\b
This pattern does not allow a word to be longer than 30 characters
It will handle e-mail addresses as two words (one word before and one word after the #.
var n = 30;
// doubly escaped slashes and global search (g)
var regex = new RegExp('\\b[^\\s#]{' + n + ',}\\b', 'g');
var text = document.getElementById('text').innerHTML;
var match = text.match(regex);
if(match) {
console.log("Document contains a value with over " + n + " characters.");
console.log(match);
}
else {
console.log("Document does not contain a value with over " + n + " characters.");
}
<div id="text">
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata
sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum.
Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos
et accusametjustoduodoloreseterebumstet#clita.kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla
facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet,
</div>

procmail regex filter mails containing a list specific word patterns

Is it possible to apply a regex in procmail that filters for specific word patterns.
For example I could do this with a normal regex:
/(?=.*dolor)(?=.*ipsum)(?=.*sit)/s
This would produce a match with the following text. Where this wouldn't:
/(?=.*money)(?=.*ipsum)(?=.*sit)/s
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam
nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,
sed diam voluptua. At vero eos et accusam et justo duo dolores et ea
rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem
ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur
sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et
dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam
et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea
takimata sanctus est Lorem ipsum dolor sit amet.
I would want this to adapt for procmail use. And even extend it so instead of just searching for "money" it would also match on "mOney", "möney", "móney" and so on.
Is it possible?
If so, how?
Yes, it is possible. Let me show you how.
Your regex checks if the words dolor, ispum and sit appearing in random order somewhere within the text. The following procmail recipe does the same:
:0 B
* -2^0
* 1^0 \<dorum\>
* 1^0 \<ipsum\>
* 1^0 \<sit\>
action_dorum_ipsum_sit
The first condition contains an empty regular expression which, because it always matches, is used to give your score a negative offset. A match of each of the next rules will increase that score by one (regardless how often each word occurs). At the end, the score will only be positive (and therefore trigger the action) if the text contains all 3 words at least once.
To add more keywords, you could either add more rules (and decrease the negative offset accordingly) or extend an existing rule, e.g. like this
* 1^0 \<(mOney|möney|móney)\>