I have to format 50k lines of chat logs.
The source file is pure text and looks something like this:
13. Mär. 01:32 - Walter:
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.
13. Mär. 06:15 - Horst:
Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et
dolore magna aliquyam erat, sed diam voluptua.
magna aliquyam erat, sed diam voluptua.
There are only two persons in the whole chat - Walter and Horst.
I need two regular expressions, one that selects all chat text from Walter and one that selects all chat text from Horst.
The regular expression for Walter should select this text from the example:
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.
The regular expression for Horst should select this text from the example:
Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et
dolore magna aliquyam erat, sed diam voluptua.
magna aliquyam erat, sed diam voluptua.
It's important to me to only select the text lines and not the date / time / person line.
UPDATE
First off, thanks for the fast reply. Unfortunately this doesn't solve my problem.
Chat texts have a varying line of numbers.
And somehow I cannot get a selection with your example.
I tried it here:
http://regexr.com/39m2a
I tried this instead:
Walter:.\n(.)
This selects Walter: and the first line. Is there away NOT to select Walter: ?
(I need this to format an Indesign Document using text formats)
These are actualy 2 questions
How to do a match across newlines (asked in the question title)
How to do a match that discards the date/time/person (asked in
the question body)
I'll answer question 1:
Before doing the match you want to change the line separator/record separator.
This separator is tool dependent (it is not part of the regex language itself). E.g. for awk you can change the RS variable (you can set it to multiple characters, e.g., colon+newline). For GNU grep you can use -z. See longer discussion at
How to find patterns across multiple lines using grep?
Here's my solution:
awk '$5~/Walter:$/{p=1} $5!~/Walter:$/&&$5~/:$/{p=0} p'
or
awk -vname=Walter 'match($5,name":$"){p=1} !match($5,name":$")&&$5~/:$/{p=0} p'
To filter out empty and date lines, pipe through
awk '$5!~":$"&&NF>0'
try it here: http://refiddle.com/1iws
I have modified the regex so could work on you data, but once again your data isn't well structured though it's not possible to write a single regex that would match it correctly
Related
I'm trying to extract some text from a column on a CSV file. Here is an example:
"Lorem ipsum dolor sit amet (2015), consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua (2000)."
I wanna get a new column with "amet (2015)" and "aliqua (2000)". This expression gives me the (2015) and (2000): value.find(/(.*?)/)
But how can I also get the word before the parentheses?
here is the regex your are looking for /\w* \([^\)]*\)/gm.
I wonna make justified text with Apache FOP to have same number of words as the text with left aligh.
The goal is to make the justified text to have exactly the same words on each line, as the one with align="left".
I would try a different formatter. If anything, one would assume that the justified text would yield less words per line or equal, not more.
I used this sample:
<fo:flow flow-name="xsl-region-body">
<fo:block-container>
<fo:block>Dolores vero dolor sed accusam ipsum tempor justo ipsum tempor lorem vel
ipsum no. Ipsum dolor eleifend lorem ipsum dolor in takimata sit dolor. Sit et
te te duo diam diam stet no assum invidunt illum lorem no. Augue amet clita no
sit tempor nonumy nulla gubergren elitr liber magna ut facilisis eum et eos.
Tempor magna accusam elitr sit voluptua et aliquyam ut consectetuer lorem et
volutpat stet diam dolor voluptua consetetur amet. Eos magna stet. Ea dolore
adipiscing eos et dolor dolore et magna nulla eleifend. Sed est at molestie
dolore amet sit accusam nulla consetetur vero mazim sed kasd lobortis dolore.
Amet ipsum sanctus duo nibh vero invidunt quis clita at dolores nonumy eum kasd
vero illum eum eos est.</fo:block>
</fo:block-container>
<fo:block space-after="12pt"><fo:leader/></fo:block>
<fo:block-container text-align="justify">
<fo:block>Dolores vero dolor sed accusam ipsum tempor justo ipsum tempor lorem vel
ipsum no. Ipsum dolor eleifend lorem ipsum dolor in takimata sit dolor. Sit et
te te duo diam diam stet no assum invidunt illum lorem no. Augue amet clita no
sit tempor nonumy nulla gubergren elitr liber magna ut facilisis eum et eos.
Tempor magna accusam elitr sit voluptua et aliquyam ut consectetuer lorem et
volutpat stet diam dolor voluptua consetetur amet. Eos magna stet. Ea dolore
adipiscing eos et dolor dolore et magna nulla eleifend. Sed est at molestie
dolore amet sit accusam nulla consetetur vero mazim sed kasd lobortis dolore.
Amet ipsum sanctus duo nibh vero invidunt quis clita at dolores nonumy eum kasd
vero illum eum eos est.</fo:block>
</fo:block-container>
</fo:flow>
I format with Apache FOP and it yields this (replicating what you get). Note you can see that justified text actually yields more words per line which means to me that justification is changing letter and word spacing so much so that it yields more words on a line.
I format with RenderX XEP and it yields this, which is what you expect. The justification just increases letter and word spacing in a pleasing manner so that the same number of words are on each line.
I would note that in my (totally biased as I work for RenderX) opinion, the original left justified text by RenderX is much better. The ragged right edges are much tighter and it is already fitting content even in a left-justification. RenderX is not merely word/space/word/space ... break because a word does not fit on the line. RenderX has an adjustment for line tightness that will already squeeze to fit even with left-justified text is a word's length at line end is within a certain threshold. FOP does not do this or it is not apparent that it does.
I tried modifying things like letter and word spacing with no success. Not sure you can report this as a "bug" because it really isn't. It is merely a difference in the core engine behavior.
FOP may have implemented TeX line breaking too literally. I haven't seen a way to stop it from shrinking white-space.
FOP implements the TeX line-breaking algorithm, as does AH Formatter (https://www.antenna.co.jp/AHF/help/en/ahf-tech.html#line-breaking). The TeX algorithm (http://www.eprg.org/G53DOC/pdfs/knuth-plass-breaking.pdf) allows spaces to have both 'stretchability' and 'shrinkability'.
word-spacing (https://www.w3.org/TR/xsl11/#word-spacing) is the XSL property that, unsurprisingly, controls the spacing between words. CSS also has word-spacing, but the XSL word-spacing can be a <space>, with .minimum, .optimum, and .maximum components. As I read the word-spacing definition, the default normal value means that a space should be the width of a space.
text-align="justify" is defined as expanding the contents (https://www.w3.org/TR/xsl11/#text-align).
So as I read the spec, text-align="justify" should give the same words per line as text-align="start" (the XSL default value), as in the Antenna House sample below and #KevinBrown's sample. Squeezing in more words per line should require a negative word-spacing.minimum value, as below.
I have a String
Lorem ipsum dolor sit amet
*consectetur adipiscing elit
sed do eiusmod tempor incididunt*
ut labore et dolore magna aliqua
Ut enim ad minim veniam.
now I want to select the * content *
this [*](.*?)[*] is my current regex, but it's working with a single line
*consectetur adipiscing elit*
How do I make it multiline?
This REGEX worked for me [*]([\\s\\S]*?)[*]
I need a Regex which matches the closed [A-Za-z_0-9]*\.xml before ERROR.
The following input should match match_me.xml
ERROR should be determined.
The first [A-Za-z_0-9]*\.xml before ERROR should be matched. -> match_me.xml
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy
eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam
voluptua. At vero eos et accusam et justo duo
match_not_me.xml
dolores et ea rebum.
match_me.xml
Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna
ERROR
aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
You can use awk for this:
awk 'p && /ERROR/{print p; p==""} /^[A-Za-z_0-9]*\.xml$/{p=$0}' file
match_me.xml
When a line matches our pattern we store that line in variable p
When p is set and we encounter ERROR we print p and reset it to blank.
grep (PCRE) solution:
grep -Poz 'ERROR[\s\S]+?\s\K[A-Za-z_0-9]+\.xml' <(tac file) && echo
tac file - concatenate lines of the file in reverse order
[\s\S]+? - matches any character in "non-greedy" manner
\K - ignoring previous match
The output:
match_me.xml
Is it possible to apply a regex in procmail that filters for specific word patterns.
For example I could do this with a normal regex:
/(?=.*dolor)(?=.*ipsum)(?=.*sit)/s
This would produce a match with the following text. Where this wouldn't:
/(?=.*money)(?=.*ipsum)(?=.*sit)/s
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam
nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,
sed diam voluptua. At vero eos et accusam et justo duo dolores et ea
rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem
ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur
sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et
dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam
et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea
takimata sanctus est Lorem ipsum dolor sit amet.
I would want this to adapt for procmail use. And even extend it so instead of just searching for "money" it would also match on "mOney", "möney", "móney" and so on.
Is it possible?
If so, how?
Yes, it is possible. Let me show you how.
Your regex checks if the words dolor, ispum and sit appearing in random order somewhere within the text. The following procmail recipe does the same:
:0 B
* -2^0
* 1^0 \<dorum\>
* 1^0 \<ipsum\>
* 1^0 \<sit\>
action_dorum_ipsum_sit
The first condition contains an empty regular expression which, because it always matches, is used to give your score a negative offset. A match of each of the next rules will increase that score by one (regardless how often each word occurs). At the end, the score will only be positive (and therefore trigger the action) if the text contains all 3 words at least once.
To add more keywords, you could either add more rules (and decrease the negative offset accordingly) or extend an existing rule, e.g. like this
* 1^0 \<(mOney|möney|móney)\>