A bunch of us English grad students are studying dialog in Virginia Woolf's novel The Waves, and I've been trying to mark up the novel in TEI. To do this, it would be useful to write a regex that captures the dialog. Thankfully, The Waves is extremely regular, and almost all the dialog is in the form:
'Now they have all gone,' said Louis. 'I am alone. They have gone into the house for breakfast,'
But could continue for several paragraphs. I'm trying to write a regex to match all the paragraphs of a given speaker.
This is discussed briefly in Chris Foster's blog post, where he suggests something like /'([\^,]+,)' said Louis, '(*)'/, although this would only match single paragraphs, I think. This is how I'm thinking through it:
For every paragraph containing the text "said Louis" (or any other character's name) in the first line of the paragraph, match every line until reaching another character's speech, i.e. "said Rhodha."
I could probably do this with a ton of awkward python, but I'd love to know whether this is possible with regex.
It seems, from your link, that the text follows the following rules.
Each "line" is indeed a line in the strict sense, i.e. separated by \n.
Paragraphs are demarcated by two or more consecutive new lines, _i.e. \n\n+.
Only the non-directional single quote ' is used to demarcate speech.
Here's a quick attempt (scroll all the way down to view the match groups)—flawed, I'm sure—but there's enough here that should lead you in the right direction. Note how if you concatenate the three capture groups, idiomatically known as $1, $2, and $3, you get each character's speech, including punctuation between the "said" separator. However, notice how certain quirks of language throw this regular expression off—for example, the fact that we do not close quotes at the end of paragraphs, yet open new quotes if the speech continues into the next paragraph, throws off the whole balanced-quotes strategy—and so do apostrophes.
\n\n.*?'([^^]+?[?]?),?' said (?:[A-Z][a-z]+)(?:([.]) |, )'([^^]+?)'(?=[^']*(?:'[^']')*[^']*\n\n.*'(?:[^^]+?[?]?),?' said (?:[A-Z][a-z]+)(?:[.] |, ))
| | | <----><--> <>|<-------------------><------------>| <----> |<--------------------------------------------------------------------------------->
| | | | | | || | | | ||
| | | | | | || | | | |assert that this end-quote is followed by a string of non-quote characters, then
| | | | | | || | | | |zero or more strings of quoted non-quote characters, then another string of non-
| | | | | | || | | | |quote characters, a new paragraph, and the next "said Bernard"; otherwise fail.
| | | | | | || | | | |
| | | | | | || | | | match an (end-)quote
| | | | | | || | | |
| | | | | | || | | match any character as needed (but no more than needed)
| | | | | | || | |
| | | | | | || | match a (start-)quote
| | | | | | || |
| | | | | | || match either a period followed by two spaces, or a comma followed by one space
| | | | | | ||
| | | | | | |match the "said Bernard"
| | | | | | |
| | | | | | match an (end-)quote
| | | | | |
| | | | | match a comma, optionally
| | | | |
| | | | match a question mark, optionally
| | | |
| | | match any character as needed (but no more than needed)
| | |
| | match a (start-)quote
| |
| match as many non-newline characters as needed (but no more than needed)
|
new paragraph
Rubular matches (an excerpt):
Match 3
1. But when we sit together, close
2.
3. we melt into each
other with phrases. We are edged with mist. We make an
unsubstantial territory.
Match 4
1. I see the beetle
2. .
3. It is black, I see; it is green,
I see; I am tied down with single words. But you wander off; you
slip away; you rise up higher, with words and words in phrases.
Related
I want to extract matches from a string column for each id. How can I achieve that?
+--------+---------------------------------------+
| id | text |
+--------+---------------------------------------+
| fsaf12 | Other Questions,Missing Document |
| sfas11 | Others,Missing Address,Missing Name |
+--------+---------------------------------------+
Desired output:
+--------+------------------+
| id | extracted |
+--------+------------------+
| fsaf12 | Other Questions |
| fsaf12 | Missing Document |
| sfas11 | Others |
| sfas11 | Missing Address |
| sfas11 | Missing Name |
+--------+------------------+
Here is the query for sample data: FIDDLE
You can use regexp_split_to_table for your requirement like below:
WITH t1 AS (
SELECT 'fsaf12' AS id, 'Other Questions,Missing Document' AS text UNION ALL
SELECT 'sfas11', 'Others,Missing Address,Missing Name'
)
SELECT id, regexp_split_to_table(text,',')
FROM t1
OUTPUT
| id | extracted |
|-----------|-----------------------|
| fsaf12 | Other Questions |
| fsaf12 | Missing Document |
| sfas11 | Others |
| sfas11 | Missing Address |
| sfas11 | Missing Name |
DEMO
Postgres is not my forte at all but based on this older post on SO you could try to use unnest(). I included a TRIM() to remove possible railing spaces after a split:
SELECT id, TRIM(unnest(string_to_array(text, ','))) as "extracted" FROM t1;
Or, if you want to use regexp_split_to_table():
SELECT id, regexp_split_to_table(text, '\s*,\s*') as "extracted" FROM t1;
Here we matches 0+ whitespace characters, a literal comma and again 0+ whitespace characters.
Like given character what are the other character used by python.
\ is an escape character in Python
\t gets interpreted as a tab
When I opened the file test_file=open('c:\Python27\test.txt','r'). It gave error as IOError: [Errno 22] invalid mode ('r') or filename: 'C:\Python27\test.txt'. When I did google search I got to know \t interpreted as tab in python. Like wise any other character which is reserved by python for specific use
From String literals section in Python language reference suggested by #Praveen:
Unless an 'r' or 'R' prefix is present, escape sequences in
strings are interpreted according to rules similar to those used by
Standard C. The recognized escape sequences are:
+-----------------+---------------------------------+
| Escape Sequence | Meaning |
+=================+=================================+
| ``\newline`` | Ignored |
+-----------------+---------------------------------+
| ``\\`` | Backslash (``\``) |
+-----------------+---------------------------------+
| ``\'`` | Single quote (``'``) |
+-----------------+---------------------------------+
| ``\"`` | Double quote (``"``) |
+-----------------+---------------------------------+
| ``\a`` | ASCII Bell (BEL) |
+-----------------+---------------------------------+
| ``\b`` | ASCII Backspace (BS) |
+-----------------+---------------------------------+
| ``\f`` | ASCII Formfeed (FF) |
+-----------------+---------------------------------+
| ``\n`` | ASCII Linefeed (LF) |
+-----------------+---------------------------------+
| ``\N{name}`` | Character named *name* in the |
| | Unicode database (Unicode only) |
+-----------------+---------------------------------+
| ``\r`` | ASCII Carriage Return (CR) |
+-----------------+---------------------------------+
| ``\t`` | ASCII Horizontal Tab (TAB) |
+-----------------+---------------------------------+
| ``\uxxxx`` | Character with 16-bit hex value |
| | *xxxx* (Unicode only) |
+-----------------+---------------------------------+
| ``\Uxxxxxxxx`` | Character with 32-bit hex value |
| | *xxxxxxxx* (Unicode only) |
+-----------------+---------------------------------+
| ``\v`` | ASCII Vertical Tab (VT) |
+-----------------+---------------------------------+
| ``\ooo`` | Character with octal value |
| | *ooo* |
+-----------------+---------------------------------+
| ``\xhh`` | Character with hex value *hh* |
+-----------------+---------------------------------+
I have a monster regex at the moment, and am currently looking at how this best functions.
My regex is listed below and I am curious if there is a way to prioritize the regex in one function rather than just look for a specific match whereever it may exist.
Example:
If in my string i have a match for ([\d]+/[\d]+) or ([\d]+ / [\d]+) it would pick that first.
If this match above does not exist then but these existed ([\d]+-[\d]+) or ([\d]+ - [\d]+) it would pick that match
After that if ([\d]+) then it would pick that match as the end marker. If none of those existed it would then just move on to any of the other matches.
So my question is:
With Regex is there any way to prioritize which match to take first?
example: Some of my address strings are in the format of 1 - 12 example street,
often the regex will pull 12 example street rather than taking 1 - 12 example street.
Thanks!
The full regex is listed below:
New Regex("( ([\d]+) | ([\d]+-[\d]+) | ([\d]+ - [\d]+) | CAR
SMOULDERING | GAS BOTTLE EXPLOSION | INPUT | OFF | OPPOSITE | CNR |
SPARKING | INCIC1 | INCIC3 | STRUC1 | STRUC3 | G&SC1 | G&SC3 | ALARC1 |
ALARC3 | NOSTC1| NOSTC3 | RESCC1 | RESCC3 | HIARC1 | HIARC3 | CAR
ACCIDENT - POSS PERSON TRAPPED | EXPLOSIONS HEARD | WASHAWAY AS A
RESULT OF ACCIDENT | ENTRANCE | ENT |FIRE| LHS | RHS | POWER LINES
ARCING AND SPARKING | SMOKE ISSUING FROM FAN | CAR FIRE | FIRE ALARM
OPERATING | GAS LEAK | GAS PIPE | NOW OUT | ACCIDENT | SMOKING | ROOF |
GAS | REQUIRED | FIRE | LOCKED IN CAR | SMOKE RISING | SINGLE CAR
ACCIDENT | ACCIDENT | FIRE)(.*?)(?=\SVSE| M | SVC | SVSW | SVNE | SVNW
)", RegexOptions.RightToLeft)
Change the order of the 3 first:
(\d+-\d+) | (\d+ - \d+) | (\d+ )
instead of:
([\d]+) | ([\d]+-[\d]+) | ([\d]+ - [\d]+)
I have recently started to use Qt as it is much more intuitive then using win32, I have been playing around with a bunch of the different widgets, and I wan't to try something more complex, but haven't been able to find anything on the Qt reference or Google related to what I want.
I am trying to do something like the Unity3D Inspector box, I get so far with how I would go, but it doesn't seem like there is something for one of the needed components.
I would have a dockable widget, in this I would have a scrollable area, at this point I am looking to add 'components' to this scrollable area, these components will all be somewhat different, they should have the ability to expand/collapse into a single line (The identifier of the component), and upon expansion should be able to have multiple widgets inside of them, such as labels, checkboxes, other collapsable sections, etc.
I must be improperly wording what I am looking for in google as it doesn't seem like there is anything similar to what I want, but it seems like a common idea.
2 solutions:
1/ Manual design
Dock:
*---------------QDockWidget---------------*
| |
| *-------------QScrollArea-------------* |
| | | |
| | *--------ExpandableWidget---------* | |
| | | | | |
| | | | | |
| | | | | |
| | *---------------------------------* | |
| | *--------ExpandableWidget---------* | |
| | | | | |
| | | | | |
| | | | | |
| | *---------------------------------* | |
| | *--------ExpandableWidget---------* | |
| | | | | |
| | | | | |
| | | | | |
| | *---------------------------------* | |
| | *--------VerticalSpacer-----------* | |
| | | | |
| | | | |
| *-------------------------------------* |
| |
*-----------------------------------------*
ExpandableWidget:
ArrowL is a QLabel containing only the arrow indicating whether the widget is collapsed or extended. You set the custom widget to the input widget you want, for example an int input. You hide this widget when collapsing, and show it when expanding.
*------------ExpandableWidget-------------*
| |
| *-------------QVBoxLayout-------------* |
| | | |
| | *-----------QHBoxLayout-----------* | |
| | | *-ArrowL-* *------QLabel------* | | |
| | *---------------------------------* | |
| | | |
| | *---------Custom QWidget----------* | |
| | | | | |
| | *---------------------------------* | |
| | | |
| *-------------------------------------* |
| |
*-----------------------------------------*
Advantage: you can entirely control how the dock behaves.
Drawback: you have to implement this hierarchy by yourself, in a global widget, to ensure its consistency.
2/ QtPropertyBrower
QtPropertyBrowser is part of the now discontinued Qt Solutions (licence). It enables you do to almost what you want in a few code lines.
Can you have dynamic naming in regex groups? Something like
reg = re.compile(r"(?PText|Or|Something).*(?PTextIWant)")
r = reg.find("TextintermingledwithTextIWant")
r.groupdict()["Text"] == "TextIWant"
So that depending on what the beggining was, group["Text"] == TextIWant
Updated to make the quesetion more clear.
Some regex engines support this, some don't. This site says that Perl, Python, PCRE (and thus PHP), and .NET support it, all with slightly different syntax:
+--------+----------------------------+----------------------+------------------+
| Engine | Syntax | Backreference | Variable |
+--------+----------------------------+----------------------+------------------+
| Perl | (?<name>...), (?'name'...) | \k<name>, \k'name' | %+{name} |
| | (?P<name>...) | \g{name}, (?&name)* | |
| | | (?P>name)* | |
+--------+----------------------------+----------------------+------------------+
| Python | (?P<name>...) | (?P=name), \g<name> | m.group('name') |
+--------+----------------------------+----------------------+------------------+
| .NET | (?<name>...), (?'name'...) | \k<name>, \k'name' | m.Groups['name'] |
+--------+----------------------------+----------------------+------------------+
| PCRE | (?<name>...), (?'name'...) | \k<name>, \k'name' | Depends on host |
| | (?P<name>...) | \g{name}, \g<name>* | language. |
| | | \g'name'*, (?&name)* | |
| | | (?P>name)* | |
+--------+----------------------------+----------------------+------------------+
This is not a complete list, but it's what I could find. If you know more flavors, add them! The backreference forms with a * are those which are "recursive" as opposed to just a back-reference; I believe this means they match the pattern again, not what was matched by the pattern. Also, I arrived at this by reading the docs, but there could well be errors—this includes some languages I've never used and some features I've never used. Let me know if something's wrong.
Your question is worded kind of funny, but I think what you are looking for is a non-capturing group. Make it like this:
(?:Must_Match_This_First)What_You_Want(?:Must_Match_This_Last)
The ?: is what designates a that a group matches, but does not capture.
You could first build the string in a dynamic way and then pass it to the Regex engine.