Emacs regexp - replacing text strings, query replace regexp - regex

It seems simple enough but I can't get it done.
My text file looks like this :
Johnson Cary, 2009, This important article, 109 pages.
Smith Tom, 2003, Much ado about nothing: a study, 89 pages.
I need this :
Johnson Cary%2009%This important article%109 pages.
Any special character unlikely to appear in text will do. The end goal is to end up with a .csv then a .xls file.
I am using
^\([^,]+\)\([,]\)
to find the first occuring comma but when I try to replace with
\1 %
it does not work, nor any kind of close combination of that sort for that matter.
Any help will be dearly welcome!
Thank you much in advance.

Replace this:
^\([^,]*\), \([^,]*\), \([^,]*\), \(.*\)$
with this:
\1%\2%\3%\4
to get the correct result.

Related

Regex: How to extract dialogue tags from fiction, with speaker information

Totally stumped on this. I need help extracting dialogue from a story so I can hand it off for narration.
Basically, this is a problem where I have a big chunk of text (a novel), and I want to extract all the dialogue from the text in a format I can pipe into a spreadsheet.
But, I also want, if it exists, the speaker information as well. So, given a string like:
'"I'm really hungry," she said.'
I would like the values returned as:
[ "I'm really hungry", "she said" ]
If there is no dialogue, as in this example:
"I'm not hungry."
the result would just be:
["I'm really hungry."]
Is this madness? Is it even possible? I have fooled around with this regex (am not a regex guru, knowing only enough to be dangerous):
"([^"]*)"
Which seems to get the dialogue tags, but doesn't get the speaker info. Any advice in how to get the speaker info as well would be greatly appreciated. I've been wrestling with this for awhile now.
Maybe a better approach would be to get the dialogue in one field, and the entire paragraph it is found in as the second field. That could also work, but I have no idea where to start with this.
Basically I want to put these all into a spreadsheet so I can hand them off to a narrator with enough context that they know whose dialogue is who's in the story.
Any help is greatly appreciated!
It definitely is possible
Look at this regex: ^.*?'?(?P<line>\".*\")(?P<actor>[^'\n]*)'?.*?$
demo here: https://regex101.com/r/UCRZwY/5
It basically marks the outer quotes as optional, but if it does find them, stores whatever provided as '$actor' (and the line as '$line') these are of course just names i've given them, feel free to change
Note updated to include such text as part of regular sentence, see example in demo

Regex that search for sentences that exclude one word

Ciao guys,
I'm creating a corpus composed with tweets that contain the keyword "catastrophic" in XML format. Each tweet are embedded like this:
<tweet>"Catastrophic loss" at Tennessee's Zoo Knoxville as 33 reptiles are found dead </tweet>
<tweet>Overcoming Catastrophic Forgetting by Incremental Moment Matching, Lee et al.</tweet
After trimming tons of unnecessary data, there are still like 200+ tweets that don't contain the keyword at all. I'd like to delete them, so I tried regex like this, but it just didn't work:
<tweet>^.*(?!catastrophic).*$</tweet>
Does anybody has any idea?
Not sure what programming language or other toolset you are using.
But a quite simple approach might be to re-write the file (or whatever kind of input it is) using a filter that only writes the entries that do contain catastrophic:
Assuming that it is a file with one line per tweet (just to illustrate the idea):
egrep '<tweet>.*catastrophic.*</tweet>' originalFile > newFile

fuzzy matching japanese strings in python?

this problem has me stumped for the whole day.
I have two Japanese strings that I want to fuzzy match in Python2.7. Currently I'm using fuzzywuzzy and
jpnStr = "日本語".encode('utf-8')
jpnList = ["日本語1".encode('utf-8'),"日本語2".encode('utf-8'),"日本語3".encode('utf-8')]
bestmatch = process.extractOne(jpnStr, jpnList)
but the resulting bestmatch is always
("日本語1",0)
How would I go by resolving this issue, or is there a best practice that I'm totally missing here? Sorry if I sound frustrated, it's been a roadblock for a while. Thanks in advance.
Ok, I'm not sure how helpful this is but I've found a workaround.
I found that I could fuzzymatch japanese strings using fuzzywuzzy.
First, you get the Unicoded Japanese string, ie "日本語です"
Then you output it as ascii text into a text file. Output will look something like "/uf34/ufeac/uewa3/..." so on and so forth.
Then you read the text file and compare the ascii representation of the japanese string : "/uf34/ufeac/uewa3/" against each other. This gives a workable fuzzywuzzy match rating.
It's probably not an ideal method, but it works and is fairly accurate. Hope this helps somebody.

Using RegEx to Find a Block of Text

I'm attempting to block a long string of unnecessary text that's on every page of a document.
Ex: "36075 This is another page and this is the date March 4 2013"
I know this must be very simple, but I'm hoping there is a way to block text verbatim. Is the only way to block this text by using a lot of /d/s/w+/+ etc or is there is a way to say, "match 36075 This is another page and this is the date March 4 2013".
This would be SO HELPFUL to know. Thank you for helping!
From what you wrote I assume you need to get leading numbers from string, to do it you just need to use this pattern: ^\d+ which from this input:
36075 This is another page and this is the date March 4 2013
will return this:
36075
For future, in case of such questions please provide example string and expected output. As well as what you have tried.
I realized the issue I was having. I didn't need to use RegEx. The program I was using has the functionality to match specific words or groups of words and pronounce them differently. What I discovered is that it will not match the words unless the word groups are input exactly the way the program typically reads them.
Ergo --> The channel saw
the end of the British hold over
Would have to be listed as one group for, "The channel saw" and a second group for "the end of the British hold over"
In addition, there were some numbers --> 11960_30_o_ho_
and if the program naturally read 119 and then 60_3 and then _o_ho_ then three strings would need to be input for each section.
A few frustrating hours later, problem solved :) Thank you for your assistance.

Splitting a title into separate parts

I need a to split a string of the form
2,9.1,The Godfather (1972), (it's a csv line)
to:
2
9.1
The Godfather
1972
any ideas for a good regular expression?
BTW,
if you know a good regular expressions creator based on examples you provide it'd be great.
I'm a bit new to this..
10x!!
(\d+)\.(\d+\.\d+),(.*?)(?= \()\((\d{4})\)
^^^^^ ^^^^^^^^^^ ^^^^^^^^^^^^ ^^^^^^^
2 9.1 Title Year
I wouldn't recommend using regex to split the csv files as it can't handle comma escaping well. But having that said, how about using the simplest available solution?
A simplest regex like this should solve your problem
'(.*?),(.*?),(.*?)\((\d+)\)'
A little time with Google gave me this: /,(?!(?:[^",]|[^"],[^"])+")/. Seeems to split CSV just fine.
>>> '2,9.1,The Godfather (1972)'.split(/,(?!(?:[^",]|[^"],[^"])+")/)
["2", "9.1", "The Godfather (1972)"]
If you are sure that the format is static, you can use this:
(\d+),(\d+\.\d+),(.*?) \((\d+)\)
But if it can contain more information, use a real CSV parser to read the line and then just split The Godfather (1972) using (.*?) \((\d+)\).
CSV has a lot of corner cases, your regexp approach might take you into a world of pain.
For example if the title has a comma in it, the title would then be double quoted. Which would screw up with all of the regexps given so far.