Having a bit of trouble with some code I'm working through. Basically, I have transcripts (txt files) for a few Japanese anime, of which I want to remove everything but the spoken lines (Japanese sentences) in order to do some NLP experiments.
I've managed to accomplish a good bit of cleaning, but where I'm stuck is with parentheses. A majority of the elements in my list start with a character's name inside parentheses (i.e. (Armin)). I want to remove these, but all the regex code I've found online doesn't seem to work.
Here's a snippet of the list I'm working with:
['(アルミン)その日', '人類は思い出した', '(アルミン)奴らに', '支配されていた恐怖を', '(アルミン)鳥籠の中に', 'とらわれていた―', '屈辱を', '(キース)総員', '戦闘用意!', '目標は1体だ', '必ず仕留め―', 'ここを', '我々', '人類', '最初の壁外拠点とする!', '(エルヴィン)あっ…', '目標接近!', '(キース)訓練どおり5つに分かれろ!', '囮は我々が引き受ける!', '全攻撃班', '立体機動に移れ!', '(エルヴィン)全方向から', '同時に叩くぞ!', '(モーゼス)やあーっ!']
I've tried the following code (it's as close as I could get):
no_parentheses = []
for line in mylist:
if '(' in line:
line = re.sub('\(.*\)','', line)
no_parentheses.append(line)
else:
no_parentheses.append(line)
But when I view the results, those pesky parentheses remain in my list mockingly.
Could anyone offer suggestions to resolve this issue?
Thanks again!
The brackets used in the text are full-width brackets. Specifically, U+FF08 FULLWIDTH LEFT PARENTHESIS, and U+FF09 FULLWIDTH RIGHT PARENTHESIS.
Your regex should use full-width brackets as well.
line = re.sub('(.*)','', line)
Related
I'm working with Emergency Services data in the NEMSIS XSD. I have a field, which is constrained to only 50 characters. I've searched this site extensively, and tried many solutions - Notepad++ rejects all of them, saying not found.
Here's an XML Sample:
<E09>
<E09_01>-5</E09_01>
<E09_02>-5</E09_02>
<E09_03>-5</E09_03>
<E09_04>-5</E09_04>
<E09_05>this one is too long Non-Emergency - PT IS BEING DISCHARGED FROM H AFTER BEING ADMITTED FOR FAILURE TO THRIVE AND ALCOHOL WITHDRAWAL</E09_05>
</E09>
<E09>
<E09_01>-5</E09_01>
<E09_02>-5</E09_02>
<E09_03>-5</E09_03>
<E09_04>-5</E09_04>
<E09_05>this one is is okay</E09_05>
</E09>
I've tried solutions naming the E09_05 tag in different ways, using <\/E09_05> for the closing tag as I've seen in some examples, and as just </E09_05> as I've seen in others. I've tried ^.{50,}$ between them, or [a-zA-Z]{50,}$ between them, I've tried wrapping those in-between expressions in () and without. I even tried just [\s\S]*? in between the tags. The only thing that Notepad++ finds is when I use ^.{50,}$ by itself with no XML tags ... but then I wind up hitting on all the E13_01 tags (which are EMS narratives, and always > 50 characters) -- making for painstaking and wrist-aching clicks.
I wanted to XSLT this, but there is too much individual, hands on tweeking of each E09_05 field for automating it. Perl is not an option in this environment (and not a tool I know at all anyway).
To be truly sublime, both E09_05 and E09_08 fields with string lengths >50 need to be what is selected on the search ... but no other elements of any kind or length.
Thanks in advance. I'm sure I'm just missing some subtle \, or () or [] somewhere ... hopefully ...
The following regex will find the text content of <E09_05> elements with more than 50 characters.
(?<=<E09_05>).{51,}?(?=</E09_05>)
Explanation
(?<=<E09_05>) Start matching right after <E09_05>
.{51,}? Match 51 or more characters (in a single line)
The ? makes it reluctant, so it'll stop at first </E09_05>
(?=</E09_05>) Stop matching right before </E09_05>
For truly sublime matching, i.e. both E09_05 and E09_08 fields with string lengths >50, use:
(?<=<(E09_0[58])>).{51,}?(?=</\1>)
Explanation
<(E09_0[58])> Match <E09_05> or <E09_08>, and capture the name as group 1
</\1> Use \1 backreference to match name inside </name>
If you want to shorten the text with ellipsis at the end, e.g. Hello World with max length 8 becomes Hello..., use:
Find what: (?<=<(E09_0[58])>)(.{47}).{4,}(?=</\1>)
Replace with: \2...
I am trying to match everything between multiple set of brackets
Example of data
[[42.30722,-83.181125],[42.30722,-83.18112667],[42.30722167,-83.18112667,[42.30721667,-83.181125],[+42.30721667,-83.181125]]
I need to match everything within the inner brackets as below
42.30722,-83.181125,
42.30722,-83.18112667,
42.30722167,-83.18112667,
42.30721667,-83.181125,
+42.30721667,-83.181125
How do I do that. I tried \[([^\[\]]|)*\] but it gives me values with brackets. Can anybody please help me with this. Thanks in advance
Seems like one of them is missing a bracket maybe, or if not, maybe some expression similar to:
\[([+-]?\d+\.\d+)\s*,\s*([+-]?\d+\.\d+)\s*\]?
might be OK to start with.
Test
import re
expression = r"\[([+-]?\d+\.\d+)\s*,\s*([+-]?\d+\.\d+)\s*\]?"
string = """
[[42.30722,-83.181125],[42.30722,-83.18112667],[42.30722167,-83.18112667,[42.30721667,-83.181125],[+42.30721667,-83.181125]]
"""
print([list(i) for i in re.findall(expression, string)])
print(re.findall(expression, string))
Output
[['42.30722', '-83.181125'], ['42.30722', '-83.18112667'], ['42.30722167', '-83.18112667'], ['42.30721667', '-83.181125'], ['+42.30721667', '-83.181125']]
[('42.30722', '-83.181125'), ('42.30722', '-83.18112667'), ('42.30722167', '-83.18112667'), ('42.30721667', '-83.181125'), ('+42.30721667', '-83.181125')]
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
A little late, but figured I would include it anyhow.
Your 3rd set is missing a ']'.
If that is in there, then in Alteryx, you can just use Text to Columns splitting to Rows and ignore delimiter in brackets
I am trying to exclude delimiters within text qualifiers. For this, I am trying to use Regex. However, I am new to Regex and am not able to fully accomplish my needs. I would be very greatful if someone can help me out.
In Alteryx, I load a delimited flat text file as 'non-delimited' and say that it does not have text qualifiers. Thus, the input will look something like this:
"aabb"|ccdd|eeff|gghh
"aa|bb"|ccdd|eeff|gghh
"aa|bb"|ccdd|"ee|ff"|gghh
"aa|bb"|"cc|dd"|"ee|ff"|"gg|hh"
"aabb"|"ccdd"|"eeff"|"gghh"
"aabb"|"ccdd"|"eeff"|"gg|hh"
aabb|ccdd|eeff|gghh
"aa|bb"|ccdd|eeff|"gg|hh"
aabb|cc|dd|eeff|gghh
aabb|"cc||dd"|eeff|gghh
aabb|"c|c|dd"|eeff|gghh
"aa||bb"|ccdd|eeff|gghh
"a|a|b|b"|ccdd|eeff|gghh
"aabb"|ccdd|eeff|"g|g|hh"
"aabb"|ccdd|eeff|"gg||hh"
I want to exclude all delimiters that are in between text qualifiers.
I have tried to use Regex to replace the delimiters within text qualifiers with nothing.
So far, I have tried the following Regex code for my target:
(")(.*?[^"])\|+(.*?)(")
And I have used the following for my replace:
$1$2$3$4
However, this will not fix te lines 11, 13, 14 and 15.
I wish to obtain the following results:
"aabb"|ccdd|eeff|gghh
"aabb"|ccdd|eeff|gghh
"aabb"|ccdd|"eeff"|gghh
"aabb"|"ccdd"|"eeff"|"gghh"
"aabb"|"ccdd"|"eeff"|"gghh"
"aabb"|"ccdd"|"eeff"|"gghh"
aabb|ccdd|eeff|gghh
"aabb"|ccdd|eeff|"gghh"
aabb|cc|dd|eeff|gghh
aabb|"ccdd"|eeff|gghh
aabb|"ccdd"|eeff|gghh
"aabb"|ccdd|eeff|gghh
"aabb"|ccdd|eeff|gghh
"aabb"|ccdd|eeff|"gghh"
"aabb"|ccdd|eeff|"gghh"
Thank you in advance for helping me out!
With kind regards,
Robin
I can't think of the correct syntax in REGEX unless you are putting in each pattern that could be found.
However, an easier way (maybe not as performant), would be to use a Text to Columns selecting Ignore delimiters in quotes. If you need it back together in one cell afterwards, you can transpose, then remove delimiters followed by a Summarize to concatenate each RecordID Group.
I've searched around quite a bit now, but I can't get any suggestions to work in my situation. I've seen success with negative lookahead or lookaround, but I really don't understand it.
I wish to use RegExp to find URLs in blocks of text but ignore them when quoted. While not perfect yet I have the following to find URLs:
(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?
I want it to match the following:
www.test.com:50/stuff
http://player.vimeo.com/video/63317960
odd.name.amazone.com/pizza
But not match:
"www.test.com:50/stuff
http://plAyerz.vimeo.com/video/63317960"
"odd.name.amazone.com/pizza"
Edit:
To clarify, I could be passing a full paragraph of text through the expression. Sample paragraph of what I'd like below:
I would like the following link to be found www.example.com. However this link should be ignored "www.example.com". It would be nice, but not required, to have "www.example.com and www.example.com" ignored as well.
A sample of a different one I have working below. language is php:
$articleEntry = "Hey guys! Check out this cool video on Vimeo: player.vimeo.com/video/63317960";
$pattern = array('/\n+/', '/(https?\:\/\/)?(player\.vimeo\.com\/video\/[0-9]+)/');
$replace = array('<br/><br/>',
'<iframe src="http://$2?color=40cc20" width="500" height="281" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe>');
$articleEntry = preg_replace($pattern,$replace,$articleEntry);
The result of the above will replace any new lines "\n" with a double break "" and will embed the Vimeo video by replacing the Vimeo address with an iframe and link.
I've found a solution!
(?=(([^"]+"){2})*[^"]*$)((https?:\/\/)?(\w+\.)+\w{2,}(:[0-9]+)?((\/\w+)+(\.\w+)?)?\/?)
The first part from (? to *$) what makes it work for me. I found this as an answer in java Regex - split but ignore text inside quotes? by https://stackoverflow.com/users/548225/anubhava
While I had read that question before, I had overlooked his answer because it wasn't the one that "solved" the question. I just changed the single quote to double quote and it works out for me.
add ^ and $ to your regex
^(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?$
please notice you might need to escape the slashes after http (meaning https?\:\/\/)
update
if you want it to be case sensitive, you shouldn't use \w but [a-z]. the \w contains all letters and numbers, so you should be careful while using it.
I'm trying to write a regexp which will help to find non-translated texts in html code.
Translated texts means that they are going through special tag: or through construction: ${...}
Ex. non-translated:
<h1>Hello</h1>
Translated texts are:
<h1><fmt:message key="hello" /></h1>
<button>${expression}</button>
I've written the following expression:
\<(\w+[^>])(?:.*)\>([^\s]+?)\</\1\>
It finds correct strings like:
<p>text<p>
Correctly skips
<a><fmt:message key="common.delete" /></a>
But also catches:
<li><p><fmt:message key="common.delete" /></p></li>
And I can't figure out how to add exception for ${...} strings in this expression
Can anybody help me?
If I understand you properly, you want to ensure the data inside the "tag" doesn't contain fmt:messsage or ${....}
You might be able to use a negative-lookahead in conjuction with a . to assert that the characters captured by the . are not one of those cases:
/<(\w+)[^>]*>(?:(?!<fmt:message|\$\{|<\/\1>).)*<\/\1>/i
If you want to avoid capturing any "tags" inside the tag, you can ignore the <fmt:message portion, and just use [^<] instead of a . - to match only non <
/<(\w+)[^>]*>(?:(?!\$\{)[^<])*<\/\1>/i
Added from comment If you also want to exclude "empty" tags, add another negative-lookahead - this time (?!\s*<) - ensure that the stuff inside the tag is not empty or only containing whitespace:
/<(\w+)[^>]*>(?!\s*<)(?:(?!\$\{)[^<])*<\/\1>/i
If the format is simple as in your examples you can try this:
<(\w+)>(?:(?!<fmt:message).)+</\1>
Rewritten into a more formal question:
Can you match
aba
but not
aca
without catching
abcba ?
Yes.
FSM:
Start->A->B->A->Terminate
Insert abcba and run it
Start is ready for input.
a -> MATCH, transition to A
b -> MATCH, transition to B
c -> FAIL, return fail.
I've used a simple one like this with success,
<([^>]+)[^>]*>([^<]*)</\1>
of course if there is any CDATA with '<' in those it's not going to work so well. But should do fine for simple XML.
also see
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/
for a discussion of using regex to parse html
executive summary: don't