Python and Regex with special characters - regex

I can't get my regex to work as desired in my Python 3 code.
I am trying to parse a file find a specific pattern (the exact pattern is Total Optimized)
I am doing this because the file can contain lines which say """Total Optimization (Active)""" and other permutations. I have tried the following lines. None work
PkOp = re.compile(r'Total Optimized\t\d')
PkOp = re.compile(r'Total Optimized\t\d')
PkOp = re.compile(r'Total Optimized\t[^(Active)]')
My basic code (which is simplified here) to just print the matching line out. If I got that working I would then choose the array item I wanted such as
PkOp = PkOpArray[4]
App = re.compile(r'Appliance\s(Active)')
PkOp = re.compile(r"Total Optimized\t\d")
with open("SteelheadMetric2.txt","r") as f:
with open("mydumbfile.txt","w") as o:
for line in f:
line = line.lstrip()
matches = PkOp.findall(line)
for firestick in matches:
PkOpArray = line.split()
PkOp = PkOpArray
print(PkOp)
Mostly I get this error
matches = PkOp.findall(line)
AttributeError: 'list' object has no attribute 'findall'
If I remove the slash characters I can get it to show lines with 'Total Optimization' or 'Appliance' whatever. I just can't be more specific in what I want.
What am I missing? It works fine if I just compile a text string but to use special regex like whitespace, tab digit it fails. The regex checks out in notepad++

When you write PkOp = PkOpArray you have just changed your regex into a list.
If you delete that line, and change your print(PkOp) to print(PkOpArray), it should fix your problem, assuming the rest of your code is correct.

Related

Conditionally extracting the beginning of a regex pattern

I have a list of strings containing the names of actors in a movie that I want to extract. In some cases, the actor's character name is also included which must be ignored.
Here are a couple of examples:
# example 1
input = 'Levan Gelbakhiani as Merab\nAna Javakishvili as Mary\nAnano Makharadze'
expected_output = ['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']
# example 2
input = 'Yoosuf Shafeeu\nAhmed Saeed\nMohamed Manik'
expected_output = ['Yoosuf Shafeeu', 'Ahmed Saeed', 'Mohamed Manik']
Here is what I've tried to no avail:
import re
output = re.findall(r'(?:\\n)?([\w ]+)(?= as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?: as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?:(?= as )|(?! as ))', input)
The \n in the input string are new line characters. We can make use of this fact in our regex.
Essentially, each line always begins with the actor's name. After the the actor's name, there could be either the word as, or the end of the line.
Using this info, we can write the regex like this:
^(?:[\w ]+?)(?:(?= as )|$)
First, we assert that we must be at the start of the line ^. Then we match some word characters and spaces lazily [\w ]+?, until we see (?:(?= as )|$), either as or the end of the line.
In code,
output = re.findall(r'^(?:[\w ]+?)(?:(?= as )|$)', input, re.MULTILINE)
Remember to use the multiline option. That is what makes ^ and $ mean "start/end of line".
You can do this without using regular expression as well.
Here is the code:
output = [x.split(' as')[0] for x in input.split('\n')]
I guess you can combine the values obtained from two regex matches :
re.findall('(?:\\n)?(.+)(?:\W[a][s].*?)|(?:\\n)?(.+)$', input)
gives
[('Levan Gelbakhiani', ''), ('Ana Javakishvili', ''), ('', 'Anano Makharadze')]
from which you filter the empty strings out
output = list(map(lambda x : list(filter(len, x))[0], output))
gives
['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']

How to replace the periods of just URL(s) and/or email address(s) buried in text

I am using the great answer provided by D Greenberg in the stackoverflow q&a Python split text on sentences to split text into sentences. I would like help augmenting one part of it.
The overall code uses a bunch of regular expressions to recognize abbreviations, acronyms, websites, prefixes (Mr., Mrs., etc.) and other non-sentence endings and changes u'.' into u'<prd>'. All the u'.' that aren't changed must be periods that end sentences.
The re that recognizes websites only works for URLs of the form text.(com|org|gov...). It doesn't work for text1.text2.text3.(com|org|gov...). May I have some help in making this work?
I have edited the original code to just the relevant section:
def split_into_sentences(text):
prefixes = u"(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = u"(Inc|Ltd|Jr|Sr|Co)"
websites = u"[.](com|net|org|io|gov)"
digits = u"([0-9])"
text = text.replace(u"\n",u" ")
text = re.sub(prefixes,u"\\1<prd>",text)
text = re.sub(websites,u"<prd>\\1",text)
text = re.sub(digits + u"[.]" + digits,u"\\1<prd>\\2",text)
if u"Ph.D" in text: text = text.replace(u"Ph.D.",u"Ph<prd>D<prd>")
text = text.replace(u".",u".<stop>")
text = text.replace(u"?",u"?<stop>")
text = text.replace(u"!",u"!<stop>")
text = text.replace(u"<prd>",u".")
sentences = text.split(u"<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
return sentences
I believe the following re will find a full URL or email address (I know there are more domains possible and I will augment if needed)
websites = ur"([\w#-]+[.])+(com|net|org|io|gov)"
What I can't figure out how to do is change the text = re.sub(websites,u"<prd>\\1",text) to accomplish what I want: in the portions of text that match the website pattern, change all of the u'.' into u'<prd>'
You may use your pattern to match all those substrings in question and perform a custom search and replace on each match using a lambda expression used as the second argument to re.sub:
result = re.sub(websites, lambda x: x.group().replace(u".", u"<prd>"),text)

Find and remove specific string from a line

I am hoping to receive some feedback on some code I have written in Python 3 - I am attempting to write a program that reads an input file which has page numbers in it. The page numbers are formatted as: "[13]" (this means you are on page 13). My code right now is:
pattern='\[\d\]'
for line in f:
if pattern in line:
re.sub('\[\d\]',' ')
re.compile(line)
output.write(line.replace('\[\d\]', ''))
I have also tried:
for line in f:
if pattern in line:
re.replace('\[\d\]','')
re.compile(line)
output_file.write(line)
When I run these programs, a blank file is created, rather than a file containing the original text minus the page numbers. Thank you in advance for any advice!
Your if statement won't work because not doing a regex match, it's looking for the literal string \[\d\] in line.
for line in f:
# determine if the pattern is found in the line
if re.match(r'\[\d\]', line):
subbed_line = re.sub(r'\[\d\]',' ')
output_file.writeline(subbed_line)
Additionally, you're using the re.compile() incorrectly. The purpose of it is to pre-compile your pattern into a function. This improves performance if you use the pattern a lot because you only evaluate the expression once, rather than re-evaluating each time you loop.
pattern = re.compile(r'\[\d\]')
if pattern.match(line):
# ...
Lastly, you're getting a blank file because you're using output_file.write() which writes a string as the entire file. Instead, you want to use output_file.writeline() to write lines to the file.
You don't write unmodified lines to your output.
Try something like this
if pattern in line:
#remove page number stuff
output_file.write(line) # note that it's not part of the if block above
That's why your output file is empty.

Python script to extract data from text file

I have a text file which have some website list links like
test.txt:
http://www.site1.com/
http://site232546ee.com/
https://www.site3eiue213.org/
http://site4.biz/
I want to make a simple python script which can extract only site names with length of 8 characters... no name more than 8 characters.... the output should be like:
output.txt:
site1
site2325
site3eiu
site4
i have written some code:
txt1 = open("test.txt").read()
txt2 = txt1.split("http://www.")
f = open('output.txt', 'w')
for us in txt2:
f.write(us)
print './done'
but i don't know how to split() more than one command in one line ... i also tried it with import re module but don't able to know that how to write code for it.
can some one help me please to make this script. :(
you can achieve this using regular expression as below.
import re
no = 8
regesx = "\\bhttp://www.|\\bhttp://|\\bhttps://www."
text = "http://site232546ee.com/"
match = re.search(regesx, text)
start = match.end(0)
end = start+no
string1 = text[start:end]
end = string1.find('.')
if end > 0:
final = string1[0:end]
else:
final = string1
print(final)
You said you want to extract site names with 8 characters, but the output.txt example shows bits of domain names. If you want to filter out domain names which have eight or less characters, here is a solution.
Step 1: Get all the domain names.
import tldextract
import pandas as pd
text_s=''
list_u=('http://www.site1.com/','http://site232546ee.com/','https://www.site3eiue213.org/','http://site4.biz/')
#http:\//www.(\w+).*\/?
for l in list_u:
extracted = tldextract.extract(l)
text_s+= extracted.domain + ' '
print (text_s) #gives a string of domain names delimited by whitespace
Step 2: filter domain names with 8 or less characters.
word= text_s.split()
lent= [len(x) for x in text_s.split()]
word_len_list = pd.DataFrame(
{'words': word,
'char_length': lent,
})
word_len_list[(word_len_list.char_length <= 8)]
Output looks like this:
words char_length
0 site1 5
3 site4 5
Disclaimer: I am new to Python. Please ignore any unnecessary and/or stupid steps I may have written
Have you tried printing txt2 before doing anything with it? You will see that it did not do what (I expect) you wanted it to do, since there's only one "http://www." available in the text. Try to split at a newline \n. That way you get a list of all the urls.
Then, for each url you'll want to strip the front and back, which you can do with regular expression but which can be quite hard, depending on what you want to be able to strip off. See here.
When you have found a regular expression that works for you, simply check the domain for its length and write those domains to a file that satisfy your conditions using an if statement (if len(domain) <= 8: f.write(domain))

how to use pyparsing to match multiple lines while using iterator to read the file

In the definition of my Pyparsing grammar, there are some grammars which will match strings that span multiple lines.
If I use the api like:
PyGrammar.parseString(open('file_name').read())
If will behave in the correct way.
However if I want to use the iterator to read the file like
with open('file_name') as f:
for line in f:
PyGrammar.parseString(line)
the parser will break
Is there a way to work around this case. Thanks...
According to Paul(the author of pyparsing)
with open('file_name') as f:
for line in f:
PyGrammar.parseString(line)
The code above is not the correct way to use pyparsing. Pyparsing needs to see all source texts before parsing the texts. So when I call parseString with each line of text, it does not work out. Another work around is to use a wrapper for it. like:
# set up a generator to yield a line of text at a time
linegenerator = open('big_hairy_file.txt')
# buffer will accumulate lines until a fully parseable piece is found
buffer = ""
for line in linegenerator:
buffer += line
match = next(grammar.scanString(buffer), None)
while match:
tokens, start, end = match
print tokens.asList()
buffer = buffer[end:]
match = next(grammar.scanString(buffer), None)