Python script to extract data from text file - python-2.7

I have a text file which have some website list links like
test.txt:
http://www.site1.com/
http://site232546ee.com/
https://www.site3eiue213.org/
http://site4.biz/
I want to make a simple python script which can extract only site names with length of 8 characters... no name more than 8 characters.... the output should be like:
output.txt:
site1
site2325
site3eiu
site4
i have written some code:
txt1 = open("test.txt").read()
txt2 = txt1.split("http://www.")
f = open('output.txt', 'w')
for us in txt2:
f.write(us)
print './done'
but i don't know how to split() more than one command in one line ... i also tried it with import re module but don't able to know that how to write code for it.
can some one help me please to make this script. :(

you can achieve this using regular expression as below.
import re
no = 8
regesx = "\\bhttp://www.|\\bhttp://|\\bhttps://www."
text = "http://site232546ee.com/"
match = re.search(regesx, text)
start = match.end(0)
end = start+no
string1 = text[start:end]
end = string1.find('.')
if end > 0:
final = string1[0:end]
else:
final = string1
print(final)

You said you want to extract site names with 8 characters, but the output.txt example shows bits of domain names. If you want to filter out domain names which have eight or less characters, here is a solution.
Step 1: Get all the domain names.
import tldextract
import pandas as pd
text_s=''
list_u=('http://www.site1.com/','http://site232546ee.com/','https://www.site3eiue213.org/','http://site4.biz/')
#http:\//www.(\w+).*\/?
for l in list_u:
extracted = tldextract.extract(l)
text_s+= extracted.domain + ' '
print (text_s) #gives a string of domain names delimited by whitespace
Step 2: filter domain names with 8 or less characters.
word= text_s.split()
lent= [len(x) for x in text_s.split()]
word_len_list = pd.DataFrame(
{'words': word,
'char_length': lent,
})
word_len_list[(word_len_list.char_length <= 8)]
Output looks like this:
words char_length
0 site1 5
3 site4 5
Disclaimer: I am new to Python. Please ignore any unnecessary and/or stupid steps I may have written

Have you tried printing txt2 before doing anything with it? You will see that it did not do what (I expect) you wanted it to do, since there's only one "http://www." available in the text. Try to split at a newline \n. That way you get a list of all the urls.
Then, for each url you'll want to strip the front and back, which you can do with regular expression but which can be quite hard, depending on what you want to be able to strip off. See here.
When you have found a regular expression that works for you, simply check the domain for its length and write those domains to a file that satisfy your conditions using an if statement (if len(domain) <= 8: f.write(domain))

Related

Making a text file which will contain my list items and applying regular expression to it

I am supposed to make a code which will read a text file containing some words with some common linguistic features. Apply some regular expression to all of the words and write one file which will have the changed words.
For now let's say my text file named abcd.txt has these words
king
sing
ping
cling
booked
looked
cooked
packed
My first question starts from here. In my simple text file how to write these words to get the above mentioned results. Shall I write them line-separated or comma separated?
This is the code provided by user palvarez.
import re
with open("new_abcd", "w+") as new, open("abcd") as original:
for word in original:
new_word = re.sub("ing$", "xyz", word)
new.write(new_word)
Can I add something like -
with open("new_abcd", "w+") as file, open("abcd") as original:
for word in original:
new_aword = re.sub("ed$", "abcd", word)
new.write(new_aword)
in the same code file? I want something like -
kabc
sabc
pabc
clabc
bookxyz
lookxyz
cookxyz
packxyz
PS - I don't know whether mentioning this is necessary or not, but I am supposed to do this for a Unicode supported script Devanagari. I didn't use it here in my examples because many of us here can't read the script. Additionally that script uses some diacritics. eg. 'का' has one consonant character 'क' and one vowel symbol 'ा' which together make 'का'. In my regular expression I need to condition the diacritics.
I think the approach you have with one word by line is better since you don't have to trouble yourself with delimiters and striping.
With a file like this:
king
sing
ping
cling
booked
looked
cooked
packed
And a code like this, using re.sub to replace a pattern:
import re
with open("new_abcd.txt", "w") as new, open("abcd.txt") as original:
for word in original:
new_word = re.sub("ing$", "xyz", word)
new_word = re.sub("ed$", "abcd", new_word)
new.write(new_word)
It creates a resulting file:
kxyz
sxyz
pxyz
clxyz
bookabcd
lookabcd
cookabcd
packabcd
I tried out with the diacritic you gave us and it seems to work fine:
print(re.sub("ा$", "ing", "का"))
>>> कing
EDIT: added multiple replacement. You can have your replacements into a list and iterate over it to do re.sub as follows.
import re
# List where first is pattern and second is replacement string
replacements = [("ing$", "xyz"), ("ed$", "abcd")]
with open("new_abcd.txt", "w") as new, open("abcd.txt") as original:
for word in original:
new_word = word
for pattern, replacement in replacements:
new_word = re.sub(pattern, replacement, word)
if new_word != word:
break
new.write(new_word)
This limits one modification per word, only the first that modifies the word is taken.
It is recommended that for starters, utilize the with context manager to open your file, this way you do not need to explicitly close the file once you are done with it.
Another added advantage is then you are able to process the file line by line, this will be very useful if you are working with larger sets of data. Writing them in a single line or csv format will then all depend on the requirement of your output and how you would want to further process them.
As an example, to read from a file and say substitute a substring, you can use re.sub.
import re
with open('abcd.txt', 'r') as f:
for line in f:
#do something here
print(re.sub("ing$",'ring',line.strip()))
>>
kring
sring
pring
clring
Another nifty trick is to manage both the input and output utilizing the same context manager like:
import re
with open('abcd.txt', 'r') as f, open('out_abcd.txt', 'w') as o:
for line in f:
#notice that we add '\n' to write each output to a newline
o.write(re.sub("ing$",'ring',line.strip())+'\n')
This create an output file with your new contents in a very memory efficient way.
If you'd like to write to a csv file or any other specific formats, I highly suggest you spend sometime to understand Python's input and output functions here. If linguistics in text is what you are going for that understand encoding of different languages and further study Python's regex operations.

Is there a regular expression for finding all question sentences from a webpage?

I am trying to extract some questions from a web site using BeautifulSoup, and want to use regular expression to get these questions from the web. Is my regular expression incorrect? And how can I combine soup.find_all with re.compile?
I have tried the following:
from bs4 import BeautifulSoup
import requests
from urllib.request import urlopen
import urllib
import re
url = "https://www.sanfoundry.com/python-questions-answers-variable-names/"
headers = {'User-Agent':'Mozilla/5.0'}
page = requests.get(url)
soup = BeautifulSoup(page.text, "lxml")
a = soup.find_all("p")
for m in a:
print(m.get_text())
Now I have some text containing the questions like "1. Is Python case sensitive when dealing with identifiers?". I want to use r"[^.!?]+\?" to filter out the unwanted text, but I have the following error:
a = soup.find_all("p" : re.compile(r'[^.!?]+\?'))
a = soup.find_all("p" : re.compile(r'[^.!?]+\?'))
^
SyntaxError: invalid syntax
I checked my regular expression on https://regex101.com, it seems right. Is there a way to combine the regular expression and soup.find_all together?
One of methods to find p elements containig a ? it to
define a criterion function:
def criterion(tag):
return tag.name == 'p' and re.search('\?', tag.text)
and use it in find_all:
pars = soup.find_all(criterion)
But you want to print only questions, not the whole paragraphs
from pars.
To match these questions, define a pattern:
pat = re.compile(r'\d+\.\s[^?]+\?')
(a sequence of digits, a dot, a space, then a sequence of chars other
than ? and finally a ?).
Note that in general case one paragraph may contain multiple
questions. So the loop processing the paragraphs found should:
use findall to find all questions in the current paragraph
(the result is a list of found strings),
print also all of them, in separate lines, so you should
use join with a \n as a separator.
So the whole loop should be:
for m in pars:
questions = pat.findall(m.get_text())
print('\n'.join(questions))
Not a big regex fan, so tried this:
for q in a:
for i in q:
if '?' in i:
print(i)
Output:
1. Is Python case sensitive when dealing with identifiers?
2. What is the maximum possible length of an identifier?
3. Which of the following is invalid?
4. Which of the following is an invalid variable?
5. Why are local variable names beginning with an underscore discouraged?
6. Which of the following is not a keyword?
8. Which of the following is true for variable names in Python?
9. Which of the following is an invalid statement?
10. Which of the following cannot be a variable?

How to replace the periods of just URL(s) and/or email address(s) buried in text

I am using the great answer provided by D Greenberg in the stackoverflow q&a Python split text on sentences to split text into sentences. I would like help augmenting one part of it.
The overall code uses a bunch of regular expressions to recognize abbreviations, acronyms, websites, prefixes (Mr., Mrs., etc.) and other non-sentence endings and changes u'.' into u'<prd>'. All the u'.' that aren't changed must be periods that end sentences.
The re that recognizes websites only works for URLs of the form text.(com|org|gov...). It doesn't work for text1.text2.text3.(com|org|gov...). May I have some help in making this work?
I have edited the original code to just the relevant section:
def split_into_sentences(text):
prefixes = u"(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = u"(Inc|Ltd|Jr|Sr|Co)"
websites = u"[.](com|net|org|io|gov)"
digits = u"([0-9])"
text = text.replace(u"\n",u" ")
text = re.sub(prefixes,u"\\1<prd>",text)
text = re.sub(websites,u"<prd>\\1",text)
text = re.sub(digits + u"[.]" + digits,u"\\1<prd>\\2",text)
if u"Ph.D" in text: text = text.replace(u"Ph.D.",u"Ph<prd>D<prd>")
text = text.replace(u".",u".<stop>")
text = text.replace(u"?",u"?<stop>")
text = text.replace(u"!",u"!<stop>")
text = text.replace(u"<prd>",u".")
sentences = text.split(u"<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
return sentences
I believe the following re will find a full URL or email address (I know there are more domains possible and I will augment if needed)
websites = ur"([\w#-]+[.])+(com|net|org|io|gov)"
What I can't figure out how to do is change the text = re.sub(websites,u"<prd>\\1",text) to accomplish what I want: in the portions of text that match the website pattern, change all of the u'.' into u'<prd>'
You may use your pattern to match all those substrings in question and perform a custom search and replace on each match using a lambda expression used as the second argument to re.sub:
result = re.sub(websites, lambda x: x.group().replace(u".", u"<prd>"),text)

Python and Regex with special characters

I can't get my regex to work as desired in my Python 3 code.
I am trying to parse a file find a specific pattern (the exact pattern is Total Optimized)
I am doing this because the file can contain lines which say """Total Optimization (Active)""" and other permutations. I have tried the following lines. None work
PkOp = re.compile(r'Total Optimized\t\d')
PkOp = re.compile(r'Total Optimized\t\d')
PkOp = re.compile(r'Total Optimized\t[^(Active)]')
My basic code (which is simplified here) to just print the matching line out. If I got that working I would then choose the array item I wanted such as
PkOp = PkOpArray[4]
App = re.compile(r'Appliance\s(Active)')
PkOp = re.compile(r"Total Optimized\t\d")
with open("SteelheadMetric2.txt","r") as f:
with open("mydumbfile.txt","w") as o:
for line in f:
line = line.lstrip()
matches = PkOp.findall(line)
for firestick in matches:
PkOpArray = line.split()
PkOp = PkOpArray
print(PkOp)
Mostly I get this error
matches = PkOp.findall(line)
AttributeError: 'list' object has no attribute 'findall'
If I remove the slash characters I can get it to show lines with 'Total Optimization' or 'Appliance' whatever. I just can't be more specific in what I want.
What am I missing? It works fine if I just compile a text string but to use special regex like whitespace, tab digit it fails. The regex checks out in notepad++
When you write PkOp = PkOpArray you have just changed your regex into a list.
If you delete that line, and change your print(PkOp) to print(PkOpArray), it should fix your problem, assuming the rest of your code is correct.

regex to strip out image urls?

I need to separate out a bunch of image urls from a document in which the images are associated with names like this:
bellpepper = "http://images.com/bellpepper.jpg"
cabbage = "http://images.com/cabbage.jpg"
lettuce = "http://images.com/lettuce.jpg"
pumpkin = "http://images.com/pumpkin.jpg"
I assume I can detect the start of a link with:
/http:[^ ,]+/i
But how can I get all of the links separated from the document?
EDIT: To clarify the question: I just want to strip out the URLs from the file minus the variable name, equals sign and double quotes so I have a new file that is just a list of URLs, one per line.
Try this...
(http://)([a-zA-Z0-9\/\\.])*
If the format is constant, then this should work (python):
import re
s = """bellpepper = "http://images.com/bellpepper.jpg" (...) """
re.findall("\"(http://.+?)\"", s)
Note: this is not "find an image in a file" regexp, just an answer to the question :)
do you mean to say you have that kind of format in your document and you just want to get the http part? you can just split on the "=" delimiter without regex
$f = fopen("file","r");
if ($f){
while( !feof($f) ){
$line = fgets($f,4096);
$s = explode(" = ",$line);
$s = preg_replace("/\"/","",$s);
print $s[1];
}
fclose($f);
}
on the command line :
#php5 myscript.php > newfile.ext
if you are using other languages other than PHP, there are similar string splitting method you can use. eg Python/Perl's split(). please read your doc to find out
You may try this, if your tool supports positive lookbehind:
/(?<=")[^"\n]+/