Reading a list of URLs into a Python list - python-2.7

I am a newbie in Python and I wish to read a list of URLs into a Python list. Below is my script :
if __name__ == 'main' :
urls_list = [] #Creating the URL vector
with open('url_list.doc') as my_file :
for line in my_file :
urls_list.append(line)
print (urls_list)
The above code does not generate any output or error even. Below is url_list.doc :
https://www.nytimes.com/2017/03/03/us/politics/majority-rule-means-the-power-to-stop-not-just-start-an-investigation.html
https://www.nytimes.com/2017/03/03/business/retirement/millennials-ask-elders-about-retirement.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=photo-spot-region&region=top-news&WT.nav=top-news
https://www.nytimes.com/2017/03/02/us/politics/jeff-sessions-russia-trump-investigation-democrats.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news
https://www.nytimes.com/2017/03/02/business/trump-pentagon-budget.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news
https://www.nytimes.com/2017/03/03/style/modern-love-you-may-want-to-marry-my-husband.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=second-column-region&region=top-news&WT.nav=top-news
Any suggestions on how to perform this.

To parse .doc files you need Antiword. It is a standalone command line executable so we are going to have to use Popen & PIPE.
Code:
from subprocess import Popen, PIPE
file_path = "Your file path here"
cmd = ['antiword', file_path]
p = Popen(cmd, stdout=PIPE)
stdout, stderr = p.communicate()
text = stdout.decode('ascii', 'ignore')
text now has all the content of the .doc file.

Related

read instructions from file python

I have this code:
import os
def inplace_change(filename, old_string, new_string):
# Safely read the input filename using 'with'
with open(filename, 'r') as f:
s = f.read()
if old_string not in s:
print('"{old_string}" not found in {filename}.'.format(**locals()))
return
else:
# Safely write the changed content, if found in the file
with open(filename, 'w') as f:
s = s.replace(old_string, new_string)
f.write(s)
path = raw_input("Enter the file's full path: ")
old = raw_input("String to change: ")
new = raw_input("change to: ")
print "********************************"
print "**** WORKING... PLEASE WAIT ****"
print "********************************\n"
for file in os.listdir(path):
filename = os.path.join(path,file)
inplace_change(filename, old, new)
os.system("pause")
As you can see, the code is replacing sub string in a file with another sub string. I want my code to change text as directed in a text file, like "instruction file" what to change.
The text file will be:
"old_string" "new_string"
"old_string" "new_string"
"old_string" "new_ string"
the result will be that all files in directory will change all the old_string to new_string
How can I do it?

Passing stdin for cpp program using a Python script

I'm trying to write a python script which
1) Compiles a cpp file.
2) Reads a text file "Input.txt" which has to be fed to the cpp file.
3) Compare the output with "Output.txt" file and Print "Pass" if all test cases have passed successfully else print "Fail".
`
import subprocess
from subprocess import PIPE
from subprocess import Popen
subprocess.call(["g++", "eg.cpp"])
inputFile = open("input.txt",'r')
s = inputFile.readlines()
for i in s :
proc = Popen("./a.out", stdin=int(i), stdout=PIPE)
out = proc.communicate()
print(out)
`
For the above code, I'm getting an output like this,
(b'32769', None)
(b'32767', None)
(b'32768', None)
Traceback (most recent call last):
File "/home/zanark/PycharmProjects/TestCase/subprocessEg.py", line 23, in <module>
proc = Popen("./a.out", stdin=int(i), stdout=PIPE)
ValueError: invalid literal for int() with base 10: '\n'
PS :- eg.cpp contains code to increment the number from the "Input.txt" by 2.
pass the string to communicate instead, and open your file as binary (else python 3 won't like it / you'll have to encode your string as bytes):
with open("input.txt",'rb') as input_file:
for i in input_file:
print("Feeding {} to program".format(i.decode("ascii").strip())
proc = Popen("./a.out", stdin=PIPE, stdout=PIPE)
out,err = proc.communicate(input=i)
print(out)
also don't convert input of the file to integer. Leave it as string (I suspect you'll have to filter out blank lines, though)

Read & write txt file error - 'str' object has no attribute 'name', polish dialectical chars in path error

I use Python 2.7 on Win 7 Pro SP1.
I try code:
import os
path = "E:/data/keyword"
os.chdir(path)
files = os.listdir(path)
query = "{keyword} AND NOT("
result = open("query.txt", "w")
for file in files:
if file.endswith(".txt"):
file_path = file.name
dane = open(file_path, "r")
query.append(dane)
result.append(" OR ")
result.write(query)
result.write(")")
result.close()
I get error:
file_path = file.name AttributeError: 'str' object has no attribute
'name'
I can't figure why.
I have secon error when path is with polish dialectical chars like "ąęłńóżć". I get error for:
path = "E:/Bieżące projekty/keyword"
I try fix it to:
path =u"E:/Bieżące projekty/keyword"
but it not help. I'm starting with Python and I can't find out why this code is not working.
What i want
Find all text file in the directory.
Join all text file in one file text named "query.txt"
fx.
file 1
data1 data2
file 2
data 3 data 4
Output from "query.txt":
data1 data2 data 3 data 4
Above code working fine when path variable is without polish dialectical characters. When I change path I get error:
SyntaXError: Non-ASCII character '\xc5' in file query.py on line 9, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
On python doc PEP263 I find magic quote. Polish lang coding characters like "ąęłńóźżć" standard is ISO-8859-2. So i try add encoding to code. I try use UTF-8 too and I get the same error. My all code is (without 5 first lines with comment what code doing):
import os
#path = r"E:/data"
# -*- coding: iso-8859-2 -*-
path = r"E:/Bieżące przedsięwzięcia"
os.chdir(path)
files = os.listdir(path)
query = "{keyword} AND NOT("
for file in files:
if file.endswith(".txt"):
dane = open(file, "r")
text = dane.read()
query += text
print(query)
dane.close()
query.join(" OR ")
result = open("query.txt", "w")
result.write(query)
result.write(")")
result.close()
On Unicode/UTF-8 character here I found that polish char "ż" is coded in UTF-8 as "\xc5\xbc". Mark # to coding line with path with "ż" as comment make error too. When I remove line with this char code:
path = r"E:/Bieżące przedsięwzięcia"
working fine and I get result which I want.
For editing I use Notepad++ with default setings. I only set in python code tab replace by four space.
*
Second Question
I try find in Python doc in variable path what r does mean. I can't find it in Python 2.7 string documentation. Could someone tell my how this part of Python (like u, r before string value) is named fx.
path = u"somedata"
path = r"somedata"?
I would get doc to read about it.

ARFF to CSV multiple files conversions

Anyone successfully tried to convert many ARFF files to CSV files from windows Command line.
I tried to use weka.core.converters.CSVSaver but it works for a single file only.
Can it be done for multiple files?
I found a way to solve this conversion by using R as shown in the following Script:
#### Set the default directory to the folder that contains all ARFF files
temp = list.files(pattern="*.arff")
library(foreign)
for (i in 1:length(temp)) assign(temp[i], read.arff(temp[i]))
for(i in 1:length(temp))
{
mydata=read.arff(temp[i])
t=temp[i]
x=paste(t,".csv")
write.csv(mydata,x,row.names=FALSE)
mydata=0
}
On a windows command line, type powershell
Change to the directory where your *.arff files reside in
Enter this command
dir *.arff | Split-Path -Leaf| ForEach-Object {Invoke-Expression "C:\Program Files\Weka-3-6\weka.jar;." weka.core.converters.CSVSaver -i $_ -o $_.csv"}
This assumes that your filenames do not contain any blanks, and all arff files reside in a single directory, and you want to convert them all. It will create a new csv file from each arff file. myfile.arff will be exported/converted to myfile.arff.csv
I write a simple python script in github: arff2csv.py.
paste my code.
"""trans multi-label *.arff file to *.csv file."""
import re
def trans_arff2csv(file_in, file_out):
"""trans *.arff file to *.csv file."""
columns = []
data = []
with open(file_in, 'r') as f:
data_flag = 0
for line in f:
if line[:2] == '#a':
# find indices
indices = [i for i, x in enumerate(line) if x == ' ']
columns.append(re.sub(r'^[\'\"]|[\'\"]$|\\+', '', line[indices[0] + 1:indices[-1]]))
elif line[:2] == '#d':
data_flag = 1
elif data_flag == 1:
data.append(line)
content = ','.join(columns) + '\n' + ''.join(data)
# save to file
with open(file_out, 'w') as f:
f.write(content)
if __name__ == '__main__':
from multi_label.arff2csv import trans_arff2csv
# setting arff file path
file_attr_in = r'D:\Downloads\birds\birds-test.arff'
# setting output csv file path
file_csv_out = r"D:\Downloads\birds\birds-test.csv"
# trans
trans_arff2csv(file_attr_in, file_csv_out)

Search for Files of a Specific Type

I'm using Python 2.7.2 and PyScriptor to create a script that I hope to use to first find a file of type (.shp) and then check to see if there is a matching file with the same name but with a (.prj) suffix. I'm stuck in the first part. I hope to share this script with others so I am trying to make the starting folder/directory a variable. Here's my script so far:
# Import the operating and system modules.
import os, sys
def Main():
# Retrieve the starting folder location.
InFolder = sys.argv[1]
#InFolder = "C:\\_LOCALdata\\junk"
os.chdir(InFolder)
print os.path.exists(Infolder)
print InFolder
# Begin reading the files and folders within the starting directory.
for root, dirs, files in os.walk(InFolder):
print os.path.exists(root)
for file in files:
print file
if file.endswith(".py"):
print os.path.join(root, file)
elif file.endswith(".xlsx"):
print os.path.join(root, file)
elif file.endswith(".shp"):
print "Shapefile present."
elif file.endswith(".txt"):
print "Text file present."
else:
print "No Python, Excel or Text files present."
return()
print "End of Script."
sys.exit(0)
if __name__ == '__main__':
sys.argv = [sys.argv[0], r"C:\_LOCALdata\junk"]
Main()
The only result I have gotten so far is:
*** Remote Interpreter Reinitialized ***
>>>
End of Script.
Exit code: 0
>>>
Any suggestions?
Layne
You're exiting your script before it even has a chance to get started!. Remove or indent these lines:
print "End of Script."
sys.exit(0)
Starting procedurally from the top (like python does), we:
Import libraries
Define the Main funcion.
Print a line, and immediately exit!
Your __name__ == '__main__' is never reached.
Oh man - duh. Thanks for the input. I did get it to work. The code searches for shapefiles (.shp) and then looks for the matching projection file (.prj) and prints the shapefiles that do not have a projection file. Here is the finished code:
# Import the operating and system modules.
import os, sys
def Main():
# Retrieve the starting folder location.
InFolder = sys.argv[1]
text = (InFolder + "\\" + "NonProjected.txt")
outFile = open(text, "w")
# Begin reading the files and folders within the starting directory.
for root, dirs, files in os.walk(InFolder):
for file in files:
if file.endswith(".shp"):
PRN = os.path.join(root,(str.rstrip(file,"shp") + "prj"))
if os.path.exists(PRN):
pass
else:
outFile.write((str(os.path.join(root,file))) + "\n")
print os.path.join(root,file)
else:
pass
outFile.close()
print
print "End of Script."
sys.exit(0)
return()
#================================================================================
if __name__ == '__main__':
sys.argv = [sys.argv[0], r"G:\GVI2Hydro"]
#sys.argv = [sys.argv[0], r"C:\_LOCALdata"]
#sys.argv = [sys.argv[0], r"U:"]
Main()