how to get text from .doc file using python 2.7

how to get text from .doc file using python 2.7 - python-2.7

I am extracting text from .docx file using following code
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
data = getText(file_path)
Now,I want to extract .doc file also in my django rest api hosted on pythonanywhere.As api is on pythonanywhere I am unable to install textract library and antiword.So,How can I do it?

abiword is installed on PythonAnywhere:
abiword --to=txt myfile.doc
will produce a file called myfile.txt.

Related

Django, Store jpg file received as string in http POST

I am receiving an http request from a desktop application with a screenshot. I cannot speak with the developer or see source code, so all I have is the http request I am getting.
The file isn't in request.FILES, it is in request.POST.
#csrf_exempt
def create_contract_event_handler(request, contract_id, event_type):
keyboard_events_count = request.POST.get('keyboard_events_count')
mouse_events_count = request.POST.get('mouse_events_count')
screenshot_file = request.POST.get('screenshot_file')
barr2 = bytes(screenshot_file.encode(encoding='utf8'))
with open('.test/output.jpeg', 'wb') as f:
f.write(barr2)
f.close()
The file is corrupted.
The binary starts like this, I don't know if that helps:
����JFIFHH��C
%# , #&')*)-0-(0%()(��C
(((((((((((((((((((((((((((((((((((((((((((((((((((�� `"��
Also, if I try to open the image with PIL, I get the following error:
from PIL import Image
im = Image.open('./test/output.jpg')
#OSError: cannot identify image file './test/output.jpg'

Finally, I managed to touch the code in the other hand, the 'filename' was missing in the header and for that reason I was getting the file in the POST instead of in the FILES dictionary.

how to attach a pdf in google app engine python send_mail function?

I cannot find any example on how to attach files(pdf) that are within my root folder of the site in python (google app engine) send_mail function.
url_test = "https://mywebsite.com/pdf/test.pdf"
test_file = urlfetch.fetch(url_test)
if test_file.status_code == 200:
test_document = test_file.content
mail.send_mail(sender=EMAIL_SENDER,
to=['test#test.com'],
subject=subject,
body=theBody,
attachments=[("testing",test_document)])
Decided to try it with EmailMessage:
message = mail.EmailMessage( sender=EMAIL_SENDER,
subject=subject,body=theBody,to=['myemail#gmail.com'],attachments=
[(attachname, blob.archivoBlob)])
message.send()
The above blob attachment is successfully sending however attaching a file with relative path always says "invalid attachment"
new_file = open(os.path.dirname(__file__) +
'/../pages/pdf/test.PDF').read()
message = mail.EmailMessage( sender=EMAIL_SENDER,
subject=subject,body=theBody,to=['myemail#gmail.com'],attachments=
[('testing',new_file )])
message.send()
In debugging I have also tried to see if the file is being read by doing this:
logging.info(new_file)
It seems to be reading the file as it outputs some unicode characters
Please help why am I not able to attach a PDF while I can attach a blob

When calling the attachments, the File type has to be indicated on the file title, for example attachments= [('testing.pdf',new_file )]). View this link

Python read a file in zip archives from api call

I have a restful endpoint which my rest api could make a get request to it and the file is a zip file. In this zip file, there're 2 files. I only want to read the content in 1 file from this zip archives. I was able to do a test and it likes my code stuck on line file=zipfile.ZipFile(io.BytesIO(response_object.content)).
class ZipFileResponseHandler:
def __init__(self,**args):
self.csv_file_to_index = args['csv_file_to_index']
def __call__(self, response_object, raw_response_output, response_type, req_args, endpoint):
file = zipfile.ZipFile(io.BytesIO(response_object.content))
for name in file.namelist():
if re.match(name, self.csv_file_to_index):
data =file.read(name)
print_xml_stream(repr(data))

So i found the solution to my own answer. Because I use python 2.7 the corresponding method that use to handle the response_object is StringIO not BytesIO. So the line:
file = zipfile.ZipFile(io.BytesIO(response_object.content))
should be
file = zipfile.ZipFile(StringIO.StringIO(response_object.content))

Upload an mp3 files to soundcloud using Python (file name is random)

I'd like to upload an mp3 file from hotfolder without knowing the name of the file. (such as *.mp3)
here's what I tried (to upload specific file / known file name)
import soundcloud
# create client object with app and user credentials
client = soundcloud.Client(client_id='***',
client_secret='***',
username='***',
password='***')
# print authenticated user's username
print client.get('/me').username
mp3_file=('test.mp3')
# upload audio file
track = client.post('/tracks', track={
'title': 'Test Sound',
'asset_data': open(mp3_file, 'rb')
})
# print track link
print track.permalink_url
how can I make the script upload any mp3 file in that folder ? (script and files are located in the same folder)

From the language as written here, it's not precisely clear what you mean by "upload any mp3 file in that folder." Does uploading the first file in the folder satisfy your need, or does it need to be a different file each time the script executes? If the latter, my suggestion is to get a list of files and then randomly select one of them.
To get a list of all files in python,
from os import listdir
from os.path import isfile, join
onlyfiles = [ f for f in listdir(mypath) if isfile(join(mypath,f)) ]
and then to randomly select one of them:
import random
print(random.choice(onlyfiles))
Hope this helps

Getting "new-line character seen in unquoted field" when parsing csv document using django-storages

I am trying to parse csv files that have been uploaded to Amazon S3 using django-storages. I keep getting a "Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?". The normal work around for this is to open the file with "rU", but that does not seem to work with django storages. If I drop the file directly on the server and open from there it works, I just want to avoid storing the files directly on the server if possible. Here is the code I am using:
import csv
from django.core.files.storage import default_storage as s3_storage
n = 'csvdumps/130331548894.csv'
csvf = s3_storage.open(n, "rU")
csvReader = csv.reader(csvf)
for item in csvReader:
print item

I can see that this is a django-storage reported bug here http://jgrid.org/david/django-storages/issue/80/trying-to-parse-csv-file-from-django but perhaps you can try this:-
csvf = s3_storage.open(n.splitlines(), "rU")
Would also be great if you could share a link to access some of your S3 (sample) csv files though so I can open them to check the line endings.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

how to get text from .doc file using python 2.7 - python-2.7

abiword is installed on PythonAnywhere: abiword --to=txt myfile.doc will produce a file called myfile.txt.

Related

Django, Store jpg file received as string in http POST

how to attach a pdf in google app engine python send_mail function?

Python read a file in zip archives from api call

Upload an mp3 files to soundcloud using Python (file name is random)

Getting "new-line character seen in unquoted field" when parsing csv document using django-storages

Categories

Resources