How to get specific fields data from string without field labels fetched from image by using tesseract-ocr - regex

I have written a program that ask for the user to upload the DMV license picture and reads the DMV License details from the uploaded picture using tesseract ocr. I've got the tesseract part work which does well to some extent. I have got a raw string and now I need to parse that string to fetch user details. The problem is that some fields on DMV license have no lables. Like name, address etc. I need to fetch these details. I can't come up with idea (may be I can use regex but don't know how to get it work?). If someone already did that I'd love to take a look. . Any suggestion would be welcome.
Image:
CODE
Here is the code to read the uploaded file and get text from image using tesseract.
from django.http import HttpResponse
from django.views.decorators.csrf import csrf_exempt
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
# replace the path with the path to tesseract installation directory on server.
pytesseract.pytesseract.tesseract_cmd = "C:\Program Files (x86)\Tesseract-OCR\\tesseract.exe"
#csrf_exempt
def upload_dmv(request):
if request.method == "POST":
dmv = request.FILES['dmv']
extracted_data = pytesseract.image_to_string(Image.open(dmv))
print(extracted_data)
return HttpResponse(b'OK')
Output
W YORK STALE
DRIVER Ee C{ENeSae
876 071652
BOGADO
PETER,GIOVANNI.
9520 93RD ST FL 2
OZONE PARK, NY 114116
SexM_ Height6'-02" Eyes BRO:
00806/06/1992
Expires 06/06/2018
ENONE
RB
Issued 03/09/2017
Usa
~e-h.
Crecutive Deputy Comminsioner of Motor
Class E

Related

Error ALDialog Python Nao

I have a problem when using the ALDialog module on Python IDE and to load on Nao. I tried in different ways to load a dialogue but I always fall back on the same error.Runtimeerror LoadTopic::ALDialogIncorrect file myDialog.topIn the first case I write directly the text that I save in a. top file but at the time of LoadTopic () I have an error.In the second case I want to load the. top file by giving it the path. I come back to the same mistake again.Do you have a solution to my problem?Thank you very much.
import qi
import argparse
import os
import sys
from naoqi import ALProxy
def main(robot_ip, robot_port):
dialog = """
topic: ~myTopic() \n
language: enu \n
u:(test) hello \n """
file = open("myDialog.top","w")
file.write(dialog)
file.close()
# load topic
proxy = ALProxy("ALDialog",robot_ip,robot_port)
proxy.setLanguage("English")
self.topic = proxy.loadTopic("myDialog.top")
# start dialog
proxy.subscribe("myModule")
# activate dialog
proxy.activateTopic(self.topic)
if name == "main":
parser = argparse.ArgumentParser()
parser.add_argument("--ip", type=str,
default="169.254.245.164",help="Robot's IP address : '169.254.245.164'")
parser.add_argument("--port", type=int, default=9559,help="port number, the default value is OK in most cases")
args = parser.parse_args()
main(args.ip, args.port)
ALDialog.loadTopic expects an absolute filepath on the robot - it doesn't know anything about the context from which you're calling it (it could be from another computer, in which case of course it can't open that file). You need to be sure that your .top is indeed on the robot, and pass it's absolute path to ALDialog.
Once installed on the robot this path will be something like /home/nao/.local/share/PackageManager/apps/your-package-id/your-dialog-name/your-dialog-name_enu.top

Python request download a file and save to a specific directory

Hello sorry if this question has been asked before.
But I have tried a lot of methods that provided.
Basically, I want to download the file from a website, which is I will show my coding below. The code works perfectly, but the problem is the file was auto download in our download folder path directory.
My concern is to download the file and save it to a specific folder.
I'm aware we can change our browser setting since this was a server that will remote by different users. So, it will automatically download to their temporarily /users/adam_01/download/ folder.
I want it to save in server disk which is, C://ExcelFile/
Below are my script and some of the data have been changing because it is confidential.
import pandas as pd
import html5lib
import time from bs4
import BeautifulSoup
import requests
import csv
from datetime
import datetime
import urllib.request
import os
with requests.Session() as c:
proxies = {"http": "http://:911"}
url = 'https://......./login.jsp'
USERNAME = 'mwirzonw'
PASSWORD = 'Fiqr123'
c.get(url,verify= False)
csrftoken = ''
login_data = dict(proxies,atl_token = csrftoken, os_username=USERNAME, os_password=PASSWORD, next='/')
c.post(url, data=login_data, headers={"referer" : "https://.....com"})
page = c.get('https://........s...../SearchRequest-96010.csv')
location = 'C:/Users/..../Downloads/'
with open('asdsad906010.csv', 'wb') as output:
output.write(page.content )
print("Done!")
Thank you, be pleased to ask if any confusing information was given.
Regards,
Fiqri
It seems that from your script you are writing the file to asdsad906010.csv. You should be able to change the output directory as follows.
# Set the output directory to your desired location
output_directory = 'C:/ExcelFile/'
# Create a file path by joining the directory name with the desired file name
file_path = os.path.join(output_directory, 'asdsad906010.csv')
# Write the file
with open(file_path, 'wb') as output:
output.write(page.content)

Django-extension runscript No (valid) module for script

I'm trying to create a script that will populate my model families with informations extracted from a text file.
This is my first post in StackOverflow, please be gentle, sorry if the question is not well expressed or not correctly formatted.
Django V 1.9 and running on Python 3.5
Django-extensions installed
This is my model: it's in an app called browse
from django.db import models
from django_extensions.db.models import TimeStampedModel
class families(TimeStampedModel):
rfam_acc = models.CharField(max_length=7)
rfam_id = models.CharField(max_length=40)
description = models.CharField(max_length=75)
author = models.CharField(max_length=50)
comment = models.CharField(max_length=500)
rfam_URL = models.URLField()
Here I have my script familiespopulate.py. Positioned in the PROJECT_ROOT/scripts directory.
import csv
from browse.models import families
file_path = "/Users/work/Desktop/StructuRNA/website/scripts/RFAMfamily12.1.txt"
def run(file_path):
listoflists = list(csv.reader(open(file_path, 'rb'), delimiter='\t'))
for row in listoflists:
families.objects.create(
rfam_acc=row[0],
rfam_id=row[1],
description=row[3],
author=row[4],
comment=row[9],
)
When from Terminal i run:
python manage.py runscript familiespopulate
it returns:
No (valid) module for script 'familiespopulate' found
Try running with a higher verbosity level like: -v2 or -v3
The problem must be in importing the model families, I'm new to django, and I cannot find any solution here on StackOverflow or anywhere else online.
This is why I ask for your help!
Do you know how the model should be imported?
Or... Am I doing something else wrong.
Important piece of information is that the script runs if I modify it to PRINT out the parameters, instead of creating an object in families.
For your information and curiosity I will also post here an extract of the textfile that I'm using.
RF00001 5S_rRNA 1302 5S ribosomal RNA Griffiths-Jones SR, Mifsud W, Gardner PP Szymanski et al, 5S ribosomal database, PMID:11752286 38.00 38.00 37.90 5S ribosomal RNA (5S rRNA) is a component of the large ribosomal subunit in both prokaryotes and eukaryotes. In eukaryotes, it is synthesised by RNA polymerase III (the other eukaryotic rRNAs are cleaved from a 45S precursor synthesised by RNA polymerase I). In Xenopus oocytes, it has been shown that fingers 4-7 of the nine-zinc finger transcription factor TFIIIA can bind to the central region of 5S RNA. Thus, in addition to positively regulating 5S rRNA transcription, TFIIIA also stabilises 5S rRNA until it is required for transcription. NULL cmbuild -F CM SEED cmcalibrate --mpi CM cmsearch --cpu 4 --verbose --nohmmonly -T 24.99 -Z 549862.597050 CM SEQDB 712 183439 0 0 Gene; rRNA; Published; PMID:11283358 7946 0 0.59496 -5.32219 1600000 213632 305 119 1 -3.78120 0.71822 2013-10-03 20:41:44 2016-04-21 23:07:03
This is the first line and the result of the extraction from the listoflists is :
RF00002
5_8S_rRNA
5.8S ribosomal RNA
Griffiths-Jones SR, Mifsud W
5.8S ribosomal RNA (5.8S rRNA) is a component of the large subunit of the eukaryotic ribosome. It is transcribed by RNA polymerase I as part of the 45S precursor that also contains 18S and 28S rRNA. Functionally, it is thought that 5.8S rRNA may be involved in ribosome translocation [2]. It is also known to form covalent linkage to the p53 tumour suppressor protein [3]. 5.8S rRNA is also found in archaea.
Try adding empty file __init__.py (double underscore) into your /scipts folder and run with:
python manage.py runscript scipts.familiespopulate
Apart from adding init.py you are not supposed to pass any parameters in the run method.
def run():
<your code goes here>
Thanks for the useful comments.
I modified my code in this way:
import csv
from browse.models import families
def run():
file_path = "/Users/work/Desktop/StructuRNA/website/scripts/RFAMfamily12.1.txt"
listoflists = list(csv.reader(open(file_path, 'r'),delimiter='\t'))
print(listoflists)
for row in listoflists:
families.objects.create(
rfam_acc=row[0],
rfam_id=row[1],
description=row[3],
author=row[4],
comment=row[9],
)
This is all. Now it worked smoothly.
I want to confirm to everyone that my file: familiespopulate.py was in the folder script with the file init.py
The problem seemed to be resolved when I put
file_path = "/Users/work/Desktop/StructuRNA/website/scripts/RFAMfamily12.1.txt"
Inside the run function, removing the parameter file_path from run(file_path).
Another modify to my code was the argument r inside open(file_path, 'r'), before it was open(file_path, 'rb') that should corrispond to read binary.
I was also getting exactly the same error, I tried all of the solution above but unfortunately did not worked for me. Then I realized my mistake, and I found it.
Inside the script file (which is inside the script/ folder) I used different name for the function, which should be named as 'run'. So, make sure you checked it as well, if you get this error.
Here you can read more about "runscript"

Why OleFileIO_PL only works with .doc file types and not .docx Python?

right so I'm working on a Python script (Python 2.7) that will extract the metadata from OLE files. I am using OleFileIO_PL and it work perfectly file with OLE files 97 - 2003, but any later then that it just says that it is not an OLE2 file type.
Any way I can modify my code to support both .doc and .docx ? Same with .ppt and .pptx etc.
Thank you in advance
Source Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import OleFileIO_PL
import StringIO
import optparse
import sys
import os
def printMetadata(fileName):
data = open(fileName, 'rb').read()
f = StringIO.StringIO(data)
OLEFile = OleFileIO_PL.OleFileIO(f)
meta = OLEFile.get_metadata()
print('Author:', meta.author)
print('Title:', meta.title)
print('Creation date:', meta.create_time)
meta.dump()
OLEFile.close()
def main():
parser = optparse.OptionParser('usage = -F + Name of the OLE file with the extention For example: python Ms Office Metadata Extraction Script.py -F myfile.docx ')
parser.add_option('-F', dest='fileName', type='string',\
help='specify OLE (MS Office) file name')
(options, args) = parser.parse_args()
fileName = options.fileName
if fileName == None:
print parser.usage
exit(0)
else:
printMetadata(fileName)
if __name__ == '__main__':
main()
To answer your question, this is because the newer MS Office 2007+ files (docx, xlsx, xlsb, pptx, etc) have a completely different structure from the legacy MS Office 97-2003 formats.
It is mainly a collection of XML files within a Zip archive. So with a little bit of work, you can extract everything you need using zipfile and ElementTree from the standard library.
If openxmllib does not work for you, you may try other solutions:
officedissector: https://www.officedissector.com/
python-opc: https://pypi.python.org/pypi/python-opc
openpack: https://pypi.python.org/pypi/openpack
paradocx: https://pypi.python.org/pypi/paradocx
BTW, OleFileIO_PL has been renamed to olefile, and the new project page is https://github.com/decalage2/olefile

Python - fetching image from urllib and then reading EXIF data from PIL Image not working

I use the following code to fetch an image from a url in python :
import urllib
from PIL import Image
urllib.urlretrieve("http://www.gunnerkrigg.com//comics/00000001.jpg", "00000001.jpg")
filename = '00000001.jpg'
img = Image.open(filename)
exif = img._getexif()
However, this way the exif data is always "None". But when I download the image by hand and then read the EXIF data in python, the image data is not None.
I have also tried the following approach (from Downloading a picture via urllib and python):
import urllib
f = open('00000001.jpg','wb')
f.write(urllib.urlopen('http://www.gunnerkrigg.com//comics/00000001.jpg').read())
f.close()
filename = '00000001.jpg'
img = Image.open(filename)
exif = img._getexif()
But this gives me 'None' for 'exif' again. Could someone please point out what I may do to solve this problem?
Thank you!
The .jpg you are using contains no exif information. If you try the same python with an exif example from http://www.exif.org/samples/ , I think you will find it works.