AWS Textract (OCR) not detecting some cells - amazon-web-services

I am using AWS Textract to read and parse tables from PDF into CSV. Lovely, AWS has a documentation for it! https://docs.aws.amazon.com/textract/latest/dg/examples-export-table-csv.html
I have set up the Asynchronous method as they suggest, and it works for a POC. However, for some documents, some lines are not shown in my csv.
After digging a bit into the json produced (the issue is persistent if I use AWS CLI to make the document analysis), I noticed that the values missing have no CELL block referenced. Those missing values are referenced into WORD block, and LINE block, but not in CELL block. According to the script, that's exactly the reason why it's not added to my csv.
We could assume it's not that good OCR algorithm. But the fun fact about this, is that if I use the same pdf within AWS Textract console, all the data is parsed into the table!
Is any of you aware of any parameters I would need to use to be sure to detect the values as CELL? Or do you think behind the scenes, they simply use a more powerful script (that would actually use the (x,y) coordinates of each WORD to match the table?
I also compared the json produced from CLI to the one from the console, and it's actually different! (not only IDs, but also as said some values are in CELL's block for console, while in LINE/WORD only for CLI)
Important fact: my PDF is 3 pages long. The first page is working perfectly fine with all the values, but the second one is missing the first 10 lines of the table basically. After those 10 lines, everything is parsed on this page as well.
Any suggestions? Or script to parse more efficiently the json provided?
Thank you!

Update: Basically the issue was the pagination of the results. There is a maximum of 1000 objects according to AWS documentation: https://docs.aws.amazon.com/textract/latest/dg/API_GetDocumentAnalysis.html#API_GetDocumentAnalysis_RequestSyntax
If you have more than this amount of object in the single table, then the IDs are in the first 1000, while the object itself is referenced in second batch (1001 --> 2000). So when trying to add the cell to the table, it can't find the reference.
Basically the solution is quite easy. We need to alter the GetResults function to concatenate each response, and THEN run the other functions.
Here is a functioning code:
def GetResults(jobId, file_name):
maxResults = 1000
paginationToken = None
finished = False
blocks = []
while finished == False:
response = None
if paginationToken == None:
response = textract.get_document_analysis(JobId=jobId, MaxResults=maxResults)
else:
response = textract.get_document_analysis(JobId=jobId, MaxResults=maxResults,
NextToken=paginationToken)
blocks += response['Blocks']
if 'NextToken' in response:
paginationToken = response['NextToken']
else:
finished = True
table_csv = get_table_csv_results(blocks)
output_file = file_name + ".csv"
# replace content
with open(output_file, "w") as fout: # Important to change "at" to "w"
fout.write(table_csv)
# show the results
print('Detected Document Text')
print('Pages: {}'.format(response['DocumentMetadata']['Pages']))
print('OUTPUT TO CSV FILE: ', output_file)
Hope this will help people.

Related

How to convert list into DataFrame in Python (Binance Futures API)

Using Binance Futures API I am trying to get a proper form of my position regarding cryptocurrencies.
Using the code
from binance_f import RequestClient
request_client = RequestClient(api_key= my_key, secret_key=my_secet_key)
result = request_client.get_position()
I get the following result
[{"symbol":"BTCUSDT","positionAmt":"0.000","entryPrice":"0.00000","markPrice":"5455.13008723","unRealizedProfit":"0.00000000","liquidationPrice":"0","leverage":"20","maxNotionalValue":"5000000","marginType":"cross","isolatedMargin":"0.00000000","isAutoAddMargin":"false"}]
The type command indicates it is a list, however adding at the end of the code print(result) yields:
[<binance_f.model.position.Position object at 0x1135cb670>]
Which is baffling because it seems not to be the list (in fact, debugging it indicates object of type Position). Using PrintMix.print_data(result) yields:
data number 0 :
entryPrice:0.0
isAutoAddMargin:True
isolatedMargin:0.0
json_parse:<function Position.json_parse at 0x1165af820>
leverage:20.0
liquidationPrice:0.0
marginType:cross
markPrice:5442.28502271
maxNotionalValue:5000000.0
positionAmt:0.0
symbol:BTCUSDT
unrealizedProfit:0.0
Now it seems like a JSON format... But it is a list. I am confused - any ideas how I can convert result to a proper DataFrame? So that columns are Symbol, PositionAmt, entryPrice, etc.
Thanks!
Your main question remains as you wrote on the header you should not be confused. In your case you have a list of Position object, you can see the structure of Position in the GitHub of this library
Anyway to answer the question please use the following:
df = pd.DataFrame([t.__dict__ for t in result])
For more options and information please read the great answers on this question
Good Luck!
you can use that
df = pd.DataFrame([t.__dict__ for t in result])
klines=df.values.tolist()
open = [float(entry[1]) for entry in klines]
high = [float(entry[2]) for entry in klines]
low = [float(entry[3]) for entry in klines]
close = [float(entry[4]) for entry in klines]

elasticsearch-dsl using from and size

I'm using python 2.7 with Elasticsearch-DSL package to query my elastic cluster.
Trying to add "from and limit" capabilities to the query in order to have pagination in my FE which presents the documents elastic returns but 'from' doesn't work right (i.e. I'm not using it correctly I spouse).
The relevant code is:
s = Search(using=elastic_conn, index='my_index'). \
filter("terms", organization_id=org_list)
hits = s[my_from:my_size].execute() # if from = 10, size = 10 then I get 0 documents, altought 100 documents match the filters.
My index contains 100 documents.
even when my filter match all results (i.e nothing is filtered out), if I use
my_from = 10 and my_size = 10, for instance, then I get nothing in hits (no matching documents)
Why is that? Am I misusing the from?
Documentation states:
from and size parameters. The from parameter defines the offset from the first result you want to fetch. The size parameter allows you to configure the maximum amount of hits to be returned.
So it seems really straightforward, what am I missing?
The answer to this question can be found in their documentation under the Pagination Section of the Search DSL:
Pagination
To specify the from/size parameters, use the Python slicing API:
s = s[10:20]
# {"from": 10, "size": 10}
The correct usage of these parameters with the Search DSL is just as you would with a Python list, slicing from the starting index to the end index. The size parameter would be implicitly defined as the end index minus the start index.
Hope this clears things up!
Try to pass from and size params as below:
search = Search(using=elastic_conn, index='my_index'). \
filter("terms", organization_id=org_list). \
extra(from_=10, size=20)
result = search.execute()

Why must I run this code a few times before my entire .csv file is converted into a .yaml file?

I am trying to build a tool that can convert .csv files into .yaml files for further use. I found a handy bit of code that does the job nicely from the link below:
Convert CSV to YAML, with Unicode?
which states that the line will take the dict created by opening a .csv file and dump it to a .yaml file:
out_file.write(ry.safe_dump(dict_example,allow_unicode=True))
However, one small kink I have noticed is that when it is run once, the generated .yaml file is typically incomplete by a line or two. In order to have the .csv file exhaustively read through to create a complete .yaml file, the code must be run two or even three times. Does anybody know why this could be?
UPDATE
Per request, here is the code I use to parse my .csv file, which is two columns long (with a string in the first column and a list of two strings in the second column), and will typically be 50 rows long (or maybe more). Also note that it designed to remove any '\n' or spaces that could potentially cause problems later on in the code.
csv_contents={}
with open("example1.csv", "rU") as csvfile:
green= csv.reader(csvfile, dialect= 'excel')
for line in green:
candidate_number= line[0]
first_sequence= line[1].replace(' ','').replace('\r','').replace('\n','')
second_sequence= line[2].replace(' ','').replace('\r','').replace('\n','')
csv_contents[candidate_number]= [first_sequence, second_sequence]
csv_contents.pop('Header name', None)
Ultimately, it is not that important that I maintain the order of the rows from the original dict, just that all the information within the rows is properly structured.
I am not sure what would cause could be but you might be running out of memory as you create the YAML document in memory first and then write it out. It is much better to directly stream it out.
You should also note that the code in the question you link to, doesn't preserve the order of the original columns, something easily circumvented by using round_trip_dump instead of safe_dump.
You probably want to make a top-level sequence (list) as in the desired output of the linked question, with each element being a mapping (dict).
The following parses the CSV, taking the first line as keys for mappings created for each following line:
import sys
import csv
import ruamel.yaml as ry
import dateutil.parser # pip install python-dateutil
def process_line(line):
"""convert lines, trying, int, float, date"""
ret_val = []
for elem in line:
try:
res = int(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = float(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = dateutil.parser.parse(elem)
ret_val.append(res)
continue
except ValueError:
pass
ret_val.append(elem.strip())
return ret_val
csv_file_name = 'xyz.csv'
data = []
header = None
with open(csv_file_name) as inf:
for line in csv.reader(inf):
d = process_line(line)
if header is None:
header = d
continue
data.append(ry.comments.CommentedMap(zip(header, d)))
ry.round_trip_dump(data, sys.stdout, allow_unicode=True)
with input xyz.csv:
id, title_english, title_russian
1, A Title in English, Название на русском
2, Another Title, Другой Название
this generates:
- id: 1
title_english: A Title in English
title_russian: Название на русском
- id: 2
title_english: Another Title
title_russian: Другой Название
The process_line is just some sugar that tries to convert strings in the CSV file to more useful types and strings without leading spaces (resulting in far less quotes in your output YAML file).
I have tested the above on files with 1000 rows, without any problems (I won't post the output though).
The above was done using Python 3 as well as Python 2.7, starting with a UTF-8 encoded file xyz.csv. If you are using Python 2, you can try unicodecsv if you need to handle Unicode input and things don't work out as well as they did for me.

Python multiprocessing queues and streaming data

I am connecting to an API that sends streaming data in string format. This stream runs for about 9 hours per day. Each string that is sent (about 1-5, 40 charter stings per second) need to be parsed and certain values need to extracted. Those values are then stored in a list which is parsed and data is retrieved at various intervals to create another list which needs to be parsed. What is the best way to accomplish this with miltiprocessing, queues? Is there a better way?
from multiprocessing import Process
import requests
data_stream = requests.get("http://myDataStreamUrl", stream=True)
lines = data_stream.iter_lines()
first_line = next(lines)
for line in lines:
find what i need and append to first_list
def parse_first_list_and create_second_list():
find what i need in first_list to create second list
def parse_second_list():
find what i need

How to Alter returned data format called from csv using python

I am looking for guidance regarding a return result FORMAT from a csv file. The code I have to date partially ahcieves my objective but despite significant effort researching through this and many other sites/forums I cannot resolve the final step. I have also posed this question on gis.stackexchange but was redirected to this forum with the comment "Questions relating to general Information Technology, with no clear GIS component, are off-topic here, but can be researched/asked at Stack Overflow".
My successful piece of python code that reads selected data from a csv and returns it in dict format is below ; (Yes I know the reason it returns as type dict is due to the format my code is calling!!! and that is the crux of the problem)
import arcpy, csv
Att_Dict ={}
with open ("C:/Data/Code/Python/Library/Peter/123.csv") as f:
reader = csv.DictReader(f)
for row in reader:
if row['Status']=='Keep':
Att_Dict.update({row['book_id']:row['book_ref']})
print Att_Dict
Att_Dict = {'7643': '7625', '9644': '2289', '4406': '4443', '7588': '9681', '2252': '7947'}
For the next part of my code to run I need the result above but in the format of ; (this is part of a very lengthy code but the only show stopper is the returned format so little value in posting the other 200 or so lines)
Att_Dict = [[7643, 7625], [9644, 2289], [4406, 4443], [7588, 9681], [2252, 7947]]
Although I have experimented endlessly and can achieve this by reverting to csv.Reader rather than csv.DictReader, I then lose the ability to 'weed out' rows where column 'Status' has value 'Keep' in them and that is a requirement for the task at hand.
My sledgehammer approach to date has been to use 'search and replace' within Idle to amend the returned set to the meet the other requirement but Im sure it can be done programatically rather than manually. Similar but not exact to https://docs.python.org/2/library/index.html, plus my startout question at Returning values from multiple CSV columns to Python dictionary? and Using Python's csv.dictreader to search for specific key to then print its value plus a multitude of csv based questions at geonet.esri.
(Using Win 7, ArcGIS 10.2, Python 2.7.5)
Try this
Att_Dict = {'7643': '7625', '9644': '2289', '4406': '4443', '7588': '9681', '2252': '7947'}
Att_List = []
for key, value in Att_Dict.items():
Att_List.append([int(key), int(value)])
print Att_List
Out: [[7643, 7625], [9644, 2289], [4406, 4443], [7588, 9681], [2252, 7947]]