How to Alter returned data format called from csv using python - python-2.7

I am looking for guidance regarding a return result FORMAT from a csv file. The code I have to date partially ahcieves my objective but despite significant effort researching through this and many other sites/forums I cannot resolve the final step. I have also posed this question on gis.stackexchange but was redirected to this forum with the comment "Questions relating to general Information Technology, with no clear GIS component, are off-topic here, but can be researched/asked at Stack Overflow".
My successful piece of python code that reads selected data from a csv and returns it in dict format is below ; (Yes I know the reason it returns as type dict is due to the format my code is calling!!! and that is the crux of the problem)
import arcpy, csv
Att_Dict ={}
with open ("C:/Data/Code/Python/Library/Peter/123.csv") as f:
reader = csv.DictReader(f)
for row in reader:
if row['Status']=='Keep':
Att_Dict.update({row['book_id']:row['book_ref']})
print Att_Dict
Att_Dict = {'7643': '7625', '9644': '2289', '4406': '4443', '7588': '9681', '2252': '7947'}
For the next part of my code to run I need the result above but in the format of ; (this is part of a very lengthy code but the only show stopper is the returned format so little value in posting the other 200 or so lines)
Att_Dict = [[7643, 7625], [9644, 2289], [4406, 4443], [7588, 9681], [2252, 7947]]
Although I have experimented endlessly and can achieve this by reverting to csv.Reader rather than csv.DictReader, I then lose the ability to 'weed out' rows where column 'Status' has value 'Keep' in them and that is a requirement for the task at hand.
My sledgehammer approach to date has been to use 'search and replace' within Idle to amend the returned set to the meet the other requirement but Im sure it can be done programatically rather than manually. Similar but not exact to https://docs.python.org/2/library/index.html, plus my startout question at Returning values from multiple CSV columns to Python dictionary? and Using Python's csv.dictreader to search for specific key to then print its value plus a multitude of csv based questions at geonet.esri.
(Using Win 7, ArcGIS 10.2, Python 2.7.5)

Try this
Att_Dict = {'7643': '7625', '9644': '2289', '4406': '4443', '7588': '9681', '2252': '7947'}
Att_List = []
for key, value in Att_Dict.items():
Att_List.append([int(key), int(value)])
print Att_List
Out: [[7643, 7625], [9644, 2289], [4406, 4443], [7588, 9681], [2252, 7947]]

Related

AWS Textract (OCR) not detecting some cells

I am using AWS Textract to read and parse tables from PDF into CSV. Lovely, AWS has a documentation for it! https://docs.aws.amazon.com/textract/latest/dg/examples-export-table-csv.html
I have set up the Asynchronous method as they suggest, and it works for a POC. However, for some documents, some lines are not shown in my csv.
After digging a bit into the json produced (the issue is persistent if I use AWS CLI to make the document analysis), I noticed that the values missing have no CELL block referenced. Those missing values are referenced into WORD block, and LINE block, but not in CELL block. According to the script, that's exactly the reason why it's not added to my csv.
We could assume it's not that good OCR algorithm. But the fun fact about this, is that if I use the same pdf within AWS Textract console, all the data is parsed into the table!
Is any of you aware of any parameters I would need to use to be sure to detect the values as CELL? Or do you think behind the scenes, they simply use a more powerful script (that would actually use the (x,y) coordinates of each WORD to match the table?
I also compared the json produced from CLI to the one from the console, and it's actually different! (not only IDs, but also as said some values are in CELL's block for console, while in LINE/WORD only for CLI)
Important fact: my PDF is 3 pages long. The first page is working perfectly fine with all the values, but the second one is missing the first 10 lines of the table basically. After those 10 lines, everything is parsed on this page as well.
Any suggestions? Or script to parse more efficiently the json provided?
Thank you!
Update: Basically the issue was the pagination of the results. There is a maximum of 1000 objects according to AWS documentation: https://docs.aws.amazon.com/textract/latest/dg/API_GetDocumentAnalysis.html#API_GetDocumentAnalysis_RequestSyntax
If you have more than this amount of object in the single table, then the IDs are in the first 1000, while the object itself is referenced in second batch (1001 --> 2000). So when trying to add the cell to the table, it can't find the reference.
Basically the solution is quite easy. We need to alter the GetResults function to concatenate each response, and THEN run the other functions.
Here is a functioning code:
def GetResults(jobId, file_name):
maxResults = 1000
paginationToken = None
finished = False
blocks = []
while finished == False:
response = None
if paginationToken == None:
response = textract.get_document_analysis(JobId=jobId, MaxResults=maxResults)
else:
response = textract.get_document_analysis(JobId=jobId, MaxResults=maxResults,
NextToken=paginationToken)
blocks += response['Blocks']
if 'NextToken' in response:
paginationToken = response['NextToken']
else:
finished = True
table_csv = get_table_csv_results(blocks)
output_file = file_name + ".csv"
# replace content
with open(output_file, "w") as fout: # Important to change "at" to "w"
fout.write(table_csv)
# show the results
print('Detected Document Text')
print('Pages: {}'.format(response['DocumentMetadata']['Pages']))
print('OUTPUT TO CSV FILE: ', output_file)
Hope this will help people.

How to convert list into DataFrame in Python (Binance Futures API)

Using Binance Futures API I am trying to get a proper form of my position regarding cryptocurrencies.
Using the code
from binance_f import RequestClient
request_client = RequestClient(api_key= my_key, secret_key=my_secet_key)
result = request_client.get_position()
I get the following result
[{"symbol":"BTCUSDT","positionAmt":"0.000","entryPrice":"0.00000","markPrice":"5455.13008723","unRealizedProfit":"0.00000000","liquidationPrice":"0","leverage":"20","maxNotionalValue":"5000000","marginType":"cross","isolatedMargin":"0.00000000","isAutoAddMargin":"false"}]
The type command indicates it is a list, however adding at the end of the code print(result) yields:
[<binance_f.model.position.Position object at 0x1135cb670>]
Which is baffling because it seems not to be the list (in fact, debugging it indicates object of type Position). Using PrintMix.print_data(result) yields:
data number 0 :
entryPrice:0.0
isAutoAddMargin:True
isolatedMargin:0.0
json_parse:<function Position.json_parse at 0x1165af820>
leverage:20.0
liquidationPrice:0.0
marginType:cross
markPrice:5442.28502271
maxNotionalValue:5000000.0
positionAmt:0.0
symbol:BTCUSDT
unrealizedProfit:0.0
Now it seems like a JSON format... But it is a list. I am confused - any ideas how I can convert result to a proper DataFrame? So that columns are Symbol, PositionAmt, entryPrice, etc.
Thanks!
Your main question remains as you wrote on the header you should not be confused. In your case you have a list of Position object, you can see the structure of Position in the GitHub of this library
Anyway to answer the question please use the following:
df = pd.DataFrame([t.__dict__ for t in result])
For more options and information please read the great answers on this question
Good Luck!
you can use that
df = pd.DataFrame([t.__dict__ for t in result])
klines=df.values.tolist()
open = [float(entry[1]) for entry in klines]
high = [float(entry[2]) for entry in klines]
low = [float(entry[3]) for entry in klines]
close = [float(entry[4]) for entry in klines]

Why must I run this code a few times before my entire .csv file is converted into a .yaml file?

I am trying to build a tool that can convert .csv files into .yaml files for further use. I found a handy bit of code that does the job nicely from the link below:
Convert CSV to YAML, with Unicode?
which states that the line will take the dict created by opening a .csv file and dump it to a .yaml file:
out_file.write(ry.safe_dump(dict_example,allow_unicode=True))
However, one small kink I have noticed is that when it is run once, the generated .yaml file is typically incomplete by a line or two. In order to have the .csv file exhaustively read through to create a complete .yaml file, the code must be run two or even three times. Does anybody know why this could be?
UPDATE
Per request, here is the code I use to parse my .csv file, which is two columns long (with a string in the first column and a list of two strings in the second column), and will typically be 50 rows long (or maybe more). Also note that it designed to remove any '\n' or spaces that could potentially cause problems later on in the code.
csv_contents={}
with open("example1.csv", "rU") as csvfile:
green= csv.reader(csvfile, dialect= 'excel')
for line in green:
candidate_number= line[0]
first_sequence= line[1].replace(' ','').replace('\r','').replace('\n','')
second_sequence= line[2].replace(' ','').replace('\r','').replace('\n','')
csv_contents[candidate_number]= [first_sequence, second_sequence]
csv_contents.pop('Header name', None)
Ultimately, it is not that important that I maintain the order of the rows from the original dict, just that all the information within the rows is properly structured.
I am not sure what would cause could be but you might be running out of memory as you create the YAML document in memory first and then write it out. It is much better to directly stream it out.
You should also note that the code in the question you link to, doesn't preserve the order of the original columns, something easily circumvented by using round_trip_dump instead of safe_dump.
You probably want to make a top-level sequence (list) as in the desired output of the linked question, with each element being a mapping (dict).
The following parses the CSV, taking the first line as keys for mappings created for each following line:
import sys
import csv
import ruamel.yaml as ry
import dateutil.parser # pip install python-dateutil
def process_line(line):
"""convert lines, trying, int, float, date"""
ret_val = []
for elem in line:
try:
res = int(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = float(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = dateutil.parser.parse(elem)
ret_val.append(res)
continue
except ValueError:
pass
ret_val.append(elem.strip())
return ret_val
csv_file_name = 'xyz.csv'
data = []
header = None
with open(csv_file_name) as inf:
for line in csv.reader(inf):
d = process_line(line)
if header is None:
header = d
continue
data.append(ry.comments.CommentedMap(zip(header, d)))
ry.round_trip_dump(data, sys.stdout, allow_unicode=True)
with input xyz.csv:
id, title_english, title_russian
1, A Title in English, Название на русском
2, Another Title, Другой Название
this generates:
- id: 1
title_english: A Title in English
title_russian: Название на русском
- id: 2
title_english: Another Title
title_russian: Другой Название
The process_line is just some sugar that tries to convert strings in the CSV file to more useful types and strings without leading spaces (resulting in far less quotes in your output YAML file).
I have tested the above on files with 1000 rows, without any problems (I won't post the output though).
The above was done using Python 3 as well as Python 2.7, starting with a UTF-8 encoded file xyz.csv. If you are using Python 2, you can try unicodecsv if you need to handle Unicode input and things don't work out as well as they did for me.

Efficient means of mass converting thousands of different find and replaces within one file

I am trying to convert a map file for some SNP data I want to use from Affy ids to dbSNP rs ids.
I am trying to find an effective way to this. I have the annotation file for the Axiom array from which the data comes from, so I know the proper ids.
I was wondering if anyone could suggest a good bash/Python/Perl based method to do this. It amounts to >100,000 different replacements. The idea I had in mind was the
sed -i 's/Affy#/rs#/g' filename
method, but I figure this would not be the most efficient. Any suggestions? Thanks!
Python code, assuming your substitutions are stored in subs.csv:
import csv
subs = dict(csv.reader(open('subs.csv'), delimiter='\t'))
source = csv.reader(open('all_snp.map'), delimiter='\t')
dest = csv.writer(open('all_snp_out.map', 'wb'), delimiter='\t')
for row in source:
row[1] = subs.get(row[1], row[1])
dest.writerow(row)
The line row[1] = subs.get(row[1], row[1]): row[1] is the Affx column, and it replaces it with a dictionary lookup which either gets the rsNumber equivalent if there is one, or returns the original Affx bit if there isn't one.

What causes "UserWarning: Discarded range with reserved name" - openpyxl

I have a simple EXCEL-sheet with names of cities in column A and I want to extract them and put them in a list:
def getCityfromEXCEL():
wb = load_workbook(filename='test.xlsx', read_only=True)
ws = wb['Sheet1']
cityList = []
for i in range(2, ws.get_highest_row()+1):
acell = "A"+str(i)
cityString = ws[acell].value
city = ftfy.fix_text_encoding(cityString)
cityList.append(city)
getCityfromEXCEL()
With a small file that worked perfectly (70 rows). Now I'm processing a big file (8300 rows) and it gives me this error:
/Library/Python/2.7/site-packages/openpyxl/workbook/names/named_range.py:121: UserWarning: Discarded range with reserved name
warnings.warn("Discarded range with reserved name")
but it does not abort. It just does not seem to continue anymore. Can someone tell me what might cause the error? Is it something in the .xlsx? Any special hints what I can look for?
It's supposed to be a friendly warning letting you know that some of the defined names are being lost when reading the file. Warnings in Python are not exceptions but informational notices.
Support for defined names is essentially limited to references to cell ranges in openpyxl at the moment. But they can refer to lots of other things like printing settings. However, if the objects/values they refer to are not preserved by openpyxl and the file is saved and later opened by Excel it might complain about the missing objects.
If you want to ignore it:
import warnings
warnings.simplefilter("ignore")
wb = load_workbook(path)
warnings.simplefilter("default")
In my case this warning shows up when filtering is on one of my worksheets. I wanted to suppress the warning so that it didn't bother my users and I just put this line in my code before the openpyxl.load_workbook call:
warnings.simplefilter("ignore")