elasticsearch-dsl using from and size - python-2.7

I'm using python 2.7 with Elasticsearch-DSL package to query my elastic cluster.
Trying to add "from and limit" capabilities to the query in order to have pagination in my FE which presents the documents elastic returns but 'from' doesn't work right (i.e. I'm not using it correctly I spouse).
The relevant code is:
s = Search(using=elastic_conn, index='my_index'). \
filter("terms", organization_id=org_list)
hits = s[my_from:my_size].execute() # if from = 10, size = 10 then I get 0 documents, altought 100 documents match the filters.
My index contains 100 documents.
even when my filter match all results (i.e nothing is filtered out), if I use
my_from = 10 and my_size = 10, for instance, then I get nothing in hits (no matching documents)
Why is that? Am I misusing the from?
Documentation states:
from and size parameters. The from parameter defines the offset from the first result you want to fetch. The size parameter allows you to configure the maximum amount of hits to be returned.
So it seems really straightforward, what am I missing?

The answer to this question can be found in their documentation under the Pagination Section of the Search DSL:
Pagination
To specify the from/size parameters, use the Python slicing API:
s = s[10:20]
# {"from": 10, "size": 10}
The correct usage of these parameters with the Search DSL is just as you would with a Python list, slicing from the starting index to the end index. The size parameter would be implicitly defined as the end index minus the start index.
Hope this clears things up!

Try to pass from and size params as below:
search = Search(using=elastic_conn, index='my_index'). \
filter("terms", organization_id=org_list). \
extra(from_=10, size=20)
result = search.execute()

Related

AWS Textract (OCR) not detecting some cells

I am using AWS Textract to read and parse tables from PDF into CSV. Lovely, AWS has a documentation for it! https://docs.aws.amazon.com/textract/latest/dg/examples-export-table-csv.html
I have set up the Asynchronous method as they suggest, and it works for a POC. However, for some documents, some lines are not shown in my csv.
After digging a bit into the json produced (the issue is persistent if I use AWS CLI to make the document analysis), I noticed that the values missing have no CELL block referenced. Those missing values are referenced into WORD block, and LINE block, but not in CELL block. According to the script, that's exactly the reason why it's not added to my csv.
We could assume it's not that good OCR algorithm. But the fun fact about this, is that if I use the same pdf within AWS Textract console, all the data is parsed into the table!
Is any of you aware of any parameters I would need to use to be sure to detect the values as CELL? Or do you think behind the scenes, they simply use a more powerful script (that would actually use the (x,y) coordinates of each WORD to match the table?
I also compared the json produced from CLI to the one from the console, and it's actually different! (not only IDs, but also as said some values are in CELL's block for console, while in LINE/WORD only for CLI)
Important fact: my PDF is 3 pages long. The first page is working perfectly fine with all the values, but the second one is missing the first 10 lines of the table basically. After those 10 lines, everything is parsed on this page as well.
Any suggestions? Or script to parse more efficiently the json provided?
Thank you!
Update: Basically the issue was the pagination of the results. There is a maximum of 1000 objects according to AWS documentation: https://docs.aws.amazon.com/textract/latest/dg/API_GetDocumentAnalysis.html#API_GetDocumentAnalysis_RequestSyntax
If you have more than this amount of object in the single table, then the IDs are in the first 1000, while the object itself is referenced in second batch (1001 --> 2000). So when trying to add the cell to the table, it can't find the reference.
Basically the solution is quite easy. We need to alter the GetResults function to concatenate each response, and THEN run the other functions.
Here is a functioning code:
def GetResults(jobId, file_name):
maxResults = 1000
paginationToken = None
finished = False
blocks = []
while finished == False:
response = None
if paginationToken == None:
response = textract.get_document_analysis(JobId=jobId, MaxResults=maxResults)
else:
response = textract.get_document_analysis(JobId=jobId, MaxResults=maxResults,
NextToken=paginationToken)
blocks += response['Blocks']
if 'NextToken' in response:
paginationToken = response['NextToken']
else:
finished = True
table_csv = get_table_csv_results(blocks)
output_file = file_name + ".csv"
# replace content
with open(output_file, "w") as fout: # Important to change "at" to "w"
fout.write(table_csv)
# show the results
print('Detected Document Text')
print('Pages: {}'.format(response['DocumentMetadata']['Pages']))
print('OUTPUT TO CSV FILE: ', output_file)
Hope this will help people.

REST Api pagination Loop... Power Query M language

I am wondering if anyone can help me with api pagination... I am trying to get all records from an external api but it restricts me with only getting maximum of 10. There are around 40k records..
The api also does not shows "no.of pages"(response below). hence i cant get my head around a solution.
There is NO "skip" or "count" or "top" supported either.. i am stuck...and i dont know how to create a loop in M language until all records are fetched. Can someone help me write a code or how it can look like
Below is my code.
let
Source = Json.Document(
Web.Contents(
"https://api.somedummy.com/api/v2/Account",
[
RelativePath ="Search",
Headers =
[
ApiKey = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx",
Authorization = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
#"Content-Type" = "application/json"
],
Content=
Json.FromValue(
[key="status", operator="EqualTo", value="Active", resultType="Full"]
)
]
)
)
in
Source
and below is output
"data": {
"totalCount": 6705,
"page": 1,
"pageSize": 10,
"list":[
This might help you along your way. While I was looking into something similar for working with Jira, I found some helpful info from two individuals in the Atlassian Community site. Below is what I think might be a relevant snippet from a query I developed with the assistance of their posts. (To be clear this snippet is their code, which I used in my query.) While I'm providing more of the query (the segment of which is also comprised of their code) below, I think the key part that relates to your particular issue is this part.
yourJiraInstance = "https://site.atlassian.net/rest/api/2/search",
Source = Json.Document(Web.Contents(yourJiraInstance, [Query=[maxResults="100",startAt="0"]])),
totalIssuesCount = Source[total],
// Now it is time to build a list of startAt values, starting on 0, incrementing 100 per item
startAtList = List.Generate(()=>0, each _ < totalIssuesCount, each _ +100),
urlList = List.Transform(startAtList, each Json.Document(Web.Contents(yourJiraInstance, [Query=[maxResults="100",startAt=Text.From(_)]]))),
// ===== Consolidate records into a single list ======
// so we have all the records in data, but it is in a bunch of lists each 100 records
// long. The issues will be more useful to us if they're consolidated into one long list
I'm thinking that maybe you could try substituting pageSize for maxResults and totalCount for totalIssuesCount. I don't know about startAt. There must be something similar available to you. Who knows? It could actually be startAt. I believe your pageSize would be 10 and you would increment your startAt by 10 instead of 100.
This is from Nick's and Tiago's posts on this thread. I think the only real difference may be that I buffered a table. (It's been a while and I did not dig into their thread and compare it for this answer.)
let
// I must credit the first part of this code -- the part between the ********** lines -- as being from Nick Cerneaz (and Tiago Machado) from their posts on this thread:
// https://community.atlassian.com/t5/Marketplace-Apps-Integrations/All-data-not-displayed-in-Power-BI-from-Jira/qaq-p/723117.
// **********
yourJiraInstance = "https://site.atlassian.net/rest/api/2/search",
Source = Json.Document(Web.Contents(yourJiraInstance, [Query=[maxResults="100",startAt="0"]])),
totalIssuesCount = Source[total],
// Now it is time to build a list of startAt values, starting on 0, incrementing 100 per item
startAtList = List.Generate(()=>0, each _ < totalIssuesCount, each _ +100),
urlList = List.Transform(startAtList, each Json.Document(Web.Contents(yourJiraInstance, [Query=[maxResults="100",startAt=Text.From(_)]]))),
// ===== Consolidate records into a single list ======
// so we have all the records in data, but it is in a bunch of lists each 100 records
// long. The issues will be more useful to us if they're consolidated into one long list
//
// In essence we need extract the separate lists of issues in each data{i}[issues] for 0<=i<#"total"
// and concatenate those into single list of issues .. from which then we can analyse
//
// to figure this out I found this post particulary helpful (thanks Vitaly!):
// https://potyarkin.ml/posts/2017/loops-in-power-query-m-language/
//
// so first create a single list that has as its members each sub-list of the issues,
// 100 in each except for the last one that will have just the residual list.
// So iLL is a List of Lists (of issues):
iLL = List.Generate(
() => [i=-1, iL={} ],
each [i] < List.Count(urlList),
each [
i = [i]+1,
iL = urlList{i}[issues]
],
each [iL]
),
// and finally, collapse that list of lists into just a single list (of issues)
issues = List.Combine(iLL),
// Convert the list of issues records into a table
#"Converted to table" = Table.Buffer(Table.FromList(issues, Splitter.SplitByNothing(), null, null, ExtraValues.Error)),
// **********

compare two dictionary, one with list of float value per key, the other one a value per key (python)

I have a query sequence that I blasted online using NCBIWWW.qblast. In my xml blast file result I obtained for a query sequence a list of hit (i.e: gi|). Each hit or gi| have multiple hsp. I made a dictionary my_dict1 where I placed gi| as key and I appended the bit score as value. So multiple values for each key.
my_dict1 = {
gi|1002819492|: [437.702, 384.47, 380.86, 380.86, 362.83],
gi|675820360| : [2617.97, 2614.37, 122.112],
gi|953764029| : [414.258, 318.66, 122.112, 86.158],
gi|675820410| : [450.653, 388.08, 386.27] }
Then I looked for max value in each key using:
for key, value in my_dict1.items():
max_value = max(value)
And made a second dictionary my_dict2:
my_dict2 = {
gi|1002819492|: 437.702,
gi|675820360| : 2617.97,
gi|953764029| : 414.258,
gi|675820410| : 450.653 }
I want to compare both dictionary. So I can extract the hsp with the highest score bits. I am also including other parameters like query coverage and identity percentage (Not shown here). The finality is to get the best gi| with the highest bit scores, coverage and identity percentage.
I tried many things to compare both dictionary like this :
First code :
matches[]
if my_dict1.keys() not in my_dict2.keys():
matches[hit_id] = bit_score
else:
matches = matches[hit_id], bit_score
Second code:
if hit_id not in matches.keys():
matches[hit_id]= bit_score
else:
matches = matches[hit_id], bit_score
Third code:
intersection = set(set(my_dict1.items()) & set(my_dict2.items()))
Howerver I always end up with 2 types of errors:
1 ) TypeError: list indices must be integers, not unicode
2 ) ... float not iterable...
Please I need some help and guidance. Thank you very much in advance for your time. Best regards.
It's not clear what you're trying to do. What is hit_id? What is bit_score? It looks like your second dict is always going to have the same keys as your first if you're creating it by pulling the max value for each key of the first dict.
You say you're trying to compare them, but don't really state what you're actually trying to do. Find those with values under a certain max? Find those with the highest max?
Your first code doesn't work because I'm assuming you're trying to use a dict key value as an index to matches, which you define as a list. That's probably where your first error is coming from, though you haven't given the lines where the error is actually occurring.
See in-code comments below:
# First off, this needs to be a dict.
matches{}
# This will never happen if you've created these dicts as you stated.
if my_dict1.keys() not in my_dict2.keys():
matches[hit_id] = bit_score # Not clear what bit_score is?
else:
# Also not sure what you're trying to do here. This will assign a tuple
# to matches with whatever the value of matches[hit_id] is and bit_score.
matches = matches[hit_id], bit_score
Regardless, we really need more information and the full code to figure out your actual goal and what's going wrong.

Processing influx db output of 'influxdb.resultset.ResultSet'

I am trying to integrate influxdb with my application and process the output. I am importing InfluxDBClient package to connect to influx instance running on my local machine. Using query() that returns data in 'influxdb.resultset.ResultSet' format.
However, I want to be able to pick each element specifically from the Resultset for my computations. I was using different functions like keys(), items() and values() from the influxdb-python manual here but of no use:
http://influxdb-python.readthedocs.io/en/latest/api-documentation.html
This is the sample output of the query():
Result: ResultSet({'(u'cpu', None)': [{u'usage_guest_nice': 0, u'usage_user': 0.90783871790308868, u'usage_nice': 0, u'usage_steal': 0, u'usage_iowait': 0.056348610076366427, u'host': u'xxx.xxx.hostname.com', u'usage_guest': 0, u'usage_idle': 98.184322579062794, u'usage_softirq': 0.0062609566755314457, u'time': u'2016-06-26T16:25:00Z', u'usage_irq': 0, u'cpu': u'cpu-total', u'usage_system': 0.84522915123660536}]})
I am also finding it hard to get the data in JSON format using Raw mentioned in the above link. Would be great to have any pointers to process the above output.
items() returns a tuple in below format, ((u'cpu', None), ), where the generator can be used to loop and get the actual data in Dictionary format. Took some time for me to figure out but it was fun!!
According to the docs you could use the get_points() function to retrieve results from an InfluxDB resultset. The function allows you to filter by either measurement, tag, both measurement AND tag, or simply get all the results without any filtering.
Getting all points
Using rs.get_points() will return a generator for all the points in the ResultSet.
Filtering by measurement
Using rs.get_points('cpu') will return a generator for all the points that are in a serie with measurement name cpu, no matter the tags.
rs = cli.query("SELECT * from cpu")
cpu_points = list(rs.get_points(measurement='cpu'))
Filtering by tags
Using rs.get_points(tags={'host_name': 'influxdb.com'}) will return a generator for all the points that are tagged with the specified tags, no matter the measurement name.
rs = cli.query("SELECT * from cpu")
cpu_influxdb_com_points = list(rs.get_points(tags={"host_name": "influxdb.com"}))
Filtering by measurement and tags
Using measurement name and tags will return a generator for all the points that are in a serie with the specified measurement name AND whose tags match the given tags.
rs = cli.query("SELECT * from cpu")
points = list(rs.get_points(measurement='cpu', tags={'host_name': 'influxdb.com'}))

TypeError during executemany() INSERT statement using a list of strings

I am trying to just do a basic INSERT operation to a PostgreSQL database through Python via the Psycopg2 module. I have read a great many of the questions already posted regarding this subject as well as the documentation but I seem to have done something uniquely wrong and none of the fixes seem to work for my code.
#API CALL + JSON decoding here
x = 0
for item in ulist:
idValue = list['members'][x]['name']
activeUsers.append(str(idValue))
x += 1
dbShell.executemany("""INSERT INTO slickusers (username) VALUES (%s)""", activeUsers
)
The loop creates a list of strings that looks like this when printed:
['b2ong', 'dune', 'drble', 'drars', 'feman', 'got', 'urbo']
I am just trying to have the code INSERT these strings as 1 row each into the table.
The error specified when running is:
TypeError: not all arguments converted during string formatting
I tried changing the INSERT to:
dbShell.executemany("INSERT INTO slackusers (username) VALUES (%s)", (activeUsers,) )
But that seems like it's merely treating the entire list as a single string as it yields:
psycopg2.DataError: value too long for type character varying(30)
What am I missing?
First in the code you pasted:
x = 0
for item in ulist:
idValue = list['members'][x]['name']
activeUsers.append(str(idValue))
x += 1
Is not the right way to accomplish what you are trying to do.
first list is a reserved word in python and you shouldn't use it as a variable name. I am assuming you meant ulist.
if you really need access to the index of an item in python you can use enumerate:
for x, item in enumerate(ulist):
but, the best way to do what you are trying to do is something like
for item in ulist: # or list['members'] Your example is kinda broken here
activeUsers.append(str(item['name']))
Your first try was:
['b2ong', 'dune', 'drble', 'drars', 'feman', 'got', 'urbo']
Your second attempt was:
(['b2ong', 'dune', 'drble', 'drars', 'feman', 'got', 'urbo'], )
What I think you want is:
[['b2ong'], ['dune'], ['drble'], ['drars'], ['feman'], ['got'], ['urbo']]
You could get this many ways:
dbShell.executemany("INSERT INTO slackusers (username) VALUES (%s)", [ [a] for a in activeUsers] )
or event better:
for item in ulist: # or list['members'] Your example is kinda broken here
activeUsers.append([str(item['name'])])
dbShell.executemany("""INSERT INTO slickusers (username) VALUES (%s)""", activeUsers)