I have the following query in Athena, I am using S3 as data store. I have 2 query parameters in the lambda expression.
PREPARE my_select1 FROM
SELECT address.phonenumbers FROM \"nested-query-db\".\"data_with_json1\" where
cardinality(filter(address.phonenumbers,js->js.type = ? and js.number = ?)) >0 and
cardinality(filter(address.phonenumbers,js->js.type = 'city'and js.number = '4')) > 0 and
firstname ='Emily';
When I execute it using
EXECUTE my_select1 USING 'Home', '1';
It throws the following error.
java.lang.RuntimeException: Query Failed to run with Error Message: SYNTAX_ERROR: line 1:1:
Sample Data:
{"firstname":"Emily","address":{"streetaddress":"101","city":"abc","state":"","phonenumbers":[{"type":"home","number":"11"},{"type":"city","number":"4"}]}}
I think the problem can be caused by the following.
You are probably using Athena engine version 2, which is based on Presto, explicitly Presto version 0.217.
The PrestoDB Github repository identifies a very similar problem like the one you described in this issue.
The issue was addressed in PrestoDB by this pull request, and included in the release 0.256.
This means the fix is not included in Athena.
I am not sure if it will work, but to solve the issue you probably could try using Athena engine version 3, which is based on Trino instead.
#jccampanero explain lambda expression does not support in Athena 2.
But I have a workaround for you.
This query works well, according your needs.
PREPARE my_select1 FROM
SELECT
address.phonenumbers
FROM address
CROSS JOIN UNNEST(address.phonenumbers) as t(phone)
where
phone.type = ? and phone.number = ? and
cardinality(filter(address.phonenumbers, js -> js.type = 'city' and js.number = '4')) > 0 and
firstname = 'Emily';
EXECUTE my_select1 USING 'home', '11'
Related
I'm trying to run the following GooleCould's BigQuery:
select REGEXP_REPLACE(SPLIT(site, "=")[OFFSET(1)], r'%\d+', ' ')
from some_db
where site = 'something'
and STARTS_WITH(site, 'XXX')
and during the execution I get the following error:
Array index 1 is out of bounds (overflow)
When I was working with AWS Athena, I used to solve such errors using try statements, but I could not find anything equivalent for BigQuery.
How should I handle exceptions?
You should use SAFE_OFFSET instead of OFFSET
select REGEXP_REPLACE(SPLIT(site, "=")[SAFE_OFFSET(1)], r'%\d+', ' ')
from some_db
where site = 'something'
and STARTS_WITH(site, 'XXX')
As of more generic try / catch question - BigQuery does not have one - but there is a SAFE prefix that can be used in most functions as SAFE.function_name() - https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#safe-prefix
Problem:
I'm currently trying to insert a date time object into my Cassandra database using the following code:
dt_str = '2016-09-01 12:00:00.00000'
dt_thing = datetime.datetime.strptime(dt_str, '%Y-%m-%d %H:%M:%S.%f')
def insert_record(session):
session.execute(
"""
INSERT INTO record_table (last_modified)
VALUES(dt_thing)
"""
)
However, I'm receiving the following error:
cassandra.protocol.SyntaxException: <Error from server: code=2000 [Syntax error in CQL query] message="line 3:17 no viable alternative at input ')' (...record_table (last_modified) VALUES([dt_thing])...)">
Background Info
I'm relatively new to Cassandra and I'm not sure how to go about this. What I'm basically trying to do is add an existing date time value in my database since an earlier version of the code is looking for one but it does not exist yet (hence, why I'm manually adding one).
I'm using Python 2.7 and Cassandra 3.0.
Any input or how to go about this would be great!
I answered a similar question yesterday. But the general idea, is that you'll want to define a prepared statement. Then bind your dt_thing variable to it, like this:
dt_str = '2016-09-01 12:00:00.00000'
dt_thing = datetime.datetime.strptime(dt_str, '%Y-%m-%d %H:%M:%S.%f')
def insert_record(session):
preparedInsert = session.prepare(
"""
INSERT INTO record_table (last_modified)
VALUES (?)
"""
)
session.execute(preparedInsert,[dt_thing])
Also, I don't recommend using a timestamp as a lone PRIMARY KEY (which is the only model for which that INSERT would work).
I think I may have identified a bug in haystack / solr but I'm not sure and wanted to see if I'm doing something completely wrong first. I'm using:
django 1.8
haystack 2.4.1
solr 4.10.4
When I try to filter my SearchQuerySet, SOLR complains of invalid syntax on the filter query that is genenerated from haystack. Bizarrely, stepping through the code in pdb works, but it all fails under normal circumstances. The relevant portion of code is:
# this is built from a query string but essentially resolves to something like
applicable_filters = {'job_type__in':['PE', 'TE'], 'sector__in':['12','13']}
# Do the query.
sqs = SearchQuerySet().models(self._meta.queryset.model).filter(**applicable_filters).order_by(order).load_all().auto_query(request.GET.get('q', ''))
if not sqs:
sqs = EmptySearchQuerySet()
When executing this query, SOLR throws the following:
[vagrant#127.0.0.1:2222] out: Failed to query Solr using '(job_type:("PE" OR "TE") AND sector:("12" OR "13") AND )': [Reason: org.apache.solr.search.SyntaxError: Cannot parse '(job_type:("PE" OR "TE") AND sector:("12" OR "13") AND )': Encountered " ")" ") "" at line 1, column 55.
As you can see, it appears that haystack ( or perhaps pysolr ? ) is adding an extra AND clause to the SOLR query, which seems completely wrong. The really bizarre bit is if I step through the same function in pdb it works.
am at a loss....
Fixed.
Problem was that I was passing an empty string to the final auto_query() clause. I've migrated from a whoosh backend and that seems more tolerant of an empty search string than SOLR.
I am currently developing a Python (2.7) application that uses the Azure Table Storage service via the Azure Python package .As far as I have read from Azure's REST API, batching operations create atomic transactions. So, if one of the operations fails, the whole batch fails and no operations are executed.
The problem that I have encountered is the following:
The method below receives via the "rows" parameter a list of dicts. Some have an ETag set (provided from a previous query).
If the ETag is set, attempt a merge operation. Otherwise, attempt an insert operation. Since there are multiple processes that may modify the same entity, it is required to address the concurrency problem via the "if_match" parameter of the merge_entity function. If the merge/insert operations are individual operations (not included in batch), the system works as expected, raising an Exception if the ETags do not match.
Unfortunately, this does not happen if they are wrapped in "begin_batch" / "commit_batch" calls. The entities are merged (wrongly) even though the ETags DO NOT match.
I have provided below both the code and the test case used. Also ran some manual tests several time with the same conclusion.
I am unsure of how to approach this problem. Am I doing something wrong or is it an issue with the Python package?
The code used is the following:
def persist_entities(self, rows):
success = True
self._service.begin_batch() #If commented, works as expected (fails)
for row in rows:
print row
etag = row.pop("ETag")
if not etag:
self._service.insert_entity(self._name,
entity=row)
else:
print "Merging " + etag
self._service.merge_entity(self._name,
row["PartitionKey"],
row["RowKey"],
row, if_match=etag)
try: #Also tried with the try at the begining of the code
self._service.commit_batch() #If commented, works as expected (fails)
except WindowsAzureError:
print "Failed to merge"
self._service.cancel_batch()
success = False
return success
The test case used:
def test_fail_update(self):
service = self._conn.get_service()
partition, new_rows = self._generate_data() #Partition key and list of dicts
success = self._wrapper.persist_entities(new_rows) #Inserts fresh new entity
ok_(success) #Insert succeeds
rows = self._wrapper.get_entities_by_row(partition) #Retreives inserted data for ETag
eq_(len(rows), 1)
for index in rows:
row = rows[index]
data = new_rows[0]
data["Count"] = 155 #Same data, different value
data["ETag"] = "random_etag" #Change ETag to a random string
val = self._wrapper.persist_entities([data]) #Try to merge
ok_(not val) #val = True for merge success, False for merge fail.
#It's always True when operations in batch. False if operations out of batch
rows1 = self._wrapper.get_entities_by_row(partition)
eq_(len(rows1), 1)
eq_(rows1[index].Count, 123)
break
def _generate_data(self):
date = datetime.now().isoformat()
partition = "{0}_{1}_{2}".format("1",
Stats.RESOLUTION_DAY, date)
data = {
"PartitionKey": partition,
"RowKey": "viewitem",
"Count": 123,
"ETag": None
}
return partition, [data]
That's a bug in the SDK (v0.8 and earlier). I've created an issue and checked in a fix. It will be part of the next release. You can pip install from the git repo to test the fix.
https://github.com/Azure/azure-sdk-for-python/issues/149
Azure Table Storage has a new python library in preview release that is available for installation via pip. To install use the following pip command
pip install azure-data-tables
There are some differences with how you are performing batching and how batching works in the newest library.
A batch operation can only be committed on entities contained with the same partition key.
A batch operation can not contain multiple operations on the same row key.
Batches will only be committed if the entire batch is successful, if the send_batch raises an error, there will be no updates made.
With this being stated, the create and update operations work the same as a non-batching create and update. For example:
from azure.data.tables import TableClient, UpdateMode
table_client = TableClient.from_connection_string(conn_str, table_name="myTable")
batch = table_client.create_batch()
batch.create_entity(entity1)
batch.update_entity(entity2, etag=etag, match_condition=UpdateMode.MERGE)
batch.delete_entity(entity['PartitionKey'], entity['RowKey'])
try:
table_client.send_batch(batch)
except BatchErrorException as e:
print("There was an error with the batch")
print(e)
For more samples with the new library, check out the samples page on the repository.
(FYI, I am a Microsoft employee on the Azure SDK for Python team)
I'm working on a project in Amazon Mturk. Im using the Python Boto API .
The boto.connection.create_HIT() method returns an object of ResultSet from which I am trying to get the HIT Id . I also used Response Groups like 'HITDetail', HITAssignmentSummary' and 'HITQuestion' in the Create_HIT().
my_hit = mturk_connection.create_hit(hit_type = my_hit_type,
question = my_question,
max_assignments = 1,
annotation = "An annotation from boto ",
lifetime = 8*60,
response_groups = ['HITDetail','HITQuestion','HITAssignmentSummary'])
But I am not able to find a way to get the HIT Id from what it returns.
Please help me with this.
In the create_HIT (), pass the value of the argument 'response_groups' as 'Minimal'.
Then in your case, use the my_hit[0].HITTypeId
It should work fine now.. :)