AWS Sagemaker CustomerError: Encoding Mismatch when monitoring input - amazon-web-services

I've deployed a Pipeline model in AWS and am now trying to use ModelMonitor to assess incoming data behavior, but it failes when generating monitoring report
The pipeline consists of a preprocessing step and then a regular XGBoost container. The model is invoked with Content-type: application/json.
For that I set up as stated in the docs, but it fails with the following error
Exception in thread "main" com.amazonaws.sagemaker.dataanalyzer.exception.CustomerError: Error: Encoding mismatch: Encoding is JSON for endpointInput, but Encoding is CSV for endpointOutput. We currently only support the same type of input and output encoding at the moment.
I've found this issue at GitHub, but didn't help me.
Digging depper into how XGBoost outputs, I've found out that it's CSV encoded, hence the error makes sense, but even deploying the model enforcing the serializers fails (code in the section below)
I'm configuring the schedule as recommended by AWS, I've just changed the location of my constraints (had to manually adjust'em)
---> Tried so far (all attempts fail with the exact same error)
As mentioned in the issue, but since I'm expecting a json payload, I've used
data_capture_config=DataCaptureConfig(
enable_capture = True,
sampling_percentage=100,
json_content_types = ['application/json'],
destination_s3_uri=MY_BUCKET)
Tried enforcing the (de)serializer of the predictor (I'm not sure if that even makes sense)
predictor = Predictor(
endpoint_name=MY_ENDPOINT,
# Hoping that I could force the output to be a JSON
deserializer=sagemaker.deserializers.JSONDeserializer)
and later
predictor = Predictor(
endpoint_name=MY_ENDPOINT,
# Hoping that I could force the input to be a CSV
serializer=sagemaker.serializers.CSVSerializer)
Setting (de)serializer during deploy
p_modle = pipeline_model.deploy(
initial_instance_count=1,
instance_type='ml.m4.xlarge',
endpoint_name=MY_ENDPOINT,
serializer = sagemaker.serializers.JSONSerializer(),
deserializer= sagemaker.deserializers.JSONDeserializer(),
wait = True)

I have come across a similar issue earlier while invoking the endpoint using boto3 sagemaker runtime. Try adding the 'Accept' parameter in invoke_endpoint function with value as 'application/json'.
refer for more help https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html#API_runtime_InvokeEndpoint_RequestSyntax

Related

Why doesn't Django's `full_clean` work when using a non-default database?

Note, this is a rewrite of a deleted question. I'd have edited it, but there were no comments, so I never even knew it was deleted until I went to add new details today.
Background: I'm modifying our code-base to make use of multiple databases for validation of user-submitted load files. This is for automated checks of that user data so that they can fix simple issues with their files, like checks on uniqueness and such.
Our original code (not written by me) has some fundamental issues with loading. It had side-effects. It should have used transaction.atomic, but didn't and simply adding it broke the codebase, so while a refactor will eventually fix that properly, to reduce effort...
Question: I created a second database (with the alias "validation") and inserted .using(db) and .save(using=db) in all the necessary places in our code, so that loading data can be tested without risking the production data.
Everything works as expected with the 2 databases except calls to full_clean(). Take this example:
new_compound_dict = {
name="test",
formula="C1H4",
hmdb_id="HMBD0000001",
}
new_compound = Compound(**new_compound_dict)
new_compound.full_clean()
new_compound.save(using="validation")
It gives me this error:
django.core.exceptions.ValidationError: {'name': ['Compound with this Name already exists.'], 'hmdb_id': ['Compound with this HMDB ID already exists.']}
I get the same error with this code:
new_compound, inserted = Compound.objects.using("validation").get_or_create(**new_compound_dict)
if inserted:
new_compound.full_clean()
Both examples above work without a problem on the default database.
I looked up full_clean in the docs for django 3.2, but I don't see a way to have it run against a database other than default, which I'm assuming is what I would need to do to fix this. There's not even a mention of any potential issues related to a non-default database that I can find. I had expected the doc to show that there's a using parameter to full_clean (like the one for .save(using=db)), but there's no such parameter.
I debugged the above examples with this before and after each example block:
Compound.objects.using("default").filter(**new_compound_dict).count()
Compound.objects.using("validation").filter(**new_compound_dict).count()
For the default database, the counts are 0 before and 1 after with no error. For the validation database, the counts are 0 and 1, but with the error mentioned above.
At this point, I'm confounded. How can I run full_clean on a database other than the default? Does full_clean just fundamentally not support non-default databases?
Footnote: The compound loading data in the example above is never validated in a user submission. It is necessary that the compound data be in both databases in order to validate the data submitted by the user, so the compound load script is one of 2 scripts that loads data into both databases (so that it's in the validation DB when the user submits their data). The default load always happens before the validation load and when I load the validation database, the test compound is always present in the default database and is only present after the validation load in the validation database.

Trouble Adding Publishing Action to append to an existing big query table?

I have a dataflow flow that appends to an existing bigquery table which has been working for the last few weeks. When i run it now it gives me the error "Cannot run job. Please reload the page and try again." and won't even start the job.
After trying a lot of things, i made a copy of the flow and when the publishing action is creating a new csv file, it works but when i try to
Add a publishing action to an existing big query table,it keeps giving me another bizarre error "SyntaxError: Unexpected token C in JSON at position 0".
I have no idea what is happening since everything used to work perfectly and i made no changes whatsoever.
I am not adding anything new, but I want to give you an answer.
It seems that one of your JSON inputs may be malformed. Try logging it to see what's the problem - and also try skipping malformed JSON strings.
As mentioned in past comments, I suggest to check your logs in StackDriver to find out why are you getting the:
"Cannot run job. Please reload the page and try again."
Also if you can retrieve more information about the error from those logs, It'd be useful to assist you further.
Besides the above, maybe we can solve this issue easier by only just checking the json format, here I put an easy third party json validator and here you can check where you have the error in your json.

From local scrapy to scrapy cloud (scraping hub) - Unexpected results

The scraper I deployed on Scrapy cloud is producing an unexpected result compared to the local version.
My local version can easily extract every field of a product item (from an online retailer) but on the scrapy cloud, the field "ingredients" and the field "list of prices" are always displayed as empty.
You'll see in a picture attached the two elements I'm always having empty as a result whereas it's perfectly working
I'mu using Python 3 and the stack was configured with a scrapy:1.3-py3 configuration.
I thought first it was in a issue with the regex and unicode but seems not.
So i tried everything : ur, ur RE.ENCODE .... and didn't work.
For the ingredients part, my code is the following :
data_box=response.xpath('//*[#id="ingredients"]').css('div.information__tab__content *::text').extract()
data_inter=''.join(data_box).strip()
match1=re.search(r'([Ii]ngr[ée]dients\s*\:{0,1})\s*(.*)\.*',data_inter)
match2=re.search(r'([Cc]omposition\s*\:{0,1})\s*(.*)\.*',data_inter)
if match1:
result_matching_ingredients=match1.group(1,2)[1].replace('"','').replace(".","").replace(";",",").strip()
elif match2 :
result_matching_ingredients=match2.group(1,2)[1].replace('"','').replace(".","").replace(";",",").strip()
else:
result_matching_ingredients=''
ingredients=result_matching_ingredients
It seems that the matching never occurs on scrapy cloud.
For prices, my code is the following :
list_prices=[]
for package in list_packaging :
tonnage=package.css('div.product__varianttitle::text').extract_first().strip()
prix_inter=(''.join(package.css('span.product__smallprice__text').re(r'\(\s*\d+\,\d*\s*€\s*\/\s*kg\)')))
prix=prix_inter.replace("(","").replace(")","").replace("/","").replace("€","").replace("kg","").replace(",",".").strip()
list_prices.append(prix)
That's the same story. Still empty.
I repeat : it's working fine on my local version.
Those two data are the only one causing issue : i'm extracting a bunch of other data (with Regex too) with scrapy cloud and I'm very satisfied with it ?
Any ideas guys ?
I work really often with ScrapingHub, and usually the way I do to debug is:
Check the job requests (through the ScrapingHub interface)
In order to check if there is not a redirection which makes the page slightly different, like a query string ?lang=en
Check the job logs (through the ScrapingHub interface)
You can either print or use a logger to check everything you want trough your parser. So if you really want to be sure the scraper display the same on local machine and on ScrapingHub, you can print(response.body) and compare what might cause this difference.
If you can not find, I'll try to deploy a little spider on ScrapingHub and edit this post if I can manage to have some time left today !
Check that Scrapping Hub’s logs are displaying the expected version of Python even if the stack is correctly set up in the project’s yml file.

Use atomic blocks with error handling in Django

I have a Django 1.9 application that is running a section of code where changes are made to the database based on the results of queries to certain remote APIs. For instance, this might be data about commits, file changes, reviewers, pull requests, etc. that I want to save as entities in my database.
commit_data = commit_API_client.get_commit_info(argument1, argument2)
new_commit = models.Commit.Create(**commit_data)
#if the last API failed, this will fail
#I will need to run this again to get these files, so I need
#to get the commit all over again, too
files = file_API_client.get_file_info(new_commit.id)
new_files = models.Files.Create(**files)
#do some more stuff here
It's very possible that one of the few APIs I'm calling will return some error instead of valid data. I essentially need to make this section into a single atomic transaction so that if there are no errors returned from the HTTP requests, I commit all the changes to the database. Otherwise, I might lose some data if 2 APIs return correctly, but the 3rd doesn't.
I saw that Django supports commit hooks for database transactions, but I was wondering if that would applicable to this situation and how I would go about implementing that.
If you want to make that into an atomic operation, just wrap it in transaction.atomic(). If any of your code raises an exception the whole block will roll back. If you're only doing reads from the remote API that should work fine. A typical pattern is:
def my_function():
try:
with transaction.atomic():
# Do your work here
pass
except Exception:
# Do some error handling
pass
Django does have an on_commit hook, but that isn't really applicable here. Its purpose is to run some code after a transaction has completed successfully. For example, you might use it to write some log data to a remote API if the transaction was successful.

Django haystack with Solr - how to specify GET parameters?

I'm using django_haystack with Solr 4.9. I've amended the /select request handler so that all requests use dismax by default.
The problem is that sometimes I would like to query specific fields, but I can't find a way to get the SearchQuerySet api to get it to play nicely with dismax. So basically I want to send the following (or equivalent) request to Solr:q=hello&qf=content_auto
I've tried the following aproaches:
Standard Api
SearchQuerySet().filter(content_auto='hello')
# understandably results in the following being sent to solr:
q=content_auto:hello
AltParser
query = AltParser('dismax', 'hello', qf="content_auto")
sqs = SearchQuerySet().filter(content=query)
# Results in
q=(_query_:"{!dismax+qf%3Dcontent_auto}hello")
Raw
query = Raw('hello&qf=content_auto')
# results in
q=hello%26qf%3Dcontent_auto
The last approach was so close, but since it escaped the = and & it doesn't seem to process the query correctly.
What is the best approach to dealing with this? I have no need for non-dismax querying so it would be preferable to keep the /select request handler the same rather than having to wrap every query in a Raw or AltParser.
In short, the answer is that it can't be done without creating a custom backend and SearchQuerySet. In the end I had to just revert back to a standard configuration and specifying dismax with an AltParser, slightly annoying because it affects your spelling suggestions.