how to deal with different ways to write the same thing - django

I wanna know if Django has any module to deal with this problem.
I have multiple ways of writing the same city name in a Postgresql database that came from scraping different websites. The field "city name" could be "S. Diego" or "San Diego". My question is if I could have a module that could normalize always to "San Diego" in both situations and I could add some normalization when some new word appear like "S Diego", and maintain this workflow.
Thanks

You can use an API to normalize the data you have scraped. Yandex or Google have feature to return a possible list of the location names based on your search query. Get the most possible answer they returned and use it to map your input to the correct one. There are manual mapping features but I highly recommend one of the giants that solved the problem before us.

Related

Case Insensitive Search parameters for API endpoint

I am working on a project that involves integrating the PUBG API. From my site, the player can lookup stats using their player name, platform and season. One issue I am facing is that the player name have to be exact and is case sensitive. Now I assumed it to be the case at the beginning. However, after searching for the name in this site I found that they don't need the name to be case sensitive. Also, referring to this post from the PUBG Dev community here I saw that it confirmed my initial assumption. So my question is if PUBG API requires the names to be case sensitive then, how is the site (linked) can search for the player even if the name provided is not in exact, matching case? For example,:
I looked up the player name MyCholula. From the PUBG API page for player lookup, it returns the proper value. When I tried mycholula, it doesn't and sends a 404. From the linked site above, both combination seems to work. Now if spaces or other separators were involved in the name then, it would be easy to convert it assuming that separated words are all capitalized (somewhat naive assumption though). For this name, I don't see any way of converting mycholula to MyCholula. I also tried many other combination in the linked site above (also different user names I got from my friends) to confirm that the linked site is actually returning the data as expected for any combination of user names. I also tried it on other sites like this and it didn't work just like it doesn't work from the PUBG DEV API page or from my page.
I am really confused as to how they are doing it. The only possible explanation I can come up with is that they have the player records stored in their database from where, they can perform advanced regexp based search to get the actual name. However, this sounds far fetched since, there are millions of players and it would require them to know all the player names and associated IDs. Also, as far as I know, it is not possible to use regex or other string manipulation to convert to the actual name because there can be many combinations (not an expert on regex so can't be definitive on this).
Any help or suggestions will be greatly appreciated. Thanks.

What's the correct way to create a REST service that allows for different types of identifiers?

I need to create a RESTful webservice that allows for addressing entities by using different types of IDs. I will give you an example based on books (which is not what I need to process but I want to build a common understanding this way).
Books can be identifier by:
ISBN 13
ID
title
I can create a book by POSTing to /api/v1/books/The%20Bible. This book can then later be addressed by its ISBN /api/v1/books/12312312301 or ID /api/v1/books/A9471IZ1. If I implemented it this way I would need to analyze whatever identifier gets sent and convert it internally.
Is it 'legal' to add the type of identifier to the URL ? Like /api/v1/books/title/The%20Bible?
It seems that what you need is not simply retrieving resources, but searching for them by certain criteria (in your case, by ISBN, title or ID). In that case, rather than complicate your /books endpoint (which, ideally, should only returns books by ID), I'd create a separate /search function. You can then use it search for books by any field.
For example, you would have:
GET /search?title=bible
GET /search?isbn=12312312301
It can even be easily expanded to add more fields later on.
First: A RESTful URl should only contain nouns and not verbs. You can find a lot of best-practices online, as example: RESTful API Design: nouns are good, verbs are bad
One approach would be to detect the id/identifier in code.
The pattern would be, as you already mentioned:
GET /api/v1/books/{id}, like /api/v1/books/12312312301 or /api/v1/books/The%20Bible
Another approach, similar to this.lau_, would be with a query parameter. But I suggest to add the query parameter to the books URL (because only nouns, no verbs):
GET /api/v1/books?isbn=12312312301
The better solution? Not sure…
Because you are selecting “one book by id” (except title), rather than performing a query/search, I prefer the first approach (…/books should return “a collection of books” and .../books/{id} should return only one book).
But maybe someone has a better approach/idea?
Edit:
I suggest to avoid adding the identifier to the URL, it has “bad smell”. But is also a possible approach and I saw that a lot in other APIs. Let’s see if I can find some information on that, if its “ok” or should be avoided.
Edit 2:
See REST API DESIGN - Getting a resource through REST with different parameters but same url pattern and REST - supporting multiple possible identifiers

RESTful API and Foreign key handling for POSTs and PUTs

I'm helping develop a new API for an existing database.
I'm using Python 2.7.3, Django 1.5 and the django-rest-framework 2.2.4 with PostgreSQL 9.1
I need/want good documentation for the API, but I'm shorthanded and I hate writing/maintaining documentation (one of my many flaws).
I need to allow consumers of the API to add new "POS" (points of sale) locations. In the Postgres database, there is a foreign key from pos to pos_location_type. So, here is a simplified table structure.
pos_location_type(
id serial,
description text not null
);
pos(
id serial,
pos_name text not null,
pos_location_type_id int not null references pos_location_type(id)
);
So, to allow them to POST a new pos, they will need to give me a "pos_name" an a valid pos_location_type. So, I've been reading about this stuff all weekend. Lots of debates out there.
How is my API consumers going to know what a pos_location_type is? Or what value to pass here?
It seems like I need to tell them where to get a valid list of pos_locations. Something like:
GET /pos_location/
As a quick note, examples of pos_location_type descriptions might be: ('school', 'park', 'office').
I really like the "Browseability" of of the Django REST Framework, but, it doesn't seem to address this type of thing, and I actually had a very nice chat on IRC with Tom Christie earlier today, and he didn't really have an answer on what to do here (or maybe I never made my question clear).
I've looked at Swagger, and that's a very cool/interesting project, but take a look at their "pet" resource on their demo here. Notice it is pretty similar to what I need to do. To add a new pet, you need to pass a category, which they define as class Category(id: long, name: string). How is the consumer suppose to know what to pass here? What's a valid id? or name?
In Django rest framework, I can define/override what is returned in the OPTION call. I guess I could come up with my own little "system" here and return some information like:
pos-location-url: '/pos_location/'
in the generic form, it would be: {resource}-url: '/path/to/resource_list'
and that would sort of work for the documentation side, but I'm not sure if that's really a nice solution programmatically. What if I change the resources location. That would mean that my consumers would need to programmatically make and OPTIONS call for the resource to figure out all of the relations. Maybe not a bad thing, but feels like a little weird.
So, how do people handle this kind of thing?
Final notes: I get the fact that I don't really want a "leaking" abstaction here and have my database peaking thru the API layer, but the fact remains that there is a foreign_key constraint on this existing database and any insert that doesn't have a valid pos_location_type_id is raising an error.
Also, I'm not trying to open up the URI vs. ID debate. Whether the user has to use the pos_location_type_id int value or a URI doesn't matter for this discussion. In either case, they have no idea what to send me.
I've worked with this kind of stuff in the past. I think there is two ways of approaching this problem, the first you already said it, allow an endpoint for users of the API to know what is the id-like value of the pos_location_type. Many API's do this because a person developing from your API is gonna have to read your documentation and will know where to get the pos_location_type values from. End-users should not worry about this, because they will have an interface showing probably a dropdown list of text values.
On the other hand, the way I've also worked this, not very RESTful-like. Let's suppose you have a location in New York, and the POST could be something like:
POST /pos/new_york/
You can handle /pos/(location_name)/ by normalizing the text, then just search on the database for the value or some similarity, if place does not exist then you just create a new one. That in case users can add new places, if not, then the user would have to know what fixed places exist, which again is the first situation we are in.
that way you can avoid pos_location_type in the request data, you could programatically map it to a valid ID.

perl module to detect foreign url

I'm making a crawler and I only want to use U.S. domains. For example, I would want:
http://thenorthface.com/
but I would not want:
http://uk.thenorthface.com
or
http://se.thenorthface.com/
Does anyone know of a way to do this or a perl module that does this? I know it could be done with regex, but I'm trying to avoid having to get together a list of all foreign domain beginnings... Thanks a lot!
You cannot reliably determine what a "US" domain is from the URL. It's not even clear that the term "US domain" has any meaning.
For example, many US state abbreviations are also ISO-3166 country codes. What will you do with ar.xyz.com. Is that Arkansas or Argentina? What about ma.pdq.com... Massachussetts or Morocco (Maroc in French)?
You may be able to link second-level domains to a country (for a headquarters at least) but hostnames and third-level domains will be impossible to classify.

Inexact full-text search in PostgreSQL and Django

I'm new to PostgreSQL, and I'm not sure how to go about doing an inexact full-text search. Not that it matters too much, but I'm using Django. In other words, I'm looking for something like the following:
q = 'hello world'
queryset = Entry.objects.extra(
where=['body_tsv ## plainto_tsquery(%s)'],
params=[q])
for entry in queryset:
print entry.title
where I the list of entries should contain either exactly 'hello world', or something similar. The listings should then be ordered according to how far away their value is from the specified string. For instance, I would like the query to include entries containing "Hello World", "hEllo world", "helloworld", "hell world", etc., with some sort of ranking indicating how far away each item is from the perfect, unchanged query string.
How would you go about doing this?
Your best bet is to use Django raw querysets, I use it with MySQL to perform full text matching. If the data is all in the database and Postgres provides the matching capability then it makes sense to use it. Plus Postgres offers some really useful things in terms of stemming etc with full text queries.
Basically it lets you write the actual query you want yet returns models (as long as you are querying a model table obviously).
The advantage this gives you is that you can test the exact query you will be using first in Postgres, the documentation covers full text queries pretty well.
The main gotcha with raw querysets at the moment is they don't support count. So if you will be returning lots of data and have memory constraints on your application you might need to do something clever.
"Inexact" matching however isn't really part of the full text searching capabilities. Instead you want the postgres fuzzystrmatch contrib module. It's use is described here with indexes.
The best would be to use a search engine for this purpose. Django-haystack supports the integration of three different search engines.
In 2022, Django supports full text search with postgres. Full documentation here: https://docs.djangoproject.com/en/4.0/ref/contrib/postgres/search/