how can I scrap specific words in the webnews? - python-2.7

I wanna scrap the name of writer.
But the structure is too different among the news websites to get that.
For instance, the name is written by classname = reporter, author or writer etc..
I think the best way to get that is searching for reporter in the boby of webpage by useing "Ctrl + f". If there is it, it maybe appear first.
and then get the word that is in front of the word reporter that I searched.
But I don't know how I can write the code by python and that it can be.
please give me the draft or link that I can refer.

Related

Case Insensitive Search parameters for API endpoint

I am working on a project that involves integrating the PUBG API. From my site, the player can lookup stats using their player name, platform and season. One issue I am facing is that the player name have to be exact and is case sensitive. Now I assumed it to be the case at the beginning. However, after searching for the name in this site I found that they don't need the name to be case sensitive. Also, referring to this post from the PUBG Dev community here I saw that it confirmed my initial assumption. So my question is if PUBG API requires the names to be case sensitive then, how is the site (linked) can search for the player even if the name provided is not in exact, matching case? For example,:
I looked up the player name MyCholula. From the PUBG API page for player lookup, it returns the proper value. When I tried mycholula, it doesn't and sends a 404. From the linked site above, both combination seems to work. Now if spaces or other separators were involved in the name then, it would be easy to convert it assuming that separated words are all capitalized (somewhat naive assumption though). For this name, I don't see any way of converting mycholula to MyCholula. I also tried many other combination in the linked site above (also different user names I got from my friends) to confirm that the linked site is actually returning the data as expected for any combination of user names. I also tried it on other sites like this and it didn't work just like it doesn't work from the PUBG DEV API page or from my page.
I am really confused as to how they are doing it. The only possible explanation I can come up with is that they have the player records stored in their database from where, they can perform advanced regexp based search to get the actual name. However, this sounds far fetched since, there are millions of players and it would require them to know all the player names and associated IDs. Also, as far as I know, it is not possible to use regex or other string manipulation to convert to the actual name because there can be many combinations (not an expert on regex so can't be definitive on this).
Any help or suggestions will be greatly appreciated. Thanks.

Needing advanced ID3 tag handling for specific situation

My situation is probably quite rare and complex, so I'll explain it in detail.
Many years ago, I put together a hand-selected collection of MP3s, which ended up taking a month or so and is now at 8000 songs. All of these songs were manually ID3 tagged, which took me forever. Unfortunately, I had a strange tagging philosophy. For songs that featured multiple artists, I would put the features in the Artist field, rather than the Title field. Here's what I mean:
What I have: OB O'Brien (ft. Drake) - 2 On/Thotful
What every normal person has: OB O'Brien - 2 On/Thotful (ft. Drake)
Is there any software or script that handles ID3 tags that will let me perform an advanced renaming like this? Basically, I want to batch handle my MP3s so that if "(ft. *)" is found in the Artist field, it is removed and instead appended to the end of the Title field. Possible?
Yes, this is possible. Please try Mp3Tag.
In the associated forum you will find many examples how to add own "Actions" that exactly do what you want:
Check a specific tag (like Artist or AlbumArtist) for a specific string (like "ft.")
If found, do something with the catch, like moving this part to another tag
You can even use Regular Expressions, like
Action: Guess values
Source format:
%ARTIST%$regexp(%TITLE%,'^(.+?)\s+[[({<]?\s*(?:featuring|feat\.?|ft\.?)\s*([^])}>]+)\>\s*[])}>]?(.*)$',' feat. $2$3+++$1',1)
... or ...
%ARTIST%$regexp(%TITLE%,'^(.+?)\s+[[({<]?\s*(?:featuring|feat\.?|ft\.?)\s*([^])}>]+)[])}>]?(.*)$',' feat. $2$3+++$1',1)
Guessing pattern:
%ARTIST%+++%TITLE%

RESTful API and Foreign key handling for POSTs and PUTs

I'm helping develop a new API for an existing database.
I'm using Python 2.7.3, Django 1.5 and the django-rest-framework 2.2.4 with PostgreSQL 9.1
I need/want good documentation for the API, but I'm shorthanded and I hate writing/maintaining documentation (one of my many flaws).
I need to allow consumers of the API to add new "POS" (points of sale) locations. In the Postgres database, there is a foreign key from pos to pos_location_type. So, here is a simplified table structure.
pos_location_type(
id serial,
description text not null
);
pos(
id serial,
pos_name text not null,
pos_location_type_id int not null references pos_location_type(id)
);
So, to allow them to POST a new pos, they will need to give me a "pos_name" an a valid pos_location_type. So, I've been reading about this stuff all weekend. Lots of debates out there.
How is my API consumers going to know what a pos_location_type is? Or what value to pass here?
It seems like I need to tell them where to get a valid list of pos_locations. Something like:
GET /pos_location/
As a quick note, examples of pos_location_type descriptions might be: ('school', 'park', 'office').
I really like the "Browseability" of of the Django REST Framework, but, it doesn't seem to address this type of thing, and I actually had a very nice chat on IRC with Tom Christie earlier today, and he didn't really have an answer on what to do here (or maybe I never made my question clear).
I've looked at Swagger, and that's a very cool/interesting project, but take a look at their "pet" resource on their demo here. Notice it is pretty similar to what I need to do. To add a new pet, you need to pass a category, which they define as class Category(id: long, name: string). How is the consumer suppose to know what to pass here? What's a valid id? or name?
In Django rest framework, I can define/override what is returned in the OPTION call. I guess I could come up with my own little "system" here and return some information like:
pos-location-url: '/pos_location/'
in the generic form, it would be: {resource}-url: '/path/to/resource_list'
and that would sort of work for the documentation side, but I'm not sure if that's really a nice solution programmatically. What if I change the resources location. That would mean that my consumers would need to programmatically make and OPTIONS call for the resource to figure out all of the relations. Maybe not a bad thing, but feels like a little weird.
So, how do people handle this kind of thing?
Final notes: I get the fact that I don't really want a "leaking" abstaction here and have my database peaking thru the API layer, but the fact remains that there is a foreign_key constraint on this existing database and any insert that doesn't have a valid pos_location_type_id is raising an error.
Also, I'm not trying to open up the URI vs. ID debate. Whether the user has to use the pos_location_type_id int value or a URI doesn't matter for this discussion. In either case, they have no idea what to send me.
I've worked with this kind of stuff in the past. I think there is two ways of approaching this problem, the first you already said it, allow an endpoint for users of the API to know what is the id-like value of the pos_location_type. Many API's do this because a person developing from your API is gonna have to read your documentation and will know where to get the pos_location_type values from. End-users should not worry about this, because they will have an interface showing probably a dropdown list of text values.
On the other hand, the way I've also worked this, not very RESTful-like. Let's suppose you have a location in New York, and the POST could be something like:
POST /pos/new_york/
You can handle /pos/(location_name)/ by normalizing the text, then just search on the database for the value or some similarity, if place does not exist then you just create a new one. That in case users can add new places, if not, then the user would have to know what fixed places exist, which again is the first situation we are in.
that way you can avoid pos_location_type in the request data, you could programatically map it to a valid ID.

Django models - link group of words to single word in lexicon/dictionary

Sorry for the lack of a better title. I didn't know how to name what I'm trying to do. Anyways, Django noob trying to do the following:
I'm building a lexicon/dictionary. You search for a word and information about that word is being displayed. But the info also contains related words, grouped into some sort of logical cluster(s).
For instance, you search for the word 'bicycle'. On the same page words like 'unicycle' and 'tricycle' (they are from the same table as the main word) are grouped together under the cluster name 'bicycle types'. Grouping of the words are done by making a group and adding words to that. For this I have the following model (simplified):
models.py
class Word(models.Model):
word = models.Charfield()
class WordGroup(models.Model):
name = models.CharField()
words = models.ManyToManyField(Word)
Then in the admin I can select the group from an inline.
I'm not sure if this is conceptually the way to do it. It crashes python (locally) so I can imagine it's not :)
Any help greatly appreciated!
With some help from a co-worker I managed to get the output I was looking for. The solution was to define an inlinemodeladmin object for the relation (https://docs.djangoproject.com/en/dev/ref/contrib/admin/#working-with-many-to-many-models).
The problem with not being able to access the wordgroupadmin seems to lie in MySQL. The same setup works in sqlite3, but hangs in MySQL. I'll have to investigate that further.

REST URIs and operations on an object that can be commented on, tagged, rated, etc

I'm doing research into a web API for my company, and it's starting to look like we might implement a RESTful one. I've read a couple of books about this now (O'Reilly's "RESTful web services" seeming the most useful) and have come up with the following set of URIs and operations for an object that can be commented on, tagged, and rated.
It doesn't really matter what the object is, as this scenario applies to many things on the net, but for the sake of argument lets say it's a movie.
Some of these seem to fit quite naturally, but others seem a bit forced (rating and tagging particularly) so does anybody have any suggestions about how these could be improved? I'll list them with the URI and then the supported verbs, and what I propose they would do.
/movies
GET = List movies
/movies/5
GET = Get movie 5
/movies/5/comments
GET = List comments on movie 5
POST = Create a new comment on movie 5
/movies/5/comments/8
GET = Get comment 8 on movie 5
POST = Reply to comment 8 on movie 5
PUT = Update comment 8 on movie 5
/movies/5/comments/8/flag
GET = Check whether the movies is flagged as inappropriate (404 if not)
PUT = Flag movie as inappropriate
/movies/5/rating
GET = Get the rating of the movie
POST = Add the user rating of the movie to the overall rating
Edit: My intention is that the movie object would contain its rating as a property, so I wouldn't really expect the GET method to be used here. The URI really exists so that the rating can be an individual resource that can be updated using the POST verb. I'm not sure if this is the best way of doing it, but I can't think of a better one
/movies/5/tags/tagname
GET = Check whether the movies is tagged with tagname (404 if not; but if it is tagged with the tag name should it return the actual tag resource by redirecting to something like /tags/tagname?)
PUT = Add tag tagname to the movie, creating the tag resource /tags/tagname if required
DELETE = Remove tag tagname from the movie, deleting the tag resource tags/tagname if nothing is tagged with it after this removal
Note that these wouldn't be the entire URIs, for example the URI to list the movies would support filtering, paging and sorting. For this I was planning on something like:
/movies/action;90s/rating,desc/20-40
Where:
action;90s is a semi-colon delimited set of filter criteria
rating,desc is the sort order and direction
20-40 is the range of item indices to get
Any comments about this API scheme too?
Edit #1
This post is getting quite long now! After reading some of the answers and comments, this is the changes from above I'm planning on making:
Tags will be handled as a group rather than individually, so they will be at:
/movies/5/tags
GET = List tags
POST = Union of specified tags and existing tags
PUT = Replace any current tags with specified tags
DELETE = Delete all tags
I'm still really not sure how to handle flagging a comment though. One option is that instead of POSTing to a comment replying to it, a comment object will include its parent so it can be POSTed to the general URI, i.e.
/movie/5/comment
POST = Create a new comment (which may be a reply to a comment)
I could then use the POST to a comment to flag it. But this still doesn't feel quite right.
/movie/5/comment/8
POST = Flag comment
Most of what you have looks good. There were just a couple of strange things I saw. When I put my URLs together, I try to follow these four principles.
Peel the onion
If you make the R in REST really be a resource then the resource URL should be able to be peeled back and still be meaningful. If it doesn't make sense you should rethink how to organize the resource. So in the case below, each makes sense. I am either looking at a specific item, or a collection of items.
/movies/horror/10/
/movies/horror/
/movies/
The following seems funny to me because flag isn't a resource, it's a property of the movie.
/movies/5/comments/8/flag -> Funny
/movies/5/comments/8/ -> Gives me all properties of comment including flag
Define the View
The last peice of the URL describes how to show the resource. The URL /movies/horror/ tells me I will have a collection of movies refined by horror. But there might be different ways I want to display that collection.
/movies/horror/simple
/movies/horror/expanded
The simple view might just be the title and an image. The expanded view would give a lot more information like description, synopsis, and ratings.
Helpers
After the resource has been limited and the proper view figured out, query string parameters are used to help the UI with the little stuff. The most common query string parameters I use are
p => Page
n => number of items to display
sortby => field to sort by
asc => sort ascending
So I could end up with a URL like
/movies/horror/default?p=12&n=50&sortby=name
This will give me the list of movies limited to horror movies with the default view; starting on page 12 with 50 movies per page where the movies are sorted by name.
Actions
The last thing needed are your action on the resource. The action are either collection based or item based.
/movies/horror/
GET -> Get resources as a list
POST -> Create, Update
/movies/horror/10/
GET -> Get resource as item
POST -> Update
I hope this helps.
I disagree with the edit. Queries should be defined by querystrings as per Martijn Laarman's post. i.e.:
/movies?genre=action&timeframe=90s&lbound=20&ubound=40&order=desc
Well, the way I see it some of the information you return now as objects could simply be added to the metadata of its parent object.
For instance, rating could be part of the response of /movies/5
<movie>
<title>..</title>
..
<rating url="movies/ratings/4">4</rating>
<tags>
<tag url="movies/tags/creative">creative</tag>
...
Removing a tag simply means posting the above response without that tag.
Also queries should go in URL variables, I believe:
/movies/?startsWith=Forrest%20G&orderBy=DateAdded
Based on my understanding of ROA (I'm only on chapter five of RESTful Web Services) it looks good to me.
This is an awesome initial draft for a spec of a REST API. The next step would to specify expected return codes (like you did with "404 No Tag Available"), acceptable Content-Types, and available content-types (e.g., HTML, JSON). Doing that should expose any additional chinks you'll need to hammer out.
#Nelson LaQuet:
Using the HTTP methods as they are actually defined gives you the safety of knowing that executing a GET on anything on a web site or service won't eat your data or otherwise mangle it. As an example (pointed out in RESTful Web Services) Google's Web Accelerator expects this behaviour -- as stated in the FAQ -- and presumably other services do too.
Also it gets you idempotency for free. That is doing a GET, DELETE, HEAD or PUT on a resource more than once is the same as doing it only once. Thus if your request fails then all you have to do is run it again.
This is not REST.
A REST API must not define fixed resource names or hierarchies (an obvious coupling of client and server). Servers must have the freedom to control their own namespace. Instead, allow servers to instruct clients on how to construct appropriate URIs, such as is done in HTML forms and URI templates, by defining those instructions within media types and link relations. [Failure here implies that clients are assuming a resource structure due to out-of band information, such as a domain-specific standard, which is the data-oriented equivalent to RPC's functional coupling].
http://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven