Extract specific users comments from a list using Wikipedia API and Python 2.7 - python-2.7

I am using the wikipedia API - wikitools package to extract some data from Wikipedia. I get the output of the format shown below and now I want to extract the timestamp and the comment for revisions made of specific user for several pages. Let's say I just want the comments made by TechBot, then I figured that I can do something like:
for revision in res["query"]["pages"]["7940378"]["revisions"]:
if revision["user"] = "Techbot":
do.something()
But the problem is ["7940378"] because this is a unique page id and will change for every page and I dont know how to get the pageid. Is there another way of doing this?
[{
"query": {
"pages": {
"7940378": {
"ns": 0,
"pageid": 7940378,
"revisions": [
{
"comment": "robot Modifying: [[az:T\u00fcrk Tarixi]]",
"timestamp": "2009-01-03T19:47:11Z",
"user": "TechBot"
},
{
"comment": "",
"timestamp": "2009-02-14T02:07:49Z",
"anon": "",
"user": "88.231.237.130"
},
{
"comment": "fixing recent deletion by merging it with the next paragraph",
"timestamp": "2009-04-03T14:49:27Z",
"user": "Soap"
},
{
"comment": "robot Modifying: [[az:T\u00fcrk tarixi]]",
"timestamp": "2009-04-09T14:35:19Z",
"user": "RibotBOT"
},
{
"comment": "Repairing link to disambiguation page - [[Wikipedia:Disambiguation pages with links|You can help!]]",
"timestamp": "2009-06-12T23:55:55Z",
"user": "J04n"
}
],
"title": "History of the Turkic peoples"
}
}
},
"continue": {
"rvcontinue": "20090807172715|306635892",
"continue": "||"
},
"warnings": {
"main": {
"*": "Unrecognized parameter: 'user'"
}
}
}]

Instead of using a single for loop. you can split up into two loops, where the outer loop gets the pages, and with the inner loop you can get to the revisions.
for pageid, pagedetails in res["query"]["pages"].iteritems():
for revision in pagedetails["revisions"]:
if revision["user"] == "TechBot":
do.something()

Related

AWS StepFunctions - Merge and flatten the task output combined with the original input

How do we use Parameters, ResultPath and ResultSelector to combine the results of a Task with the original input in the same JSON level?
I checked the documentation on AWS, but it seems that ResultSelector always create a new dictionary which puts it in 1-level below on the result.
Example input
{
"status": "PENDING",
"uuid": "00000000-0000-0000-0000-000000000000",
"first_name": "John",
"last_name": "Doe",
"email": "john.doe#email.com",
"orders": [
{
"item_uuid": "11111111-1111-1111-1111-111111111111",
"quantities": 2,
"price": 2.38,
"created_at": 16049331038000
}
]
}
State Machine definition
"Review": {
"Type": "Task",
"Resource": "arn:aws:states:us-east-1:123456789012:activity:Review",
"ResultPath": null,
"Next": "Processing",
"Parameters": {
"task_name": "REVIEW_REQUIRED",
"uuid.$": "$.uuid"
}
},
Example output from Review Activity
{
"review_status": "APPROVED"
}
Question
How do I update the State Machine definition to combined the result of Review Activity and the original input to something as below?
{
"status": "PENDING",
"uuid": "00000000-0000-0000-0000-000000000000",
"first_name": "John",
"last_name": "Doe",
"email": "john.doe#email.com",
"orders": [
{
"item_uuid": "11111111-1111-1111-1111-111111111111",
"quantities": 2,
"price": 2.38,
"created_at": 16049331038000
}
],
"review_status": "APPROVED"
}
NOTE
I don't have access to the Activity code, just the definition file.
I recommend NOT doing the way suggested above as you will drop all data that you do not include. It's not a long term approach, you can more easily do it like this:
Step Input
{
"a": "a_value",
"b": "b_value",
"c": {
"c": "c_value"
}
}
In your state-machine.json
"Flatten And Keep All Other Keys": {
"Type": "Pass",
"InputPath": "$.c.c",
"ResultPath": "$.c",
"Next": "Some Other State"
}
Step Output
{
"a": "a_value",
"b": "b_value",
"c": "c_value"
}
While Step Function does not allow you to do so, you can create a Pass state that flattens the input as a workaround.
Example Input:
{
"name": "John Doe",
"lambdaResult": {
"age": "35",
"location": "Eastern Europe"
}
}
Amazon State Language:
"Flatten": {
"State": "Pass",
"Parameters": {
"name.$" : "$.name",
"age.$" : "$.lambdaResult.age",
"location.$": "$.lambdaResult.location"
},
"Next": "MyNextState"
}
Output:
{
"name": "John Doe",
"age": "35",
"location": "Eastern Europe"
}
It's tedious, but it gets the job done.
Thanks for your question.
It looks like you don't necessarily need to manipulate the output in any way, and are looking for a way to combine the state's output with its input before passing it on to the next state. The ResultPath field allows you to combine a task result with task input, or to select one of these. The path you provide to ResultPath controls what information passes to the output.

How to get school informations by id?

I try to find a way to get some information (address, geo-data and so on ) of an education institute by the id.
My current request: response the following object:
{
"education": [
{
"school": {
"id": "110415448986654",
"name": "BULME"
},
"type": "High School",
"id": "587913304561851"
},
{
"school": {
"id": "114054375298355",
"name": "HTL Bulme Graz-Gösting"
},
"type": "College",
"id": "587913327895182"
}
],
"id": "769605149725998"
}
I see that here are IDs for the schools, but how can I use this to load some data about this institute?
I read a lot on the developers Facebook page but I can't find a solution.

Regular Expressions and Elastic Search

I am trying to retrieve some company results using elasticsearch. I want to get companies that start with "A", then "B", etc. If I just do a pretty typical query with "prefix" like so
GET apple/company/_search
{
"query": {
"prefix": {
"name": "a"
}
},
"fields": [
"id",
"name",
"websiteUrl"
],
"size": 100
}
But this will return Acme as well as Lemur and Associates, so I need to distinguish between A at the beginning of the whole name versus just A at the beginning of a word.
It would seem like regular expressions would come to the rescue here, but elastic search just ignores whatever I try. In tests with other applications, ^[\S]a* should get you anything that starts with A that doesn't have a space in front of it. Elastic search returns 0 results with the following:
GET apple/company/_search
{
"query": {
"regexp": {
"name": "^[\S]a*"
}
},
"fields": [
"id",
"name",
"websiteUrl"
],
"size": 100
}
In FACT, the Sense UI for Elasticsearch will immediately alert you to a "Bad String Syntax Error". That's because even in a query elastic search wants some characters escaped. Nonetheless ^[\\S]a* doesn't work either.
Searching in Elasticsearch is both about the query itself, but also about the modelling of your data so it suits best the query to be used. One cannot simply index whatever and then try to struggle to come up with a query that does something.
The Elasticsearch way for your query is to have the following mapping for that field:
PUT /apple
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"keyword_lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
}
},
"mappings": {
"company": {
"properties": {
"name": {
"type": "string",
"fields": {
"analyzed_lowercase": {
"type": "string",
"analyzer": "keyword_lowercase"
}
}
}
}
}
}
}
And to use this query:
GET /apple/company/_search
{
"query": {
"prefix": {
"name.analyzed_lowercase": {
"value": "a"
}
}
}
}
or
GET /apple/company/_search
{
"query": {
"query_string": {
"query": "name.analyzed_lowercase:A*"
}
}
}

How do you handle large relationship data attributes and compound documents?

If an article has several comments (think thousands over time). Should data.relationships.comments return with a limit?
{
"data": [
{
"type": "articles",
"id": 1,
"attributes": {
"title": "Some title",
},
"relationships": {
"comments": {
"links": {
"related": "https://www.foo.com/api/v1/articles/1/comments"
},
"data": [
{ "type": "comment", "id": "1" }
...
{ "type": "comment", "id": "2000" }
]
}
}
}
],
"included": [
{
"type": "comments",
"id": 1,
"attributes": {
"body": "Lorem ipusm",
}
},
.....
{
"type": "comments",
"id": 2000,
"attributes": {
"body": "Lorem ipusm",
}
},
]
}
This starts to feel concerning, when you think of compound documents (http://jsonapi.org/format/#document-compound-documents). Which means, the included section will list all comments as well, making the JSON payload quite large.
If you want to limit the number of records you get at a time from a long list use pagination (JSON API spec).
I would load the comments separately with store.query (ember docs), like so -
store.query('comments', { author_id: <author_id>, page: 3 });
which will return the relevant subset of comments.
If you don't initially want to make two requests per author, you could include the first 'page' in the authors request as you're doing now.
You may also want to look into an addon like Ember Infinity (untested), which will provide an infinite scrolling list and automatically make pagination requests.

Search words with 'and' logical condition with Facebook Graph API

Using Facebook Graph API I am trying to search for all public pages related with two or more words. I want the AND condition satisfied.
Trying to use a query like i.e.
https://graph.facebook.com/v2.5/search?access_token=my_token&type=page&q=marziano+venusiano&limit=1000
but it gives me empty data answer.
I've tried to use something suggested in old questions, but it seems to be not working any more.
What is the right syntax to use if one exist?
I suspect there is no page with these two words in the name. If you try scuderia and ferrari, the results look as desired:
/search?type=page&q=scuderia+ferrari
returns
{
"data": [
{
"name": "Scuderia Ferrari",
"id": "500214176674878"
},
{
"name": "Scuderia Ferrari",
"id": "105467226153743"
},
{
"name": "Scuderia Ferrari Club Prealpi Venete",
"id": "342159749154826"
},
{
"name": "Scuderia Ferrari Club Prato",
"id": "1414918342093728"
},
{
"name": "Scuderia Ferrari Club Zola Predosa",
"id": "226860530777232"
},
...
],
"paging": {
"cursors": {
"before": "MAZDZD",
"after": "MjQZD"
},
"next": "https://graph.facebook.com/v2.5/search?access_token=&pretty=0&q=scuderia+ferrari&type=page&limit=25&after=MjQZD"
},
}