Determine number of contacts with Google People API - google-people-api

I am using the Google People API to fetch a user's connections, and since the results do not include email addresses, phone numbers, etc. after I call people.connections.list and retrieve a total of 50 results I am then polling people:batchGet to fetch data for the users (which only accepts up to 50 users at a time). This works fine, and after looping over the results a few times I can import all of the contacts. Great!
But because of this setup and the need to loop (some users have thousands of connections after all) I am using a process that (basically) redirects over and over until we're done. This is also working fine, but I'd love to show a progress bar on the redirect screens, and in order to do this I'd need to know the total number of connections the user has. I can't seem to find any way to determine the number of total results that people.connections.list could return (provided no filters or sync tokens are passed in). Does anyone know a way I can determine how many connections total that we need to loop over with people.connections.list?

What language are you using? This works for me in Java.
// Get the user's connections
ListConnectionsResponse r = peopleService.people().connections()
.list("people/me")
.setPageSize(500)
// specify fields to be returned
.setRequestMaskIncludeField("person.names,person.emailAddresses")
.execute();
List<Person> connections = r.getConnections();
System.out.println(connections.size());
You could also add what fields to be included if you use something like this URL:
https://people.googleapis.com/v1/people/me/connections?pageSize=500&requestMask.includeField=person.names%2Cperson.emailAddresses&key={YOUR_API_KEY}

There is a recently added totalItems field on the list response which should return the total number of people in the list.
However, the better fix is that you should request email addresses, phone numbers, etc. in the request mask for the list call

Related

aws identity pool filter list users api on multiple conditions

Is there a way to filter the result from list users api on multiple conditions. I want to get list of all users who have usernames from a list
import boto3
client = boto3.client('cognito-idp')
client.list_users(UserPoolId='us-east-1_123456789', AttributesToGet=['email'], Filter="username =\"user_name_1\"")
the above code returns me only one username. Now if I want to get the same information for multiple usernames I cant seem to find a way to do it.
ex:
import boto3
usernames=['user_id1','user_id2']
client = boto3.client('cognito-idp')
client.list_users(UserPoolId='us-east-1_123456789', AttributesToGet=['email'], Filter="username =usernames")
Unfortunately not: https://docs.aws.amazon.com/cognito-user-identity-pools/latest/APIReference/API_ListUsers.html#API_ListUsers_RequestSyntax
You can only filter by strict equality or starts-with; no wildcards or arrays.
That said, ListUsers does not seem to have an specific api-calling limit, so you would be able to call it multiple times in quick succession until you had processed all the usernames.
client.list_users() does have a limit of 60 users and this function is not pageable.
I faced a similar problem and created the filter-value that included element i of the list and contained the entire filter expression ahead of the actual function so that I was able to call that filter value within the function.
In the end, I looped over the filter-value that would adopt the new value of the lists element i.

Parse large file and paginate / load its parts with scrolling

I'm looking for suggestions and the most Django way of loading large variable content (say massive 10,000 lines list) part by part to user page to display only some lines before user asks for more.
This is a detailed scenario (I hope it makes sense to you. It is just a simple example to help dealing with large template variables and pagination):
User goes to website.com/searchfiles which is hosted on my Django backend and returned as a template
searchfiles.html template contains one form with Select drop-down menu to let choose a file that already exists on server (say there are 20 massive log files). Below the drop-down menu there is a text box that allows user to enter a regular expression string. So only two items in the form.
P.S. Each file is usually pretty big e.g. 20-30MB
When user selects the file and enters regular expression in the text box and clicks on "Submit", HTTP POST is made
Django backend receives POST, reads the filename + regexp string and executes function dosearch(FILE, pattern)
dosearch function does something like this:
dosearch(FILE, pattern):
result = []
fh = open(FILE, 'r')
for line in fh:
if re.match(pattern, line):
result.append(line)
return result
Now, result is a list that, depending on pattern, can be pretty large (e.g. 10-20MB). Processing of the file is completed now and I want to present user with "result" variable. After HTTP post, user is redirected to website.com/parsed.
As you can imagine, my goal in step 7 is to return this variable to the user after HTTP POST. But because "result" variable can be huge, I don't want to just dump, say, 10,000 lines of output directly to the page. What I want to achieve is the way that maybe first 200 lines are displayed, and as user is scrolling down, additional 200 lines are loaded once user reaches the bottom of the page.
To keep it simple, ignore the scroll part. User can be also presented with [NEXT] button to click and load additional 200 entries and so on.
What is the most Django way to achieve this? Do I need to save results variable to database and use Ajax?
Also assume that multiple users are going to use the very same page/website so I need to be able to distinguish between two users searching two different files at the same time.
When user navigates out, "result" variable that was generated should be destroyed from the memory.
I can think of two possibilities:
A. using a Model
class ResultLine(models.Model):
user = models.ForeignKey(User)
sequence_number = models.IntegerField()
line = models.CharField(max_length=1000)
created_at = models.DateTimeField(auto_now_add=True)
After parsing the file you would store each result line as a instance of this model, using sequence_number to specify the order of the lines.
In your result view you could use pagination or generic ListView to show the first lines, or use AJAX to fetch more lines.
You would need to add a delete button to clear the users data from this model, or run periodical jobs (maybe using crontab and a custom management command) to delete old result lines.
B. using session data
Another possibility would be to store the result in the users session.
request.session['result_list'] = dosearch(FILE, pattern)
Depending on the session engine there could be size restrictions; this post states that the database-backed sessions are only limited by the database engine (which means you could store many MB or even GB of data in the session).
Also, your server needs sufficient RAM to hold the whole result list of multiple users.
And later in your result view you just read from the session instead from a model.
Performance-wise there are differences: both approaches store the data in the database (with database-backed sessions), but option A allows you to do partial reads in your result view, while option B always reads the whole result list into memory on each request (because the whole session dict is stored in encoded format).

Feed Algorithm + Database: Either too many rows or too slow retrieval

Say I have a general website that allows someone to download their feed in a small amount of time. A user can be subscribed to many different pages, and the user's feed must be returned to the user from the server with only N of the most recent posts between all of the pages subscribed to. Originally when a user queried the server for a feed, the algorithm was as follows:
look at all of the pages a user subscribed to
getting the N most recent posts from each page
sorting all of the posts
return the N most recent posts to the user as their feed
As it turns out, doing this EVERY TIME a user tried to refresh a feed was really slow. Thus, I changed the database to have a table of feedposts, which simply has a foreign key to a user and a foreign key to the post. Every time a page makes a new post, it creates a feed post for each of its subscribing followers. That way, when a user wants their feed, it is already created and does not have to be created upon retrieval.
The way I am doing this is creating far too many rows and simply does not seem scalable. For instance, if a single page makes 1 post & has 1,000,000 followers, we just created 1,000,000 new rows in our feedpost table.
Please help!
How do companies such as facebook handle this problem? Do they generate the feed upon request? Are my database relationships terrible?
It's not that the original schema itself would be inherently wrong, at least not based on the high-level description you have provided. The slowness stems from the fact that you're not accessing the database in a way relational databases should be accessed.
In general, when querying a relational database, you should use JOINs and in-database ordering where possible, instead of fetching a bunch of data, and then trying to connect related objects and sort them in your code. If you let the database do all this for you, it will be much faster, because it can take advantage of indices, and only access those objects that are actually needed.
As a rule of thumb, if you need to sort the results of a QuerySet in your Python code, or loop through multiple querysets and combine them somehow, you're most likely doing something wrong and you should figure out how to let the database do it for you. Of course, it's not true every single time, but certainly often enough.
Let me try to illustrate with a simple piece of code. Assume you have the following models:
class Page(models.Model):
name = models.CharField(max_length=47)
followers = models.ManyToManyField('auth.User', related_name='followed_pages')
class Post(models.Model):
title = models.CharField(max_length=147)
page = models.ForeignKey(Page, related_name='posts')
content = models.TextField()
time_published = models.DateTimeField(auto_now_add=True)
You could, for example, get the list of the last 20 posts posted to pages followed by the currently logged in user with the following single line of code:
latest_posts = Post.objects.filter(page__followers=request.user).order_by('-time_published')[:20]
This runs a single SQL query against your database, which only returns the (up to) 20 results that match, and nothing else. And since you're joining on primary keys of all tables involved, it will conveniently use indices for all joins, making it really fast. In fact, this is exactly the kind of operation relational databases were designed to perform efficiently.
Caching will be the solution here.
You will have to reduce the database reads, which are much slower as compared to cache reads.
You can use something like Redis to cache the post.
Here is an amazing answer for better understanding
Is Redis just a cache
Each page can be assigned a key, and you can pull all of the posts for that page under that key.
you need not to cache everything , just cache resent M posts, where M>>N and safe enough to reduce the database calls.Now if in case user requests for posts beyond the latesd M ones, then they can be directly fetched from the DB.
Now when you have to generate the feed you can make a DB call to get all of the subscribed pages(or you can put in the cache as well) and then just get the required number of post's from the cache.
The problem here would be keeping the cache up-to date.
For that you can use something like django-signals. Whenever a new post is added, add it to the cache as well using the signal.
So for each DB write you will have to write to cache as well.
But then you will not have to read from DB and as Redis is a in memory datastore it is pretty fast as compared to standard relational databases.
Edit:
These are a few more articles which can help for better understanding
Does Stack Exchange use caching and if so, how
How Twitter Uses Redis to Scale - 105TB RAM, 39MM QPS, 10,000+ Instances

Find User's First Post?

Using the Graph API or FQL, is there a way to efficiently find a user's first post or status? As in, the first one they ever made?
The slow way, I assume, would be to paginate through the feed, but for users like me who joined in 2005 or earlier, that would take a very long time with a huge amount of API calls.
From what I have found, we cannot obtain the date the user registered with Facebook for a good starting point, and we cannot sort by date ascending (not outside of the single page of data returned) to get the oldest post on top.
Is there any reasonable way to do this?
you can use facebook query language (FQL) to get first post information.
Please refer below query for more details :-
SELECT message, time FROM status WHERE uid= me() ORDER BY time ASC LIMIT 1
Please check and let me know in case of any issue.
Thanks and Regards
Durgaprasad
I think the Public API is limited to the depth of information it is allowed to query. Facebook probably put in these constraints for performance and cost concerns. Maybe they've changed it. When I tried to go backwards thru a person's stream about 4 months ago, there seemed to be a limit as to how far back I could go. Maybe it's a time limit or a # posts back limit. If you know when your user first posted, then getting to it should be fairly quick using the since/until time stamps in your queries.

how can I get the total number of notes without retrieving them all

In order to go through all notes of a user I first want to know the total number available. The only way I found so far is to get them all by setting the limit to a high enough value - but this is pretty slow since all note objects will be fetched including their content. There must be a more effective way since the normal public https://www.facebook.com/<user>?sk=notes page displays the total at the bottom line very quick.
You can't get only number of notes without fetching the details for all of 'em.
You can however speed the query by limiting fields retrieved by adding fields argument to notes connection URL:
http://graph.facebook.com/me/notes?fields=id&access_token=...
This will only return ids of notes without all the rest of details (which should be faster that retrieval of all notes data).