Stripe api supports cursor-based pagination. To retreive next page one should get id of last item in previous page.
Are there a way to retrieve pages concurrently?
The best solution here is to parallelize your requests based on the object's creation date and not just on the object id.
The first thread would start at the top of the list and continue to paginate until a certain date such as next month's first day. The second thread would do the same but with created[lte] set to that month's first day at 12 am. The third thread would do the same for the following month's first day, etc.
Related
Heres the relation I'm trying to model in DynamoDB:
My service contains posts and topics. A post may belong to multiple topics. A topic may have multiple posts. All posts have an interest value which would be adjusted based on a combination of likes and time since posted, interest measures the popularity of a post at the current moment. If a post gets too old, its interest value will be 0 and stay that way forever (archival).
The REST api end points work like this:
GET /posts/{id} returns a post-object containing title, text, author name and a link to the authors rest endpoint (doesn't matter for this example) and the number of likes (the interest value is not included)
GET /topics/{name} should return an object with both a list with the N newest posts of the topics as well as one for the N currently most interesting posts
POST /posts/ creates a new post where multiple topics can be specified
POST /topics/
creates a new topic
POST /likes/ creates a like for a specified post (does not actually create an object, just adds the user to the given post-object's list of likers, which is invisible to the users)
The problem now becomes, how do I create a relationship between topics and and posts in DynamoDB NoSql?
I thought about adding a list of copies of posts to tag entries in DynamboDB, where every tag has a list of both the newest and the most interesting Posts.
One way I could do this is by creating a cloudwatch job that would run every 10 minutes and loop through every topic object, finding both the most interesting and newest entries and then replacing the old lists of the topic.
Another job would also have to regularly update the "interest" value of every non archived post (keep in mind both likes and time have an effect on the interest value).
One problem with this is that a lot of posts in the Tag list would be out of date for 10 minutes in case the User makes a change or deletes the post. Likes will also not be properly tracked on the Tags post list. This could perhaps be solved with transactions, although dynamoDB is limited to 10 objects per transaction.
Another problem is that it would require the add-posts-to-tags job to load all the non archived posts into memory in order to manually sort them by both time and interest, split them up by tag and then adding the first N of both sets to the tag lists every 10 minutes.
I also had a another idea, by limiting the tags of a post that are allowed to 1, I could add the tag as a partition key, with the post-time as the sort key, and use a GSI to add Interest as a second sort key.
This does have several downsides though:
very popular tags may be limited to a single parition since all the posts share a single partition key
Tag limit is 1
A cloudwatch job to adjust the Interest value of posts may still be required
It would require use of a GSI which may lead to dangerous race conditions
But it would have the advantage that there are no replications of the post objects aside from the GSI. It would also allow basically infinite paging of all posts by date instead of being limited to just the N newest posts.
So what is a good approach here? It seams both of my solutions have horrible dealbreakers. Is this just one of those problems that NoSQL simply can't solve?
You are trying to model relational data using a non relational DB ,
to do this I would use 2 types of DB ,
I would store in dynamo the post information
in your example it would be :
GET /posts/{id}
POST /posts/
POST /likes/creates
For the topic related information I would use Elastic search (Amazon Elasticsearch Service)
GET /topics/{name} : the search index would stored the full topic info as well post id's that , and the relevant fields you want to search for (in your case update date to get the most recent posts)
what this will entail is background process (in dynamoDB this can be done via streams) that takes changes to the dynamoDB for new post's , update to like count etc.. and populates the search index.
Note: this can also be solved using graphDB but for scaling purposes better separate the source of the data (post's ) and the data relations (topic).
My Scenario is like this.
I want to see/query what is the current aggregation value(s) of a particular query of the active processing window.
I have seen this in Apache Flink.
For e.g:
Say I have a query to count total number of failures, windowing to every 12 hours. And I want to ask (from another application) what is the current count for active aggregating window. Note that active window is still processing.
Reason is that my application need to give a feedback to the user regarding its current total failure count. So he/she can act based on that. Waiting until the window is processed and get the count then, is not the desired behavior in the perspective of user.
Is this possible? If so how?
One option is to use rolling time window. Rolling time window will give you the rolling aggregation(sum, count, etc) for a given time range. So for every incoming event you will get an output event with the count. You can use that to give feedback. There are two catches of this approach. One is it is a rolling count not a batch count. Other one is process is triggered with events to count stream. If you want to trigger the feedback depending on another requirement(Ex: user initiated, every hour etc) this approach will not work. For that you need to use below approach.
Use a time batch window and then join it with another stream which will get triggered depending on the business requirement. Below is a sample and here are the testcases for your reference.
from countStream#window.timeBatch(12 hrs) right outer join
feedbackTriggerStream#window.length(1)
select count() as totalFailures
insert into FeedbackStream;
Another option is to use the query feature. This approach is suitable if you are using Siddhi as a library and you have access to SiddhiAppRuntime. Below is a code sample for that. Lets assume below is your window query to calculate count.
define countWindow(userid string, reason string) timeBatch(12 hrs);
From countStream
Select *
Insert into countWindow;
Then you can use queries as below to access window data.
Event[] events = siddhiAppRuntime.query(
"from countWindow " +
"select count() as totalCount ");
events will contain the one event with count. Here is a reference to testcases.
i have some 200+ tables in my dynamodb. Since all my tables have localSecondaryIndexes defined, i have to ensure that no table is in the CREATING status, at the time of my CreateTable() call.
While adding a new table, i list all tables and iterate through their names, firing describeTable() calls one by one. On the returned data, i check for TableStatus key. Each describeTable() call takes a second. This implies an average of 3 minute waiting time before creation of each table. So if i have to create 50 new tables, it takes me around 4 hours.
How do i go about optimizing this? i think that a BatchGetItem() call works on stuff inside the table and not table-metadata. Can i do a bulk describeTable() call?
It is enough that you wait until the last table you created becomes ACTIVE. Run DescribeTable on that last created table with a few seconds interval.
Using the Graph API or FQL, is there a way to efficiently find a user's first post or status? As in, the first one they ever made?
The slow way, I assume, would be to paginate through the feed, but for users like me who joined in 2005 or earlier, that would take a very long time with a huge amount of API calls.
From what I have found, we cannot obtain the date the user registered with Facebook for a good starting point, and we cannot sort by date ascending (not outside of the single page of data returned) to get the oldest post on top.
Is there any reasonable way to do this?
you can use facebook query language (FQL) to get first post information.
Please refer below query for more details :-
SELECT message, time FROM status WHERE uid= me() ORDER BY time ASC LIMIT 1
Please check and let me know in case of any issue.
Thanks and Regards
Durgaprasad
I think the Public API is limited to the depth of information it is allowed to query. Facebook probably put in these constraints for performance and cost concerns. Maybe they've changed it. When I tried to go backwards thru a person's stream about 4 months ago, there seemed to be a limit as to how far back I could go. Maybe it's a time limit or a # posts back limit. If you know when your user first posted, then getting to it should be fairly quick using the since/until time stamps in your queries.
I have an event object which has from_date as a field which represents when the event starts. What I want to do is to find the nearest event, and then find the next upcoming events that are still within the month. Here are the two queries I have so far, is there a way to combine them?
today = datetime.date.today()
date = Event.objects.filter(
status='P', # Published status
pub_date__lte=today, # Published after today, or today
from_date__gte=today, # Starting next
).order_by('from_date').only('from_date')[:1][0].from_date
events = Event.objects.filter(
# Published after today, with a published status, and start today or later
pub_date__lte=today,
from_date__gte=today,
status='P',
# We're only going to show one month at a time.
from_date__month=date.month,
from_date__year=date.year,
)
I think what you're already doing is actually pretty efficient. Django's query mechanism should collapse those into two SQL queries, one for each filter.
Jamming everything into a single SQL query doesn't always make it more efficient.