CouchDB View with time limitation - mapreduce

I have the following problem:
I am storing URLs in a Couchdb with some additional information such as the release date. I have a view that returns all URLs where the published date is less than 12 hours old.
The odd thing is, that I am surprised that it works. Ie. after 24 hours of not touching the database, when the last action was to run the 'depreciating' view and returned some URLs, the next time this is called it does not return any items.
I assumed to have read, that a view is not running over all elements but only those that have changed or were added since the last time the view was run. This is why running a view a second time is usually faster than the first time.
In my example where documents 'expire' from the view, I would not have expect this to happen if there are no edits taking place.
Where am I going wrong?

Please, be sure that your view implementation does not depend of data outside the document (like the current date)... Or the cache mechanism implemented in CouchDB will be totally broken.
To get the URLs published less than 12 hours old, you must:
generate an index of the date of publishing (with date being sortable, like [2013,10,22,13,54]):
function(o) {
emit(o.date, null);
}
query the index from the most ancient time you want:
GET /mydb/_design/myapp/myview?startkey=[2013,10,22,1,56]&include_docs=true

Related

Sitecore Lucene Index queue lagging behind in Prod server

In our Sitecore (6.6) implementation we use Lucene indexing. In our PROD server, index bilding process is very slow. At the moment it has 5000+ entries to waiting in the index queue.
Queries I used (in master database),
select * from Properties (check the index last run time)
select * from History where created > 'last index updated time'
As a result of this delay, data gets created do not reflect their changes in the website. Also this queue keeps increasing. When the site takes offline, index building catch up after a while.
Its a heavy read intensive website.
We encountered CPU going high issues, but now they have been sorted. We thought index building was lagging because of the CPU high issue. But now the CPU is running around 30-40%. Still the lucene indexing queue increase rate is high.
How can I solve this issue? Please help.
You need to set up a database maintenance task, so that you regularly flush your History table. If you have sites that are index heavy, this table can grow excessively large. I think the default job cleans this table out with everything that is older than 30 days - you could set this much lower. Like 1 day, or a couple of days.
This article on SDN covers most of the standard maintenance tasks: http://sdn.sitecore.net/Articles/Administration/Database%20Maintenance.aspx
More general information about searching, indexing and performance here: http://sdn.sitecore.net/upload/sitecore6/65/sitecore_search_and_indexing_sc60-65-a4.pdf#search=%22clean%22
I think you need to take a step back and ask the question as to why there is such a large number of entries being added to the history table to begin with, before looking at what configuration changes to Sitecore can be made.
You should trace through your code in your development environment based on each of the use cases for your implementation, to find all calls to the Sitecore API where an item is:
Added into the Sitecore Tree
Edited - the changing of any fields item including security, presentation, workflow, publishing restrictions, etc.
Duplicated
Deleted from the Sitecore Tree
Moved to a new location.
Has a new version is added
Has a version removed
As you are going through, make sure that all edit actions to an item are performed with in a single Sitecore.Data.Items.Item.Editing.BeginEdit() and Sitecore.Data.Items.Item.Editing.EndEdit() call whenever possible, so that the changes are performed as a single edit action instead of multiple. Every time Sitecore.Data.Items.Item.Editing.EndEdit() is called, a new record will be inserted into the history table so unnecessary edits will only cause the history table size to increase.
If you are duplicating an item using the Sitecore.Data.Items.Item.CopyTo() method, remember that all versions of the item will be duplicated as well as the item's descendants. This means that the history table will have a record in it for every version of the item that was copied. If you only require the latest version and therefore removing older versions from the new item after it was created, again you should be aware that removing a version from an item will result in a record inserted into the history table for each version deleted.
If you have minimized all of the above actions to the bare minimum that is required to make the system functional, you should find that the Lucene Indexing will keep up-to-date pretty well without having to change Sitecore's default index configuration.

Overcoming querying limitations in Couchbase

We recently made a shift from relational (MySQL) to NoSQL (couchbase). Basically its a back-end for social mobile game. We were facing a lot of problems scaling our backend to handle increasing number of users. When using MySQL loading a user took a lot of time as there were a lot of joins between multiple tables. We saw a huge improvement after moving to couchbase specially when loading data as most of it is kept in a single document.
On the downside, couchbase also seems to have a lot of limitations as far as querying is concerned. Couchbase alternative to SQL query is views. While we managed to handle most of our queries using map-reduce, we are really having a hard time figuring out how to handle time based queries. e.g. we need to filter users based on timestamp attribute. We only need a user in view if time is less than current time:
if(user.time < new Date().getTime() / 1000)
What happens is that once a user's time is set to some future time, it gets exempted from this view which is the desired behavior but it never gets added back to view unless we update it - a document only gets re-indexed in view when its updated.
Our solution right now is to load first x user documents and then check time in our application. Sorting is done on user.time attribute so we get those users who's time is less than or near to current time. But I am not sure if this is actually going to work in live environment. Ideally we would like to avoid these type of checks at application level.
Also there are times e.g. match making when we need to check multiple time based attributes. Our current strategy doesn't work in such cases and we frequently get documents from view which do not pass these checks when done in application. I would really appreciate if someone who has already tackled similar problems could share their experiences. Thanks in advance.
Update:
We tried using range queries which works for only one key. Like I said in most cases we have multiple time based keys meaning multiple ranges which does not work.
If you use Date().getTime() inside a view function, you'll always get the time when that view was indexed, just as you said "it never gets added back to view unless we update it".
There are two ways:
Bad way (don't do this in production). Query views with stale=false param. That will cause view to update before it will return results. But view indexing is slow process, especially if you have > 1 milllion records.
Good way. Use range requests. You just need to emit your date in map function as a key or a part of complex key and use that range request. You can see one example here or here (also if you want to use DateTime in couchbase this example will be more usefull). Or just look to my example below:
I.e. you will have docs like:
doc = {
"id"=1,
"type"="doctype",
"timestamp"=123456, //document update or creation time
"data"="lalala"
}
For those docs map function will look like:
map = function(){
if (doc.type === "doctype"){
emit(doc.timestamp,null);
}
}
And now to get recently "updated" docs you need to query this view with params:
startKey="dateTimeNowFromApp"
endKey="{}"
descending=true
Note that startKey and endKey are swapped, because I used descending order. Here is also a link to documnetation about key types that couchbase supports.
Also I've found a link to a question that can also help.

What is the performance of CouchDB's stale=update_after?

I'm curious about how the stale=update_after feature of the CouchDB view API works.
I can see here that it returns stale results and then updates the view:
If stale=ok is set, CouchDB will not refresh the view even if it is stale, the benefit is a an improved query latency. If stale=update_after is set, CouchDB will update the view after the stale result is returned. update_after was added in version 1.1.0.
Assume that I have inserted some large number of documents -- enough to require several minutes to update the view index -- and then I query the view twice in rapid succession with stale=update_after. The first query will return very quickly; that's the whole point of update_after.
My question is, will the 2nd query also return stale results quickly, or will it wait for the view to finish updating?
The second query also returns stale results. It uses the partial results that are available at the time the query hits the server. If you just added documents, you're fine.
But if you have modified your view, the first query will return the results of the first query and trigger a complete rebuild of the view. So the second query will probably deliver no results or just very few rows.
So the short answer: In your case, both queries will return quickly, with the second query probably giving the same result as the first one, maybe with some additional rows.
Hope I could help!
Yours, Bernhard

Find User's First Post?

Using the Graph API or FQL, is there a way to efficiently find a user's first post or status? As in, the first one they ever made?
The slow way, I assume, would be to paginate through the feed, but for users like me who joined in 2005 or earlier, that would take a very long time with a huge amount of API calls.
From what I have found, we cannot obtain the date the user registered with Facebook for a good starting point, and we cannot sort by date ascending (not outside of the single page of data returned) to get the oldest post on top.
Is there any reasonable way to do this?
you can use facebook query language (FQL) to get first post information.
Please refer below query for more details :-
SELECT message, time FROM status WHERE uid= me() ORDER BY time ASC LIMIT 1
Please check and let me know in case of any issue.
Thanks and Regards
Durgaprasad
I think the Public API is limited to the depth of information it is allowed to query. Facebook probably put in these constraints for performance and cost concerns. Maybe they've changed it. When I tried to go backwards thru a person's stream about 4 months ago, there seemed to be a limit as to how far back I could go. Maybe it's a time limit or a # posts back limit. If you know when your user first posted, then getting to it should be fairly quick using the since/until time stamps in your queries.

Web Service query on Sharepoint 2007 List with 12,000 items fails to return all documents

We are querying a large SP 2007 document library with over 12,000 documents using the Lists web service, for document comparison purposes.
All queries are built using CAML, to limit the results returned by one of the fields on the list.
In general, the CAML query will return no more than 200 records.
Unfortunately, we are finding that one query will return 20 documents, and the exact same query will return 23 documents 15 minutes later.
As this crawl occurs after hours, it is not possible that documents have been added during that time.
Has anyone experienced similar issues ?
If you're using the Lists.GetListItems method, try setting the RowLimit parameter to something larger.
rowLimit A string that specifies the number of items, or rows, to
display on a page before paging begins. If supplied, the value of this
parameter overrides the row limit set in the view specified by the
viewName parameter or the row limit set in the default view for the
list.
If you don't specify, it will use the limit for the default view which is probably 200 judging by your question.
I don't understand the second part of your question. The search index uses a completely separate web service and you'll never use CAML to query the search index.
Turns out that the issue was related to hardware errors on one of our front end web servers.
This caused a validation failure for some of the List Items.