SimpleDB Incremental Index - amazon-web-services

I understand SimpleDB doesn't have an auto increment but I am working on a script where I need to query the database by sending the id of the last record I've already pulled and pull all subsequent records. In a normal SQL fashion if there were 6200 records I already have 6100 of them when I run the script I query records with an ID greater than > 6100. Looking at the response object, I don't see anything I can use. It just seems like there should be a sequential index there. The other option I was thinking would be a real time stamp. Any ideas are much appreciated.

Using a timestamp was perfect for what I needed to do. I followed this article to help me on my way:http://aws.amazon.com/articles/1232 I would still welcome if anyone knows if there is a way to get an incremental index number.

Related

How to know when elasticsearch is ready for query after adding new data?

I am trying to do some unit tests using elasticsearch. I first start by using the index API about 100 times to add new data to my index. Then I use the search API with aggs. The problem is if I don't pause for 1 second after adding data 100 times, I get random results. If I wait 1 second I always get the same result.
I'd rather not have to wait x amount of time in my tests, that seems like bad practice. Is there a way to know when the data is ready?
I am waiting until I get a success response from elasticsearch /index api already, but that is not enough it seems.
First I'd suggest you to index your documents with a single bulk query : it would save some time because of less http/tcp overhead.
To answer your question, you should consider using the refresh=true parameter (or wait_for) while indexing your 100 documents.
As stated in documentation, it would :
Refresh the relevant primary and replica shards (not the whole index)
immediately after the operation occurs, so that the updated document
appears in search results immediately
More about it here :
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-refresh.html

Depth of sys.dm_pdw_exec_requests on Azure SQL Data Warehouse

I am running tests that take many hours to complete on ADW and the amount of SQL involved rolls off the 10,000 row limit of sys.dm_pdw_exec_requests (as documented at https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-service-capacity-limits ) in less than 30 minutes.
Is my only option to create a process to capture into a table in my database the data on sys.dm_pdw_exec_requests every N minutes (where N << 30 )?
I'm not sure what your use case is, but perhaps you can get the same useful information out of the audit logs?
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-auditing-overview
You might be able to use something that was already built for that purpose, instead of reinventing the wheel:
https://github.com/andrealibero/Azure_SQL_DWH_Perf_Stats
the PowerShell script can collect output of DMVs (configured in an XML file) in a loop or for a number of specified iterations.
Given how quickly the DMVs roll out for you this might help in your scenario.

Use a long running database migration script

I'm trialing FluentMigrator as a way of keeping my database schema up to date with minimum effort.
For the release I'm currently building, I need to run a database script to make a simple change to a large number of rows of existing data (around 2% of 21,000,000 rows need to be updated).
There's too much data for to be updated in a single transaction (the transaction log gets full and the script aborts), so I use a WHILE loop to iterate through the table, updating 10,000 rows at a time, each batch in a separate transacticon. This works, and takes around 15 minutes to run to completion.
Now I have the script complete, I'm trying to integrate it into FluentMigrator.
FluentMigrator seems to run all the migrations for a single batch in one transaction.
How do I get FM to run each migration in a separate transaction?
Can I tell FM to not use a transaction for a specific migration?
This is not possible as of now.
There are ongoing discussions and some work already in progress.
Check it out here : https://github.com/schambers/fluentmigrator/pull/178
But your use case will surely help in pushing the things in the right direction.
You are welcome to take part to the discussion!
Maybe someone will find a temporary workaround?

Simple query working for years, then suddenly very slow

I've had a query that has been running fine for about 2 years. The database table has about 50 million rows, and is growing slowly. This last week one of my queries went from returning almost instantly to taking hours to run.
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)).latest('id')
I have narrowed the slow query down to the Rank model. It seems to have something to do with using the latest() method. If I just ask for a queryset, it returns an empty queryset right away.
#count returns 0 and is fast
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)).count() == 0
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)) == [] #also very fast
Here are the results of running EXPLAIN. http://explain.depesz.com/s/wPh
And EXPLAIN ANALYZE: http://explain.depesz.com/s/ggi
I tried vacuuming the table, no change. There is already an index on the "site" field (ForeignKey).
Strangely, if I run this same query for another client that already has Rank objects associated with her account, then the query returns very quickly once again. So it seems that this is only a problem when their are no Rank objects for that client.
Any ideas?
Versions:
Postgres 9.1,
Django 1.4 svn trunk rev 17047
Well, you've not shown the actual SQL, so that makes it difficult to be sure. But, the explain output suggests it thinks the quickest way to find a match is by scanning an index on "id" backwards until it finds the client in question.
Since you said it has been fast until recently, this is probably not a silly choice. However, there is always the chance that a particular client's record will be right at the far end of this search.
So - try two things first:
Run an analyze on the table in question, see if that gives the planner enough info.
If not, increase the stats (ALTER TABLE ... SET STATISTICS) on the columns in question and re-analyze. See if that does it.
http://www.postgresql.org/docs/9.1/static/planner-stats.html
If that's still not helping, then consider an index on (client,id), and drop the index on id (if not needed elsewhere). That should give you lightning fast answers.
latests is normally used for date comparison, maybe you should try to order by id desc and then limit to one.

Reliably get Latest Event Log Record with WQL

I have written an application which collects windows logs from linux, via the Zenoss wmi-client package.
It uses WQL to query the Event log and parses the return. My problem is trying to find the latest entry in the log.
I stumbled across this which tells me to use the NumberOfRecords column in a query such as this
Select NumberOfRecords from Win32_NTEventLogFile Where LogFileName = 'Application'
and use the return value from that as the highest log.
My question is, I have heard that the Windows Event log is a circular buffer, that is it overwrites it's oldest logs with new ones as the log gets full. Will this have an impact on NumberOfRecords, as if that happens, the "RecordNumber" property of the events will continue to increase, however the actual Number of Records in the event log wouldn't change (as for every entry written, one is dropped).
Can anyone shed some insight to how this actually works (whether NumberOfRecords is the highest RecordNumber, or the actual number of events in the log), and perhaps suggest a solution?
Update
So we know now that NumberOfRecords won't work on it's own because the Event Log is a ring buffer. The MS Solution is to get the Oldest record and add it to NumberOfRecords to get the actual latest record.
This is possible through WinAPI, but I am calling remotely from Linux. Does anyone know how I might achieve this in my scenario?
Thanks
NumberOfRecords will not always be the max record number because the log is circular and the log can be cleared and you may have 1 entry but it's record number is 1000.
The way you would do this using the win api would be to get the oldest record number and add the number of records in the log to get the max record number. It doesn't look like Win32_NTEventLogFile has a oldest record number field to use.
Are you trying to get the latest record every time you query the log? You can use TimeGenerated when you query Win32_NTLogEvent to get everything > NOW. You can iterate that list to find your max record number.
You need the RecordNumber of the newest record, but there is no fast way to get it.
Generally, you have to:
SELECT RecordNumber FROM Win32_NTLogEvent WHERE LogFile='Application'
And find the max RecordNumber through results. But this can take tens of seconds or minutes if the size of log file is big...it's very slow.
But!
You can get number of records:
SELECT NumberOfRecords FROM Win32_NTEventlogFile WHERE LogfileName='Application'
This is very fast. And then reduce the selection to speedup the search of the newest record:
SELECT RecordNumber FROM Win32_NTLogEvent WHERE LogFile='Application' AND RecordNumber>='_number_of_records_'
The execution time of this <= than in general case.