Coldfusion 8 - Problems with indexing large data using verity - coldfusion

I am currently running coldfusion 8 with verity running on a K2 server. I am using a query to index several different columns with my table using cfindex. One of the columns is a large varchar type.
It seems that when the data is being indexed only the first 30KB is being stored, resulting in no results being brought back if I search for anything after that. I tried moving several different phrases and words further up in the data, within the 30KB and the results then appear.
I then carried out more verity tests using the browse command in the command prompt to see whats actually in the collection.
i.e. Coldfusion8\verity\collections\\parts browse 0000001.ddd
I found out that the body being indexed (CF_BODY) never exceeds the size of 32000.
Can anyone tell me if there is a fixed index size per document for verity?
Many thanks,
Richard

Punch line
Version 6 has operator limits:
up to 32 764 children in one "topic" for ANY operator
up to 64 children for NEAR
Exceeding these values doesn't necessarily give error message. When you search, you're certain you don't exceed them?
Source
Verity documentation, Appendix B: Query limits says there are two limitations: search time and operator's. Quote below is whole section telling about the latter, straight from the book.
Verity Query Language and Topic Guide, Version 6.0:
Note the following limits on the use of operators:
There can be a maximum of 32,764 children for the ANY operator. If a topic exceeds
this limit, the search engine does not always return an error message.
The NEAR operator can evaluate only 64 children. If a topic exceeds this limit, the
search engine does not return an error message.
For example, assume you have created a large topic that uses the ACCRUE operator with
8365 children. This topic exceeds the 1024 limit for any ACCRUE-class topic and the
16000/3 limit for the total number of nodes.
In this case, you cannot substitute ANY for ACCRUE, because that would cause the topic
to exceed the 8,000 limit for the maximum number of children for the ANY operator.
Instead, you can build a deeper tree structure by grouping topics and creating some
named subnodes.

Related

AWS Elasticsearch indexing memory usage issue

The problem: very frequent "403 Request throttled due to too many requests" errors during data indexing which should be a memory usage issue.
The infrastructure:
Elasticsearch version: 7.8
t3.small.elasticsearch instance (2 vCPU, 2 GB memory)
Default settings
Single domain, 1 node, 1 shard per index, no replicas
There's 3 indices with searchable data. 2 of them have roughly 1 million documents (500-600 MB) each and one with 25k (~20 MB). Indexing is not very simple (has history tracking) so I've been testing refresh with true, wait_for values or calling it separately when needed. The process is using search and bulk queries (been trying sizes of 500, 1000). There should be a limit of 10MB from AWS side so these are safely below that. I've also tested adding 0,5/1 second delays between requests, but none of this fiddling really has any noticeable benefit.
The project is currently in development so there is basically no traffic besides the indexing process itself. The smallest index generally needs an update once every 24 hours, larger ones once a week. Upscaling the infrastructure is not something we want to do just because indexing is so brittle. Even only updating the 25k data index twice in a row tends to fail with the above mentioned error. Any ideas how to reasonably solve this issue?
Update 2020-11-10
Did some digging in past logs and found that we used to have 429 circuit_breaking_exception-s (instead of the current 403) with a reason among the lines of [parent] Data too large, data for [<http_request>] would be [1017018726/969.9mb], which is larger than the limit of [1011774259/964.9mb], real usage: [1016820856/969.7mb], new bytes reserved: [197870/193.2kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=197870/193.2kb, accounting=4309694/4.1mb]. Used cluster stats API to track memory usage during indexing, but didn't find anything that I could identify as a direct cause for the issue.
Ended up creating a solution based on the information that I could find. After some searching and reading it seemed like just trying again when running into errors is a valid approach with Elasticsearch. For example:
Make sure to watch for TOO_MANY_REQUESTS (429) response codes
(EsRejectedExecutionException with the Java client), which is the way
that Elasticsearch tells you that it cannot keep up with the current
indexing rate. When it happens, you should pause indexing a bit before
trying again, ideally with randomized exponential backoff.
The same guide has also useful information about refreshes:
The operation that consists of making changes visible to search -
called a refresh - is costly, and calling it often while there is
ongoing indexing activity can hurt indexing speed.
By default, Elasticsearch periodically refreshes indices every second,
but only on indices that have received one search request or more in
the last 30 seconds.
In my use case indexing is a single linear process that does not occur frequently so this is what I did:
Disabled automatic refreshes (index.refresh_interval set to -1)
Using refresh API and refresh parameter (with true value) when and where needed
When running into a "403 Request throttled due to too many requests" error the program will keep trying every 15 seconds until it succeeds or the time limit (currently 60 seconds) is hit. Will adjust the numbers/functionality if needed, but results have been good so far.
This way the indexing is still fast, but will slow down when needed to provide better stability.

is it possible to determine the max length of a field in a csv file using regex?

This has been discussed in stackoverflow before but I couldn't find a case/answer that might apply to my situation:
From time to time I have raw data in text to be imported into SQL, for almost every case I must try out several times as SSIS wizard doesn't know what's the max size of each field and the default is 50 characters. Only after it fails I can know from the error message which (first) field was truncated and I then increase the field's size.
There might be more than one field that needs getting its size increased, and the SSIS wizard only gives one error each time it encounters a truncate, as you can see this is very tedious, I want to find a way to have a quick inspect to the data first to determine the max size of each field.
I came across an old post on stackoverflow: Here is the post
Unfortunately it might not work on my case: my raw data could have as many rows as 10 Million (yes, in one single text file which is over GB).
I am kind of do not think there would be a way to get that, but just still want to post my question here hoping to get some clue.
Thank you very much.

Efficiently read data from a structured file in C/C++

I have a file as follows:
The file consists of 2 parts: header and data.
The data part is separated into equally sized pages. Each page holds data for a specific metric. Multiple pages (needs not to be consecutive) might be needed to hold data for a single metric. Each page consists of a page header and a page body. A page header has a field called "Next page" that is the index of the next page that holds data for the same metric. A page body holds real data. All pages have the same & fixed size (20 bytes for header and 800 bytes for body (if data amount is less than 800 bytes, 0 will be filled)).
The header part consists of 20,000 elements, each element has information about a specific metric (point 1 -> point 20000). An element has a field called "first page" that is actually index of the first page holding data for the metric.
The file can be up to 10 GB.
Requirement: Re-order data of the file in the shortest time, that is, pages holding data for a single metric must be consecutive, and from metric 1 to metric 20000 according to alphabet order (header part must be updated accordingly).
An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.
Is there any efficient ways?
One possible solution is to create an index from the file, containing the page number and the page metric that you need to sort on. Create this index as an array, so that the first entry (index 0) corresponds to the first page, the second entry (index 1) the second page, etc.
Then you sort the index using the metric specified.
When sorted, you end up with a new array which contains a new first, second etc. entries, and you read the input file writing to the output file in the order of the sorted index.
An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.
Is there any efficient ways?
Yes. After you get a working solution, measure it's efficiency, then decide which parts you wish to optimize. What and how you optimize will depend greatly on what results you get here (what are your bottlenecks).
A few generic things to consider:
if you have one set of steps that read data for a single metric and move it to the output, you should be able to parallelize that (have 20 sets of steps instead of one).
a 10Gb file will take a bit to process regardless of what hardware you run your code on (concievably, you could run it on a supercomputer but I am ignoring that case). You / your client may accept a slower solution if it displays it's progress / shows a progress bar.
do not use string comparisons for sorting;
Edit (addressing comment)
Consider performing the read as follows:
create a list of block offset for the blocks you want to read
create a list of worker threads, of fixed size (for example, 10 workers)
each idle worker will receive the file name and a block offset, then create a std::ifstream instance on the file, read the block, and return it to a receiving object (and then, request another block number, if any are left).
read pages should be passed to a central structure that manages/stores pages.
Also consider managing the memory for the blocks separately (for example, allocate chunks of multiple blocks preemptively, when you know the number of blocks to be read).
I first read header part, then sort metrics in alphabetic order. For each metric in the sorted list I read all data from the input file and write to the output file. To remove bottlenecks at reading data step, I used memory mapping. The results showed that when using memory mapping the execution time for an input file of 5 GB was reduced 5 ~ 6 times compared with when not using memory mapping. This way temporarily solve my problems. However, I will also consider suggestions of #utnapistim.

Most efficient way to process complex histogram data?

I'm currently implementing a histogram that will show a very large scale data using Qt and I have some doubts about which data structure(s) I should be using for my problem. I will be displaying the amount of queries received from users of an application and the way I should display is as follows -in a single application that will show different histograms upon clicking different "show me this data etc." buttons-
1) Display the histogram of total queries per every month -4 months of data here, I
kept four variables and incremented them as I caught queries belonging to those months
in the CSV file-
2) Display the histogram of total queries per every single day in a selected month -I was thinking of using 4 QVectors to represent the months for this one, incrementing every element of the vector (day), as I come by that specific day -e.g. the vector represents the month of August and whenever I come across a data with 2011-08-XY , I will increment the (XY + 1)th element of that vector by 1- my second alternative is to use 4 QLinkedList's for the sake of better complexity but I'm not sure if the ways I've come up with are efficient enough and I'm willing to listen to any other idea.
3) Here's where things get a bit complicated. Display the histogram of total queries per every hour in a selected day and month. The data represented is multiplied in a vast manner and I don't know which data structure -or combination of structures- I should use to implement this one. A list of lists perhaps?
Any ideas on my problems at 2) and 3) would be helpful, Thanks in advance.
Actually, it shouldn't be too unmanageable to always do queries per hour. Assuming that the number of queries per hour is never greater than the maximum int value, that's only 24 ints per day = 32 bits or 64 depending on your machine. Assuming 32 bits, then you could get up to 28 years worth of data per MB.
There's no need to transfer the month/year - your program can work that out. Just assign hour 0 to the earliest point in your data, which you keep as a constant, then work out the date based on hours passed since then.
This avoids having to have a list of lists or anything fancy - just use an array where each address contains the number of hours since hour 0, and the number of queries for that hour.
Why don't you simply use a classical database?
When you start asking these kind of question I think it is a good time to consider a more robust structure.There are multiple data structures implemented inside any DB, optimized either for different access type. You should considerate at least lookup, insertion, deletion, range queries. There is no structure which is better than the others in all costs, so there is always a trade-off.
Qt has some database classes you can use. I never used the Qt SQL library, but I think you should give it a shot. Fortunately, there is a Qt SQL programming guide at the end of the page linked.

Amazon SimpleDB Woes: Implementing counter attributes

Long story short, I'm rewriting a piece of a system and am looking for a way to store some hit counters in AWS SimpleDB.
For those of you not familiar with SimpleDB, the (main) problem with storing counters is that the cloud propagation delay is often over a second. Our application currently gets ~1,500 hits per second. Not all those hits will map to the same key, but a ballpark figure might be around 5-10 updates to a key every second. This means that if we were to use a traditional update mechanism (read, increment, store), we would end up inadvertently dropping a significant number of hits.
One potential solution is to keep the counters in memcache, and using a cron task to push the data. The big problem with this is that it isn't the "right" way to do it. Memcache shouldn't really be used for persistent storage... after all, it's a caching layer. In addition, then we'll end up with issues when we do the push, making sure we delete the correct elements, and hoping that there is no contention for them as we're deleting them (which is very likely).
Another potential solution is to keep a local SQL database and write the counters there, updating our SimpleDB out-of-band every so many requests or running a cron task to push the data. This solves the syncing problem, as we can include timestamps to easily set boundaries for the SimpleDB pushes. Of course, there are still other issues, and though this might work with a decent amount of hacking, it doesn't seem like the most elegant solution.
Has anyone encountered a similar issue in their experience, or have any novel approaches? Any advice or ideas would be appreciated, even if they're not completely flushed out. I've been thinking about this one for a while, and could use some new perspectives.
The existing SimpleDB API does not lend itself naturally to being a distributed counter. But it certainly can be done.
Working strictly within SimpleDB there are 2 ways to make it work. An easy method that requires something like a cron job to clean up. Or a much more complex technique that cleans as it goes.
The Easy Way
The easy way is to make a different item for each "hit". With a single attribute which is the key. Pump the domain(s) with counts quickly and easily. When you need to fetch the count (presumable much less often) you have to issue a query
SELECT count(*) FROM domain WHERE key='myKey'
Of course this will cause your domain(s) to grow unbounded and the queries will take longer and longer to execute over time. The solution is a summary record where you roll up all the counts collected so far for each key. It's just an item with attributes for the key {summary='myKey'} and a "Last-Updated" timestamp with granularity down to the millisecond. This also requires that you add the "timestamp" attribute to your "hit" items. The summary records don't need to be in the same domain. In fact, depending on your setup, they might best be kept in a separate domain. Either way you can use the key as the itemName and use GetAttributes instead of doing a SELECT.
Now getting the count is a two step process. You have to pull the summary record and also query for 'Timestamp' strictly greater than whatever the 'Last-Updated' time is in your summary record and add the two counts together.
SELECT count(*) FROM domain WHERE key='myKey' AND timestamp > '...'
You will also need a way to update your summary record periodically. You can do this on a schedule (every hour) or dynamically based on some other criteria (for example do it during regular processing whenever the query returns more than one page). Just make sure that when you update your summary record you base it on a time that is far enough in the past that you are past the eventual consistency window. 1 minute is more than safe.
This solution works in the face of concurrent updates because even if many summary records are written at the same time, they are all correct and whichever one wins will still be correct because the count and the 'Last-Updated' attribute will be consistent with each other.
This also works well across multiple domains even if you keep your summary records with the hit records, you can pull the summary records from all your domains simultaneously and then issue your queries to all domains in parallel. The reason to do this is if you need higher throughput for a key than what you can get from one domain.
This works well with caching. If your cache fails you have an authoritative backup.
The time will come where someone wants to go back and edit / remove / add a record that has an old 'Timestamp' value. You will have to update your summary record (for that domain) at that time or your counts will be off until you recompute that summary.
This will give you a count that is in sync with the data currently viewable within the consistency window. This won't give you a count that is accurate up to the millisecond.
The Hard Way
The other way way is to do the normal read - increment - store mechanism but also write a composite value that includes a version number along with your value. Where the version number you use is 1 greater than the version number of the value you are updating.
get(key) returns the attribute value="Ver015 Count089"
Here you retrieve a count of 89 that was stored as version 15. When you do an update you write a value like this:
put(key, value="Ver016 Count090")
The previous value is not removed and you end up with an audit trail of updates that are reminiscent of lamport clocks.
This requires you to do a few extra things.
the ability to identify and resolve conflicts whenever you do a GET
a simple version number isn't going to work you'll want to include a timestamp with resolution down to at least the millisecond and maybe a process ID as well.
in practice you'll want your value to include the current version number and the version number of the value your update is based on to more easily resolve conflicts.
you can't keep an infinite audit trail in one item so you'll need to issue delete's for older values as you go.
What you get with this technique is like a tree of divergent updates. you'll have one value and then all of a sudden multiple updates will occur and you will have a bunch of updates based off the same old value none of which know about each other.
When I say resolve conflicts at GET time I mean that if you read an item and the value looks like this:
11 --- 12
/
10 --- 11
\
11
You have to to be able to figure that the real value is 14. Which you can do if you include for each new value the version of the value(s) you are updating.
It shouldn't be rocket science
If all you want is a simple counter: this is way over-kill. It shouldn't be rocket science to make a simple counter. Which is why SimpleDB may not be the best choice for making simple counters.
That isn't the only way but most of those things will need to be done if you implement an SimpleDB solution in lieu of actually having a lock.
Don't get me wrong, I actually like this method precisely because there is no lock and the bound on the number of processes that can use this counter simultaneously is around 100. (because of the limit on the number of attributes in an item) And you can get beyond 100 with some changes.
Note
But if all these implementation details were hidden from you and you just had to call increment(key), it wouldn't be complex at all. With SimpleDB the client library is the key to making the complex things simple. But currently there are no publicly available libraries that implement this functionality (to my knowledge).
To anyone revisiting this issue, Amazon just added support for Conditional Puts, which makes implementing a counter much easier.
Now, to implement a counter - simply call GetAttributes, increment the count, and then call PutAttributes, with the Expected Value set correctly. If Amazon responds with an error ConditionalCheckFailed, then retry the whole operation.
Note that you can only have one expected value per PutAttributes call. So, if you want to have multiple counters in a single row, then use a version attribute.
pseudo-code:
begin
attributes = SimpleDB.GetAttributes
initial_version = attributes[:version]
attributes[:counter1] += 3
attributes[:counter2] += 7
attributes[:version] += 1
SimpleDB.PutAttributes(attributes, :expected => {:version => initial_version})
rescue ConditionalCheckFailed
retry
end
I see you've accepted an answer already, but this might count as a novel approach.
If you're building a web app then you can use Google's Analytics product to track page impressions (if the page to domain-item mapping fits) and then to use the Analytics API to periodically push that data up into the items themselves.
I haven't thought this through in detail so there may be holes. I'd actually be quite interested in your feedback on this approach given your experience in the area.
Thanks
Scott
For anyone interested in how I ended up dealing with this... (slightly Java-specific)
I ended up using an EhCache on each servlet instance. I used the UUID as a key, and a Java AtomicInteger as the value. Periodically a thread iterates through the cache and pushes rows to a simpledb temp stats domain, as well as writing a row with the key to an invalidation domain (which fails silently if the key already exists). The thread also decrements the counter with the previous value, ensuring that we don't miss any hits while it was updating. A separate thread pings the simpledb invalidation domain, and rolls up the stats in the temporary domains (there are multiple rows to each key, since we're using ec2 instances), pushing it to the actual stats domain.
I've done a little load testing, and it seems to scale well. Locally I was able to handle about 500 hits/second before the load tester broke (not the servlets - hah), so if anything I think running on ec2 should only improve performance.
Answer to feynmansbastard:
If you want to store huge amount of events i suggest you to use distributed commit log systems such as kafka or aws kinesis. They allow to consume stream of events cheap and simple (kinesis's pricing is 25$ per month for 1K events per seconds) – you just need to implement consumer (using any language), which bulk reads all events from previous checkpoint, aggregates counters in memory then flushes data into permanent storage (dynamodb or mysql) and commit checkpoint.
Events can be logged simply using nginx log and transfered to kafka/kinesis using fluentd. This is very cheap, performant and simple solution.
Also had similiar needs/challenges.
I looked at using google analytics and count.ly. the latter seemed too expensive to be worth it (plus they have a somewhat confusion definition of sessions). GA i would have loved to use, but I spent two days using their libraries and some 3rd party ones (gadotnet and one other from maybe codeproject). unfortunately I could only ever see counters post in GA realtime section, never in the normal dashboards even when the api reported success. we were probably doing something wrong but we exceeded our time budget for ga.
We already had an existing simpledb counter that updated using conditional updates as mentioned by previous commentor. This works well, but suffers when there is contention and conccurency where counts are missed (for example, our most updated counter lost several million counts over a period of 3 months, versus a backup system).
We implemented a newer solution which is somewhat similiar to the answer for this question, except much simpler.
We just sharded/partitioned the counters. When you create a counter you specify the # of shards which is a function of how many simulatenous updates you expect. this creates a number of sub counters, each which has the shard count started with it as an attribute :
COUNTER (w/5shards) creates :
shard0 { numshards = 5 } (informational only)
shard1 { count = 0, numshards = 5, timestamp = 0 }
shard2 { count = 0, numshards = 5, timestamp = 0 }
shard3 { count = 0, numshards = 5, timestamp = 0 }
shard4 { count = 0, numshards = 5, timestamp = 0 }
shard5 { count = 0, numshards = 5, timestamp = 0 }
Sharded Writes
Knowing the shard count, just randomly pick a shard and try to write to it conditionally. If it fails because of contention, choose another shard and retry.
If you don't know the shard count, get it from the root shard which is present regardless of how many shards exist. Because it supports multiple writes per counter, it lessens the contention issue to whatever your needs are.
Sharded Reads
if you know the shard count, read every shard and sum them.
If you don't know the shard count, get it from the root shard and then read all and sum.
Because of slow update propogation, you can still miss counts in reading but they should get picked up later. This is sufficient for our needs, although if you wanted more control over this you could ensure that- when reading- the last timestamp was as you expect and retry.