How to count the number of times value exists in line? - regex

My Data like this..,
[123:1000,156,132,123,156,123]
[123:1009,392,132,123,156,123]
[234:987,789,132,123,156,123]
[234:8765,789,132,123,156,123]
I need to count number of times "123" exists in each line using expression language in nifi.
I need to do it in expression language only.How can i count it?
Any help appreciated.

You should use the SplitContent processor to split the flowfile content into individual flowfiles per line, then use ExtractText with a regex like pattern=(123)? which will result in an attribute being added to the flowfile for each matching group:
[123:1009,392,132,123,156,123] -> pattern.1, pattern.2, pattern.3
[234:987,789,132,123,156,123] -> pattern.1, pattern.2
Finally, you can use a ScanAttribute processor to detect the attribute with the highest group count in each of the flowfiles and route it to an UpdateAttribute to put that value into a common flowfile attribute (i.e. count). You could also replace some steps with an ExecuteStreamCommand and use a variety of OS-level tools (grep/awk/sed/cut/etc.) to perform the count, return that value, and update the content of the flowfile.
It would probably be simpler for you to perform this count action within an ExecuteScript processor, as it could be done in 1-2 lines of Groovy, Ruby, Python, or Javascript, and would not require multiple processors. Apache NiFi is designed for data routing and simple transformation, not complex event processing, so there are not standard processors developed for these tasks. There is an open Jira for "Add processor to perform simple aggregations" which has a patch available here, which may be useful for you.

According to the documentation a count is done like this:
${allMatchingAttributes(".*"):contains("123"):count()}

Related

one line regex independent the number of items

Can I have a one-line regex code that matches the values between a pipe line "|" independent of the number if items between the pipe lines. E.g. I have the following regex:
^(.*?)\|(.*?)\|(.*?)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)$
which works only if I have 12 items. How can I make the same work for e.g. 6 items as well?
([^|]+)+
This is the pattern I've used in the past for that purpose. It matches 1 or more group that does not contain the pipe delimeter.
For Adobe Classification Rule Builder (CRB), there is no way to write a regex that will match an arbitrary number of your pattern and push them to $n capture group. Most regex engines do not allow for this, though some languages offer certain ways to more or less effectively do this as returned arrays or whatever. But CRB doesn't offer that sort of thing.
But, it's mostly pointless to want this anyways, since there's nothing upstream or downstream that really dynamically/automatically accommodates this sort of thing anyways.
For example, there's no way in the CRB interface to dynamically populate the output value with an arbitary $1$2$3[$n..] value, nor is there a way to dynamically generate an arbitrary number of rules in the rule set.
In addition, Adobe Analytics (AA) does not offer arbitrary on-the-fly classification column generation anyways (unless you want to write a script using the Classification API, but you can't say the same for CRBs anyways).
For example if you have
s.eVar1='foo1|foo2';
And you want to classify this into 2 classification columns/reports, you have to go and create them in the classification interface. And then let's say your next value sent in is:
s.eVar1='foo1|foo2|foo3';
Well AA does not automatically create a new classification level for you; you have to go in and add a 3rd, and so on.
So overall, even though it is not possible to return an arbitrary number of captured groups $n in a CRB, there isn't really a reason you need to.
Perhaps it would help if you explain what you are actually trying to do overall? For example, what report(s) do you expect to see?
One common reason I see this sort of "wish" come up for is when someone wants to track stuff like header or breadcrumb navigation links that have an arbitrary depth to them. So they push e.g. a breadcrumb
Home > Electronics > Computers > Monitors > LED Monitors
...or whatever to an eVar (but pipe delimited, based on your question), and then they want to break this up into classified columns.
And the problem is, it could be an arbitrary length. But as mentioned, setting up classifications and rules for them doesn't really accommodate this sort of thing.
Usually the best practice for a scenario like this is to to look at the raw data and see how many levels represents the bulk of your data, on average. For example if you look at your raw eVar report and see even though upwards of like 5 or 6 levels in the values can be found, but you can also see that most of values on average are between 1-3 levels, then you should create 4 classification columns. The first 3 classifications represent the first 3 levels, and the 4th one will have everything else.
So going back to the example value:
Home|Electronics|Computers|Monitors|LED Monitors
You can have:
Level1 => Home
Level2 => Electronics
Level3 => Computers
Level4+ => Monitors|LED Monitors
Then you setup a CRB with 4 rules, one for each of the levels. And you'd use the same regex in all 4 rule rows:
^([^|]+)(?:\|([^|]+))?(?:\|([^|]+))?(?:\|(.+))?
Which will return the following captured groups to use in the CRB outputs:
$1 => Home
$2 => Electronics
$3 => Computers
$4 => Monitors|LED Monitors
Yeah, this isn't the same as having a classification column for every possible length, but it is more practical, because when it comes to analytics, you shouldn't really try to be too granular about things in the first place.
But if you absolutely need to have something for every possible amount of delimited values, you will need to find out what the max possible is and make that many, hard coded.
Or as an alternative to classifications, consider one of the following alternatives:
Use a list prop
Use a list variable (e.g. list1)
Use a Merchandising eVar (product variable syntax)
This isn't exactly the same thing, and they each have their caveats, but you didn't provide details for what you are ultimately trying to get out of the reports, so this may or may not be something you can work with.
Well anyways, hopefully some of this is food for thought for you.

How would I merge related records in apache beam / dataflow, based on hundreds of rules?

I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier
Example Data
Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}
Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
There are about 100 rules I can use to compare one record to another
to see if customer in source system 1 is the same as source system 2.
Some rules may be able to infer record values and add data to a master record about a customer.
Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
We have millions of records per day we'd have to unify
Apache Beam / Dataflow implementation
Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with Flink Gelly or Spark GraphX
Is there any way you may know in dataflow to process such a problem efficiently?
Other thoughts
Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
Graph database: Something like neo4j or datomic would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar
BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns
Or maybe there's a more standard way to solving these class of problems.
It is hard to say what solution works best for you from what I can read so far. I would try to split the problem further and try to tackle different aspects separately.
From what I understand, the goal is to combine together the matching records that represent the same thing in different sources:
records come from a number of sources:
it is logically the same data but formatted differently;
there are rules to tell if the records represent the same entity:
collection of rules is static;
So, the logic probably roughly goes like:
read a record;
try to find existing matching records;
if matching record found:
update it with new data;
otherwise save the record for future matching;
repeat;
To me this looks very high level and there's probably no single 'correct' solution at this level of detail.
I would probably try to approach this by first understanding it in more detail (maybe you already do), few thoughts:
what are the properties of the data?
are there patterns? E.g. when one system publishes something, do you expect something else from other systems?
what are the requirements in general?
latency, consistency, availability, etc;
how data is read from the sources?
can all the systems publish the records in batches in files, submit them into PubSub, does your solution need to poll them, etc?
can the data be read in parallel or is it a single stream?
then the main question of how can you efficiently match a record in general will probably look different under different assumptions and requirements as well. For example I would think about:
can you fit all data in memory;
are your rules dynamic. Do they change at all, what happens when they do;
can you split the data into categories that can be stored separately and matched efficiently, e.g. if you know you can try to match some things by id field, some other things by hash of something, etc;
do you need to match against all of historical/existing data?
can you have some quick elimination logic to not do expensive checks?
what is the output of the solution? What are the requirements for the output?

Storm and stop words

I am new in storm framework(https://storm.incubator.apache.org/about/integrates.html),
I test locally with my code and I think If I remove stop words, it will perform well, but i search on line and I can't see any example that removing stopwords in storm.
If the size of the stop words list is small enough to fit in memory, the most straighforward approach would be to simply filter the tuples with an implementation of storm Filter that knows that list. This Filter could possibly poll the DB every so often to get the latest list of stop words if this list evolves over time.
If the size of the stop words list is bigger, then you can use a QueryFunction, called from your topology with the stateQuery function, which would:
receive a batch of tuples to check (say 10000 at a time)
build a single query from their content and look up corresponding stop words in persistence
attach a boolean to each tuple specifying what to with each one
+ add a Filter right after that to filter based on that boolean.
And if you feel adventurous:
Another and faster approach would be to use a bloom filter approximation. I heard that Algebird is meant to provide this kind of functionality and targets both Scalding and Storm (how cool is that?), but I don't know how stable it is nor do I have any experience in practically plugging it into Storm (maybe Sunday if it's rainy...).
Also, Cascading (which is not directly related to Storm but has a very similar set of primitive abstractions on top of map reduce) suggests in this tutorial a method based on left joins. Such joins exist in Storm and the right branch could possibly be fed with a FixedBatchSpout emitting all stop words every time, or even a custom spout that reads the latest version of the list of stop words from persistence every time, so maybe that would work too? Maybe? This also assumes the size of the stop words list is relatively small though.

How to handle searches for very common keywords

I want to be able to return useful records if a user searches for a keyword that is very, very common in a solr index. For example education.
In this case, close to 99% of the records would have that word in it. So searches for this word or similar take a long time.
This is for solr on ColdFusion but I'm open to solutions which are isolated to just solr.
Right now I'm thinking of coming up with a list of stopwords and preventing those searches from taking place altogether.
If searches are taking a long time, it could be because you are not limiting the number of results that are returned. The <cfsearch> tag has a maxrows attribute, as well as a startrow attribute, that you could use to limit or paginate the data. Alternately, you could call Solr's web service directly through a <cfhttp> call:
<cfhttp url="http://localhost:8983/solr/<collection_name>/select/?q=<searchterm>&fl=*,score&rows=100&wt=json" />
Solr will return 10 rows by default; you can change this with the rows parameter. You can use the start parameter as well (note that Solr starts counting with 0 instead of 1). I believe this solution is more flexible, especially if you're using CF 9, as it allows you to paginate while sorting on a field other than score.
You can find more detail here:
http://www.thefaberfamily.org/search-smith/coldfusion-solr-tutorial/
If the user searches on just one term that is exceedingly common then you need to limit your results and advise the user that there were too many matches.
In the more general case, you want to perform a two-pass (at least) approach. Take your search terms and perform a lookup to determine their 'common-ness'. You want to filter based on least common terms first, and more common terms last.
For example, user searches serendipitous education. You identify that you have 11 matches for serendipitous, and 900000 matches for education. Thus you apply the serendipitous filter first, resulting in 11 matches. Then apply the education filter, resulting in 7 matches.
The key to fast searching is indexing and precomputed statistics. If you have statistics like this on hand you can dynamic create an optimised approach.

Create and use HTML full text search index (C++)

I need to create a search index for a collection of HTML pages.
I have no experience in implementing a search index at all, so any general information how to build one, what information to store, how to implement advanced searches such as "entire phrase", ranking of results etc.
I'm not afraid to build it myself, though I'd be happy to reuse an existing component (or use one to get started with a prototype). I am looking for a solution accessible from C++, preferrably without requiring additional installations at runtime. The content is static (so it makes sense to aggregate search information), but a search might have to accumulate results from multiple such repositories.
I can make a few educated guesses, though: create a map word ==> pages for all (relevant) words, a rank can be assigned to the mapping by promincence (h1 > h2 > ... > <p>) and proximity to top. Advanced searches could be built on top of that: searching for phrase "homo sapiens" could list all pages that contain "homo" and "sapiens", then scan all pages returned for locations where they occur together. However, there are a lot of problematic scenarios and unanswered questions, so I am looking for references to what should be a huge amount of existing work that somehow escapes my google-fu.
[edit for bounty]
The best resource I found until now is this and the links from there.
I do have an imlementation roadmap for an experimental system, however, I am still looking for:
Reference material regarding index creation and individual steps
available implementations of individual steps
reusable implementations (with above environment restrictions)
This process is generally known as information retrieval. You'll probably find this online book helpful.
Existing libraries
Here are two existing solutions that can be fully integrated into an application without requiring a separate process (I believe both will compile with VC++).
Xapian is mature and may do much of what you need, from indexing to ranked retrieval. Separate HTML parsing would be required because, AFAIK, it does not parse html (it has a companion program Omega, which is a front end for indexing web sites).
Lucene is a index/searching Apache library in Java, with an official pre-release C version lucy, and an unofficial C++ version CLucene.
Implementing information retrieval
If the above options are not viable for some reason, here's some info on the individual steps of building and using an index. Custom solutions can go from simple to very sophisticated, depending what you need for your application. I've broken the process into 5 steps
HTML processing
Text processing
Indexing
Retrieval
Ranking
HTML Processing
There are two approaches here
Stripping The page you referred to discusses a technique generally known as stripping, which involves removing all the html elements that won't be displayed and translating others to their display form. Personally, I'd preprocess using perl and index the resulting text files. But for an integrated solution, particularly one where you want to record significance tags (e.g. <h1>, <h2>), you probably want to role your own. Here is a partial implementation of a C++ stripping routine (appears in Thinking in C++ , final version of book here), that you could build from.
Parsing A level up in complexity from stripping is html parsing, which would help in your case for recording significance tags. However, a good C++ HTML parser is hard to find. Some options might be htmlcxx (never used it, but active and looks promising) or hubbub (C library, part of NetSurf, but claims to be portable).
If you are dealing with XHTML or are willing to use an HTML-to-XML converter, you can use one of the many available XML parsers. But again, HTML-to-XML converters are hard to find, the only one I know of is HTML Tidy. In addition to conversion to XHTML, its primary purpose is to fix missing/broken tags, and it has an API that could possibly be used to integrate it into an application. Given XHTML documents, there are many good XML parsers, e.g. Xerces-C++ and tinyXML.
Text Processing
For English at least, processing text to words is pretty straight forward. There are a couple of complications when search is involved though.
Stop words are words known a priori not to provide a useful distinction between documents in the set, such as articles and propositions. Often these words are not indexed and filtered from query streams. There are many stop word lists available on the web, such as this one.
Stemming involves preprocessing documents and queries to identify the root of each word to better generalize a search. E.g. searching for "foobarred" should yield "foobarred", "foobarring", and "foobar". The index can be built and searched on roots alone. The two general approaches to stemming are dictionary based (lookups from word ==> root) and algorithm based. The Porter algorithm is very common and several implementations are available, e.g. C++ here or C here. Stemming in the Snowball C library supports several languages.
Soundex encoding One method to make search more robust to spelling errors is to encode words with a phonetic encoding. Then when queries have phonetic errors, they will still map directly to indexed words. There are a lot of implementations around, here's one.
Indexing
The map word ==> page data structure is known as an inverted index. Its inverted because its often generated from a forward index of page ==> words. Inverted indexes generally come in two flavors: inverted file index, which map words to each document they occur in, and full inverted index, which map words to each position in each document they occur in.
The important decision is what backend to use for the index, some possibilities are, in order of ease of implementation:
SQLite or Berkly DB - both of these are database engines with C++ APIs that integrated into a project without requiring a separate server process. Persistent databases are essentially files, so multiple index sets can be search by just changing the associated file. Using a DBMS as a backend simplifies index creation, updating and searching.
In memory data structure - if your using a inverted file index that is not prohibitively large (memory consumption and time to load), this could be implemented as a std::map<std::string,word_data_class>, using boost::serialization for persistence.
On disk data structure - I've heard of blazingly fast results using memory mapped files for this sort of thing, YMMV. Having an inverted file index would involve having two index files, one representing words with something like struct {char word[n]; unsigned int offset; unsigned int count; };, and the second representing (word, document) tuples with just unsigned ints (words implicit in the file offset). The offset is the file offset for the first document id for the word in the second file, count is the number of document ids associate with that word (number of ids to read from the second file). Searching would then reduce to a binary search through the first file with a pointer into a memory mapped file. The down side is the need to pad/truncate words to get a constant record size.
The procedure for indexing depends on which backend you use. The classic algorithm for generating a inverted file index (detailed here) begins with reading through each document and extending a list of (page id, word) tuples, ignoring duplicate words in each document. After all documents are processed, sort the list by word, then collapsed into (word, (page id1, page id2, ...)).
The mifluz gnu library implements inverted indexes w/ storage, but without document or query parsing. GPL, so may not be a viable option, but will give you an idea of the complexities involved for an inverted index that supports a large number of documents.
Retrieval
A very common method is boolean retrieval, which is simply the union/intersection of documents indexed for each of the query words that are joined with or/and, respectively. These operations are efficient if the document ids are stored in sorted order for each term, so that algorithms like std::set_union or std::set_intersection can be applied directly.
There are variations on retrieval, wikipedia has an overview, but standard boolean is good for many/most application.
Ranking
There are many methods for ranking the documents returned by boolean retrieval. Common methods are based on the bag of words model, which just means that the relative position of words is ignored. The general approach is to score each retrieved document relative to the query, and rank documents based on their calculated score. There are many scoring methods, but a good starting place is the term frequency-inverse document frequency formula.
The idea behind this formula is that if a query word occurs frequently in a document, that document should score higher, but a word that occurs in many documents is less informative so this word should be down weighted. The formula is, over query terms i=1..N and document j
score[j] = sum_over_i(word_freq[i,j] * inv_doc_freq[i])
where the word_freq[i,j] is the number of occurrences of word i in document j, and
inv_doc_freq[i] = log(M/doc_freq[i])
where M is the number of documents and doc_freq[i] is the number of documents containing word i. Notice that words that occur in all documents will not contribute to the score. A more complex scoring model that is widely used is BM25, which is included in both Lucene and Xapian.
Often, effective ranking for a particular domain is obtained by adjusting by trial and error. A starting place for adjusting rankings by heading/paragraph context could be inflating word_freq for a word based on heading/paragraph context, e.g. 1 for a paragraph, 10 for a top level heading. For some other ideas, you might find this paper interesting, where the authors adjusted BM25 ranking for positional scoring (the idea being that words closer to the beginning of the document are more relevant than words toward the end).
Objective quantification of ranking performance is obtained by precision-recall curves or mean average precision, detailed here. Evaluation requires an ideal set of queries paired with all the relevant documents in the set.
Depending on the size and number of the static pages, you might want to look at an already existent search solution.
"How do you implement full-text search for that 10+ million row table, keep up with the load, and stay relevant? Sphinx is good at those kinds of riddles."
I would choose the Sphinx engine for full text searching. The licence is GPL but the also have a commercial version available. It is meant to be run stand-alone [2], but it can also be embedded into applications by extracting the needed functionality (be it indexing[1], searching [3], stemming, etc).
The data should be obtained by parsing the input HTML files and transforming them to plain-text by using a parser like libxml2's HTMLparser (I haven't used it, but they say it can parse even malformed HTML). If you aren't bound to C/C++ you could take a look at Beautiful Soup.
After obtaining the plain-texts, you could store them in a database like MySQL or PostgreSQL. If you want to keep everything embedded you should go with sqlite.
Note that Sphinx doesn't work out-of-the-box with sqlite, but there is an attempt to add support (sphinx-sqlite3).
I would attack this with a little sqlite database. You could have tables for 'page', 'term' and 'page term'. 'Page' would have columns like id, text, title and url. 'Term' would have a column containing a word, as well as the primary ID. 'Page term' would have foreign keys to a page ID and a term ID, and could also store the weight, calculated from the distance from the top and the number of occurrences (or whatever you want).
Perhaps a more efficient way would be to only have two tables - 'page' as before, and 'page term' which would have the page ID, the weight, and a hash of the term word.
An example query - you want to search for "foo". You hash "foo", then query all page term rows that have that term hash. Sort by descending weight and show the top ten results.
I think this should query reasonably quickly, though it obviously depends on the number and size of the pages in question. Sqlite isn't difficult to bundle and shouldn't need an additional installation.
Ranking pages is the really tricky bit here. With a large sample of pages you can use links quite a lot in working out ranks. Other wise you need to check how words seem to be placed, and also making sure your engine doesn't get fooled by 'dictionary' pages.
Good luck!