Quicksight breaking up strings for use of all aspects - amazon-web-services

I was wondering if anyone has every had experience with breaking a string up in quicksight and using certain aspects of the string. My example is a data set that returns tags like this "animals|funny|dog-park" I have used "split(tags,'|',1)" but then all that gets returned is the first part(animals). I have also tried a combination of ifelse->locate->split with no luck. Is there a way to split these tags to where they are all usable (animals) & (funny) or (funny) & (dog-park), etc.? Say the article associated will then be broken up into one tag but also another separately? I know this will end up being a calculated field most likely. Thank you in advance!

Since QuickSight does not support any form of nested fields (including objects and list) in analysis, you need to normalise this into separate rows before feeding the data to QuickSight.
Otherwise, if you leave it as is, you would be limited to filtering using string contains and doing string lookup in calculated fields - nevertheless you would not be able to use these tags as categories (such as in colours field well of visuals).

Related

How to find entity in search query in Elasticsearch?

I'm using Elasticsearch to build search for ecommerece site.
One index will have products stored in it, in products index I'll store categories in it's other attributes along with. Categories can be multiple but the attribute will have single field value. (E.g. color)
Let's say user types in Black(color) Nike(brand) shoes(Categories)
I want to process this query so that I can extract entities (brand, attribute, etc...) and I can write Request body search.
I have tought of following option,
Applying regex on query first to extract those entities (But with this approach not sure how Fuzzyness would work, user may have typo in any of the entity)
Using OpenNLP extension (But this one only works on indexation time, in above scenario we want it on query side)
Using NER of any good NLP framework. (This is not time & cost effective because I'll have millions of products in engine also they get updated/added on frequent basis)
What's the best way to solve above issue ?
Edit:
Found couple of libraries which would allow fuzzy text matching in regex. But the entities to find will be many, so what's the best solution to optimise that ?
Still not sure about OpenNLP
NER won't work in this case because there are fixed number of entities so prediction is not right when there are no entity available in the query.
If you cannot achieve desired results with tuning of built-in ElasticSearch scoring/boosting most likely you'll need some kind of 'natural language query' processing:
Tokenize free-form query. Regex can be used for splitting lexems, however very often it is better to write custom tokenizer for that.
Perform named-entity recognition to determine possible field(s) for each keyword. At this step you will get associations like (Black -> color), (Black -> product name) etc. In fact you don't need OpenNLP for that as this should be just an index (keyword -> field(s)), and you can try to use ElasticSearch 'suggest' API for this purpose.
(optional) Recognize special phrases or combinations like "released yesterday", "price below $20"
Generate possible combinations of matches, and with help of special scoring function determine 'best' recognition result. Scoring function may be hardcoded (reflect 'common sense' heuristics) or it this may be a result of machine learning algorithm.
By recognition result (matches metadata) produce formal query to produce search results - this may be ElasticSearch query with field hints, or even SQL query.
In general, efficient NLQ processing needs significant development efforts - I don't recommend to implement it from scratch until you have enough resources & time for this feature. As alternative, you can try to find existing NLQ solution and integrate it, but most likely this will be commercial product (I don't know any good free/open-source NLQ components that really ready for production use).
I would approach this problem as NER tagging considering you already have corpus of tags. My approach for this problem will be as below:
Create a annotated dataset of queries with each word tagged to one of the tags say {color, brand, Categories}
Train a NER model (CRF/LSTMS).
This is not time & cost effective because I'll have millions of
products in engine also they get updated/added on frequent basis
To handle this situation I suggest dont use words in the query as features but rather use the attributes of the words as features. For example create an indicator function f(x',y) for word x with context x' (i.e the word along with the surrounding words and their attributes) and tag y which will return a 1 or 0. A sample indicator function will be as below
f('blue', 'y') = if 'blue' in `color attribute` column of DB and words previous to 'blue' is in `product attribute` column of DB and 'y' is `colors` then return 1 else 0.
Create lot of these indicator functions also know as features maps.
These indicator functions are then used to train a models using CRFS or LSTMS. Finially we use viterbi algorithm to find the best tagging sequence for your query. For CRFs you can use packages like CRFSuite or CRF++. Using these packages all you have go do is create indicator functions and the package will train a model for you. Once trained you can use this model to predict the best sequence for your queries. CRFs are very fast.
This way of training without using vector representation of words will generalise your model without the need of retraining. [Look at NER using CRFs].

Is it possible to query multiple AWS Cloudsearch fields for the same value without repeating?

Using AWS Cloudsearch, I need to query 2 separate fields for the same value using a structured (compound) query e.g.
(and (or name:'john smith') (or curr_addr:'123 someplace' other_addr:'123 someplace'))
This query works, but I'm wondering if it's necessary to repeat the value for each field that I want to search against. Is there some way to specify the value only once e.g. curr_addr+other_addr:'123 someplace'
That is the correct way to structure your compound query. From the AWS documentation, you'll see that they structure their example query the same way:
(and title:'star' (or actors:'Harrison Ford' actors:'William Shatner')(not actors:'Zachary Quinto'))
From Constructing Compound Queries
You may be able to get around this by listing the more repetitive fields in the query options (q.options), and then specify the field for the rest of the fields. The fields list is sort of a fallback for when you don't specify which field you are searching in a compound query. So if you list the address fields there, and then only specify the name field in your query, you may get close to the behavior you're looking for.
Query options
q.options={fields: ['curr_addr','other_addr']}
Query
(and (or name:'john smith') (or '123 someplace'))
But this approach would only work for one set of repetitive fields, so it's not a silver bullet by any means.
From Search API Reference (see q.options => fields)

I have large file contents that I want to make searchable on AWS CloudSearch but the maximum document size is 1MB - how do I deal with this?

I could split the file contents up into separate search documents but then I would have to manually identify this in the results and only show one result to the user - otherwise it will look like there are 2 files that match their search when in fact there is only one.
Also the relevancy score would be incorrect. Any ideas?
So the response from AWS support was to split the files up into separate documents. In response to my concerns regarding relevancy scoring and multiple hits they said the following:
You do raise two very valid concerns here for your more challenging use case here. With regard to relevance, you face a very significant problem already in that is harder to establish a strong 'signal' and degrees of differentiation with large bodies of text. If the documents you have are much like reports or whitepapers, a potential workaround to this may be in indexing the first X number of characters (or the first identified paragraph) into a "thesis" field. This field could be weighted to better indicate what the document subject matter may be without manual review.
With regard to result duplication, this will require post-processing on your end if you wish to filter it. You can create a new field that can generate a unique "Parent" id that will be shared for each chunk of the whole document. The post-processing can check to see if this "Parent" id has already been return(the first result should be seen as most relevant), and if it has, filter the subsequent results. What is doubly useful in such a scenario, is that you include a refinement link into your results that could filter on all matches within that particular Parent id.

How I can encode/escape a varchar to be more secure without using cfqueryparam?

How I can encode/escape a varchar to be more secure without using cfqueryparam? I want to implement the same behaviour without using <cfqueryparam> to get around "Too many parameters were provided in this RPC request. The maximum is 2100" problem. See: http://www.bennadel.com/blog/1112-Incoming-Tabular-Data-Stream-Remote-Procedure-Call-Is-Incorrect.htm
Update:
I want the validation / security part, without generating a prepared-statement.
What's the strongest encode/escape I can do to a varchar inside <cfquery>?
Something similar to mysql_real_escape_string() maybe?
As others have said, that length-related error originates at a deeper level, not within the queryparam tag. And it offers some valuable protection and therefore exists for a reason.
You could always either insert those values into a temporary table and join against that one or use the list functions to split that huge list into several smaller lists which are then used separately.
SELECT name ,
..... ,
createDate
FROM somewhere
WHERE (someColumn IN (a,b,c,d,e)
OR someColumn IN (f,g,h,i,j)
OR someColumn IN (.........));
cfqueryparam performs multiple functions.
It verifies the datatype. If you say integer, it makes sure there is an integrer, and if not, it does nto allow it to pass
It separates the data of a SQL script from the executable code (this is where you get protection from SQL injection). Anything passed as a param cannot be executed.
It creates bind variables at the DB engine level to help improve performance.
That is how I understand cfqueryparam to work. Did you look into the option of making several small calls vs one large one?
It is a security issue. Stops SQL injections
Adobe recommends that you use the cfqueryparam tag within every cfquery tag, to help secure your databases from unauthorized users. For more information, see Security Bulletin ASB99-04, "Multiple SQL Statements in Dynamic Queries," at www.adobe.com/devnet/security/security_zone/asb99-04.html, and "Accessing and Retrieving Data" in the ColdFusion Developer's Guide.
The first thing I'd be asking myself is "how the heck did I end up with more than 2100 params in a single query?". Because that in itself should be a very very big red flag to you.
However if you're stuck with that (either due to it being outwith your control, or outwith your motivation levels to address ;-), then I'd consider:
the temporary table idea mentioned earlier
for values over a certain length just chop 'em in half and join 'em back together with a string concatenator, eg:
*
SELECT *
FROM tbl
WHERE col IN ('a', ';DROP DATABAS'+'E all_my_data', 'good', 'etc' [...])
That's a bit grim, but then again your entire query sounds grim, so that might not be such a concern.
param values that are over a certain length or have stop words in them or something. This is also quite a grim suggestion.
SERIOUSLY go back over your requirement and see if there's a way to not need 2100+ params. What is it you're actually needing to do that requires all this???
The problem does not reside with cfqueryparam, but with MsSQL itself :
Every SQL batch has to fit in the Batch Size Limit: 65,536 * Network Packet Size.
Maximum size for a SQL Server Query? IN clause? Is there a Better Approach
And
http://msdn.microsoft.com/en-us/library/ms143432.aspx
The few times that I have come across this problem I have been able to rewrite the query using subselects and/or table joins. I suggest trying to rewrite the query like this in order to avoid the parameter max.
If it is impossible to rewrite (e.g. all of the multiple parameters are coming from an external source) you will need to validate the data yourself. I have used the following regex in order to perform a safe validation:
<cfif ReFindNoCase("[^a-z0-9_\ \,\.]",arguments.InputText) IS NOT 0>
<cfthrow type="Application" message="Invalid characters detected">
</cfif>
The code will force an error if any special character other than a comma, underscore, or period is found in a text string. (You may want to handle the situation cleaner than just throwing an error.) I suggest you modify this as necessary based on the expected or allowed values in the fields you are validating. If you are validating a string of comma separated integers you may switch to use a more limiting regex like "[^0-9\ \,]" which will only allow numbers, commas, and spaces.
This answer will not escape the characters, it will not allow them in the first place. It should be used on any data that you will not use with <cfqueryparam>. Personally, I have only found a need for this when I use a dynamic sort field; not all databases will allow you to use bind variables with the ORDER BY clause.

How to limit columns returned by Django query?

That seems simple enough, but all Django Queries seems to be 'SELECT *'
How do I build a query returning only a subset of fields ?
In Django 1.1 onwards, you can use defer('col1', 'col2') to exclude columns from the query, or only('col1', 'col2') to only get a specific set of columns. See the documentation.
values does something slightly different - it only gets the columns you specify, but it returns a list of dictionaries rather than a set of model instances.
Append a .values("column1", "column2", ...) to your query
The accepted answer advising defer and only which the docs discourage in most cases.
only use defer() when you cannot, at queryset load time, determine if you will need the extra fields or not. If you are frequently loading and using a particular subset of your data, the best choice you can make is to normalize your models and put the non-loaded data into a separate model (and database table). If the columns must stay in the one table for some reason, create a model with Meta.managed = False (see the managed attribute documentation) containing just the fields you normally need to load and use that where you might otherwise call defer(). This makes your code more explicit to the reader, is slightly faster and consumes a little less memory in the Python process.