How to match strings in MongoDb and ignore any whitespace - regex

Is it possible to ignore all whitespace using regex in MongoDB queries?
My Node.js program uses Cheerio to pull data from a number of websites, parses and then stores the data in MongoDB. My database has a People collection that keys on the string field Name.
Problem occurs where one website (site-A) shows the name HTML text as John&npsp;Smith, whereas another website (site-B) shows name as John Smith. My program has two scripts, one that scrapes site-A and another to scrape site-B; both of which use the following to scrape the Name data -
var $ = cheerio.load(htmlrow);
var personobj = { name: $('td.person a').text().trim() }
Each script then uses the following MongoDb command (using the native driver) to upsert the scraped data, keying on the Name field. However, this results in two records in the People collection -
db.collection('people').update(
{ Name: personobj.name },
{ $set: { LastScan: new Date() }},
{ upsert: true },
function(){} );
Now, I tried using the regex "extended" 'x' option to query in MongoDb, but it's not working. In fact, I tried testing the 'x' option via the find operator in Robomongo, and it returns zero records. I also note that when find testing in Robomongo, and I simply type Name: "John Smith", it only returns the site-B record, the one without the $nbsp; whitespace; even though when I view the detail of both records, the name strings appear identical. (I suppose difference is caused somewhere by all the encoding/decoding going on here to scrape, parse, store, retrieve... but I'm not sure where or why).
Is it possible to ignore all whitespace when querying MongoDb using regex?
Or, is it easier to handle this in my javascript parse line, to somehow replace and 'standardize' all possible whitespace characters? (Any recommended library to do so?)

Related

How can I use regex to construct an API call in my Jekyll plugin?

I'm trying to write my own Jekyll plugin to construct an api query from a custom tag. I've gotten as far as creating the basic plugin and tag, but I've run into the limits of my programming skills so looking to you for help.
Here's my custom tag for reference:
{% card "Arbor Elf | M13" %}
Here's the progress on my plugin:
module Jekyll
class Scryfall < Liquid::Tag
def initialize(tag_name, text, tokens)
super
#text = text
end
def render(context)
# Store the name of the card, ie "Arbor Elf"
#card_name =
# Store the name of the set, ie "M13"
#card_set =
# Build the query
#query = "https://api.scryfall.com/cards/named?exact=#{#card_name}&set=#{#card_set}"
# Store a specific JSON property
#card_art =
# Finally we render out the result
"<img src='#{#card_art}' title='#{#card_name}' />"
end
end
end
Liquid::Template.register_tag('cards', Jekyll::Scryfall)
For reference, here's an example query using the above details (paste it into your browser to see the response you get back)
https://api.scryfall.com/cards/named?exact=arbor+elf&set=m13
My initial attempts after Googling around was to use regex to split the #text at the |, like so:
#card_name = "#{#text}".split(/| */)
This didn't quite work, instead it output this:
[“A”, “r”, “b”, “o”, “r”, “ “, “E”, “l”, “f”, “ “, “|”, “ “, “M”, “1”, “3”, “ “]
I'm also then not sure how to access and store specific properties within the JSON response. Ideally, I can do something like this:
#card_art = JSONRESPONSE.image_uri.large
I'm well aware I'm asking a lot here, but I'd love to try and get this working and learn from it.
Thanks for reading.
Actually, your split should work – you just need to give it the correct regex (and you can call that on #text directly). You also need to escape the pipe character in the regex, because pipes can have special meaning. You can use rubular.com to experiment with regexes.
parts = #text.split(/\|/)
# => => ["Arbor Elf ", " M13"]
Note that they also contain some extra whitespace, which you can remove with strip.
#card_name = parts.first.strip
#card_set = parts.last.strip
This might also be a good time to answer questions like: what happens if the user inserts multiple pipes? What if they insert none? Will your code give them a helpful error message for this?
You'll also need to escape these values in your URL. What if one of your users adds a card containing a & character? Your URL will break:
https://api.scryfall.com/cards/named?exact=Sword of Dungeons & Dragons&set=und
That looks like a URL with three parameters, exact, set and Dragons. You need to encode the user input to be included in a URL:
require 'cgi'
query = "https://api.scryfall.com/cards/named?exact=#{CGI.escape(#card_name)}&set=#{CGI.escape(#card_set)}"
# => "https://api.scryfall.com/cards/named?exact=Sword+of+Dungeons+%26+Dragons&set=und"
What comes after that is a little less clear, because you haven't written the code yet. Try making the call with the Net::HTTP module and then parsing the response with the JSON module. If you have trouble, come back here and ask a new question.

How to run a combination of query and filter in elasticsearch?

I am experimenting using elasticsearch in a dummy project in django. I am attempting to make a search page using django-elasticsearch-dsl. The user may provide a title, summary and a score to search for. The search should match all the information given by the user, but if the user does not provide any info about something, this should be skipped.
I am running the following code to search for all the values.
client = Elasticsearch()
s = Search().using(client).query("match", title=title_value)\
.query("match", summary=summary_value)\
.filter('range', score={'gt': scorefrom_value, 'lte': scoreto_value})
When I have a value for all the fields then the search works correctly, but if for example I do not provide a value for the summary_value, although I am expecting the search to continue searching for the rest of the values, the result is that it comes up with nothing as a result.
Is there some value that the fields should have by default in case the user does not provide a value? Or how should I approach this?
UPDATE 1
I tried using the following, but it returns every time no matter the input i am giving the same results.
s = Search(using=client)
if title:
s.query("match", title=title)
if summary:
s.query("match", summary=summary)
response = s.execute()
UPDATE 2
I can print using the to_dict().
if it is like the following then s is empty
s = Search(using=client)
s.query("match", title=title)
if it is like this
s = Search(using=client).query("match", title=title)
then it works properly but still if i add s.query("match", summary=summary) it does nothing.
You need to assign back into s:
if title:
s = s.query("match", title=title)
if summary:
s = s.query("match", summary=summary)
I can see in the Search example that django-elasticsearch-dsl lets you apply aggregations after a search so...
How about "staging" your search? I can think if the following:
#first, declare the Search object
s = Search(using=client, index="my-index")
#if parameter1 exists
if parameter1:
s.filter("term", field1= parameter1)
#if parameter2 exists
if parameter2:
s.query("match", field=parameter2)
Do the same for all your parameters (with the needed method for each) so only the ones that exist will appear in your query. At the end just run
response = s.execute()
and everything should work as you want :D
I would recommend you to use the Python ES Client. It lets you manage multiple things related to your cluster: set mappings, health checks, do queries, etc.
In its method .search(), the body parameter is where you send your query as you normally would run it ({"query"...}). Check the Usage example.
Now, for your particular case, you can have a template of your query stored in a variable. You first start with, let's say, an "empty query" only with filter, just like:
query = {
"query":{
"bool":{
"filter":[
]
}
}
}
From here, you now can build your query from the parameters you have.
This is:
#This would look a little messy, but it's useful ;)
#if parameter1 is not None or emtpy
#(change the if statement for your particular case)
if parameter1:
query["query"]["bool"]["filter"].append({"term": {"field1": parameter1}})
Do the same for all your parameters (for strings, use "term", for ranges use "range" as usual) and send the query in the .search()'s body parameter and it should work as you want.
Hope this is helpful! :D

How to efficiently store records as Ember query parameters

Background: The user can filter a list based on tags, every tag is an Ember Data model. The selected filtering tags should be stored in the URL via query parameters. I also want to display the selected tags in the interface.
What's the best way to go about this?
I'm pretty sure I only want to store the IDs, but if I do this I have to maintain a separate variable / computed property where I store the actual record of the ID which comes from the query parameters. This kind of duplication seems wrong to me.
Apart from this issue I don't know if I should use an array query parameter or build an comma separated string of the IDs. If I do the former I end up with ugly URLs like ?tags=%5B"3"%5D. But doing the later means doing even more work.
So what's your approach to this kind of problem? I hope I'm missing something obvious that doesn't have so many downsides :)
Easy!
Find out which non-alphanumeric characters are not getting encoded in URLs. Using this tool, I found these characters: ~!*()_-.
Pick one character that is not used in your tag names. Let's say it's *. Use it as a delimiter.
Use tag names for tag ids. Or at least enforce unique tag names.
Then you've got a nice readable query:
?tags=foo*bar*baz&search=quux
To implement custom query param serialization/deserialization, do this:
Ember.Route.reopen({
serializeQueryParam: function(value, urlKey, defaultValueType) {
if (defaultValueType === 'array') {
return `${value.join('*')}`;
}
return this._super(...arguments);
},
deserializeQueryParam: function(value, urlKey, defaultValueType) {
if (defaultValueType === 'array') {
return value.split('*');
}
return this._super(...arguments);
}
});
Demo: app, code.
If you must use numeric tag ids in query params (for example, your tag names aren't unique), then you need some more customization: app, code.

Partial text search with postgreql & django

Tried to implement partial text search with postgresql and django,used the following query
Entry.objects.filter(headline__contains="search text")
This returns records having exact match,ie suppose checking for a match against the record "welcome to the new world" with query __contains="welcome world" , returns zero records
How can i implement this partial text search with postgresql-8.4 and django?
If you want this exact partial search you can use the startswitch field lookup method: Entry.objects.filter(headline__startswith="search text"). See more info at https://docs.djangoproject.com/en/dev/ref/models/querysets/#startswith.
This method creates a LIKE query ("SELECT ... WHERE headline LIKE 'search text%'") so if you're looking for a fulltext alternative you can check out PostgreSQL's built in Tsearch2 extension or other options such as Xapian, Solr, Sphinx, etc.
Each of the former engines mentioned have Django apps that makes them easier to integrate: Djapian for Xapian integration or Haystack for multiple integrations in one app.

Regular expression for validating url with parameters

I have been searching high and low for a solution to this, but to no avail. I am trying to prevent users from entering poorly formed URLs. Currently I have this regular expression in place:
^(http|https)\://.*$
This does a check to make sure the user is using http or https in the URL. However I need to go a step further and validate the structure of the URL.
For example this URL: http://mytest.com/?=test is clearly invalid as the parameter is not specified. All of the regular expressions that I've found on the web return valid when I use this URL.
I've been using this site to test the expressions that I've been finding.
Look I think the best solution for testing the URL as :
var url="http://mytest.com/?=test";
Make 2 steps :
1- test only URL as :
http://mytest.com/
use pattern :
var pattern1= "^(http:\/\/www.|https:\/\/www.|ftp:\/\/www.|www.){1}([0-9A-Za-z]+\.)([A-Za-z]){2,3}(\/)?";
2- split URL string by using pattern1 to get the URL query string and IF URL has Query string then make test on It again by using the following pattern :
var query=url.split(pattern1);
var q_str = query[1];
var pattern2 = "^(\?)?([0-9A-Za-z]+=[0-9A-Za-z]+(\&)?)+$";
Good Luck,
I believe the problem you are having comes from the fact that what is or is not a valid parameter from a query string is not universally defined. And specifically for your problem, the criteria for a valid query is still not well defined from your single example of what should fail.
To be precise, check this out RFC3986#3.4
Maybe you can make up a criteria for what should be an "acceptable" query string and from that you can get an answer. ;)