Smart data extraction algorithm from websites

Smart data extraction algorithm from websites - regex

I'm building a deal aggregator so I need a crawler that will extract data from some sites: price, discount, image, coordinates and name of deal of cource.
Do you know of any tutorials, ebooks or something that will help me? For image and coordinates and discount I have a solution and pattern:
image: biggest image is always the main image of deal
discount: discount is always a number between 50 and 99 and always has a "%" symbol
coordinates: is always in decimal numbers so I get it with regex
How do I get the following items?
Name of deal?
Price?
Do you know of any data extraction algorithms that can be helpful?

I'd suggest you to use XPath based scraper. For example Web-Harvest
Or, if you want to analyze raw texts, I'd suggest using state-machine parser for recognizing templated parts of texts.
Look at this topic: Are there APIs for text analysis/mining in Java?

Related

Extract list from range Google Sheets

I have some data from workplaces with some different work areas, I need to extract a list for each workplace with their corresponding availables working areas, I have an example of some kind of attempt really close what I wanted. I use this formula but with more data will be long time to do it =IF(D2=$G$1, "Yes", "No"). I want to do it more automatic with some formulas but I don't know where to start.

Give a try on below formula. Put the formula to G1 cell then drag down as needed.
=TRANSPOSE(IFERROR(FILTER($D$2:$D$16,$A$2:$A$16=F2,$D$2:$D$16<>""),""))

AWS GroundTruth text labeling - hide columns in the data, and checking quality of answers

I am new to SageMaker. I have a large csv dataset which I would like labelled:
sentence_id
sentence
pre_agreed_label
148392
A sentence
0
383294
Another sentence
1
For each sentence, I would like a) a yes/no binary classification in response to a question, and b) on a scale of 1-3, how obvious the classification was. I need the sentence id to map to other parts of the dataset, and will use the pre-agreed labels to assess accuracy.
I have identified SageMaker GroundTruth labelling jobs as a possible way to do this. Is this the best way? In trying to set it up I have run into a few problems.
The first problem is I can't find a way to display only the sentence column to the labellers, hiding the sentence_id and pre_agreed_labels.
The second is that there is either single labelling or multi labelling, but I would like a way to have two sets of single-selection labels:
Select one for binary classification:
Yes
No
Select one for difficulty of classification:
Easy
Medium
Hard
It seems as though this can be done using custom HTML, but I don't know how to do this - the template it gives you doesn't even render
Finally, having not used mechanical turk before, are there ways of ensuring people take the work seriously and don't just select random answers? I can see there's an option to have x number of people answer the same question, but is there also a way to put in an obvious question to which we already have a 'pre_agreed_label' every nth question, and kick people off the task if they get it wrong? There also appears to be a maximum of $1.20 per task which seems odd.

Clear approach for assigning semantic tags to each sentence (or short documents) in python

I am looking for a good approach using python libraries to tackle the following problem:
I have a dataset with a column that has product description. The values in this column can be very messy and would have a lot of other words that are not related to the product. I want to know which rows are about the same product, so I would need to tag each description sentence with its main topics. For example, if I have the following:
"500 units shoe green sport tennis import oversea plastic", I would like the tags to be something like: "shoe", "sport". So I am looking to build an approach for semantic tagging of sentences, not part of speech tagging. Assume I don't have labeled (tagged) data for training.
Any help would be appreciated.

Lack of labeled data means you cannot apply any semantic classification method using word vectors, which would be the optimal solution to your problem. An alternative however could be to construct the document frequencies of your token n-grams and assume importance based on some smoothed variant of idf (i.e. words that tend to appear often in descriptions probably carry some semantic weight). You can then inspect your sorted-by-idf list of words and handpick(/erase) words that you deem important(/unimportant). The results won't be perfect, but it's a clean and simple solution given your lack of training data.

C++ FIR noise filter

I'm digging up some info about filtering the noise out of my IQ data samples in C++.
I have learned that this can be done by using a simple filter which calculates the average of last few data samples and applies it to the current sample.
Do you have any further experience with this kind of filtering or do you recommend using some existing FIR filtering library?
Thanks for your comments!

Unfortunately, it is not as simple as "just get some library and it will do all the work for you"; digital filters is a quite complicated subject.
It is easy to apply digital filter to your data only if your measurements come at fixed time intervals (known as "sample rate" in digital filters). Otherwise (if time intervals vary), it is not trivial to apply digital filters (and I suspect you might need FFT to do it, but I might be wrong here).
Digital filters (both IIR and FIR) are interesting in that as soon as you know coefficients, you don't really need a library, it is easy to write it yourself (see, for example, first picture here: https://en.wikipedia.org/wiki/Finite_impulse_response : looks simple, right?); it is finding coefficients which is tricky.
As a prerequisite to find out coefficients, you need to understand quite a lot about filters: you need to know what kind of filter you need (if it is after demodulation - you'll likely need low-pass, otherwise see comment by MSalters below), you need to understand what "corner frequency" is, and you need to realize how to map those frequencies to your samples (for example, you can say that your samples are coming once per second - or at any other rate, but this choice will affect your desired "corner frequency"). As soon as you've got this understanding of "what you need in terms of digital filters" - finding coefficients is quite easy, you can do it either in MatLab, or using online calculator, look for "digital filter calculator" in Google.

What is the best data mining method for vehicle search?

I'm trying to build a search engine that goes through online vehicle classifieds such as Oodle, eBay motors, and craigslist. I also have a large database of standard vehicle names and specifications about them. What I would like to do is for each record that I find through the classified site, be able to determine exactly what vehicle model, style it is (from my database). For example, a standard name for a ford truck in my db is:
2003 Ford F150.
However on classified sites, people might refer to is as: "2003 Ford F 150" or "2003 Ford f-150" or "03 Ford truck 150". Is there an effective data mining/text classification algorithm to be able to normalize these texts to the standard name above?

You could use the Levenshtein distance to match the found string against your database records.
Another (probably better) idea is to tokenize the strings and use a term vector model for the vehicle names. This way you can use cosine similarity to find relevant matches.

If you're gonna develop a whole search engine intended to scale in both, usage and size, you will need something robust to support your queries.
If you're gonna used edit distance, Bed-trees provide a good alternative for your index structure. Another good approach, depending on the size of your dataset, is to use a Levenshtein automata. Levenshtein automatas are also great at providing auto-complete functionalities, which you may need since you're developing a search engine.
Another approach to edit distance is to use n-grams combined with Jaccard index. For this approach you can use Minhash + LSH. Also, you can use Jaccard as a distance metric (1 - Jaccard index) which respects the triangle inequality, thus, can be used in a metric tree such as a VP-tree.
One of these approaches will certainly help you.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js