could anyone give me help on ground-truth data - data-mining

I recently came to a term in one of my email communicatons with my supervisor.Since I am beinging doing a data-mining project on facebook user profile,and he said I should being collecting groud-truth data.
I am very new to this term and I searched online for it,but found very few results about it in data-mining sense.
Could anyone give me an example of what this groud-truth data is in a data-mining task pleae?
Thank you very much.

Ground-truth is data annotated (generally by human) known to be sure at 100%.
It's used to train algorithm since it's what you expect the algorithm to give you.

Related

On aws personalize, users and items datasets requires 1 metadata field, why? is it necessary?

I am trying to create a recommendation system.
I would like to recommend users to subscribe to topics based on topics they already have subscribed to.
As mentioned in here, there is a requirement of 1 metadata field for the users and items (topics in my case) schemas.
I couldn't find any worthy field to put there.
Why is it a requirement? What should I do then? How impactful is it on the final recommendations score if I leave it blank? Am I missing any other way to solve that issue? Do you think that aws personalize is a good fit for my problem?
Seems like it is not mandatory to create all three dataset types. I found it out from aws-personalize-samles. Couldn't figure it out from the docs.
I thought of deleting this question, because it might be a too much of a noob question (seems like it from the downvoting), but heck, if it may shorten someone else time to figure it out by finding the samples, I will leave it as is.
Sorry if I annoyed some people!

Choose which way to calculate the PV and UV in Django?

I'm building a news website using Django and hope this website can handle millions of traffic.Now I'm coding a function that displays 48 hours most viewed articles to readers,so I need calculate the PV.
I have searched for a while and asked some people.I know I have some options:
1.using simply click_num=click_num+1,but I know this is the worst way.
2.A better way is using Celery to code a distributed task,but I don't know how to exactly do it.
3.I heard Redis also can used to calculate PV and UV but,have no idea how to do it and not sure the it can satisfy my needs.
4.Another good way is to use google analysis,but I also have no idea how to do it and not sure the it can satisfy my needs.
5.The last way ,I think the easiest way is to use JavaScript, but I'm not sure whether it can satisfy my needs.
Any friend can give me some advice? Thank you so much!

Specific topics on Tensorflow for CNN

I have a mini project for my new course in Tensorflow for this semester with random topics. Since I have some background on Convolution Neuron Network, I intend to use it for my project. My computer can only run CPU version of TensorFlow.
However, as a new bee, I realize that there are a lot of topics such that MNIST, CIFAR-10, etc, thus I don't know which suitable topic I should pick out from them. I only have two weeks left. It would be great if the topic is not too complicated but too not easy for study because it matchs my intermediate level.
In your experience, could you give me some advice about the specific topic I should do for my project?
Moreover, it would be better if in this topic I can provide my own data to test my training, because my professor said that it is a plus point to get A grade in my project.
Thanks in advance,
I think that to answer this question you need to properly evaluate the marking criteria for your project. However, I can give you a brief overview of what you've just mentioned.
MNIST: MNIST is a Optical Character Recognition task for individual numbers 0-9 in images size 28px square. This is considered the "Hello World" of CNNs. It's pretty basic and might be too simplistic for your requirements. Hard to gauge without more information. Nonetheless, this will run pretty quickly with CPU Tensorflow and the online tutorial is pretty good.
CIFAR-10: CIFAR is a much bigger dataset of objects and vehicles. The image sizes are 32px square so individual image processing isn't too bad. But the dataset is very large and your CPU might struggle with it. It takes a long time to train. You could try training on a reduced dataset but I don't know how that would go. Again, depends on your course requirements.
Flowers-Poets: There is the Tensorflow for Poets re-training example which might not be suitable for your course, you could use the flowers dataset to build your own model.
Build-your-own-model: You could use tf.Layers to build your own network and experiment with it. tf.Layers is pretty easy to use. Alternatively you could look at the new Estimators API that will automate a lot of the training processes for you. There are a number of tutorials (of varying quality) on the Tensorflow website.
I hope that helps give you a run-down of what's out there. Other datasets to look at are PASCAL VOC and imageNet (however they are huge!). Models to look at experimenting with may include VGG-16 and AlexNet.

Clarification on Sitecore A/B Testing Results

We have recently started using Sitecore A/B Testing and I am getting lots of questions about how the scoring works. I have been through the relevant Sitecore DMS documents but I still am not 100% sure if I understand how the scoring works.
My basic understanding is that the scores are based on Value Per Visit and my assumption is that the the value relates to the whole visit and not just the specific components we may be trying to optimize with the A/B Test.
For example, if option A has a goal associated with it worth 5 points, anyone presented with this option would get 5 points PLUS any other goal values they trigger during that visit to the site. That might add 5, 10, 50 or more to the visit score and then the option A score would be "total visits score/total visits".
Can anyone confirm if my assumptions are correct or explain where I may be off base? Can a user presented with option B change the score for option A?
By default, the Engagement Value is calculated on a per-visit basis. So your assumption is basically correct - and it does make it hard to test how a particular component variation does against another.
That being said; there are tools to help you.
We're currently implementing SBOS Accellerators into our solution. We have the same issue you are describing, and need a more fine-grained approach to testing.
Basically SBOS accellerators will allow you to track individual personalisation performance, not "just" looking at the overall Engagement Value.
Lars Petersen blogs about it here: http://www.larsdk.dk/2014/01/must-have-marketplace-modules-for-sitecore-digital-marketing-system/
Marketplace link for the module here: http://marketplace.sitecore.net/en/Modules/SBOS_Accelerators.aspx
We found a few issues in testing the module, but none were really severe. I know these issues are being fixed if they haven't already.

Data mining/BI/Analytics/ML : Can a mathematically challenged person move into this field?

I have recently become interested in the field(s) of data mining and machine learning. The idea of going through huge datasets and trying to correlate hidden patterns and trends is fascinating. So far I have done the following
Used Weka to load simple data sets and generate decision trees
Continously read books, wiki's, blogs and SO on the same
Started playing around SQL Server DM and Python API's
Have an idea on options of freely available data sets on the web(freedb, UN etc)
What is hindering me is the minute I try to go beyond classification/associsciation and into priori/apriori algorithms I am stuck because understanding mathematical equations and logic is not(to put it modestly) one of my strong points.
So my question would be are there anybody in the Data mining field(in the role of product owner or builder) who are not naturally mathematicians? If so, how would you approach in undestanding the field since free tools like Weka and Rapid-miner both expects some mathematical/statistical background?
P.S: Excuse me if I made some mistake in the query like mixing Data mining and analytics when they are separate as I am still getting my feet wet. I hope my core question is clear.
Well, being able to do some analysis of what the data mining models are showing is absolutely vital. However, these days all of the math and statistics are taken care of by the data mining models. You don't need to understand the math behind them (although it helps).
For example, you can look through the SQL Server Analysis Services Data Mining Algorithms and see that even the technical reference is how to use these implementations, not how to recreate them.
If you can understand the business cases and you can understand what the data mining is telling you, there's really no need to delve into the math behind it.
As for some of the free tools, I've never used them, so I can't speak to them. However, I'm a big fan of SSAS and those data mining models, which don't require an extensive mathematical background.
As Eric says, and as far as you only intend to use the existing algorithms and APIs and make sense from them, I don't see problems with the required math/statistics skill set (anyway, you'll need some previous basic knowledge/level).
Now, if you intend to do research or if you want to improve or modify existing algorithms, or why not, create your own algorithms, then math and statistics is a MUST. I just started doing some research in this area, and I'm still trying to fill my skills gap =)