Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have a set of training data consisting of 20 multiple choice questions (A/B/C/D) answered by a hundred respondents. The answers are purely categorical and cannot be scaled to numerical values. 50 of these respondents were selected for free product trial. The selection process is not known. What interesting knowledge can be mined from this information?
The following is a list of what I have come up with so far-
A study of percentages (Example - Percentage of people who answered B on Qs.5 and got selected for free product trial)
Conditional probabilities (Example - What is the probability that a person will get selected for free product trial given that he answered B on Qs.5)
Naive Bayesian classifier (This can be used to predict whether a person will be selected or not for a given set of values for any subset of questions).
Can you think of any other interesting analysis or data-mining activities that can be performed?
The usual suspects like correlation can be eliminated as the response is not quantifiable/scoreable.
Is my approach correct?
It is kind of reverse engineering.
For each respondent, you have 20 answers and one label, which indicates whether this respondent gets the product trial or not.
You want to know which of the 20 questions are critical to give trial or not decision. I'd suggest you first build a decision tree model on the training data. And study the tree carefully to get some insights, e.g. the low level decision nodes contain most discriminant questions.
The answers can be made numeric for analysis purposes, example:
RespondentID IsSelected Q1AnsA Q1AnsB Q1AnsC Q1AnsD Q2AnsA...
12345 1 0 0 1 0 0
Use association analysis to see if there are patterns in the answers.
Q3AnsC + Q8AnsB -> IsSelected
Use classification (such as logistic regression or a decision tree) to model how users are selected.
Use clustering. Are there distinct groups of respondents? In what ways are they different? Use the "elbow" or scree method to determine the number of clusters.
Do you have other info about the respondents, such as demographics? Pivot table would be good in that case.
Is there missing data? Are there patterns in the way that people skipped questions?
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 months ago.
Improve this question
I would appreciate it if someone can assist with a code to calculate the "specific number of heatwaves days where relative humidity > 66% and < 33%".
(whereas, a heatwave event is defined as one in which temperatures exceeded the 90th percentile of the daily mean temperature for at least three consecutive days, respectively).
Ok well here is a solution
# temperature percentile
cdo timpctl,90 infile -timmin infile -timmax t2m.nc t2m_pcen90.nc
# mask the temperature
cdo ge t2m.nc t2m_pcen90.nc mask.nc
# Need to make sure we have three consecutive days
cdo --timestat_date last runmean,3 mask.nc mask3.nc
cdo gec,1 mask3.nc heatwave_T.nc
# Now filter for dry heatwaves, assuming RH is %, change X if fraction
cdo lec,33 rh.nc rhdry.nc
cdo mul heatwave_T.nc rhdry.nc heatwave_dry.nc
# and same for wet
cdo gec,66 rh.nc rhwet.nc
cdo mul heatwave_T.nc rhwet.nc heatwave_wet.nc
Each file should have a 1 in it for each location/time when you are in a heatwave according to your definition. Of course the metadata is appropriate for T2m not the index, use NCO to change that if required. I have several video guides that would help with this question, the key one being the one on masking (it doesn't include the running mean part though). Note also that the RH criterion is applied ONLY on the day (no running mean) but that is how you write the definition in your question. Duplicate the running mean part if needed.
ps: In general it is good to show that you have attempted a solution yourself, before asking, SO guidelines are that questions are of a debugging nature, or can be a request for a one-liner, but not coding requests like "write me a code that does X or Y" - I think that is why you were getting downvoted.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I am working on a Power BI report. There are two dimensions DimWorkedClass and DimWorkedService. (The above snippet is obtained by exporting matrix values to csv.)
The requirement is to transform only the Worked Service Text5 into the Worked Class of Text5 as opposed to A (which is the current value).
It can be transformed at the backend, but is there any way to do it in Power BI?
This is trickier than it might appear, but it looks like this question has already been answered here:
Power Query Transform a Column based on Another Column
In your case, the M code would look something like this:
= Table.FromRecords(Table.TransformRows(#"[Source or Previous Step Here]",
(here) => Record.TransformFields(here, {"Worked Class",
each if here[Worked Service] = "text5" then "text5" else here[Worked Class]})))
(In the above, here represents the current row.)
Another answer points out a slightly cleaner way of doing this:
= Table.ReplaceValue(#"[Source or Previous Step Here]",
each [Worked Class],
each if [Worked Service] = "text5" then "text5" else [Worked Class],
Replacer.ReplaceText,{"Worked Class"})
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm practicing with classes and I'm given the task of creating employee management system. I'm given two .txt files. One (details.txt) has details of each employee with the following info: ID, name, DOB, SSN, department, and position. A sample of the file looks like such:
5 ali 6/24/1988 126-42-6989 support assistant
13 tim 2/10/1981 131-12-1034 logistics manager
The other .txt (timelog.txt) will contain a daily log of when employees clock in and clock out. The following format for this file is: ID, date, clock in time, and clock out time. Sample:
5 3/11 0800 1800
13 3/11 0830 1830
Firstly, I am to allow users to search up an employee by ID, name, department or position. Doing so will display all of the employees info (multiple employees if they have the same name, position or are from the same department) as well as show the total number of hours they have worked in the company.
Secondly, users are to be given another option to look up employee time logs by ID number. This will display the entire clock in/ clock out history of that employee as well as total hours worked each day.
I'm planning to read in the info from .txt files via ifstream and store them as an array of objects. I'm just wondering how many classes I should create. I'm thinking 2 classes- one for employee info (from details.txt) and one for time logs(timelogs.txt). Is there any other class I should create or should those 2 suffice?
Short answer: At least two.
Long answer: It depends on many things. Especially what part of code you can identify as potentially reusable.
If you asked for the highest possible amount of classes that could accomplish your task, I would think about a single class for:
Employee
EmployeeManager (Factory, Holder etc.) – creates, holds and deletes the Employee objects, provides search feature
DayWork – a row from timelog.txt, can calculate the amount of hours/minutes spent in work that day
WorkLog – a list of DayWork objects for one employee, can calculate the whole spent time
TextLineParser – encapsulation of std::ifstream
The right answer is most likely somewhere between. Keep in mind that C++ is a multi-paradigm language and you can perform some operations without having a class for them. Instead, they can be performed in a function or a set of functions in a C-like unit. That’s especially useful for one-time operations where the functions don’t share common data (potential properties).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I've noticed twitter people search can come up with some weird results. Searching for match in screen_name twitter_name and bio is obvious, but they also do something different. I guess it has something to do with Triadic Closure but find its usage for search (instead of suggestions) weird. Wanted to hear your thoughts about this issue.
I think your question might be a little nonspecific, but here are my thoughts:
Suppose your search query was "Miley Cyrus", for instance. Now the top results will for sure include her real account, then fake ones, but then the results will get a little distorted.
I expect it ranks each account / person X in this manner (or something similar):
If person X follows accounts that has the search query in its bio / name, it has a higher rank than if that person didn't.
In our search, "Rock Mafia" is a good example; it doesn't have the term "Miley Cyrus" in its bio nor its name, but if you look at the people "Rock Mafia" is following, you'll find a lot of "similar" names / bios. Another ranking criteria would be this:
If person X has tweets that contains the search query in its content, it would also have a higher rank
A good example is the result "AnythingDisney" (#adljupdated), you can see that the 4th most recent tweet contains "Miley".
So basically the search prioritization looks like this:
Look in name / bio.
Need more results? Rank each person X by his followers and the people he follows, and by tweets that contain the query.
Need even more results? Look at "deeper" levels, rank each person X by the people being followed by the people X is following.
An so on, recursively.
I hope this helped in any manner!
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Where can I find some GPS unit test data to validate my code?
For example:
Distance between two coordinates (miles / kilometers)
Heading/bearing from Point A to Point B
Speed from Ponit A to Point B given a duration
Right now I'm using Google Earth to fumble around with this, but it would be nice to know I'm validating my calculations against something, well, valid.
"GPS unit test data" is quite vague. You could easily have a pile of data, but if you don't know what they represent, what value are the tests?
If you're looking for a math sample of latitude/longitude calculations, check out the example on Wikipedia's Great Circle distances article: http://en.wikipedia.org/wiki/Great-circle_distance#Worked_example It has two points and works the math to compute the distance between them.
Or are you looking for the data that comes directly from a GPS unit? These are called NMEA sentences. An NMEA sentence begins with $GP and the next 3 characters are the sentence code, followed by the sentence data. http://aprs.gids.nl/nmea/ has a list.
You could certainly Google for "sample nmea data". The magnalox site appears to have some downloadable sample files, but I didn't check them to see if they'd be useful to you.
A better option would probably be to record your own data. Connect your laptop to your GPS unit, set it up to capture the serial data being emitted from the GPS, set the GPS to record your track, and take it for a short test drive. You can then compare how you processed the captured data based on what you know from the stored track (and from your little drive.) You could even have a web cam record the screen of the GPS to record heading/bearing information that doesn't arrive in the sentences.
Use caution if screen scraping NMEA sentences from a web site. All valid NMEA sentences begin with a "$GP"
RandomProfile offers randomly generated valid NMEA sentences.