Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 months ago.
Improve this question
I would appreciate it if someone can assist with a code to calculate the "specific number of heatwaves days where relative humidity > 66% and < 33%".
(whereas, a heatwave event is defined as one in which temperatures exceeded the 90th percentile of the daily mean temperature for at least three consecutive days, respectively).
Ok well here is a solution
# temperature percentile
cdo timpctl,90 infile -timmin infile -timmax t2m.nc t2m_pcen90.nc
# mask the temperature
cdo ge t2m.nc t2m_pcen90.nc mask.nc
# Need to make sure we have three consecutive days
cdo --timestat_date last runmean,3 mask.nc mask3.nc
cdo gec,1 mask3.nc heatwave_T.nc
# Now filter for dry heatwaves, assuming RH is %, change X if fraction
cdo lec,33 rh.nc rhdry.nc
cdo mul heatwave_T.nc rhdry.nc heatwave_dry.nc
# and same for wet
cdo gec,66 rh.nc rhwet.nc
cdo mul heatwave_T.nc rhwet.nc heatwave_wet.nc
Each file should have a 1 in it for each location/time when you are in a heatwave according to your definition. Of course the metadata is appropriate for T2m not the index, use NCO to change that if required. I have several video guides that would help with this question, the key one being the one on masking (it doesn't include the running mean part though). Note also that the RH criterion is applied ONLY on the day (no running mean) but that is how you write the definition in your question. Duplicate the running mean part if needed.
ps: In general it is good to show that you have attempted a solution yourself, before asking, SO guidelines are that questions are of a debugging nature, or can be a request for a one-liner, but not coding requests like "write me a code that does X or Y" - I think that is why you were getting downvoted.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Currently, my line of code is really long and I was curious to know if there was a more efficient way of doing this.
As Nick has pointed out your question is missing most of the information that would make it answerable. Please read more here, and add more information to your question.
In the meantime, a useful approach is to merge your zipcode data with a dataframe (or dataset) with the state-zipcode link in it.
* first you need to get the zipcode data from somewhere.
* Here is one way:
!wget "https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt"
* now put this data in a frame
frame create zctaFrame
frame zctaFrame{
import delimited "zcta_county_rel_10.txt"
}
* now I'm making up a dataset (share some of yours with dataex from ssc
input str10 name zip
"sam" 55901
"sasha" 84101
"saul" 84111
end
frlink 1:1 zip, frame(zctaFrame zcta5)
frget state, from(zctaFrame)
If this doesn't match what you're trying to do, please add more detail to the question.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have to create an Application to read some live data feed from more than 200 tables simultaneously and process this data. I want to discuss what could be the best approach to solve this problem with optimum speed as for each table we are getting 20+ records in every minute. So far I can think of following solutions :-
1) I can make multiple thread handling some 20 odd symbols independently.
2) I can make two thread one for data read and other for data processing but reader thread will take more time as it has to read all tables sequentially.
my database is MySQL and I am not looking to shift to nosql DB right now.I am using C++ to solve this problem.I feel that if instead of 200+ tables I can get live data feed in a single table then my second approach will become much appropriate and faster.
Is the use of MySQL required if not you might get a speed increase from any nosql "database". Furthermore retrieving data from a database is always a bottleneck, generally when it comes to that much data volume you want to load as much as you can into RAM and read it from there, as it is much faster.
You could make a query that would only retrieve the newest data from a certain timestamp(which is the same timestamp of the execution of your last query) then load that into memory do all the operations that require speed there, and clean up old entries that are not required anymore.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm working on a research project and am assigned to do a bit of data scraping and writing code in R that can help extract current temperature for a particular zip code from a site such as wunderground.com. Now this may be a bit of an abstract question but does anyone know how to do the following:
I can extract the current temperature of a particular zip code by doing this:
temps <- readLines("http://www.wunderground.com/q/zmw:20904.1.99999")
edit(temps)
temps //gives me the source code for the website where I can look at the line that contains the temperature
ldata <- temps[lnumber]
ldata
# then have a few gsub functions that basically extracts
# just the numerical data (57.8 for example) from that line of code
I have a cvs file that contains zip code of every city in the country and I have that imported in R. It is arranged in a table according to zip, city and state. My challenge now is to write a method (using java analogy here because I'm new to R) that basically extracts 6-7 consecutive zip codes (after a particular one specified) and runs the above code by modifying the link within the readLines function and putting in the respective zip code after the link segment zmw:XXXXX and running everything after that based on that link. Now I don't quite know how to extract the data from the table. Maybe with a for-loop function? But then I don't know how to use that to modify the link. I think that's where I'm really getting stuck on. I have a bit of Java background so I understand HOW to approach this problem, just not the knowledge of the syntax. I understand this is quite an abstract question as I didn't provide a lot of code but I just want to know they functions/syntax that will help me extract the data from the table and somehow use that to modify the link through a function rather than manually doing it.
So this is about the Weather Underground data.
You can download csv files from individual weather stations in wunderground, however you need to know the weather station identifier. Here is an example URL for a weather station in Kirkland, WA (KWAKIRKL8):
http://www.wunderground.com/weatherstation/WXDailyHistory.asp?ID=KWAKIRKL8&day=31&month=1&year=2014&graphspan=day&format=1
Here is some R code:
url <- 'http://www.wunderground.com/weatherstation/WXDailyHistory.asp?ID=KWAKIRKL8&day=31&month=1&year=2014&graphspan=day&format=1'
s <- getURL(url)
s <- gsub("<br>\n","",s)
wdf <- read.csv(con<-textConnection(s))
And here is a page with which you can manually find stations and their codes.
http://www.wunderground.com/wundermap/
Since you only need a few you can pick them out manually.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I've noticed twitter people search can come up with some weird results. Searching for match in screen_name twitter_name and bio is obvious, but they also do something different. I guess it has something to do with Triadic Closure but find its usage for search (instead of suggestions) weird. Wanted to hear your thoughts about this issue.
I think your question might be a little nonspecific, but here are my thoughts:
Suppose your search query was "Miley Cyrus", for instance. Now the top results will for sure include her real account, then fake ones, but then the results will get a little distorted.
I expect it ranks each account / person X in this manner (or something similar):
If person X follows accounts that has the search query in its bio / name, it has a higher rank than if that person didn't.
In our search, "Rock Mafia" is a good example; it doesn't have the term "Miley Cyrus" in its bio nor its name, but if you look at the people "Rock Mafia" is following, you'll find a lot of "similar" names / bios. Another ranking criteria would be this:
If person X has tweets that contains the search query in its content, it would also have a higher rank
A good example is the result "AnythingDisney" (#adljupdated), you can see that the 4th most recent tweet contains "Miley".
So basically the search prioritization looks like this:
Look in name / bio.
Need more results? Rank each person X by his followers and the people he follows, and by tweets that contain the query.
Need even more results? Look at "deeper" levels, rank each person X by the people being followed by the people X is following.
An so on, recursively.
I hope this helped in any manner!
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have a set of training data consisting of 20 multiple choice questions (A/B/C/D) answered by a hundred respondents. The answers are purely categorical and cannot be scaled to numerical values. 50 of these respondents were selected for free product trial. The selection process is not known. What interesting knowledge can be mined from this information?
The following is a list of what I have come up with so far-
A study of percentages (Example - Percentage of people who answered B on Qs.5 and got selected for free product trial)
Conditional probabilities (Example - What is the probability that a person will get selected for free product trial given that he answered B on Qs.5)
Naive Bayesian classifier (This can be used to predict whether a person will be selected or not for a given set of values for any subset of questions).
Can you think of any other interesting analysis or data-mining activities that can be performed?
The usual suspects like correlation can be eliminated as the response is not quantifiable/scoreable.
Is my approach correct?
It is kind of reverse engineering.
For each respondent, you have 20 answers and one label, which indicates whether this respondent gets the product trial or not.
You want to know which of the 20 questions are critical to give trial or not decision. I'd suggest you first build a decision tree model on the training data. And study the tree carefully to get some insights, e.g. the low level decision nodes contain most discriminant questions.
The answers can be made numeric for analysis purposes, example:
RespondentID IsSelected Q1AnsA Q1AnsB Q1AnsC Q1AnsD Q2AnsA...
12345 1 0 0 1 0 0
Use association analysis to see if there are patterns in the answers.
Q3AnsC + Q8AnsB -> IsSelected
Use classification (such as logistic regression or a decision tree) to model how users are selected.
Use clustering. Are there distinct groups of respondents? In what ways are they different? Use the "elbow" or scree method to determine the number of clusters.
Do you have other info about the respondents, such as demographics? Pivot table would be good in that case.
Is there missing data? Are there patterns in the way that people skipped questions?