I am trying to compare data from census tracts over time, but some census tracts change from census to census. Is there a way to match up old census tract IDs with what the new census tract ID is? Specifically for 1990 and 2010 censuses.
I realize your question is several months old, but in case you still are wondering about this, (or for anyone who stumbles on here later) - the IPUMS project has geographic crosswalk files available on their website for 1990 blocks to 2010 blocks.
Those are available here:
https://www.nhgis.org/user-resources/geographic-crosswalks
Keep in mind that if you are doing spatial analysis the Census Bureau did some clean up of their geography, improving the accuracy of the boundary lines around 2008 and after so you may find some discrepancies if you're doing spatial overlays. See here for more info:
https://www.nhgis.org/documentation/gis-data
Related
If someone could please advise on how to achieve the following function on my site: I have a business where I ship live animals to customers. The challenge is communicating the weather forecast with the customer, along with the safest day/s to ship for the animals’ wellbeing. This planning is especially key during the peak summer and winter months when temperatures are very extreme. I am looking to automate these functions during the customers checkout process, giving them control of the shipping date. I am looking to implement something like this on my site.
I found another website where I recently placed an order that provided this relevant information for the customer to review before deciding on what day they would like to have the animal shipment shipped out. The table provided a 5 day forecast, which included temperatures for morning and evening at the point of origin, in transit, as well as the customers’ destination address. Any day/s where the temperature is too hot or too cold was marked as unavailable (red) for shipping. Possible shipping dates consisted of two consecutive days where temps are in the safe range and labeled as available (green) Excluding Friday, Saturday, and Sunday. It is also important for the customer to plan ahead and pick a shipping day based on their availability to be home to receive the package the following morning.
I'm new to google automl tables and have a basic question about which data is worthwhile including in the training of my model.
I have a dataset of golfers and will be looking at the averages of scores over different periods. For example, average over the past 3 months, 6 months, 1 year etc.
My question is, is it worthwhile also including the sample size for each date range for each player. For example, over the past 3 months, some players will have a sample size of 28 while some will only have 2. Those players that have 28 rounds will have more accurate averages than those with 2. However, I didn't know whether google automl tables would pick up this link automatically, whether I could create a different weighting/reliability variable, or whether there's a way to specify a link between columns? Or if this automated type of automl isn't really suitable?
Thanks in advance
I am looking into utilising PowerBI to identify time saved due to various Projects. People will add the projects to a Sharepoint List which then feeds into PowerBI.
PROJECTs Table:
Project Tite, Desc, Hours/Month Saved, StartDate, EndDate, Repeat? (T/F)
[Some Projects only save a fixed 10 or so hours, others save time per month (indicated by the Repeat Column)]
I've created two measures, RUNTIME determining how long the project has run in months ((TodayDate - StartDate)/30) as well as TIMESAVED which is the total hours saved from that specific project (RUNTIME*Hours/Month Saved).
Whilst this works, it has a pretty big limitation. When selecting a range, say 01/01/2017 - 01/01/2018, any projects with a start date before that range are excluded. However these maybe on-going, meaning the time saved by this project during the range needs to be added.
I've attempted to find a solution to this, however I keep getting stuck at requiring the the filter dates from the slicer, however I'm not certain this is possible. I need those projects with on-going savings to have the savings during the period given to be counted as well.
Possible alternative maybe to create a Month/Year column per Month/Year with a custom formula per column to determine that projects Hours saved for that Month/Year however this seems inefficient, at that point back to Excel might be better.
Any ideas / suggestions would be greatly appreciated, currently running through any ideas to solve but keeps coming back to needing that value specified by the filter. Cheers in advance for any advice tackling this :)
See also: https://community.powerbi.com/t5/Desktop/Re-occuring-Savings-over-Time-with-Time-Date-Slicer/m-p/346100
Unfortunately, there is no current simple solution to this problem out-of-the-box with Power Bi. All of the slicers seem to handle dates as a single point in time. They suffer in that if you are dealing with any items that span a Start and End date (like your projects, and most of my data examples) they only take one of the dates as the input. The slicers need to accept an optional end date in our case and then perform a simple date span overlap logic to determine the items that match.
I tried to solve your problem with out-of-the-box Power Bi Desktop slicers and a custom visual Timeline Slicer I found at the store with no luck earlier this month. Out of frustration, I posted a question in the Power Bi forums for suggestions.
The final suggestion from the forums I got was to "use two Filters at Filter pane". But I am not satisfied with this answer.
The Timeline Slicer code is open source and when I get more time (ha ha), I would like to make this change to the Timeline Slicer and publish it back to the repository for everyone to use.
I will monitor this question and the forum to see if a solution emerges in the future.
You can use Timeline Storyteller. you can create your time line and add a couple Slicers for Start and End. It will split by day the dates and you won't miss any data.
I am currently doing a customer segmentation project in SAS.
I have identified 2700 customers who are have made a purchase in each of the 4 years I am analysing. For the cluster analysis the more purchases/customer each year the better the data quality is. However as I become more selective over number of purchases needed each year per customer, the less customers can be considered in the cluster analysis.
How should I go about choosing the cutoff point for the number of purchases necessary per customer per year to be considered for analysis. I am struggling with this trade off between data quality and having enough customers for analysis.
Thanks a lot! :)
There is no correct way. It entirely depends on your data.
Clustering such data is "magic" and the results tend to be all but statistically sound. More like random gueses.
Because of this, always try multiple parameters and carefully inspect the results. No equation ever will tell you what a good clustering is.
I am currently developing a sentiment index using Google search frequencies taken from Google Trends.
I am using Stata 12 on Windows.
My approach is as following:
I downloaded approx ~150 business-related search queries from Googletrends from Jan 2004 to Dec 2013
I now want to construct an index using the 30 at that point in time most relevant queries related to the market I observe
To achieve that I want to use monthly expanding backward rolling regressions of each query on the market
Thus I need to regress 150 items one-by-one on the market 120 times (12 months x 10 years), using different time windows and then extract the 30 queries with the most negative t-test.
To exemplify the procedure, if I would want to construct the sentiment for January 2010 I would regress the query terms on the market during the period from Jan 2004 to December 2009 and then extract the 30 queries with the most negative t-statistic.
Now I am looking for a way to make this as automatized as possible. I guess should be able to run the 150 items at once, and I can specify the time window using the time stamps. Using Excel commands and creating a do-file with all the regression commands in it (which would be quite large) I could probably create the regressions relatively efficiently (although it depends on how much Stata can handle - any experience on that?).
What I would need to make the data extraction much easier is a command which I can use to rank the results of the regression according to their t-statistics. Does someone have an efficient approach to this? Or has general advice?
If you are using Stata, once you run a ttest, you can type return list and you will get scalars that stata stores. Once you run a loop you can store these values in a number of different ways. check out the post command.