Geocoding for full addresses

Geocoding for full addresses - sas

I have more than two million records in my dataset. The addresses are in full, not parsed out into fields such as address number, street, city and state. There isn't a standardized pattern in the way these addresses are formed and since there are two million records, I can't investigate the whole dataset fully.
Neither can I change anything in the address field as I pulled the dataset from my company's database.
I want to turn addresses into longitudes and latitudes, but the procedure from SAS requires addresses being parsed into smaller fields and as I mentioned above, it's not practical for me to do so.
http://support.sas.com/documentation/cdl/en/graphref/63022/HTML/default/viewer.htm#overview-geocode.htm
My company is in Financial sector, so due to security measures, I can't install 3rd party software or applications. I need to do so in SAS Enterprise. If you have any suggestion, that'd be greatly appreciated.

Related

Best way to organize data from multiple companies in Power BI

Basically I have a big Excel dataset about 500x500 with economic information from various companies.
Each row is representing a different company and in columns we have the information. A little bit of it is qualitative like ZIP code, type, etc. But most of it is quantitative. For each of the quantitative info, we have info for 5 years, so we have one column for each year and for each information i.e. Debt 2019, Debt 2020, etc.
So my question is which is the best way to preprocess this data to work with it and how should it be done. Either doing the preprocessing with Excel, running a Script on PowerBI, using Query, SQL, ...
The objective is to have a report which will be accessible online and the user will type the name of the company and it will show them the dashboard with the information of that company (only that one), so they can navigate through it.
The structure and which information is shown is the same for each company, the only thing that changes is the "numbers" that each company has. So it has to be possible to change which data is showing (to use the one from the company they want).
It also needs to be able to show comparative data to other groups of companies or to the total.
I want to have it right from the start, because then changes get complicated.
I thought about doing sort of a "relational model" with one "table" for each company with the quantitative data (with one row for each year and each column one info point) and then a general table with the qualitative data (with rows being each company and the columns the info). But I am not really sure.
I know how to use Power BI but I have never used it for something this big. I would like to know which way to organize this data is better and some info on how to do it.
Many thanks to everyone.

I thought about doing sort of a "relational model" with one "table" for each company with the quantitative data (with one row for each year and each column one info point) and then a general table with the qualitative data (with rows being each company and the columns the info).
Yes, do that.
General guidance is to use Power Query in PowerBI to transform the data into a star schema model. See Understand star schema and the importance for Power BI
So that would typically result in one table that has the "dimension" data for each company, a date table, and a "fact" table at the grain of (CompanyId,Date) with the quantitative data.

BI - How to design a data model to compare orders price vs cost in a star schema

I'm struggling triying to make a star schema from a set of tables with different origins, two SQL databases, Excel files and CSV reports, it's a bit of a puzzle.
The initial tables that they provide me are set like this:
The important points of this set of tables are:
In Products table IdProduct is not unique, because one product can be make with one type of machine in factory A, and another type of machine in factory B, so it's one row for every Factory/Machine/IdProduct combination.
The OrderItems table has mixed rows with materials and products, so you have all the products in the Order and all the materials used in each product of the same Order.
The cost of the material changes daily and is updated in the system from where I get the OrderItems table.
The delivery cost is different for each order.
The packaging and fix costs are updated once a week.
The product price changes from order to order (it is set taking into account the client, day and size of the order).
I got to this model dividing the OrderItems in products and costs (materials), and joining with them the fixed costs and the packaging costs, I haven't joined the delivery costs, but i end up with two fact tables and a snowflake schema:
I am thinking in Region, Factory, Machine, Date, Product, and a compound of cost concepts (materials, fixed costs, etc.) as dimensions, and the total amounts and quantities as facts. This to compare the total sales to the total costs around the different dimensions.
I just wanna know if this is the correct path or there is a better way, I tried to search more about the subject but the case is too specific so I get nothing.
Thanks in advance for your answers.

Choosing a star schema over a snowflake schema ?
The Star schema is in a more de-normalized form and can be better for performance. Along the same records the Star schema uses less foreign keys so the query execution time is limited. In almost all cases the data retrieval speed of a Star schema has the Snowflake beat.
But you can split your work as data marts :
A data mart is a simple form of data warehouse focused on a single
subject or line of business.
A data mart can contain star schemas and other tables for more than one warehouse pack. For example, a single data mart might contain data for the your reporting needs related to costs

SAS EG - Individual Datasets split by date vs Single appended dataset containing all dates

This is mainly a question about efficiency, as I'm unfamiliar with how SAS processes datasets. A lot of code that I run reads from multiple datasets with consecutive dates (whether this is consecutive months/quarters/years depends on the datasets).
At the moment, the codes require manual updates each time they're run to ensure they're picking up the correct dates, so I would have something such as:
Data Quarters;
Set XYZ_201803
XYZ_201806
...
...
XYZ_202006;
Run;
To help tidy up the code and make it a bit less tedious, I've approached a few different ideas and had a few sent my way and one of the big ideas is to store all of the XYZ_YYYYMM datasets as a single, appended dataset, so they can be read from with a simple filter on the date as below:
Data Quarters;
Set AppendedData;
Where Date > 201812;
Run;
Which of these two options is more efficient as far as computation goes? On datasets which are typically a couple of gb in size, which would you recommend? What other pros and cons come with each idea?
Thanks for any input. :)

Most likely a single dataset and several separate datasets will be similar from a performance standpoint; there is some small overhead opening new datasets, but as long as it's not thousands of them you probably won't notice a difference.
There will be a performance hit with a single dataset in creating that dataset, and in using that dataset, if you use only small sections usually. Typically, separate datasets are common where people usually do analysis of individual quarters, and rarely combine them.
Finally, if the datasets can vary from quarter to quarter in their contents (if the formats could change, if the fields can change), then having separate is easier in some ways than having to manage the change between the different periods.
That said, there's a huge organizational benefit to a single dataset, and all of the above issues can be dealt with. Think of SAS datasets as large SQL tables - they are effectively the same, and the same things that help SQL tables can help SAS. Proper sizing of columns, proper sorting of the stored data, indexing appropriately, are all important solutions. If you have a database team at your place of work, they may be able to help construct an ideal table plan. Files of several GB can definitely benefit from indexing and proper sorting, to allow users to easily get at the bits they need.
If you were to stay with separate datasets, you can use the macro language to make sure you're reading in the right datasets, assuming they're named in a consistent fashion. That might be the ideal solution if there are other reasons to stay separate - then no changes are needed each quarter.

Points of interest:
From a coding standpoint
Dealing with a single, stacked data set, created by appending the quarterly data sets is more efficient.
From a resource standpoint
Have to make sure you have large enough disk to hold the single large table
Have additional off storage to hold the original pieces -- no need to clutter up the primary data disk with all the pieces.
A 2TB SSD is very fast, remarkably cheap, and low power and can contain a table comprised of quite a few "couple GB" pieces.
Spinning disk has lower $/TB and more capacity. I/O will be slower and consume more power.
To further improve query performance you will want to index the variables most commonly used in BY, CLASS, and WHERE statements.
"... simple filter ..." is part of "Keep it Simple S****" (KISS)

How to estimates monthly/daily sales of an item using Amazon Advertising API

I have looked into the responses of "ItemSeach ()" and "lookUp()" functions in Amazon Advertising API and
could not find a possible way to get daily/monthly sales of an item.
Popular product research software like , JungleScout, ProfitPhonix, AMZ tracker etc do display Number of monthly sales but all of them show different results.
Does Amazon provide this information ? If not then how the above software are estimating it?
I think when they fetch the ASIN information, they do store "some thing" in their DB and next time when the same ASIN is pulled again then the estimated sales are roughly calculated based on DB previous value/score.
Any help will be highly appreciated .
Thanks

It is not a solution, but here is a reply from UnicornSmasher I found, it may help to save time searching for something that doesn't exist.
constantine We just took all of the bulk data from the products that are being tracked in AMZ Tracker and applied a formula to it all. If you have specific products that are way off please let us know! Certain categories we had less data on. This is version 1 of the research tool, so I'm sure it will continue to improve quickly over time.
Here is the link to question and answer:
amz forum
So, now, the question is 'What formula do they use?'
Let me know if you come up with an idea :)

Let me tell you first that if you're not the part of the Amazon data team you can't get the sales numbers of any product. And, its probably not easy to estimate sales using Amazon advertising API. You need to constantly track a huge number of products to estimate the sales. Here I can explain how AMZ Insight an Amazon tracking tool estimates the sales of any product.
They constantly track a few thousand products from all the categories and collect massive data. Then their in-house data scientist analyze the data to form the sales estimating algorithm. Relationship of multiple data points plots a scattered graph which means of course sales estimates are not 100 percent right.
Data is continuously gathered and analyzed by tracking the Best Seller Rank (BSR), Buybox, reviews and more factors. Then the relationship between this data is formed to come up with the unit sales. Once this relationship is in place then it is much easier to estimate monthly sales and revenue for the product.

How do I choose an appropriate number of customers for cluster analysis?

I am currently doing a customer segmentation project in SAS.
I have identified 2700 customers who are have made a purchase in each of the 4 years I am analysing. For the cluster analysis the more purchases/customer each year the better the data quality is. However as I become more selective over number of purchases needed each year per customer, the less customers can be considered in the cluster analysis.
How should I go about choosing the cutoff point for the number of purchases necessary per customer per year to be considered for analysis. I am struggling with this trade off between data quality and having enough customers for analysis.
Thanks a lot! :)

There is no correct way. It entirely depends on your data.
Clustering such data is "magic" and the results tend to be all but statistically sound. More like random gueses.
Because of this, always try multiple parameters and carefully inspect the results. No equation ever will tell you what a good clustering is.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js