Extracting dollar amounts from existing sql data? - regex

I have a field with that contains a mix of descriptions and dollar amounts. With TSQL, I would like to extract those dollar amounts, then insert them into a new field for the record.
-- UPDATE --
Some data samples could be:
Used knife set for sale $200.00 or best offer.
$4,500 Persian rug for sale.
Today only, $100 rebate.
Five items for sale: $20 Motorola phone car charger, $150 PS2, $50.00 3 foot high shelf.
In the set above I was thinking of just grabbing the first occurrence of the dollar figure... that is the simplest.
I'm not trying to remove the amounts from the original text, just get their value, and add them to a new field.
The amounts could/could not contain decimals, and commas.
I'm sure PATINDEX won't cut it and I don't need an extremely RegEx function to accomplish this.
However, looking at The OLE Regex Find (Execute) function here, appears to be the most robust, however when trying to use the function I get the following error message in SSMS:
SQL Server blocked access to procedure 'sys.sp_OACreate' of component
'Ole Automation Procedures' because this component is turned off as
part of the security configuration for this server. A system
administrator can enable the use of 'Ole Automation Procedures' by
using sp_configure. For more information about enabling 'Ole
Automation Procedures', see "Surface Area Configuration" in SQL Server
Books Online.
I don't want to go and changing my server settings just for this function. I have another regex function that works just fine without changes.
I can't imagine this being that complicated to just extract dollar amounts. Any simpler ways?
Thanks.

CREATE FUNCTION dbo.fnGetAmounts(#str nvarchar(max))
RETURNS TABLE
AS
RETURN
(
-- generate all possible starting positions ( 1 to len(#str))
WITH StartingPositions AS
(
SELECT 1 AS Position
UNION ALL
SELECT Position+1
FROM StartingPositions
WHERE Position <= LEN(#str)
)
-- generate possible lengths
, Lengths AS
(
SELECT 1 AS [Length]
UNION ALL
SELECT [Length]+1
FROM Lengths
WHERE [Length] <= 15
)
-- a Cartesian product between StartingPositions and Lengths
-- if the substring is numeric then get it
,PossibleCombinations AS
(
SELECT CASE
WHEN ISNUMERIC(substring(#str,sp.Position,l.Length)) = 1
THEN substring(#str,sp.Position,l.Length)
ELSE null END as Number
,sp.Position
,l.Length
FROM StartingPositions sp, Lengths l
WHERE sp.Position <= LEN(#str)
)
-- get only the numbers that start with Dollar Sign,
-- group by starting position and take the maximum value
-- (ie, from $, $2, $20, $200 etc)
SELECT MAX(convert(money, Number)) as Amount
FROM PossibleCombinations
WHERE Number like '$%'
GROUP BY Position
)
GO
declare #str nvarchar(max) = 'Used knife set for sale $200.00 or best offer.
$4,500 Persian rug for sale.
Today only, $100 rebate.
Five items for sale: $20 Motorola phone car charger, $150 PS2, $50.00 3 foot high shelf.'
SELECT *
FROM dbo.fnGetAmounts(#str)
OPTION(MAXRECURSION 32767) -- max recursion option is required in the select that uses this function

This link should help.
http://blogs.lessthandot.com/index.php/DataMgmt/DataDesign/extracting-numbers-with-sql-server
Assuming you are OK with extracting the numeric's, regardless of wether or not there is a $ sign. If that is a strict requirement, some mods will be needed.

Related

Implementing a calculated field within my Tableau Viz

I have data within tableau that I wish to show a breakdown of USED and FREE storage. However, I need to first filter a specific column to perform 2 different types of calculations. Here is the data
Total Free SKU
10 5 A
20 1 A
5 4 B
2 0 B
10 5 C
10 6 D
I am wanting to show a tableau bar chart that displays the available, used and total within Tableau. However, I need to first filter out by SKU:
I created this calculated field below as well as this calculated field:
Used = Total - Free
IF CONTAINS(ATTR([SKU]),'A') or
CONTAINS(ATTR([SKU]),'D')
THEN SUM([Total])
ELSEIF CONTAINS(ATTR([SKU]),'B') or
CONTAINS(ATTR([SKU]),'C')
THEN AVG([Total])
END
This is what I have done so far, but not sure how to incorporate the calculated field within the viz
Any suggestion is appreciated.
If I understand your problem correctly, proceed like this
Situation-1 You want to work at SKUG level
Create calculation fields each for total/USED/FREE as
SUM(ZN(IF CONTAINS([SKU], 'A') OR CONTAINS([SKU], 'D')
THEN [Total] END))
+
AVG(ZN(IF CONTAINS([SKU], 'B') OR CONTAINS([SKU], 'C')
THEN [Total] END))
Needless to say, please replace [total] by [used] or [free] as applicable
Situation-2 You want to work at higher level of detail instead. In this case you need to decide what you have to do with each of the SKU's group. Let's assume you want to add these. then creating similar fields will do. else replace + in a separate field with your desired operator(!).
Good luck!

Add custom column based on string in another column

Source data:
Market Platform Web sales $ Mobile sales $ Insured
FR iPhone 1323 8709 Y
IT iPad 12434 7657 N
FR android 234 2352355 N
IT android 12323 23434 Y
Is there a way to evaluate the sales of devices that are insured?
if List.Contains({"iPhone","iPad","iPod"},[Platform]) and ([Insured]="Y") then [Mobile sales] else "error"
Something to that extent, just not sure how to approach it
A direct answer to your question is
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
SumUpSales = Table.AddColumn(Source, "Sales of insured devices", each if List.Contains({"iPhone","iPad","iPod"}, _[Platform]) and Text.Upper(_[Insured]) = "Y" then _[#"Mobile sales $"] else null, type number)
in
SumUpSales
However, I would like to stress you few things.
First, it's better to convert values in [Insured] column to boolean first. That way you can catch errors before they corrupt your data without you noticing. My example doesn't do that, all it does is negating letter case in [Insured], since PowerM is case-sensitive language.
Second, you'd better use null rather than text value error. Then, you can set column type, and do some math with its values, such as summing them up. In case of mixed text and number values you will get an error in this and many other cases.
And last.
It is probably better way to use a pivot table for visualizing data like this. You just need to add a column which groups all Apple (and/or other) devices together based on the same logic, but excluding [Insured]. Pivot tables are more flexible, and I personally like them very much.

HiveQL: Parse strings and count

I am using HiveQL to work with millions of rows of domain name text data stored in HDFS. The following is a hand-selected subset to illustrate lexical diversity. There are duplicate entries.
dnsvm.mgmtsubnet.mgmtvcn.oraclevcn.com.
mgmtsubnet.mgmtvcn.oraclevcn.com.
asdf.mgmtvcn.oraclevcn.com.
dnsvm.mgmtsubnet.mgmtvcn.oraclevcn.com.
localhost.
a.localhost.
img.pulsemgr.com.
36.136.154.156.in-addr.arpa.
accounts.spotify.com.
_dmarc.ixia-devops.com.
&eventtype=close&reason=4&duration=35.
&eventtype=close&reason=3&duration=10336.
I am trying to get a count of # of rows based on the last two levels of the domain, where sometimes the 2nd level is absent (i.e. localhost.). For example:
domain_root count
oraclevcn.com. 4
localhost. 1
a.localhost. 1
pulsemgr.com. 1
in-addr.arpa. 1
spotify.com. 1
ixia-devops.com 1
It would be nice to also see how to filter out domains 2nd level is absent.
I am not sure where to start. I have seen use of the SPLIT() function, but that may not be robust since there could be many levels to a domain name, for example: a.b.c.d.e.f.g.h.i etc.
Any ideas are implementations are appreciated.
Below would be the query with regexp_extract.
select domain_root, count(*) from (select regexp_extract('dnsvm.mgmtsubnet.mgmtvcn.oraclevcn.com.', '[A-Za-z0-9-]+\.[A-Za-z0-9-]+\.$', 0) as domain_root from table) A group by A.domain_root -- replace first argument with column name
regex will extract for domain root with Alphanumeric and special character '-'
hope this helps.

Data validation using regular expressions in Google Sheets

I am using the below date/time format in gSheets:
01 Apr at 11:00
I wonder whether it is possible to use Data Validation (or any other function) to report error (add the small red triangle to the corner of the cell) when the format differs in any way.
Possible values in the given format:
01 -> any number between 01-31 (but not "1", there must be the leading zero)
space
Apr -> 3 letters for month (Jan, Feb, Mar... Dec)
space
at
space
11 -> hours in 24h format (00, 01...23)
:
00 -> minutes (00, 01,...59)
Is there any way to validate that the cell contains "text/data" exactly in the above mentioned format?
The right way to do this is using Regular Expression and "regexmatch()" function in Google Sheets. For the given example, I made the below regular expression:
[0-3][0-9] (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) at [0-2][0-9]\:[0-5][0-9]
Process:
Select range of cells to be validated
Go to Data > Data Validation
Under Criteria select "Own pattern is" (not sure the exact translation used in EN)
Paste: =regexmatch(to_text(K4); "[0-3][0-9] (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) at [0-2][0-9]\:[0-5][0-9]")
Make sure that instead of K4 in "to_text(K4)" there is a upper-left cell from the selected range
Save
Hope it helps someone :)
You may try the formula for data validation:
=not(iserror(SUBSTITUTE(A1," at","")*1))*(len(A1)=15)*(right(A1,2)*1<61)
not(iserror(SUBSTITUTE(A1," at","")*1)) checks all statemant is legal date
(len(A1)=15) checks dates are entered with 2 digits
(right(A1,2)*1<61) cheks too much minutes, for some reason 01 Apr at 11:99 is a legal date..
Select the range of fields, where you need the data validation to occur to.
Press on -> Data -> Data validation
For "Criteria" select "Custom formula is"
Enter the following in the textfield next to "Custom formula is":
=regexmatch(Tablename!B2; "^[a-z_]*$")
Where as "Tablename" should be replaced by the table name and "B2" should be replaced by the first cell of the range.
Inside the "" you enter then your regex-expression. Here this would allow only small letters and underscores.
Using the to_text() function additionally didn't work for me. So you should maybe avoid it in order to make sure, that it works.
Press save

Generating rolling z-scores of panel data in Stata

I have an unbalanced panel data set (countries and years). For simplicity let's say I have one variable, x, that I am measuring. The panel data sorted first by country (a 3-digit numeric country-code) and then by year. I would like to write a .do file that generates a new variable, z_x, containing the standardized values of the variable x. The variables should be standardized by subtracting the mean from the preceding (exclusive) m time periods, and then dividing by the standard deviation from those same time periods. If this is not possible, return a missing value.
Currently, the code I am using to accomplish this is the following (edited now for clarity)
xtset weocountrycode year
sort weocountrycode year
local win_len = 5 // Defining rolling window length.
quietly: rolling sd_x=r(sd) mean_x=r(mean), window(`win_len') saving(stats_x, replace): sum x
use stats_x, clear
rename end year
save, replace
use all_data_PROCESSED_FINAL.dta, clear
quietly: merge 1:1 (weocountrycode year) using stats_x
replace sd_x = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n] // This and next line are for deleting values that rolling calculates when I actually want missing values.
replace mean_`x' = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n]
gen z_`x' = (`x' - mean_`x'[_n-1])/sd_`x'[_n-1] // calculate z-score
UPDATE:
My struggle with rolling is that when rolling is set up to use a window length 5 rolling mean, it automatically does window length 1,2,3,4 means for the first, second, third and fourth entries (when there are not 5 preceding entries available to average out). In fact, it does this in general - if the first non-missing value is on entry 5, it will do a length 1 rolling average on entry 5, length 2 rolling average on entry 6, ..... and then finally start doing length 5 moving averages on entry 9. My issue is that I do not want this, so I would like to avoid performing these calculations. Until now, I have only been able to figure out how to delete them after they are done, which is both inefficient and bothersome.
I tried adding an if clause to the -rolling- statement:
quietly: rolling sd_x=r(sd) mean_x=r(mean) if x[_n-`win_len'+1] != . & weocountrycode[_n-`win_len'+1] != weocountrycode[_n], window(`win_len') saving(stats_x, replace): sum x
But it did not fix the problem and the output is "weird" in the sense that
1) If `win_len' is equal to, say, 10, there are 15 missing values in the resulting z_x variable, instead of 9.
2) Even though there are "extra" missing values in z_x, the observations still start out as window length 1 means, then window length 2 means, etc. which makes no sense to me.
Which leads me to believe I fundamentally don't understand 1) what -rolling- is doing and 2) how an if clause works in the context of -rolling-.
Does this help?
Thanks!
I'm not sure I understand completely but I'll try to answer based on what I think your problem is, and based on a comment by #NickCox.
You say:
... when rolling is set up to use a window length 5 rolling mean...
if the first non-missing value is
on entry 5, it will do a length 1 rolling average on entry 5, length 2
rolling average on entry 6, ...
This is expected. help rolling states:
The window size refers to calendar periods, not the number of
observations. If there
are missing data (for example, because of weekends), the actual number of observations used by command may be less than
window(#).
It's not actually doing a "length 1 rolling average", but I get to that later.
Below some examples to see what rolling does:
clear all
set more off
*-------------------------- example data -----------------------------
set obs 92
gen dat = _n - 1
format dat %tq
egen seq = fill(1 1 1 1 2 2 2 2)
tsset dat
tempfile main
save "`main'"
list in 1/12, separator(4)
*------------------- Example 1. None missing ------------------------
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------- Example 2. All but one value, missing in first window ------
use "`main'", clear
replace seq = . in 1/3
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------------- Example 3. All missing in first window --------------
use "`main'", clear
replace seq = . in 1/4
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
Note I use the stepsize option to make things much easier to follow. Because the date variable is in quarters, I set windowsize(4) and stepsize(4) so rolling is just computing averages by year. I hope that's easy to see.
Example 1 does as expected. No problem here.
Example 2 on the other hand, should be more interesting for you. We've said that what matters are calendar periods, so the mean is computed for the whole year (four quarters), even though it contains missings. There are three missings and one non-missing. summarize is computing the mean over the whole year, but summarize ignores missings, so it just outputs the mean of non-missings, which in this case is just one value.
Example 3 has missings for all four quarters of the year. Therefore, summarize outputs . (missing).
Your problem, as I understand it, is that when you face a situation like Example 2, you'd like the output to be missing. This is where I think Nick Cox's advice comes in. You could try something like:
rolling mean=r(mean) N=r(N), window(4) stepsize(4) clear: summarize seq, detail
replace mean = . if N != 4
list in 1/12, separator(0)
This says: if the number of non-missings for the window (r(N), also computed by summarize), is not the same as the window size, then replace it with missing.