HiveQL: Parse strings and count - regex

I am using HiveQL to work with millions of rows of domain name text data stored in HDFS. The following is a hand-selected subset to illustrate lexical diversity. There are duplicate entries.
dnsvm.mgmtsubnet.mgmtvcn.oraclevcn.com.
mgmtsubnet.mgmtvcn.oraclevcn.com.
asdf.mgmtvcn.oraclevcn.com.
dnsvm.mgmtsubnet.mgmtvcn.oraclevcn.com.
localhost.
a.localhost.
img.pulsemgr.com.
36.136.154.156.in-addr.arpa.
accounts.spotify.com.
_dmarc.ixia-devops.com.
&eventtype=close&reason=4&duration=35.
&eventtype=close&reason=3&duration=10336.
I am trying to get a count of # of rows based on the last two levels of the domain, where sometimes the 2nd level is absent (i.e. localhost.). For example:
domain_root count
oraclevcn.com. 4
localhost. 1
a.localhost. 1
pulsemgr.com. 1
in-addr.arpa. 1
spotify.com. 1
ixia-devops.com 1
It would be nice to also see how to filter out domains 2nd level is absent.
I am not sure where to start. I have seen use of the SPLIT() function, but that may not be robust since there could be many levels to a domain name, for example: a.b.c.d.e.f.g.h.i etc.
Any ideas are implementations are appreciated.

Below would be the query with regexp_extract.
select domain_root, count(*) from (select regexp_extract('dnsvm.mgmtsubnet.mgmtvcn.oraclevcn.com.', '[A-Za-z0-9-]+\.[A-Za-z0-9-]+\.$', 0) as domain_root from table) A group by A.domain_root -- replace first argument with column name
regex will extract for domain root with Alphanumeric and special character '-'
hope this helps.

Related

Remove columns by name based on pattern

How can I remove a large number of columns by name based on a pattern?
A data set exported from Jira has a ton of extra columns that I've no interest in. 400 Log entries, 50 Comments, dozens of links or attachments. Problem is that they get random numbers assigned which means that removing them with hardcoded column names will not work. That would look like this and break as the numbers change:
= Table.RemoveColumns(#"Previous Step",{"Watchers", "Watchers_10", "Watchers_11", "Watchers_12", "Watchers_13", "Watchers_14", "Watchers_15", "Watchers_16", "Watchers_17", "Watchers_18", "Watchers_19", "Watchers_20", "Watchers_21", "Watchers_22", "Watchers_23", "Watchers_24", "Watchers_25", "Watchers_26", "Watchers_27", "Watchers_28", "Log Work", "Log Work_29", "Log Work_30", "Log Work_31", "Log Work_32", ...
How can I remove a large number of columns by using a pattern in the name? i.e. remove all "Log Work" columns.
The best way I've found is to use List.FindText on Table.ColumnNames to get a list of column names dynamically based on target string:
= Table.RemoveColumns(#"Previous Step", List.FindText(Table.ColumnNames(#"Previous Step"), "Log Work")
This works by first grabbing the full list of Column Names and keeping only the ones that match the search string. That's then sent to RemoveColumns as normal.
Limitation appears to be that FindText doesn't offer complex pattern matching.
Of course, when you want to remove a lot of different patterns, having individual steps isn't very interesting. A way to combine this is to use List.Combine to join the resulting column names together.
That becomes:
= Table.RemoveColumns(L, List.Combine({ List.FindText(Table.ColumnNames(L), "Watchers_"), List.FindText(Table.ColumnNames(L), "Log Work"), List.FindText(Table.ColumnNames(L), "Comment"), List.FindText(Table.ColumnNames(L), "issue link"), List.FindText(Table.ColumnNames(L), "Attachment")} ))
SO what's actually written there is:
Table.RemoveColumns(PreviousStep, List.Combine({ foundList1, foundlist2, ... }))
Note the { } that signifies a list! You need to use this as List.Combine only accepts a single argument which is itself already a List of lists. And the Combine call is required here.
Also note the L here instead of #"Previous Step". That's used to make the entire thing more readable. Achieved by inserting a step named "L" that just has = #"Promoted Headers".
This allows relatively maintainable removal of multiple columns by name, but it's far from perfect.

Parsing a name from a complex string in Tableau

I have a series of values in Tableau that are long strings intermixed with letters and numbers. I am unable to control the data output, but would like to parse the names from these strings. They follow the following format:
Potato 1TByte 4.5 NFA
Board 256GByte 553 NCA
Launch 4 512GByte 4.5 NFA
Launch 4S 512GByte 4.5 NCA
From each of these, I am attempting to capture the following:
"Potato"
"Board"
"Launch 4"
"Launch 4S"
Each string follows the same format: the name, followed by size, followed by some extra information we don't really care about.
I've tried to put together some text parsing strings, but am coming up short, and am still trying to learn regular expressions.
The Tableau calculated field I was trying to work with was something like the following:
LEFT([String], FIND([String], "Byte") - 2)
The issue is that the text and numbers preceding Byte can be anywhere from 4 to 2 characters and I need a way to identify the length of that.
Any help would be greatly appreciated!
One option which uses a regex replacement:
REGEXP_REPLACE('Launch 4 512GByte 4.5 NFA', ' \d+[A-Z]Byte .*$', '')
This strips off everything from the Byte term to the right, leaving us with only the product name.
You could try the following - this seems to work - Screenshot of Tableau output. Find below the formulas for the various derived columns you see in the screenshot (Your source column is called [Name])
Step1 = LEFT([Name],FIND([Name],"Byte")-1)
Step2 = LEN([Step1])-LEN(REPLACE([Step1]," ",""))
Step3 = FINDNTH([Step1]," ",[Step2])
Step4 = LEFT([Step1],[Step3]-1)
And of course you can nest all these in a single calculated field - kept them as separate columns for easier understanding

Remove all lines does not contain in other file

I have file A 'Emails' with so many email , and file B 'Domain' with so many domain
Example File A 'Emails ':
ctv#ymail.com
kfi#aol.in
hi#axus.cc
0#gmail.com
igp#yahoo.com
encor#mail2.com
cjang#mail.com
vn#gmail.com
87#gmail.com
ee#maoyt.com
Example file B 'Domain'
#gmail.com
#yahoo.com
My expected result :
0#gmail.com
igp#yahoo.com
vn#gmail.com
87#gmail.com
is there a way to do with 2 file in emeditor .Thanks much
I would propose using the Join CSV function. #Abimanyu's regex method may work if you have less than 10 or so domains. More than that, it might take a while to process the data.
To prepare the document for joining, right click on the CSV/Sort toolbar and edit the User-defined separated format to use # as the delimiter.
Now on both file A and file B, change the CSV mode to User-defined separated. On the CSV/Sort toolbar, there is a button called "Join CSV".
Join CSV options:
Make sure the correct documents are selected
Key Column is the email domain columns
In the list at the bottom, select the output columns, which should be column 1 and 2 from file A
Press the Join Now button, change CSV mode to Normal mode and you will get an output which looks like this:
0#gmail.com
igp#yahoo.com
vn#gmail.com
87#gmail.com
May be this will be help to you :
Pattern : .*#gmail.com|.*#yahoo.com
Match groups:
Match 1
1. 0#gmail.com
Match 2
1. igp#yahoo.com
Match 3
1. vn#gmail.com
Match 4
1. 87#gmail.com
https://rubular.com/r/M3MVSoRj6qnSbl

Coldfusion - Checking for all lowercase or uppercase

I have been given the daunting task of sifting through a database of over 30,000 registrants and correcting the letter casing of names and addresses where needed. I am trying to write a program that will search for names and addresses in our database that are either all lowercase or all uppercase and output these mishaps in a webpage for me to review and correct more efficiently. I was informed that I could utilize Regular Expressions to find fields that adhere to my criteria, only I am new to programming and I am unfamiliar with the syntax of RegEx.
If anyone could provide me with some pointers as how to use RegEx to query for these inconsistencies, it would be greatly appreciated.
Thank you.
strComp should work
SELECT col
FROM table
WHERE strComp(col, lcase(col), 0) = 0 --all lower case
OR strComp(col, ucase(col), 0) = 0 --all upper case
The first two arguments are the columns to compare. The 3rd argument says to do a binary comparison. If the two strings are equal 0 is returned.
How will you accurately correct the data? If you see a last name of "MACGUYVER" should it change to Macguyver or MacGuyver? If you see a last name of "DE LA HOYA" will it become de la Hoya, De La Hoya, or something else? This task seems a bit dangerous.
If your plan is basically to just do initial capitalization then I suggest that you run an update first before doing any manual review.
You could run something like this to change your name fields to initial capital letters:
update yourTable
set lname = StrConv(lname,3)
where StrComp(lname, StrConv(lname,3), 0) <> 0
and StrComp(mid(lname,2,len(lname)), lcase(mid(lname,2,len(lname))), 0) = 0;
Where "lname" above is your last name column, for example.
The above would have to be run for each name field.
Note that this will not update names that legitimately have multiple capital letters, like MacGuyver or O'Connor, which need manual review.
Also note that it will update last names that start with van, von, de la, and others that may intentionally be lowercase.
You could then query for just the names that need manual review, which I assume will be a much smaller subset:
select *
from yourTable
where StrComp(lname, StrConv(lname,3), 0) <> 0;
Addresses are tougher. To find just those that are either all lowercase or all uppercase you can do this:
select *
from yourTable
where strComp(address1, lcase(address1), 0) = 0;
select *
from yourTable
where strComp(address1, ucase(address1), 0) = 0;
Obviously this won't catch address lines like "123 New YORK AveNUE".
Consider asking for permission to just set all address values to uppercase.
You'll save yourself a lot of trouble.

Extracting dollar amounts from existing sql data?

I have a field with that contains a mix of descriptions and dollar amounts. With TSQL, I would like to extract those dollar amounts, then insert them into a new field for the record.
-- UPDATE --
Some data samples could be:
Used knife set for sale $200.00 or best offer.
$4,500 Persian rug for sale.
Today only, $100 rebate.
Five items for sale: $20 Motorola phone car charger, $150 PS2, $50.00 3 foot high shelf.
In the set above I was thinking of just grabbing the first occurrence of the dollar figure... that is the simplest.
I'm not trying to remove the amounts from the original text, just get their value, and add them to a new field.
The amounts could/could not contain decimals, and commas.
I'm sure PATINDEX won't cut it and I don't need an extremely RegEx function to accomplish this.
However, looking at The OLE Regex Find (Execute) function here, appears to be the most robust, however when trying to use the function I get the following error message in SSMS:
SQL Server blocked access to procedure 'sys.sp_OACreate' of component
'Ole Automation Procedures' because this component is turned off as
part of the security configuration for this server. A system
administrator can enable the use of 'Ole Automation Procedures' by
using sp_configure. For more information about enabling 'Ole
Automation Procedures', see "Surface Area Configuration" in SQL Server
Books Online.
I don't want to go and changing my server settings just for this function. I have another regex function that works just fine without changes.
I can't imagine this being that complicated to just extract dollar amounts. Any simpler ways?
Thanks.
CREATE FUNCTION dbo.fnGetAmounts(#str nvarchar(max))
RETURNS TABLE
AS
RETURN
(
-- generate all possible starting positions ( 1 to len(#str))
WITH StartingPositions AS
(
SELECT 1 AS Position
UNION ALL
SELECT Position+1
FROM StartingPositions
WHERE Position <= LEN(#str)
)
-- generate possible lengths
, Lengths AS
(
SELECT 1 AS [Length]
UNION ALL
SELECT [Length]+1
FROM Lengths
WHERE [Length] <= 15
)
-- a Cartesian product between StartingPositions and Lengths
-- if the substring is numeric then get it
,PossibleCombinations AS
(
SELECT CASE
WHEN ISNUMERIC(substring(#str,sp.Position,l.Length)) = 1
THEN substring(#str,sp.Position,l.Length)
ELSE null END as Number
,sp.Position
,l.Length
FROM StartingPositions sp, Lengths l
WHERE sp.Position <= LEN(#str)
)
-- get only the numbers that start with Dollar Sign,
-- group by starting position and take the maximum value
-- (ie, from $, $2, $20, $200 etc)
SELECT MAX(convert(money, Number)) as Amount
FROM PossibleCombinations
WHERE Number like '$%'
GROUP BY Position
)
GO
declare #str nvarchar(max) = 'Used knife set for sale $200.00 or best offer.
$4,500 Persian rug for sale.
Today only, $100 rebate.
Five items for sale: $20 Motorola phone car charger, $150 PS2, $50.00 3 foot high shelf.'
SELECT *
FROM dbo.fnGetAmounts(#str)
OPTION(MAXRECURSION 32767) -- max recursion option is required in the select that uses this function
This link should help.
http://blogs.lessthandot.com/index.php/DataMgmt/DataDesign/extracting-numbers-with-sql-server
Assuming you are OK with extracting the numeric's, regardless of wether or not there is a $ sign. If that is a strict requirement, some mods will be needed.