This is siva Ramanjaneyulu, I am working on hive. I have got the following problem with hive
sample.log: <ABC>
CREATE TABLE sample4( num1 STRING ) ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH
SERDEPROPERTIES ( "input.regex" = "<.*>", "output.format.string" =
"%1$s" ) STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH "../hive-0.9.0/sample.log" INTO TABLE sample4;
select * from sample4;
NULL
Expected output: ABC
Why does this .RegexSerDe not work on regular exprssion <.*>?
how it is possible to remove < and > symbels using regular expression , can u please provide solution for this
Try this :
hive> CREATE TABLE s(num1 STRING) ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH
SERDEPROPERTIES ( "input.regex" = "(<.*>)",
"output.format.string" = "%1$s" ) STORED AS TEXTFILE;
Mind the parentheses around regex.
You are getting a NULL value because you didn't include the parentheses in the regex definition. If you don't want the angle brackets to be included in the output you need to put them outside the parentheses. The stuff inside the parenthesis is what will be returned as the output.
CREATE TABLE sample4 (num1 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "<(.*)>"
, "output.format.string" = '%1$s'
)
STORED AS TEXTFILE;
Related
I have a regular expression like the following: (Running on Oracle's regexp_like(), despite the question isn't Oracle-specific)
abc|bcd|def|xyz
This basically matches a tags field on database to see if tags field contains abc OR bcd OR def OR xyz when user has input for the search query "abc bcd def xyz".
The tags field on the database holds keywords separated by spaces, e.g. "cdefg abcd xyz"
On Oracle, this would be something like:
select ... from ... where
regexp_like(tags, 'abc|bcd|def|xyz');
It works fine as it is, but I want to add an extra option for users to search for results that match all keywords. How should I change the regular expression so that it matches abc AND bcd AND def AND xyz ?
Note: Because I won't know what exact keywords the user will enter, I can't pre-structure the query in the PL/SQL like this:
select ... from ... where
tags like '%abc%' AND
tags like '%bcd%' AND
tags like '%def%' AND
tags like '%xyz%';
You can split the input pattern and check that all the parts of the pattern match:
SELECT t.*
FROM table_name t
CROSS APPLY(
WITH input (match) AS (
SELECT 'abc bcd def xyz' FROM DUAL
)
SELECT 1
FROM input
CONNECT BY LEVEL <= REGEXP_COUNT(match, '\S+')
HAVING COUNT(
REGEXP_SUBSTR(
t.tags,
REGEXP_SUBSTR(match, '\S+', 1, LEVEL)
)
) = REGEXP_COUNT(match, '\S+')
)
Or, if you have Java enabled in the database then you can create a Java function to match regular expressions:
CREATE AND COMPILE JAVA SOURCE NAMED RegexParser AS
import java.util.regex.Pattern;
public class RegexpMatch {
public static int match(
final String value,
final String regex
){
final Pattern pattern = Pattern.compile(regex);
return pattern.matcher(value).matches() ? 1 : 0;
}
}
/
Then wrap it in an SQL function:
CREATE FUNCTION regexp_java_match(value IN VARCHAR2, regex IN VARCHAR2) RETURN NUMBER
AS LANGUAGE JAVA NAME 'RegexpMatch.match( java.lang.String, java.lang.String ) return int';
/
Then use it in SQL:
SELECT *
FROM table_name
WHERE regexp_java_match(tags, '(?=.*abc)(?=.*bcd)(?=.*def)(?=.*xyz)') = 1;
Try this, the idea being counting that the number of matches is == to the number of patterns:
with data(val) AS (
select 'cdefg abcd xyz' from dual union all
select 'cba lmnop xyz' from dual
),
targets(s) as (
select regexp_substr('abc bcd def xyz', '[^ ]+', 1, LEVEL) from dual
connect by regexp_substr('abc bcd def xyz', '[^ ]+', 1, LEVEL) is not null
)
select val from data d
join targets t on
regexp_like(val,s)
group by val having(count(*) = (select count(*) from targets))
;
Result:
cdefg abcd xyz
I think dynamic SQL will be needed for this. The match all option will require individual matching with logic to ensure every individual match is found.
An easy way would be to build a join condition for each keyword. Concatenate the join statements in a string. Use dynamic SQL to execute the string as a query.
The example below uses the customer table from the sample schemas provided by Oracle.
DECLARE
-- match string should be just the values to match with spaces in between
p_match_string VARCHAR2(200) := 'abc bcd def xyz';
-- need logic to determine match one (OR) versus match all (AND)
p_match_type VARCHAR2(3) := 'OR';
l_sql_statement VARCHAR2(4000);
-- create type if bulk collect is needed
TYPE t_email_address_tab IS TABLE OF customers.EMAIL_ADDRESS%TYPE INDEX BY PLS_INTEGER;
l_email_address_tab t_email_address_tab;
BEGIN
WITH sql_clauses(row_idx,sql_text) AS
(SELECT 0 row_idx -- build select plus beginning of where clause
,'SELECT email_address '
|| 'FROM customers '
|| 'WHERE 1 = '
|| DECODE(p_match_type, 'AND', '1', '0') sql_text
FROM DUAL
UNION
SELECT LEVEL row_idx -- build joins for each keyword
,DECODE(p_match_type, 'AND', ' AND ', ' OR ')
|| 'email_address'
|| ' LIKE ''%'
|| REGEXP_SUBSTR( p_match_string,'[^ ]+',1,level)
|| '%''' sql_text
FROM DUAL
CONNECT BY LEVEL <= LENGTH(p_match_string) - LENGTH(REPLACE( p_match_string, ' ' )) + 1
)
-- put it all together by row_idx
SELECT LISTAGG(sql_text, '') WITHIN GROUP (ORDER BY row_idx)
INTO l_sql_statement
FROM sql_clauses;
dbms_output.put_line(l_sql_statement);
-- can use execute immediate (or ref cursor) for dynamic sql
EXECUTE IMMEDIATE l_sql_statement
BULK COLLECT
INTO l_email_address_tab;
END;
Variable
Value
p_match_string
abc bcd def xyz
p_match_type
AND
l_sql_statement
SELECT email_address FROM customers WHERE 1 = 1 AND email_address LIKE '%abc%' AND email_address LIKE '%bcd%' AND email_address LIKE '%def%' AND email_address LIKE '%xyz%'
Variable
Value
p_match_string
abc bcd def xyz
p_match_type
OR
l_sql_statement
SELECT email_address FROM customers WHERE 1 = 0 OR email_address LIKE '%abc%' OR email_address LIKE '%bcd%' OR email_address LIKE '%def%' OR email_address LIKE '%xyz%'
My Athena query
CREATE EXTERNAL TABLE IF NOT EXISTS cwmilenko.new1 ( `ticket-id` string, `amount_stake_one` struct )
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1') location 's3://cw-milenko-tests/1507tick2.parquet/part-00000-bbab8f70-4758-4041-9f69-c17f21c916dac000.snappy.parquet'
Error
FAILED: ParseException line 1:6 missing EOF at '-' near 'cw'
How to fix this problem?
This works
CREATE EXTERNAL TABLE parket3 (
`amount_stake_one` double,
`bet_event_etl_date`string,
`ticket_id` string
) STORED AS PARQUET LOCATION 's3://myparket/' tblproperties ("parquet.compression"="SNAPPY")
I'm trying to optimize my Glue/PySpark job by using push down predicates.
start = date(2019, 2, 13)
end = date(2019, 2, 27)
print(">>> Generate data frame for ", start, " to ", end, "... ")
relaventDatesDf = spark.createDataFrame([
Row(start=start, stop=end)
])
relaventDatesDf.createOrReplaceTempView("relaventDates")
relaventDatesDf = spark.sql("SELECT explode(generate_date_series(start, stop)) AS querydatetime FROM relaventDates")
relaventDatesDf.createOrReplaceTempView("relaventDates")
print("===LOG:Dates===")
relaventDatesDf.show()
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights", push_down_predicate="""
querydatetime BETWEEN '%s' AND '%s'
AND querydestinationplace IN (%s)
""" % (start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d"), ",".join(map(lambda s: str(s), arr))))
However it appears, that Glue still attempts to read data outside the specified date range?
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-01/part-00045-6cdebbb1-562c-43fa-915d-93b125aeee61.c000.snappy.parquet' for reading
INFO FileScanRDD: Reading File path: s3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet, range: 0-11797922, partition values: [12191,17965]
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet' for reading
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Notice the querydatetime=2019-03-01 and querydatetime=2019-03-10 its outside the specified range of 2019-02-13 - 2019-02-27. Is that why there's the next line "aborting HTTP connection" tho? It goes on to say "This is likely an error and may result in sub-optimal behavior" is something wrong?
I wonder if the problem is because it does not support BETWEEN inside the predicate or IN?
The table create DDL
CREATE EXTERNAL TABLE `flights`(
`id` string,
`querytaskid` string,
`queryoriginplace` string,
`queryoutbounddate` string,
`queryinbounddate` string,
`querycabinclass` string,
`querycurrency` string,
`agent` string,
`quoteageinminutes` string,
`price` string,
`outboundlegid` string,
`inboundlegid` string,
`outdeparture` string,
`outarrival` string,
`outduration` string,
`outjourneymode` string,
`outstops` string,
`outcarriers` string,
`outoperatingcarriers` string,
`numberoutstops` string,
`numberoutcarriers` string,
`numberoutoperatingcarriers` string,
`indeparture` string,
`inarrival` string,
`induration` string,
`injourneymode` string,
`instops` string,
`incarriers` string,
`inoperatingcarriers` string,
`numberinstops` string,
`numberincarriers` string,
`numberinoperatingcarriers` string)
PARTITIONED BY (
`querydestinationplace` string,
`querydatetime` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://pinfare-glue/flights/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='pinfare-parquet',
'averageRecordSize'='19',
'classification'='parquet',
'compressionType'='none',
'objectCount'='623609',
'recordCount'='4368434222',
'sizeKey'='86509997099',
'typeOfData'='file')
One of the issue I can see with the code is that you are using "today" instead of "end" in the between clause. Though I don't see the today variable declared anywhere in your code, I am assuming it has been initialized with today's date.
In that case the range will be different and the partitions being read by glue spark is correct.
In order to push down your condition, you need to change the order of columns in your partition by clause of table definition
A condition having "in" predicate on first partition column can not be push down as you are expecting.
Let me if it helps.
Pushdown predicates in Glue DynamicFrame works fine with between as well as IN clause.
As long as you have correct sequence of partition columns defined in table definition and in query.
I have table with three level of partitions.
s3://bucket/flights/year=2018/month=01/day=01 -> 50 records
s3://bucket/flights/year=2018/month=02/day=02 -> 40 records
s3://bucket/flights/year=2018/month=03/day=03 -> 30 records
Read data in dynamicFrame
ds = glueContext.create_dynamic_frame.from_catalog(
database = "abc",table_name = "pqr", transformation_ctx = "flights",
push_down_predicate = "(year == '2018' and month between '02' and '03' and day in ('03'))"
)
ds.count()
Output:
30 records
So, you are gonna get the correct results, if sequence of columns is correctly specified. Also note, you need to specify '(quote) IN('%s') in IN clause.
Partition columns in table:
querydestinationplace string,
querydatetime string
Data read in DynamicFrame:
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights",
push_down_predicate=
"""querydestinationplace IN ('%s') AND
querydatetime BETWEEN '%s' AND '%s'
"""
%
( ",".join(map(lambda s: str(s), arr)),
start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d")))
Try to do the end as this
start = str(date(2019, 2, 13))
end = str(date(2019, 2, 27))
# Set your push_down_predicate variable
pd_predicate = "querydatetime >= '" + start + "' and querydatetime < '" + end + "'"
#pd_predicate = "querydatetime between '" + start + "' AND '" + end + "'" # Or this one?
flightsGDF = glueContext.create_dynamic_frame.from_catalog(
database = "xxx"
, table_name = "flights"
, transformation_ctx="flights"
, push_down_predicate=pd_predicate)
The pd_predicate will be a string that will work as a push_down_predicate.
Here is a nice read about it if you like.
https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/
This is my first attempt with Athena, please be gentle :)
This query is not failing with error -> No viable alternative
CREATE EXTERNAL TABLE IF NOT EXISTS dev.mytokendata (
'dateandtime' timestamp, 'requestid' string,
'ip' string, 'caller' string, 'token' string, 'requesttime' int,
'httpmethod' string, 'resourcepath' string, 'status' smallint,
'protocol' string, 'responselength' int )
ROW FORMAT SERDE 'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
'input.format'='%{TIMESTAMP_ISO8601:dateandtime}
\s+\{\"requestId\":\s+\"%{USERNAME:requestid}\",
\s+\"ip\":\s+\"%{IP:ip}\",
\s+\"caller\":\s+\"%{USERNAME:caller}\",
\s+\"token\":\s+\"%{USERNAME:token}\",
\s+\"requestTime\":\s+\"%{INT:requesttime}\",
\s+\"httpMethod\":\s+\"%{WORD:httpmethod}\",
\s+\"resourcePath\":\s+\"%{UNIXPATH:resourcepath}\",
\s+\"status\":\s+\"%{INT:status}\",
\s+\"protocol\":\s+\"%{UNIXPATH:protocol}\",
\s+\"responseLength:\"\s+\"%{INT:responselength}\"\s+\}' )
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://athena-abc/1234/'
TBLPROPERTIES ('has_encrypted_data'='false', 'compressionType'='gzip');
This is a line from the log file which I am trying to parse (.gz file)
2018-07-30T02:23:34.134Z { "requestId":
"810000-9100-1100-a100-f100000", "ip": "01.01.01.001", "caller": "-",
"token": "1234-5678-78910-abcd", "requestTime":
"1002917414000", "httpMethod": "POST", "resourcePath":
"/MyApp/v1.0/MyService.wsdl",
"status": "200", "protocol": "HTTP/1.1", "responseLength": "1000" }
Can anyone please point out what may be wrong? It would be of great help
You have used the apostrophes unnecessarily in the column names and there was also the problem with escape characters.
The correct version:
CREATE EXTERNAL TABLE mytokendata (
`dateandtime` timestamp,
`requestid` string,
`ip` string,
`caller` string,
`token` string,
`requesttime` int,
`httpmethod` string,
`resourcepath` string,
`status` smallint,
`protocol` string,
`responselength` int )
ROW FORMAT SERDE 'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
'input.format'='%{TIMESTAMP_ISO8601:dateandtime}
\\s+\\{\"requestId\":\\s+\"%{USERNAME:requestid}\",
\\s+\"ip\":\\s+\"%{IP:ip}\",
\\s+\"caller\":\\s+\"%{USERNAME:caller}\",
\\s+\"token\":\\s+\"%{USERNAME:token}\",
\\s+\"requestTime\":\\s+\"%{INT:requesttime}\",
\\s+\"httpMethod\":\\s+\"%{WORD:httpmethod}\",
\\s+\"resourcePath\":\\s+\"%{UNIXPATH:resourcepath}\",
\\s+\"status\":\\s+\"%{INT:status}\",
\\s+\"protocol\":\\s+\"%{UNIXPATH:protocol}\",
\\s+\"responseLength:\"\\s+\"%{INT:responselength}\"\\s+\\}' )
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://athena-abc/1234/'
TBLPROPERTIES ('has_encrypted_data'='false', 'compressionType'='gzip');
This is a sample row in input data file with two fields - dept and names
dept,names
Mathematics,[foo,bar,alice,bob]
Here, 'name' is an array of String and I want to load it as array of String Athena.
Any suggestion?
To have a valid CSV file, make sure you put quotes around your array:
Mathematics,"[foo,bar,alice,bob]"
If you can remove the "[" and "]" the solution below becomes even easier and you can just split without the regex.
Better: Mathematics,"foo,bar,alice,bob"
First create a simple table from CSV with just strings:
CREATE EXTERNAL TABLE IF NOT EXISTS test.mydataset (
`dept` string,
`names` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'quoteChar' = '"',
"separatorChar" = ',',
'collection.delim' = ',',
'mapkey.delim' = ':'
) LOCATION 's3://<your location>'
TBLPROPERTIES ('has_encrypted_data'='false')
Then create a view which uses a regex to remove your '[' and ']' characters, then splits the rest by ',' into an array.
CREATE OR REPLACE VIEW mydataview AS
SELECT dept,
split(regexp_extract(names, '^\[(.*)\]$', 1), ',') as names
FROM mydataset
Then use the view for your queries. I am not 100% sure as I've only spent like 12 hours using Athena.
--
Note that in order to use the quotes, you need to use OpenCSVSerde, the 'lazyserde' won't work as it does support quotes. lazyserde DOES support internal arrays, but you can't use the ',' as a separator in that case. If you want to try that, your data would look like:
Better: Mathematics,foo|bar|alice|bob
In that case this MIGHT work directly:
CREATE EXTERNAL TABLE IF NOT EXISTS test.mydataset (
`dept` string,
`names` array<string>
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'quoteChar' = '"',
"separatorChar" = ',',
'collection.delim' = '|',
'mapkey.delim' = ':'
) LOCATION 's3://<your location>'
TBLPROPERTIES ('has_encrypted_data'='false')
Note how collection.delim = '|', which should translate your field directly to an array.
Sorry I don't have time to test this, I'll be happy to update my answer if you can confirm what works. Hopefully this get's you started.