I have a column that has multiple comments along with some DateTime stamps. The example is like this:
abc#gmail.com - 03/03/2022 13:04:40
Documents Pending
Some random comment
I want to extract only the DateTime stamp from this column. I tried using to_char, and to_date functions in PostgreSQL, but none of it seems to work for me.
I also tried writing a regex to extract the DateTime stamp, but it didn't work.
What is the correct way to extract only DateTime from the above column? What would be the regular expression to extract the DateTime?
Thanks in advance.
Edit: I want to extract a date from a column like this:
abc#gmail.com - 12-Aug-2022
Documents Pending
Some random comment
How we can identify the month number or a date format from this comment?
You could use SUBSTRING() here:
SELECT col,
SUBSTRING(col FROM '\y\d{2}/\d{2}/\d{4} \d{2}:\d{2}:\d{2}\y') AS ts
FROM yourTable;
So, I've been trying to load csvs from a s3 bucket into Athena. However, the way the csv are designer looks like the following
ns=2;s=A_EREG.A_EREG.A_PHASE_PRESSURE,102.19468,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_REF_SIGNAL_TO_VALVE,50.0,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_SEC_CURRENT,15.919731,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_SEC_VOLTAGE,0.22070877,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.ACTIVE_PWR,0.0,12/12/19 00:00:01.2144275 GMT
The csv is just one record. Each column of the record has a value associated to it, which sits between two commas between the timestamp and the name, which I am trying to capture.
I've been trying to parse it using Regex Serde and I got to this Regular expression:
((?<=\,).*?(?=\,))
demo
I want the output of the above to be:
col_a col_b col_c col_d col_e
102.19468 50.0 15.919731 0.22070877 0.0
My DDL query looks like this:
CREATE EXTERNAL TABLE IF NOT EXISTS
(...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = "\(?<=\,).*?(?=\,)"
) LOCATION 's3://jackson-nifi-plc-data-1/2019-12-12/'
TBLPROPERTIES ('has_encrypted_data'='false');
The table creation Query above works succesfully, but when I try to preview my table I get the following error:
HIVE_CURSOR_ERROR: Number of matching groups doesn't match the number of columns
I am fairly new to Hive and Regex so I don't know what is going on. Can someone help me out here?
Thanks in advance,
BR
One column in Hive table corresponds to one capturing group in the regex. If you want to select single column containing everything between commas then this will work:
'.*,(.*),.*'
Athena serdes require that each record in the input is a single line. Multiline records are not supported.
What you can do instead is to create a table which maps each line in your data to a row in a table, and use a view to pivot the rows that belong together into a single row.
I'm going to assume that the ns field at the start of the lines is an ID, if not, I assume there is some other thing identifying which lines belong together that you can use.
I used your demo to create a regex that matched all the fields of each line and came up with ns=(\d);s=([^,]+),([^,]+),(.+) (see https://regex101.com/r/HnjnxK/5).
CREATE EXTERNAL TABLE my_data (
ns string,
s string,
v double,
dt string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = "ns=(\\d);s=([^,]+),([^,]+),(.+)"
)
LOCATION 's3://jackson-nifi-plc-data-1/2019-12-12/'
TBLPROPERTIES ('has_encrypted_data'='false')
Apologies if the regex isn't correctly escaped, I'm just typing this into Stack Overflow.
This table has four columns, corresponding to the four fields in each line. I've named then ns and s from the data, and v for the numerical value, and dt for the date. The date needs to be typed as a string since it's not in a format Athena natively understands.
Assuming that ns is a record identifier you can then create a view that pivots rows with different values for s to columns. You have to do this the way you want it to, the following is of course just a demonstration:
CREATE VIEW my_pivoted_data AS
WITH data_aggregated_by_ns AS (
SELECT
ns,
map_agg(array_agg(s), array_agg(v)) AS s_and_v
FROM my_data
GROUP BY ns
)
SELECT
ns,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_PRESSURE') AS phase_pressure,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_REF_SIGNAL_TO_VALVE') AS phase_ref_signal_to_valve,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_SEC_CURRENT') AS phase_sec_current,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_SEC_VOLTAGE') AS phase_sec_voltage,
element_at(s_and_v, 'A_EREG.A_EREG.ACTIVE_PWR') AS active_pwr
FROM data_aggregated_by_ns
Apologies if there are syntax errors in the SQL above.
What this does is that it creates a view (but start by trying it out as a query using everything from WITH and onwards), which has two parts to it.
The first part, the first SELECT results in rows that aggregate all the s and v values for each value of ns into a map. Try to run this query by itself to see how the result looks.
The second part, the second SELECT uses the results of the first part and just picks out the different v values for a number of values of s that I chose from your question using the aggregated map.
I need to create a new column with an IF.
If the difference between two dates is more than a month I have to use a text-like "much time" but if it is not I have to show a date.
So the date must be converted to a string to use a text column.
How can I convert date to text?
Fecha_real =
IF( DATEDIFF(ventas[fecha_pedido]; ventas[fecha]; month) = 1 ;
"much time";
ConvertToTextInSomeWay ventas[fecha]
)
This is pretty simple with the FORMAT function.. For example, FORMAT(ventas[fecha], "Short Date") will convert fecha into textlike "12/31/2018".
That's just one format example. There are plenty of pre-defined and custom options if you'd rather something else. For example, FORMAT(ventas[fecha], "dd-mm-yyyy") would format that same date as "31-12-2018" instead.
I want to convert the string 20160101000000 into datetime format using expression. I have used below date function
TO_DATE(PERIOD_END_DATE),'MM/DD/YYYY HH24:MI:SS')
But my table file is not loading. My session and workflow gets succeed. My target and source is also flatfile.
I want to change the string 20160101000000 into MM/DD/YYYY HH24:MI:SS for loading data into my target table.
You need to give exact format that looks so that to_date function can understand that format and converts it into date.
TO_DATE(PERIOD_END_DATE,'YYYYMMDDHH24MISS')
So here your date looks like YYYYMMDDHH24MISS (20160101000000).
There is often confusion with the TO_DATE function... it is in fact for converting a string into a date and the function itself is to describe the pattern of the incoming date. Now if you want to convert a date field to a specified date format you must use TO_CHAR
I'm having trouble calling a date within an if statement. My date is, for example, "2001-08-05". I am trying to subset my data based on the date. So this is my code:
If ID = "Yes" and Date > 2001-08-05 then delete;
But this just doesn't do what I'm asking. I don't get an error, but it doesn't perform what I ask. I tried "2001-08-05"d. as well but this produced an error. Is there a certain way to read this format?
The proper format for a date constant in SAS is 'ddmmmyyyy'd, so:
if ID = 'Yes' and Date > '05Aug2001'd
You can use either single or double quotes to delimit the constant. The month name in the constant is case insensitive.
On a side-note, if you need to do a date-time constant in SAS, the format is 'ddmmmyyyy:hh:mm:ss'dt. Notice the suffix becomes dt rather than just d and there is a semi-colon between the date and time.
Or you could try to covert character to date
If ID = "Yes" and Date > input('2001-08-05',yymmdd10.) then delete;