Dataprep change str yyyymmdd date to datetime column - google-cloud-platform

I have a column with dates (in a string format) in Dataprep: yyyymmdd. I would like it to become a datetime object. Which function/transformation should I apply to achieve this result automatically?

In this case, you actually don't need to apply a transformation at all—you can just change column type to Date/Time and select the appropriate format options.
Note: This is one of the least intuitive parts of Dataprep as you have to select an incorrect format (in this case yy-mm-dd) before you can drill-down to the correct format (yyyymmdd).
Here's a screenshot of the Date / Time type window to illustrate this:
While it's unintuitive, this will correctly treat the column as a date in future operations, including assigning the correct type in export operations (e.g. BigQuery).
Through the UI, this will generate the following Wrangle Script:
settype col: YourDateCol customType: 'Datetime','yy-mm-dd','yyyymmdd' type: custom
According to the documentation, this should also work (and is more succinct):
settype col: YourDateCol type: 'Datetime','yy-mm-dd','yyyymmdd'
Note that if you absolutely needed to do this in a function context, you could extract the date parts using SUBSTRING/LEFT/RIGHT and pass them to the DATE or DATETIME function to construct a datetime object. As you've probably already found, DATEFORMAT will return NULL if the source column isn't already of type Datetime.
(From a performance standpoint though, it would probably be far more efficient for a large dataset to either just change the the or create a new column with the correct type versus having to perform those extra operations on so many rows.)

Related

Converting a datastore column from string to timestamp

I have a datastore entity which has a column name timestamp. It was supposed to be a timestamp type but it is a string type as of now. Now, this column has values in 2 formats. YYYY-MM-DDTHH:MM:SSZ, YYYY-MM-DDTHH:MM:SS-offset_hours.
In our code, we are doing sorting on timestamp. Which is essentially sorting the "string". Now the question is, how can i convert this "string" column into "Timestamp".
Do i have to do any conversion for existing values which are in different format? How can i do it in terraform?
Google datastore has no notion of schema migrations, you're going to have to write a taskqueue job to do it.
The proper way would be to create a new column called timestamp_2 and backfill it. Here is an article GCP wrote:
https://cloud.google.com/appengine/articles/update_schema

Google Cloud DataPrep DATEDIF function inconsistent

I have four DateTime columns, all in long format eg 2016-08-01T21:13:02Z. They are called EnqDateTime, QuoteCreatedDateTime, BookingCreatedDateTime and RejAt.
I want to add columns for the duration (in days) between EnquiryDateTime and the other three columns, i.e.
DATEDIF(EnqDateTime, QuoteCreatedDateTime, day)
This works for RejAt, but throws an error for all the other columns:
Parameter "rhs" accepts only ["Datetime"]
As per the image below, all four columns are DateTime.
Can anyone see any other reason this may not be working for 2 of the three columns?
As you can see in the image below, I reproduced an scenario such as the one you presented here, and I had no issue with it. I create the three columns X2Y using the same formula that you shared:
DATEDIF(EnqDateTime, QuoteCreatedDateTime, day)
DATEDIF(EnqDateTime, BookingCreatedDateTime, day)
DATEDIF(EnqDateTime, RejAt, day)
My guessing is that, for some reason, the columns do not have an appropriate Datetime format. Maybe you can try applying some transformations to the data in order to make sure that the data contained in the columns has the appropriate format. I recommend that you try doing the following:
Clean all missing values, clicking on the column and then Clean > Missing > Fill with NULL. Missing values can prevent Dataprep from recognizing a data type properly.
Change the data type again to Datetime, just to doublecheck that there is not any field that does not have the Datetime type. You can do so by clicking on the column and then Change type > Date/Time.
If these methods do not solve your issue, maybe you can try working with a minimal example, having only a few rows, so that you can narrow down the variables with which to work. Then you can update your question with more information.
It would also be nice to know where are you getting the error Parameter "rhs" accepts only ["Datetime"]. It is not clear for me what the rhs (Right Hand Side) parameter is in this case, so maybe you can also provide more details about that.

How to work with data different from schema in pandas python

I am currently using pandas (0.22.0) with read_table with names.
How can I address when my underlying data schema changes?
For example, my read_table is reading 5 columns and the data file has 5 columns. How would I tackle changes in the data(when a new column is added to the data, does that mean that I have to update schema when the data format changes? Is there a way to ignore the columns not mentioned via names in Pandase read_table
there is a usecols parameter that you can pass to read_table to read only a subset of the available data. So long as the 5 columns that you are concerned with are always present, you should be able to name them explicitly in the call.
cols_of_interest = ['col1', 'col2', 'col3', 'col4', 'col5']
df = pd.read_table(file_path, usecols=cols_of_interest)
Documentation for pd.read_table here - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html
Note that you can also pass a callable which can decide which columns to parse, or specify column indices instead of named columns (depends on the underlying data I guess).
The problem I have here is I am iterating over data files with a set schema with read_table and names. I do not want to be updating schema every time when the underlying data changes.
I found a work-around (more of a hack) at this point. I added a few 'dummy' columns to names array.

Convert String attributes to numeric values in WEKA

I am new to weka.. My data contains a column of student name. I want to convert these names to numeric values, over the whole column.
Eg: Suppose there are 10 names abcd ,cdef,xyz ,etc. I want to pre process the data so that corresponding to each name there is distinct numeric value, like abcd changes to 1 ,cdef changes to 2 ,etc.
Also two or more rows can have same name. So in this case, same name should have same value.
Please help me...
Weka supports 4 non-relational attribute types: nominal, numeric, string and date. You can find out more about them in Weka Manual (it can be found in the same folder were you downloaded Weka), chapter "The ARFF Header Section".
You should find out what is the type of the "student's name" attribute (probably string, but could be nominal), and decide what should be the type of the attribute with converted values (numeric, nominal, or string).
There can be 2 scenarios:
(1) If types of the existing and desired attributes are the same (string-string or nominal-nominal, i.e. you only want to change values, not attribute type), you could do so
(a) manually - open the data file in Weka Explorer, and click Edit... button, or
(b) write a small program using Weka's Attribute class functions value and setValue.
(2) Types are different - Weka attribute types cannot be converted, so you will have to create and insert a new attribute with the converted values, and delete the old attribute. An example of how to create a new attribute can be found at
http://weka.wikispaces.com/Programmatic+Use#Step.
As far as I understand, strictly converting names into a "numeric" type doesn't seem like the best approach, within the context of WEKA - WEKA will treat numeric attributes differently than it does "string" or "nominal" attributes (for example, for running certain "attribute selection" algorithms, you can not use "numeric" types - they need to be "discretized" or converted into nominal form).
So, for your case, I think you can convert your "string" names into just "nominal" type using the StringToNominal class (this class acts as a WEKA "filter" to help convert a given "string" attribute into an attribute of type "nominal"). This will also take care about the repeating names - the list of "nominal" values for the names (that will be generated after you apply this filter) will contain any given name (that appears any number of times) only one time.
"Nominal" attributes also have the advantage that implicitly, they do have a numeric representation (the index of the value within the set of values; similar to how the "enums" in Java have a numeric index). So, you can utilize that as the "numeric" information corresponding to the names (though as I said earlier, it's probably best to just use it as "nominal" attribute; really depends on your particular use case).
I had the same problem as the one mentioned in the question, and I could "address" it in the following way.
I first applied the StringToNominal filter as mentioned before (don't forget to change the attribute range (from "last" to "first-last")). Once done that, I saved the dataset in LibSVM format, which changes the nominal values to numeric ones.
Then, if you close Weka and open it again, you will have the same dataset with the same number of features but they will be numeric. Now some changes should be done, first of all, normalizing all the numeric values in the dataset, using the Normalize filter. After that, apply the NumericToNominal filter to the last attribute.
Then, you will have a similar dataset with numeric values.
Hope this helps.

Getting a date value in a postgres table column and check if it's bigger than todays date

I have a Postgres table called clients. The name column contains certain values eg.
test23233 [987665432,2014-02-18]
At the end of the value is a date, I need to compare this date, and return all records where this specific date is younger than today
I tried
select id,name FROM clients where name ~ '(\d{4}\-\d{1,2}\-\d{1,2})';
but this isn't returning any values. How would I go about to achieve the results I want?
If the data is always stored this way (i.e. after the comma), I would not use a regex, but extract the date part and convert it to a proper date type.
SELECT *
FROM the_table
WHERE to_date(substring(name, strpos(name, ',') + 1, 10), 'yyyy-mm-dd') < current_date
You might want to put that to_date(...) thing into a view to make this easier for other queries.
In the long run you should realy (really) try to fix that data model.
Using a regular expression for this would be extremely hard. Is it possible to change the schema and data to separate the name, whatever the second value is, and the timestamp into separate columns? That would be far more logical, less error prone, and significantly faster.
Otherwise, I suspect you'll have to use some sort of parsing (possibly a regex) to extract the date, then convert it to a Postgres date, then compare that with the current time... for every single row. Ick.
EDIT: Actually, it's not quite that bad... because your dates are stored in a sort-friendly way, it's possible that you could do the extraction (whether with a regex or anything else) and just do an ordinal comparison with the string representation of today's date, without actually performing any date conversion for each row. It's still ugly though, and doesn't validate that the date isn't (say) 2011-99-99. If you can possibly store the data more sensibly, do.
I solved my issue by doing something similar to
select id,substring(name,'[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}'),name FROM clients where substring(name,'[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}') > '2011-03-18';
Might not be the best practice, but it works. But open to better suggestions