Converting a datastore column from string to timestamp - google-cloud-platform

I have a datastore entity which has a column name timestamp. It was supposed to be a timestamp type but it is a string type as of now. Now, this column has values in 2 formats. YYYY-MM-DDTHH:MM:SSZ, YYYY-MM-DDTHH:MM:SS-offset_hours.
In our code, we are doing sorting on timestamp. Which is essentially sorting the "string". Now the question is, how can i convert this "string" column into "Timestamp".
Do i have to do any conversion for existing values which are in different format? How can i do it in terraform?

Google datastore has no notion of schema migrations, you're going to have to write a taskqueue job to do it.
The proper way would be to create a new column called timestamp_2 and backfill it. Here is an article GCP wrote:
https://cloud.google.com/appengine/articles/update_schema

Related

Groupby existing attribute present in json string line in apache beam java

I am reading json files from GCS and I have to load data into different BigQuery tables. These file may have multiple records for same customer with different timestamp. I have to pick latest among them for each customer. I am planning to achieve as below
Read files
Group by customer id
Apply DoFn to compare timestamp of records in each group and have only latest one from them
Flat it, convert to table row insert into BQ.
But I am unable to proceed with step 1. I see GroupByKey.create() but unable to make it use customer id as key.
I am implementing using JAVA. Any suggestions would be of great help. Thank you.
Before you GroupByKey you need to have your dataset in key-value pairs. It would be good if you had shown some of your code, but without knowing much, you'd do the following:
PCollection<JsonObject> objects = p.apply(FileIO.read(....)).apply(FormatData...)
// Once we have the data in JsonObjects, we key by customer ID:
PCollection<KV<String, Iterable<JsonObject>>> groupedData =
objects.apply(MapElements.via(elm -> KV.of(elm.getString("customerId"), elm)))
.apply(GroupByKey.create())
Once that's done, you can check timestamps and discard all bot the most recent as you were thinking.
Note that you will need to set coders, etc - if you get stuck with that we can iterate.
As a hint / tip, you can consider this example of a Json Coder.

Athena shows no value against boolean column, table created using glue crawler

I am using aws glue csv crawler to crawl s3 directory containing csv files. Crawler works fine in the sense that it creates the schema with correct data types for each column, however, when I query data from athena, it doesn't show value under boolean type column.
A csv looks like this:
"val","ts","cond"
"1.2841974","15/05/2017 15:31:59","True"
"0.556974","15/05/2017 15:40:59","True"
"1.654111","15/05/2017 15:41:59","True"
And the table created by crawler is:
Column name Data type
val string
ts string
cond boolean
However, when I run say select * from <table_name> limit 10 it returns:
val ts cond
1 "1.2841974" "15/05/2017 15:31:59"
2 "0.556974" "15/05/2017 15:40:59"
3 "1.654111" "15/05/2017 15:41:59"
Does any one has any idea what might be the reason?
I forgot to add, if I change the data type of cond column to string, it does show data as string e.g. "True" or "False"
I don't know why Glue classifies the cond column as boolean, because Athena will not understand that value as a boolean. I think this is a bug in Glue, or an artefact of it not targeting Athena exclusively. Athena expects boolean values to be either true or false. I don't remember if that includes different capitalizations of the strings or not, but either way yours will fail because they are quoted. The actual bug is that Glue has not configured your table so that it strips the quotes from the strings, and therefore Athena sees a boolean column containing "True" with quotes and all, and that is not a supported boolean value. Instead you get NULL values.
You could try changing your tables to use the OpenCSVSerDe instead, it supports quoted values.
It's surprising that Glue continues to stumble on basic things like this. Glue is unfortunately rarely worth the effort over writing some basic scripts yourself.

Dataprep change str yyyymmdd date to datetime column

I have a column with dates (in a string format) in Dataprep: yyyymmdd. I would like it to become a datetime object. Which function/transformation should I apply to achieve this result automatically?
In this case, you actually don't need to apply a transformation at all—you can just change column type to Date/Time and select the appropriate format options.
Note: This is one of the least intuitive parts of Dataprep as you have to select an incorrect format (in this case yy-mm-dd) before you can drill-down to the correct format (yyyymmdd).
Here's a screenshot of the Date / Time type window to illustrate this:
While it's unintuitive, this will correctly treat the column as a date in future operations, including assigning the correct type in export operations (e.g. BigQuery).
Through the UI, this will generate the following Wrangle Script:
settype col: YourDateCol customType: 'Datetime','yy-mm-dd','yyyymmdd' type: custom
According to the documentation, this should also work (and is more succinct):
settype col: YourDateCol type: 'Datetime','yy-mm-dd','yyyymmdd'
Note that if you absolutely needed to do this in a function context, you could extract the date parts using SUBSTRING/LEFT/RIGHT and pass them to the DATE or DATETIME function to construct a datetime object. As you've probably already found, DATEFORMAT will return NULL if the source column isn't already of type Datetime.
(From a performance standpoint though, it would probably be far more efficient for a large dataset to either just change the the or create a new column with the correct type versus having to perform those extra operations on so many rows.)

How do I ensure that the AWS Glue crawler I've written is using the OpenCSV SerDe instead of the LazySimpleSerDe?

For context: I skimmed this previous question but was dissatisifed with the answer for two reasons:
I'm not writing anything in Python; in fact, I'm not writing any custom scripts for this at all as I'm relying on a crawler and not a Glue script.
The answer is not as complete as I require since it's just a link to some library.
I'm looking to leverage AWS Glue to accept some CSVs into a schema, and using Athena, convert that CSV table into multiple Parquet-formatted tables for ETL purposes. The data I'm working with has quotes embedded in it, which would be okay save for the fact that one record I have has a value of:
"blablabla","1","Freeman,Morgan","bla bla bla"
It seems that Glue is tripping over itself when it encounters the "Freeman,Morgan" piece of data.
If I use the standard Glue crawler, I get a table created with the LazySimpleSerDe, which truncates the record above in its column to:
"Freeman,
...which is obviously not desirable.
How do I force the crawler to output the file with the correct SerDe?
[Unpleasant] Constraints:
Looking to not accomplish this with a Glue script, since for that to work I believe I have to have a table beforehand, whereas the crawler will create the table on my behalf.
If I have to do this all through Amazon Athena, I'd feel like that would largely defeat the purpose but it's a tenable solution.
This is going to turn into a very dull answer, but apparently AWS provides its own set of rules for classifying if a file is a CSV.
To be classified as CSV, the table schema must have at least two
columns and two rows of data. The CSV classifier uses a number of
heuristics to determine whether a header is present in a given file.
If the classifier can't determine a header from the first row of data,
column headers are displayed as col1, col2, col3, and so on. The
built-in CSV classifier determines whether to infer a header by
evaluating the following characteristics of the file:
Every column in a potential header parses as a STRING data type.
Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing
delimiter, the last column can be empty throughout the file.
Every column in a potential header must meet the AWS Glue regex requirements for a column name.
The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than
STRING type. If all columns are of type STRING, then the first row of
data is not sufficiently different from subsequent rows to be used as
the header.
I believed that I had met all of these requirements, given that the column names are wildly divergent from the actual data in the CSV, and ideally there shouldn't be much of an issue there.
However, in spite of my belief that it would satisfy the AWS Glue regex (which I can't find a definition for anywhere), I elected to move away from commas and to pipes instead. The data now loads as I expect it to.
Use glueContext.create_dynamic_frame_from_options() while converting csv to parquet and then run crawler over parquet data.
df = glueContext.create_dynamic_frame_from_options("s3", {"paths": [src]}, format="csv")
Default separator is ,
Default quoteChar is "
If you wish to change then check https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html

How to work with data different from schema in pandas python

I am currently using pandas (0.22.0) with read_table with names.
How can I address when my underlying data schema changes?
For example, my read_table is reading 5 columns and the data file has 5 columns. How would I tackle changes in the data(when a new column is added to the data, does that mean that I have to update schema when the data format changes? Is there a way to ignore the columns not mentioned via names in Pandase read_table
there is a usecols parameter that you can pass to read_table to read only a subset of the available data. So long as the 5 columns that you are concerned with are always present, you should be able to name them explicitly in the call.
cols_of_interest = ['col1', 'col2', 'col3', 'col4', 'col5']
df = pd.read_table(file_path, usecols=cols_of_interest)
Documentation for pd.read_table here - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html
Note that you can also pass a callable which can decide which columns to parse, or specify column indices instead of named columns (depends on the underlying data I guess).
The problem I have here is I am iterating over data files with a set schema with read_table and names. I do not want to be updating schema every time when the underlying data changes.
I found a work-around (more of a hack) at this point. I added a few 'dummy' columns to names array.