How to work with data different from schema in pandas python - python-2.7

I am currently using pandas (0.22.0) with read_table with names.
How can I address when my underlying data schema changes?
For example, my read_table is reading 5 columns and the data file has 5 columns. How would I tackle changes in the data(when a new column is added to the data, does that mean that I have to update schema when the data format changes? Is there a way to ignore the columns not mentioned via names in Pandase read_table

there is a usecols parameter that you can pass to read_table to read only a subset of the available data. So long as the 5 columns that you are concerned with are always present, you should be able to name them explicitly in the call.
cols_of_interest = ['col1', 'col2', 'col3', 'col4', 'col5']
df = pd.read_table(file_path, usecols=cols_of_interest)
Documentation for pd.read_table here - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html
Note that you can also pass a callable which can decide which columns to parse, or specify column indices instead of named columns (depends on the underlying data I guess).

The problem I have here is I am iterating over data files with a set schema with read_table and names. I do not want to be updating schema every time when the underlying data changes.
I found a work-around (more of a hack) at this point. I added a few 'dummy' columns to names array.

Related

ColumnsofType Record not returning any columns

I've got a table full of different data types, including records, that I want to extract all column names of records to then use in an expand function. I've included a screenshot of a column containing record's however, when I use this = Table.ColumnsOfType(#"Expanded fields", {type record}), it returns an empty list .
I've tried looking through the entire column to see if there was anything different but its all record types. Any help please.
EDIT:
Error using Table.TransformColumnTypes
Record is not a valid type to search for. And judging by your image, your type is Type.Any as denoted by the ABC123
You best bet is to unpivot all the columns (perhaps those starting with a certain prefix) then on the new Value column, expand like so
#"PriorStepNameHere" = .... ,
ExpandList= List.Distinct(List.Combine(List.Transform(Table.Column(#"PriorStepNameHere", "Value"), each if _ is record then Record.FieldNames(_) else {}))),
Expand= Table.ExpandRecordColumn(#"PriorStepNameHere", "Value", ExpandList,ExpandList)
It sounds like the Table.ColumnsOfType function is not properly identifying the columns in your table that contain records.One possible reason for this is that the column's datatype is not properly set as 'record'. Another possible reason could be that the data in the columns is not structured properly and hence it is not being identified as a record. You can try to use the Table.TransformColumnTypes function to convert the column's datatype to 'record' and see if that resolves the issue.
If the issue still persists, please share the sample data and the code you are using.

Converting a datastore column from string to timestamp

I have a datastore entity which has a column name timestamp. It was supposed to be a timestamp type but it is a string type as of now. Now, this column has values in 2 formats. YYYY-MM-DDTHH:MM:SSZ, YYYY-MM-DDTHH:MM:SS-offset_hours.
In our code, we are doing sorting on timestamp. Which is essentially sorting the "string". Now the question is, how can i convert this "string" column into "Timestamp".
Do i have to do any conversion for existing values which are in different format? How can i do it in terraform?
Google datastore has no notion of schema migrations, you're going to have to write a taskqueue job to do it.
The proper way would be to create a new column called timestamp_2 and backfill it. Here is an article GCP wrote:
https://cloud.google.com/appengine/articles/update_schema

Athena Query Results: Are they always strings?

I'm in the process of building new "ETL" pipelines with CTAS. Unfortunately, Quite often the CTAS query is too intensive which causes Athena to time out. As such, I use CTAS to create the initial table and populate with a small sample. I then write a script that queries the same table the CTAS was generated from (which is parquet format) for the remaining days that the CTAS couldn’t handle upfront. I write the output of these query results to the same directory that is holding the results of the CTAS query before repairing the table (to pick up new data). However, it seems to be a pretty clunky process for a number of reasons:
1) Query results written out with a standard SQL statements all end up being strings. For example, when I write out the number of DAUs (which is a count and cast to an int) the csv output is a string I.e. wrapped in “”.
Is it possible to write out Athena "query_results" (not the CTAS) as anything other than a string when in CSV format. The main problem with this is it means it can't be read back into the table produced by the CTAS since these column expect a bigint. This, of course, can be resolved with a lambda function but seems like a big overhead for something that should be trivial.
2) Can you put query results (not from CTAS) directly into parquet instead of CSV?
3) Is there any way to prevent metadata being generated with the query_results (not from CTAS). Again, it can be cleaned up with a lambda function, but it's just additional nonsense I need to handle.
Thanks in advance!
The data type of the result depends on the SQL used to create it and also on how you consume it. Based on your question I'm going to assume that you're creating a table using CTAS and that the output is CSV, and that you're then looking at the CSV data directly.
That CSV is going to have quotes in it, but that doesn't mean that it's not possible to read integer values as integers, and so on. Athena uses a schema-on-read approach, and as long as the serde can interpret a value as a particular type, that type will work as the type of the column.
If you query the table created by your CTAS operation you should get back integers for the integer columns.
Using CTAS you can also create output of different types, like JSON, Avro, Parquet, and ORC, that keep the type information. Just use the format property to select the output type.
I am a bit confused what you mean by your third question. With a normal query you get two files on S3, the data file and the metadata file, and they will be written to the output location given in the StartQueryExecution API call, but with a CTAS query you get the output data in a different location (given in the SQL) than the metadata file.
Are you actually using CTAS, or are you talking about the regular query result files?
Update after the question got clarified:
1) Athena is unfortunately unable to properly read it's own output in many situations. This is something that really surprises me that they never considered before launch. You might be able to set up a table that uses the regex serde.
2) No, unfortunately the only output of a regular query is CSV at this time.
3) No, the metadata is always written to the same prefix as the output.
I think your best bet is running multiple CTAS queries that select subsets of your source data, if there is a date column for example you could make one CTAS per month or some other time range that works. After the CTAS queries have completed you can move the result files into the same directory on S3 and create a final table that has that directory as its location.

How do I ensure that the AWS Glue crawler I've written is using the OpenCSV SerDe instead of the LazySimpleSerDe?

For context: I skimmed this previous question but was dissatisifed with the answer for two reasons:
I'm not writing anything in Python; in fact, I'm not writing any custom scripts for this at all as I'm relying on a crawler and not a Glue script.
The answer is not as complete as I require since it's just a link to some library.
I'm looking to leverage AWS Glue to accept some CSVs into a schema, and using Athena, convert that CSV table into multiple Parquet-formatted tables for ETL purposes. The data I'm working with has quotes embedded in it, which would be okay save for the fact that one record I have has a value of:
"blablabla","1","Freeman,Morgan","bla bla bla"
It seems that Glue is tripping over itself when it encounters the "Freeman,Morgan" piece of data.
If I use the standard Glue crawler, I get a table created with the LazySimpleSerDe, which truncates the record above in its column to:
"Freeman,
...which is obviously not desirable.
How do I force the crawler to output the file with the correct SerDe?
[Unpleasant] Constraints:
Looking to not accomplish this with a Glue script, since for that to work I believe I have to have a table beforehand, whereas the crawler will create the table on my behalf.
If I have to do this all through Amazon Athena, I'd feel like that would largely defeat the purpose but it's a tenable solution.
This is going to turn into a very dull answer, but apparently AWS provides its own set of rules for classifying if a file is a CSV.
To be classified as CSV, the table schema must have at least two
columns and two rows of data. The CSV classifier uses a number of
heuristics to determine whether a header is present in a given file.
If the classifier can't determine a header from the first row of data,
column headers are displayed as col1, col2, col3, and so on. The
built-in CSV classifier determines whether to infer a header by
evaluating the following characteristics of the file:
Every column in a potential header parses as a STRING data type.
Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing
delimiter, the last column can be empty throughout the file.
Every column in a potential header must meet the AWS Glue regex requirements for a column name.
The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than
STRING type. If all columns are of type STRING, then the first row of
data is not sufficiently different from subsequent rows to be used as
the header.
I believed that I had met all of these requirements, given that the column names are wildly divergent from the actual data in the CSV, and ideally there shouldn't be much of an issue there.
However, in spite of my belief that it would satisfy the AWS Glue regex (which I can't find a definition for anywhere), I elected to move away from commas and to pipes instead. The data now loads as I expect it to.
Use glueContext.create_dynamic_frame_from_options() while converting csv to parquet and then run crawler over parquet data.
df = glueContext.create_dynamic_frame_from_options("s3", {"paths": [src]}, format="csv")
Default separator is ,
Default quoteChar is "
If you wish to change then check https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html

How to subtract 2 columns with dtype = object within data frame to form a new column of the difference pandas

I have a merge data frame(mdf) which the 2 data frames are retrieved from SQL. I wish to create a new col within mdf which will be the subtraction of existing 2 columns.
I'm not sure what you mean by a "merge data frame," but here's a sketch of what you might be after. Please elaborate a little your question so it will be more useful to others.
df = pd.read_sql('select ....', some_sql_connection)
df['difference'] = df['some column name'] - df['another column name']
Also, referring to the title of your question where you mention dtype=object, data extracted from a SQL database sometimes defaults to the generic object datatype, even if it is actually numeric. (This is not ideal, and better handling of datatypes to and from SQL databases is being actively improved for a future release of pandas.)
For now, before manipulating your data, you might want to run df.convert_objects(convert_numeric=True) if you have all numerical data. See documentation.