Google cloud not recognizing header with string column - google-cloud-platform

When I try to create a dataset in Bigquery it is not able to autodetect my header (which is the first row) since I have a column containing only string values. Is there any way to circumvent this?

Related

How to determine if my AWS Glue Custom CSV Classifier is working?

I am using AWS Glue to catalog (and hopefully eventually transform) data. I am trying to create a Custom CSV Classifier for the crawler so that I can provide a known set of column headers to the table. The data is in TSV (tab separated value) format, and the files themselves have no header row. There is no 'quote' character in the data, but there are 1 or 2 columns which use a double quote in the data, so I've indicated in the Classifier that it should use single quote ('). To ensure I start clean, I delete the AWS Glue Catalog Table, and then run the Crawler with the Classifier attached. When I subsequently check the created table, it lists csv as the classification, and the columns names specified in the Classifier are not associated with the table (and instead are labelled as col0, col1, col2, col3 etc.). Further, when inspecting a few rows in the table, it appears as though the data associated with the columns does not use the same column ordering as in the raw data itself, which I can confirm because I have a copy of the raw data open locally on my computer.
AWS Glue Classifier documentation indicates that a crawler will attempt to use the Custom Classifiers associated with a Crawler in the order they are specified in the Crawler definition, and if no match is found with certainty 1.0, it will use Built-in Classifiers. In the event a match with certainty 1.0 is still not found, the Classifier with the highest certainty will be used.
My questions are these:
How do I determine if my Custom CSV Classifier (which I have specifically named, say for sake of argument customClassifier) is actually being used, or if it is defaulting to the Built-In CSV Classifier?
More importantly, given the situation above (having the columns known but separate from the data, and having double quotes be used in the actual data but no quoted values), how do I get the Crawler to use the specified column names for the table schema?
Why does it appear as though my data in the Catalog is not using the column order specified in the file (even if it is the generic column names)?
If it is even possible, how could I use an ApplyMapping transform to rename the columns for the workflow (which would be sufficient for my cases).? I need to do so without enabling script-only mode (by modifying an AWS Glue Studio Workflow), and without manually entering in over 200 columns

Manipulating .xls columns and rows with Open Refine

I need to manipulate a data set such that it can be mapped with Google Fusion Tables. Current xls data is formatted as follows:
Image of xls file with personal data anonymized
Note that a blank row indicates a new entry. I need the information in the column to be sorted into a rows under the appropriate heading, specifically the address for geocoding. Any ideas?
First, do some clean up to merge your second and third column into a single one and then use the feature Columnize by key/value column to transpose data in the third and fourth columns into separate fields.
Once this done, Fusion table should be able to geocode the dataset based on the address. If it is not the case, there is plenty of tutorials to geocode a dataset with OpenRefine. See:
OpenRefine wiki,
Google Maps,
OpenStreet Map,
Yahoo Maps.

Extract data from JSON field with Power BI desktop

I'm using Power BI desktop to connect to a MySQL database.
One of the fields contains data with the following structure:
a:1:{s:3:"IVA";O:8:"stdClass":3:{s:11:"tax_namekey";s:3:"IVA";s:8:"tax_rate";s:7:"0.23000";s:10:"tax_amount";d:25.07000000000000028421709430404007434844970703125;}}
I need to transform the data in a way that allows the extraction of the value of the tax amount. That is, I need to transform this column to: 25.07.
How can I do this? I tried splitting the column by semicolon, but since not all the columns have the same number of semicolons it didn't work.
Thanks in advance!
Use this function
Works only for your task - parse number 25.07 from source string
(src) => Splitter.SplitTextByEachDelimiter({";d:",";"})(src){1}
The value in the column is not actual JSON file .There is option in power bi itself to split json column but it should be valid json.To check whether is a json file or not try using the link
https://jsonformatter.curiousconcept.com/
After that go to edit query right click on the json column and transform-> JSON .
It will transform your json file into columns.

Pentaho DI (Kettle) best way to select flow based on csv file header?

I'm using Pentaho DI (kettle) and not sure what's the best way to do the following:
From a downloaded csv file, check if a column exists, and based on that select the right next step.
There are 3 possible options.
Thanks,
Isaac
You did not mention possible options, so I'll just provide you with a sketch showing how to check if a column exists in a file.
For this you will need a CSV file input step and Metadata structure of stream step which will read the metadata of the incoming stream.
For a sample csv file with 3 columns named col1, col2 and col3 you get every column in a separate row with its name as a value in Fieldname column in Metadata step.
Then depending on your needs you could use for example Filter Rows or Switch / Case step for further processing.

Parquet: read particular columns into memory

I have exported a mysql table to a parquet file (avro based). Now i want to read particular columns from that file. How can i read particular columns completely? I am looking for java code examples.
Is there an api where i can pass the columns i need and get back a 2D array of table?
If you can use hive, creating a hive table and issuing a simple select query would be by far the easiest option.
create external table tbl1(<columns>) location '<file_path>' stored as parquet;
select col1,col2 from tbl1;
//this works in hive 0.14
You can use JDBC driver to do that from java program as well.
Otherwise, if you want to stay completely in java, you need to modify the avro schema by excluding all the fields but the ones you want to fetch. Then when you read the file supply the modified schema as reader schema and it will only read the included columns. But you will get you original avro record back with excluded fields nullified, not a 2D array.
To modify the schema look at org.apache.avro.Schema and org.apache.avro.SchemaBuilder. make sure that modified schema is compatible with the original schema.
Options:
Use Hive table to create table with all columns with storage format parquet and read the required columns by specifying the column names
Create Thrift for the table and use the thrift fields to read the data from code (Java or Scala)
You can also use apache drill that natively parse parquet files.