I have a parallel job in DATASTAGE
which load table from CSV file and insert the values to ORACLE DB.
I want to use process that replace values in one of the column.
let's say I have a column call: ID and I want to change values like "null" or "0" to value: "N/A"
How can I do it?
thanks.
For the few information in your question it seems a pretty normal job with a transformer stage upfront the Oracle target.
In the derivation field you can specify thing like
NullToValue (this is a DataStage provided function for the one usecase described) so this would look like NullToValue(<inputcol>, "N/A")
or something like IF <inputcol> = "X" THEN "y"
This should do the job
Related
I am reading json files from GCS and I have to load data into different BigQuery tables. These file may have multiple records for same customer with different timestamp. I have to pick latest among them for each customer. I am planning to achieve as below
Read files
Group by customer id
Apply DoFn to compare timestamp of records in each group and have only latest one from them
Flat it, convert to table row insert into BQ.
But I am unable to proceed with step 1. I see GroupByKey.create() but unable to make it use customer id as key.
I am implementing using JAVA. Any suggestions would be of great help. Thank you.
Before you GroupByKey you need to have your dataset in key-value pairs. It would be good if you had shown some of your code, but without knowing much, you'd do the following:
PCollection<JsonObject> objects = p.apply(FileIO.read(....)).apply(FormatData...)
// Once we have the data in JsonObjects, we key by customer ID:
PCollection<KV<String, Iterable<JsonObject>>> groupedData =
objects.apply(MapElements.via(elm -> KV.of(elm.getString("customerId"), elm)))
.apply(GroupByKey.create())
Once that's done, you can check timestamps and discard all bot the most recent as you were thinking.
Note that you will need to set coders, etc - if you get stuck with that we can iterate.
As a hint / tip, you can consider this example of a Json Coder.
I am using aws glue csv crawler to crawl s3 directory containing csv files. Crawler works fine in the sense that it creates the schema with correct data types for each column, however, when I query data from athena, it doesn't show value under boolean type column.
A csv looks like this:
"val","ts","cond"
"1.2841974","15/05/2017 15:31:59","True"
"0.556974","15/05/2017 15:40:59","True"
"1.654111","15/05/2017 15:41:59","True"
And the table created by crawler is:
Column name Data type
val string
ts string
cond boolean
However, when I run say select * from <table_name> limit 10 it returns:
val ts cond
1 "1.2841974" "15/05/2017 15:31:59"
2 "0.556974" "15/05/2017 15:40:59"
3 "1.654111" "15/05/2017 15:41:59"
Does any one has any idea what might be the reason?
I forgot to add, if I change the data type of cond column to string, it does show data as string e.g. "True" or "False"
I don't know why Glue classifies the cond column as boolean, because Athena will not understand that value as a boolean. I think this is a bug in Glue, or an artefact of it not targeting Athena exclusively. Athena expects boolean values to be either true or false. I don't remember if that includes different capitalizations of the strings or not, but either way yours will fail because they are quoted. The actual bug is that Glue has not configured your table so that it strips the quotes from the strings, and therefore Athena sees a boolean column containing "True" with quotes and all, and that is not a supported boolean value. Instead you get NULL values.
You could try changing your tables to use the OpenCSVSerDe instead, it supports quoted values.
It's surprising that Glue continues to stumble on basic things like this. Glue is unfortunately rarely worth the effort over writing some basic scripts yourself.
I have a dataset with JSON files in it. Some of the entries of these JSONs have spaces in the entries like
{
'propertyOne': 'something',
'property Two': 'something'
}
I've had this data set crawled by several different crawlers to try and get the schema I want. For some reason on one of my crawls, the spaces were removed, but on trying to replicate the process, I cannot get the spaces to be removed and when querying in Athena I get this error
HIVE_METASTORE_ERROR: : expected at position x in 'some string' but ' ' found instead.
Position x is the position of the space between 'property' and 'Two' in the JSON entry.
I would like to just be able to exclude this field or have the space removed when crawled, but I'm not sure how. I can't change the JSON format. Any help is appreiated
this is actually a bug with aws gule json classifier because it doesn't play nice with nested properties that have spaces in them. The syntax error is in the schema generated by the crawler, not in the json. It generates something like this:
struct<propertyOne:string, property Two:string>
The space in "property two" should have been escaped, by the crawler. At this point, generating the DDL for the table is also not working. We are also facing this issue and looking for workarounds
I believe your only option, in this case, would be to create your own custom JSON classifier to select only those attributes you want the Crawler to add to the Data Catalog.
I.e. if you want to only retrieve propertyOne you can use specify the JSONPath expression as $.propertyOne.
Note also that your JSON should be in double quotes, the single quotes could also be causing issues when parsing the data.
I am trying to understand apache nifi in and out keeping files in hdfs and have various scenarios to work on. Please let me know the feasibility of each with explanations. I am adding few understanding with each scenario.
Can we check null value present with in a single column? I have checked different processors, and found notNull property, but I think this works on file names, not on columns present within file.
Can we drop a column present in hdfs using nifi transformations?
Can we change column values as in replace one text with other? I have checked replaceText property for the same.
Can we delete a row from file system?
Please suggest the possibilities and how to achieve the goal.
Try with this:
1.Can we check null value present with in a single column? I have checked different :
Yes using replace text processor you can check and replace if you want to replace or use 'Route on Attribute' if want to route based on null value condition.
Can we drop a column present in hdfs using nifi transformations?
Yes using same 'ReplaceText' processor you can put desired fields with delimiter as I used to have current date field and some mandatory fields only in my data with comma separated so I provided replacement value as
"${'userID'}","${'appID'}","${sitename}","${now():format("yyyy-MM-dd")}"
To change column value use 'ReplaceText' processor.
I currently have a package pulling data from an excel file, but when pulling the data out I get rows I do not want. So I need to extract everything from the 'ID' field that has any sort of letter in it.
I need to be able to run a RegEx command such as "%[a-zA-Z]%" to pull out that data. But with the current limitation of conditional split it's not letting me do that. Any ideas on how this can be done?
At the core of the logic, you would use a Script Transformation as that's the only place you can access the regex.
You could simply a second column to your data flow, IDCleaned and that column would only contain cleaned values or a NULL. You could then use the Conditional Split to filter good rows vs bad. System.Text.RegularExpressions.Regex.Replace error in C# for SSIS
If you don't want to add another column, you can set your current ID column to be ReadWrite for the Script and then update in place. Perhaps adding a boolean column might make the Conditional Split logic easier at this point.