sqoop: multiple export paths - mapreduce

Is it possible to export files in multiple HDFS locations in a single sqoop command? I tried to specify multiple --export-dir in the command, but seems only one is effective.

Assume you have three files:
.../basedir/folder_1/file.txt
.../basedir/folder_2/file.txt
.../basedir/folder_3/file.txt
To export three files, use
--export-dir .../basedir/*/file.txt
To export files in folder_1, folder_2, use
--export-dir .../basedir/{folder_1,folder_2}/file.txt

In oozie workflow and using sqoop action, you can specify multiple paths, where each path is coming from job.properties:
<arg>--export-dir</arg>
<arg>${rootPath}/{${folder1},${folder2},${folder3}}</arg>
Pay attention to curly brackets.

Related

How to exclude either files or folder paths on S3 within an AWS Glue job when reading an Athena table?

We have an AWS Glue job that is attempting to read data from an Athena table that is being populated by HUDI. Unfortunately, we are running into an error that relates to create_dynamic_frame.from_catalog trying to read from these tables.
An error occurred while calling o82.getDynamicFrame. s3://bucket/folder/.hoodie/20220831153536771.commit is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [32, 125, 10, 125]
This appears to be a known issue on GitHub: https://github.com/apache/hudi/issues/5891
Unfortunately, no workaround was provided. We are attempting to see if we can exclude either the folder or file(s) of .hoodie or *.commit, respectively within the additional_options of the create_dynamic_frame.from_catalog connection. Unfortunately, we are not having any success either with exclusion a file or folder. Note: we have .hoodie files in the root directory as well as a .hoodie folder that contains a .commit file, among other files. We prefer to exclude them all.
Per AWS:
"exclusions": (Optional) A string containing a JSON list of Unix-style glob patterns to exclude. For example, "["**.pdf"]" excludes all PDF files. For more information about the glob syntax that AWS Glue supports, see Include and Exclude Patterns.
Question: how do we exclude both file and folder from a connection?
Folder
datasource0 = glueContext.create_dynamic_frame.from_catalog(database=args['ENV']+"_some_database", table_name="some_table", transformation_ctx="datasource_x1", additional_options={"exclusions": "[\".hoodie/**\"]"})
File
datasource0 = glueContext.create_dynamic_frame.from_catalog(database=args['ENV']+"_some_database", table_name="some_table", transformation_ctx="datasource_x1", additional_options={"exclusions": "[\"**.commit\"]"})
Turns out the original attempted solution of {"exclusions": "[\"**.commit\"]"} worked. Unfortunately, I wasn't paying close enough attention and there were multiple tables that needed to be excluded. After hacking through all of the file types, here are two working solutions:
Exclude folder
additional_options={"exclusions": "[\"s3://bucket/folder/.hoodie/*\"]"}
Exclude file(s)
additional_options={"exclusions": "[\"**.commit\",\"**.inflight\",\"**.properties\"]"}

Cloud Storage Text to BigQuery using GCP Template

I'm trying to execute a pipeline using the GCP template available at:
https://cloud.google.com/dataflow/docs/templates/provided-templates#cloud-storage-text-to-bigquery
But I'm getting the error:
2018-03-30 (15:35:17) java.lang.IllegalArgumentException: Failed to match any files with the pattern: gs://.......
Can anyone share a working CSV file to be used as an input for running that pipeline?
The problem was between chair and keyboard, you just need to create a CSV file accordingly to the data structure defined in the JSON file and transformed by the JS file.
I see that this has been answered but I was having a similar issue and this answer was partial for me - as it turns out, the path pattern (at the moment, at least) in the template does not support some types of patterns.
For example, for multiple CSV files across multiple sub-directories in a given GCS path (this was my use-case):
gs://bucket-name/dir/
The pattern that will work is:
gs://bucket-name/dir/*/*.csv
These patterns, although they are valid via gsutil ls and return the correct files, will not work in the template:
gs://bucket-name/dir/*
gs://bucket-name/dir/*.csv

Pick up a particular file from a directory using regex in Talend

My directory contains files named as WM_PersonFile_22022018 , WM_PersonFile_23022018, WM_PersonFile_24022018 , WM_PersonFile_25022018 and these files come on a daily basis. I am using tFileList to iterate through the files
What should be my regex in my Filemask to pick up the most recent file? Should the Use Global Expressions as Filemask be unchecked?
I tried "*.txt" which is picking up all the files.
RegEx would help you to filter for the correct files.
Some other logic would get you the newest file. If you use tFileList, you might be able to sort after date and only take the first result.
Alternatively, if you also want to check the date in the filename is correct, you might need to add a little logic with a tMap, tAssert, tJava or tJavaRow.

i have headers separately, how to import it to informatica target

I have source and target in an informatica powercenter developer. I heed some other header name to be imported in the target file automatically without any manual entry. How can I import customized headers to informatica target.
What have you tried?
You can use a header command in the session configuration for the target, I haven't used it, and couldn't find any documentation on it (i.e. what is possible and how, whether parameters can be used or not, etc.). I did test using (on Windows) an ECHO command to output its text to the header row, but it didn't seem to recognize parameters.
Or you can try to include the header as the first data output row. That means your output will have to be all string types and length restrictions may compound the issue.
Or you can try using two mappings, one that truncates the files and writes the header and one which outputs the data specifying append in the session. You may need two target definitions pointing to the same files. I don't know if the second mapping would attempt to load the existing data (i.e. typecheck), in which case it might throw an error if it didn't match.
Other options may be possible, we don't do much with flat files.
The logic is,
In session command, there is an option called user defined headers. Type echo followed by column name separated by comma delimited
echo A, B, C

How to implement my product resource into a Pods structure?

Reading http://www.ember-cli.com/#pod-structure
Lets say I have a product resource. Which currently has the following directory structure:
app/controllers/products/base.js
app/controllers/products/edit.js
app/controllers/products/new.js
app/controllers/products/index.js
With pods all the logic in these files are put in a single file app/products/controller.js?
At the same time, my routes and templates for these resources currently look like:
app/routes/products/base.js
app/routes/products/edit.js
app/routes/products/new.js
app/routes/products/index.js
app/templates/products/-form.hbs
app/templates/products/edit.hbs
app/templates/products/index.hbs
app/templates/products/new.hbs
app/templates/products/show.hbs
How should this be converted to Pods?
You can use ember generate --pod --dry-run to help with that:
$ ember g -p -d route products/base
version: 0.1.6
The option '--dryRun' is not supported by the generate command. Run `ember generate --help` for a list of supported options.
installing
You specified the dry-run flag, so no changes will be written.
create app/products/base/route.js
create app/products/base/template.hbs
installing
You specified the dry-run flag, so no changes will be written.
create tests/unit/products/base/route-test.js
$
(I don't know why it complains yet it honours the option, might be a bug).
So you'd end up with a structure like:
app/controllers/products/base/route.js
app/controllers/products/edit/route.js
etc.