Flume - spoolign dir source - ingesting sub directories

Flume - spoolign dir source - ingesting sub directories - hdfs

I am currently using Flume 1.7 . Configured a spooling directory source. I have enabled recursiveDirectorySearch=true to look in to the sub directories for files.
source.spoolDir=/tmp/test
and under /tmp/test, subdirectories get created with data files /tmp/test/data1/file.csv , /tmp/test/data2/file2.csv .
I want the exact sub directory structure to be created in the HDFS sink path.
/sink/data1/file.csv
/sink/data2/file2.csv
When i use the %{file} for HDFS sink filepath, i get the complete absolute path, and %{basename} gives me only the file name. I want to extract the sub directory structure from the spooldir source path. Any way to achieve this?

You can make use of the fileHeader and fileHeaderKey properties and refer to this header variable at your sink configuration to get the absolute path.
https://flume.apache.org/FlumeUserGuide.html#spooling-directory-source

Related

How to check whether a folder exists or not in gcp cloud storage out of 1 million folders

For example I have structure like this.
bucketname/checked/folder1/some files
bucketname/checked/folder2/some files
bucketname/checked/folder3/some files
bucketname/checked/folder4/some files
bucketname/checked/folder5/some files
bucketname/checked/folder6/some files
bucketname/checked/folder7/some files
bucketname/checked/folder8/some files
bucketname/checked/folder9/some files
bucketname/checked/folder10/some files
bucketname/checked/folder11/some files
......
......
bucketname/checked/folder-1million/some files
Now,
1. If I have to check whether folder99999 exists or not. So,what would be the best way to check it (we have information of folder name - folder99999) ?
2. If we simply check path that exists or not, and if not then it means, folder don't exists. would it work fine If we have millions of folders?
3. Which data structure gcp uses to retrieve the folder data ?

The true answer is this one provided by John: folder doesn't exist. All the files are stored at the root directory (bucket level) and the file name is the full path. By human convention, the / is the folder separator and the console display fake folders.
If you haven't files in a "folder", the "folder" doesn't exist, it's not interpreted/deduced from the name fully qualified path. The folder is not a Cloud Storage resource
It's also for that reason that you search only by path prefix
However, it depends what you want to check. If you exactly know which folder you want to check and validate, and if there is at least one file in it, you can directly list the files with the folder path as prefix.

How to get the SACL properties of a folder in a remote machine using c++

I am trying to read the SACL properties of a folder.
The application will run on the Domain Controller, and it needs to read and update the SACL properties of a folder or file that is present in a member computer.
Is there any APIs available for this?
Can I use the GetNamedSecurityInfo to read the file? if yes how should the path of the file be?
Consider the domain is 'Raja.org' and the folder for which I am trying to set the SACL is 'C:\Test'
what should be the path I pass to the GetNamedSecurityInfo function?

You could use GetNamedSecurityInfo, where pObjectName would be the path to the file and ObjectType is SE_FILE_OBJECT.
SE_FILE_OBJECT
Indicates a file or directory. The name string that identifies a file
or directory object can be in one of the following formats:
A relative path, such as FileName.dat or ..\FileName
An absolute path, such as FileName.dat, C:\DirectoryName\FileName.dat,
or G:\RemoteDirectoryName\FileName.dat.
A UNC name, such as \ComputerName\ShareName\FileName.dat.

Proper way to zip xlsx-file from directory using Info-Zip's Zip utility

I’m using Zip utility from the Info-Zip library to compress the tree of catalogs to get xlsx-file.
For that I’m using the next command:
zip -r -D res.xlsx source
source - contains the correct directory tree of the xlsx file.
But if you then look at the resulting file structure, the source directory will be included in the paths of all files and directories at the top level, and MS Office Excel will not be able to open this file. This is a well known problem. To avoid it zip.exe needs to be inside of the dest directory.
The problem is that I want to use the source code of this utility in my project, so this leads me to be unable to call my process, which will be responsible for compressing directories, to get xlsx files from these directories.
I’ve tried to find a place in the zip source code, where the parent catalog appending on the top-level happens. But seems
it is done implicitly.
Who can suggest how it can be done?

Informatica B2B Data Exchange

I am new to B2B DX, using since 2 months. I have a requirement where files are getting generated in dynamic folder. For eg. File name is 20170503test.txt then it will get generated in /2017_05/20170503/20170503test.txt.
The next day means tomorrow it will get generated in /2017_05/20170504/20170504test.txt. So how my endpoint can pick these files as they are getting generated in different folders? So what I can set file pattern is *test.txt. But how endpoint can go in different directories?

If you go high enough in the target file system, there will be a single directory which is common among all the target files... even if it is the root directory... set your target folder to that. Then in the mapping itself, create a special FileName port on your target... this will dynamically set the name of the output file to whatever value this port is set to so you can fully qualify the string you set this to to include the full dynamic file path up until that shared directory described above.

Failed to create log file on application directory?

I want to write a log file for my application. The path where I want to store the file is:
destination::"C:\ColdFusion8\wwwroot\autosyn\logs"
I have used the sample below to generate the log file:
<cfset destination = expandPath('logs')>
<cfoutput>destination::"#destination#"</cfoutput><br/>
<cflog file='#destination#/test' application="yes" text="Running test log.">
When I supply the full path, it didn't create a log file. When I remove my destination, and only provide a file name, the log is generated in the ColdFusion server path C:\ColdFusion8\logs.
How can I generate a log file in my application directory?

Here is the description of attribute file according to cflog tag specs:
Message file. Specify only the main part of the filename. For example,
to log to the Testing.log file, specify "Testing".
The file must be located in the default log directory. You cannot
specify a directory path. If the file does not exist, it is created
automatically, with the extension .log.
You can use cffile tag to write information into the custom folder.

From the docs for <cflog>:
file
Optional
Message file. Specify only the main part of the filename. For example, to log to the Testing.log file, specify "Testing".
The file must be located in the default log directory. You cannot specify a directory path. If the file does not exist, it is created automatically, with the extension .log.
(My emphasis).
Reading the docs is always a good place to start when wondering how things might work.
So <cflog> will only log to the ColdFusion logs directory, and that is by design.
I don't have CF8 handy, but you would be able to set the logging directory to be a different one via either the CFAdmin UI (CF9 has this, I just confirmed), or neo-logging.xml in WEB-INF/cfusion/lib.
Or you could use a different logging mechanism. I doubt it will work on a rusty of CF8 install, but perhaps LogBox?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Flume - spoolign dir source - ingesting sub directories - hdfs

You can make use of the fileHeader and fileHeaderKey properties and refer to this header variable at your sink configuration to get the absolute path. https://flume.apache.org/FlumeUserGuide.html#spooling-directory-source

Related

How to check whether a folder exists or not in gcp cloud storage out of 1 million folders

How to get the SACL properties of a folder in a remote machine using c++

Proper way to zip xlsx-file from directory using Info-Zip's Zip utility

Informatica B2B Data Exchange

Failed to create log file on application directory?

Categories

Resources