how to get hdfs filename with apache drill

how to get hdfs filename with apache drill - hdfs

I want to get hdfs filename through drill to show, the apache drill user guide(https://drill.apache.org/docs/file-system-storage-plugin/) not written how to get it, I need help

You can use this command
SHOW FILES [ FROM filesystem.directory_name | IN filesystem.directory_name ];
And for more detailed info visit this link :-
http://drill.apache.org/docs/show-files-command/

Related

Return just data in BQ CLI with query no "Welcome to BigQuery!" etc output

I am using the gcloud cli to query Big Query.
Example:
bq query --use_legacy_sql=false --format json 'select * from `ga4-extract.analytics_123456789.events_20220722` limit 1;' > data.json
After running this, if I cat data.json I see my data, but above the data in the file is the following text:
root#9e4947a68356:/# cat data.json
Welcome to BigQuery! This script will walk you through the
process of initializing your .bigqueryrc configuration file.
First, we need to set up your credentials if they do not
already exist.
Setting project_id data-extract as the default.
BigQuery configuration complete! Type "bq" to get started.
Then my data appears underneath in desired json format. How can I get rid of this text so it does not show? Tried the following flags after reading documentation, in each case no difference, the above output was still added to my data.json file:
--batch=true
--quiet=true
--headless=true
How can I save my output to data.json without the text above at the top of the json file?

Just do another command which will create this output the first type...
For example in your case :
bq show ga4-extract.analytics_123456789
will show some little information about your dataset and will also create this .biqqueryrc file... Your next command will not have this in the output.

How to load data/update Power BI Dataset monthly

I've been asked to implement a way to load data to my datasets once a month. As Power BI Service doesn't have this option, I had to find a solution using Power Query and bellow I describe the step-by-step of my solution.
If it helps you at some way, please, let me know by posting a comment bellow. If you have a better and/or more elegant solution I'm glad to hear from you.

So, as my first solution didn't work, here I'll post the definity solution that we (me and my colleges) found.
I have to say that this solution is not so simple as it uses a Linux server, Gitlab and Jenkins, so it require a relative complex environment and I'll not describe how to build it.
At the end, I'll suggest a simpler solution.
THE ENVIRONMENT
On my company we use Jenkins to schedule jobs, Gitlab to store source code and we have a Linux Server to execute small tasks using Shell Script. For this problem I used all three services besides Power BI API.
JENKINS
I use Jenkins to schedule a job that run montlhy. This job was created using the following configs:
Parameters: I created 2 parameters (workspace_id and dataset_id) so I can test the script at any environment (Power BI Workspace) by just changing the value of those parameters;
Schedule Job: this job was schedule to run every day 1 at 02:00 a.m. As Jenkins uses the same sintax as CRON (I thing it is just a intermediate between you and CRON) the value of this field is 0 2 1 * *.
Build: as here we have a remote linux server to execute the scripts, I used a Execute shell script on remote host using ssh. I don't know why on Jenkins you can not execute the curl command direct on the job, it just didn't work, so I had to split the solution into both Jenkins and Linux server. At SSH site you have to select the credentials (previously created by my team) and at command are the commands bellow:
#Navigate to the script shell directory
cd "script-shell-script/"
# pulls the last version of the script. If you aren't using Gitlab,
# remove this command
git pull
# every time git pulls a new file version, it has read access.
# This command allows the execution of the filechmod +x powerbi_refresh_dataset.sh
# make a call to the file passing as parameter the workspace id and dataset id
./powerbi_refresh_dataset.sh $ID_WORKSPACE $ID_DATASET
SHELL SCIPT
As you already imagine, the core solution is the content of powerbi_refresh_dataset.sh. But, before going, there, you must understand how Power BI API works and you have to configure your Power BI environment to make API calls work. So, please, make sure that you already have your Principal Service properly configured by following this tutorial: https://learn.microsoft.com/en-us/power-bi/developer/embedded/embed-service-principal
Once you got your object_id, client_id and client_secret you can create your shell script file. Bellow is the code of my .sh file.
# load OBJECT_ID, CLIENT_ID and CLIENT_SECRET as environment variables
source credential_file.sh
# This command retrieves a new token from Microsoft Credentials Manager
token_msg=$(curl -X POST "https://login.windows.net/$OBJECT_ID/oauth2/token" \
-H 'Content-Type: application/x-www-form-urlencoded' \
-H 'Accept: application/json' \
-d 'grant_type=client_credentials&resource=https://analysis.windows.net/powerbi/api&client_id='$CLIENT_ID'&client_secret='$CLIENT_SECRET
)
# Extract the token from the response message
token=$(echo "$token_msg" | jq -r '.access_token')
# Ask Power BI to refresh dataset
refresh_msg=$(curl -X POST 'https://api.powerbi.com/v1.0/myorg/groups/'$1'/datasets/'$2'/refreshes' \
-H 'Authorization: Bearer '$token \
-H 'Content-Type: application/json' \
-d '{"notifyOption": "NoNotification"}')
And here goes some explanation. The first command is source credential_file.sh which loads 3 variables (OBJECT_ID, CLIENT_ID and CLIENT_SECRET). The intention here is to separate confidential info from the script so I can store the main script file on a version control (Git) and not disclosure any sensitivy information. So, besides powerbi_refresh_dataset.sh file you must have credential_file.sh at the same directory and with the following content:
OBJECT_ID=OBJECT_ID_VALUE
CLIENT_ID=CLIENT_ID_VALUE
CLIENT_SECRET=CLIENT_SECRET_VALUE
It's important to say that if you are using Git or any other version control, only powerbi_refresh_dataset.sh file goes to version control and credential_file.sh file must remain only at your Linux Server. I suggest you to save it's content into a password store application like keepass, as CLIENT_SECRET is not possible to retrieve.
FINAL CONSIDERATIONS
So above is the most relevant info of my solution. As you can see I'm ommiting (intentionally) how to build the environment and make them talk (jekins with linux, jenkins with Git and so on).
If all you have is a Linux or Windows host, I suggest you this:
Linux Host
On this simpler environment, just create the powerbi_refresh_dataset.sh and credential_file.sh, place it at any directory and create a CRON task to call powerbi_refresh_dataset as many time as you wish.
Windows Host
On windows you can do almost the same as on Linux, but you'll have to replace the content of shell script file by Power Shell command (google it) and use the Scheduled Task to regularly execute you Power Shell file.
Well, I think this would help you. I know it's not a complete answer as it will only works if you have a similar environment, but I hope that the final tips might help you.
Best regards

The Solution
First let me resume the solution. I just putted a condition execution at the end of each query that checks if today is the day where new data must be uploaded or not. If yes, it returns the step to be executed, if not, it raises a error.
There is many ways to implement that and I'll go from the simplest form to the more complex one.
Simplest Version: checking if it's the day to load new data directly at the query
This is the simplest way to implement the solution, but, depending on your dataset it may not be the smartest one.
Lets say you have this foo query:
let
step1 = ...,
...,
...,
step10 = SomeFunction(Somevariable, someparameter)
in
setp10
Now lets pretend you want that query to upload new data just on 1st day of the month. To do that, you just insert a condicional struction at in clause.
let
step1 = ...,
...,
...,
step10 = SomeFunction(Somevariable, someparameter)
in
if Date.Day(DateTime.LocalNow()) = 1 then setp10 else error "Today is not the day to load data"
At this example I just replaced the setp10 at the return of the query by this piece of code:if Date.Day(DateTime.LocalNow()) = 1 then setp10 else error "Today is not the day to load data". By doing that, setp10 will be the result of this query only if this query is been executed at day 1st of the month, otherwise, it will return a error.
And here it's worthy some explanation. Power Query is not a script language that runs at the same order that it's declared. So the fact the condicional statement was placed at the end of the query doesn't mean that all code above will be executed before the error is launched. As Power Query just executes what's necessary, the if... statement it will probably be the first one to be executed. For more info about how Power Query works behind the scene, I stronlgy recomend you this reading: https://bengribaudo.com/blog/2018/02/28/4391/power-query-m-primer-part5-paradigm
Using function
Now lets move foward. Lets say that your Dataset set has not only one, but many queries and all of them needs to be executed only once a month. In this case, a smart way to do that is by using what all other programming languages have to reuse block of code: create a function!
For this, create a new Blank Query and paste this code on its body:
(step) =>
let
result = if Date.Day(DateTime.LocalNow()) = 1 then step else error "Today is not the day to load data"
in
result
Now, at each query you'll call this function, sending the last setp as parameter. The function will check which day is today and return the same step passed as parameter if it's the day to load the data. Otherwise, it will return the error.
Bellow is the code of our query using our function called check_if_upload
let
step1 = ...,
...,
...,
step10 = SomeFunction(Somevariable, someparameter)
step11 = check_if_upload(step10)
in
step11
Using parameters
One final tip. As your query raises a error if today is not the day to upload day, it means that you can only test your ETL once a month, right? The error message also limite you to save you Power Query, which means that if you don't apply the modifications you can't upload the new Power Query version (having this implementations) to Power BI Service.
Well, you could change the value of the day verification into the function, but it's let's say, a little dummy.
A more ellegante way to change this parameter is by using parameters. So, lets do it. Create a parameter (I'll call it Upload Day) as a number type. Now, all you have to do is use this parameter at your function. It will look like this:
(step) =>
let
result = if Date.Day(DateTime.LocalNow()) = #"Upload Day" then step else error "Today is not the day to load data"
in
result
That's it. Now you can change the upload day directly at Power BI Service, just changing this parameter at the dataset (click on dataset name and goes to Settings >> Parameters).
Hope you neiled it and that its helpful for you.
Best regards.

Automating a Website Search and Download of hundreds of Files Using Python

I am trying to automate a download process from a website for my data curation. It involves few searches and selecting the right file to download if the search term matches. Could you please give mes some insights or Sample code to start with?
Flow chart of process will be like:
Go to website-->Click on Advanced Search>> Click on "Title" >>Select from few dropdowns and fix the search space---> Enter the search string --->select the correct one and download the latest release.

Does an EMR master node know its cluster ID?

I want to be able to create EMR clusters, and for those clusters to send messages back to some central queue. In order for this to work, I need to have some sort of agent running on each master node. Each one of those agents will have to identify itself in this message so that the recipient knows which cluster the message is about.
Does the master node know its ID (j-*************)? If not, then is there some other piece of identifying information that could allow the message recipient to infer this ID?
I've taken a look through the config files in /home/hadoop/conf, and I haven't found anything useful. I found the ID in /mnt/var/log/instance-controller/instance-controller.log, but it looks like it'll be difficult to grep for. I'm wondering where instance-controller might get that ID from in the first place.

You may look at /mnt/var/lib/info/ on Master node to find lot of info about your EMR cluster setup. More specifically /mnt/var/lib/info/job-flow.json contains the jobFlowId or ClusterID.
You can use the pre-installed json parser (jq) to get the jobflow id.
cat /mnt/var/lib/info/job-flow.json | jq -r ".jobFlowId"
(updated as per #Marboni)

You can use Amazon EC2 API to figure out. The example below uses shell commands for simplicity. In real life you should use appropriate API to do this steps.
First you should find out your instance ID:
INSTANCE=`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`
Then you can use your instance ID to find out the cluster id :
ec2-describe-instances $INSTANCE | grep TAG | grep aws:elasticmapreduce:job-flow-id
Hope this helps.

As been specifed above, the information is in the job-flow.json file. This file has several other attributes. So, knowing where it's located, you can do it in a very easy way:
cat /mnt/var/lib/info/job-flow.json | grep jobFlowId | cut -f2 -d: | cut -f2 -d'"'
Edit: This command works in core nodes also.

Another option - query the metadata server:
curl -s http://169.254.169.254/2016-09-02/user-data/ | sed -r 's/.*clusterId":"(j-[A-Z0-9]+)",.*/\1/g'

Apparently the Hadoop MapReduce job has no way to know which cluster it is running on - I was surprised to find this out myself.
BUT: you can use other identifiers for each map to uniquely identify the mapper which is running, and the job that is running.
These are specified in the environment variables passed on to each mapper. If you are writing a job in Hadoop streaming, using Python, the code would be:
import os
if 'map_input_file' in os.environ:
fileName = os.environ['map_input_file']
if 'mapred_tip_id' in os.environ:
mapper_id = os.environ['mapred_tip_id'].split("_")[-1]
if 'mapred_job_id' in os.environ:
jobID = os.environ['mapred_job_id']
That gives you: input file name, the task ID, and the job ID. Using one or a combination of those three values, you should be able to uniquely identify which mapper is running.
If you are looking for a specific job: "mapred_job_id" might be what you want.

Running Nutch crawls on EMR (newbie)

I'm a first time EMR/Hadoop user and first time Apache Nutch user. I'm trying to use Apache Nutch 2.1 to do some screen scraping. I'd like to run it on hadoop, but don't want to setup my own cluster (one learning curve at a time). So I'm using EMR. And I'd like S3 to be used for output (and whatever input I need).
I've been reading the setup wikis for Nutch:
http://wiki.apache.org/nutch/NutchTutorial
http://wiki.apache.org/nutch/NutchHadoopTutorial
And they've been very helpful in getting me up to speed on the very basics of nutch. I realize I can build nutch from source, preconfigure some regexes, then be left with a hadoop friendly jar:
$NUTCH_HOME/runtime/deploy/apache-nutch-2.1.job
Most of the tutorials culminate in a crawl command being run. In the Hadoop examples, it's:
hadoop jar nutch-${version}.jar org.apache.nutch.crawl.Crawl urls -dir crawl -depth 3 -topN 5
And in the local deployment example it's something like:
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
My question is as follows. What do I have to do to get my apache-nutch-2.1.job to run on EMR? What arguments to I pass it? For the hadoop crawl example above, the "urls" file is already on hdfs with seed URLs. How do I do this on EMR? Also, what do I specify on the command line to have my final output to go S3 instead of HDFS?

So to start off, this can not really to be done using the GUI. Instead I got nutch working using the AWS Java API.
I have my seed files located in s3, and I transfer my output back to s3.
I use the dsdistcp jar to copy the data from s3 to hdfs.
Here is my basic step config. MAINCLASS is going to be the package details of your nutch crawl. Something like org.apach.nutch.mainclass.
String HDFSDIR = "/user/hadoop/data/";
stepconfigs.add(new StepConfig()
.withName("Add Data")
.withActionOnFailure(ActionOnFailure.TERMINATE_JOB_FLOW)
.withHadoopJarStep(new HadoopJarStepConfig(prop.getProperty("S3DISTCP"))
.withArgs("--src", prop.getProperty("DATAFOLDER"), "--dest", HDFSDIR)));
stepconfigs.add(new StepConfig()
.withName("Run Main Job")
.withActionOnFailure(ActionOnFailure.CONTINUE)
.withHadoopJarStep(new HadoopJarStepConfig(nutch-1.7.jar)
.withArgs("org.apache.nutch.crawl.Crawl", prop.getProperty("CONF"), prop.getProperty("STEPS"), "-id=" + jobId)));
stepconfigs.add(new StepConfig()
.withName("Pull Output")
.withActionOnFailure(ActionOnFailure.TERMINATE_JOB_FLOW)
.withHadoopJarStep(new HadoopJarStepConfig(prop.getProperty("S3DISTCP"))
.withArgs("--src", HDFSDIR, "--dest", prop.getProperty("DATAFOLDER"))));
new AmazonElasticMapReduceClient(new PropertiesCredentials(new File("AwsCredentials.properties")), proxy ? new ClientConfiguration().withProxyHost("dawebproxy00.americas.nokia.com").withProxyPort(8080) : null)
.runJobFlow(new RunJobFlowRequest()
.withName("Job: " + jobId)
.withLogUri(prop.getProperty("LOGDIR"))
.withAmiVersion(prop.getProperty("AMIVERSION"))
.withSteps(getStepConfig(prop, jobId))
.withBootstrapActions(getBootStrap(prop))
.withInstances(getJobFlowInstancesConfig(prop)));

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js