Extracting main directory from path using Regex in Hive

Extracting main directory from path using Regex in Hive - regex

I am using regex function in Hive to find the main folder.
I want to parse out "main" from this file path:
/main/one/path/to/hdfs
This is the regex which I used:
regexp_extract(filepath,'(^/[^/]+)',0)

You have to escape the "/" with a "\"
(^\/[^\/]+)

I'm guessing that we wish to get the first directory after slash, which we might want to start with this simple expression:
\/(.+?)\/.+
Here, we are having our main output captured in this first capturing group:
(.+?)
which we can simply call it using group 1, and our code would likely look like:
regexp_extract(filepath,'\/(.+?)\/.+', 1)
Demo

Related

regex - get new path string from old path string

I'm trying to run a shell script in linux and want to turn this:
/path/to/(\w+)/b/c
into
/path/to/(\w+)/b/(\w+)\.txt
(where \w+ should remain the same as given in input).
I keep getting 'No match found'.

You need to use the capturing group and then use that in your substitution.
\/r\/path\/to\/(\w+).*
Test string
/r/path/to/teststring/b/c
Substitution
/path/to/\1/b/\1\.txt
Result
/path/to/teststring/b/teststring.txt
I have created a regex101 playground for you here
https://regex101.com/r/R0O3OK/1

How to extract file name from URL?

I have file names in a URL and want to strip out the preceding URL and filepath as well as the version that appears after the ?
Sample URL
Trying to use RegEx to pull, CaptialForecasting_Datasheet.pdf
The REGEXP_EXTRACT in Google Data Studio seems unique. Tried the suggestion but kept getting "could not parse" error. I was able to strip out the first part of the url with the following. Event Label is where I store URL of downloaded PDF.
The URL:
https://www.dudesolutions.com/Portals/0/Documents/HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
REGEXP_EXTRACT( Event Label , 'Documents/([^&]+)' )
The result:
HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
Now trying to determine how do I pull out everything after the? where the version data is, so as to extract just the Filename.pdf.

You could try:
[^\/]+(?=\?[^\/]*$)
This will match CaptialForecasting_Datasheet.pdf even if there is a question mark in the path. For example, the regex will succeed in both of these cases:
https://www.dudesolutions.com/somepath/CaptialForecasting_Datasheet.pdf?ver
https://www.dudesolutions.com/somepath?/CaptialForecasting_Datasheet.pdf?ver

Assuming that the name appears right after the last / and ends with the ?, the regular expression below will leave the name in group 1 where you can get it with \1 or whatever the tool that you are using supports.
.*\/(.*)\?
It basically says: get everything in between the last / and the first ? after, and put it in group 1.
Another regular expression that only matches the file name that you want but is more complex is:
(?<=\/)[^\/]*(?=\?)
It matches all non-/ characters, [^\/], immediately preceded by /, (?<=\/) and immediately followed by ?, (?=\?). The first parentheses is a positive lookbehind, and the second expression in parentheses is a positive lookahead.

This REGEXP_EXTRACT formula captures the characters a-zA-Z0-9_. between / and ?
REGEXP_EXTRACT(Event Label, "/([\\w\\.]+)\\?")
Google Data Studio Report to demonstrate.

Please try the following regex
[A-Za-z\_]*.pdf
I have tried it online at https://regexr.com/. Attaching the screenshot for reference
Please note that this only works for .pdf files

Following regex will extract file name with .pdf extension
(?:[^\/][\d\w\.]+)(?<=(?:.pdf))
You can add more extensions like this,
(?:[^\/][\d\w\.]+)(?<=(?:.pdf)|(?:.jpg))
Demo

Regex processing in systemverilog using svlib

I am a new user of svlib package in systemverilog environment. Refer to Verilab svlib. I have following sample text , {'PARAMATER': 'lollg_1', 'SPEC_ID': '1G3HSB_1'} and I want to use regex to extract 1G3HSB from this text.
For this reason, I am using the following code snippet but I am getting the whole line instead of only the information.
wordsRe = regex_match(words[i], "\'SPEC_ID\': \'(.*?)\'");
$display("This is the output of Regex: %s", wordsRe.getStrContents())
Can anybody direct me what is going wrong?
The output I am getting : {'PARAMATER': 'lollg_1', 'SPEC_ID': '1G3HSB_1'}
And, I want to get: 1G3HSB_1

It seems you need to get the contents of the first capturing group with getMatchString(1). Also, you need to use a greedy quantifier (lazy ones are not POSIX compliant) and a negated bracket expression - [^']* instead of .*?:
wordsRe = regex_match(words[i], "\'SPEC_ID\': \'([^\']*)\'");
$display("This is the output of Regex: %s", wordsRe.getMatchString(1))
See the User Guide details:
getMatchString(m) is always exactly equivalent to calling the range method on the Str object containing the string that was searched:
range(getMatchStart(m), getMatchLength(m))

How to include regular expression in logstash file input path

I am using logstash to convert tomcat access logs into json format. The access log names are in below format
abcd_access_log.2016-03-15.log
efgh_access_log.2016-02-16.log
The input filter is:
input {
file {
path => "C:\tools\apache-tomcat-8.0.32\logs\*_access_log.*.log"
start_position => beginning
}
}
It is not showing logs with the regex used. What regex should I use here to select only these files?

Simple and fast way
If your log file is always after logs\, you can use the following regex:
logs\\(.*?\.log)
Capture everything .*? followed by logs\ (please note double \\ in the regex). Also your file ends with .log, so dont forget to escape the dot with \.log.
Detailed description is at Regex101.
Bullet-proof and slower way
In case there is missing the key characters logs\, you can use the following regex, that captures the last part of the path, using the negative lookahead:
\\((?:.(?!\\))+\.log)
Here is the detailed description again: Regex101

Regex: split to the last occurence of path

I want to split up an UNC-path for hostname, shared folder, path, filename and extension. I almost got it, but the last sequence is somehow wrong because I didn't get the filenaem correctly.
e.g.
//host/shared/path1/path2/path3/filename.pdf
should be split up to:
host
shared
path1/path2/path3
filename
pdf
But at the moment I get something like this:
host
shared
path1/path2/path3/filenam
e
pdf
using this regex:
std::regex rgx("\/\/(\\w+?){1,1}\/(\\w+?)\/([\\w\/]+)([^\\.])\\.(.+)$");
So what is wrong with it and how can I solve it?

You want to remove the group "([^\\.])" as the following "\\." matches the period at the end. You also want another word group to match the file name itself that is followed by the period like so:
std::regex rgx("\/\/(\\w+?){1,1}\/(\\w+?)\/([\\w\/]+)\/([\\w]+)\\.(.+)$");
https://regex101.com/r/yK4zH1/4

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extracting main directory from path using Regex in Hive - regex

I am using regex function in Hive to find the main folder. I want to parse out "main" from this file path: /main/one/path/to/hdfs This is the regex which I used: regexp_extract(filepath,'(^/[^/]+)',0)

You have to escape the "/" with a "\" (^\/[^\/]+)

Related

regex - get new path string from old path string

How to extract file name from URL?

Regex processing in systemverilog using svlib

How to include regular expression in logstash file input path

Regex: split to the last occurence of path

Categories

Resources