Matching string between nth occurrence of character in python with RegEx - regex

I'm working with files in a tar.gz file which contains txt files and trying to extract the filename of a the related TarInfo object whose member.name property looks like this:
aclImdb/test/neg/1026_2.txt
aclImdb/test/neg/1027_5.txt
...
aclImdb/test/neg/1030_4.txt
I've written the following code which prints the string test/neg/1268_2
regex = '\/((?:[^/]*/).*?)\.'
with tarfile.open("C:\\Users\\Orestis\\Desktop\\aclImdb_v1.tar.gz") as archive:
for member in archive.getmembers():
if member.isreg():
m = re.findall(regex, member.name)
print(m)
How should I modify the regex to extract only the 1268_2 part of the filenames? Effectively I want to extract the string after the 3rd occurrence of "/" and before the 1st occurrence of ".".

You could hardcode this:
.*?\/.*?\/.*?\/(.*?)\.
More elegant is something along the lines of this:
(.*?\/){3}(.*?)\.
You can simply change the 3 to suit your pattern. (Note that the group you'll want is $2)

Related

Get each value after specific words with Regex

I have the below string and I am trying to get every value after ID and Display Name. I have tried to create a lookup but I could not get it to work and it only grabs the first value while I want to grab all of them.
This was my code to grab the value after DisplayName
(?<=\bDisplayName\\\"\=\>\\\")(\w+)
When I tried it, it grabs the first value, but only if it is alphabet while most of my text is a mix of Japanese Kanji, Katakana, Hiragana and special characters such as ・.
"{\"Ancestor\"=>{\"Ancestor\"=>{\"Ancestor\"=>{\"Ancestor\"=>{\"ContextFreeName\"=>\"本\", \"DisplayName\"=>\"本\", \"Id\"=>\"465392\"}, \"ContextFreeName\"=>\"本\", \"DisplayName\"=>\"ジャンル別\", \"Id\"=>\"465610\"}, \"ContextFreeName\"=>\"ビジネス・経済\", \"DisplayName\"=>\"ビジネス・経済\", \"Id\"=>\"466282\"}, \"ContextFreeName\"=>\"経営学・キャリア・MBA\", \"DisplayName\"=>\"経営学・キャリア・MBA\", \"Id\"=>\"492076\"}, \"ContextFreeName\"=>\"経営学・キャリア・MBAの起業・開業\", \"DisplayName\"=>\"起業・開業\", \"Id\"=>\"492058\", \"IsRoot\"=>false}"
What I want to achieve from the above string is the following:
Grab each string after DisplayName
ex.
"DisplayName"=>"本" grab 本
"DisplayName"=>"経営学・キャリア・MBA" grab 経営学・キャリア・MBA
Grab each integer after Id
ex.
"Id"=>"465392" grab 465392
"Id"=>"4920588" grab 4920588
Is it possible to do this in Regex or should I look for something else than Regex?
You can use capturing groups like in
"DisplayName"=>"([^"]*)"
"Id"=>"(\d+)
Details:
"DisplayName"=>"([^"]*)" - "DisplayName"=>" is matched first, then one or more chars other than " are captured into Group 1.
"Id"=>"(\d+) - "Id"=>" is matched first, then one or more digits are captured into Group 1.
See the Python demo:
import re
s = "{\"Ancestor\"=>{\"Ancestor\"=>{\"Ancestor\"=>{\"Ancestor\"=>{\"ContextFreeName\"=>\"本\", \"DisplayName\"=>\"本\", \"Id\"=>\"465392\"}, \"ContextFreeName\"=>\"本\", \"DisplayName\"=>\"ジャンル別\", \"Id\"=>\"465610\"}, \"ContextFreeName\"=>\"ビジネス・経済\", \"DisplayName\"=>\"ビジネス・経済\", \"Id\"=>\"466282\"}, \"ContextFreeName\"=>\"経営学・キャリア・MBA\", \"DisplayName\"=>\"経営学・キャリア・MBA\", \"Id\"=>\"492076\"}, \"ContextFreeName\"=>\"経営学・キャリア・MBAの起業・開業\", \"DisplayName\"=>\"起業・開業\", \"Id\"=>\"492058\", \"IsRoot\"=>false}"
print(re.findall(r'"DisplayName"=>"([^"]*)"', s))
# => ['本', 'ジャンル別', 'ビジネス・経済', '経営学・キャリア・MBA', '起業・開業']
print(re.findall(r'"Id"=>"(\d+)', s))
# => ['465392', '465610', '466282', '492076', '492058']

The regex in string.format of LUA

I use string.format(str, regex) of LUA to fetch some key word.
local RICH_TAGS = {
"texture",
"img",
}
--\[((img)|(texture))=
local START_OF_PATTER = "\\[("
for index = 1, #RICH_TAGS - 1 do
START_OF_PATTER = START_OF_PATTER .. "(" .. RICH_TAGS[index]..")|"
end
START_OF_PATTER = START_OF_PATTER .. "("..RICH_TAGS[#RICH_TAGS].."))"
function RichTextDecoder.decodeRich(str)
local result = {}
print(str, START_OF_PATTER)
dump({string.find(str, START_OF_PATTER)})
end
output
hello[img=123] \[((texture)|(img))
dump from: [string "utils/RichTextDecoder.lua"]:21: in function 'decodeRich'
"<var>" = {
}
The output means:
str = hello[img=123]
START_OF_PATTER = \[((texture)|(img))
This regex works well with some online regex tools. But it find nothing in LUA.
Is there any wrong using in my code?
You cannot use regular expressions in Lua. Use Lua's string patterns to match strings.
See How to write this regular expression in Lua?
Try dump({str:find("\\%[%("))})
Also note that this loop:
for index = 1, #RICH_TAGS - 1 do
START_OF_PATTER = START_OF_PATTER .. "(" .. RICH_TAGS[index]..")|"
end
will leave out the last element of RICH_TAGS, I assume that was not your intention.
Edit:
But what I want is to fetch several specific word. For example, the
pattern can fetch "[img=" "[texture=" "[font=" any one of them. With
the regex string I wrote in my question, regex can do the work. But
with Lua, the way to do the job is write code like string.find(str,
"[img=") and string.find(str, "[texture=") and string.find(str,
"[font="). I wonder there should be a way to do the job with a single
pattern string. I tryed pattern string like "%[%a*=", but obviously it
will fetch a lot more string I need.
You cannot match several specific words with a single pattern unless they are in that string in a specific order. The only thing you could do is to put all the characters that make up those words into a class, but then you risk to find any word you can build from those letters.
Usually you would match each word with a separate pattern or you match any word and check if the match is one of your words using a look up table for example.
So basically you do what a regex library would do in a few lines of Lua.

Match a file name that includes the path and a period within the name

I have a file I need to take just its name:
/var/www/foo/dog.tur-tles.chickens.txt
I want to match just the:
dog.tur-tles.chickens
I have tried this in regexer:
([^\/]*)$
This matches:
dog.tur-tles.chickens.txt
I can't figure out how to only exclude that last period.
You can assume it will always be a .txt, but I wanted to build in the ability that if a file was named dog-turtles.txt.txt it would see that the name is dog-turtles.txt.
You could use something like so: ([^\/]*)(\.).+?$.
An example is available here. Not though that this will fail for extensions such as .tar.gz and so on.
You may use File::Basename.fileparse to get the file name, then use rindex to get the last index of . and then get the required substring using substr:
use File::Basename;
$x = fileparse('/var/www/foo/dog.tur-tles.chickens.txt');
print substr($x, 0, rindex($x, '.')) . "\n";
Output of a sample program:
dog.tur-tles.chickens
$name = ($pathname =~ s{.*/}{}r =~ s{\.[^.]+$}{}r)
substitution 1 : just remove dir
substitution 2 : just remove extension if presente
Just add .txt to your regex and since * is greedy by default it will match everything till last .txt
([^\/]*)\.txt$
Input:
/var/www/foo/dog.tur-tles.chickens.txt.txt
/var/www/foo/dog.tur-tles.chickens.txt
Output:
dog.tur-tles.chickens.txt
dog.tur-tles.chickens
See DEMO

Extract string of numbers from URL using regex PIG

I'm using PIG to generate a list of URLs that have been recently visited. In each of the URLs, there is a string of numbers that represents the product page visited. I'm trying to use a regex_extract_all() function to extract just the string of numbers, which vary in length from 6-8. The string of digits can be found directly after jobs2/view/ and usually ends with +&cd but sometimes they may end with ).
Here are a few example URLs:
(http://a.com/search?q=cache:QD7vZRHkPQoJ:ca.xyz.com/jobs2/view/17069404+&cd=1&hl=en&ct=clnk&gl=ca)
(http://a.com/search?q=cache:G9323j2oNbAJ:ca.xyz.com/jobs2/view/5977065+&cd=1&hl=en&ct=clnk&gl=ca)
(http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk)
(http://a.com/search?q=cache:aNspmG11AJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk)
(http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=cl k&gl=hk)
Here is the current regex I am using:
J = FOREACH jpage GENERATE FLATTEN(REGEX_EXTRACT_ALL(TEXTCOLUMN, '\/view\/(\d+)\+\&')) as (output:chararray)
I have also tried other forms such as:
'[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]', 'view.([0-9]+)', 'view\/([\d]+)\+',
'[0-9][0-9][0-9]+', and
'[0-9][0-9][0-9]*'; none of which work.
Can anybody assist here or have another way of going about it?
Much appreciated,
MM
Reason for"Unexpected character 'D'" is, you need to put double backslash instead of single backslash. eg just replace [\d+] to [\\d+]
Here your solution, please validate all your inputs strings
input.txt
http://a.com/search?q=cache:QD7vZRHkPQoJ:ca.xyz.com/jobs2/view/17069404+&cd=1&hl=en&ct=clnk&gl=ca
http://a.com/search?q=cache:G9323j2oNbAJ:ca.xyz.com/jobs2/view/5977065+&cd=1&hl=en&ct=clnk&gl=ca
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk
http://a.com/search?q=cache:aNspmG11AJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clk&gl=hk
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928)=2&hl=zh-TW&ct=clk&gl=hk
http://webcache.googleusercontent.com/search?q=cache:http://my.linkedin.com/jobs2/view/9919248
Updated Pigscript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE REGEX_EXTRACT(line,'.*/view/(\\d+)([+|&|cd|)?]+)?',1);
dump B;
(17069404)
(5977065)
(16988928)
(16988928)
(16988928)
(16988928)
I'm not familiar with PIG, but this regex will match your target:
(?<=/jobs2/view/)\d+
By using a (non-consuming) look behind, the entire match (not just a group of the match) is your number.

How can I get a substring from the middle of a file path in VBScript?

I have the following string in VBScript:
myPath = "C:\Movies\12 Monkeys\12_MONKEYS.ISO"
The path C:\Movies\ is always going to be the same. So here is another path as an example:
myPath = "C:\Movies\The Avengers\DISC_1.ISO"
My question is, how can I pull only the movie folder name, so in the above examples I would get:
myMovie = "12 Monkeys"
myMovie = "The Avengers"
Is there a way to use RegEx with this? Or should I just do some substring and index calls? What is the easiest way to do this?
Consider the code below:
arrPathParts = Split(myPath, "\");
myMovie = arrPathParts(2);
Split the string where the delimiter is the backslash character. Splitting a string returns an array of strings. Your movie is the third item in the array of strings.
http://regexr.com?3332n
(?<=C:\\Movies\\).*?(?=\\)
You use assertions so that it finds a string that starts with C:\Movies but does not include it in the results, then a greedy operator to find everything up until the forward slash. You use a look ahead assertion to exclude the forward slash from the results.