Retrieving contents of a CSS Selector - regex

I would like to extract "1381912680" from the following code:
[<abbr class="timestamp" data-utime="1381912680"></abbr>]
Using Python 2.7, this is what I currently have in my code to get to that stage:
s = soup.find_all("abbr", { "class" : "timestamp" })
print s
Should I use regex or can BS do it on its own?
EDIT
I tried to using regex but with no luck:
import re
regex = 'data-utime=\"(\d+)\"'
x = re.compile(regex)
x2 = re.findall(x, s)
print x2
I got: TypeError: expected string or buffer

Python reserves class so you use the format:
s= soup.find("abbr", class_="timestamp")
but... <abbr> is empty so use the above answers :)

You could use the below regex to extract the number within double quotes,
(?<=data-utime=\")[^\"]*
DEMO
Python code would be,
>>> import re
>>> str = '[<abbr class="timestamp" data-utime="1381912680"></abbr>]'
>>> m = re.findall(r'(?<=data-utime=\")[^\"]*', str)
>>> m
['1381912680']
Explanation:
(?<=data-utime=\") Regex engine sets a marker just after to the string data-utime="
[^\"]* Matches nay character zero or more times upto the literal "

Related

Regex to extract text from request id

i have a log where a certain part is requestid in that text is there which i have to extract
Ex: RES_1621480647_49610052479341623017223137119508459972977816017376903362_Book,
Can any1 pls help in extracting Book out of it
Consider string splitting instead
>>> s = "RES_1621480647_49610052479341623017223137119508459972977816017376903362_Book"
>>> s.split("_")[-1]
'Book'
It seems that string splitting will be more efficient, if you must use regular expressions, here is an example.
#!/usr/bin/env python3
import re
print(
re.findall(r"^\w+_\d+\d+_(\w+)$",'RES_1621480647_49610052479341623017223137119508459972977816017376903362_Book')
)
// output: ['Book']

python3 regular expression max length limit [duplicate]

I try to compile a big pattern with re.compile in Python 3.
The pattern I try to compile is composed of 500 small words (I want to remove them from a text). The problem is that it stops the pattern after about 18 words
Python doesn't raise any error.
What I do is:
stoplist = map(lambda s: "\\b" + s + "\\b", stoplist)
stopstring = '|'.join(stoplist)
stopword_pattern = re.compile(stopstring)
The stopstring is ok (all the words are in) but the pattern is much shorter. It even stops in the middle of a word!
Is there a max length for the regex pattern?
Consider this example:
import re
stop_list = map(lambda s: "\\b" + str(s) + "\\b", range(1000, 2000))
stopstring = "|".join(stop_list)
stopword_pattern = re.compile(stopstring)
If you try to print the pattern, you'll see something like
>>> print(stopword_pattern)
re.compile('\\b1000\\b|\\b1001\\b|\\b1002\\b|\\b1003\\b|\\b1004\\b|\\b1005\\b|\\b1006\\b|\\b1007\\b|\\b1008\\b|\\b1009\\b|\\b1010\\b|\\b1011\\b|\\b1012\\b|\\b1013\\b|\\b1014\\b|\\b1015\\b|\\b1016\\b|\\b1017\\b|\)
which seems to indicate that the pattern is incomplete. However, this just seems to be a limitation of the __repr__ and/or __str__ methods for re.compile objects. If you try to perform a match against the "missing" part of the pattern, you'll see that it still succeeds:
>>> stopword_pattern.match("1999")
<_sre.SRE_Match object; span=(0,4), match='1999')

capture special string between 2 characters in python

I have been working on a python project while I don't have that much experience so can you tell me please if I have this string : Synset'dog.n.01' and I want to extract the string dog only what should I do ?
I mean just to extract any string between Synset' and .n.01'
I suggest to use re (regex)
import re
s = "Synset'dog.n.01'"
result = re.search("Synset'(.*).n.01'", s)
print result.group(1)

regex : all strings containing "tomcat/logs"

I want to know how to match all strings containing tomcat/logs ?
For example : /home/tomcat/logs, /etc/tomcat/logs, /home/folder/tomcat/logs
Thanks.
Edited :
I'm using this for excluding backup directories, I need just regular expression independent of any specific language.
You can do something like this: (This is in Python)
>>> import re
>>> string_to_find_in = '/home/tomcat/logs'
>>> m = re.search('(.*tomcat\/logs)', string_to_find_in)
>>> m.group(0)
'/home/tomcat/logs'

Regular expression syntax in python

I try to write a python scripts to analys a data txt.I want the script to do such things:
find all the time data in one line, and compare them.but this is my first time to write RE syntax.so I write a small script at 1st.
and my script is:
import sys
txt = open('1.txt','r')
a = []
for eachLine in txt:
a.append(eachLine)
import re
pattern = re.compile('\d{2}:\d{2}:\d{2}')
for i in xrange(len(a)):
print pattern.match(a[i])
#print a
and the output is always None.
my txt is just like the picture:
what's the problem? plz help me. thx a lot.
and my python is python 2.7.2.my os is windows xp sp3.
Didn't you miss one of the ":" in you regex? I think you meant
re.compile('\d{2}:\d{2}:\d{2}')
The other problems are:
First, if you want to search in the hole text, use search instead of match. Second, to access your result you need to call group() in the match object returned by your search.
Try it:
import sys
txt = open('1.txt','r')
a = []
for eachLine in txt:
a.append(eachLine)
import re
pattern = re.compile('\d{2}:\d{2}:\d{2}')
for i in xrange(len(a)):
match = pattern.search(a[i])
print match.group()
#print a
I think you're missing the colons and dots in your regex. Also try using re.search or re.findall instead on the entire text. Like this:
import re, sys
text = open("./1.txt", "r").read() # or readlines() to make a list of lines
pattern = re.compile('\d{2}:\d{2}:\d{2}')
matches = pattern.findall(text)
for i in matches:
print(i);