parsing text file using python for fields and their values - regex

I am new to coding with python, learning it step by step. I have an assignment to parse text file and update database. The file would have some status like
ticket summary:
Frequency :
Action taken: < something like "restarted server">
Status:
I want to parse this file, fetch the values for the fields like "ticket summary","frequency" etc. and put them in the database, where columns for them are defined. I am reading through python regex and sub string parsing, but not finding how to start. Need help

since you haven't included an example, I will provide one to use here. I am assuming you have already installed mongodb, pymongo and have an instance of it running locally. as an example I am choosing my file to have the following formatting:
Frequency: restarted server, Action taken- none, Status: active
the code will use regex to extract the fields. for demos have a look here: regex demo
import pymongo
import re
client = pymongo.MongoClient()
db = client['some-db']
freqgroup = re.compile(r"(?P<Frequency>Frequency[\:\-]\s?).*?(?=,)", flags=re.I)
actgroup = re.compile(r"(?P<ActionTaken>Action\sTaken[\:\-]\s?).*?(?=,)", flags=re.I)
statgroup = re.compile(r"(?P<Status>Status[\:\-]\s?).*", flags=re.I)
with open("some-file.txt", "rb") as f:
for line in f:
k = re.search(freqgroup, line)
db.posts.insert_one({"Frequency": line[k.end("Frequency"):k.end()]})
k = re.search(actgroup, line)
db.posts.insert_one({"ActionTaken": line[k.end("ActionTaken"):k.end()]})
k = re.search(statgroup, line)
db.posts.insert_one({"Status": line[k.end("Status"):k.end()]})

Related

Regex in spark.read.json

I want to read all json files which are having timestamp one hour before the current time from the hadoop directory.
File name is like test_2020021418553333
import java.util.Calendar;
import java.text.SimpleDateFormat;
val form = new SimpleDateFormat("yyyyMMddhh");
val c = Calendar.getInstance();
c.add(Calendar.HOUR, -1);
val path ="/Test_"+form.format(c.getTime())+"*";
val test_df = spark.read.json(path)
When I run this code: Path does not exist error is coming.
Can anyone suggest how to read file names like Test_20200214{Any Possible combination of Digit}??
A quick test show that you have minutes
form.format(c.getTime())
res2: String = 2020021401
So remove the latest 2 cars
regards

Cleaning up re.search output?

I have been writing a script which will recover for me CVSS3 scores when i enter a vulnerability name, i've pretty much got it working as intended except for a minor annoying detail.
π ~/Documents/Tools/Scripts ❯ python3 CVSS3-Grabber.py
Paste Vulnerability Name: PHP 7.2.x < 7.2.21 Multiple Vulnerabilities.
Base Score: None
Vector: <re.Match object; span=(27869, 27913), match='CVSS:3.0/AV:N/AC:L/PR:N/UI:R/S:U/C:L/I:N/A:H'>
Temporal Vector: <re.Match object; span=(27986, 28008), match='CVSS:3.0/E:U/RL:O/RC:C'>
As can be seen the output could be much neater, i would much prefer something like this:
π ~/Documents/Tools/Scripts ❯ python3 CVSS3-Grabber.py
Paste Vulnerability Name: PHP 7.2.x < 7.2.21 Multiple Vulnerabilities.
Base Score: None
Vector: CVSS:3.0/AV:N/AC:L/PR:N/UI:R/S:U/C:L/I:N/A:H
However i have been struggling to figure out how to get the output nicer, is there an easy part of the re module that im missing that can do this for me? or perhaps putting the output into a file first would then allow me to manipulate the text to how i need it.
Here is my code, would appreciate any feedback on how to improve as i have recently gotten back into python and scripting in general.
import requests
import re
from bs4 import BeautifulSoup
from googlesearch import search
def get_url():
vuln = input("Paste Vulnerability Name: ") + "tenable"
for url in search(vuln, tld='com',lang='en',num=1,start=0,stop=1,pause=2.0):
return url
def get_scores(url):
response = requests.get(url)
html = response.text
cvss3_temporal_v = re.search("CVSS:3.0/E:./RL:./RC:.",html)
cvss3_v = re.search("CVSS:3.0/AV:./AC:./PR:./UI:./S:./C:./I:./A:.",html)
cvss3_basescore = re.search("Base Score:....",html)
print("Base Score: ",cvss3_basescore)
print("Vector: ",cvss3_v)
print("Temporal Vector: ",cvss3_temporal_v)
urll = get_url()
get_scores(urll)
### IMPROVEMENTS ###
# Include the base score in output
# Tidy up output
# Vulnerability list?
# modify to accept flags, i.e python3 CVSS3-Grabber.py -v VULNAME ???
# State whether it is a failing issue or Action point
Thanks!
Don't print the match object. Print the match value.
In Python the value is accessible through the .group() method. If there are no regex subgroups (or you want the entire match, like in this case), don't specify any arguments when you call it:
print("Vector: ", cvss3_v.group())

how to save regular expression extractor output in a file(csv or any other format) in Jmeter

Hi StackOverflow Community,
I m working on Jmeter.I build a script running for our Web Application
In one request i fetched a value from Http Response using Regular Expression Extractor..all well..
I want to save this value in a file(csv ,txt any format not an issue)
How this can be done in Jmeter
Thnks in Advance
To save the extracted value to a csv file follow the below procedure.
Remove Regex extractor as we will extract from the response using JSR223 Post Processor
Add JSR223 Post Processor and paste the below code in the script section. Choose language as "Groovy 2.4.12 / Groovy scripting engine 2.0"
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.FileWriter;
import java.io.Writer;
String stringToSearch=prev.getResponseDataAsString();
Pattern p = Pattern.compile('value="(PR.+?)"');
Matcher match = p.matcher(stringToSearch);
if (match.find()) {
def value = match.group(1)
log.info('------------------')
log.info(value) // to check in the jmeter log for the extracted data
vars.put('a', value)
}
//get path of csv file (creates new one if its not exists)
FileWriter fileWriter = new FileWriter("C:\\Users\\Tarik\\Desktop\\example.csv",true); // true to append
BufferedWriter out = new BufferedWriter(fileWriter);
out.write(vars.get("a"));
out.close();
fileWriter.close();
This worked, thank you.
Used 'println' instead of 'write' to have consecutive values in a new line.

Python lxml xpath no output

For educational purposes I am trying to scrape this page using lxml and requests in Python.
Specifically I just want to print the research areas of all the professors on the page.
This is what I have done till now
import requests
from lxml import html
response=requests.get('http://cse.iitkgp.ac.in/index.php?secret=d2RkOUgybWlNZzJwQXdLc28wNzh6UT09')
parsed_body=html.fromstring(response.content)
for row in parsed_body.xpath('//div[#id="maincontent"]//tr[position() mod 2 = 1]'):
for column in row.xpath('//td[#class="fcardcls"]/tr[2]/td/font/text()'):
print column.strip()
But it is not printing anything. I was struggling quite a bit with xpaths and was intially using the copy xpath feature in chrome. I followed what was done in the following SO questions/answers and cleaned up my code quite a bit and got rid of ' tbody ' in the xpaths. Still the code returns a blank.
1. Empty List Returned
2. Python-lxml-xpath problem
First of all, the main content with the desired data inside is loaded from a different endpoint via an XHR request - simulate that in your code.
Here is the complete working code printing names and a list of research areas per name:
import requests
from lxml import html
response = requests.get('http://cse.iitkgp.ac.in/faculty4.php?_=1450503917634')
parsed_body = html.fromstring(response.content)
for row in parsed_body.xpath('.//td[#class="fcardcls"]'):
name = row.findtext(".//a[#href]/b")
name = ' '.join(name.split()) # getting rid of multiple spaces
research_areas = row.xpath('.//*[. = "Research Areas: "]/following-sibling::text()')[0].split(", ")
print(name, research_areas)
The idea here is use the fact that all "professor blocks" are located in td elements with class="fcardcls". For every block, get the name from the bold link text and research areas from the following string after Research Areas: bold text.

Scraping messy source page with Beautiful Soup

I try to do some web scraping using Python and Beautiful Soup, but the source page of the webpage is not the prettiest. The code below is a minor part of the source page:
...717301758],"birthdayFriends":2,"lastActiveTimes":{"719317510":0,"719435783":0,...
I want to get the parameter '2' after the string 'birthdayFriends', but I have no idea how to get it. So far i have written the code below, but it only prints a empty list.
import urllib2
from bs4 import BeautifulSoup
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
uri='myWebpage',
user='myUsername',
passwd='myPassword')
opener = urllib2.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib2.install_opener(opener)
page = urllib2.urlopen('myWebpage')
soup = BeautifulSoup(page.read())
bf = soup.findAll('birthdayFriends')
print bf
>> []
suppose somewhere in the html there is a script tag like the following:
<script>
var x = {"birthdayFriends":2,"lastActiveTimes":{"719317510":0,"719435783":0}}
</script>
then your code might look something like:
script = soup.findAll('script')[0] # or the number it appears in the file
# take the json part
j = bf.text.split('=')[1]
import json
# load json string to a dictionary
d = json.loads(j, strict=False)
print d["birthdayFriends"]
in case the content of the script tag is more complicated, consider loop over the script lines or see How can I parse Javascript variables using python?
also, for parsing JavaScript in python also see pynoceros