Regex to find everything in between - regex

I have the following regex which works when there is no leading /d,"There is 1 interface on the system:
or a trailing ",2017-01-...
Here is the regex:
(?m)(?<_KEY_1>\w+[^:]+?):\s(?<_VAL_1>[^\r\n]+)$
Here is a sample of what I am trying to parse:
1,"There is 1 interface on the system:
Name : Mobile Broadband Connection
Description : Qualcomm Gobi 2000 HS-USB Mobile Broadband Device 250F
GUID : {1234567-12CD-1BC1-A012-C1A1234CBE12}
Physical Address : 00:a0:c6:00:00:00
State : Connected
Device type : Mobile Broadband device is embedded in the system
Cellular class : CDMA
Device Id : A1000001234f67
Manufacturer : Qualcomm Incorporated
Model : Qualcomm Gobi 2000
Firmware Version : 09010091
Provider Name : Verizon Wireless
Roaming : Not roaming
Signal : 67%",2017-01-20T16:00:07.000-0700
I am trying to extract field names where for example Cellular class would equal CDMA but for all fields beginning after:
1,"There is 1 interface on the system: (where 1 increments 1,2 3,4 and so on
and before the tailing ",2017-01....
Any help is much appreciated!

You could use look-ahead to ensure that the strings you match come before a ",\d sequence, and do not include a ". The latter would ensure you will only match between double quotes, of which the second has the pattern ",\d:
/^\h*(?<_KEY_1>[\w\h]+?)\h*:\h*(?<_VAL_1>[^\r\n"]+)(?="|$)(?=[^"]*",\d)/gm
See it on regex101
NB: I put the g and m modifiers at the end, but if your environment requires them at the start with (?m) notation, that will work too of course.

Your example string seems to be a record from a csv file. This is how I will accomplish the task with Python (2.7 or 3.x):
import csv
with open('file.csv', 'r') as fh:
reader = csv.reader(fh)
results = []
for fields in reader:
lines = fields[1].splitlines()
keyvals = [list(map(str.strip, line.split(':', 1))) for line in lines[1:]]
results.append(keyvals)
print(results)
It can be done in a similar way with other languages.

You haven't responded to my comments or any of the answers, but here is my answer - try
^\s*(?<_KEY_1>[\w\s]+?)\s*:\s*(?<_VAL_1>[^\r\n"]+).*$
See it here at regex101.

Related

Telegraf: How to extract from field using regex processor?

I would like to extract the values for connections, upstream and downstream using telegraf regex processor plugin from this input:
2022/11/16 22:38:48 In the last 1h0m0s, there were 10 connections. Traffic Relayed ↑ 60 MB, ↓ 4 MB.
Using this configuration the result key "upstream" is a copy of the initial message but without a part of the 'regexed' stuff.
[[processors.regex]]
tagpass = ["snowflake-proxy"]
[[processors.regex.fields]]
## Field to change
key = "message"
## All the power of the Go regular expressions available here
## For example, named subgroups
pattern = 'Relayed.{3}(?P<UPSTREAM>\d{1,4}\W.B),'
replacement = "${UPSTREAM}"
## If result_key is present, a new field will be created
## instead of changing existing field
result_key = "upstream"
Current output:
2022/11/17 10:38:48 In the last 1h0m0s, there were 1 connections. Traffic 3 MB ↓ 5 MB.
How do I get the decimals?
I'm quite a bit confused how to use the regex here, because on several examples in the web it should work like this. See for example: http://wiki.webperfect.ch/index.php?title=Telegraf:_Processor_Plugins
The replacement config option specifies what you want to replace in for any matches.
I think you want something closer to this:
[[processors.regex.fields]]
key = "message"
pattern = '.*Relayed.{3}(?P<UPSTREAM>\d{1,4}\W.B),.*$'
replacement = "${1}"
result_key = "upstream"
to get:
upstream="60 MB"

fetching name and age from a text file

I have a .txt file from which I have to fetch name and age.
The .txt file has data in the format like:
Age: 71 . John is 47 years old. Sam; Born: 05/04/1989(29).
Kenner is a patient Age: 36 yrs Height: 5 feet 1 inch; weight is 56 kgs.
This medical record is 10 years old.
Output 1: John, Sam, Kenner
Output_2: 47, 29, 36
I am using the regular expression to extract data. For example, for age, I am using the below regular expressions:
re.compile(r'age:\s*\d{1,3}',re.I)
re.compile(r'(age:|is|age|a|) \s*\d{1,3}(\s|y)',re.I)
re.compile(r'.* Age\s*:*\s*[0-9]+.*',re.I)
re.compile(r'.* [0-9]+ (?:year|years|yrs|yr) \s*',re.I)
I will apply another regular expression to the output of these regular expressions to extract the numbers. The problem is with these regular expressions, I am also getting the data which I do not want. For example
This medical record is 10 years old.
I am getting '10' from the above sentence which I do not want.
I only want to extract the names of people and their age. I want to know what should be the approach? I would appreciate any kind of help.
Please take a look at the Cloud Data Loss Prevention API. Here is a GitHub repo with examples. This is what you'll likely want.
def inspect_string(project, content_string, info_types,
min_likelihood=None, max_findings=None, include_quote=True):
"""Uses the Data Loss Prevention API to analyze strings for protected data.
Args:
project: The Google Cloud project id to use as a parent resource.
content_string: The string to inspect.
info_types: A list of strings representing info types to look for.
A full list of info type categories can be fetched from the API.
min_likelihood: A string representing the minimum likelihood threshold
that constitutes a match. One of: 'LIKELIHOOD_UNSPECIFIED',
'VERY_UNLIKELY', 'UNLIKELY', 'POSSIBLE', 'LIKELY', 'VERY_LIKELY'.
max_findings: The maximum number of findings to report; 0 = no maximum.
include_quote: Boolean for whether to display a quote of the detected
information in the results.
Returns:
None; the response from the API is printed to the terminal.
"""
# Import the client library.
import google.cloud.dlp
# Instantiate a client.
dlp = google.cloud.dlp.DlpServiceClient()
# Prepare info_types by converting the list of strings into a list of
# dictionaries (protos are also accepted).
info_types = [{'name': info_type} for info_type in info_types]
# Construct the configuration dictionary. Keys which are None may
# optionally be omitted entirely.
inspect_config = {
'info_types': info_types,
'min_likelihood': min_likelihood,
'include_quote': include_quote,
'limits': {'max_findings_per_request': max_findings},
}
# Construct the `item`.
item = {'value': content_string}
# Convert the project id into a full resource id.
parent = dlp.project_path(project)
# Call the API.
response = dlp.inspect_content(parent, inspect_config, item)
# Print out the results.
if response.result.findings:
for finding in response.result.findings:
try:
if finding.quote:
print('Quote: {}'.format(finding.quote))
except AttributeError:
pass
print('Info type: {}'.format(finding.info_type.name))
print('Likelihood: {}'.format(finding.likelihood))
else:
print('No findings.')

Get a string after a specific word, using a program that has limited regex features?

Looking for help on building a regex that captures a 1-line string after a specific word.
The challenge I'm running into is that the program where I need to build this regex uses a single line format, in other words dot matches new line. So the formula I created isn't working. See more details below. Any advice or tips?
More specific regex task:
I'm trying to grab the line that comes after the word Details from entries like below. The goal is pull out 100% Silk, or 100% Velvet. This is the material of the product that always comes after Details.
Raw data:
<p>Loose fitted blouse green/yellow lily print.
V-neck opening with a closure string.
Small tie string on left side of top.</p>
<h3>Details</h3> <p>100% Silk.</p>
<p>Made in Portugal.</p> <h3>Fit</h3>
<p>Model is 5‰Ûª10,‰Û size 2 wearing size 34.</p> <p>Size 34 measurements</p>
OR
<p>The velvet version of this dress. High waist fit with hook and zipper closure.
Seams run along edges of pants to create a box-like.</p>
<h3>Details</h3> <p>100% Velvet.</p>
<p>Made in the United States.</p>
<h3>Fit</h3> <p>Model is 5‰Ûª10‰Û, size 2 and wearing size M pants.</p> <p>Size M measurements Length: 37.5"åÊ</p>
<p>These pants run small. We recommend sizing up.</p>
Here is the current formula I created that's not working:
Replace (.)(\bDetails\s+(.)) with $3
The output gives the below:
<p>100% Silk.</p>
<p>Made in Portugal.</p>
<h3>Fit</h3>
<p>Model is 5‰Ûª10,‰Û size 2 wearing size 34.</p>
<p>Size 34 measurements</p>
OR
<p>100% Velvet.</p>
<p>Made in the United States.</p>
<h3>Fit</h3> <p>Model is 5‰Ûª10‰Û, size 2 and wearing size M pants.</p> <p>Size M measurements Length: 37.5"åÊ</p>
<p>These pants run small. We recommend sizing up.</p>
`
How do I capture just the desired string? Let me know if you have any tips! Thank you!
Difficult to provide a working solution in your situation as you mention your program has "limited regex features" but don't explain what limitations.
Here is a Regex you can try to work with to capture the target string
^(?:<h3>Details<\/h3>)(.*)$
I would personally use BeautifulSoup for something like this, but here are two solutions you could use:
Match the line after "Details", then pull out the data.
matches = re.findall('(?<=Details<).*$', text)
matches = [i.strip('<>') for i in matches]
matches = [i.split('<')[0] for i in [j.split('>')[-1] for j in matches]]
Replace "Details<...>data" with "Detailsdata", then find the data.
text = re.sub('Details<.*?<.*>', '', text)
matches = re.findall('(?<=Details).*?(?=<)', text)

R- Subset a corpus by meta data (id) matching partial strings

I'm using the R (3.2.3) tm-package (0.6-2) and would like to subset my corpus according to partial string matches contained with the metadatum "id".
For example, I would like to filter all documents that contain the string "US" within the "id" column. The string "US" would be preceded and followed by various characters and numbers.
I have found a similar example here. It is recommended to download the quanteda package but I think this should also be possible with the tm package.
Another more relevant answer to a similar problem is found here. I have tried to adapt that sample code to my context. However, I don't manage to incorporate the partial string matching.
I imagine there might be multiple things wrong with my code so far.
What I have so far looks like this:
US <- tm_filter(corpus, FUN = function(corpus, filter) any(meta(corpus)["id"] == filter), grep(".*US.*", corpus))
And I receive the following error message:
Error in structure(as.character(x), names = names(x)) :
'names' attribute [3811] must be the same length as the vector [3]
I'm also not sure how to come up with a reproducible example simulating my problem for this post.
It could work like this:
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
(corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain)))
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 20
(idx <- grep("0", sapply(meta(corp, "id"), paste0), value=TRUE))
# 502 704 708
# "502" "704" "708"
(corpsubset <- corp[idx] )
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 3
You are looking for "US" instead of "0". Have a look at ?grep for details (e.g. fixed=TRUE).

Extract texts from a large character string based a pattern

I have a large string of characters and would like to extract certain information from it matching pattern:
str(input)
chr [1:109094] "{'asin': '0981850006', 'description': 'Steven Raichlen\'s Best of Barbecue Primal Grill DVD. The first three volumes of the si"| truncated ...
I get the following content of input[1] - description of product meta
[1] ("{'asin': '144072007X', 'related': {'also_viewed': ['B008WC0X0A', 'B000CPMOVG', 'B0046641AE', 'B00J150GAO', 'B00005AMCG', 'B005WGX97I'],
'bought_together': ['B000H85WSA']},
'title': 'Sand Shark Margare Maron Audio CD',
'price': 577.15,
'salesRank': {'Patio, Lawn & Garden': 188289},
'imUrl': 'http://ecx.images-amazon.com/images/I/31B9X0S6dqL._SX300_.jpg',
'brand': 'Tesoro',
'categories': [['Patio, Lawn & Garden', 'Lawn Mowers & Outdoor Power Tools', 'Metal Detectors']],
'description': \"The Tesoro Sand Shark metal combines time-proven PI circuits with the latest digital technology creating the first.\"}")
Now I would like to iterate over each element of the large string and extract asin, title, price, salesRank, brand and categories that should be saved in a data.frame for better handling.
The data is originally from a JSON file as you might notice. I tried to import it using stream_in command, but it didn't help. So just imported it using readLines. Please please help! Being a bit desperate...Any hint is appreciated!
The jsonlite package shows the following problem:
lexical error: invalid char in json text.
{'asin': '0981850006', 'descript
(right here) ------^
closing fileconnectionoldClass input connection.
Any new ideas on that?
Given lots of unanswered questions on that issue, must be very relevant for newbies ;)