Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 days ago.
Improve this question
I have the following bit in a google script that parses pdfs:
function extractPDFtext(text){
const regexp = /[w,W,s,S]*(\d{3}).?(\d{3}).?(\d{3}).?(\d{3})?.?(\d{3})?[\w\W]*?(\d+.\d+)/gm;
try{
let array = [...text.match(regexp)];
return array;
}catch(e){
let array = ["No items found"]
return array;
}
};
The existing regex partially works (because the pdf's are not all equal) and so I have to restrict the search/matching between words/results and when I try to do it, I get no results. I would like to retrieve the digits related to Reference and Amount tags, while ignoring any words and digits in between. And it's here that I'm having some trouble because on regex101 I get the full match + the correct capturing groups but once on the script, I get no results.
This is a regex example based on what was suggested on another question of mine but in the end has the same problem as any of my other attempts:
^Reference\b[^\d\n]*[\t ](\d{3})[\t ]*(\d{3})[\t ]*(\d{3})[\t ]*(\d{3})[\t ]*(\d{3})(?:\n(?!Amount\b)\S.*)*\nAmount\b[^\d\n]*[\t ](\d+(?:,\d+)?)\b
So I'm wondering if the problem is with the regex or with the script and how to solve in any of those circumstances.
Below, there's some dummy text example of the variable text where the regex is being used on, baring in mind that it can have more words after each "tag" (example: Reference of something // Amount of first payment:); it can have : or not.
Some dummy text that may have words in common like `reference` or `amount` throughout the document
Reference: 245 154 343 345 345
Entity: 34567
Amount: 11,11
Payment date: 14/07/2022
Some more text
Maybe your trying to do too much with one command. Try breaking it up as I show below.
console.log(text);
let ref = text.match(/Reference.+/gm);
if( ref.length > 0 ) {
ref = ref[0].match(/\d.+/);
console.log(ref[0]);
}
ref = text.match(/Amount.+/);
if( ref.length > 0 ) {
ref = ref[0].match(/\d.+/);
console.log(ref[0]);
}
Execution log
8:55:50 AM Notice Execution started
8:55:50 AM Info Some dummy text that may have words in common like `reference` or `amount` throughout the document
Reference: 245 154 343 345 345
Entity: 34567
Amount: 11,11
Payment date: 14/07/2022
Some more text
8:55:50 AM Info 245 154 343 345 345
8:55:50 AM Info 11,11
8:55:50 AM Notice Execution completed
Related
I need to write the regex to fetch the details from the following data
Type Time(s) Ops TPS(ops/s) Net(M/s) Get_miss Min(us) Max(us) Avg(us) Std_dev Geo_dist
Period 5 145443 29088 22.4 37006 352 116302 6600 7692.04 4003.72
Global 10 281537 28153 23.2 41800 281 120023 6797 7564.64 4212.93
The above is the log which i get from a log file
I have tried writing the reg ex to get the details in the table format but could not get.
Below is the reg ex which i tried.
Type[\s+\S+].+\n(?<time>[\d+\S+\s+]+)[\s+\S+].*Period
When it comes to Period keyword the regex fails
If for some reason RichG's suggestion of using multikv doesn't work, the following should:
| rex field=_raw "(?<type>\w+)\s+(?<time>[\d\.]+)\s+(?<ops>[\d\.]+)\s+(?<tps>[\d\.]+)\s+(?<net>[\d\.]+)\s+(?<get_miss>[\d\.]+)\s+(?<min>[\d\.]+)\s+(?<max>[\d\.]+)\s+(?<avg>[\d\.]+)\s+(?<std_dev>[\d\.]+)\s+(?<geo_dist>[\d\.]+)"
Where is your data coming from?
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
I'm using the following script which is working correctly to pull 2 fields out of an email body.
This is causing the script execution time to increase significantly due to the amount of content in the body. Is there a way to make this search through only the first 5 lines of the email body?
First lines of e-mail:
Name: Full Report
Store: River North (Wells St)
Date Tripped: 19 Feb 2020 1:07 PM
Business Date: 19 Feb 2020 (Open)
Message:
Information:
This alert was tripped based on a user defined trigger: Every 15 minutes.
Script:
//gets first(latest) message with set label
var threads = GmailApp.getUserLabelByName('South Loop').getThreads(0,1);
if (threads && threads.length > 0) {
var message = threads[0].getMessages()[0];
// Get the first email message of a threads
var tmp,
subject = message.getSubject(),
content = message.getPlainBody();
// Get the plain text body of the email message
// You may also use getRawContent() for parsing HTML
// Implement Parsing rules using regular expressions
if (content) {
tmp = content.match(/Date Tripped:\s*([:\w\s]+)\r?\n/);
var tripped = (tmp && tmp[1]) ? tmp[1].trim() : 'N/A';
tmp = content.match(/Business Date:\s([\w\s]+\(\w+\))/);
var businessdate = (tmp && tmp[1]) ? tmp[1].trim() : 'N/A';
}
}
You can use the pattern /^(?:.*\r?\n){0,5}/ to grab the first 5 lines of the email, then run your search against this smaller string. Here's a browser example with hardcoded content, but I tested it in Google Apps Script.
const Logger = console; // Remove this for GAS!
const content = `Name: Full Report
Store: River North (Wells St)
Date Tripped: 19 Feb 2020 1:07 PM
Business Date: 19 Feb 2020 (Open)
Message:
Information:
This alert was tripped based on a user defined trigger: Every 15 minutes.`;
const searchPattern = /(Date Tripped|Business Date): *(.+?)\r?\n/g;
const matches = [...content.match(/^(?:.*\r?\n){0,5}/)[0]
.matchAll(searchPattern)]
const result = Object.fromEntries(matches.map(e => e.slice(1)));
Logger.log(result);
If you wish to dynamically inject the search terms, use:
const Logger = console; // Remove this for GAS!
const content = `Name: Full Report
Store: River North (Wells St)
Date Tripped: 19 Feb 2020 1:07 PM
Business Date: 19 Feb 2020 (Open)
Foo: this will match because it's on line 5
Bar: this won't match because it's on line 6
Information:
`;
const searchTerms = ["Date Tripped", "Business Date", "Foo", "Bar"];
const searchPattern = new RegExp(`(${searchTerms.join("|")}): *(.+?)\r?\n`, "g");
const matches = [...content.match(/^(?:.*\r?\n){0,5}/)[0]
.matchAll(searchPattern)]
const result = Object.fromEntries(matches.map(e => e.slice(1)));
Logger.log(result);
ES5 version if you're using the older engine:
var Logger = console; // Remove this for GAS!
var content = "Name: Full Report\nStore: River North (Wells St)\nDate Tripped: 19 Feb 2020 1:07 PM\nBusiness Date: 19 Feb 2020 (Open)\nMessage:\nInformation:\nThis alert was tripped based on a user defined trigger: Every 15 minutes.\n";
var searchPattern = /(Date Tripped|Business Date): *(.+?)\r?\n/g;
var truncatedContent = content.match(/^(?:.*\r?\n){0,5}/)[0];
var result = {};
for (var m; m = searchPattern.exec(content); result[m[1]] = m[2]);
Logger.log(result);
#ggorlen's answer is not precise, to my taste. Let's have a look at regex01
My problem with (?:.*\r?\n){0,5} is this: in english this regex says:
Take any number of characters (0 or more) ending with a newline.
Do this between 0 and 5 times.
Which means any empty string matches. If you would do a global match, there's a lot of those.
So, how could you grab the first 5 lines? Be exact! So something like
^([^\r\n]+\r?\n){5}
See regex101
P.S. #ggorlen mentioned I left the default multiline matching on in regex101, and he's right about that. Your preference may vary: choosing between ignoring messages with less than 5 lines and accepting strings with empty lines depends on your particular case.
P.S.2 I've adapted my wording and disabled the multiline and global settings in regex101 to display my concerns with it.
I have a text file with data formatted as below. Figured out how to format the second part of the file to format it for upload into a db table. Hitting a wall trying to get the just the first 7 lines to format in the same way.
If it wasn't obvious, I'm trying to get it pipe delimited with the exact same number of columns, so I can easily upload it to the db.
Year: 2019 Period: 03
Office: NY
Dept: Sales
Acct: 111222333
SubAcct: 11122234-8
blahblahblahblahblahblahblah
Status: Pending
1000
AAAAAAAAAA
100,000.00
2000
BBBBBBBBBB
200,000.00
3000
CCCCCCCCCC
300,000.00
4000
DDDDDDDDDD
400,000.00
some kind folks answered my question about the bottom part, using the following code I can format that to look like so -
(.*)\r?\n(.*)\r?\n(.*)(?:\r?\n|$)
substitute with |||||||$1|$2|$3\n
|||||||1000|AAAAAAAAAA|100,000.00
|||||||2000|BBBBBBBBBB|200,000.00
|||||||3000|CCCCCCCCCC|300,000.00
|||||||4000|DDDDDDDDDD|400,000.00
just need help formatting the top part - to look like this, so the entire file matches with the exact same number of columns.
Year: 2019|Period: 03|Office: NY|Dept: Sales|Acct: 111222333|SubAcct: 11122234-8|blahblahblahblahblahblahblah|Status: Pending|||
I'm ok with having multiple passes on the file to get the desired end result.
I've helped you on your previous question, so I will focus now on the first part of your file.
You can use this regex:
\n|\b(?=Period)
Working demo
And use | as the replacement string
If you don't want the previous space before Period, then you can use:
\n|\s(?=Period)
Looking for help on building a regex that captures a 1-line string after a specific word.
The challenge I'm running into is that the program where I need to build this regex uses a single line format, in other words dot matches new line. So the formula I created isn't working. See more details below. Any advice or tips?
More specific regex task:
I'm trying to grab the line that comes after the word Details from entries like below. The goal is pull out 100% Silk, or 100% Velvet. This is the material of the product that always comes after Details.
Raw data:
<p>Loose fitted blouse green/yellow lily print.
V-neck opening with a closure string.
Small tie string on left side of top.</p>
<h3>Details</h3> <p>100% Silk.</p>
<p>Made in Portugal.</p> <h3>Fit</h3>
<p>Model is 5‰Ûª10,‰Û size 2 wearing size 34.</p> <p>Size 34 measurements</p>
OR
<p>The velvet version of this dress. High waist fit with hook and zipper closure.
Seams run along edges of pants to create a box-like.</p>
<h3>Details</h3> <p>100% Velvet.</p>
<p>Made in the United States.</p>
<h3>Fit</h3> <p>Model is 5‰Ûª10‰Û, size 2 and wearing size M pants.</p> <p>Size M measurements Length: 37.5"åÊ</p>
<p>These pants run small. We recommend sizing up.</p>
Here is the current formula I created that's not working:
Replace (.)(\bDetails\s+(.)) with $3
The output gives the below:
<p>100% Silk.</p>
<p>Made in Portugal.</p>
<h3>Fit</h3>
<p>Model is 5‰Ûª10,‰Û size 2 wearing size 34.</p>
<p>Size 34 measurements</p>
OR
<p>100% Velvet.</p>
<p>Made in the United States.</p>
<h3>Fit</h3> <p>Model is 5‰Ûª10‰Û, size 2 and wearing size M pants.</p> <p>Size M measurements Length: 37.5"åÊ</p>
<p>These pants run small. We recommend sizing up.</p>
`
How do I capture just the desired string? Let me know if you have any tips! Thank you!
Difficult to provide a working solution in your situation as you mention your program has "limited regex features" but don't explain what limitations.
Here is a Regex you can try to work with to capture the target string
^(?:<h3>Details<\/h3>)(.*)$
I would personally use BeautifulSoup for something like this, but here are two solutions you could use:
Match the line after "Details", then pull out the data.
matches = re.findall('(?<=Details<).*$', text)
matches = [i.strip('<>') for i in matches]
matches = [i.split('<')[0] for i in [j.split('>')[-1] for j in matches]]
Replace "Details<...>data" with "Detailsdata", then find the data.
text = re.sub('Details<.*?<.*>', '', text)
matches = re.findall('(?<=Details).*?(?=<)', text)
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I'm involved in a digital radio propagation study where a remote transmitter sends a predefined beacon at a defined time that's easily matched with a regex.
But due to solar and atmospheric conditions it's not always a 100% decoded. What I want to do is calculate the percentage of the decode.
The beacon format is as so:
de va6shs va6shs va6shs Loc DO46gs Olivia-4-250 NBEMS test 2218Z
| | | |
(Station) (Location) (Digital Mode) (UTC Time)
Can I actually figure out the percentage with Perl, or should I be looking for another solution?
Edit: What often happens as there is limited error correction in the data mode we are using so random characters often end up in the decoded string or characters are not decode at all these are received strings from the same station at different times of the same day as solar conditions degraded.
100% decode
de ve6rfm ve6rfm ve6rfm Loc DO46gs Olivia-4-250 NBEMS test 0218Z
93.75%
P!de ve6rfm ve6rfm ve6rfm Loc DO46gs Olivia-4-250 NBEMS <TAB>est F248Z
9.375%
de ve6rfmr&
The only difference there should be between the two beacon strings is the UTC time at the end of the string, but as you can see there's a few characters that didn't decode correctly.
The correctly decodes string has 64 characters.
The first incorrectly decoded string has 60 correct characters.
So 60/64 * 100 = 93.75% decode.
My regex for the station call sign, the three repeated words is
/[vV][aAeEyY][15678]\w{2,3}/
There are several different stations involved in the study across western Canada so I need to capture them as propagation permits, and using the above regex saves me from having to update my script every time a new station comes on the air.
The problem is one of partial or fuzzy matching. There are modules out there that may help. They mostly use Levenshtein distance, the number of edits needed to get one string from the other, but there are other methods. See a partial list in Text::Levenshtein. See this post for search phrases that will offer far more.
Here are examples using String::Approx, String::Similarity, and Text::Fuzzy. None gives exactly what you ask but all retrieve similar measures, and have options that may allow you to get your target.
use warnings 'all';
use strict;
my $beacon =
'de va6shs va6shs va6shs Loc DO46gs Olivia-4-250 NBEMS test 2218Z';
my $received =
'P!de ve6rfm ve6rfm ve6rfm Loc DO46gs Olivia-4-250 NBEMS <TAB>est F248Z';
# Can use an object, or the functional interface
use Text::Fuzzy qw(fuzzy_index distance_edits);
my $tf = Text::Fuzzy->new ($beacon);
my ($offset, $edits, $distance);
# Different distance/edits
$distance = $tf->distance($received);
($offset, $edits, $distance) = fuzzy_index ($received, $beacon);
($distance, $edits) = distance_edits ($received, $beacon);
# Provides "similarity", in terms of edit distance
use String::Similarity;
my $similarity = similarity $beacon, $received;
# Can be tuned, but is more like regex in some sense. See docs.
use String::Approx qw(amatch);
my #matches = amatch($beacon, $received); # within 10%
# amatch($beacon, ["20%"], $received); # within 20%
# amatch($beacon, ["S0"], $received); # no "substitutions"
Please look through their documentation.
The String::Approx considers a "match" if it is not further than 10% in length. This is the default, and the module allows to adjust that parameter. For example,
amatch($beacon, ["20%"], $received);
would make that 20%. Other refinements of possible use for you can be made.
Newer versions of the module are written in C and are much better perfoming.
You could calculate the number of characters matching compared to the length of the initial string, for instance looking for /\bva\dshs\b/ (\b is a word boundary, and \d is a digit, see the manual page)
my $s = 'de va6shs va6shs va6shs Loc DO46gs Olivia-4-250 NBEMS test 2218Z';
my $r = join('', $s =~ m/\bva\dshs\b/g);
print(((length($r)*100) / length($s)) . "%\n");
The matching strings, combined, give
"va6shsva6shsva6shs"
which is 28.125% of the initial string length.