extracting certain values from a text file in python

extracting certain values from a text file in python - regex

I have a text file in the below format and I have to extract all range of motion and Location values. In some files, the value is given in the next line and in some, it is not given
File1.txt:
Functional Assessment: Patient currently displays the following functional
limitations and would benefit from treatment to maximize functional use and
pain reduction: Range of Motion: limited . ADLs: limited . Gait: limited .
Stairs: limited . Squatting: limited . Work participation status: limited .
Current Status: The patient's current status is improving.
Location: Right side
Expected output: limited | Right side
File2.txt:
Functional Assessment: Patient currently displays the following functional
limitations and would benefit from treatment to maximize functional use and
pain reduction:
Range of Motion:
painful
and
limited
Strength:
limited
Expected output: painful and limited | Not given
This is the code which I am trying:
if "Functional Assessment:" in line:
result=str(line.rsplit('Functional Assessment:'))
romvalue = result.rsplit('Range of Motion:')[-1].split()[0]
outputfile.write(romvalue)
partofbody = result.rsplit('Location:')[-1].split()[0]
outputfile.write(partofbody)
I am not getting the output which I want with this code. Can someone please help.

You may collect all lines after a line that starts with Functional Assessment:, join them and use the following regex:
(?sm)\b(Location|Range of Motion):\s*([^\W_].*?)\s*(?=(?:\.\s*)?[^\W\d_]+:|\Z)
See the regex demo.
Details
(?sm) - re.S and re.M modifiers
\b - word boundary
(Location|Range of Motion) - Group 1: either Location or Range of Motion
:\s* - a colon and 0+ whitespaces
([^\W_].*?) - Group 2:
\s* - 0+ whitespaces
(?=(?:\.\s*)?[^\W\d_]+:|\Z) - a positive lookahead that, immediately to the right of the current location, requires
(?:\.\s*)? - an optional sequence of . and 0+ whitespaces
[^\W\d_]+: - 1+ letters followed with :
| - or
\Z - end of string.
Here is a Python demo:
reg = re.compile(r'\b(Location|Range of Motion):\s*([^\W_].*?)\s*(?=(?:\.\s*)?[^\W\d_]+:|\Z)', re.S | re.M)
for file in files:
flag = False
tmp = ""
for line in file.splitlines():
if line.startswith("Functional Assessment:"):
tmp = tmp + line + "\n"
flag = not flag
elif flag:
tmp = tmp + line + "\n"
print(dict(list(reg.findall(tmp))))
Output (for the two texts you posted):
{'Location': 'Right side', 'Range of Motion': 'limited'}
{'Range of Motion': 'painful \nand\nlimited'}

Related

Regex for validating account names for NEAR protocol

I want to have accurate form field validation for NEAR protocol account addresses.
I see at https://docs.near.org/docs/concepts/account#account-id-rules that the minimum length is 2, maximum length is 64, and the string must either be a 64-character hex representation of a public key (in the case of an implicit account) or must consist of "Account ID parts" separated by . and ending in .near, where an "Account ID part" consists of lowercase alphanumeric symbols separated by either _ or -.
Here are some examples.
The final 4 cases here should be marked as invalid (and there might be more cases that I don't know about):
example.near
sub.ex.near
something.near
98793cd91a3f870fb126f66285808c7e094afcfc4eda8a970f6648cdf0dbd6de
wrong.near.suffix (INVALID)
shouldnotendwithperiod.near. (INVALID)
space should fail.near (INVALID)
touchingDotsShouldfail..near (INVALID)
I'm wondering if there is a well-tested regex that I should be using in my validation.
Thanks.
P.S. Originally my question pointed to what I was starting with at https://regex101.com/r/jZHtDA/1 but starting from scratch like that feels unwise given that there must already be official validation rules somewhere that I could copy.
I have looked at code that I would have expected to use some kind of validation, such as these links, but I haven't found it yet:
https://github.com/near/near-wallet/blob/40512df4d14366e1b8e05152fbf5a898812ebd2b/packages/frontend/src/utils/account.js#L8
https://github.com/near/near-wallet/blob/40512df4d14366e1b8e05152fbf5a898812ebd2b/packages/frontend/src/components/accounts/AccountFormAccountId.js#L95
https://github.com/near/near-cli/blob/cdc571b1625a26bcc39b3d8db68a2f82b91f06ea/commands/create-account.js#L75

The pre-release (v0.6.0-0) version of the JS SDK comes with a built-in accountId validation function:
const ACCOUNT_ID_REGEX =
/^(([a-z\d]+[-_])*[a-z\d]+\.)*([a-z\d]+[-_])*[a-z\d]+$/;
/**
* Validates the Account ID according to the NEAR protocol
* [Account ID rules](https://nomicon.io/DataStructures/Account#account-id-rules).
*
* #param accountId - The Account ID string you want to validate.
*/
export function validateAccountId(accountId: string): boolean {
return (
accountId.length >= 2 &&
accountId.length <= 64 &&
ACCOUNT_ID_REGEX.test(accountId)
);
}
https://github.com/near/near-sdk-js/blob/dc6f07bd30064da96efb7f90a6ecd8c4d9cc9b06/lib/utils.js#L113
Feel free to implement this in your program too.

Something like this should do: /^(\w|(?<!\.)\.)+(?<!\.)\.(testnet|near)$/gm
Breakdown
^ # start of line
(
\w # match alphanumeric characters
| # OR
(?<!\.)\. # dots can't be preceded by dots
)+
(?<!\.) # "." should not precede:
\. # "."
(testnet|near) # match "testnet" or "near"
$ # end of line
Try the Regex out: https://regex101.com/r/vctRlo/1

If you want to match word characters only, separated by a dot:
^\w+(?:\.\w+)*\.(?:testnet|near)$
Explanation
^ Start of string
\w+ Match 1+ word characters
(?:\.\w+)* Optionally repeat . and 1+ word characters
\. Match .
(?:testnet|near) Match either testnet or near
$ End of string
Regex demo
A bit broader variant matching whitespace character excluding the dot:
^[^\s.]+(?:\.[^\s.]+)*\.(?:testnet|near)$
Regex demo

PostgreSQL: .csv regex - test for repeating substrings within a string (digits)

Introduction:
I have the following scenario in PostgreSQL whereby I want to perform some data validation on a .csv string prior to inserting it into a table (see the fiddle here).
I've managed to get a regex (in a CHECK constraint) which disallows spaces within strings (e.g. "12 34") and also disallows preceding zeros ("00343").
Now, the icing on the cake would be if I could use regular expressions to disallow strings which contain a repeat of an integer - i.e. if a sequence \d+ matched another \d+ within the same string.
Is this beyond the capacities of regular expressions?
My table is as follows:
CREATE TABLE test
(
data TEXT NOT NULL,
CONSTRAINT d_csv_only_ck
CHECK (data ~ '^([ ]*([1-9]\d*)+[ ]*)(,[ ]*([1-9]\d*)+[ ]*)*$')
);
And I can populate it as follows:
INSERT INTO test VALUES
('992,1005,1007,992,456,456,1008'), -- want to make this line unnacceptable - repeats!
('44,1005,1110'),
('13, 44 , 1005, 10078 '), -- acceptable - spaces before and after integers
('11,1203,6666'),
('1,11,99,2222'),
('3435'),
(' 1234 '); -- acceptable
But:
INSERT INTO test VALUES ('23432, 3433 ,00343, 567'); -- leading 0 - unnacceptable
fails (as it should), and also fails (again, as it should)
INSERT INTO test VALUES ('12 34'); -- spaces within numbers - unnacceptable
The question:
However, if you notice the first string, it has repeats of 992and 456.
I would like to be able to match these.
All of these rules do not have to be in the same regex - I can use a second CHECK constraint.
I would like to know if what I am asking is possible using Regular Expressions?
I did find this post which appears to go some (all?) of the way to solving my issue, but I'm afraid it's beyond my skillset to get it to work - I've included a small test at the bottom of the fiddle.
Please let me know should you require any further information.
p.s. as an aside, I'm not very experienced with regexes and I would welcome any input on my basic one above.

Since PostegreSQL regex does not support backreferences, you cannot apply this restriction because you would need a negative lookahead with a backreference in it.
Have a look at this PCRE regex:
^(?!.*\b(\d+)\b.*\b\1\b) *[1-9]\d* *(?:, *[1-9]\d* *)*$
See this regex demo.
Details:
^ - start of string
(?!.*\b(\d+)\b.*\b\1\b) - no same two numbers as whole word allowed anywhere in the string
* - zero or more spaces
[1-9]\d* - a non-zero digit and then any zero or more digits
* - zero or more spaces
(?:, *[1-9]\d* *)* - zero or more occurrences of
, * - comma and zero or more spaces
[1-9]\d* - a non-zero digit and then any zero or more digits
* - zero or more spaces
$ - end of string.
Even if you replace \b with \y (PostgreSQL regex word boundaries) in the PostgreSQL code, it won't work due to the drawback mentioned at the top of the answer.

Replace version in yaml structure with regexp

I'm trying to replace the value of the version property in the following yaml structure.
My reason for using regex rather than parsing the yaml is that I need to write it back again. If I parse it and then write it back it'll loose it's existing formatting.
environments:
local:
values:
- kubeContext: default
- surfScreenshotter:
installed: false
version: 0
- whoamiMn:
installed: false
version: 0
dev:
values:
- kubeContext: nuc
- surfScreenshotter:
installed: false
version: 0
- whoamiMn:
installed: false
version: 0
My kotlin code
val regex = """environments:
|.*
| $environment:
| values:
|.*
| - $projectName:
|.*
| {10}version: (\S+)
""".trimMargin().toRegex(MULTILINE)
val updatedHelmfile = regex.replaceFirst(helmfileContent, version)
$environment can be either "local" or "dev" and projectName can be "surfScreenshotter" or "whoamiMn".
Nothing is matched. Anyone got an idea how to make this work?

You can rely on indentation to make sure you are in the right section of your text block and capture the whole part before the version into a capturing group:
val regex = """(environments:
|(?:\R\h{2}.*)*?\R\h{2}$environment:
|(?:\R\h{4}.*)*?\R\h{6}-\h*$projectName:
|(?:\R\h{10}.*)*?\R\h{10}
|version:\h*)\S+
""".trimMargin().toRegex(RegexOption.COMMENTS)
Then, you need to make sure to restore Group 1 contents with $1 in the replacement pattern:
val updatedHelmfile = regex.replaceFirst(helmfileContent, "$1" + version)
See the regex demo and the Kotlin demo.
Details
(environments: - Group 1 start and environments: string
(?:\R\h{2}.*)*?\R\h{2}dev: - zero or more occurrences (as few as possible) of a line break followed with two horizontal whitespace and then the rest of the line, then a line break, two horizontal whitespace and dev: string
(?:\R\h{4}.*)*?\R\h{6}-\h*whoamiMn: - zero or more occurrences (as few as possible) of a line break followed with four horizontal whitespace and then the rest of the line, then a line break, six horizontal whitespace and - + 0 or more spaces, and then whoamiMn: string
(?:\R\h{10}.*)*?\R\h{10} - zero or more occurrences (as few as possible) of a line break followed with ten horizontal whitespace and then the rest of the line, then a line break, ten horizontal whitespace
version:\h*) - version:, 0 or more spaces, end of Group 1
\S+ - one or more non-whitespace chars.

How to extract headings in text file using regex in python?

I have always used stackoverflow for solving many of my problems by searching the threads. Today I would like some guidance on creating a regex pattern for my text files. My files have headings that are varied in nature and do not follow the same naming pattern. The pattern they do follow somewhat is like this:
2.0 DESCRIPTION
3.0 PLACE OF PERFORMANCE
5.0 SERVICES RETAINED
6.0 STRUCTURE AND ROLES
etc....
It always follows a number and then capital letters or number and then spaces and then capital letters. The output I need is a list :
output = ['2.0 DESCRIPTION','3.0 PLACE OF PERFORMANCE','5.0 SERVICES RETAINED','6.0 STRUCTURE AND ROLES']
I am extremely new to python and regex. I tried the following but it did not give me the output desired:
import re
text = f'''2.0 DESCRIPTION
some text here
3.0 SERVICES
som text
5.0 SERVICES RETAINED
some text
6.0 STRUCTURE AND ROLES
sometext'''
pattern = r"\d\s[A-Z][A-Z]+"
matches = re.findall(pattern,text)
But it returned:
['0 DESCRIPTION', '0 SERVICES', '0 SERVICES']
Not the output that I was looking for. Your guidance in finding a pattern will be really appreciated.
Cheers,
Abhishek

You may use
matches = re.findall(r'^\d+(?:\.\d+)* *[A-Z][A-Z ]*$',text, re.M)
See the regex demo.
Here,
^ - start of a line (re.M redefines ^ behavior to include these positions, too)
\d+(?:\.\d+)* - 1+ digits and then 0+ sequences of a . and 1+ digits
* - zero or more spaces
[A-Z][A-Z ]* - an uppercase letter and then 0 or more uppercase letters or spaces
$ - end of a line.

import pdfplumber
import re
pdfToString = ""
with pdfplumber.open(r"sample.pdf") as pdf:
for page in pdf.pages:
print(page.extract_text())
pdfToString += page.extract_text()
matches = re.findall(r'^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*',pdfToString, re.M)
for i in matches:
if "word_to_extractenter code here" in i[:50]:
print(i)
This solution is to extract all the headings which has same format of headings in the question and to extract the required heading and the paragraphs that follows it.

Regular expression for specific file mask

I want to have 2 regex patterns that checks files after specific file mask. The way I like to do it is written below.
Pattern 1:
check if the left side of _ has 7 digits.
checks if the right side of _ is numeric.
checks for the specified extension is there.
the input will look like this : 1234567_1.jpg
Pattern 2:
check if there is 10 digits to the left of a "Space" char
check if there is 4 digits to the right of a "Space" char
check to the right side of _ is numeric
check for the specified extension is there.
The input will look like this: 1234567891 1234_1.png
As stated above this is to be used to check for a specific file mask.
i have been playing around with ideas like : ^[0-9][0-9].jpg$
and ^[0-9] [0-9][0-9].jpg$ is my first tries.
i do apologies for not providing my tries.

I suggest combining patterns with | (or):
string pattern = string.Join("|",
#"(^[0-9]{7}_[0-9]+\.jpg$)", // 1st possibility
#"(^[0-9]{10} [0-9]{4}_[0-9]+\.png$)"); // 2nd one
....
string fileName = #"c:\myFiles\1234567_1.jpg";
// RegexOptions.IgnoreCase - let's accept ".JPG" or ".Jpg" files
if (Regex.IsMatch(Path.GetFileName(fileName), pattern, RegexOptions.IgnoreCase)) {
...
}
Let's explain the second pattern: (^[0-9]{10} [0-9]{4}_[0-9]+\.jpg$)
^ - anchor (string start)
[0-9]{10} - 10 digits - 0-9
- single space
[0-9]{4} - 4 digits
_ - single underscope
[0-9]+ - one or more digits
\.png - .png (. is escaped)
$ - anchor (string end)

This should work for first regex:
\d{7}_\d*.(jpg|png)
This should work for second regex:
\d{10}\s\d{4}_\d*.(jpg|png)
If you want to use them together just do it like below:
(\d{7}_\d*.(jpg|png)|\d{10}\s\d{4}_\d*.(jpg|png))
In this group (jpg|png) you can just add another extensions by separating them with | (or).
You can check if it works for you at: https://regex101.com/
Cheers!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js