Extracting main subject from a sentence in python

Extracting main subject from a sentence in python - regex

I am trying to extract the main subject from a sentence contained in a text file. For example, the file contains data as given below
I never used tobacco
They smoke tobacco
I do not like today's weather
Good weather
Exercise 3 to 4 times a week
No exercise
Family history of Cancer
No Cancer
,,· Alcohol use
Amazing football match
Pathetic football match
Has Depression
I have to extract the main subject and print it as follows:
I never used tobacco | Tobacco | False
They smoke tobacco | Tobacco | True
I do not like today's weather | Weather | False
Good weather | Weather | True
Exercise 3 to 4 times a week | Exercise | True
No exercise | Exercise | False
Family history of Cancer | Cancer | True
No Cancer | Cancer | False
,,· Alcohol use. | Alcohol | True
Amazing football match | Football Match| True
Pathetic football match | Football Match | False
Has Depression | Depression | True
I am trying Spacy for it but not able to get the desired output. I tokenized the sentences using Spacy then used part of speech tagging to extract the nouns but still not getting what is required.
Can anyone help that how it could be done?

There is not an exact solution to it but the below code which I used is somewhat helpful:
negatedwords = read_words_from_file('false.txt') # file containing all the negation words
#read_words_from_file() will read words from file
from collections import Counter
import spacy
nlp = spacy.load('en_core_web_md')
count = Counter(line.split())
negated_word_found = False
for key, val in count.items():
key = key.rstrip('.,?!\n') # removing punctuations
if key in negatedwords :
negated_word_found= True
if negated_word_found== True:
file_write.write("False")
else:
file_write.write("True")
file_write.write(" | ")
document = nlp(line)
for word in document:
look_for_word = word.text
word_pos = word.pos_
if ((word_pos =="NOUN" or word_pos =="ADJ" or word_pos == "PROPN" ) and look_for_word!="use" ): #The pos_ tag for 'use' is showed as NOUN
file_write.write(look_for_word)
file_write.write(' ')
false.txt
never
Never
no
No
NO
not
NOT
Not
NEVER
don't
Don't
DON'T

Related

How to regexp_extract if a matching pattern resides anywhere in the string - pyspark

I was trying to get some insights on regexp_extract in pyspark and I tried to do a check with this option to get better understanding.
Below is my dataframe
data = [('2345', 'Checked|by John|for kamal'),
('2398', 'Checked|by John|for kamal '),
('2328', 'Verified|by Srinivas|for kamal than some random text'),
('3983', 'Verified|for Stacy|by John')]
df = sc.parallelize(data).toDF(['ID', 'Notes'])
df.show()
+----+-----------------------------------------------------+
| ID| Notes |
+----+-----------------------------------------------------+
|2345|Checked|by John|for kamal |
|2398|Checked|by John|for kamal |
|2328|Verified|by Srinivas|for kamal than some random text |
|3983|Verified|for Stacy|by John |
+----+-----------------------------------------------------+
So here I was trying to identify whether an ID is checked or verified by John
With the help of SO members I was able to crack the use of regexp_extract and came to below solution
result = df.withColumn('Employee', regexp_extract(col('Notes'), '(Checked|Verified)(\\|by John)', 1))
result.show()
+----+------------------------------------------------+------------+
| ID| Notes |Employee|
+----+------------------------------------------------+------------+
|2345|Checked|by John|for kamal | Checked|
|2398|Checked|by John|for kamal | Checked|
|2328|Verified|by Srinivas|for kamal than some random text| |
|3983|Verified|for Stacy|by John | |
+----+--------------------+----------------------------------------+
For few ID's this gives me perfect result ,But for last ID it didn't print Verified. Could someone please let me know whether any other action needs to be performed in the mentioned regular expression?
What I feel is (Checked|Verified)(\\|by John) is matching only adjacent values. I tried * and $, still it didn't print Verified for ID 3983.

I would have phrased the regex as:
(Checked|Verified)\b.*\bby John
Demo
This pattern finds Checked/Verified followed by by John with the two separated by any amount of text. Note that I just use word boundaries here instead of pipes.
Updated code:
result = df.withColumn('Employee', regexp_extract(col('Notes'), '\b(Checked|Verified)\b.*\bby John', 1))

You can try this regex:
import pyspark.sql.functions as F
result = df.withColumn('Employee', F.regexp_extract('Notes', '(Checked|Verified)\\|.*by John', 1))
result.show()
+----+--------------------+--------+
| ID| Notes|Employee|
+----+--------------------+--------+
|2345|Checked|by John|f...| Checked|
|2398|Checked|by John|f...| Checked|
|2328|Verified|by Srini...| |
|3983|Verified|for Stac...|Verified|
+----+--------------------+--------+

Another way is to check if the column Notes contains a string by John:
df.withColumn('Employee',F.when(col('Notes').like('%Checked|by John%'), 'Checked').when(col('Notes').like('%by John'), 'Verified').otherwise(" ")).show(truncate=False)
+----+----------------------------------------------------+--------+
|ID |Notes |Employee|
+----+----------------------------------------------------+--------+
|2345|Checked|by John|for kamal |Checked |
|2398|Checked|by John|for kamal |Checked |
|2328|Verified|by Srinivas|for kamal than some random text| |
|3983|Verified|for Stacy|by John |Verified|
+----+----------------------------------------------------+--------+

Keep words starting with character/letter in Pandas | Python

I'm not sure how to do this in a dataframe context
I have the table below here with text information
TEXT |
-------------------------------------------|
"Get some new #turbo #stacks today!" |
"Is it one or three? #phone" |
"Mayhaps it be three afterall..." |
"So many new issues with phone... #iphone" |
And I want to edit it down to where only the words with a '#' symbol are kept, like in the result below.
TEXT |
-----------------|
"#turbo #stacks" |
"#phone" |
"" |
"#iphone" |
In some cases, I'd also like to know if it's possible to eliminate the rows that are empty by checking for NaN as true or if you run a different kind of condition to get this result:
TEXT |
-----------------|
"#turbo #stacks" |
"#phone" |
"#iphone" |
Python 2.7 and pandas for this.

You could try using regex and extractall:
df.TEXT.str.extractall('(#\w+)').groupby(level=0)[0].apply(' '.join)
Output:
0 #turbo #stacks
1 #phone
3 #iphone
Name: 0, dtype: object

How to extract specific information from strings

I have a dataset with the addresses of authors' affiliations. Addresses have differing length. But the information before the first comma is the name of he institution and that after the last comma the country. What I want to do is to extract the country and create a new variable for it.
I tried this code in Stata. It works to extract the name of institutions.
generate splitat = strpos(institutions ,",")
generate str80 univ = substr(institutions, 1, splitat - 1)
I am wondering whether this code also can be applied to extract the country.
I thought it could check from the end instead from the start?
My dataset looks like the following example:
Natl Taiwan Univ, Inst Epidemiol, Taipei 106, Taiwan
Radboud Univ Nijmegen, Inst Water & Wetland Res, Dept Anim Ecol & Ecophysiol, NL-6525 AJ Nijmegen, Netherlands

There is a specific function in Stata 14+ to look for the last occurrence of a substring (e.g. a specific character) in a string. See help string functions in Stata 14 for documentation of strrpos().
If that is not in your version of Stata, you merely reverse the string, find the substring using the method you already know, and then reverse what you found.
If you are not using the latest version of Stata, it is always a good idea to specify that in questions in any forum that supports Stata questions,
clear
input str244 institutions
"Natl Taiwan Univ, Inst Epidemiol, Taipei 106, Taiwan"
"Radboud Univ Nijmegen, Inst Water & Wetland Res, Dept Anim Ecol & Ecophysiol, NL-6525 AJ Nijmegen, Netherlands"
end
compress
gen country = substr(institutions, strrpos(institutions, ",") + 1, .)
local rev strreverse(institutions)
gen country2 = strreverse(substr(`rev', 1, strpos(`rev', ",") - 1))
assert country == country2
l country
+--------------+
| country |
|--------------|
1. | Taiwan |
2. | Netherlands |
+--------------+

cucumber Repeat steps

I am learing cucumber and trying to write a feature file.
Following is my feature file.
Feature: Doctors handover Notes Module
Scenario: Search for patients on the bases of filter criteria
Given I am on website login page
When I put username, password and select database:
| Field | Value |
| username | test |
| password | pass |
| database | test|
Then I login to eoasis
Then I click on doctors hand over notes link
And I am on doctors handover notes page
Then I select sites, wards, onCallTeam, grades,potential Discharge, outstanding task,High priority:
| siteList | wardsList | onCallTeamList | gradesList | potentialDischargeCB | outstandingTasksCB | highPriorityCB |
| THE INFIRMARY | INFIRMARY WARD 9 - ASSESSMENT | null | null | null | null | null |
| THE INFIRMARY | INFIRMARY WARD 9 - ASSESSMENT | GENERAL MEDICINE | null | null | null | null |
| THE INFIRMARY | INFIRMARY WARD 9 - ASSESSMENT | GENERAL MEDICINE | CONSULTANT | null | null | null |
| THE INFIRMARY | INFIRMARY WARD 9 - ASSESSMENT | GENERAL MEDICINE | CONSULTANT | true | null | null |
| THE INFIRMARY | INFIRMARY WARD 9 - ASSESSMENT | GENERAL MEDICINE | CONSULTANT | true | true | null |
| THE INFIRMARY | INFIRMARY WARD 9 - ASSESSMENT | GENERAL MEDICINE | CONSULTANT | true | true | true |
Then I click on search button
Then I should see search results
I want to repeat last three steps like I select the search criteria then click on search button and then check search result. So how should I break this feature file. if I use scenario outline then there would be two different scenarios One for login and one for search criteria. Is that fine? Will the session will maintain in that case? Whats the best way to write such feature file.
Or is this a right way to write?

I don't think we can have multiple example sets in a Scenario Outline.
Most of the scenario steps in the example is too procedural to have its own step.
The first three steps could be reduced to something like.
Given I am logged into eoasis as a <user>
Code in the step definition, which could make calls to a separate login method that could take care of updating entering the username, password and selecting database.
Another rule is to avoid statements like "When I click the doctor's handover link". The keyword to avoid here being click. Today its a click, tomorrow it could be drop down or a button. So the focus should be on the functional expectation of the user, which is viewing the handover notes. So we modify this to
When I view the doctor's handover notes link
To summarize, this is how I would write this test.
Scenario Outline: Search for patients on the basis of filter criteria
Given I am logged into eoasis as a <user>
When I view the doctor's handover notes link
And I select sites, wards, onCallTeam, grades, potential Discharge, outstanding task, High priority
And perform a search
Then I should see the search results
Examples:
|sites |wards |onCallTeam |grades |potential Discharge |outstanding task |High priority|
| THE INFIRMARY | INFIRMARY WARD 9 - ASSESSMENT | null | null | null | null | null |

This really is the wrong way to write features. This feature is very declarative, its all about HOW you do something. What a feature should do is explain WHY you are doing something.
Another bad thing this feature does is mix up the details of two different operations, signing in, and searching for patients. Write a feature for each one e.g.
Feature: Signing in
As a doctor
I want my patients data to only be available if I sign in
So I ensure their confidentiality
Scenario: Sign in
Given I am a doctor
When I sign in
Then I should be signed in
Feature: Search for patients
Explain why searching for patients gives value to the doctor
...
You should focus on the name of the feature and the bit at the top that explains why this has value first. If you do that well then the scenarios are much easier to write (look how simple my sign in scenario is).
The art of writing features is doing this bit well, so that you end up with simple scenarios.

Regular Expression patterns for Tracking numbers

Does anybody know good place or patterns for checking which company tracking number is the given tracking number for a package. Idea is After scanning a barcode for a package check tracking number with patterns and show which company it was shipped by.

Just thought I would post an update on this as I am working on this to match via jquery and automatically select the appropriate shipping carrier. I compiled a list of the matching regex for my project and I have tested a lot of tracking numbers across UPS FedEX and USPS.
If you come across something which doesn't match, please let me know here via comments and I will try to come up for that as well.
UPS:
/\b(1Z ?[0-9A-Z]{3} ?[0-9A-Z]{3} ?[0-9A-Z]{2} ?[0-9A-Z]{4} ?[0-9A-Z]{3} ?[0-9A-Z]|[\dT]\d\d\d ?\d\d\d\d ?\d\d\d)\b/
FedEX: (3 Different Ones)
/(\b96\d{20}\b)|(\b\d{15}\b)|(\b\d{12}\b)/
/\b((98\d\d\d\d\d?\d\d\d\d|98\d\d) ?\d\d\d\d ?\d\d\d\d( ?\d\d\d)?)\b/
/^[0-9]{15}$/
USPS: (4 Different Ones)
/(\b\d{30}\b)|(\b91\d+\b)|(\b\d{20}\b)/
/^E\D{1}\d{9}\D{2}$|^9\d{15,21}$/
/^91[0-9]+$/
/^[A-Za-z]{2}[0-9]+US$/
Please note that I did not come up with these myself. I simply searched around and compiled the list from different sources, including some which may have been mentioned here.
Thanks
Edit: Fixed missing end delimiter.

I needed something more robust for my use case. I kept running across examples that were incomplete, incorrect, or overly verbose without any improvement in correctness. Hopefully this helps someone else! It covers all of the different formats in the other answers, plus a few more, and doesn't overlap between FedEx and USPS unlike some of the other answers.
Tracking Number Regular Expressions:
USPS/S10:
https://postalpro.usps.com/mnt/glusterfs/2020-02/Pub%20199%20Intelligent%20Mail%20Package%20Barcode%20(IMpb)%20Implementation%20Guide%202020_02_11%20TT%20v6.pdf
\b([A-Z]{2}\d{9}[A-Z]{2}|(420\d{9}(9[2345])?)?\d{20}|(420\d{5})?(9[12345])?(\d{24}|\d{20})|82\d{8})\b
UPS:
\b1Z[A-Z0-9]{16}\b
FedEx:
\b([0-9]{12}|100\d{31}|\d{15}|\d{18}|96\d{20}|96\d{32})\b
Caveats/notes:
FedEx SmartPost is [intentionally] categorized as USPS; it can be tracked with either
USPS includes S10 format tracking numbers used for international post
Tracking numbers have module check bits; these regex's don't check them
This was found by reading spec sheets, reading other answers, looking at open source code, etc. It matched ~6,000 tracking numbers I ran it against with 100% accuracy, but I can't be sure it will be correct in all cases.
These assume you've removed all whitespace before applying the regex
Example Tracking Numbers
Mostly pulled from:
https://tools.usps.com/go/TrackConfirmAction
https://github.com/jkeen/tracking_number_data
| Tracking Number | Kind | Tracking Carrier |
|------------------------------------|-------------------------------------|------------------|
| 03071790000523483741 | USPS 20 | USPS |
| 71123456789123456787 | USPS 20 | USPS |
| 4201002334249200190132607600833457 | USPS 34v2 | USPS |
| 4201028200009261290113185417468510 | USPS 34v2 | USPS |
| 420221539101026837331000039521 | USPS 91 | USPS |
| 71969010756003077385 | USPS 91 | USPS |
| 9505511069605048600624 | USPS 91 | USPS |
| 9101123456789000000013 | USPS 91 | USPS |
| 92748931507708513018050063 | USPS 91 | USPS |
| 9400111201080805483016 | USPS 91 | USPS |
| 9361289878700317633795 | USPS 91 | USPS |
| 9405803699300124287899 | USPS 91 | USPS |
| EK115095696SA | S10 | USPS |
| 1Z5R89390357567127 | UPS | UPS |
| 1Z879E930346834440 | UPS | UPS |
| 1Z410E7W0392751591 | UPS | UPS |
| 1Z8V92A70367203024 | UPS | UPS |
| 1ZXX3150YW44070023 | UPS | UPS |
| 986578788855 | FedEx Express (12) | FedEx |
| 477179081230 | FedEx Express (12) | FedEx |
| 799531274483 | FedEx Express (12) | FedEx |
| 790535312317 | FedEx Express (12) | FedEx |
| 974367662710 | FedEx Express (12) | FedEx |
| 1001921334250001000300779017972697 | FedEx Express (34) | FedEx |
| 1001921380360001000300639585804382 | FedEx Express (34) | FedEx |
| 1001901781990001000300617767839437 | FedEx Express (34) | FedEx |
| 1002297871540001000300790695517286 | FedEx Express (34) | FedEx |
| 61299998820821171811 | FedEx SmartPost | USPS |
| 9261292700768711948021 | FedEx SmartPost | USPS |
| 041441760228964 | FedEx Ground | FedEx |
| 568283610012000 | FedEx Ground | FedEx |
| 568283610012734 | FedEx Ground | FedEx |
| 000123450000000027 | FedEx Ground (SSCC-18) | FedEx |
| 9611020987654312345672 | FedEx Ground 96 (22) | FedEx |
| 9622001900000000000000776632517510 | FedEx Ground GSN | FedEx |
| 9622001560000000000000794808390594 | FedEx Ground GSN | FedEx |
| 9622001560001234567100794808390594 | FedEx Ground GSN | FedEx |
| 9632001560123456789900794808390594 | FedEx Ground GSN | FedEx |
| 9400100000000000000000 | USPS Tracking | USPS |
| 9205500000000000000000 | Priority Mail | USPS |
| 9407300000000000000000 | Certified Mail | USPS |
| 9303300000000000000000 | Collect On Delivery Hold For Pickup | USPS |
| 8200000000 | Global Express Guaranteed | USPS |
| EC000000000US | Priority Mail Express International | USPS |
| 9270100000000000000000 | Priority Mail Express | USPS |
| EA000000000US | Priority Mail Express | USPS |
| CP000000000US | Priority Mail International | USPS |
| 9208800000000000000000 | Registered Mail | USPS |
| 9202100000000000000000 | Signature Confirmation | USPS |

I need to verify JUST United States Postal Service (USPS) tracking numbers. WikiAnswers says that my number formats are as follows:
USPS only offers tracking with Express
mail, with usually begins with an "E",
another letter, followed by 9 digits,
and two more letters. USPS does have
"Label numbers" for other services
that are between 16 and 22 digits
long.
http://wiki.answers.com/Q/How_many_numbers_in_a_USPS_tracking_number
I'm adding in that the Label numbers start with a "9" as all the ones I have from personal shipments for the past 2 years start with a 9.
So, assuming that WikiAnswers is correct, here is my regex that matches both:
/^E\D{1}\d{9}\D{2}$|^9\d{15,21}$/
It's pretty simple. Here is the break down:
^E - Begins w/ E (For express number)
\D{1} - followed by another letter
\d{9} - followed by 9 numbers
\D{2} - followed by 2 more letters
$ - End of string
| - OR
^9 - Basic Track & Ship Number
\d{15,21} - followed by 15 to 21 numbers
$ - End of string
Using www.gummydev.com's regex tester this patter matches both of my test strings:
EXPRESS MAIL : EK225651436US
LABEL NUMBER: 9410803699300003725216
**Note: If you're using ColdFusion (I am), remove the first and last "/" from the pattern

I pressed Royal Mail for a regex for the Recorded Delivery & Special Delivery tracking references but didn't get very far. Even a full set of rules so I could roll my own was beyond them.
Basically, even after they had taken about a week and came back with various combinations of letters denoting service type, I was able to provide examples from our experience that showed there were additional combinations that were obviously valid but that they had not documented.
The references follow the apparently standard international format that I think Jefe's /^[A-Za-z]{2}[0-9]+GB$/ regex would describe:
XX123456789GB
Even though this seems to be a standard format, i.e. most international mail has the same format where the last two letters denote the country of origin, I've not been able to find out any more about this 'standard' or where it originates from (any clarification welcome!).
Particular to Royal Mail seems to be the use of the first two letters to denote service level. I have managed to compile a list of prefixes that denote Special Delivery, but am not convinced that it is 100% complete:
AD AE AF AJ AK AR AZ BP CX DS EP HC HP KC KG
KH KI KJ KQ KU KV KW KY KZ PW SA SC SG SH SI
SJ SL SP SQ SU SW SY SZ TX WA WH XQ WZ
Without one of these prefixes the service is Recorded Delivery which gives delivery confirmation but no tracking.
It seems generally that inclusion of an S, X or Z denotes a higher service level and I don't think I've ever seen a normal Recorded Delivery item with any of those letters in the prefix.
However, as you can see there are many prefixes that would need to be tested if service level were to be checked using regex, and given the fact that Royal Mail seem incapable of providing a comprehensive rule set then trying to test for service level may be futile.

Here are some sample numbers from the main US carriers:
USPS:
70160910000108310009 (certified)
23153630000057728970 (signature confirmation)
RE360192014US (registered mail)
EL595811950US (priority express)
9374889692090270407075 (regular)
FEDEX:
810132562702 (all seem to follow same pattern regardless)
795223646324
785037759224
UPS:
K2479825491 (UPS ground)
J4603636537 (UPS next day express)
1Z87585E4391018698 (regular)
Patterns I am using (php code). Yep I gave up and started testing against all the patterns at my disposal. Had to write the second UPS one.
public function getCarrier($trackingNumber){
$matchUPS1 = '/\b(1Z ?[0-9A-Z]{3} ?[0-9A-Z]{3} ?[0-9A-Z]{2} ?[0-9A-Z]{4} ?[0-9A-Z]{3} ?[0-9A-Z]|[\dT]\d\d\d ?\d\d\d\d ?\d\d\d)\b/';
$matchUPS2 = '/^[kKJj]{1}[0-9]{10}$/';
$matchUSPS0 = '/(\b\d{30}\b)|(\b91\d+\b)|(\b\d{20}\b)/';
$matchUSPS1 = '/(\b\d{30}\b)|(\b91\d+\b)|(\b\d{20}\b)|(\b\d{26}\b)| ^E\D{1}\d{9}\D{2}$|^9\d{15,21}$| ^91[0-9]+$| ^[A-Za-z]{2}[0-9]+US$/i';
$matchUSPS2 = '/^E\D{1}\d{9}\D{2}$|^9\d{15,21}$/';
$matchUSPS3 = '/^91[0-9]+$/';
$matchUSPS4 = '/^[A-Za-z]{2}[0-9]+US$/';
$matchUSPS5 = '/(\b\d{30}\b)|(\b91\d+\b)|(\b\d{20}\b)|(\b\d{26}\b)| ^E\D{1}\d{9}\D{2}$|^9\d{15,21}$| ^91[0-9]+$| ^[A-Za-z]{2}[0-9]+US$/i';
$matchFedex1 = '/(\b96\d{20}\b)|(\b\d{15}\b)|(\b\d{12}\b)/';
$matchFedex2 = '/\b((98\d\d\d\d\d?\d\d\d\d|98\d\d) ?\d\d\d\d ?\d\d\d\d( ?\d\d\d)?)\b/';
$matchFedex3 = '/^[0-9]{15}$/';
if(preg_match($matchUPS1, $trackingNumber) ||
preg_match($matchUPS2, $trackingNumber))
{
echo('UPS');
$carrier = 'UPS';
return $carrier;
} else if(preg_match($matchUSPS0, $trackingNumber) ||
preg_match($matchUSPS1, $trackingNumber) ||
preg_match($matchUSPS2, $trackingNumber) ||
preg_match($matchUSPS3, $trackingNumber) ||
preg_match($matchUSPS4, $trackingNumber) ||
preg_match($matchUSPS5, $trackingNumber)) {
$carrier = 'USPS';
return $carrier;
} else if (preg_match($matchFedex1, $trackingNumber) ||
preg_match($matchFedex2, $trackingNumber) ||
preg_match($matchFedex3, $trackingNumber)) {
$carrier = 'FedEx';
return $carrier;
} else if (0){
$carrier = 'DHL';
return $carrier;
}
return;
}

Been researching this for a while, and made these based mostly on the answers here.
These should cover everything, without being too lenient.
UPS:
/^(1Z\s?[0-9A-Z]{3}\s?[0-9A-Z]{3}\s?[0-9A-Z]{2}\s?[0-9A-Z]{4}\s?[0-9A-Z]{3}\s?[0-9A-Z]$|[\dT]\d{3}\s?\d{4}s?\d{3})$/i
USPS:
/^(EA|EC|CP|RA)\d{9}(\D{2})?$|^(7\d|03|23|91)\d{2}\s?\d{4}\s?\d{4}\s?\d{4}\s?\d{4}(\s\d{2})?$|^82\s?\d{3}\s?\d{3}\s?\d{2}$/i
FEDEX:
/^(((96|98)\d{5}\s?\d{4}$|^(96|98)\d{2})\s?\d{4}\s?\d{4}(\s?\d{3})?)$/i

I'm working in an Angular2+ app and just put together a component to handle common US tracking numbers. It tests them using standard JavaScript RegExp's that I put together from this resource HERE & HERE and sets the href on an anchor tag with the tracking link URL if it's good. You don't have to be using Angular or TypeScript to easily adapt this to your application. I tested it out with different dummy numbers and seem to work dynamically so far. Please note, you can also switch out the null in the last else statement with the in-line commented url and it will send you to a Google search.
Any feedback (or if your tracking numbers don't work) please let me know I will update the answer. Thanks!
USAGE IN HTML:
<app-tracking-number [trackNum]="myTrackingNumberInput"></app-tracking-number>
COMPONENT .TS
import { Component, OnInit, Input } from '#angular/core';
#Component({
selector: 'app-tracking-number',
templateUrl: './tracking-number.component.html',
styleUrls: ['./tracking-number.component.scss']
})
export class TrackingNumberComponent implements OnInit {
#Input() trackNum:string;
trackNumHref:string = null;
// Carrier tracking numbers patterns from https://www.iship.com/trackit/info.aspx?info=24 AND https://www.canadapost.ca/web/en/kb/details.page?article=how_to_track_a_packa&cattype=kb&cat=receiving&subcat=tracking
isUPS:RegExp = new RegExp('^1Z[A-H,J-N,P,R-Z,0-9]{16}$'); // UPS tracking numbers usually begin with "1Z", contain 18 characters, and do not contain the letters "O", "I", or "Q".
isFedEx:RegExp = new RegExp('^[0-9]{12}$|^[0-9]{15}$'); // FedEx Express tracking numbers are normally 12 digits long and do not contain letters AND FedEx Ground tracking numbers are normally 15 digits long and do not contain letters.
isUSPS:RegExp = new RegExp('^[0-9]{20,22}$|^[A-Z]{2}[0-9,A-Z]{9}US$'); // USPS Tracking numbers are normally 20-22 digits long and do not contain letters AND USPS Express Mail tracking numbers are normally 13 characters long, begin with two letters, and end with "US".
isDHL:RegExp = new RegExp('^[0-9]{10,11}$'); // DHL tracking numbers are normally 10 or 11 digits long and do not contain letters.
isCAPost:RegExp = new RegExp('^[0-9]{16}$|^[A-Z]{2}[0-9]{9}[A-Z]{2}$'); // 16 numeric digits (0000 0000 0000 0000) AND 13 numeric and alphabetic characters (AA 000 000 000 AA).
constructor() { }
ngOnInit() {
this.setHref();
}
setHref() {
if(!this.trackNum) this.trackNumHref = null;
else if(this.isUPS.test(this.trackNum)) this.trackNumHref = `https://wwwapps.ups.com/WebTracking/processInputRequest?AgreeToTermsAndConditions=yes&loc=en_US&tracknum=${this.trackNum}&Requester=trkinppg`;
else if(this.isFedEx.test(this.trackNum)) this.trackNumHref = `https://www.fedex.com/apps/fedextrack/index.html?tracknumber=${this.trackNum}`;
else if(this.isUSPS.test(this.trackNum)) this.trackNumHref = `https://tools.usps.com/go/TrackConfirmAction?tLabels=${this.trackNum}`;
else if(this.isDHL.test(this.trackNum)) this.trackNumHref = `http://www.dhl.com/en/express/tracking.html?AWB=${this.trackNum}&brand=DHL`;
else if(this.isCAPost.test(this.trackNum)) this.trackNumHref =`https://www.canadapost.ca/trackweb/en#/search?searchFor=${this.trackNum}`;
else this.trackNumHref = null; // Google search as fallback... `https://www.google.com/search?q=${this.trackNum}`;
}
}
COMPONENT .HTML
<a *ngIf="trackNumHref" [href]="trackNumHref" target="_blank">{{trackNum}}</a>
<span *ngIf="!trackNumHref">{{trackNum}}</span>

Here is a great resource which captures just about all possibilities and is as tight as I have found:
https://andrewkurochkin.com/blog/code-for-recognizing-delivery-company-by-track
string[] upsPattern = new string[]
{
"^(1Z)[0-9A-Z]{16}$",
"^(T)+[0-9A-Z]{10}$",
"^[0-9]{9}$",
"^[0-9]{26}$"
};
string[] uspsPattern = new string[]
{
"^(94|93|92|94|95)[0-9]{20}$",
"^(94|93|92|94|95)[0-9]{22}$",
"^(70|14|23|03)[0-9]{14}$",
"^(M0|82)[0-9]{8}$",
"^([A-Z]{2})[0-9]{9}([A-Z]{2})$"
};
string[] fedexPattern = new string[]
{
"^[0-9]{20}$",
"^[0-9]{15}$",
"^[0-9]{12}$",
"^[0-9]{22}$"
};

You can try these (not guaranteed):
UPS:
\b(1Z ?[0-9A-Z]{3} ?[0-9A-Z]{3} ?[0-9A-Z]{2} ?[0-9A-Z]{4} ?[0-9A-Z]{3} ?[0-9A-Z]|[\dT]\d\d\d ?\d\d\d\d ?\d\d\d)\b
UPS:
\b(1Z ?\d\d\d ?\d\w\w ?\d\d ?\d\d\d\d ?\d\d\d ?\d|[\dT]\d\d\d ?\d\d\d\d ?\d\d\d)\b
USPost:
\b(\d\d\d\d ?\d\d\d\d ?\d\d\d\d ?\d\d\d\d ?\d\d\d\d ?\d\d|\d\d\d\d ?\d\d\d\d ?\d\d\d\d ?\d\d\d\d ?\d\d\d\d)\b
But please test before you use them. I recommend RegexBuddy.

I use these in an eBay application I wrote:
USPS Domestic:
/^91[0-9]+$/
USPS International:
/^[A-Za-z]{2}[0-9]+US$/
FedEx:
/^[0-9]{15}$/
However, this might be eBay/Paypal specific, as all USPS Domestic labels start with "91". All USPS International labels start with two characters and end with "US". As far as I know, FedEx just uses 15 random digits.
(Please note that these regular expressions assume all spaces are removed. It would be fairly easy to allow for spaces though)

Check out this github project that lists a lot of PHP tracking regexes. https://github.com/darkain/php-tracking-urls

Here are the ones I am now using in my Java app. These are determined by my experience of sucking tracking numbers out of shipping confirmation emails from a whole pile of drop ship services. I just made a new USPS one from scratch since none of the ones I found worked for some of my numbers based on example numbers on the USPS site. These only work for US tracking codes because we only sell in the US.
private final Pattern UPS_TRACKING_NUMBER =
Pattern.compile("[^A-Za-z0-9](1Z[A-Za-z0-9]{6,})",
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
private final Pattern FEDEX_TRACKING_NUMBER =
Pattern.compile("\\b((96|98)\\d{18,20}|\\d{15}|\\d{12})\\b",
Pattern.MULTILINE);
private final Pattern USPS_TRACKING_NUMBER =
Pattern.compile("\\b(9[2-4]\\d{20}(?:(?:EA|RA)\\d{9}US)?|(?:03|23|14|70)\\d{18})\\b",
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);

I believe FedEx is 12 digits:
^[0-9]{12}$

I also came across tracking numbers from FedEx with 22 digits recently, so watch out!
I haven't found any good reference for the FedEx's general format yet.
FedEx Example #: 9612019059803563050071

Late to the party however, the below will work with 26 char USPS numbers as well.
/(\b\d{30}\b)|(\b91\d+\b)|(\b\d{20}\b)|(\b\d{26}\b)|^E\D{1}\d{9}\D{2}$|^9\d{15,21}$|^91[0-9]+$|^[A-Za-z]{2}[0-9]+US$/i

I know there are already lots of answers and that this was asked a long time ago, but I don't see a single one that addresses all the possible USPS tracking numbers with a single expression.
Here is what I came up with:
((\d{4})(\s?\d{4}){4}\s?\d{2})|((\d{2})(\s?\d{3}){2}\s?\d{2})|((\D{2})(\s?\d{3}){3}\s?\D{2})
See it working here: http://regexr.com/3e61u

//UPS - UNITED PARCEL SERVICE
final String UPS = "\b(1Z ?[0-9A-Z]{3} ?[0-9A-Z]{3} ?[0-9A-Z]{2} ?[0-9A-Z]{4} ?[0-9A-Z]{3} ?[0-9A-Z]|T\d{3} ?\d{4} ?\d{3})\b";
//USPS - UNITED STATES POSTAL SERVICE - FORMAT 1
final String USPS_FORMAT1 = "\b((420 ?\d{5} ?)?(91|92|93|94|01|03|04|70|23|13)\d{2} ?\d{4} ?\d{4} ?\d{4} ?\d{4}( ?\d{2,6})?)\b";
//USPS - UNITED STATES POSTAL SERVICE - FORMAT 2
final String USPS_FORMAT2 = "\b((M|P[A-Z]?|D[C-Z]|LK|E[A-C]|V[A-Z]|R[A-Z]|CP|CJ|LC|LJ) ?\d{3} ?\d{3} ?\d{3} ?[A-Z]?[A-Z]?)\b";
//USPS - UNITED STATES POSTAL SERVICE - FORMAT 3
final String USPS_FORMAT3 = "\b(82 ?\d{3} ?\d{3} ?\d{2})\b";
//FEDEX - FEDERAL EXPRESS
final String FED_EX = "\b(((96\d\d|6\d)\d{3} ?\d{4}|96\d{2}|\d{4}) ?\d{4} ?\d{4}( ?\d{3})?)\b";
//ONTRAC
final String ONTRAC = "\b(C\d{14})\b";
//DHL
final String DHL = "\b(\d{4}[- ]?\d{4}[- ]?\d{2}|\d{3}[- ]?\d{8}|[A-Z{3}\d{7})\b";
Sample tracking number
UPS
//"1Z 999 AA1 01 2345 6784"
Fed-ex
// "449044304137821"
USPS
//"9400 1000 0000 0000 0000 00"
final Pattern pattern = Pattern.compile(DHL, Pattern.CASE_INSENSITIVE |
Pattern.UNICODE_CASE);
final Matcher matcher = pattern.matcher("1Z 999 AA1 01 2345 6784");
if (matcher.find()) {
System.out.println(true + "");
}
It's working in java and android.
https://regex101.com/
You can change your regex into another language regex by this link and generate code also.

Here's an up-to-date regex for UPS. It works with standard and Mail Innovation type tracking numbers:
\b(1Z ?[0-9A-Z]{3} ?[0-9A-Z]{3} ?[0-9A-Z]{2} ?[0-9A-Z]{4} ?[0-9A-Z]{3} ?[0-9A-Z]|[\dT]\d\d\d ?\d\d\d\d ?\d\d\d|\d\d\d ?\d\d\d ?\d\d\d|\d{22,34})\b

I solved this by using an external API : https://shippingcarrierdetector.com/
If your project allows external API's it might be a much quicker and easier solution than trying to build the logic yourself.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extracting main subject from a sentence in python - regex

Related

How to regexp_extract if a matching pattern resides anywhere in the string - pyspark

Keep words starting with character/letter in Pandas | Python

How to extract specific information from strings

cucumber Repeat steps

Regular Expression patterns for Tracking numbers

Categories

Resources