Regex how to find pattern? - regex

I need to separate text below with Regex syntax. Actually I found recipes for dddd-dddd and dddd-ddd[x]. What with text? I need to get string with this value like this: "British Journal of Applied Science & Technology". How to write it in regex?
337 British Journal of Applied Science & Technology 2231-0843 5
338 British Journal of Economics, Management & Trade 2278-098X 5
339 British Journal of Education, Society & Behavioural Science 2278-0998 6
340 British Journal of Environment and Climate Change 2231-4784 5
341 British Journal of Mathematics & Computer Science 2231-0851 4
342 British Journal of Medicine and Medical Research 2231-0614 8
343 British Journal of Pharmaceutical Research 2231-2919 4
344 British Microbiology Research Journal 2231-0886 9
345 Bromatologia i Chemia Toksykologiczna 0365-9445 5
346 Budownictwo Górnicze i Tunelowe 1234-5342 5
347 Budownictwo i Architektura 1899-0665 3
348 Budownictwo, Technologie, Architektura 1644-745X 3
349 Builder 1896-0642 2
350 Built Environment 0263-7960 10
351 Bulgarian Journal of Veterinary Medicine 1311-1477 8
352 Bulgarian Medicine 1314-3387 2
353 Bulletin de la Société des sciences et des lettres de Łódź, Série: Recherches sur les déformations 0459-6854 7
354 Bulletin of Alfred Nobel University. Series "Legal Science" 2226-2873 6
355 Bulletin of Geography. Socio-economic Series 1732-4254 10
356 Bulletin of Geography: Physical Geography Series 2080-7686 9
357 Bulletin of the Polish Academy of Sciences. Mathematics 0239-7269 9
358 Business and Economic Horizons 1804-1205 8
359 Business and Economics Research Journal 1309-2448 10
360 Business Process Management Journal 1463-7154 10

(?<=\d\s)\D+(?=\s\d)
That should find what you need. If you are interested in how it works:
The first part of the Regex ((?<=\d\s)) declares that the searched phrase must come after a digit (\d) followd by a whitespace (\s).
The second part (\D+) is what is actually found. It means any number of non digit characters.
The third part ((?=\s\d)) makes sure that the result is followed by another whitespace and digit.

You can do it with an expression that uses lookahead and lookbehind, like this:
(?<=\d{3}\s).*(?=\s\d{4}-)
This expression requires three digits followed by space in front of the text, and four digits preceded by space and followed by a dash after the text. The name itself is matched by a straight .* pattern.
Demo.

Since you don't specify a target language or anything like that, here's how you could do it with perl:
cat test.txt | perl -pe 's/^\d+\s//' | perl -pe 's/[0-9X "-]+$//'
The second expression might need adaptation depending on how the rest of your data looks like.
This prints:
British Journal of Applied Science & Technology
British Journal of Economics, Management & Trade
British Journal of Education, Society & Behavioural Science
British Journal of Environment and Climate Change
[snip]
Bulletin of the Polish Academy of Sciences. Mathematics
Business and Economic Horizons
Business and Economics Research Journal
Business Process Management Journal

\d+ (.+) ....-.... \d+
Extracting:
British Journal of Applied Science & Technology
British Journal of Economics, Management & Trade
British Journal of Education, Society & Behavioural Science
British Journal of Environment and Climate Change
British Journal of Mathematics & Computer Science
British Journal of Medicine and Medical Research
British Journal of Pharmaceutical Research
[... cut ...]

(\d{3})\s([\D]+)(\d{4}-\d{3,4}X?\s\d{1,2})
This splits the string into 3 capture groups:
3 digits
Anything NOT containing a digit, up to the next digit
The reference at the end (assumes it begins with 4 digits and is in a consistent format)
See demo here

I understand you are looking for REGEX, but if you wanted something slightly more straight forward it looks like your document can easily be parsed using simple string manipulation. I offer this idea as an alternative for people not looking to use REGEX.
String tmp = "340 British Journal of Environment and Climate Change 2231-4784 5";
String ending = tmp.substring(tmp.length() - 11);
tmp = tmp.substring(0, (tmp.length() - 11)); //parse off the ending
StringTokenizer st = new StringTokenizer(tmp, " ");
String index = st.nextToken(); //reads the first int up to the first space.
tmp = tmp.substring(index.length()); //parse front
Now tmp is the name of the journal, index is the first few characters, and the reference at the end is saved as ending. This method only works presuming all the strings are exactly as listed above, or within similar bounds.

This one:
(?<=\d\s)\D+(?=\s\d)
works very well, but i found in my pdf that titles could have numbers, for example
338 British Journal of 5Economics, Management & Trade 2278-098X 5
How to properly parse it ?
PS I write my app in C#(.NET).

Related

Using Regex in SOLR Query

I have a data set of street names and numbers which I need to search.
eg. 12 HILL STREET
12A HILL STREET
12B HILL STREET
123 HILL STREET
12 HILARY STREET
If I search as follows q=(street_name:12\ HILL*), I get
12 HILL STREET
I want to obtain the following results:
12 HILL STREET
12A HILL STREET
12B HILL STREET
Is there a way to query in SOLR to return the results as the above example shows?
I have tried querying as:
q=(street_name:/12[A-Z]\ HILL*/)
but don't get anything back.
You can use
q=(street_name:/12[A-Z]* HILL.*/)
Here, the pattern means
12 - string starts with 12
[A-Z]* - zero or more ASCII uppercase letters
- a space
HILL - HILL char sequence
.* - any zero or more chars other than line break chars as many as possible (so, the rest of the line).

Regular expression for this problem (extracting string between strings)

I'm working in a project and unfortunately data extracted from another software needs more format. Take a look at this line
Instructor : 95371 XXX XXX XXX Associate Professor Course Name EE 311 Microprocessors lecture 834 1 32 3 3 1 08:00 AM - 08:50 AM 1 09:00 AM - 09:50 AM 3 10:00 AM - 10:50 AM 21 Total : 3 Section Position : Serial Campus Hrs Weekly Activity Semester: Time Schedule Type : 411 Reg. Regular First Semester 41/42 Rank : Course
Each line must start with Instructor followed by : and ID. The name may not be available. After that the rank of the teacher is stated in the following group
Associate Professor
Assistant Professor
Lecturer
Teacher
Teaching Assistant
after the words lecture or exercise or practical there are six number places, I need to extract the first one from the right.
Could you please suggest a startup regular expression for this? Qt library is welcomed.
This regex will match your text and extract the value as group
Instructor :\s*\d+\s+(?:\w+(?: \w+)*)\s+(?:Associate Professor|Assistant Professor|Lecturer|Teacher|Teaching Assistant)\s+Course Name\s+\w+ \d+\s+\w+(?: \w+)*\s+(?:lecture|exercise|practical)\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+(\d+)\s+\d{2}:\d{2} (?:AM|PM) - \d{2}:\d{2} (?:AM|PM)\s+\d\s+\d{2}:\d{2} (?:AM|PM) - \d{2}:\d{2} (?:AM|PM)\s+\d\s+\d{2}:\d{2} (?:AM|PM) - \d{2}:\d{2} (?:AM|PM)\s+\d+\s+Total : \d\s+Section Position : \s+Serial\s+Campus\s+Hrs Weekly\s+Activity\s+Semester:\s+Time\s+Schedule Type : \d+ Reg\.\s+Regular First Semester \d{2}\/\d{2}\s+Rank :\s+Course\s+

Regex (Posix) to get first word only, not including numbers

New to Regex (which was recently added to SQL in DB2 for i). I don't know anything about the different engines but research indicates that it is "based on POSIX extended regular expressions".
I would like to get the street name (first non-numeric word) from an address.
e.g.
101 Main Street = Main
2/b Pleasant Ave = Pleasant
5H Unpleasant Crescent = Unpleasant
I'm sorry I don't have a string that isn't working, as suggested by the forum software. I don't even know where to start. I tried a few things I found in search but they either yielded nothing or the first "word" - i.e. the number (101, 2/b, 5H).
Thanks
Edit: Although it's looking as if IBM's implementation of regex on the DB2 family of databases may be too alien for many of the resident experts, I'll press ahead with some more detail in case it helps.
A plain English statement of the requirement would be:
Basic/acceptable: Find the first word/unbroken string that contains no numbers or special characters
Advanced/ideal: Find the first word that contains three or more characters, being only letters and zero or one embedded dash/hyphen, but no numbers or other characters.
Additional examples (original ones at top are still valid)
190 - 192 Tweety-bird avenue = Tweety-bird
190-192 Tweety-bird avenue = Tweety-bird
Charles Bronson Place = Charles
190H Charles-Bronson Place = Charles-Bronson
190 to 192 Charles Bronson Place = Charles
Second Edit:
Mooching around on the internet and trying every vaguely connected expression that I could find, I stumbled on this one:
[a-zA-Z]+(?:[\s-][a-zA-Z]+)*
which actually works pretty well - it gives the street name and street type, which on reflection would actually suit my purpose as well as the street name alone (I can easily expand common abbreviations - e.g. RD to ROAD - on the fly).
Sample SQL:
select HAD1,
regexp_substr(HAD1, '[a-zA-Z]+(?:[\s-][a-zA-Z]+)*')
from ECH
where HEDTE > 20190601
Sample output
Ship To REGEXP_SUBSTR
Address
Line 1
32 CHRISTOPHER STREET CHRISTOPHER STREET
250 - 270 FEATHERSTON STREET FEATHERSTON STREET
118 MONTREAL STREET MONTREAL STREET
7 BIRMINGHAM STREET BIRMINGHAM STREET
59 MORRISON DRIVE MORRISON DRIVE
118 MONTREAL STREET MONTREAL STREET
MASON ROAD MASON ROAD
I know this wasn't exactly the question I asked, so apologies to anyone who could have done this but was following the original request faithfully.
Not sure if this is Posix compliant, but something like this could work: ^[\w\/]+?\s((\w+\s)+?)\s*\w+?$, example here.
The script assumes that the first chunk is the number of the building, the second chunk, is the name of the street, and the last chunk is Road/Ave/Blvd/etc.
This should also cater for street names which have white spaces in them.
Using the following regex matches your examples :
(?<=[^ ]+ )[^ ]*[ ]

Regex for extracting names of colleges, universities, and institutes?

I have a bunch of strings like this in a file:
M.S., Arizona University, Tucson, Az., 1957
B.A., American International College, Springfield, Mass., 1978
B.A., American University, Washington, D.C., 1985
and I'd like to extract Tufts University, American International College, American University, University of Massachusetts, etc, but not the high schools (it's probably safe to assume that if it contains "Academy" or "High School" that it's a high school). Any ideas?
Tested with preg_match_all in PHP, will work for the sample text you provided:
/(?<=,)[\w\s]*(College|University|Institute)[^,\d]*(?=,|\d)/
Will need to be modified somewhat if your regex engine does not support lookaheads/lookbehinds.
Update: I looked at your linked sample text & updated the regex accordingly
/([A-Z][^\s,.]+[.]?\s[(]?)*(College|University|Institute|Law School|School of|Academy)[^,\d]*(?=,|\d)/
The first part will match a string starting with a capital letter, optionally followed by an .. Then a space, then optionally an (. This pattern is matched zero or more times.
This should get all relevant words preceding the keywords.

Regular expression for dividing country calling codes

I have a list of calling codes for all countries(the phone number prefixes), I would like to split them up in the
country name and the actual code so I can put then into an xml.
I have tried back and forth but can not get a regexp going that takes all cases into account.
I think it is fairly simple for someone with a bit of experience.
The codes have these formats:
Afghanistan 93
Anguilla 1 264
Antarctica 6721
Antigua and Barbuda 1 268
Bosnia and Herzegovina 387
Canada 1
Congo, Republic of the 242
Cote d'Ivoire 225
Ireland (Eire) 353
United States of America 1
There are around 235 of them in total, but these are the regulars and the exceptions.
^[a-zA-Z]\s,'()] for between 1 and X words and then it is [0-9\s]{1,5}$ for the numbers:
X
XX
XXX
XXXX
X XXX
So if I should express it as a sentence it would be: "from beginning of a line, take all characters (1) including space,'() until you encounter digits, then take all of these including space(2) until you encounter a line break."
I am using TextMate, and the docs says:
TextMate uses the Oniguruma regular
expression library by K. Kosako.
I would appreciate any help given:)
Thank you.
This posix regex should be sufficient: ^[a-zA-Z ]+[0-9 ]+$