Separate numbers and words in a string - stata

I have a string variable in Stata, that contains entries such as:
110xyz 43 abc
110xyz 44 abc
111 xyz 56 abc
The key is that sometimes the first set of numbers (anything from 1 to 5 digits long) is not followed by a space before the next word begins (and that can be anything from 1 to 50 letters long). Is there a neat way to insert a blank to separate number and word? (110 xyz would be the desired result in the above example) I triedregexr(), but that doesn't help.

What went wrong with regexr()? It seems exactly the tool to use, so perhaps it's regular expressions you're having trouble with.
My Stata syntax might be a little off, but try something like this:
gen string = "110xyz 43 abc"
gen number = regexs(1) if regexm(string, "^([0-9]+) *([a-zA-Z].*)")
gen remain = regexs(2) if regexm(string, "^([0-9]+) *([a-zA-Z].*)")
gen fixed = number + " " + remain
This can be simplified, but I'm starting you off with what I think has the highest probability of working, e.g. guaranteed match, because I don't know what kind of null value results if regexm() didn't match.

moss from SSC offers a convenience wrapper for Stata's regular expression functions. Demonstration:
clear
input str16 test
"110xyz 43 abc"
"110xyz 44 abc"
"111 xyz 56 abc"
end
ssc inst moss
help moss
moss test, match("([0-9]+)") regex pre(num)
moss test, match("([a-z]+)") regex pre(word)
list test *match*
+------------------------------------------------------------+
| test nummat~1 nummat~2 wordma~1 wordma~2 |
|------------------------------------------------------------|
1. | 110xyz 43 abc 110 43 xyz abc |
2. | 110xyz 44 abc 110 44 xyz abc |
3. | 111 xyz 56 abc 111 56 xyz abc |
+------------------------------------------------------------+

Related

Regex match multiple numbers stop at string (word) despite more matches exist

Goal;
Match all variations of phone numbers with 8 digits + (optional) country code.
Stop match when "keyword" is found, even if more matches exist after the "keyword".
Need this in a one-liner and have tried a plethora of variations with lookahead/behind and negate [^keyword] but I am unable to understand how to achieve this.
Example of text;
abra 90998855
kadabra 04 94 84 54
cat 132 23 564
oh the nice Hat +41985 32 565
+17 98 56 32 56
Ladida
keyword
I Want It To Stop Matching Here Or Right Before The "keyword"
more nice text with some matches
cat 132 23 564
oh the nice Hat +41985 32 565
+17 98 56 32 56
Example of regex;
(\+\d{1,2})?[\s]?\(?\d{2,3}\)?[\s]?(\d{2})[\s]?(\d{2})?[\s]?(\d{2,3})
-> This matches all numbers also below the keyword
(\+\d{1,2})?[\s]?\(?\d{2,3}\)?[\s]?(\d{2})[\s]?(\d{2})?[\s]?(\d{2,3})[^keyword]
-> This matches all numbers also below the keyword
(\+\d{1,2})?[\s]?\(?\d{2,3}\)?[\s]?(\d{2})[\s]?(\d{2})?[\s]?(\d{2,3})(?!keyword)
-> This matches all numbers also below the keyword
(\+\d{1,2})?[\s]?\(?\d{2,3}\)?[\s]?(\d{2})[\s]?(\d{2})?[\s]?(\d{2,3})(?=keyword)
-> This matches nothing
((\+\d{1,2})?[\s]?\(?\d{2,3}\)?[\s]?(\d{2})[\s]?(\d{2})?[\s]?(\d{2,3})(?:(?!keyword))*)
-> This matches all numbers also below the keyword

Match number ending in 1 except when ending in 11

I need to match any number ending in 1 except numbers ending in 11. I use awk. To illustrate, the correctly working lines are:
if ( max ~ /1$/ && max !~ /11$/ ) { print max }
or using regex:
if ( max ~ /[^1]1$|^1$/ ) { print max }
or a much slower variant of the same regex:
([^1]|^)1$
I actualy suspect just this one part (with a modification) should work somehow. It is nice and short and readable, does the job in far less steps than the above combos, works for all numbers with 2 digits of more, but fails for 1 itself. Which I fixed above, but would prefer a better one (if there is). I actually need it to work for 1 to 3 digit numbers, but would prefer to not limiting it.
[^1]1$
As soon as I try quantifers to fix it, it fails to work correctly. It either starts picking leading 1s (e.g. 1211 is matched and it should not) or loose a single digit number 1 as a match. Obviously, my problem is lying in the fact I must match the end of the number. How to make a better regex?
Test cases:
Matching numbers are:
1
21
31
121
131
1021
skip (not match) numbers ending in 11 like:
11
111
211
1011
1211
Can't you just do, I believe it is quicker than a regex parsing:
If you know max is a number:
if ( max%10 == 1 && max%100 != 11 ) { print max }
If you do not know max is a number:
if ( max+0==max && max%10 == 1 && max%100 != 11 ) { print max }
If you want a regex, you can use ^[0-9]*[02-9]1$|^1$ but this is just an extension of RavinderSingh13's answer to make sure it is a number.
If your Input_file is same as shown sample then following awk may help you here.
awk '/[02-9]1$/||/^1$/' Input_file
Let's say following is the sample Input_file.
cat Input_file
1
2001
21
31
121
131
1021
11
111
211
1011
1211
Then following will be output after running the code.
awk '/[02-9]1$/||/^1$/' Input_file
1
2001
21
31
121
131
1021

How can I find values which does not contains characters and spaces?

Below is sample data from the list I am working with:
74
7491
75
75010
75013
78
8081
84
8400 Winterthu
852
9000 Aalborg
974
A
A CORUÑA
aa
Aalborg
Aargau
Aarhus
aas
AAT
AB
ABERC
Abu Dhabi
Abuja
AC
ACT
AD
Using [^\p{L}-] I can get a list but it also includes the following values which I do not want in the list
Abu Dhabi
Puerto Rico
Hong Kong
How can I do this?
You want to find multiple items, so you must use g option.
You will be checking each line separately. Usual way the pattern
for such case is constructed is ^...$, but both ^ and $
should match begin and end of each line, not the whole string.
So you must use m option.
And the last point, what should be the accepted content of a candidate
line, i.e. what should be between ^ and $: Any not empty sequence of
letters in any language or literal minus, i.e. [\p{L}-]+.
So, to sum up, the whole regex should be:
/^[\p{L}-]+$/gm
This way names containing a space (e.g. Puerto Rico) will not be matched
(as you specified).
Say your file is test.dat
A 1 simple line in grep will give wat you want:
grep -o -P "[0-9]+$" test.dat
Output:
74
7491
75
75010
75013
78
8081
84
852
974

Match Regular Expressions patterns if exist, else

Here is what I am trying to achieve. Given a certain set of data I am trying to get the entire row that contains the matching regular expressions that I have.
Essentially, given a data set such as this
AFAM 002A AFAM & DEV AM HIS/GV 03 46493 3 LEC D2 70 P 20/15 W 1800-2045 08/24/16-12/12/16 WSQ 207 K WHITE
AFAM 102 AFRO-AMER MUSIC 01 47200 3 LEC P 5/30 W 1800-2045 08/24/16-12/12/16 MUS 250 V GROCE-ROBERTS
AFAM 125 THE BLACK FAMILY 01 47198 3 LEC P 16/40 M 1800-2045 08/24/16-12/12/16 CCB 101 S MILLNER
AFAM 152 THE BLACK WOMAN 01 47199 3 LEC P 8/40 T 1800-2045 08/24/16-12/12/16 CL 111 R WILSON
AFAM 159 ECON ISSUES BLKCM 01 47197 3 LEC P 11/40 MW 1330-1445 08/24/16-12/12/16 CL 234 R WILSON
AFAM 180 INDIVIDUAL STUDIES 01 46982 3 SUP P 0/10 TBA TBA 08/24/16-12/12/16
The regex that I have created basically groups the following into..
Course ID eg. AFAM 002A
Course Name eg. AFRO-AMER MUSIC
Start date
end date
Professor Name (This is the value that I want to be optional)
The problem that I am having now is that for the optional value, instead of what I what which is to check if it exist, if not then leave empty. If someone could show me the correct way to do this I would greatly appreciated it.
Essentially this part of my regular expression ([A-Z][\s][A-Z]+[-]*[A-Z]+)? Needs to be included if it exist, I understand that that's how the ? operator is supposed to work, however I cant seem to find the right keyword for this question so here I am
([A-Z]+[\s][0-9]+[A-Z]*)(.+)[\s][0-9]+[\s][0-9]+.+(\d\d\/\d\d\/\d\d)-(\d\d\/\d\d\/\d\d)[\s]([A-Z][\s][A-Z]+[-]*[A-Z]+)?
The Expected results for this dataset for the last two rows should be
{ [ (AFAM 159), (ECON ISSUES BLKCM), (08/24/16), (12/12/16), (R WILSON)],
[(AFAM 180), (INDIVIDUAL STUDIES), (08/24/16), (12/12/16), ()]
}
Your regex does not match CL 234 in the last but one line. You need to consume it. However, just adding .*? won't work, you need to make your optional pattern obligatory (remove ?) and wrap .*?([A-Z]\s[A-Z]+-*[A-Z]+) with an optional non-capturing group (?:....).
([A-Z]+\s\d+[A-Z]*)(.+?)\s\d+\s\d+.+?(\d\d\/\d\d\/\d\d)-(\d\d\/\d\d\/\d\d)\s(?:.*?([A-Z]\s[A-Z]+-*[A-Z]+))?
See the regex demo.

Regex to validate 13 digit telephone number is numeric and can contain maximum of 3 spaces

I would like to validate a telephone number which can contain 10 to 13 digit numbers and can contain 0 to 3 spaces (can come anywhere in the data). Please let me know how to do it?
I tried using regex ^(\d*\s){0,3}\d*$ which works fine but I need to restrict the total number of characters to 13.
You want to match same text against 2 different whole-line patterns.
It's achievable either by matching patterns consequently:
$ cat file
1234567 90
1234567890
123 456 789 0123
123 456 789 01 23
$ sed -rn '/^([0-9] ?){9,12}[0-9]$/{/^([0-9]+ ){0,3}[0-9]+$/p}' file
1234567890
123 456 789 0123
$
Or if Your regex engine (perl/"grep -P"/java/etc) supports lookaheads - patterns can be combined:
// This is Java
Pattern p = Pattern.compile("(?=^([0-9] ?){9,12}[0-9]$)(?=^([0-9]+ ){0,3}[0-9]+$)^.*$");
System.out.println(p.matcher("1234567 90").matches()); // false
System.out.println(p.matcher("123 456 789 0123").matches()); // true
System.out.println(p.matcher("123 456 789 01 23").matches()); // false