For the grammar given, provide 3 valid example strings - regex

today I got a homework on my 'Programming Languages' class and I'm having trouble. Here is the full question;
For the grammars given below, draw transition diagram and transition
table. Provide 3 valid example strings.
And this is the one I'm having trouble with;
1?1.(0|1)+
I don't really know what the question mark (?) stands for on this example, and I couldn't find an online paper. I don't want any help on diagram or the table, I could make them if I knew what '?' means. Please help me with this, thanks in advance.

Usually the question mark ? stands for 0 or 1 time
So 1? will match "" (empty) and "1"

Related

how to build regular expressions

I'm dealing with some google spreadsheet with data, some of which is in a very confused way, but regular, so i hope we can figure this out.
I've tried reg ex builders but I can't find the right one for google sheets or I misunderstand some stuff.
I would appreciate help with these sentances below:
1. {"user":{"Czy faktura?":"Y","Nazwa firmy":"Name of the company ","NIP":"113 234 20 57"}}
2. {"user":{"Czy faktura?":"Y","Nazwa firmy":"The longer name of the company","NIP":"2352225961"}}
3. {"user":{"Czy faktura?":"N","Nazwa firmy":"","NIP":""}}
The point is to extract: (using arrayformula in google sheets)
Y or N
Name of the company
NIP number
Problems:
The name of the company has different lengths, and the NIP number is sometimes with white-spaces.
Do you guys have any idea how can I properly use it?
I know it's the REGEXEXTRACT formula of course :)
Just have a problem on how to formulate the regular expression..
=regexreplace(B1, "(^.*Nazwa firmy"":"")(.*)("",""NIP.*$)", "$2")
Well the support was fantastic :)
After all, a simple "Y|N" solves the first problem
I used #ttarchala's solution for the company name as it seems to work for some reason - i don't know why or how :)
"(^.Nazwa firmy"":"")(.)("",""NIP.*$)", "$2"
and the NIP is isolated by this one: "NIP\"":\""(.+)\"""),"-|\s","" and later trimmed of off the "-" minus and whitespaces signs.
cheers

How to write regular expression if the reference to the sentence is not clear

I'm learning NLP and I meet a problem when I tried to use regular expression to solve the following questions:
How much did A drop?
How much did B drop?
And the giving sentences are below:
At about 3:45, A careened to still another limit, of 30 points down, and trading was locked again.
2.Futures traders say A was signaling that B could fall as much as 200 points.
3.A had plunged 12 points
I tried to extract the correct answers 30 and 12, and my regular expression code is:
'\s?A (.+ )?(fall|drop|go\sdown|down|fell|plunged)(\sas\smuch\sas)? (\d+)'
Obviously, it's not correct. it will give the answer "200" to the 'A' and miss '30'.
Could someone please teach me how to write Regex based on this situation?
Any response will be greatly appreciated!
If we assume that the text you want to match has the format Letter ... VERB ... [0-9]+ points, then we can try using the following pattern:
\b[A-Z]\b.*?(?:fall|drop|go\sdown|down|fell|plunged|careened).*?(\d+) points
Demo

Regex for binary multiple of 3

I would like to know how can I construct a regex to know if a number in base 2 (binary) is multiple of 3. I had read in this thread Check if a number is divisible by 3 but they dont do it with a regex, and the graph someone drew is wrong(because it doesn't accept even numbers). I have tried with: ((1+)(0*)(1+))(0) but it doesn't works for some values. Hope you can help me.
UPDATE:
Ok, thanks all for your help, now I know how to draw the NFA, here I left the graph and the regular expresion:
In the graph, the states are the number in base 10 mod 3.
For example: to go to state 1 you have to have 1, then you can add 1 or 0, if you add 1, you would have 11(3 in base 10), and this number mod 3 is 0 then you draw the arc to the state 0.
((0*)((11)*)((1((00) *)1) *)(101 *(0|((00) *1 *) *0)1) *(1(000)+1*01)*) *
And the other regex works, but this is shorter.
Thanks a lot :)
I know this is an old question, but an efficient answer is yet to be given and this question pops up first for "binary divisible by 3 regex" on Google.
Based on the DFA proposed by the author, a ridiculously short regex can be generated by simplifying the routes a binary string can take through the DFA.
The simplest one, using only state A, is:
0*
Including state B:
0*(11)*0*
Including state C:
0*(1(01*0)*1)*0*
And include the fact that after going back to state A, the whole process can be started again.
0*((1(01*0)*1)*0*)*
Using some basic regex rules, this simplifies to
(1(01*0)*1|0)*
Have a nice day.
If I may plug my solution for this code golf question! It's a piece of JavaScript that generates regexes (probably inefficiently, but does the job) for divisibility for each base.
This is what it generates for divisibility by 3 in base 2:
/^((((0+)?1)(10*1)*0)(0(10*1)*0|1)*(0(10*1)*(1(0+)?))|(((0+)?1)(10*1)*(1(0+)?)|(0(0+)?)))$/
Edit: comparing to Asmor's, probably very inefficient :)
Edit 2: Also, this is a duplicate of this question.
For some who is learning and searching how to do this:
see this video:
https://www.youtube.com/watch?v=SmT1DXLl3f4&t=138s
write state quations and solve them with Axden's Theorem
The way I did is visible in the image-result is the same as pointed out by user #Kert Ojasoo. I hope i did it corretly because i spent 2 days to solve it...
n+2n = 3n. Thus, 2 adjacent bits set to 1 denote a multiple of 3. If there are an odd number of adjacent 1s, that would not be 3.
So I'd propose this regex:
(0*(11)?)+

Regular expression to search for Gadaffi [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm trying to search for the word Gadaffi, which can be spelled in many different ways. What's the best regular expression to search for this?
This is a list of 30 variants:
Gadaffi
Gadafi
Gadafy
Gaddafi
Gaddafy
Gaddhafi
Gadhafi
Gathafi
Ghadaffi
Ghadafi
Ghaddafi
Ghaddafy
Gheddafi
Kadaffi
Kadafi
Kaddafi
Kadhafi
Kazzafi
Khadaffy
Khadafy
Khaddafi
Qadafi
Qaddafi
Qadhafi
Qadhdhafi
Qadthafi
Qathafi
Quathafi
Qudhafi
Kad'afi
My best attempt so far is:
\b[KG]h?add?af?fi$\b
But I still seem to be missing some variants. Any suggestions?
Easy... (Qadaffi|Khadafy|Qadafi|...)... it's self-documented, maintainable, and assuming your regexp engine actually compiles regular expressions (rather than interpreting them), it will compile to the same DFA that a more obfuscated solution would.
Writing compact regular expressions is like using short variable names to speed up a program. It only helps if your compiler is brain-dead.
\b[KGQ]h?add?h?af?fi\b
Arabic transcription is (Wiki says) "Qaḏḏāfī", so maybe adding a Q. And one H ("Gadhafi", as the article (see below) mentions).
Btw, why is there a $ at the end of the regex?
Btw, nice article on the topic:
Gaddafi, Kadafi, or Qaddafi? Why is the Libyan leader’s name spelled so many different ways?.
EDIT
To match all the names in the article you've mentioned later, this should match them all. Let's just hope it won't match a lot of other stuff :D
\b(Kh?|Gh?|Qu?)[aeu](d['dt]?|t|zz|dhd)h?aff?[iy]\b
One interesting thing to note from your list of potential spellings is that there's only 3 Soundex values for the contained list (if you ignore the outlier 'Kazzafi')
G310, K310, Q310
Now, there are false positives in there ('Godby' also is G310), but by combining the limited metaphone hits as well, you can eliminate them.
<?
$soundexMatch = array('G310','K310','Q310');
$metaphoneMatch = array('KTF','KTHF','FTF','KHTF','K0F');
$text = "This is a big glob of text about Mr. Gaddafi. Even using compound-Khadafy terms in here, then we might find Mr Qudhafi to be matched fairly well. For example even with apostrophes sprinkled randomly like in Kad'afi, you won't find false positives matched like godfrey, or godby, or even kabbadi";
$wordArray = preg_split('/[\s,.;-]+/',$text);
foreach ($wordArray as $item){
$rate = in_array(soundex($item),$soundexMatch) + in_array(metaphone($item),$metaphoneMatch);
if ($rate > 1){
$matches[] = $item;
}
}
$pattern = implode("|",$matches);
$text = preg_replace("/($pattern)/","<b>$1</b>",$text);
echo $text;
?>
A few tweaks, and lets say some cyrillic transliteration, and you'll have a fairly robust solution.
Using CPAN module Regexp::Assemble:
#!/usr/bin/env perl
use Regexp::Assemble;
my $ra = Regexp::Assemble->new;
$ra->add($_) for qw(Gadaffi Gadafi Gadafy Gaddafi Gaddafy
Gaddhafi Gadhafi Gathafi Ghadaffi Ghadafi
Ghaddafi Ghaddafy Gheddafi Kadaffi Kadafi
Kaddafi Kadhafi Kazzafi Khadaffy Khadafy
Khaddafi Qadafi Qaddafi Qadhafi Qadhdhafi
Qadthafi Qathafi Quathafi Qudhafi Kad'afi);
say $ra->re;
This produces the following regular expression:
(?-xism:(?:G(?:a(?:d(?:d(?:af[iy]|hafi)|af(?:f?i|y)|hafi)|thafi)|h(?:ad(?:daf[iy]|af?fi)|eddafi))|K(?:a(?:d(?:['dh]a|af?)|zza)fi|had(?:af?fy|dafi))|Q(?:a(?:d(?:(?:(?:hd)?|t)h|d)?|th)|u(?:at|d)h)afi))
I think you're over complicating things here. The correct regex is as simple as:
\u0627\u0644\u0642\u0630\u0627\u0641\u064a
It matches the concatenation of the seven Arabic Unicode code points that forms the word القذافي (i.e. Gadaffi).
If you want to avoid matching things that no-one has used (ie avoid tending towards ".+") your best approach would be to create a regular expression that's just all the alternatives (eg. (Qadafi|Kadafi|...)) then compile that to a DFA, and then convert the DFA back into a regular expression. Assuming a moderately sensible implementation that would give you a "compressed" regular expression that's guaranteed not to contain unexpected variants.
If you've got a concrete listing of all 30 possibilities, just concatenate them all together with a bunch of "ors". Then you can be sure that it only matches the exact things you've listed, and no more. Your RE engine will probably be able to optimize in further, and, well, with 30 choices even if it doesn't it's still not a big deal. Trying to fiddle around with manually turning it into a "clever" RE can't possibly turn out better and may turn out worse.
(G|Gh|K|Kh|Q|Qh|Q|Qu)(a|au|e|u)(dh|zz|th|d|dd)(dh|th|a|ha|)(\x27|)(a|)(ff|f)(i|y)
Certainly not the most optimized version, split on syllables to maximize matches while trying to make sure we don't get false positives.
Well since you are matching small words why don't you try a similarity search engine with the Levenshtein distance? You can allow at most k insertions or deletions. This way you can change the distance function to other things that work better for your specific problem. There are many functions available in the simMetrics library.
A possible alternative is the online tool for generate regular expressions from examples http://regex.inginf.units.it.
Give it a chance!
Why not do a mixed approach? Something between a list of all possibilities and a complicated Regex that matches far too much.
Regex is about pattern matching and I can't see a pattern for all variants in the list. Trying to do so, will also find things like "Gazzafy" or "Quud'haffi" which are most probably not a used variant and definitly not on the list.
But I can see patterns for some of the variants, and so I ended up with this:
\b(?:Gheddafi|Gathafi|Kazzafi|Kad'afi|Qadhdhafi|Qadthafi|Qudhafi|Qu?athafi|[KG]h?add?h?aff?[iy]|Qad[dh]?afi)\b
At the beginning I list the ones where I can't see a pattern, then followed by some variants where there are patterns.
See it here on www.rubular.com
I know this is an old question, but...
Neither of these two regexes is the prettiest, but they are optimized and both match ALL the variations in the original post.
"Little Beauty" #1
(?:G(?:a(?:d(?:d(?:af[iy]|hafi)|af(?:f?i|y)|hafi)|thafi)|h(?:ad(?:daf[iy]|af?fi)|eddafi))|K(?:a(?:d(?:['dh]a|af?)|zza)fi|had(?:af?fy|dafi))|Q(?:a(?:d(?:(?:(?:hd)?|t)h|d)?|th)|u(?:at|d)h)afi)
"Little Beauty" #2
(?:(?:Gh|[GK])adaff|(?:(?:Gh|[GKQ])ad|(?:Ghe|(?:[GK]h|[GKQ])a)dd|(?:Gadd|(?:[GKQ]a|Q(?:adh|u))d|(?:Qad|(?:Qu|[GQ])a)t)h|Ka(?:zz|d'))af)i|(?:Khadaff|(?:(?:Kh|G)ad|Gh?add)af)y
Rest in Peace, Muammar.
Just an addendum: you should add "Gheddafi" as alternate spelling. So the RE should be
\b[KG]h?[ae]dd?af?fi$\b
[GQK][ahu]+[dtez]+\'?[adhz]+f{1,2}(i|y)
In parts:
[GQK]
[ahu]+
[dtez]+
\'?
[adhz]+
f{1,2}(i|y)
Note: Just wanted to give a shot at this.
What else starts with Q, G, or K, has a d, z or t in the middle, and ends in "fi" the people actually search for?
/\b[GQK].+[dzt].+fi\b/i
Done.
>>> print re.search(a, "Gadasadasfiasdas") != None
False
>>> print re.search(a, "Gadasadasfi") != None
True
>>> print re.search(a, "Qa'dafi") != None
True
Interesting that I'm getting downvoted. Can someone leave some false positives in the comments?

I'm going to be teaching a few developers regular expressions - what are some good homework problems? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm thinking of presenting questions in the form of "here is your input: [foo], here are the capture groups/results: [bar]" (and maybe writing a small script to test their answers for my results).
What are some good regex questions to ask? I need everything from beginner questions like "validate a 4 digit number" to "extract postal codes from addresses".
A few that I can think off the top of my head:
Phone numbers in any format e.g. 555-5555, 555 55 55 55, (555) 555-555 etc.
Remove all html tags from text.
Match social security number (Finnish one is easy;)
All IP addresses
IP addresses with shorthand netmask (xx.xx.xx.xx/yy)
There's a bunch of examples of various regular expression techniques over at www.regular-expressions.info - everything for simple literal matching to backreferences and lookahead.
To keep things a bit more interesting than the usual email/phone/url stuff, try looking for more original exercises. Avoid boredom.
For example, have a look at the Forsysth-Edwards Notation which is used for describing a particular board position of a chess game.
Have your students validate and extract all the bits of information from a string like this:
rnbqkbnr/pp1ppppp/8/2p5/4P3/5N2/PPPP1PPP/RNBQKB1R b KQkq - 1 2
Additionaly, have a look at algebraic chess notation, used to describe moves. Extract chess moves out of a piece of text (and make them bold).
1. e4 e5 2. Nf3 Black now defends his pawn 2...Nc6 3. Bb5 Black threatens c4
Validate phone numbers (extract area code + rest of number with grouping) (Assuming US phone number, otherwise generalize for you style)
Play around with validating email address (probably want to tell the students that this is hugely complicated regular expression but for simple ones it is pretty straight forward)
regexplib.com has a good library you can search through for examples.
H0w about extract first name, middle name, last name, personal suffix (Jr., III, etc.) from a format like:
Smith III, John Paul
How about Reg Ex to remove line breaks and tabs from the input
I would start with the common ones:
validate email
validate phone number
separate the parts of a URL
Be cruel. Tell them parse HTML.
RegEx match open tags except XHTML self-contained tags
Are you teaching them theory of finite automata as well?
Here is a good one: parse the addresses of churches correctly from this badly structured format (copy and paste it as text first)
http://www.churchangel.com/WEBNY/newhart.htm
I'm a fan of parsing date strings. Define a few common data formats, as well as time and date-time formats. These are often good exercises because some dates are simple mixes of digits and punctuation. There's a limited degree of freedom in parsing dates.
Just to throw them for a loop, why not reword a question or two to suggest that they write a regular expression to generate data fitting a specific pattern like email addresses, phone numbers, etc.? It's the same thing as validating, but can help them get out of the mindset that regex is just for validation (whereas the data generation tool in visual studio uses regex to randomly generate data).
Rather than teaching examples based from the data set, I would do examples from the perspective of the rule set to get basics across. Give them simple examples to solve that leads them to use ONE of several basic groupings in each solution. Then have a couple of "compound" regex's at the end.
Simple:
s/abc/def/
Spinners and special characters:
s/a\s*b/abc/
Grouping:
s/[abc]/def/
Backreference:
s/ab(c)/def$1/
Anchors:
s/^fred/wilma/
s/$rubble/and betty/
Modifiers:
s/Abcd/def/gi
After this, I would give a few examples illustrating the pitfalls of trying to match html tags or other strings that shouldn't be done with regex's to show the limitations.
Try to think of some tests that don't include ones that can be found with Google.
Asking a email validator should pose no trouble finding..
Try something like a 5 proof test.
Input 5 digit. Sum up each digit must be dividable by five: 12345 = 1+2+3+4+5 = 15 / 5 = 3(.0)