c++ find any string from a list in another string - c++

What options do I have to find any string from a list in another string ?
With s being an std::string, I tried
s.find("CAT" || "DOG" || "COW" || "MOUSE", 0);
I want to find the first one of these strings and get its place in the string ; so if s was "My cat is sleeping\n" I'd get 3 as return value.
boost::to_upper(s);
was applied (for those wondering).

You can do this with a regex.
I don't think there's a way to get the position of a match directly, so first you have to search for the regex, and if there is a match you can search for that string. Like this:
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main() {
string s = "My cat is sleeping\n";
smatch m;
regex animal("cat|dog|cow|mouse");
if (regex_search (s,m,animal)) {
cout << "Match found: " << m.str() << endl;
size_t match_position = s.find(m.str());
// In this case it is always true, but in general you might want to check
if (match_position != string::npos) {
cout << "First animal found at: " << match_position << endl;
}
}
return 0;
}

You may convert your search cases to a DFA. It is the most efficient way of doing it.
states:
nil, c, ca, cat., d, do, dog., co, cow., m, mo, mou, mous, mouse.
transition table:
state | on | goto
nil | c | c
nil | d | d
nil | m | m
c | a | ca
c | o | co
d | o | do
m | o | mo
ca | t | cat.
co | w | cow.
do | g | dog.
mo | u | mou
mou | s | mous
mous | e | mouse.
* | * | nil
You may express this using a lot of intermediary functions. Using a lot of switches. Or using enum to represent states and a mapping to represent the transitions.
If your test case list is dynamic or grows too big, then a manually hardcoding the states will nor suffice for you. However, as you can see, the rule to make the states and the transitions is very simple.

Related

Extract letters and numbers from string

I have the following strings:
KZ1,345,769.1
PKS948,123.9
XG829,823.5
324JKL,282.7
456MJB87,006.01
How can I separate the letters and numbers?
This is the outcome I expect:
KZ 1345769.1
PKS 948123.9
XG 829823.5
JKL 324282.7
MJB 45687006
I have tried using the split command for this purpose but without success.
#Pearly Spencer's answer is surely preferable, but the following kind of naive looping should occur to any programmer. Look at each character in turn and decide whether it is a letter; or a number or decimal point; or something else (implicitly) and build up answers that way. Note that although we loop over the length of the string, looping over observations too is tacit.
clear
input str42 whatever
"KZ1,345,769.1"
"PKS948,123.9"
"XG829,823.5"
"324JKL,282.7"
"456MJB87,006.01"
end
compress
local length = substr("`: type whatever'", 4, .)
gen letters = ""
gen numbers = ""
quietly forval j = 1/`length' {
local arg substr(whatever,`j', 1)
replace letters = letters + `arg' if inrange(`arg', "A", "Z")
replace numbers = numbers + `arg' if `arg' == "." | inrange(`arg', "0", "9")
}
list
+-----------------------------------------+
| whatever letters numbers |
|-----------------------------------------|
1. | KZ1,345,769.1 KZ 1345769.1 |
2. | PKS948,123.9 PKS 948123.9 |
3. | XG829,823.5 XG 829823.5 |
4. | 324JKL,282.7 JKL 324282.7 |
5. | 456MJB87,006.01 MJB 45687006.01 |
+-----------------------------------------+

Test if all characters in string are not alphanumeric

The string below is probably the result of bad API call:
_±êµÂ’¥÷“_¡“__‘_Ó ’¥Ï“ùü’ÄÛ“_« “_Ô“Ü“ù÷ “Ïã“_÷’¥Ï “µÏ“ÄÅ“ù÷ “Á¡ê±«“ùã ê¡Û“_ã “__’
I am not sure which rows contain non-alphanumeric characters and my task is to identify which rows are problematic.
Another problem is that some non-alphanumeric characters appear with strings that I would like to still keep and search, like:
This sentence is fine and searchable, but a few non-alphanumeric äóî donäó»t popup
Is there a way to test if the entire contents of a string are non-alphanumeric?
You can use a regular expression to find all rows with only standard alphabetic and numeric characters including commas, periods, exclamation and question marks as well as spaces:
clear
input str168 var1
"_±êµÂ’¥÷“_¡“__‘_Ó ’¥Ï“ùü’ÄÛ“_« “_Ô“Ü“ù÷ “Ïã“_÷’¥Ï “µÏ“ÄÅ“ù÷ “Á¡ê±«“ùã ê¡Û“_ã “__’"
"This sentence is fine and searchable, but a few non unicode äóî donäó»t popup"
" This is a regular sentence of course"
" another sentence, but with comma"
" but what happens with question marks?"
" or perhaps an exclamation mark!"
end
generate tag = ustrregexm(var1, "^[A-Za-z0-9 ,.?!]*$")
. list tag, separator(0)
+-----+
| tag |
|-----|
1. | 0 |
2. | 0 |
3. | 1 |
4. | 1 |
5. | 1 |
6. | 1 |
+-----+
Another possibility is to use a regular expression to exclude any rows that do not have any alphabetic and numeric characters, a solution which in this case covers both required cases:
clear
input str168 var1
"_±êµÂ’¥÷“_¡“__‘_Ó ’¥Ï“ùü’ÄÛ“_« “_Ô“Ü“ù÷ “Ïã“_÷’¥Ï “µÏ“ÄÅ“ù÷ “Á¡ê±«“ùã ê¡Û“_ã “__’"
"This sentence is fine and searchable, but a few non unicode äóî donäó»t popup"
" This is a regular sentence of course"
" another sentence, but with comma"
" but what happens with question marks?"
" or perhaps an exclamantion mark!"
"¥Ï“ùü’ÄÛ“_« “_Ô“Ü“ù÷ "
"¥Ï“ùü’ÄÛ hihuo"
end
generate tag = ustrregexm(var1, "^[^A-Za-z0-9]*$")
list tag, separator(0)
+-----+
| tag |
|-----|
1. | 1 |
2. | 0 |
3. | 0 |
4. | 0 |
5. | 0 |
6. | 0 |
7. | 1 |
8. | 0 |
+-----+

Spark - extracting numeric values from an alphanumeric string using regex

I have an alphanumeric column named "Result" that I'd like to parse into 4 different columns: prefix, suffix, value, and pure_text.
I'd like to solve this using Spark SQL using RLIKE and REGEX, but also open to PySpark/Scala
pure_text: contains only alphabets (or) if there are numbers present, then they should either have a special character "-" or multiple decimals (i.e. 9.9.0) or number followed by an alphabet and then a number again (i.e. 3x4u)
prefix: anything that can't be categorized into "pure_text" will be taken into consideration. any character(s) before the 1st digit [0-9] needs to be extracted.
suffix: anything that can't be categorized into "pure_text" will be taken into consideration. any character(s) after the last digit [0-9] needs to be extracted.
value: anything that can't be categorized into "pure_text" will be taken into consideration. extract all numerical values including the decimal point.
Result
11 H
111L
<.004
>= 0.78
val<=0.6
xyz 100 abc
1-9
aaa 100.3.4
a1q1
Expected Output:
Result Prefix Suffix Value Pure_Text
11 H H 11
111L L 111
.9 0.9
<.004 < 0.004
>= 0.78 >= 0.78
val<=0.6 val<= 0.6
xyz 100 abc xyz abc 100
1-9 1-9
aaa 100.3.4 aaa 100.3.4
a1q1 a1q1
Here's one approach using a UDF that applies pattern matching to extract the string content into a case class. The pattern matching centers around the numeric value with Regex pattern [+-]?(?:\d*\.)?\d+ to extract the first occurrence of numbers like "1.23", ".99", "-100", etc. A subsequent check of numbers in the remaining substring captured in suffix determines whether the numeric substring in the original string is legitimate.
import org.apache.spark.sql.functions._
import spark.implicits._
case class RegexRes(prefix: String, suffix: String, value: Option[Double], pure_text: String)
val regexExtract = udf{ (s: String) =>
val pattern = """(.*?)([+-]?(?:\d*\.)?\d+)(.*)""".r
s match {
case pattern(pfx, num, sfx) =>
if (sfx.exists(_.isDigit))
RegexRes("", "", None, s)
else
RegexRes(pfx, sfx, Some(num.toDouble), "")
case _ =>
RegexRes("", "", None, s)
}
}
val df = Seq(
"11 H", "111L", ".9", "<.004", ">= 0.78", "val<=0.6", "xyz 100 abc", "1-9", "aaa 100.3.4", "a1q1"
).toDF("result")
df.
withColumn("regex_res", regexExtract($"result")).
select($"result", $"regex_res.prefix", $"regex_res.suffix", $"regex_res.value", $"regex_res.pure_text").
show
// +-----------+------+------+-----+-----------+
// | result|prefix|suffix|value| pure_text|
// +-----------+------+------+-----+-----------+
// | 11 H| | H| 11.0| |
// | 111L| | L|111.0| |
// | .9| | | 0.9| |
// | <.004| <| |0.004| |
// | >= 0.78| >= | | 0.78| |
// | val<=0.6| val<=| | 0.6| |
// |xyz 100 abc| xyz | abc|100.0| |
// | 1-9| | | null| 1-9|
// |aaa 100.3.4| | | null|aaa 100.3.4|
// | a1q1| | | null| a1q1|
// +-----------+------+------+-----+-----------+

Puts and Gets in c++ [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I know what puts and gets do, but I don't understand the meaning of this code.
int main(void) {
char s[20];
gets(s); //Helloworld
gets(s+2);//dog
sort(s+1,s+7);
puts(s+4);
}
Could you please help me to understand?
Draw it on paper, along these lines.
At first, twenty uninitialised elements:
| | | | | | | | | | | | | | | | | | | | |
gets(s):
|H|e|l|l|o|w|o|r|l|d|0| | | | | | | | | |
gets(s+2):
|H|e|d|o|g|0|o|r|l|d|0| | | | | | | | | |
^
|
s+2
sort(s+1, s+7):
|H|0|d|e|g|o|o|r|l|d|0| | | | | | | | | |
^ ^
| |
s+1 s+7
puts(s+4):
|H|0|d|e|g|o|o|r|l|d|0| | | | | | | | | |
^
|
s+4
The best thing to say about the code is that it is very bad. Luckily, it is short but it is vulnerable, unmaintainable and error prone.
However, since the previous is not really an answer, let's go through the code, assuming the standard include files were used and "using namespace std;":
char s[20];
This declares an array of 20 characters with the intent of filling it with a null-terminated string. If somehow, the string becomes larger, you're in trouble
gets(s); //Helloworld
This reads in a string from stdin. No checks can be done on the size. The comment assumes it will read in Helloworld, which should fit in s.
gets(s+2);//dog
This reads in a second string from stdin, but it will overwrite the previous string starting from the third character. So if the comment is write, s will contain the null-terminated string "Hedog".
sort(s+1,s+7);
This will sort the characters in asserting ascii value from the second up to the seventh character. With the given input, we already have a problem that the null-character is on the sixth position so it will be part of the sorted characters and thus will be second, so the null-terminated string will be "H".
puts(s+4);
Writes out the string from the fifth position on, so until the null-charater that was read in for "Helloworld", but then overwritten and half-sorted. Of course input can be anything, so expect surprises.
gets(s); //Helloworld -- reads a string from keyboard to s
gets(s+2);//dog -- reads a string from keyboard to s started with char 2
sort(s+1,s+7); -- sorts s in interval [1, 7]
puts(s+4); -- writes to console s from char 4
gets(s); //Helloworld --> s=Helloworld
gets(s+2);//dog --> s=Hedog
sort(s+1,s+7); --> s=Hdego
puts(s+4); --> console=Hdego

How to disable non-standard features in SML/NJ

SML/NJ provides a series of non-standard features, such as higher-order modules, vector literal syntax, etc.
Is there a way to disable these non-standard features in SML/NJ, through some command-line param maybe, or, ideally, using a CM directive?
Just by looking at the grammar used by the parser, I'm going to say that there is not a way to do this. From "admin/base/compiler/Parse/parse/ml.grm":
apat' : OP ident (VarPat [varSymbol ident])
| ID DOT qid (VarPat (strSymbol ID :: qid varSymbol))
| int (IntPat int)
| WORD (WordPat WORD)
| STRING (StringPat STRING)
| CHAR (CharPat CHAR)
| WILD (WildPat)
| LBRACKET RBRACKET (ListPat nil)
| LBRACKET pat_list RBRACKET (ListPat pat_list)
| VECTORSTART RBRACKET (VectorPat nil)
| VECTORSTART pat_list RBRACKET (VectorPat pat_list)
| LBRACE RBRACE (unitPat)
| LBRACE plabels RBRACE (let val (d,f) = plabels
in RecordPat{def=d,flexibility=f}
end)
The VectorPat stuff is fully mixed in with the rest of the patterns. A recursive grep for VectorPat also will show that there aren't any options to turn this off anywhere else.