subset of data using regular exp in spark

subset of data using regular exp in spark - regex

I have sample data like below
class: 9
section: A
stud : Robert
subject: maths
mark : 69
subject:science
mark: 75
stud : Billy
subject: maths
mark : 69
subject:science
mark: 75
stud : Venice
subject: maths
mark : 69
subject:science
mark: 75
stud : Marc
subject: maths
mark : 69
subject:science
mark: 75
class: 10
section: A
stud : Agnes
subject: maths
mark : 69
subject:science
mark: 75
stud : Sarah
subject: maths
mark : 69
subject:science
mark: 75
stud : Scott
subject: maths
mark : 69
subject:science
mark: 75
stud : Alex
subject: maths
mark : 69
subject:science
mark: 75
line1
line2
line3
...
line n
I am trying to extract class 9 student data out of this file. Here is my code
val datafile = sc.textFile("file.txt").collect().mkString(" ")
// to take the data I needed from whole file
val datpattern = """(class: 9).*?(?=\bline\s)
val finaldata = datpattern.findAllIn(datafile)
//student data extract regex
val stupattern = "section: (\S+)\s+ stud : ([\w\S]+)\s+ subject: ([\w\S]+)\s+ mark : (\d+)"""".r
val finalresult = finaldata.flatMap { a => stupattern findAllIn a }
.map {l =
val stupattern(section,stuname,sub,mark) = l
(section,stuname,sub,mark)
}
.foreach(println)
But this gave me only first record in each class that too only the first subject & mark. (Robert maths mark & Agnes Maths mark from class 9 & 10th S A section.
I thought this is because only that matches entire pattern.
I tried to change it like with 0 or more occurences for subject & mark. something like below (Only the lines I have changed given below)
val stupattern = "section: (\S+)\s+ stud : ([\w\S]+)\s+ (subject: ([\w\S]+)\s+ mark : (\d+))*"""".r
val finalresult = finaldata.flatMap { a => stupattern findAllIn a }
.map {l =
val stupattern(section,stuname,{sub,mark}) = l//This doesn't even let me compile
(section,stuname,sub,mark)//This doesn't even let me compiled
}
.foreach(println)
It error out like for those 2 lines "Illegal start of pattern".
Can someone tell me how to extract repeat subset of data from above?
Thanks in Advance.

Related

output formatting error getting extra dots [duplicate]

This question already has answers here:
Which iomanip manipulators are 'sticky'?
(3 answers)
Restore the state of std::cout after manipulating it
(9 answers)
C++ - How to reset the output stream manipulator flags [duplicate]
(5 answers)
Closed 5 months ago.
my output formatting is wrong, here is display function below:
std::ostream &Seat::display(std::ostream &coutRef) const
{
if(isEmpty()){
coutRef<<"Invalid Seat!";
}
//check if seat no is valid or not
else if(!validate(m_row,m_letter)){
char temp[41];
strncpy(temp,m_passengerName,40);
temp[40]='\0';
coutRef<<temp<<setw(41-strlen(temp))<<setfill('.')<<" "<<"___";
}
else{
char temp[41];
strncpy(temp,m_passengerName,40);
temp[40]='\0';
coutRef<<temp<<setw(41-strlen(temp))<<setfill('.')<<" "<<m_row<<m_letter;
}
return coutRef;
}
My output Below:
..1- Business Class Window: Baby Gerald............................. 1A
..2- Business Class Aisle: Groundskeeper Willie.................... 1B
..3- Business Class Aisle: Dolph Starbeam.......................... 1E
..4- Business Class Window: Kirk Van Houten......................... 1F
..5- Business Class Window: Artie Ziff.............................. 2A
..6- Business Class Aisle: Edna Krabappel.......................... 2B
..7- Business Class Aisle: Luann Van Houten........................ 2E
..8- Business Class Window: Janey Powell............................ 2F
..9- Business Class Window: Akira Kurosawa.......................... 3A
.10- Business Class Aisle: Luigi Risotto........................... 3B
.11- Business Class Aisle: Homer Simpson........................... 3E
.12- Business Class Window: Selma Bouvier........................... 3F
.13- Business Class Window: Wendell Borton.......................... 4A
.14- Business Class Middle: Manjula Nahasapeemapetilon.............. 4B
.15- Business Class Middle: Kearney Zzyzwicz........................ 4E
.16- Business Class Window: Brandine Spuckler....................... 4F
.17- Invalid seat assigned: Moe Szyslak............................. ___
.18- Invalid seat assigned: Ralph Wiggum............................ ___
.19- Economy Plus Aisle: Barney Gumble........................... 7D
I just want to remove dots from the front of the number like ..1, ..2, ..3 I just want to remove .. from the front. I just want to print like this 1,2,3

How To Interpret Least Square Means and Standard Error

I am trying to understand the results I got for a fake dataset. I have two independent variables, hours, type and response pain.
First question: How was 82.46721 calculated as the lsmeans for the first type?
Second question: Why is the standard error exactly the same (8.24003) for both types?
Third question: Why is the degrees of freedom 3 for both types?
data = data.frame(
type = c("A", "A", "A", "B", "B", "B"),
hours = c(60,72,61, 54,68,66),
# pain = c(85,95,69, 73, 29, 30)
pain = c(85,95,69, 85,95,69)
)
model = lm(pain ~ hours + type, data = data)
lsmeans(model, c("type", "hours"))
> data
type hours pain
1 A 60 85
2 A 72 95
3 A 61 69
4 B 54 85
5 B 68 95
6 B 66 69
> lsmeans(model, c("type", "hours"))
type hours lsmean SE df lower.CL upper.CL
A 63.5 82.46721 8.24003 3 56.24376 108.6907
B 63.5 83.53279 8.24003 3 57.30933 109.7562

Try this:
newdat <- data.frame(type = c("A", "B"), hours = c(63.5, 63.5))
predict(model, newdata = newdat)
An important thing to note here is that your model has hours as a continuous predictor, not a factor.

Record with optional and mutable fields

In the docs: https://bucklescript.github.io/docs/en/object.html there are examples for a record with mutable fields and optional fields. When I try to use both it fails:
Compiles:
type person = {
mutable age: int;
job: string;
} [##bs.deriving abstract]
let joe = person ~age:20 ~job:"teacher"
let () = ageSet joe 21
Adding the [#bs.optional] attribute:
type person = {
mutable age: int;
job: string [#bs.optional];
} [##bs.deriving abstract]
let joe = person ~age:20 ~job:"teacher"
let () = ageSet joe 21
Error message:
Line 7, 20:
This expression has type unit -> person
but an expression was expected of type person
Line 7 is the ageSet line.
Am I missing anything here?

I re-read the documentation and this is the part I missed
Note: now that your creation function contains optional fields, we mandate an unlabeled () at the end to indicate that you've finished applying the function.
type person = {
mutable age: int;
job: string [#bs.optional];
} [##bs.deriving abstract]
let joe = person ~age:20 ~job:"teacher" ()
let () = ageSet joe 21

Adding a new column based on values

I have the following sample data:
data weight_club;
input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight;
Loss = StartWeight - EndWeight;
datalines;
1023 David Shaw red 189 165
1049 Amelia Serrano yellow 145 124
1219 Alan Nance purple 210 192
1246 Ravi Sinha yellow 194 177
1078 Ashley McKnight green 127 118
;
What I would like to do now is the following:
Create two lists with colours (fe, list1 = "red" and "yellow" and list2 = "purple" and "green")
Classify the records according to whether or not they are in list1 and list2 and add a new column.
So the pseudo code is like this:
'Set new category called class
If item is in list1 then class = 1
Else if item is in list2 then class = 2
Else class = 3
Any thoughts on how I can do this most effciently?

Your pseudocode is almost exactly it.
If item is in ('red' 'yellow') then class = 1;
Else if item is in ('purple' 'green') then class = 2;
Else class = 3;
This is really a lookup, so their are many other methods. One I usually recommend as well is Proc format, though in a simplistic case like this I'm not sure of any gains.
Proc format;
Value $ colour_cat
'red', 'yellow' = 1
'purple', 'green' = 2
Other = 3;
Run;
And then in a data/SQL either of the following can be used.
*actual conversion;
Category = put(colour, $colour_cat.);
* change display only;
Format colour $colour_cat.;

How read separate lines from files?

------------------------------------------------
Artiles for a magazine
------------------------------------------------
There are total 5 articles in the magazine
------------------------------------------------
ID : 3
Description : opis2
Price : 212
Incoming amount : 2
Outgoing amount : 0
Taxes : 0
Total : 424
Date : 20324
------------------------------------------------
ID : 3
Description : 54
Price : 123
Incoming amount : 12
Outgoing amount : 0
Taxes : 0
Total : 1476
Date : 120915
------------------------------------------------
ID : 3
Description : opsi2
Price : 12
Incoming amount : 324
Outgoing amount : 0
Taxes : 0
Total : 3888
Date : 570509
------------------------------------------------
ID : 2
Description : vopi
Price : 2
Incoming amount : 2
Outgoing amount : 0
Taxes : 0
Total : 4
Date : 951230
------------------------------------------------
ID : 1
Description : opis1
Price : 2
Incoming amount : 2
Outgoing amount : 0
Taxes : 0
Total : 4
Date : 101
------------------------------------------------
I have a file called directory.dat with the contents above. What I'm trying to do is the following.
I want to find all articles with the same ID in a given year and do the following : outgoing amount - incoming amount. So, my problem is how can I find all the articles with same ID in a given year (by the user) and do the outgoing amount-incoming amount for them, by working with the file?
I tried something like this:
ifstream directory("directory.dat");
//directory.open("directory.dat");
string line;
string priceLine = "Price : ";
int price;
while(getline(directory, line)){
if(line.find(priceLine) == 0){
cout << atoi(line.substr(priceLine.size()).c_str()) << endl;
}
}
cout << price << endl;
directory.close();
But I am far away from getting on the right track and I need some help to achieve something like this.

You need to define precisely the format of your input (perhaps as a BNF grammar). A single example is not enough. We can't guess if Artiles for a magazine is meaningful or not.
while(getline(directory, line)){
int colonpos = -1;
if (line.find("----")) {
/// check that line has only dashes, then
process_dash_line();
}
else if ((colonpos=line.find(':'))>0) {
std::string name = line.substr(0, colonpos-1);
std::string value = line.substr(colonpos+1);
process_name_value (name, value);
}
}
Also, study (and perhaps adapt) the source code of some free software C++ parsers for JSON (e.g. jsoncpp) and YAML (e.g. yaml-cpp). They will certainly give you some inspiration.
Learn more about C++ standard libraries, e.g. on cppreference.com & cplusplus.com (both sites are easy to read but are imperfect) and of course by reading the C++11 standard, or at least its draft n3337

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

subset of data using regular exp in spark - regex

Related

output formatting error getting extra dots [duplicate]

How To Interpret Least Square Means and Standard Error

Record with optional and mutable fields

Adding a new column based on values

How read separate lines from files?

Categories

Resources