finding avg/min/max for a dataset using mapreduce - mapreduce

i am trying to write a mapreduce sample practice program where in my data set is somthing like this
its about salaries of people in every year in a country/city/state
place year salary($)
america 2014 60,000
france 2010 40,000
india 2012 20,000
australia 2001 50,000
america 2014 65,000
i want output something like this
place year avg min max
america 2014 625000 600000 650000
france 2010 400000 400000 400000
please guide me how can i write a mapreduce program/ any sample program which is already handled such case.
thanks in advance :)
i have tried mapper part
public static class Map extends Mapper{
public void map(LongWritable key, Text value,
Context context)
throws IOException,InterruptedException {
String year=null;
String country =null;
String amount=null;
// this will work even if we receive more than 1 line
Scanner scanner = new Scanner(value.toString());
String line;
String[] tokens;
while (scanner.hasNext()) {
line = scanner.nextLine();
tokens = line.split("\\s+");
country = tokens[0];
year = tokens[1];
amount = (tokens[2]);
context.write(new Text(country), new Text(year));
context.write(new Text(year), new Text(amount));
}
}
}

Related

One to many merge using Akka streams

I have a use case where I have a list of values to be fetched from the database and a list of dates for which the values need to be fetched. I want to use akka streams (Flow or Source with GraphDSL) to make a one to many (or many to one) relationship between them so that I fetch each value for each of the dates
For example,
animals = cow, goat, sheep
years=2018, 2019
expected stream output is
cow & 2018
goat & 2018
sheep & 2018
cow & 2019
goat & 2019
sheep & 2019
If you want a product like this, you don't need the Graph DSL.
def animalsAndYears(animals: Source[Animal, NotUsed], years: Source[Year, NotUsed]): Source[(Animal, Year), NotUsed] =
years.flatMapConcat { year =>
animals.map { animal =>
animal -> year
}
}
So:
animalsAndYears(Source(listOfAnimals), Source(listOfYears))
would give you a stream of animal, year tuples. Let's say that you have a function:
def queryDBForAnimalYear(aandy: (Animal, Year)): Future[Seq[Row]] = ???
Then you can get a stream of the rows with:
val parallelism: Int = ??? // How many queries to have in-flight at a time
animalsAndYears(Source(listOfAnimals), Source(listOfYears))
.mapAsync(parallelism) { params => queryDBForAnimalYear(params) }
.mapConcat(identity) // gives you a Source[Row]

Enum is defined but not found in the class

The solution consists of three classes: the SongGenre, the Song and the Library (+ Program). I am just following the instructions so most of coding comes from my lectures and the book and not much of the experience. It is what what you see and I am not really proud of it. Pointers are really appreciated. The main one is why the enum values can not be seen in another classes?
This code has been fixed (see comments).
namespace SongLibrary
{
[Flags]
enum SongGenre
{
Unclassified = 0,
Pop = 0b1,
Rock = 0b10,
Blues = 0b100,
Country = 0b1_000,
Metal = 0b10_000,
Soul = 0b100_000
}
}
namespace SongLibrary
{
/*This acts like a record for the song. The setter is missing for all the properties.
* There are no fields.
* This class comprise of four auto-implemented properties with public getters and
* setters absent. */
public class Song
{
public string Artist { get; }
public string Title { get; }
public double Length { get; }
public SongGenre Genre { get; }
/*This constructor that takes four arguments and assigns them to the appropriate properties.*/
public Song(string title, string artist, double length, SongGenre genre)
{
Artist = artist;
Title = title;
Length = length;
SongGenre Genre = SongGenre.genre;/*<-ERROR 'SongGenre does not contain a definition for 'genre'*/
}
public override string ToString()
{
return string.Format("[{0} by ,{1} ,({2}) ,{3}min]", Title, Artist, Genre, Length);
}
}
}
namespace SongLibrary
{
public static class Library
{
/*This is a static class therefore all the members also have to be static. Class members
* are accessed using the type instead of object reference.
* There are no properties.
* There is no constructor for this class.
* There are four over-loaded methods. */
/*This private field is a list of song object is a class variable.*/
private static List<string> songs = new List<string> { "title", "artist", "length", "genre" };
/*This is a public class method that does not take any argument and displays all the songs in
* the collection.*/
public static void DisplaySongs()
{
for (int i = 0; i < songs.Count; i++)
Console.WriteLine(songs[i]);
}
/*This is a public class method that takes a double argument and displays only songs that are
* longer than the argument.*/
public static void DisplaySongs(double longerThan)
{
foreach (string songs in songs)
{
if (songs.Length > longerThan)
{
Console.WriteLine("\n" + songs);
}
}
}
/*This is a public class method that takes a SongGenre argument and displays only songs that
* are of this genre.*/
public static void DisplaySongs(SongGenre genre)
{
foreach (string songs in songs)
{
if (songs.Genre == genre)/*<-ERROR 'string' does not contain a definition for 'Genre'
* and no accessable extension method 'Genre' accepting a first
* argument of type 'string' could be found*/
{
Console.WriteLine("\n" + songs);
}
}
}
/*This is a public class method that takes a string argument and displays only songs by this artist.*/
public static void DisplaySongs(string artist)
{
foreach (string songs in songs)
{
if (songs.Artist == artist) /*< -ERROR 'string' does not contain a definition for 'Artist'
* and no accessable extension method 'Artist' accepting a first
* argument of type 'string' could be found */
{
Console.WriteLine("\n" + songs);
}
}
}
/*This a class method that is public. It takes a single string argument that represents a text file containing
* a collection of songs. You will read all the data and create songs and add it to the songs collection.You
* will have to read four lines to create one Song. Your loop body should have four ReadLine(). */
public static void LoadSongs(string fileName)
{
/*Initialize the songs field to a new List of Song*/
List<string> songs = new List<string> { "title", "artist", "length", "genre" };
/*Declare four string variable (title, artist, length, genre) to store the results of four in reader.ReadLine().*/
string title;
string artist;
double length;
SongGenre genre;
/*The first ReadLine() is a string representing the title of the song. This can and should be used as a check
* for termination condition. If this is empty then there are no more songs to read i.e. it is the end of
* the file. The next ReadLine() will get the Artist. The next ReadLine() will be a string that represents
* the weight. Use the Convert.ToDouble to get the required type. The next ReadLine() will be a string that
* represents the genre. Use the Enum.Parse() to get the required type. Use the above four variables to create
* a Song object. Add the newly created object to the collection.And finally do one more read for the title
* to re-enter the loop.*/
TextReader reader = new StreamReader(filename);//<-ERROR The name 'filename' does not exist in the current context
string line = reader.ReadLine();
while (line != null)
{
string[] data = line.Split();
title.Add(data[0]);//<-ERROR Use of unassigned local variable 'title'| 'string' does not contain definition for 'Add'
artist.Add(data[1]);//<-ERROR Use of unassigned local variable 'artist'| 'string' does not contain definition for 'Add'
length.Add(Convert.ToDouble(data[2]));/*<-ERROR Use of unassigned local variable 'length'| 'string' does not contain
* definition for 'Add'*/
genre.Add(Enum.Parse(data[3]));/*<-ERROR Use of unassigned local variable 'genre' |ERROR 'string' does not contain
* definition for 'Add' | ERROR The type arguments for method Enum.Parse cannot be
inferred from the usage*/
line = reader.ReadLine();
}
reader.Close();
}
}
}
class Program
{
static void Main(string[] args)
{
List<string> songs = new List<string>();
string filename = #"D:\songs4.txt";//<-Warning The variable 'filename' is assigned but it's value never used.
//To test the constructor and the ToString method
Console.WriteLine(new Song("Baby", "Justin Bebier", 3.35, SongGenre.Pop));//<-ERROR 'Pop'
//This is first time to use the bitwise or. It is used to specify a combination of genres
Console.WriteLine(new Song("The Promise", "Chris Cornell", 4.26, SongGenre.Country | SongGenre.Rock));//<-ERROR 'Country' and 'Rock'
Library.LoadSongs("songs4.txt"); //Class methods are invoke with the class name
Console.WriteLine("\n\nAll songs");
Library.DisplaySongs();
SongGenre genre = SongGenre.Rock;//<-ERROR 'SongGenre' does no contain a definition for 'Rock'
Console.WriteLine($"\n\n{genre} songs");
Library.DisplaySongs(genre);
string artist = "Bob Dylan";
Console.WriteLine($"\n\nSongs by {artist}");
Library.DisplaySongs(artist);
double length = 5.0;
Console.WriteLine($"\n\nSongs more than {length}mins");
Library.DisplaySongs(length);
Console.ReadKey();
}
}
}
song4.txt file is used to test the solution:
Baby
Justin Bebier
3.35
Pop
Fearless
Taylor Swift
4.03
Pop
Runaway Love
Ludacris
4.41
Pop
My Heart Will Go On
Celine Dion
4.41
Pop
Jesus Take The Wheel
Carrie Underwood
3.31
Country
If Tomorrow Never Comes
Garth Brooks
3.40
Country
Set Fire To Rain
Adele
4.01
Soul
Don't You Remember
Adele
3.03
Soul
Signed Sealed Deliverd I'm Yours
Stevie Wonder
2.39
Soul
Just Another Night
Mick Jagger
5.15
Rock
Brown Sugar
Mick Jagger
3.50
Rock
All I Want Is You
Bono
6.30
Metal
Beautiful Day
Bono
4.08
Metal
Like A Rolling Stone
Bob Dylan
6.08
Rock
Just Like a Woman
Bob Dylan
4.51
Rock
Hurricane
Bob Dylan
8.33
Rock
Subterranean Homesick Blues
Bob Dylan
2.24
Rock
Tangled Up In Blue
Bob Dylan
5.40
Rock
Love Me
Elvis Presley
2.42
Rock
In The Getto
Elvis Presley
2.31
Rock
All Shook Up
Elvis Presley
1.54
Rock
The output should look like that:
Baby by Justin Bebier (Pop) 3.35min
The Promise by Chris Cornell (Rock, Country) 4.26min
All songs
Baby by Justin Bebier (Pop) 3.35min
Fearless by Taylor Swift (Pop) 4.03min
Runaway Love by Ludacris (Pop) 4.41min
My Heart Will Go On by Celine Dion (Pop) 4.41min
Jesus Take The Wheel by Carrie Underwood (Country) 3.31min
If Tomorrow Never Comes by Garth Brooks (Country) 3.40min
Set Fire To Rain by Adele (Soul) 4.01min
Don't You Remember by Adele (Soul) 3.03min
Signed Sealed Deliverd I'm Yours by Stevie Wonder (Soul) 2.39min
Just Another Night by Mick Jagger (Rock) 5.15min
Brown Sugar by Mick Jagger (Rock) 3.50min
All I Want Is You by Bono (Metal) 6.30min
Beautiful Day by Bono (Metal) 4.08min
Like A Rolling Stone by Bob Dylan (Rock) 6.08min
Just Like a Woman by Bob Dylan (Rock) 4.51min
Hurricane by Bob Dylan (Rock) 8.33min
Subterranean Homesick Blues by Bob Dylan (Rock) 2.24min
Tangled Up In Blue by Bob Dylan (Rock) 5.40min
Love Me by Elvis Presley (Rock) 2.42min
In The Getto by Elvis Presley (Rock) 2.31min
All Shook Up by Elvis Presley (Rock) 1.54min
Rock songs
Just Another Night by Mick Jagger (Rock) 5.15min
Brown Sugar by Mick Jagger (Rock) 3.50min
Like A Rolling Stone by Bob Dylan (Rock) 6.08min
Just Like a Woman by Bob Dylan (Rock) 4.51min
Hurricane by Bob Dylan (Rock) 8.33min
Subterranean Homesick Blues by Bob Dylan (Rock) 2.24min
Tangled Up In Blue by Bob Dylan (Rock) 5.40min
Love Me by Elvis Presley (Rock) 2.42min
In The Getto by Elvis Presley (Rock) 2.31min
All Shook Up by Elvis Presley (Rock) 1.54min
Songs by Bob Dylan
Like A Rolling Stone by Bob Dylan (Rock) 6.08min
Just Like a Woman by Bob Dylan (Rock) 4.51min
Hurricane by Bob Dylan (Rock) 8.33min
Subterranean Homesick Blues by Bob Dylan (Rock) 2.24min
Tangled Up In Blue by Bob Dylan (Rock) 5.40min
Songs more than 5mins
Just Another Night by Mick Jagger (Rock) 5.15min
All I Want Is You by Bono (Metal) 6.30min
Like A Rolling Stone by Bob Dylan (Rock) 6.08min
Hurricane by Bob Dylan (Rock) 8.33min
Tangled Up In Blue by Bob Dylan (Rock) 5.40min
There are a couple of different bits wrong with it and it'll take a little while to work through with some explanations, but the basic problem (that you pointed me to here from your question) of "Genre can't be seen in other classes" is that the Genre enum is declared inside a class called SongGenre rather than being declared in the namespace directly, and you're hence not referring to it properly (it's of type SongGenre.Genre, not Genre) so in the Song class (for example) you'd declare like:
public SongGenre.Genre Genre { get; }
^^^^^^^^^^^^^^^ ^^^^^
this is the type the name
Consequentially this is a bit of a syntax error in the Song contructor:
SongGenre Genre = SongGenre.genre;/*<-ERROR 'SongGenre does not contain a definition for 'genre'*/
It should be like:
Genre = SongGenre.Genre.Blues;
Or like:
Genre = genre;
But then you have to adjust your constructor not to take a SongGenre class but to take a SongGenre.Genre enum:
public Song(string title, string artist, double length, SongGenre.Genre genre)
It's actually causing you a lot of headaches by having that enum inside the SongGenre class. You should consider throwing the SongGenre class away and moving the enum into the namespace directly, instead and renaming the enum to be SongGenre:
namespace whatever{
enum SongGenre{ Blues...
This means you don't have to refer to it by the class name prefix all the time and your existing code will work more like expected
You have another type confusion here:
if (songs.Genre == genre)/*<-ERROR 'string' does not contain a definition for 'Genre'
* and no accessable extension method 'Genre' accepting a first
* argument of type 'string' could be found*/
{
Console.WriteLine("\n" + songs);
}
songs is a list of strings, not a list of Songs, and strings don't have a Genre property. Try List<Song> instead
= new List<string> { "title", "artist", "length", "genre" };
This doesnt need to make sense to me; are you expecting these to be column headers to somthing? This just declares a list of 4 strings, nothing really to do with songs. You could perhaps load these strings into a combo box so the user can "choose a thing to search by" - but they aren't anything to do with songs
title.Add(data[0]);//<-ERROR Use of unassigned local variable 'title'| 'string' does not contain definition for 'Add'
title is a string, not a list or other container, and it cannot be added to
TextReader reader = new StreamReader(filename);//<-ERROR The name 'filename' does not exist in the current context
string line = reader.ReadLine();
while (line != null)
{
string[] data = line.Split();
title.Add(data[0]);//<-ERROR Use of unassigned local variable 'title'| 'string' does not contain definition for 'Add'
artist.Add(data[1]);//<-ERROR Use of unassigned local variable 'artist'| 'string' does not contain definition for 'Add'
length.Add(Convert.ToDouble(data[2]));/*<-ERROR Use of unassigned local variable 'length'| 'string' does not contain
* definition for 'Add'*/
genre.Add(Enum.Parse(data[3]));/*<-ERROR Use of unassigned local variable 'genre' |ERROR 'string' does not contain
* definition for 'Add' | ERROR The type arguments for method Enum.Parse cannot be
inferred from the usage*/
line = reader.ReadLine();
}
reader.Close();
If I was reading that file I'd do it like:
//read all lines into an array
var songFile = File.ReadAllLines("...path to file...");
List<Song> library = new List<Song>();
//iterate, skipping 4 lines at a time
for(int i = 0; i< songFile.Length; i+=4){
string artist = songFile[i];
string title = songFile[i+1];
double durn = double.Parse(songFile[i+2]);
Genre gen = (Genre)Enum.Parse(songFile[3]);
Song s = new Song(artist, title, durn, gen);
library.Add(s);
}

Splitting a long string in pandas cell near the n-th character position into multiple cells without splitting words

As MS Excel limits the number of characters in a cell to 32767, I have to split longer strings in a pandas dataframe into several cells.
Is there a way to split the strings of a pandas column "Text" into several columns "Text_1", "Text_2", "Text_3", ... to divide? It is also important that the text block is not separated within a word, so I assume regex is needed.
An example dataframe:
df_test = pd.DataFrame({'Text' : ['This should be the first very long string','This is the second very long string','This is the third very long string','This is the last string which is very long'],
'Date' : [2019, 2018, 2019, 2018],
'Source' : ["FAZ", "SZ" , "HB", "HB"],
'ID' : ["ID_1", "ID_2", "ID_3", "ID_4"]})
df_test
Text Date Source ID
0 This should be the first very long string 2019 FAZ ID_1
1 This is the second very long string 2018 SZ ID_2
2 This is the third very long string 2019 HB ID_3
3 This is the last string which is very long 2018 HB ID_4
Assuming that the cut in this example occurs at n=15 and not at n=32767, I want to split the Text column accordingly to something like this:
Text_1 Text_2 Text_3 Text_4 Date Source ID
0 This should be the first very long string 2019 FAZ ID_1
1 This is the second very long string 2018 SZ ID_2
2 This is the third very long string 2019 HB ID_3
3 This is the last string which is very long 2018 HB ID_4
Ultimately the approach should be scalable to n=32767 and at least ten new columns "Text_1", "Text_2", and so on.
So far I have created a new column "n" indicating the length of the df_text["Text"] strings per row:
df_test['n'] = df_test['Text'].str.split("").str.len()
Here is the general idea.
# find longest long string, then divide the text
# into the number of new cols you want, adding a | at
# the division and then later splitting by that |
longest = ""
for x in df_test['Text']:
if len(x) > len(longest):
longest = x
continue
import math
num_cols = math.floor(len(longest.split(' ')) / 3) # shoot for 3 words per row
for index,row in df_test.iterrows():
word_str = row['Text']
word_char_len = len(word_str)
word_as_list = word_str.split(' ')
num_words = len(word_as_list)
col_index = math.ceil(len(word_as_list) / num_cols)
for _ in range(num_cols - 1):
word_as_list.insert(col_index,'|')
col_index += col_index
new = ' '.join(word_as_list)
df_test.at[index,'Text'] = new
cols = ['Text'+str(i) for i in range(1,num_cols+1)]
df_test[cols] = df_test.Text.str.split('|',expand=True)
del df_test['Text']
print(df_test)
OUTPUT
Date Source ID Text1 Text2 Text3
0 2019 FAZ ID_1 This should be the first very long string
1 2018 SZ ID_2 This is the second very long string
2 2019 HB ID_3 This is the third very long string
3 2018 HB ID_4 This is the last string which is very long
I will upload a full one when I am done. Comment if you don't like this way or have other suggestions.
Yes - a single pandas cell should contain a maximum number of characters of 32767. So the string from df_test[“Text”] should be split accordingly.

LINQ to extract duplicate data more than 3

class Year
{
public int YearNumber;
public List<Month> Months = new List<Month>();
}
class Month
{
public int MonthNumber;
public List<Day> Days = new List<Day>();
}
class Day
{
public int DayNumber;
public string Event;
}
So I have a list of Years(list<year> years). How do I get the list (another list) which have the result that has duplicates event on the same day? I mean events can be happen on multiple dates, does not matter, what matters is, to find out if this any of date happens the same event from different year. . Lastly, (filter) only if its occurs more than 3 times. Example, 5 July 2014, 5 July 2017 and 5 July 2019 is 'Abc Festival', which occurs more than 3 times. So u get the date, the event, and the number of counts.
Using just the classes you show we can only group dates, where a "date" is a day in a month:
var query = from y in years
from m in y.Months
from d in m.Days
select new { m.MonthNumber, d.DayNumber }
into date
group date by date
into dateGroup
where dateGroup.Count() > 2
select dateGroup;
select dateGroup;
As you see, the core solution is to build new { m.MonthNumber, d.DayNumber } objects and group them.

Pyspark: add new column derived from other and with time contition

I have DataFrame #1 with columns A, B, C startyear, endyear With Values:
year B C startyear endyear
2010 2 A 2012 2014
2011 2 A 2010 2013
2013 2 B .. ..
2012 2 C``
I want to create a new column called result
df = df.Withcolumn(...)
Resutt will take into consideration start and end years to compute mean of B for each year between start and en dates
if start date = 2012
end date = 2014
then result will be the mean of the sum of ( B2012 + B2013+B2014) = 2+2+2/3=2
Some advice ?
Thank you
You can use filter with two conditions and then use aggregate function to calculate mean.
df = df.filter((x_df.year >= x_df.start_yr) & (x_df.year <= x_df.end_yr))
df.agg({"B":"mean"})