Applied StatisticsProjects - Business English | Trường Đại học Hùng Vương

Applied StatisticsProjects - Business English | Trường Đại học Hùng Vương được sưu tầm và soạn thảo dưới dạng file PDF để gửi tới các bạn sinh viên cùng tham khảo, ôn tập đầy đủ kiến thức, chuẩn bị cho các buổi học thật tốt. Mời bạn đọc đón xem!

Course: Applied Statistics
Projects
Bui Anh Tuan
March 7, 2023
Overview
Due: Session 12.
Details
The class will be divided into groups. Each group, with 5 to 6 students, will be assigned
a topic to study and present in the class. The objective of this assessment is to encourage
students in doing research in groups and communicate their results in an oral presentation.
Presentation should be created using PowerPoint and should address:
1. Overview of the dataset, why would we investigate this topic.
2. Basic insights from the data using plots and descriptive statistics.
3. Models and results
4. Conclusion.
Presentations should generally not exceed 15 minutes, to allow time for questions and
discussion.
Marking criteria and standards
The presenters will be evaluated by the lecturer (50%) as well as the rest of the class (50%)
based on the following criteria:
i. Content: Is the presentation clear and focused? Does it cover all important content of
the assigned topic?
ii. Preparation: How well prepared is this group? How good are the slides and supporting
materials? How well does this group know their materials?
iii. Presentation and Communication: How well organized is the presentation? How
effectively does this group present, interact and involve the rest of the class? Does this
group use time effectively?
iv. Addressing questions: How effective does this group deal with questions and com-
ments?
v. Interest and Creativity: How interesting and creative is this group presentation?
1
Project 1
Dataset:
The file houseprice.csv contains house sale prices for King County, which includes Seat-
tle. It includes homes sold between May 2014 and May 2015. Besides the house prices, the
dataset also provides the details of the houses which are helpful for determining the house
price. Use this dataset to build a regression model to predict the house price.
Main variables are:
price: price of the houses
floors: number of floors
condition: rating from 1 to 5 (from worse to great)
view: rating from 0-4 (from worse to great)
sqft above: area of the house
sqft living: living area (includes land around the house)
sqft basement: area of the basement.
bedrooms: number of bedrooms
Tasks:
Part 1. Data Preprocessing
1. Import data:: houseprice.csv
2. Data cleaning: NA (remove all observations containing ”NA”, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of the price: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Build a linear regression model.to evaluate factors on the price of the house.
2. With the details of a chosen house: predict the price.
3. Any interesting insights based on the data? (choose our own methods)
2
Project 2
Dataset:
The dataset ”student performance.csv” approach student achievement in secondary ed-
ucation of two Portuguese schools. The data attributes include student grades, demographic,
social and school related features) and it was collected by using school reports and question-
naires.
Main variables are:
G1: first period grade (numeric: from 0 to 20)
G2: second period grade (numeric: from 0 to 20)
G3: final grade (numeric: from 0 to 20, output target)
studytime: weekly study time (numeric: 1 - 2 hours, 2 - 2 to 5 hours, 3 - 5 to 10
hours, or 4 - > 10 hours)
failures: number of past class failures
absences: number of school absences (numeric: from 0 to 93)
paid: extra paid classes within the course subject (Math or Portuguese) (binary: yes
or no)
sex: student’s sex (binary: ’F’ - female or ’M’ - male)
Tasks:
Part 1. Data Preprocessing
1. Import data:: student performance.csv
2. Data cleaning: NA (remove all observations containing ”NA”, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of the grades: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Build a linear regression model.to evaluate factors on the final grade.
2. With the details of a chosen student: predict the final grade.
3. Any interesting insights based on the data? (choose our own methods)
3
Project 3
Dataset:
The data set ”Diet.csv” contains information on 78 people who undertook one of three diets.
There is background information such as age, gender (Female=0, Male=1) and height. The
aim of the study was to see which diet was best for losing weight but it was also thought
that the best diets for males and females may be different so the independent variables are
diet and gender.
Main variables are:
Person: index of the participant
gender:
Age:
Height:
pre:weight: weight before the diet
Diet: type of diets (1,2 or 3)
weight6weeks: weight after 6 weeks on the chosen diet
Tasks:
Part 1. Data Preprocessing
1. Import data:: Diet.csv
2. Data cleaning: NA (remove all observations containing ”NA”, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of the variables: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Use one factor ANOVA to see which diet was best for losing weight.
2. You may divide the whole dataset into two sub-dataset: one for male and one for
female to see if we have difference choices.
3. Any interesting insights based on the data? (choose our own methods)
4
Project 4
Dataset:
The dataset ”flights.csv” contains information about all flights that departed from the two
major airports of the Pacific Northwest (PNW), SEA in Seattle and PDX in Portland, in
2014: 162,049 flights in total. The main goal of the project is to use this dataset and try to
find out the major factors cause the delay or postpone of the flights.
Main variables are:
year, month, day: Date of departure
carrier: Two letter carrier abbreviation. See airlines to get name.
origin, dest: Origin and destination. See airports for additional metadata
dep delay, arr delay: Departure and arrival delays, in minutes. Negative times
represent early departures/arrivals.
dep time, arr time: Actual departure and arrival times (format HHMM or HMM),
local tz.
distance: Distance between airports, in miles.
Tasks:
Part 1. Data Preprocessing
1. Import data:: flights.csv
2. Data cleaning: NA (remove all observations containing ”NA”, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of the arr delay: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Use one factor ANOVA to evaluate the differences in the delay time between airlines.
2. Based on your analysis, which carrier(s) tend to delay more than the others?.
3. Any interesting insights based on the data? (choose our own methods)
5
Project 5
Dataset:
The dataset ”salary data.csv” consists of salaries for Data Scientists, Machine Learning
Engineers, Data Analysts, Data Engineers in various cities across India (2022). The aim
of this project is to find out the differences in salary between 4 data-related fields using
one-factor ANOVA.
Main variables are:
Company Name:
Job Title:
Salary( )/Year: annual salary
Field: four data-related fields
Level: level of position
Tasks:
Part 1. Data Preprocessing
1. Import data:: salary data.csv
2. Data cleaning: NA (remove all observations containing ”NA”, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of the salary: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Use one factor ANOVA to evaluate the differences in the salary between 4 data-related
fields.
2. Based on your analysis, which field tend to have higher salary than the others?.
3. Any interesting insights based on the data? (choose our own methods)
6
Project 6
Dataset:
The dataset ”insurance.csv” consists of 1338 records of insurance contracts. The aim of
this project is to build a model to predict the insurance costs.
Main variables are:
age: age of primary beneficiary
sex: insurance contractor gender( female, male)
bmi: Body mass index, providing an understanding of body, weights that are relatively
high or low relative to height, objective index of body weight (
kg/m
2
) using the ratio
of height to weight, ideally 18.5 to 24.9
children: Number of children covered by health insurance / Number of dependents
smoker: smoking or non-smoking
region: the beneficiary’s residential area in the US, northeast, southeast, southwest,
northwest.
charges: Individual medical costs billed by health insurance
Tasks:
Part 1. Data Preprocessing
1. Import data:: insurance.csv
2. Data cleaning: NA (remove all observations containing ”NA” if any, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of the variables: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Build a linear regression model.to evaluate factors on the insurance charges.
2. Give an example of a contractor and then predict the insurance charge.
3. Any interesting insights based on the data? (choose our own methods)
7
Project 7
Dataset:
The dataset ”supermarket sales.csv” is the historical sales of supermarket company which
has recorded in 3 different branches for 3 months data. The aim of this project is to inves-
tigate the customer’s satisfaction based on the rating in difference branches.
Main variables are:
Invoice id: Computer generated sales slip invoice identification number
Branch: Branch of supercenter (3 branches are available identified by A, B and C).
Customer type: Type of customers, recorded by Members for customers using mem-
ber card and Normal for without member card.
Product line: General item categorization groups - Electronic accessories, Fashion
accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and
travel
Unit price: Price of each product in US dollar
Quantity: Number of products purchased by customer
Total: Total price including tax
Rating: Customer stratification rating on their overall shopping experience (On a
scale of 1 to 10)
Tasks:
Part 1. Data Preprocessing
1. Import data:: supermarket sales.csv
2. Data cleaning: NA (remove all observations containing ”NA” if any, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of the variables: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Use one factor ANOVA to evaluate the differences in customer’s satisfaction (based on
ratings) between 3 branches.
2. Based on your analysis, which branch tend to higher customer’s satisfaction?.
3. Any interesting insights based on the data? (choose our own methods)
8
Project 8
Dataset:
The dataset ”crime.xlsx” contains the information of residents and criminals records of 50
cities. The aim of this project is to predict the total overall reported crime rate per 1 million
residents using the other information.
Main variables are:
annual police funding : annual police funding in dollar per resident
per25 hs: percent of people 25 years+ with 4 yrs. of high school
per16 19: percent of 16 to 19 year-olds not in highschool and not highschool gradu-
ates.
per18 24: percent of 18 to 24 year-olds in college
per25 clg: percent of people 25 years+ with at least 4 years of college
crime rate: total overall reported crime rate per 1 million residents
Tasks:
Part 1. Data Preprocessing
1. Import data:: crime.xlsx
2. Data cleaning: NA (remove all observations containing ”NA” if any, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of the variables: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Build a linear regression model to evaluate factors on the crime rate.
2. Give an example of a contractor and then predict the crime rate.
3. Any interesting insights based on the data? (choose our own methods)
9
Project 9
Dataset:
This dataset “OnlineNewsPopularity.xlsx” summarizes a heterogeneous set of features
about articles published by Mashable in a period of two years. The goal is to predict the
number of shares in social networks (popularity).
Main variables are:
n tokens title: Number of words in the title
n tokens content: Number of words in the content
num hrefs: Number of links
num imgs: Number of images
num videos: Number of videos
data channel is lifestyle: Is data channel ’Lifestyle’?
data channel is entertainment: Is data channel ’Entertainment’ ?
data channel is bus: Is data channel ’Business’?
data channel is socmed: Is data channel ’Social Media’ ?
data channel is tech: Is data channel ’Tech’?
data channel is world: Is data channel ’World’ ?
weekday is monday: Was the article published on a Monday?
weekday is tuesday: Was the article published on a Tuesday?
weekday is wednesday: Was the article published on a Wednesday?
weekday is thursday: Was the article published on a Thursday?
weekday is friday: Was the article published on a Friday?
weekday is saturday: Was the article published on a Saturday?
weekday is sunday: Was the article published on a Sunday?
is weekend: Was the article published on the weekend?
global subjectivity: Text subjectivity
global rate positive words: Rate of positive words in the content
global rate negative words: Rate of negative words in the content
shares: Number of shares (target)
10
Tasks:
Part 1. Data Preprocessing
1. Import data:: OnlineNewsPopularity.xlsx
2. Data cleaning: NA (remove all observations containing ”NA” if any, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of number of shares: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Build a linear regression model to evaluate factors on the number of shares.
2. Give an example of a contractor and then predict the number of shares.
3. Any interesting insights based on the data? (choose our own methods)
11
References
[1] Douglas C. Montgomery, George C. Runger. Hoboken. Applied StatisticsandProbability
forEngineers. NJ: Wiley, (2007).
[2] Peter Dalgaard Introductory StatisticswithR. Springer, (2008).
[3] Gareth, J., Daniela, W., Trevor, H. and Robert, T. An introduction to statistical
learning: withapplications inR. Springer, (2013).
12
| 1/12

Preview text:

Course: Applied Statistics Projects Bui Anh Tuan March 7, 2023 Overview Due: Session 12. Details
The class will be divided into groups. Each group, with 5 to 6 students, will be assigned
a topic to study and present in the class. The objective of this assessment is to encourage
students in doing research in groups and communicate their results in an oral presentation.
Presentation should be created using PowerPoint and should address:
1. Overview of the dataset, why would we investigate this topic.
2. Basic insights from the data using plots and descriptive statistics. 3. Models and results 4. Conclusion.
Presentations should generally not exceed 15 minutes, to allow time for questions and discussion. Marking criteria and standards
The presenters will be evaluated by the lecturer (50%) as well as the rest of the class (50%)
based on the following criteria:
i. Content: Is the presentation clear and focused? Does it cover all important content of the assigned topic?
ii. Preparation: How well prepared is this group? How good are the slides and supporting
materials? How well does this group know their materials?
iii. Presentation and Communication: How well organized is the presentation? How
effectively does this group present, interact and involve the rest of the class? Does this group use time effectively?
iv. Addressing questions: How effective does this group deal with questions and com- ments?
v. Interest and Creativity: How interesting and creative is this group presentation? 1 Project 1 Dataset:
The file ”houseprice.csv” contains house sale prices for King County, which includes Seat-
tle. It includes homes sold between May 2014 and May 2015. Besides the house prices, the
dataset also provides the details of the houses which are helpful for determining the house
price. Use this dataset to build a regression model to predict the house price. Main variables are: price: price of the houses floors: number of floors
condition: rating from 1 to 5 (from worse to great)
view: rating from 0-4 (from worse to great) sqft above: area of the house
sqft living: living area (includes land around the house)
sqft basement: area of the basement. bedrooms: number of bedrooms Tasks: Part 1. Data Preprocessing
1. Import data:: houseprice.csv
2. Data cleaning: NA (remove all observations containing ”NA”, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of the price: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Build a linear regression model.to evaluate factors on the price of the house.
2. With the details of a chosen house: predict the price.
3. Any interesting insights based on the data? (choose our own methods) 2 Project 2 Dataset:
The dataset ”student performance.csv” approach student achievement in secondary ed-
ucation of two Portuguese schools. The data attributes include student grades, demographic,
social and school related features) and it was collected by using school reports and question- naires. Main variables are:
G1: first period grade (numeric: from 0 to 20)
G2: second period grade (numeric: from 0 to 20)
G3: final grade (numeric: from 0 to 20, output target)
studytime: weekly study time (numeric: 1 - 2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - > 10 hours)
failures: number of past class failures
absences: number of school absences (numeric: from 0 to 93)
paid: extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
sex: student’s sex (binary: ’F’ - female or ’M’ - male) Tasks: Part 1. Data Preprocessing
1. Import data:: student performance.csv
2. Data cleaning: NA (remove all observations containing ”NA”, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of the grades: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Build a linear regression model.to evaluate factors on the final grade.
2. With the details of a chosen student: predict the final grade.
3. Any interesting insights based on the data? (choose our own methods) 3 Project 3 Dataset:
The data set ”Diet.csv” contains information on 78 people who undertook one of three diets.
There is background information such as age, gender (Female=0, Male=1) and height. The
aim of the study was to see which diet was best for losing weight but it was also thought
that the best diets for males and females may be different so the independent variables are diet and gender. Main variables are:
Person: index of the participant gender: Age: Height:
pre:weight: weight before the diet Diet: type of diets (1,2 or 3)
weight6weeks: weight after 6 weeks on the chosen diet Tasks: Part 1. Data Preprocessing 1. Import data:: Diet.csv
2. Data cleaning: NA (remove all observations containing ”NA”, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of the variables: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Use one factor ANOVA to see which diet was best for losing weight.
2. You may divide the whole dataset into two sub-dataset: one for male and one for
female to see if we have difference choices.
3. Any interesting insights based on the data? (choose our own methods) 4 Project 4 Dataset:
The dataset ”flights.csv” contains information about all flights that departed from the two
major airports of the Pacific Northwest (PNW), SEA in Seattle and PDX in Portland, in
2014: 162,049 flights in total. The main goal of the project is to use this dataset and try to
find out the major factors cause the delay or postpone of the flights. Main variables are:
year, month, day: Date of departure
carrier: Two letter carrier abbreviation. See airlines to get name.
origin, dest: Origin and destination. See airports for additional metadata
dep delay, arr delay: Departure and arrival delays, in minutes. Negative times
represent early departures/arrivals.
dep time, arr time: Actual departure and arrival times (format HHMM or HMM), local tz.
distance: Distance between airports, in miles. Tasks: Part 1. Data Preprocessing 1. Import data:: flights.csv
2. Data cleaning: NA (remove all observations containing ”NA”, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of the arr delay: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Use one factor ANOVA to evaluate the differences in the delay time between airlines.
2. Based on your analysis, which carrier(s) tend to delay more than the others?.
3. Any interesting insights based on the data? (choose our own methods) 5 Project 5 Dataset:
The dataset ”salary data.csv” consists of salaries for Data Scientists, Machine Learning
Engineers, Data Analysts, Data Engineers in various cities across India (2022). The aim
of this project is to find out the differences in salary between 4 data-related fields using one-factor ANOVA. Main variables are: Company Name: Job Title: Salary( )/Year: annual salary
Field: four data-related fields Level: level of position Tasks: Part 1. Data Preprocessing
1. Import data:: salary data.csv
2. Data cleaning: NA (remove all observations containing ”NA”, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of the salary: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Use one factor ANOVA to evaluate the differences in the salary between 4 data-related fields.
2. Based on your analysis, which field tend to have higher salary than the others?.
3. Any interesting insights based on the data? (choose our own methods) 6 Project 6 Dataset:
The dataset ”insurance.csv” consists of 1338 records of insurance contracts. The aim of
this project is to build a model to predict the insurance costs. Main variables are:
age: age of primary beneficiary
sex: insurance contractor gender( female, male)
bmi: Body mass index, providing an understanding of body, weights that are relatively
high or low relative to height, objective index of body weight (kg/m2) using the ratio
of height to weight, ideally 18.5 to 24.9
children: Number of children covered by health insurance / Number of dependents smoker: smoking or non-smoking
region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance Tasks: Part 1. Data Preprocessing 1. Import data:: insurance.csv
2. Data cleaning: NA (remove all observations containing ”NA” if any, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of the variables: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Build a linear regression model.to evaluate factors on the insurance charges.
2. Give an example of a contractor and then predict the insurance charge.
3. Any interesting insights based on the data? (choose our own methods) 7 Project 7 Dataset:
The dataset ”supermarket sales.csv” is the historical sales of supermarket company which
has recorded in 3 different branches for 3 months data. The aim of this project is to inves-
tigate the customer’s satisfaction based on the rating in difference branches. Main variables are:
Invoice id: Computer generated sales slip invoice identification number
Branch: Branch of supercenter (3 branches are available identified by A, B and C).
Customer type: Type of customers, recorded by Members for customers using mem-
ber card and Normal for without member card.
Product line: General item categorization groups - Electronic accessories, Fashion
accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel
Unit price: Price of each product in US dollar
Quantity: Number of products purchased by customer
Total: Total price including tax
Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10) Tasks: Part 1. Data Preprocessing
1. Import data:: supermarket sales.csv
2. Data cleaning: NA (remove all observations containing ”NA” if any, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of the variables: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Use one factor ANOVA to evaluate the differences in customer’s satisfaction (based on ratings) between 3 branches.
2. Based on your analysis, which branch tend to higher customer’s satisfaction?.
3. Any interesting insights based on the data? (choose our own methods) 8 Project 8 Dataset:
The dataset ”crime.xlsx” contains the information of residents and criminals records of 50
cities. The aim of this project is to predict the total overall reported crime rate per 1 million
residents using the other information. Main variables are:
annual police funding : annual police funding in dollar per resident
per25 hs: percent of people 25 years+ with 4 yrs. of high school
per16 19: percent of 16 to 19 year-olds not in highschool and not highschool gradu- ates.
per18 24: percent of 18 to 24 year-olds in college
per25 clg: percent of people 25 years+ with at least 4 years of college
crime rate: total overall reported crime rate per 1 million residents Tasks: Part 1. Data Preprocessing 1. Import data:: crime.xlsx
2. Data cleaning: NA (remove all observations containing ”NA” if any, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of the variables: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Build a linear regression model to evaluate factors on the crime rate.
2. Give an example of a contractor and then predict the crime rate.
3. Any interesting insights based on the data? (choose our own methods) 9 Project 9 Dataset:
This dataset “OnlineNewsPopularity.xlsx” summarizes a heterogeneous set of features
about articles published by Mashable in a period of two years. The goal is to predict the
number of shares in social networks (popularity). Main variables are:
n tokens title: Number of words in the title
n tokens content: Number of words in the content num hrefs: Number of links num imgs: Number of images num videos: Number of videos
data channel is lifestyle: Is data channel ’Lifestyle’ ?
data channel is entertainment: Is data channel ’Entertainment’ ?
data channel is bus: Is data channel ’Business’ ?
data channel is socmed: Is data channel ’Social Media’ ?
data channel is tech: Is data channel ’Tech’ ?
data channel is world: Is data channel ’World’ ?
weekday is monday: Was the article published on a Monday?
weekday is tuesday: Was the article published on a Tuesday?
weekday is wednesday: Was the article published on a Wednesday?
weekday is thursday: Was the article published on a Thursday?
weekday is friday: Was the article published on a Friday?
weekday is saturday: Was the article published on a Saturday?
weekday is sunday: Was the article published on a Sunday?
is weekend: Was the article published on the weekend?
global subjectivity: Text subjectivity
global rate positive words: Rate of positive words in the content
global rate negative words: Rate of negative words in the content
shares: Number of shares (target) 10 Tasks: Part 1. Data Preprocessing
1. Import data:: OnlineNewsPopularity.xlsx
2. Data cleaning: NA (remove all observations containing ”NA” if any, missing data)
Part 2. Visualization and Descriptive Statistics
1. Data visualization: choose some suitable plots (boxplot, scatter plot,...) and try to get
some basic insights of the data.
2. Descriptive statistics of number of shares: mean, median,... Any insights?
Part 3. Models and Analyzing data
1. Build a linear regression model to evaluate factors on the number of shares.
2. Give an example of a contractor and then predict the number of shares.
3. Any interesting insights based on the data? (choose our own methods) 11 References
[1] Douglas C. Montgomery, George C. Runger. Hoboken. Applied StatisticsandProbability
forEngineers. NJ: Wiley, (2007).
[2] Peter Dalgaard Introductory Statistics with R. Springer, (2008).
[3] Gareth, J., Daniela, W., Trevor, H. and Robert, T. An introduction to statistical
learning: withapplicationsinR. Springer, (2013). 12