Lecture 5 - ENEE1006IU

Tài liệu học tập môn Applied statistics (ENEE1006IU) tại Trường Đại học Quốc tế, Đại học Quốc gia Thành phố Hồ Chí Minh. Tài liệu gồm 22 trang giúp bạn ôn tập hiệu quả và đạt điểm cao! Mời bạn đọc đón xem! 
lOMoARcPSD|359747 69
APPLIED STATISTICS
COURSE CODE: ENEE1006IU
Lecture 5:
Chapter 3: Descriptive statistics
(3 credits: 2 is for lecture, 1 is for lab-work)
Instructor: TRAN THANH TU Email:
tttu@hcmiu.edu.vn
tttu@hcmiu.edu.vn 1
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 2
3.3. MEASURES OF DISTRIBUTION SHAPE, RELATIVE LOCATION, AND DETECTING
OUTLIERS
•Distribution Shape
•z-Scores
•Chebyshev’s Theorem
•Empirical Rule
•Detecting Outliers
3.3. MEASURES OF DISTRIBUTION SHAPE, RELATIVE LOCATION, AND DETECTING
OUTLIERS
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 3
•Distribution Shape:
•x
i
= i
th
Random Variable
= Mean of the Distribution
•n = Number of Variables in the
Distribution
•σ (or s) = Standard Distribution
3.3. MEASURES OF DISTRIBUTION
SHAPE, RELATIVE LOCATION, AND
DETECTING OUTLIERS
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 4
•Distribution Shape:
•For a symmetric distribution,
the mean and the median are
equal.
•When the data are positively
skewed, the mean will usually
be greater than the median.
•When the data are negatively
skewed, the mean will usually
be less than the median.
3.3. MEASURES OF
DISTRIBUTION SHAPE, RELATIVE
LOCATION, AND DETECTING OUTLIERS •z-Scores: By using
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 5
both the mean and standard deviation, we can determine the relative location of
any observation i.
The z-score is often called the standardized value.
•The z-score, z
i
, can be interpreted as the number of standard deviations x
i
is from
the mean x.
z-score > 0 means x
i
>
z-score < 0 means x
i
< z-
score = 0 means x
i
=
3.3. MEASURES OF
DISTRIBUTION SHAPE,
RELATIVE LOCATION, AND
DETECTING OUTLIERS
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 6
•Chebyshev’s Theorem: enables us to make statements about the
proportion (%) of data values that must be within a specified number of standard
deviations of the mean (applied for all distribution shapes).
At least (1 − 1/z
2
) of the data values must be within z standard deviations of the
mean, where z is any value greater than 1.”
(but z need not be an integer)
3.3. MEASURES OF DISTRIBUTION SHAPE, RELATIVE LOCATION, AND
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 7
DETECTING OUTLIERS
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 8
3.3. MEASURES OF DISTRIBUTION SHAPE, RELATIVE LOCATION, AND DETECTING
OUTLIERS
•Detecting Outliers:
•Sometimes a data set will have one or more observations with unusually large or
unusually small values. These extreme values are called outliers.
•An outlier may be a data value that has been incorrectly recorded that needed
to be corrected or removed before further analysis.
- Standardized values (z-scores) can be used to identify outliers
Treat any data value with a z-score less than −3 or greater than +3 as an outlier
3.3. MEASURES OF DISTRIBUTION SHAPE, RELATIVE LOCATION, AND DETECTING
OUTLIERS
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 9
•Detecting Outliers:
- Another approach to identifying outliers is based upon the values of the first and
third quartiles (Q
1
and Q
3
) and the interquartile range (IQR).
An observation is classified as an outlier if its value is less than
the lower limit or greater than the upper limit
lOMoARcPSD|359747 69
End of file 1.
Any questions?
tttu@hcmiu.edu.vn
10
3.4. FIVE NUMBERS SUMMARIES AND BOX PLOTS
•Five-Number Summary
•Box Plot
•Comparative Analysis Using Box Plots
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 11
3.4. FIVE NUMBERS SUMMARIES AND BOX PLOTS
•Five-Number Summary: is especially useful in descriptive analyses or during
the preliminary investigation of a large data set.
A summary consists of five values: the most extreme values in the data set (the
maximum and minimum values), the lower and upper quartiles, and the median.
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 12
3.4. FIVE NUMBERS SUMMARIES AND BOX PLOTS
•Box Plot: A box plot is a graphical display of data based on a fivenumber
summary. A key to the development of a box plot is the computation of the
interquartile range, IQR = Q
3
Q
1
.
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 13
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 14
3.4. FIVE NUMBERS SUMMARIES AND BOX PLOTS
•Comparative Analysis Using Box Plots: Box plots can also be used to
provide a graphical summary of two or more groups and facilitate visual
comparisons among the groups.
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 15
lOMoARcPSD|359747 69
End of file 2.
Any questions?
tttu@hcmiu.edu.vn
15
3.5. MEASURES OF ASSOCIATION BETWEEN TWO VARIABLES
•Covariance
•Interpretation of the Covariance
•Correlation Coefficient
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 17
•Interpretation of the Correlation Coefficient
3.5. MEASURES OF ASSOCIATION BETWEEN TWO VARIABLES
•Covariance: For a sample of size n with the observations (x
1
, y
1
),
(x
2
, y
2
), and so on, the sample covariance and population covariance are defined
as follows:
To measure the strength of the linear relationship between x and y
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 18
3.5. MEASURES OF ASSOCIATION BETWEEN TWO VARIABLES
•Interpretation of the Covariance: The lines divide the graph into four
quadrants:
Points in quadrant I correspond to x
i
greater than and y
i
greater than
Points in quadrant II correspond to x
i
less than and y
i
greater than
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 19
Points in quadrant III correspond to x
i
less than and y
i
less than Points in
quadrant IV correspond to x greater than
and y less than
value of (x
i
)(y
i
) must be:
- positive for points in quadrant I
- negative for points in quadrant II
- positive for points in quadrant III
- negative for points in quadrant IV
3.5. MEASURES OF ASSOCIATION BETWEEN TWO VARIABLES
•Correlation Coefficient: Person product moment correlation coefficient:
i
i
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 20
the sample correlation coefficient r
xy
is a point estimator of the
population correlation coefficient ρ
xy
.
3.5. MEASURES OF ASSOCIATION BETWEEN TWO VARIABLES
•Interpretation of the Correlation Coefficient:
lOMoARcPSD|359747 69
tttu@hcmiu.edu.vn 21
In general, it can be shown that if all the points in a data set fall on a positively
sloped straight line, the value of the sample correlation coefficient is +1; that is, a
sample correlation coefficient of +1 corresponds to a perfect positive linear
relationship between x and y.
Moreover, if the points in the data set fall on a straight line having negative slope,
the value of the sample correlation coefficient is −1; that is, a sample correlation
coefficient of −1 corresponds to a perfect negative linear relationship between x
and y.
note that correlation provides a measure of linear association and not
necessarily causation
A high correlation between two variables does not mean that changes in one
variable will cause changes in the other variable.
lOMoARcPSD|359747 69
End of file 3.
Any questions?
tttu@hcmiu.edu.vn
21
| 1/22

Preview text:

lOMoARcPSD|359 747 69 APPLIED STATISTICS COURSE CODE: ENEE1006IU Lecture 5:
Chapter 3: Descriptive statistics
(3 credits: 2 is for lecture, 1 is for lab-work)
Instructor: TRAN THANH TU Email: tttu@hcmiu.edu.vn tttu@hcmiu.edu.vn 1 lOMoARcPSD|359 747 69
3.3. MEASURES OF DISTRIBUTION SHAPE, RELATIVE LOCATION, AND DETECTING OUTLIERS •Distribution Shape •z-Scores •Chebyshev’s Theorem •Empirical Rule •Detecting Outliers
3.3. MEASURES OF DISTRIBUTION SHAPE, RELATIVE LOCATION, AND DETECTING OUTLIERS tttu@hcmiu.edu.vn 2 lOMoARcPSD|359 747 69 •Distribution Shape: •xi = ith Random Variable
• = Mean of the Distribution
•n = Number of Variables in the Distribution
•σ (or s) = Standard Distribution 3.3. MEASURES OF DISTRIBUTION SHAPE, RELATIVE LOCATION, AND DETECTING OUTLIERS tttu@hcmiu.edu.vn 3 lOMoARcPSD|359 747 69 •Distribution Shape:
•For a symmetric distribution, the mean and the median are equal.
•When the data are positively skewed, the mean will usually be greater than the median.
•When the data are negatively skewed, the mean will usually be less than the median. 3.3. MEASURES OF DISTRIBUTION SHAPE, RELATIVE
LOCATION, AND DETECTING OUTLIERS •z-Scores: By using tttu@hcmiu.edu.vn 4 lOMoARcPSD|359 747 69
both the mean and standard deviation, we can determine the relative location of any observation i.
The z-score is often called the standardized value.
•The z-score, zi, can be interpreted as the number of standard deviations xi is from the mean x. z-score > 0 means xi >
z-score < 0 means xi < z- score = 0 means xi = 3.3. MEASURES OF DISTRIBUTION SHAPE, RELATIVE LOCATION, AND DETECTING OUTLIERS tttu@hcmiu.edu.vn 5 lOMoARcPSD|359 747 69
•Chebyshev’s Theorem: enables us to make statements about the
proportion (%) of data values that must be within a specified number of standard
deviations of the mean (applied for all distribution shapes).
“At least (1 − 1/z2) of the data values must be within z standard deviations of the
mean, where z is any value greater than 1.”
(but z need not be an integer)
3.3. MEASURES OF DISTRIBUTION SHAPE, RELATIVE LOCATION, AND tttu@hcmiu.edu.vn 6 lOMoARcPSD|359 747 69 DETECTING OUTLIERS tttu@hcmiu.edu.vn 7 lOMoARcPSD|359 747 69
3.3. MEASURES OF DISTRIBUTION SHAPE, RELATIVE LOCATION, AND DETECTING OUTLIERS •Detecting Outliers:
•Sometimes a data set will have one or more observations with unusually large or
unusually small values. These extreme values are called outliers.
•An outlier may be a data value that has been incorrectly recorded that needed
to be corrected or removed before further analysis.
- Standardized values (z-scores) can be used to identify outliers
Treat any data value with a z-score less than −3 or greater than +3 as an outlier
3.3. MEASURES OF DISTRIBUTION SHAPE, RELATIVE LOCATION, AND DETECTING OUTLIERS tttu@hcmiu.edu.vn 8 lOMoARcPSD|359 747 69 •Detecting Outliers:
- Another approach to identifying outliers is based upon the values of the first and
third quartiles (Q1 and Q3) and the interquartile range (IQR).
An observation is classified as an outlier if its value is less than
the lower limit or greater than the upper limit tttu@hcmiu.edu.vn 9 lOMoARcPSD|359 747 69 End of file 1. Any questions? tttu@hcmiu.edu.vn 10
3.4. FIVE NUMBERS SUMMARIES AND BOX PLOTS •Five-Number Summary •Box Plot
•Comparative Analysis Using Box Plots lOMoARcPSD|359 747 69
3.4. FIVE NUMBERS SUMMARIES AND BOX PLOTS
•Five-Number Summary: is especially useful in descriptive analyses or during
the preliminary investigation of a large data set.
A summary consists of five values: the most extreme values in the data set (the
maximum and minimum values), the lower and upper quartiles, and the median. tttu@hcmiu.edu.vn 11 lOMoARcPSD|359 747 69
3.4. FIVE NUMBERS SUMMARIES AND BOX PLOTS
•Box Plot: A box plot is a graphical display of data based on a fivenumber
summary. A key to the development of a box plot is the computation of the
interquartile range, IQR = Q3 − Q1. tttu@hcmiu.edu.vn 12 lOMoARcPSD|359 747 69 tttu@hcmiu.edu.vn 13 lOMoARcPSD|359 747 69
3.4. FIVE NUMBERS SUMMARIES AND BOX PLOTS
•Comparative Analysis Using Box Plots: Box plots can also be used to
provide a graphical summary of two or more groups and facilitate visual comparisons among the groups. tttu@hcmiu.edu.vn 14 lOMoARcPSD|359 747 69 tttu@hcmiu.edu.vn 15 lOMoARcPSD|359 747 69 End of file 2. Any questions? tttu@hcmiu.edu.vn 15
3.5. MEASURES OF ASSOCIATION BETWEEN TWO VARIABLES •Covariance
•Interpretation of the Covariance •Correlation Coefficient lOMoARcPSD|359 747 69
•Interpretation of the Correlation Coefficient
3.5. MEASURES OF ASSOCIATION BETWEEN TWO VARIABLES
•Covariance: For a sample of size n with the observations (x1, y1),
(x2, y2), and so on, the sample covariance and population covariance are defined as follows:
To measure the strength of the linear relationship between x and y tttu@hcmiu.edu.vn 17 lOMoARcPSD|359 747 69
3.5. MEASURES OF ASSOCIATION BETWEEN TWO VARIABLES
•Interpretation of the Covariance: The lines divide the graph into four quadrants:
Points in quadrant I correspond to xi greater than and yi greater than
Points in quadrant II correspond to xi less than and yi greater than tttu@hcmiu.edu.vn 18 lOMoARcPSD|359 747 69
Points in quadrant III correspond to xi less than and yi less than Points in
quadrant IV correspond to x greater i i than and y less than
value of (xi − )(yi − ) must be:
- positive for points in quadrant I
- negative for points in quadrant II
- positive for points in quadrant III
- negative for points in quadrant IV
3.5. MEASURES OF ASSOCIATION BETWEEN TWO VARIABLES
•Correlation Coefficient: Person product moment correlation coefficient: tttu@hcmiu.edu.vn 19 lOMoARcPSD|359 747 69
the sample correlation coefficient rxy is a point estimator of the
population correlation coefficient ρxy.
3.5. MEASURES OF ASSOCIATION BETWEEN TWO VARIABLES
•Interpretation of the Correlation Coefficient: tttu@hcmiu.edu.vn 20 lOMoARcPSD|359 747 69
In general, it can be shown that if all the points in a data set fall on a positively
sloped straight line, the value of the sample correlation coefficient is +1; that is, a
sample correlation coefficient of +1 corresponds to a perfect positive linear relationship between x and y.
Moreover, if the points in the data set fall on a straight line having negative slope,
the value of the sample correlation coefficient is −1; that is, a sample correlation
coefficient of −1 corresponds to a perfect negative linear relationship between x and y.
note that correlation provides a measure of linear association and not necessarily causation
A high correlation between two variables does not mean that changes in one
variable will cause changes in the other variable. tttu@hcmiu.edu.vn 21 lOMoARcPSD|359 747 69 End of file 3. Any questions? tttu@hcmiu.edu.vn 21
Document Outline

  • APPLIED STATISTICS
    • Chapter 3: Descriptive statistics
      • 3.4. FIVE NUMBERS SUMMARIES AND BOX PLOTS
      • 3.4. FIVE NUMBERS SUMMARIES AND BOX PLOTS (1)