Glossary

Outliers

Point outlier:

  • A point outlier is a datum that behaves unusual in a specific time instant when compared either to the other values in the time series (global outlier) or to its neighboring points (local outlier).1

Subsequences:

  • This term refers to consecutive points in time whose joint behavior is unusual, although each observation individually is not necessarily a point outlier.1

Outlier Estimation and methods:

  • If the outlier is obtained based on the past,current and future data then it is called the estimation method. If the outlier is obtained based on only past data then the method is prediction method.

  • Some methods:

    1. Descriptive statistics :

      • Finding maximum and minimum in the data - for example if the the data is regarding grades of the students and the maximum possible value is 100 but due to mistake if the data has 1000 as maximum value this can be found by retrieveing the range of the data.
      • Histogram , Boxplots of the data can be used to visualise the outlier. In Boxplots all the observations beyond the interquartile range criterion(I=[q0.251.5.IQR;q0.75+1.5.IQR]I=[q_{0.25}-1.5.IQR;q_{0.75}+1.5.IQR]) is considered as outlier.
      • Percentiles All the observations that are beyond a percentile of interest is considered as outlier.
    2. Statistical tests : These tests requires the data is normally distributed. This can be checked by either visualsing the data using a histogram or using shapiro-Wilk normality test - shapiro.text().

      • Grubb's test: The Grubbs test allows to detect whether the highest or lowest value in a dataset is an outlier.
      • Dixon's test : Tests if a particular value is outlier or not
      • Rosner's test: used to detect several outliers at once.
      • Z-Scores if the data has a normal distribution. Data are categorised as outliers based on their z-score

Hampler Filter

This is also part of descriptive statistics. Considers values outside the interval I=[median3MAD,median+3MAD]I=[median-3*MAD,median+3*MAD] as outliers. To understand MAD(median absolute deviation) MAD

Confidence interval

In general a 95% confidence interval means there is 95% probability that the confidence interval contains the mean.2 To understand what is CI of a sample proportion, the term population proportion is defined first.

A population proportion is the proportion of individuals in a population sharing a certain trait, denoted as p. The sample proportion is the proportion of individuals in a sample sharing a certain trait, denoted ˆp.3

Just like the estimating the CI of mean the CI of proportion is estimated by adding and subtracting margin of error from ^p to get the limits of CI.

Margin of Error =z×p^×(1p^)n\\ Margin\space of\space Error\space =z\times\sqrt{\frac{\hat p\times(1-\hat p)}{n}}

Where z is the z-score for 95% confidence level.4 For multinomial sample prortions the confidence intervals are often approximated by single binomial confidence interval, I assume the trait of iterest is considered as p^\hat p while others become (1p^)(1-\hat p). There are also methods to to calculate confidence interval simultaneously. One such method sisonglaz was used in the work through function MultinomCI from DescTools package.

Interpolation

Finding a new datapoint based on the preexisting data point is called the interpolation. Common methods of interpolation includes linear,polynomial, spline interpolation. Linear interpolation fits a stright line between known points and uses the slope off the line to interpolate the missing data points. In both polynomial and spline interpolation polynomials are used to do the interpolation. The difference is that spline fits multiple piecewise polynomials to the subset of data to do the interpolation, on the other hand polynomial interpolation fits one polynomial to the entire data to do the interpolation.5


Backlinks