Environmental Protection

Homepage > Environmental Protection > e-Digest Environmental Statistics

Environmental Statistics

Glossary of Statistical Information

By no means definitive, just a few terms and brief descriptions used in statistics work that may be of use to readers of the e-digest.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

A Back to top

Aggregated data are data about a collection of people rather than about individuals.

The all-commodities price ratio is the weighted mean of all the price ratios for the commodities under consideration, using as weights the total expenditure on each commodity in the previous year.

The alternative hypothesis H1 represents the conclusion that is reached when the null hypothesis is rejected.

Analysis of variance is a statistical technique which may be used for making many simultaneous comparisons (e.g. to compare the complete set of data resulting from a whole range of different experiment treatments but involving only a single control group).

The word average is often used instead of measure of location. Thus the median, mean, mid-range and mid interquartile range are all types of averages.

B Back to top

The base date of an index is the starting date from which the index is calculated; it is the date at which the value of the index is set at 100.

A batch is a collection of data values.

The batch size is the number of data values in a batch.

Bias is the name given to an error (other than sampling error) introduced by using a poor sampling scheme.

C Back to top

A relationship is causal if a change in one variable (the explanatory variable) produces a change in the other variable (the dependent variable).

A census is a process which obtains information about a complete target population.

A chained index is a measure of change, for example in prices.

Cleaning the data is the process of dealing with omissions in a batch of data, ensuring that measurements are given in the same units and to the same reasonable level of accuracy etc.

Cluster sampling is a method of random sampling which restricts the sample to a limited number of geographical areas. First, a limited number of suitable geographical areas are selected. Then, for each of these geographical areas, a subsample is chosen form the members of the target population (cluster) in that area. Finally, the samples are combined to give a cluster sample.

In order to control for a specific factor the sample data are selected in such a way as to eliminate the effects due to that particular factor.

A control group is the group that does not receive the treatment, when a comparison is made between a group that does and a group that does not receive some form of treatment.

A correlation coefficient is a number, lying between –1 and +1, which summarise the degree of relationship between two variables.

A crossover design is the design of an experiment in which during the course of the experiment each subject crosses over from receiving one treatment to receiving the other.

D Back to top

The dependent variable is the variable being explained when investigating the relationship between two variables. It is often desirable to explain the values which are taken by one variable in terms of the values taken by the other variable.

E Back to top

Ecological fallacy is the mistake of falsely assuming that a relationship observed in aggregated data is also present in the individual data from which the aggregated data are produced.

A method of sampling which reduces sampling error is said to be efficient.

Epidemiology is the study of the causes of diseases by observing human populations. Techniques of epidemiology include prospective studies and retrospective studies.

An Estimate is an indication of the value of an unknown quantity based on observed data. More formally, an estimate is the particular value of an estimator that is obtained from a particular sample of data and used to indicate the value of a parameter.

Estimation is the process by which sample data are used to indicate the value of an unknown quantity in a population. Results of estimation can be expressed as a single value, known as a point estimate; or a range of values, known as a confidence interval.

An Experiment is any process or study which results in the collection of data, the outcome of which is unknown. In statistics, the term is usually restricted to situations in which the researcher has control over some of the conditions under which the experiment takes place.

Experimental (or Sampling) Unit is a person, animal, plant or thing which is actually studied by a researcher; the basic objects upon which the study or experiment is carried out. For example, a person; a monkey; a sample of soil; a pot of seedlings; a postcode area; a doctor's practice.

The explanatory variable is the variable doing the explaining when investigating the relationship between two variables. It is often desirable to explain the values which are taken by one variable in terms of the values taken by the other variable. The explanatory variable is sometimes called the independent variable.

G Back to top

A group-comparative design is the design of an experiment in which the subjects are simply divided into two groups which are both representative of the population being studied. Then one treatment is given to the subjects in one group and the other treatment is given to those in the other group.

H Back to top

Historical fallacy is the mistake of falsely assuming that a relationship observed in cross-sectional data will be present in similar longitudinal data, or vice versa.

A hypothesis test is a method of inferring back from sample data to the population as a whole by deciding between a null hypothesis and an alternative hypothesis.

A hypothesis testing experiment is an experiment in which the experiment tests whether the things that the hypothesis predicts, do actually happen.

I Back to top

Index-linking is a process used to safeguard the value of money held or received in savings or pensions.

The interquartile range is a measure of spread of a batch; it is the distance between the lower and upper quartiles.

Measurements on an interval scale are actual quantities in definite units, such as length in metres, height in centimetres, price in pounds, etc., for which differences between the measurements can be compared.

L Back to top

Linked data is the term used to describe data from two batches where each value in one batch is naturally linked with a unique value in the other batch (see pairing).

The location or level of a batch of data is the average or centre of the batch. See also measure of location.

Longitudinal data are data about a relatively small number of individuals collected over a period of time.

M Back to top

A matched pairs design is the design of an experiment in which the subject are matched in pairs according to factors relevant to what is being measured in the experiment. Then one treatment is given to one member of the pair and the other treatment is given to the other member.

The mean is a measure of location of a batch; it is given by the sum of all the data values in the batch, divided by the batch size.

The medianis a measure of location of a batch; if a batch size is odd then the median is the middle value of the batch; if the batch size is even then the median is halfway between the middle two values.

The mid-range is a measure of location of a batch; it is the point halfway between the lower and upper extremes.

N Back to top

The null hypothesis H? is an assumption about a population which may or may not be rejected as the result of a hypothesis test.

O Back to top

Ordinal data is any data that can be ordered or ranked.

An outlier is a data value which lies a long way away from the main body of data, high outliers lie above the main body of data, high outliers lie above the main body of data, low outliers lie below the main body of data.

P Back to top

Pairing is a method of control whereby people or things are selected or matched in pairs so that both members of a pair possess the same characteristics (e.g. gender, age, etc). Pairing gives rise to ‘before’ and ‘after’ data and data in the form of matched pairs.

A Parameter is a value, usually unknown (and which therefore has to be estimated), used to represent a certain population characteristic. For example, the population mean is a parameter that is often used to indicate the average value of a quantity. Within a population, a parameter is a fixed value which does not vary. Each sample drawn from the population has its own value of any statistic that is used to estimate this parameter. For example, the mean of the data in a sample is used to give information about the overall mean in the population from which that sample was drawn.

A Population is any entire collection of people, animals, plants or things from which we may collect data. It is the entire group we are interested in, which we wish to describe or draw conclusions about. In order to make any generalisations about a population, a sample, that is meant to be representative of the population, is often studied. For each population there are many possible samples. A sample statistic gives information about a corresponding population parameter. For example, the sample mean for a set of data would give information about the overall population mean. It is important that the investigator carefully and completely defines the population before collecting the sample, including a description of the members to be included. Example: The population for a study of infant health might be all children born in the U.K. in the 1980's. The sample might be all babies born on 7th May in any of the years.

Probability is a measure of the chance of an event occurring. The probability of selecting at random a person with a particular property from a property is simply the proportion of people in the population with that property.

A prospective study is, for example, an epidemiological study in which a potential cause of a disease is investigated by finding two groups of people, one of which is exposed to the potential cause and the other of which is not. The two groups are then followed up for some time to see if one group suffers more from the disease than the other.

A pseudo experiment is a study in which groups are selected after the decision as to which patients should be treated in a particular way, for example with a particular drug.

The purchasing power (in pence) of the pound at date A compared with date B measures how much a consumer can buy with a fixed amount of money at date A compared with date B.

Q Back to top

The quartiles of a batch cut off the top and bottom 25% of the data values; the lower quartile Q? separates off the bottom 25%, the upper quartile Q? separates off the top 25%.

Quota sampling is a method of sampling in which interviewers are allocated a quota of interviews to achieve. It is frequently used for market research surveys and opinion polls. Although the selection of individuals is haphazard, random sampling is not involved.

R Back to top

A random number table is a table of numbers in which there is no apparent pattern. The numbers are used in the selection of random samples and in designing clinical trials and other experiments.

Random sampling is a method of selecting a sample from a target population using random methods (e.g. a die or a random number table) which produces a sample which is close to the ideal representative sample.

The Retail Prices Index (RPI) is the main measure used in this country to record changes in the level of the prices most people pay for the goods and services they buy.

A retrospective study is, for example, an epidemiological study in which potential causes of a disease are investigated by finding two groups of people, one suffering from the disease and the other not. The history of the people in the groups is then investigated to identify differences in their exposure to different potential causes.

S Back to top

A Sample is a group of units selected from a larger group (the population). By studying the sample it is hoped to draw valid conclusions about the larger group. A sample is generally selected for study because the population is too large to study in its entirety. The sample should be representative of the general population. This is often best achieved by random sampling. Also, before collecting the sample, it is important that the researcher carefully and completely defines the population, including a description of the members to be included.

A sampling frame is a list of all the individuals in a target population.

The spread or scatter of a batch is the pattern of the data values about the centre of location. It is pictured by a stem plot or represented by a measure of spread.

Spurious accuracy is introduced if, when carrying out a calculation, many more places of decimals are included in the answer than there were in the original data, so the answer looks more accurate than it really is.

The standard deviation is a measure of spread of a batch, based on averaging the distances of the data values from the mean of the batch.

Statistical Inference makes use of information from a sample to draw conclusions (inferences) about the population from which the sample was taken.

Two events are said to be statistically independent if the occurrence of one has no effect on the likelihood of occurrence of the other.

Stratified sampling is a method of random sampling. First, the target population is categorised into strata. Then, subsamples are selected from each of these strata using simple random sampling or systematic random sampling. The subsamples are then combined to form a stratified sample which is representative of the population with respect to the strata.

A systematic error is an error which is consistent in its direction and approximately constant in its magnitude.

T Back to top

A target population, often just referred to as the population, consists of all the individual members of a specified population of interest.

A test statistic is a summary measure, calculated from the data in a random sample which indicates how extreme that random sample would be if the null hypothesis were true.

U Back to top

Two samples are unrelated if the way in which one sample is selected does not influence the way in which the other sample is selected, for example when the samples are not matched in any way.

Back to top

Further Information:
Internet Links:
Department of Statistics, University of Glasgow
Probability and Statistics at MathWorld, Wolfram Research
Social Research Methods
Behind the statistics, ONS/National Statistics
WikiPedia

Your questions and comments about information presented on this page are welcome. Contact information and Email . Copyright of data and/or information presented or attached in this document may not reside solely with this Department. Please see guidance on Copyright.

Page last modified: 27 October 2006
Page published: 27 October 2006

Department for Environment, Food and Rural Affairs