Europe Site Site for Asia Site for Asia-Pacific Site for Middle East UK Site USA Site
Para mis visitantes del mundo de habla hispana,
este sitio se encuentra disponible en español en:
This Web site is a course in statistics appreciation; i.e., acquiring a feeling for the statistical way of thinking. It contains various useful concepts and topics at many levels of learning statistics for decision making under uncertainties. The cardinal objective for this Web site is to increase the extent to which statistical thinking is merged with managerial thinking for good decision making under uncertainty.
MENU
Chapter 1: Towards Statistical Thinking for Decision Making Chapter 2: Descriptive Sampling Data Analysis Chapter 3: Probability as a Confidence Measuring Tool for Statistical Inference Chapter 4: Necessary Conditions for Statistical Decision Making Chapter 5: Estimators and Their Qualities Chapter 6: Hypothesis Testing: Rejecting a Claim Chapter 7: Hypotheses Testing for Means and Proportions Chapter 8: Tests for Statistical Equality of Two or More Populations Chapter 9: Applications of the Chi-square Statistic Chapter 10: Regression Modeling and Analysis Chapter 11: Unified Views of Statistical Decision Technologies Chapter 12: Index Numbers and Ratios with Applications Exercise Your Knowledge to Enhance What You Have Learned (PDF) JavaScript E-labs Learning Objects Excel for Statistical Data Analysis A Why List: Frequently Asked Statistical Questions (Word.Doc) Formulas Concerning the Mean(s) (PDF), Print to enlarge After This Course Is Over: Statistical Concepts You Need For Life (Word.Doc) What Maths Do I Need for This Course? (Word.Doc), A Sample of "How Things Can Go Wrong?" Companion Sites:
Excel For Statistical Data Analysis Topics in Statistical Data Analysis Time Series Analysis and Business Forecasting Computers and Computational Statistics Questionnaire Design and Surveys Sampling Probabilistic Modeling Systems Simulation Probability and Statistics Resources Success Science Leadership Decision Making Linear Programming (LP) and Goal-Seeking Strategy Artificial-variable Free LP Solution Algorithms Integer Optimization and the Network Models Tools for LP Modeling Validation The Classical Simplex Method Zero-Sum Games with Applications Computer-assisted Learning Concepts and Techniques Linear Algebra and LP Connections From Linear to Nonlinear Optimization with Business Applications Construction of the Sensitivity Region for LP Models Zero Sagas in Four Dimensions Business Keywords and Phrases Collection of JavaScript E-labs Learning Objects Compendium of Web Site Review Impact of the Internet on Learning & Teaching The Business Statistics Online Course Course Information
To search the site, try Edit | Find in page [Ctrl + f]. Enter a word or phrase in the dialogue box, e.g."parameter" or"probability". If the first appearance of the word/phrase is not what you are looking for, try Find Next.
Towards Statistical Thinking for Decision Making
![]()
Descriptive Sampling Data Analysis
![]()
Probability as a Confidence Measuring Tool for Statistical Inference
![]()
Necessary Conditions for Statistical Decision Making
![]()
Estimators and Their Qualities
![]()
Hypothesis Testing: Rejecting a Claim
![]()
Hypotheses Testing for Means and Proportions
![]()
Tests for Statistical Equality of Two or More Populations
![]()
Applications of the Chi-square Statistic
![]()
Regression Modeling and Analysis
![]()
Unified Views of Statistical Decision Technologies
![]()
Index Numbers and Ratios with Applications
Introduction to Statistical Thinking for Decision Making
This site builds up the basic ideas of business statistics systematically and correctly. It is a combination of lectures and computer-based practice, joining theory firmly with practice. It introduces techniques for summarizing and presenting data, estimation, confidence intervals and hypothesis testing. The presentation focuses more on understanding of key concepts and statistical thinking, and less on formulas and calculations, which can now be done on small computers through user-friendly Statistical JavaScript A, etc. A Spanish version of this site is available at Razonamiento Estadístico para la Toma de Decisiones Gerenciales and its collection of JavaScript.
Today's good decisions are driven by data. In all aspects of our lives, and importantly in the business context, an amazing diversity of data is available for inspection and analytical insight. Business managers and professionals are increasingly required to justify decisions on the basis of data. They need statistical model-based decision support systems.
Statistical skills enable them to intelligently collect, analyze and interpret data relevant to their decision-making. Statistical concepts and statistical thinking enable them to:
- solve problems in a diversity of contexts.
- add substance to decisions.
- reduce guesswork.
This Web site is a course in statistics appreciation; i.e., acquiring a feel for the statistical way of thinking. It hopes to make sound statistical thinking understandable in business terms. An introductory course in statistics, it is designed to provide you with the basic concepts and methods of statistical analysis for processes and products. Materials in this Web site are tailored to help you make better decisions and to get you thinking statistically. A cardinal objective for this Web site is to embed statistical thinking into managers, who must often decide with little information. In competitive environment, business managers must design quality into products, and into the processes of making the products. They must facilitate a process of never-ending improvement at all stages of manufacturing and service. This is a strategy that employs statistical methods, particularly statistically designed experiments, and produces processes that provide high yield and products that seldom fail. Moreover, it facilitates development of robust products that are insensitive to changes in the environment and internal component variation. Carefully planned statistical studies remove hindrances to high quality and productivity at every stage of production. This saves time and money. It is well recognized that quality must be engineered into products as early as possible in the design process. One must know how to use carefully planned, cost-effective statistical experiments to improve, optimize and make robust products and processes.
Business Statistics is a science assisting you to make business decisions under uncertainties based on some numerical and measurable scales. Decision making processes must be based on data, not on personal opinion nor on belief.
The Devil is in the Deviations: Variation is inevitable in life! Every process, every measurement, every sample has variation. Managers need to understand variation for two key reasons. First, so that they can lead others to apply statistical thinking in day-to-day activities and secondly, to apply the concept for the purpose of continuous improvement. This course will provide you with hands-on experience to promote the use of statistical thinking and techniques to apply them to make educated decisions, whenever you encounter variation in business data. You will learn techniques to intelligently assess and manage the risks inherent in decision-making. Therefore, remember that:
Just like weather, if you cannot control something, you should learn how to measure and analyze it, in order to predict it, effectively.
If you have taken statistics before, and have a feeling of inability to grasp concepts, it may be largely due to your former non-statistician instructors teaching statistics. Their deficiencies lead students to develop phobias for the sweet science of statistics. In this respect, Professor Herman Chernoff (1996) made the following remark:
Inadequate statistical teaching during university education leads even after graduation, to one or a combination of the following scenarios:
Plugging numbers into the formulas and crunching them have no value by themselves. You should continue to put effort into the concepts and concentrate on interpreting the results.
Even when you solve a small size problem by hand, I would like you to use the available computer software and Web-based computation to do the dirty work for you.
You must be able to read the logical secret in any formulas not memorize them. For example, in computing the variance, consider its formula. Instead of memorizing, you should start with some why:
i. Why do we square the deviations from the mean.
Because, if we add up all deviations, we get always zero value. So, to deal with this problem, we square the deviations. Why not raise to the power of four (three will not work)? Squaring does the trick; why should we make life more complicated than it is? Notice also that squaring also magnifies the deviations; therefore it works to our advantage to measure the quality of the data.ii. Why is there a summation notation in the formula.
To add up the squared deviation of each data point to compute the total sum of squared deviations.iii. Why do we divide the sum of squares by n-1.
The amount of deviation should reflect also how large the sample is; so we must bring in the sample size. That is, in general, larger sample sizes have larger sum of square deviation from the mean. Why n-1 not n? The reason for n-1 is that when you divide by n-1, the sample's variance provides an estimated variance much closer to the population variance, than when you divide by n. You note that for large sample size n (say over 30), it really does not matter whether it is divided by n or n-1. The results are almost the same, and they are acceptable. The factor n-1 is what we consider as the"degrees of freedom".This example shows how to question statistical formulas, rather than memorizing them. In fact, when you try to understand the formulas, you do not need to remember them, they are part of your brain connectivity. Clear thinking is always more important than the ability to do arithmetic.
When you look at a statistical formula, the formula should talk to you, as when a musician looks at a piece of musical-notes, he/she hears the music.
computer-assisted learning: The computer-assisted learning provides you a"hands-on" experience which will enhance your understanding of the concepts and techniques covered in this site.
Java, once an esoteric programming language for animating Web pages, is now a full-fledged platform for building JavaScript E-labs' learning objects with useful applications. As you used to do experiments in physics labs to learn physics, computer-assisted learning enables you to use any online interactive tool available on the Internet to perform experiments. The purpose is the same; i.e., to understand statistical concepts by using statistical applets which are entertaining and educating.
The appearance of computer software, JavaScript, Statistical Demonstration Applets, and Online Computation are the most important events in the process of teaching and learning concepts in model-based, statistical decision making courses. These e-lab Technologies allow you to construct numerical examples to understand the concepts, and to find their significance for yourself.
Unfortunately, most classroom courses are not learning systems. The way the instructors attempt to help their students acquire skills and knowledge has absolutely nothing to do with the way students actually learn. Many instructors rely on lectures and tests, and memorization. All too often, they rely on"telling." No one remembers much that's taught by telling, and what's told doesn't translate into usable skills. Certainly, we learn by doing, failing, and practicing until we do it right. The computer assisted learning serves this purpose.
A course in appreciation of statistical thinking gives business professionals an edge. Professionals with strong quantitative skills are in demand. This phenomenon will grow as the impetus for data-based decisions strengthens and the amount and availability of data increases. The statistical toolkit can be developed and enhanced at all stages of a career. Decision making process under uncertainty is largely based on application of statistics for probability assessment of uncontrollable events (or factors), as well as risk assessment of your decision. For the foundation of decision making visit Operations/Operational Research site. For more statistical-based Web sites with decision making applications, visit Decision Science Resources, and Modeling and Simulation Resources sites. The main objective for this course is to learn statistical thinking; to emphasize more on concepts, and less theory and fewer recipes, and finally to foster active learning using the useful and interesting Web-sites. It is already a known fact that"Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write." So, let's be ahead of our time.
Further Readings:
Chernoff H., A Conversation With Herman Chernoff, Statistical Science, Vol. 11, No. 4, 335-350, 1996.
Churchman C., The Design of Inquiring Systems, Basic Books, New York, 1971. Early in the book he stated that knowledge could be considered as a collection of information, or as an activity, or as a potential. He also noted that knowledge resides in the user and not in the collection.
Rustagi M., et al. (eds.), Recent Advances in Statistics: Papers in Honor of Herman Chernoff on His Sixtieth Birthday, Academic Press, 1983.
The Birth of Probability and Statistics
The original idea of"statistics" was the collection of information about and for the"state". The word statistics derives directly, not from any classical Greek or Latin roots, but from the Italian word for state.The birth of statistics occurred in mid-17th century. A commoner, named John Graunt, who was a native of London, began reviewing a weekly church publication issued by the local parish clerk that listed the number of births, christenings, and deaths in each parish. These so called Bills of Mortality also listed the causes of death. Graunt who was a shopkeeper organized this data in the form we call descriptive statistics, which was published as Natural and Political Observations Made upon the Bills of Mortality. Shortly thereafter he was elected as a member of Royal Society. Thus, statistics has to borrow some concepts from sociology, such as the concept of Population. It has been argued that since statistics usually involves the study of human behavior, it cannot claim the precision of the physical sciences.
Probability has much longer history. Probability is derived from the verb to probe meaning to"find out" what is not too easily accessible or understandable. The word"proof" has the same origin that provides necessary details to understand what is claimed to be true.
Probability originated from the study of games of chance and gambling during the 16th century. Probability theory was a branch of mathematics studied by Blaise Pascal and Pierre de Fermat in the seventeenth century. Currently in 21st century, probabilistic modeling is used to control the flow of traffic through a highway system, a telephone interchange, or a computer processor; find the genetic makeup of individuals or populations; quality control; insurance; investment; and other sectors of business and industry.
New and ever growing diverse fields of human activities are using statistics; however, it seems that this field itself remains obscure to the public. Professor Bradley Efron expressed this fact nicely:
Further Readings:
Daston L., Classical Probability in the Enlightenment, Princeton University Press, 1988.
The book points out that early Enlightenment thinkers could not face uncertainty. A mechanistic, deterministic machine, was the Enlightenment view of the world.
David H., and A.Edwards, Annotated Readings in the History of Statistics, Springer, 2001. Offers a general historical collections of the probability and statistical literature.
Gillies D., Philosophical Theories of Probability, Routledge, 2000. Covers the classical, logical, subjective, frequency, and propensity views.
Hacking I., The Emergence of Probability, Cambridge University Press, London, 1975. A philosophical study of early ideas about probability, induction and statistical inference.
Hald A., A History of Probability and Statistics and Their Applications before 1750, Wiley, 2003.
Peters W., Counting for Something: Statistical Principles and Personalities, Springer, New York, 1987. It teaches the principles of applied economic and social statistics in a historical context. Featured topics include public opinion polls, industrial quality control, factor analysis, Bayesian methods, program evaluation, non-parametric and robust methods, and exploratory data analysis.
Porter T., The Rise of Statistical Thinking, 1820-1900, Princeton University Press, 1986. The author states that statistics has become known in the twentieth century as the mathematical tool for analyzing experimental and observational data. Enshrined by public policy as the only reliable basis for judgments as the efficacy of medical procedures or the safety of chemicals, and adopted by business for such uses as industrial quality control, it is evidently among the products of science whose influence on public and private life has been most pervasive. Statistical analysis has also come to be seen in many scientific disciplines as indispensable for drawing reliable conclusions from empirical (i.e., observed) results. This new field of mathematics found so extensive a domain of applications.
Stigler S., The History of Statistics: The Measurement of Uncertainty Before 1900, U. of Chicago Press, 1990. It covers the people, ideas, and events underlying the birth and development of early statistics.
Tankard J., The Statistical Pioneers, Schenkman Books, New York, 1984.
This work provides the detailed lives and times of theorists whose work continues to shape much of the modern statistics.
Statistical Modeling for Decision-Making under Uncertainties:
In this diverse world of ours, no two things are exactly the same. A statistician is interested in both the differences and the similarities; i.e., both departures and patterns.
From Data to the Instrumental KnowledgeThe actuarial tables published by insurance companies reflect their statistical analysis of the average life expectancy of men and women at any given age. From these numbers, the insurance companies then calculate the appropriate premiums for a particular individual to purchase a given amount of insurance.
Exploratory analysis of data makes use of numerical and graphical techniques to study patterns and departures from patterns. The widely used descriptive statistical techniques are: Frequency Distribution; Histograms; Boxplot; Scattergrams and Error Bar plots; and diagnostic plots.
In examining distribution of data, you should be able to detect important characteristics, such as shape, location, variability, and unusual values. From careful observations of patterns in data, you can generate conjectures about relationships among variables. The notion of how one variable may be associated with another permeates almost all of statistics, from simple comparisons of proportions through linear regression. The difference between association and causation must accompany this conceptual development.
Data must be collected according to a well-developed plan if valid information on a conjecture is to be obtained. The plan must identify important variables related to the conjecture, and specify how they are to be measured. From the data collection plan, a statistical model can be formulated from which inferences can be drawn.
As an example of statistical modeling with managerial implications, such as "what-if" analysis, consider regression analysis. Regression analysis is a powerful technique for studying relationship between dependent variables (i.e., output, performance measure) and independent variables (i.e., inputs, factors, decision variables). Summarizing relationships among the variables by the most appropriate equation (i.e., modeling) allows us to predict or identify the most influential factors and study their impacts on the output for any changes in their current values.
Frequently, for example the marketing managers are faced with the question, What Sample Size Do I Need? This is an important and common statistical decision, which should be given due consideration, since an inadequate sample size invariably leads to wasted resources. The sample size determination section provides a practical solution to this risky decision.
Statistical models are currently used in various fields of business and science. However, the terminology differs from field to field. For example, the fitting of models to data, called calibration, history matching, and data assimilation, are all synonymous with parameter estimation.
Your organization database contains a wealth of information, yet the decision technology group members tap a fraction of it. Employees waste time scouring multiple sources for a database. The decision-makers are frustrated because they cannot get business-critical data exactly when they need it. Therefore, too many decisions are based on guesswork, not facts. Many opportunities are also missed, if they are even noticed at all.
Knowledge is what we know well. Information is the communication of knowledge. In every knowledge exchange, there is a sender and a receiver. The sender make common what is private, does the informing, the communicating. Information can be classified as explicit and tacit forms. The explicit information can be explained in structured form, while tacit information is inconsistent and fuzzy to explain. Know that data are only crude information and not knowledge by themselves.
Data is known to be crude information and not knowledge by itself. The sequence from data to knowledge is: from Data to Information, from Information to Facts, and finally, from Facts to Knowledge. Data becomes information, when it becomes relevant to your decision problem. Information becomes fact, when the data can support it. Facts are what the data reveals. However the decisive instrumental (i.e., applied) knowledge is expressed together with some statistical degree of confidence.
Fact becomes knowledge, when it is used in the successful completion of a decision process. Once you have a massive amount of facts integrated as knowledge, then your mind will be superhuman in the same sense that mankind with writing is superhuman compared to mankind before writing. The following figure illustrates the statistical thinking process based on data in constructing statistical models for decision making under uncertainties.
Click on the image to enlarge it and THEN print it.
The Path from Statistical Data to Managerial KnowledgeThe above figure depicts the fact that as the exactness of a statistical model increases, the level of improvements in decision-making increases. That's why we need Business Statistics. Statistics arose from the need to place knowledge on a systematic evidence base. This required a study of the rules of computational probability, the development of measures of data properties and relationships, and so on.
Statistical inference aims at determining whether any statistical significance can be attached that results after due allowance is made for any random variation as a source of error. Intelligent and critical inferences cannot be made by those who do not understand the purpose, the conditions, and applicability of the various techniques for judging significance.
Considering the uncertain environment, the chance that"good decisions" are made increases with the availability of"good information." The chance that"good information" is available increases with the level of structuring the process of Knowledge Management. The above figure also illustrates the fact that as the exactness of a statistical model increases, the level of improvements in decision-making increases.
Knowledge is more than knowing something technical. Knowledge needs wisdom. Wisdom is the power to put our time and our knowledge to the proper use. Wisdom comes with age and experience. Wisdom is the accurate application of accurate knowledge and its key component is to knowing the limits of your knowledge. Wisdom is about knowing how something technical can be best used to meet the needs of the decision-maker. Wisdom, for example, creates statistical software that is useful, rather than technically brilliant. For example, ever since the Web entered the popular consciousness, observers have noted that it puts information at your fingertips but tends to keep wisdom out of reach.
The notion of "wisdom" in the sense of practical wisdom has entered Western civilization through biblical texts. In the Hellenic experience this kind of wisdom received a more structural character in the form of philosophy. In this sense philosophy also reflects one of the expressions of traditional wisdom.
Business professionals need a statistical toolkit. Statistical skills enable you to intelligently collect, analyze and interpret data relevant to their decision-making. Statistical concepts enable us to solve problems in a diversity of contexts. Statistical thinking enables you to add substance to your decisions.
That's why we need statistical data analysis in probabilistic modeling. Statistics arose from the need to place knowledge management on a systematic evidence base. This required a study of the rules of computational probability, the development of measures of data properties, relationships, and so on.
The purpose of statistical thinking is to get acquainted with the statistical techniques, to be able to execute procedures using available JavaScript, and to be conscious of the conditions and limitations of various techniques.
Statistical Decision-Making Process
Unlike the deterministic decision-making process, such as linear optimization by solving systems of equations, Parametric systems of equations and in decision making under pure uncertainty, the variables are often more numerous and more difficult to measure and control. However, the steps are the same. They are:
Fortunately the probabilistic and statistical methods for analysis and decision making under uncertainty are more numerous and powerful today than ever before. The computer makes possible many practical applications. A few examples of business applications are the following:
Questions Concerning Statistical the Decision-Making Process:
Further Readings:
Corfield D., and J. Williamson, Foundations of Bayesianism, Kluwer Academic Publishers, 2001. Contains Logic, Mathematics, Decision Theory, and Criticisms of Bayesianism.
Lapin L., Statistics for Modern Business Decisions, Harcourt Brace Jovanovich, 1987.
Pratt J., H. Raiffa, and R. Schlaifer, Introduction to Statistical Decision Theory, The MIT Press, 1994.
What is Business Statistics?
The main objective of Business Statistics is to make inferences (e.g., prediction, making decisions) about certain characteristics of a population based on information contained in a random sample from the entire population. The condition for randomness is essential to make sure the sample is representative of the population.Business Statistics is the science of ‘good' decision making in the face of uncertainty and is used in many disciplines, such as financial analysis, econometrics, auditing, production and operations, and marketing research. It provides knowledge and skills to interpret and use statistical techniques in a variety of business applications. A typical Business Statistics course is intended for business majors, and covers statistical study, descriptive statistics (collection, description, analysis, and summary of data), probability, and the binomial and normal distributions, test of hypotheses and confidence intervals, linear regression, and correlation.
Statistics is a science of making decisions with respect to the characteristics of a group of persons or objects on the basis of numerical information obtained from a randomly selected sample of the group. Statisticians refer to this numerical observation as realization of a random sample. However, notice that one cannot see a random sample. A random sample is only a sample of a finite outcomes of a random process.
At the planning stage of a statistical investigation, the question of sample size (n) is critical. For example, sample size for sampling from a finite population of size N, is set at: N½+1, rounded up to the nearest integer. Clearly, a larger sample provides more relevant information, and as a result a more accurate estimation and better statistical judgement regarding test of hypotheses.
Under-lit Streets and the Crimes Rate: It is a fact that if residential city streets are under-lit then major crimes take place therein. Suppose you are working in the Mayer’s office and put you in charge of helping him/her in deciding which manufacturers to buy the light bulbs from in order to reduce the crime rate by at least a certain amount, given that there is a limited budget?
Click on the image to enlarge it and THEN print it.
Activities Associated with the General
Statistical Thinking and Its ApplicationsThe above figure illustrates the idea of statistical inference from a random sample about the population. It also provides estimation for the population's parameters; namely the expected value µx, the standard deviation, and the cumulative distribution function (cdf) Fx, s and their corresponding sample statistics, mean
, sample standard deviation Sx, and empirical (i.e., observed) cumulative distribution function (cdf), respectively.
The major task of Statistics is the scientific methodology for collecting, analyzing, interpreting a random sample in order to draw inference about some particular characteristic of a specific Homogenous Population. For two major reasons, it is often impossible to study an entire population:
The process would be too expensive or too time-consuming. The process would be destructive. In either case, we would resort to looking at a sample chosen from the population and trying to infer information about the entire population by only examining the smaller sample. Very often the numbers, which interest us most about the population, are the mean m and standard deviation s, any number -- like the mean or standard deviation -- which is calculated from an entire population, is called a Parameter. If the very same numbers are derived only from the data of a sample, then the resulting numbers are called Statistics. Frequently, Greek letters represent parameters and Latin letters represent statistics (as shown in the above Figure).
The uncertainties in extending and generalizing sampling results to the population are measures and expressed by probabilistic statements called Inferential Statistics. Therefore, probability is used in statistics as a measuring tool and decision criterion for dealing with uncertainties in inferential statistics.
An important aspect of statistical inference is estimating population values (parameters) from samples of data. An estimate of a parameter is unbiased if the expected value of sampling distribution is equal to that population. The sample mean is an unbiased estimate of the population mean. The sample variance is an unbiased estimate of population variance. This allows us to combine several estimates to obtain a much better estimate. The Empirical distribution is the distribution of a random sample, shown by a step-function in the above figure. The empirical distribution function is an unbiased estimate for the population distribution function F(x).
Given you already have a realization set of a random sample, to compute the descriptive statistics including those in the above figure, you may like using Descriptive Statistics JavaScript.
Hypothesis testing is a procedure for reaching a probabilistic conclusive decision about a claimed value for a population’s parameter based on a sample. To reduce this uncertainty and having high confidence that statistical inferences are correct, a sample must give equal chance to each member of population to be selected which can be achieved by sampling randomly and relatively large sample size n.
Given you already have a realization set of a random sample, to perform hypothesis testing for mean m and variance s2, you may like using Testing the Mean and Testing the Variance JavaScript, respectively.
Statistics is a tool that enables us to impose order on the disorganized cacophony of the real world of modern society. The business world has grown both in size and competition. Corporate executive must take risk in business, hence the need for business statistics. Business statistics has grown with the art of constructing charts and tables! It is a science of basing decisions on numerical data in the face of uncertainty.
Business statistics is a scientific approach to decision making under risk. In practicing business statistics, we search for an insight, not the solution. Our search is for the one solution that meets all the business's needs with the lowest level of risk. Business statistics can take a normal business situation, and with the proper data gathering, analysis, and re-search for a solution, turn it into an opportunity.
While business statistics cannot replace the knowledge and experience of the decision maker, it is a valuable tool that the manager can employ to assist in the decision making process in order to reduce the inherent risk, measured by, e.g., the standard deviation s.
Among other useful questions, you may ask why we are interested in estimating the population's expected value m and its Standard Deviation s ? Here are some applicable reasons. Business Statistics must provide justifiable answers to the following concerns for every consumer and producer:
- What is your (or your customers) Expectation of the product/service you buy (or that your sell)? That is, what is a good estimate for m ?
- Given the information about your (or your customers) expectation, what is the Quality of the product/service you buy (or that you sell)? That is, what is a good estimate for s ?
- Given the information about what you buy (or your sell) expectation, and the quality of the product/service, how does the product/service compare with other existing similar types? That is, comparing several m 's, and several s 's.
Common Statistical Terminology with Applications
Like all profession, also statisticians have their own keywords and phrases to ease a precise communication. However, one must interpret the results of any decision making in a language that is easy for the decision-maker to understand. Otherwise, he/she does not believe in what you recommend, and therefore does not go into the implementation phase. This lack of communication between statisticians and the managers is the major roadblock for using statistics. Population: A population is any entire collection of people, animals, plants or things on which we may collect data. It is the entire group of interest, which we wish to describe or about which we wish to draw conclusions. In the above figure the life of the light bulbs manufactured say by GE, is the concerned population.
Qualitative and Quantitative Variables: Any object or event, which can vary in successive observations either in quantity or quality is called a"variable." Variables are classified accordingly as quantitative or qualitative. A qualitative variable, unlike a quantitative variable does not vary in magnitude in successive observations. The values of quantitative and qualitative variables are called"Variates" and"Attributes", respectively.
Variable: A characteristic or phenomenon, which may take different values, such as weight, gender since they are different from individual to individual.
Randomness: Randomness means unpredictability. The fascinating fact about inferential statistics is that, although each random observation may not be predictable when taken alone, collectively they follow a predictable pattern called its distribution function. For example, it is a fact that the distribution of a sample average follows a normal distribution for sample size over 30. In other words, an extreme value of the sample mean is less likely than an extreme value of a few raw data.
Sample: A subset of a population or universe.
An Experiment: An experiment is a process whose outcome is not known in advance with certainty.
Statistical Experiment: An experiment in general is an operation in which one chooses the values of some variables and measures the values of other variables, as in physics. A statistical experiment, in contrast is an operation in which one take a random sample from a population and infers the values of some variables. For example, in a survey, we"survey" i.e."look at" the situation without aiming to change it, such as in a survey of political opinions. A random sample from the relevant population provides information about the voting intentions.
In order to make any generalization about a population, a random sample from the entire population; that is meant to be representative of the population, is often studied. For each population, there are many possible samples. A sample statistic gives information about a corresponding population parameter. For example, the sample mean for a set of data would give information about the overall population mean m .
It is important that the investigator carefully and completely defines the population before collecting the sample, including a description of the members to be included.
Example: The population for a study of infant health might be all children born in the U.S.A. in the 1980's. The sample might be all babies born on 7th of May in any of the years.
An experiment is any process or study which results in the collection of data, the outcome of which is unknown. In statistics, the term is usually restricted to situations in which the researcher has control over some of the conditions under which the experiment takes place.
Example: Before introducing a new drug treatment to reduce high blood pressure, the manufacturer carries out an experiment to compare the effectiveness of the new drug with that of one currently prescribed. Newly diagnosed subjects are recruited from a group of local general practices. Half of them are chosen at random to receive the new drug, the remainder receives the present one. So, the researcher has control over the subjects recruited and the way in which they are allocated to treatment.
Design of experiments is a key tool for increasing the rate of acquiring new knowledge. Knowledge in turn can be used to gain competitive advantage, shorten the product development cycle, and produce new products and processes which will meet and exceed your customer's expectations.
Primary data and Secondary data sets: If the data are from a planned experiment relevant to the objective(s) of the statistical investigation, collected by the analyst, it is called a Primary Data set. However, if some condensed records are given to the analyst, it is called a Secondary Data set.
Random Variable: A random variable is a real function (yes, it is called" variable", but in reality it is a function) that assigns a numerical value to each simple event. For example, in sampling for quality control an item could be defective or non-defective, therefore, one may assign X=1, and X = 0 for a defective and non-defective item, respectively. You may assign any other two distinct real numbers, as you wish; however, non-negative integer random variables are easy to work with. Random variables are needed since one cannot do arithmetic operations on words; the random variable enables us to compute statistics, such as average and variance. Any random variable has a distribution of probabilities associated with it.
Probability: Probability (i.e., probing for the unknown) is the tool used for anticipating what the distribution of data should look like under a given model. Random phenomena are not haphazard: they display an order that emerges only in the long run and is described by a distribution. The mathematical description of variation is central to statistics. The probability required for statistical inference is not primarily axiomatic or combinatorial, but is oriented toward describing data distributions.
Sampling Unit: A unit is a person, animal, plant or thing which is actually studied by a researcher; the basic objects upon which the study or experiment is executed. For example, a person; a sample of soil; a pot of seedlings; a zip code area; a doctor's practice.
Parameter: A parameter is an unknown value, and therefore it has to be estimated. Parameters are used to represent a certain population characteristic. For example, the population mean m is a parameter that is often used to indicate the average value of a quantity.
Within a population, a parameter is a fixed value that does not vary. Each sample drawn from the population has its own value of any statistic that is used to estimate this parameter. For example, the mean of the data in a sample is used to give information about the overall mean min the population from which that sample was drawn.
Statistic: A statistic is a quantity that is calculated from a sample of data. It is used to give information about unknown values in the corresponding population. For example, the average of the data in a sample is used to give information about the overall average in the population from which that sample was drawn.
A statistic is a function of an observable random sample. It is therefore an observable random variable. Notice that, while a statistic is a"function" of observations, unfortunately, it is commonly called a random"variable" not a function.
It is possible to draw more than one sample from the same population, and the value of a statistic will in general vary from sample to sample. For example, the average value in a sample is a statistic. The average values in more than one sample, drawn from the same population, will not necessarily be equal.
Statistics are often assigned Roman letters (e.g.
and s), whereas the equivalent unknown values in the population (parameters ) are assigned Greek letters (e.g., µ, s).
The word estimate means to esteem, that is giving a value to something. A statistical estimate is an indication of the value of an unknown quantity based on observed data.
More formally, an estimate is the particular value of an estimator that is obtained from a particular sample of data and used to indicate the value of a parameter.
Example: Suppose the manager of a shop wanted to know m , the mean expenditure of customers in her shop in the last year. She could calculate the average expenditure of the hundreds (or perhaps thousands) of customers who bought goods in her shop; that is, the population mean m . Instead she could use an estimate of this population mean m by calculating the mean of a representative sample of customers. If this value were found to be $25, then $25 would be her estimate.
There are two broad subdivisions of statistics: Descriptive Statistics and Inferential Statistics as described below.
Descriptive Statistics: The numerical statistical data should be presented clearly, concisely, and in such a way that the decision maker can quickly obtain the essential characteristics of the data in order to incorporate them into decision process.
The principal descriptive quantity derived from sample data is the mean (
), which is the arithmetic average of the sample data. It serves as the most reliable single measure of the value of a typical member of the sample. If the sample contains a few values that are so large or so small that they have an exaggerated effect on the value of the mean, the sample is more accurately represented by the median -- the value where half the sample values fall below and half above.
The quantities most commonly used to measure the dispersion of the values about their mean are the variance s2 and its square root, the standard deviation s. The variance is calculated by determining the mean, subtracting it from each of the sample values (yielding the deviation of the samples), and then averaging the squares of these deviations. The mean and standard deviation of the sample are used as estimates of the corresponding characteristics of the entire group from which the sample was drawn. They do not, in general, completely describe the distribution (Fx) of values within either the sample or the parent group; indeed, different distributions may have the same mean and standard deviation. They do, however, provide a complete description of the normal distribution, in which positive and negative deviations from the mean are equally common, and small deviations are much more common than large ones. For a normally distributed set of values, a graph showing the dependence of the frequency of the deviations upon their magnitudes is a bell-shaped curve. About 68 percent of the values will differ from the mean by less than the standard deviation, and almost 100 percent will differ by less than three times the standard deviation.
Inferential Statistics: Inferential statistics is concerned with making inferences from samples about the populations from which they have been drawn. In other words, if we find a difference between two samples, we would like to know, is this a"real" difference (i.e., is it present in the population) or just a"chance" difference (i.e. it could just be the result of random sampling error). That's what tests of statistical significance are all about. Any inferred conclusion from a sample data to the population from which the sample is drawn must be expressed in a probabilistic term. Probability is the language and a measuring tool for uncertainty in our statistical conclusions.
Inferential statistics could be used for explaining a phenomenon or checking for validity of a claim. In these instances, inferential statistics is called Exploratory Data Analysis or Confirmatory Data Analysis, respectively.
Statistical Inference: Statistical inference refers to extending your knowledge obtained from a random sample from the entire population to the whole population. This is known in mathematics as Inductive Reasoning, that is, knowledge of the whole from a particular. Its main application is in hypotheses testing about a given population. Statistical inference guides the selection of appropriate statistical models. Models and data interact in statistical work. Inference from data can be thought of as the process of selecting a reasonable model, including a statement in probability language of how confident one can be about the selection.
Normal Distribution Condition: The normal or Gaussian distribution is a continuous symmetric distribution that follows the familiar bell-shaped curve. One of its nice features is that, the mean and variance uniquely and independently determines the distribution. It has been noted empirically that many measurement variables have distributions that are at least approximately normal. Even when a distribution is non-normal, the distribution of the mean of many independent observations from the same distribution becomes arbitrarily close to a normal distribution, as the number of observations grows large. Many frequently used statistical tests make the condition that the data come from a normal distribution.
Estimation and Hypothesis Testing:Inference in statistics are of two types. The first is estimation, which involves the determination, with a possible error due to sampling, of the unknown value of a population characteristic, such as the proportion having a specific attribute or the average value m of some numerical measurement. To express the accuracy of the estimates of population characteristics, one must also compute the standard errors of the estimates. The second type of inference is hypothesis testing. It involves the definitions of a hypothesis as one set of possible population values and an alternative, a different set. There are many statistical procedures for determining, on the basis of a sample, whether the true population characteristic belongs to the set of values in the hypothesis or the alternative.
Statistical inference is grounded in probability, idealized concepts of the group under study, called the population, and the sample. The statistician may view the population as a set of balls from which the sample is selected at random, that is, in such a way that each ball has the same chance as every other one for inclusion in the sample.
Notice that to be able to estimate the population parameters, the sample size n must be greater than one. For example, with a sample size of one, the variation (s2) within the sample is 0/1 = 0. An estimate for the variation (s2) within the population would be 0/0, which is indeterminate quantity, meaning impossible.
Greek Letters Commonly Used as Statistical Notations
We use Greek letters as scientific notations in statistics and other scientific fields to honor the ancient Greek philosophers who invented science and scientific thinking. Before Socrates, in 6th Century BC, Thales and Pythagoras, amomg others, applied geometrical concepts to arithmetic, and Socrates is the inventor of dialectic reasoning. The revival of scientific thinking (initiated by Newton's work) was valued and hence reappeared almost 2000 years later.
Greek Letters Commonly Used as Statistical Notations alpha beta ki-sqre delta mu nu pi rho sigma tau theta a b c 2 d m n p r s t q Note: ki-square (ki-sqre, Chi-square), c 2, is not the square of anything, its name implies Chi-square (read, ki-square). Ki does not exist in statistics.
I'm glad that you're overcoming all the confusions that exist in learning statistics.
Type of Data and Levels of Measurement
Information can be collected in statistics using qualitative or quantitative data. Qualitative data, such as eye color of a group of individuals, is not computable by arithmetic relations. They are labels that advise in which category or class an individual, object, or process fall. They are called categorical variables.Quantitative data sets consist of measures that take numerical values for which descriptions such as means and standard deviations are meaningful. They can be put into an order and further divided into two groups: discrete data or continuous data.
Discrete data are countable data and are collected by counting, for example, the number of defective items produced during a day's production.
Continuous data are collected by measuring and are expressed on a continuous scale. For example, measuring the height of a person.
Among the first activities in statistical analysis is to count or measure: Counting/measurement theory is concerned with the connection between data and reality. A set of data is a representation (i.e., a model) of the reality based on numerical and measurable scales. Data are called"primary type" data if the analyst has been involved in collecting the data relevant to his/her investigation. Otherwise, it is called"secondary type" data.
Data come in the forms of Nominal, Ordinal, Interval, and Ratio (remember the French word NOIR for the color black). Data can be either continuous or discrete.
Levels of Measurements _________________________________________ Nominal Ordinal Interval/Ratio Ranking? no yes yes Numerical difference no no yes Both the zero point and the units of measurement are arbitrary on the Interval scale. While the unit of measurement is arbitrary on the Ratio scale, its zero point is a natural attribute. The categorical variable is measured on an ordinal or nominal scale.
Counting/measurement theory is concerned with the connection between data and reality. Both statistical theory and counting/measurement theory are necessary to make inferences about reality.
Since statisticians live for precision, they prefer Interval/Ratio levels of measurement.
Pareto Chart: A Pareto chart is similar to the histogram, except that it is a frequency bar chart for qualitative variables, rather than being used for quantitative data that have been grouped into classes. The following is an example of a Pareto chart that shows the types of shoes-frequency, worn in the class on a particular day:
Click on the image to enlarge it and THEN print it.
A Typical Pareto ChartFor a good business application of discrete random variables, visit Markov Chain Calculator, Large Markov Chain Calculator and Zero-Sum Games.
Why Statistical Sampling?
Sampling is the selection of part of an aggregate or totality known as population, on the basis of which a decision concerning the population is made.The following are the advantages and/or necessities for sampling in statistical decision making:
- Cost: Cost is one of the main arguments in favor of sampling, because often a sample can furnish data of sufficient accuracy and at much lower cost than a census.
- Accuracy: Much better control over data collection errors is possible with sampling than with a census, because a sample is a smaller-scale undertaking.
- Timeliness: Another advantage of a sample over a census is that the sample produces information faster. This is important for timely decision making.
- Amount of Information: More detailed information can be obtained from a sample survey than from a census, because it take less time, is less costly, and allows us to take more care in the data processing stage.
- Destructive Tests: When a test involves the destruction of an item under study, sampling must be used. Statistical sampling determination can be used to find the optimal sample size within an acceptable cost.
Further Reading:
Thompson S., Sampling, Wiley, 2002.
Sampling Methods
From the food you eat to the television you watch, from political elections to school board actions, much of your life is regulated by the results of sample surveys.A sample is a group of units selected from a larger group (the population). By studying the sample, one hopes to draw valid conclusions about the larger group.
A sample is generally selected for study because the population is too large to study in its entirety. The sample should be representative of the general population. This is often best achieved by random sampling. Also, before collecting the sample, it is important that one carefully and completely defines the population, including a description of the members to be included.
A common problem in business statistical decision-making arises when we need information about a collection called a population but find that the cost of obtaining the information is prohibitive. For instance, suppose we need to know the average shelf life of current inventory. If the inventory is large, the cost of checking records for each item might be high enough to cancel the benefit of having the information. On the other hand, a hunch about the average shelf life might not be good enough for decision-making purposes. This means we must arrive at a compromise that involves selecting a small number of items and calculating an average shelf life as an estimate of the average shelf life of all items in inventory. This is a compromise, since the measurements for a sample from the inventory will produce only an estimate of the value we want, but at substantial savings. What we would like to know is how"good" the estimate is and how much more will it cost to make it"better". Information of this type is intimately related to sampling techniques. This section provides a short discussion on the common methods of business statistical sampling.
Cluster sampling can be used whenever the population is homogeneous but can be partitioned. In many applications the partitioning is a result of physical distance. For instance, in the insurance industry, there are small"clusters" of employees in field offices scattered about the country. In such a case, a random sampling of employee work habits might not required travel to many of the"clusters" or field offices in order to get the data. Totally sampling each one of a small number of clusters chosen at random can eliminate much of the cost associated with the data requirements of management.
Stratified sampling can be used whenever the population can be partitioned into smaller sub-populations, each of which is homogeneous according to the particular characteristic of interest. If there are k sub-populations and we let Ni denote the size of sub-population i, let N denote the overall population size, and let n denote the sample size, then we select a stratified sample whenever we choose:
ni = n(Ni/N) items at random from sub-population i, i = 1, 2, . . . ., k. The estimates is:
s = S Wt.
t, over t = 1, 2, ..L (strata), and
t is SXit/nt.
Its variance is:
SW2t /(Nt-nt)S2t/[nt(Nt-1)] Population total T is estimated by N.
s; its variance is
SN2t(Nt-nt)S2t/[nt(Nt-1)]. Random sampling is probably the most popular sampling method used in decision making today. Many decisions are made, for instance, by choosing a number out of a hat or a numbered bead from a barrel, and both of these methods are attempts to achieve a random choice from a set of items. But true random sampling must be achieved with the aid of a computer or a random number table whose values are generated by computer random number generators.
A random sampling of size n is drawn from a population size N. The unbiased estimate for variance of
is:
Var( where n/N is the sampling fraction. For sampling fraction less than 10% the finite population correction factor (N-n)/(N-1) is almost 1.) = S2(1-n/N)/n,
The total T is estimated by N ´
, its variance is N2Var(
).
For 0, 1, (binary) type variables, variation in estimated proportion p is:
S2 = p(1-p) ´ (1-n/N)/(n-1). For ratio r = Sxi/Syi=
/
, the variation for r is:
[(N-n)(r2S2x + S2y -2 r Cov(x, y)]/[n(N-1) 2].
Determination of sample sizes (n) with regard to binary data: Smallest integer greater than or equal to:
[t2 N p(1-p)] / [t2 p(1-p) + a2 (N-1)], with N being the size of the total number of cases, n being the sample size, a the expected error, t being the value taken from the t-distribution corresponding to a certain confidence interval, and p being the probability of an event.
Cross-Sectional Sampling:Cross-Sectional study the observation of a defined population at a single point in time or time interval. Exposure and outcome are determined simultaneously.
What is a statistical instrument? A statistical instrument is any process that aim at describing a phenomena by using any instrument or device, however the results may be used as a control tool. Examples of statistical instruments are questionnaire and surveys sampling.
What is grab sampling technique? The grab sampling technique is to take a relatively small sample over a very short period of time, the result obtained are usually instantaneous. However, the Passive Sampling is a technique where a sampling device is used for an extended time under similar conditions. Depending on the desirable statistical investigation, the passive sampling may be a useful alternative or even more appropriate than grab sampling. However, a passive sampling technique needs to be developed and tested in the field.
Further Reading:
Thompson S., Sampling, Wiley, 2002.Statistical Summaries
Representative of a Sample: Measures of Central Tendency Summaries
How do you describe the"average" or"typical" piece of information in a set of data? Different procedures are used to summarize the most representative information depending of the type of question asked and the nature of the data being summarized.Measures of location give information about the location of the central tendency within a group of numbers. The measures of location presented in this unit for ungrouped (raw) data are the mean, the median, and the mode.
Mean: The arithmetic mean (or the average, simple mean) is computed by summing all numbers in an array of numbers (xi) and then dividing by the number of observations (n) in the array.
Mean = = S Xi /n, the sum is over all i's.
The mean uses all of the observations, and each observation affects the mean. Even though the mean is sensitive to extreme values; i.e., extremely large or small data can cause the mean to be pulled toward the extreme data; it is still the most widely used measure of location. This is due to the fact that the mean has valuable mathematical properties that make it convenient for use with inferential statistical analysis. For example, the sum of the deviations of the numbers in a set of data from the mean is zero, and the sum of the squared deviations of the numbers in a set of data from the mean is the minimum value.
You might like to use Descriptive Statistics to compute the mean.
Weighted Mean: In some cases, the data in the sample or population should not be weighted equally, rather each value should be weighted according to its importance.
Median: The median is the middle value in an ordered array of observations. If there is an even number of observations in the array, the median is the average of the two middle numbers. If there is an odd number of data in the array, the median is the middle number.
The median is often used to summarize the distribution of an outcome. If the distribution is skewed, the median and the interquartile range (IQR) may be better than other measures to indicate where the observed data are concentrated.
Generally, the median provides a better measure of location than the mean when there are some extremely large or small observations; i.e., when the data are skewed to the right or to the left. For this reason, median income is used as the measure of location for the U.S. household income. Note that if the median is less than the mean, the data set is skewed to the right. If the median is greater than the mean, the data set is skewed to the left. For normal population, the sample median is distributed normally with m = the mean, and standard error of the median (p/2)½ times standard error of the mean.
The mean has two distinct advantages over the median. It is more stable, and one can compute the mean based of two samples by combining the two means.
Mode: The mode is the most frequently occurring value in a set of observations. Why use the mode? The classic example is the shirt/shoe manufacturer who wants to decide what sizes to introduce. Data may have two modes. In this case, we say the data are bimodal, and sets of observations with more than two modes are referred to as multimodal. Note that the mode is not a helpful measure of location, because there can be more than one mode or even no mode.
When the mean and the median are known, it is possible to estimate the mode for the unimodal distribution using the other two averages as follows:
Mode » 3(median) - 2(mean) This estimate is applicable to both grouped and ungrouped data sets.Whenever, more than one mode exist, then the population from which the sample came is a mixture of more than one population, as shown, for example in the following bimodal histogram.
Click on the image to enlarge it and THEN print it.
A Mixture of Two Different PopulationsHowever, notice that a Uniform distribution has uncountable number of modes having equal density value; therefore it is considered as a homogeneous population.
Almost all standard statistical analyses are conditioned on the assumption that the population is homogeneous.
Notice that Excel has very limited statistical capability. For example, it displays only one mode, the first one. Unfortunately, this is very misleading. However, you may find out if there are others by inspection only, as follow: Create a frequency distribution, invoke the menu sequence: Tools, Data analysis, Frequency and follow instructions on the screen. You will see the frequency distribution and then find the mode visually. Unfortunately, Excel does not draw a Stem and Leaf diagram. All commercial off-the-shelf software, such as SAS and SPSS, display a Stem and Leaf diagram, which is a frequency distribution of a given data set.
Selecting Among the Mode, Median, and Mean
It is a common mistake to specify the wrong index for central tenancy.
Click on the image to enlarge it and THEN print it.
Selecting Among the Mode, Median, and MeanThe first consideration is the type of data, if the variable is categorical, the mode is the single measure that best describes that data.
The second consideration in selecting the index is to ask whether the total of all observations is of any interest. If the answer is yes, then the mean is the proper index of central tendency.
If the total is of no interest, then depending on whether the histogram is symmetric or skewed one must use either mean or median, respectively.
In all cases the histogram must be unimodal. However, notice that, e.g., a Uniform distribution has uncountable number of modes having equal density value; therefore it is considered as a homogeneous population.
Notice also that:
|Mean - Median| £s The main characteristics of these three statistics are tabulated below:
| Fact No. | The Mode | The Median | The Mean |
|---|---|---|---|
| 1 | It is the most frequent value in the distribution; it is the point of greatest density. | It is the value of the middle point of the array (not midpoint of range), such that half the item are above and half below it. | It is the value in a given aggregate which would obtain if all the values were equal. |
| 2 | The value of the mode is established by the predominant frequency, not by the value in the distribution. | The value of the media is fixed by its position in the array and doesn't reflect the individual value. | The sum of deviations on either side of the mean are equal; hence, the algebraic sum of the deviation is equal zero. |
| 3 | It is the most probable value, hence the most typical. | The aggregate distance between the median point and all the value in the array is less than from any other point. | It reflect the magnitude of every value. |
| 4 | A distribution may have 2 or more modes. On the other hand, there is no mode in a rectangular distribution. | Each array has one and only one median. | An array has one and only one mean. |
| 5 | The mode does nott reflect the degree of modality. | It cannot be manipulated algebraically: medians of subgroups cannot be weighted and combined. | Means may be manipulated algebraically: means of subgroups may be combined when properly weighted. |
| 6 | It cannot be manipulated algebraically: modes of subgroups cannot be combined. | It is stable in that grouping procedures do not affect it appreciably. | It may be calculated even when individual values are unknown, provided the sum of the values and the sample size n are known. |
| 7 | It is unstable that it is influenced by grouping procedures. | Value must be ordered, and may be grouped, for computation. | Values need not be ordered or grouped for this calculation. |
| 8 | Values must be ordered and group for its computation. | It can be compute when ends are open | It cannot be calculated from a frequency table when ends are open. |
| 9 | It can be calculated when table ends are open. | It is not applicable to qualitative data. | It is stable in that grouping procedures do not seriously affected it. |
The Descriptive Statistics JavaScript provides a complete set of information about all statistics that you ever need. You might like to use it to perform some numerical experimentation for validating the above assertions for a deeper understanding.
Specialized Averages: The Geometric & Harmonic Means
The Geometric Mean: The geometric mean (G) of n non-negative numerical values is the nth root of the product of the n values.If some values are very large in magnitude and others are small, then the geometric mean is a better representative of the data than the simple average. In a"geometric series", the most meaningful average is the geometric mean (G). The arithmetic mean is very biased toward the larger numbers in the series.
An Application: Suppose sales of a certain item increase to 110% in the first year and to 150% of that in the second year. For simplicity, assume you sold 100 items initially. Then the number sold in the first year is 110 and the number sold in the second is 150% x 110 = 165. The arithmetic average of 110% and 150% is 130% so that we would incorrectly estimate that the number sold in the first year is 130 and the number in the second year is 169. The geometric mean of 110% and 150% is G = (1.65)1/2 so that we would correctly estimate that we would sell 100 (G)2 = 165 items in the second year.
The Harmonic Mean:The harmonic mean (H) is another specialized average, which is useful in averaging variables expressed as rate per unit of time, such as mileage per hour, number of units produced per day. The harmonic mean (H) of n non-zero numerical values x(i) is: H = n/[S (1/x(i)].
An Application: Suppose 4 machines in a machine shop are used to produce the same part. However, each of the four machines takes 2.5, 2.0, 1.5, and 6.0 minutes to make one part, respectively. What is the average rate of speed?
The harmonic means is: H = 4/[(1/2.5) + (1/2.0) + 1/(1.5) + (1/6.0)] = 2.31 minutes.
If all machines working for one hour, how many parts will be produced? Since four machines running for one hour represent 240 minutes of operating time, then: 240 / 2.31 = 104 parts will be produced.
The Order Among the Three Means: If all the three means exist, then the Arithmetic Mean is never less than the other two, moreover, the Harmonic Mean is never larger than the other two.
You might like to use The Other Means JavaScript in performing some numerical experimentation for validating the above assertions for a deeper understanding.
Further Reading:
Langley R., Practical Statistics Simply Explained, 1970, Dover Press.
Histogramming: Checking for Homogeneity of Population
A histogram is a graphical presentation of an estimate for the density (for continuous random variables) or probability mass function (for discrete random variables) of the population.The geometric feature of histogram enables us to find out useful information about the data, such as:
- The location of the"center" of the data.
- The degree of dispersion.
- The extend to which its is skewed, that is, it does not fall off systemically on both side of its peak.
- The degree of peakedness. How steeply it rises and falls.
The mode is the most frequently occurring value in a set of observations. Data may have two modes. In this case, we say the data are bimodal, and sets of observations with more than two modes are referred to as multimodal. Whenever, more than one mode exist, then the population from which the sample came is a mixture of more than one population. Almost all standard statistical analyses are conditioned on the assumption that the population is homogeneous, meaning that its density (for continuous random variables) or probability mass function (for discrete random variables) is unimodal. However, notice that, e.g., a Uniform distribution has uncountable number of modes having equal density value; therefore it is considered as a homogeneous population.
To check the unimodality of sampling data, one may use the histogramming process.
Number of Class Intervals in a Histogram: Before we can construct our frequency distribution we must determine how many classes we should use. This is purely arbitrary, but too few classes or too many classes will not provide as clear a picture as can be obtained with some more nearly optimum number. An empirical (i.e., observed) relationship, known as Sturge's rule, may be used as a useful guide to determine the optimal number of classes (k) is given by
the smallest integer greater than or equal to
Minimum of { n 1/2, 10 Log(n) }, n ³ 30, where Log is the logarithm in base 10, and n is the total number of the numerical values which comprise the data set.
Therefore, class width is:
(highest value - lowest value) / k The following JavaScript produces a histogram based on this rule:
Test for Homogeneity of a Population.To have an"optimum" you need some measure of quality -- presumably in this case, the"best" way to display whatever information is available in the data. The sample size contributes to this; so the usual guidelines are to use between 5 and 15 classes, with more classes, if you have a larger sample. You should take into account a preference for tidy class widths, preferably a multiple of 5 or 10, because this makes it easier to understand.
Beyond this it becomes a matter of judgement. Try out a range of class widths, and choose the one that works best. This assumes you have a computer and can generate alternative histograms fairly readily.
There are often management issues that come into play as well. For example, if your data is to be compared to similar data -- such as prior studies, or from other countries -- you are restricted to the intervals used therein.
If the histogram is very skewed, then unequal classes should be considered. Use narrow classes where the class frequencies are high, wide classes where they are low.
The following approaches are common:
Let n be the sample size, then the number of class intervals could be
Min {n½, 10 Log(n) }. The Log is the logarithm in base 10. Thus for 200 observations you would use 14 intervals but for 2000 you would use 33.
Alternatively,
- Find the range (highest value - lowest value).
- Divide the range by a reasonable interval size: 2, 3, 5, 10 or a multiple of 10.
- Aim for no fewer than 5 intervals and no more than 15.
One of the main applications of histogramming is to Test for Homogeneity of a Population. The unimodality of the histogram is a necessary condition for the homogeneity of population to make any statistical analysis meaningful. However, notice that, e.g., a Uniform distribution has uncountable number of modes having equal density value; therefore it is considered as a homogeneous population.
Further Reading:
Efron B., and R. Tibshirani, An Introduction to the Bootstrap, Chapman & Hall (now the CRC Press), 1994. Contains a tedious test for multimodality that is based on the Gaussian kernel density estimates and then test for multimodality by using the window-size approach.How to Construct a BoxPlot
A BoxPlot is a graphical display that has many characteristics. It includes the presence of possible outliers. It illustrates the range of data. It shows a measure of dispersion such as the upper quartile, lower quartile and interquartile range (IQR) of the data set as well as the median as a measure of central location, which is useful for comparing sets of data. It also gives an indication of the symmetry or skewness of the distribution. The main reason for the popularity of boxplots is that they offer much of information in a compact way.Steps to Construct a BoxPlot:
- Horizontal lines are drawn at the smallest observation (A), lower quartile. And another from the upper quartile (D), and the largest observation (E). Vertical lines to produce the box join these horizontal lines at points (B, and D).
- A vertical line is drawn at the median point (C), as shown on the above Figure.
For a deeper understanding, you may like using graph paper, and Descriptive Sampling Statistics JavaScript in constructing the BoxPlots for some sets of data; e.g., from your textbook.
Measuring the Quality of a Sample
Average by itself is not a good indication of quality. You need to know the variance to make any educated assessment. We are reminded of the dilemma of the six-foot tall statistician who drowned in a stream that had an average depth of three feet.Statistical measures are often used for describing the nature and extent of differences among the information in the distribution. A measure of variability is generally reported together with a measure of central tendency.
Statistical measures of variation are numerical values that indicate the variability inherent in a set of data measurements. Note that a small value for a measure of dispersion indicates that the data are concentrated around the mean; therefore, the mean is a good representative of the data set. On the other hand, a large measure of dispersion indicates that the mean is not a good representative of the data set. Also, measures of dispersion can be used when we want to compare the distributions of two or more sets of data. Quality of a data set is measured by its variability: Larger variability indicates lower quality. That is why high variation makes the manager very worried. Your job, as a statistician, is to measure the variation, and if it is too high and unacceptable, then it is the job of the technical staff, such as engineers, to fix the process.
Decision situations with complete lack of knowledge, known as the flat uncertainty, have the largest risk. For simplicity, consider the case when there are only two outcomes, one with probability of p. Then, the variation in the outcomes is p(1-p). This variation is the largest if we set p = 50%. That is, equal chance for each outcome. In such a case, the quality of information is at its lowest level.
Remember, quality of information and variation are inversely related. The larger the variation in the data, the lower the quality of the data (i.e., information): the Devil is in the Deviations.
The four most common measures of variation are the range, variance, standard deviation, and coefficient of variation.
Range: The range of a set of observations is the absolute value of the difference between the largest and smallest values in the data set. It measures the size of the smallest contiguous interval of real numbers that encompasses all of the data values. It is not useful when extreme values are present. It is based solely on two values, not on the entire data set. In addition, it cannot be defined for open-ended distributions such as Normal distribution.
Notice that, when dealing with discrete random observations, some authors define the range as:
Range = Largest value - Smallest value + 1.A normal distribution does not have a range. A student said,"since the tails of a normal density function never touch the x-axis and since for an observation to contribute to forming such a curve, very large positive and negative values must exist" Yet such remote values are always possible, but increasingly improbable. This encapsulates the asymptotic behavior of normal density very well. Therefore, in spite of this behavior, it is useful and applicable to a wide range of decision-making situations.
Quartiles: When we order the data, for example in ascending order, we may divide the data into quarters, Q1…Q4, known as quartiles. The first Quartile (Q1) is that value where 25% of the values are smaller and 75% are larger. The second Quartile (Q2) is that value where 50% of the values are smaller and 50% are larger. The third Quartile (Q3) is that value where 75% of the values are smaller and 25% are larger.
Percentiles: Percentiles have a similar concept and therefore, are related; e.g., the 25th percentile corresponds to the first quartile Q1, etc. The advantage of percentiles is that they may be subdivided into 100 parts. The percentiles and quartiles are most conveniently read from a cumulative distribution function, as depicted in the following figure.
Click on the image to enlarge it and THEN print it.
Empirical Cumulative Distribution Function as an Informative Tool
Interquartiles Range: The interquartile range (IQR) describes the extent for which the middle 50% of the observations scattered or dispersed. It is the distance between the first and the third quartiles:
IQR = Q3 - Q1, which is twice the Quartile Deviation. For data that are skewed, the relative dispersion, similar to the coefficient of variation (C.V.) is given (provided the denominator is not zero) by the Coefficient of Quartile Variation:
CQV = (Q3-Q1) / (Q3 + Q1). Note that almost all statistics that we have covered up to now can be obtained and understood deeply by graphical method using Empirical (i.e., observed) Cumulative Distribution Function (ECDF) JavaScript. However, the numerical Descriptive Statistics provides a complete set of information about all statistics that you ever need.
The Duality between the ECDF and the Histogram: Notice that the empirical (i.e., observed) cumulative distribution function (ECDF) indicates by its height at a particular pointthat is numerically equal to the area in the corresponding histogram to the left of that point. Therefore, either or both could be used depending on the intended applications.
Mean Absolute Deviation (MAD): A simple measure of variability is the mean absolute deviation:
MAD = S |(xi - )| / n.
The mean absolute deviation is widely used as a performance measure to assess the quality of the modeling, such forecasting techniques. However, MAD does not lend itself to further use in making inference; moreover, even in the error analysis studies, the variance is preferred since variances of independent (i.e., uncorrelated) errors are additive; however MAD does not have such a nice feature.
The MAD is a simple measure of variability, which unlike range and quartile deviation, takes every item into account, and it is simpler and less affected by extreme deviations. It is therefore often used in small samples that include extreme values.
The mean absolute deviation theoretically should be measured from the median, since it is at its minimum; however, it is more convenient to measure the deviations from the mean.
As a numerical example, consider the price (in $) of same item at 5 different stores: $4.75, $5.00, $4.65, $6.10, and $6.30. The mean absolute deviation from the mean is $0.67, while from the median is $0.60, which is a better representative of deviation among the prices.
Variance: An important measure of variability is variance. Variance is the average of the squared deviations of each observation in the set from the arithmetic mean of all of the observations.
Variance = S (xi - ) 2 / (n - 1), where n is at least 2.
The variance is a measure of spread or dispersion among values in a data set. Therefore, the greater the variance, the lower the quality.
The variance is not expressed in the same units as the observations. In other words, the variance is hard to understand because the deviations from the mean are squared, making it too large for logical explanation. This problem can be solved by working with the square root of the variance, which is called the standard deviation.
Standard Deviation: Both variance and standard deviation provide the same information; one can always be obtained from the other. In other words, the process of computing a standard deviation always involves computing a variance. Since standard deviation is the square root of the variance, it is always expressed in the same units as the raw data:
Standard Deviation = S = (Variance) ½ For large data sets (say, more than 30), approximately 68% of the data are contained within one standard deviation of the mean, 95% contained within two standard deviations. 97.7% (or almost 100% ) of the data are contained within within three standard deviations (S) from the mean.
You may use Descriptive Statistics JavaScript to compute the mean, and standard deviation.
The Mean Square Error (MSE) of an estimate is the variance of the estimate plus the square of its bias; therefore, if an estimate is unbiased, then its MSE is equal to its variance, as it is the case in the ANOVA table.
Coefficient of Variation: Coefficient of Variation (CV) is the absolute relative deviation with respect to size
, provided
is not zero, expressed in percentage:
CV =100 |S/ | %
CV is independent of the unit of measurement. In estimation of a parameter, when its CV is less than 10%, the estimate is assumed acceptable. The inverse of CV; namely, 1/CV is called the Signal-to-noise Ratio.
The coefficient of variation is used to represent the relationship of the standard deviation to the mean, telling how representative the mean is of the numbers from which it came. It expresses the standard deviation as a percentage of the mean; i.e., it reflects the variation in a distribution relative to the mean. However, confidence intervals for the coefficient of variation are rarely reported. One of the reasons is that the exact confidence interval for the coefficient of variation is computationally tedious.
Note that, for a skewed or grouped data set, the coefficient of quartile variation:
VQ = 100(Q3 - Q1)/(Q3 + Q1)% is more useful than the CV.
You may use Descriptive Statistics to compute the mean, standard deviation and the coefficient of variation.
Variation Ratio for Qualitative Data: Since the mode is the most frequently used measure of central tendency for qualitative variables, variability is measured with reference to the mode. The statistic that describes the variability of quantitative data is the Variation Ratio (VR):
VR = 1 - fm/n, where fm is the frequency of the mode, and n is the total number of scores in the distribution.
Z Score: how many standard deviations a given point (i.e., observation) is above or below the mean. In other words, a Z score represents the number of standard deviations that an observation (x) is above or below the mean. The larger the Z value, the further away a value will be from the mean. Note that values beyond three standard deviations are very unlikely. Note that if a Z score is negative, the observation (x) is below the mean. If the Z score is positive, the observation (x) is above the mean. The Z score is found as:
Z = (x - ) / standard deviation of X
The Z score is a measure of the number of standard deviations that an observation is above or below the mean. Since the standard deviation is never negative, a positive Z score indicates that the observation is above the mean, a negative Z score indicates that the observation is below the mean. Note that Z is a dimensionless value, and therefore is a useful measure by which to compare data values from two different populations, even those measured by different units.
Z-Transformation: Applying the formula z = (X - m) / s will always produce a transformed variable with a mean of zero and a standard deviation of one. However, the shape of the distribution will not be affected by the transformation. If X is not normal, then the transformed distribution will not be normal either.
One of the nice features of the z-transformation is that the resulting distribution of the transformed data has an identical shape but with mean zero, and standard deviation equal to 1.
One can generalize this data transformation to have any desirable mean and standard deviation other than 0 and 1, respectively. Suppose we wish the transformed data to have the mean and standard deviation of M and D, respectively. For example, in the SAT Scores, they are set at M = 500, and D=100. The following transformation should be applied:
Z = (standard Z) ´ D + M Suppose you have two data sets with very different scales (e.g., one has very low values, another very high values). If you wish to compare these two data sets, due to differences in scales, the statistics that you generate are not comparable. It is a good idea to use the Z-transformation of both original data sets and then make any comparison.
You have heard the terms z value, z test, z transformation, and z score. Do all of these terms mean the same thing? Certainly not:
The z value refers to the critical value (a point on the horizontal axes) of the Normal (0, 1) density function, for a given area to the left of that z-value.
The z test refers to the procedures for testing the equality of mean (s) of one (or two) population(s).
The z score of a given observation x, in a sample of size n, is simply (x - average of the sample) divided by the standard deviation of the sample. One must be careful not to mistake z scores for the Standard Scores.
The z transformation of a set of observations of size n is simply (each observation - average of all observations) divided by the standard deviation among all observations. The aim is to produce a transformed data set with a mean of zero and a standard deviation of one. This makes the transformed set dimensionless and manageable with respect to its magnitudes.