Statistics I Practice 1 Notes Introduction to Statistics with Excel; Descriptive Statistics In the following notes we will learn how to use Excel 2007 for Descriptive Statistics. 1. Introduction to statistics with Excel 1.1 Load or import data In Excel 2007 the data can be introduced manually, generated using certain functions of the program (ALEATORIO, for example), or loaded from an external file. For example, in a blank sheet we will introduce these numbers (2, 5, 7, 9, 13) in the following manner: Say we want to calculate the mean of the previous numbers and write it in the cell A7. In order to do this, we can use the built-in formulas: 1. Select the cell where you want the result to appear, in our case, it is A7. Then go to Fórmulas in the above Menu: 2. Select Insertar función (insert function): 1 3. Search for the function that calculates the mean of the given data: PROMEDIO. If the function does not appear, maybe it is in the different category, so you can try changing the category to Todas. Once you have found and selected the desired function press Aceptar. 4. Once you press Aceptar, Excel gives the following: 2 We can see that Excel automatically has selected the cells A1:A6, which means all the cells from A1 till A6. However, we want to calculate the mean of the numbers in cells A1 through A5; therefore, we need to change it manually. It can be done in two ways: either by correcting A6 to A5 or selecting the entire range by minimizing the window Argumentos de función using the mouse and then pressing Enviar and Aceptar, as seen in the picture below: 5. Finally, Excel 2007 gives the mean we were looking for: Besides introducing the data manually, we can also load it form an external file. In this class we will be using the data from the file called “Paises.xlsx”, which contains data from 91 countries, describing 7 variables: 6 quantitative (birth rate, death rate, infant death rate, life expectancy for men, life expectancy for women, GDP) and 1 qualitative (zone). In order to open the file, follow these steps: 1. Press the Office button: 2. Select Abrir: 3 3. Select file “Paises.xlsx” in the location that you have saved it in and press Abrir. 1.2 How to load statistics add-in in Excel 2007 In excel 2007 exists a possibility to load an add-in for statistical computation. Since it is not a basic add-in, it needs to be installed. In order for an item Análisis de datos to appear in Datos menu, we have to follow the steps provided by the help menu of Excel 2007: 1. Press the Question mark (up right) to open the help of Excel 2007: Enter the phrase “Herramientas para análisis” in the search bar and press Buscar. In our case we are looking for an article called Herramientas para análisis. Select it and follow the instructions to install the statistics add-in. Once the process is finished, in the menu Datos we can find a new item called Análisis de datos: 4 2. Descriptive statistics 2.1. Data analysis Next, we will perform descriptive statistic analysis for the variable GDP (PIB). In order to do this, we will make use of the statistics add-in we have recently installed - Análisis de datos: 1. Select Análisis de datos in the menu Datos; select Estadística descriptiva and press Aceptar: 2. In a new window that just opened we have to introduce the range of the data we want to analyze (either manually or by selecting the range from the worksheet). In our case the range is “$G$1:$G$92”. We have to select Agrupados por: Columnas and Rótulos en la primera fila because we have also selected the header of the column. Also, we have to indicate where we want Excel to post the results; in our case we select the new sheet En una hoja nueva: and name it An_Uni_PIB (or any other name that will help you to remember the location of the results). Finally, also select Resumen de estadísticas and press Aceptar: 3. Once we press Aceptar, Excel 2007 will give the following view: 5 Another interesting option for descriptive statistics of the add-in Análisis de datos is the Jerarquía y percentil. Once we select it, same as before, we have to indicate the range of the data and whether there are column headers included. As for the place where we want to see the results, we can select the same sheet as before: An_Uni_PIB’!$D$1. Once we press Aceptar, Excel 2007 will provide the following view: 6 From the obtained result we can obtain the first, the second and the third quartiles (the second quartile is the median). Since the number of the observations is odd, the first quartile is the observation the occupies the position 3(n+1)/4, the second quartile is the observation that occupies the position 2(n+1)/4=(n+1)/2, and the third quartile is the observation in the (n+1)/4 position. We can obtain these three values by using Excel 2007 as a simple calculator. Put the cursor on the cell I1 and introduce the following code: =3*(B15+1)/4, where B15 indicates the cell where Excel has calculated the number of observations n. Next, in the cells I2 and I3 write =(B15+1)/2 and =(B15+1)/4. The obtained results indicate positions 23, 46 and 69, and the quartiles we are looking for correspond to observations 470, 1690 y 7600. We can write down the new information in the same sheet An_Uni_PIB, as seen below: Note: in order to calculate the quartiles we can also use the built-in Excel functions PERCENTIL or CUARTIL, however, if the number of observations is odd, Excel 2007 calculates the mean of the observations of the two positions, for example, for Q1, the mean of 23rd and 24th positions. 7 Note: the value of the quartile IS NOT the position, but the observation CORRESPODING to the position. The descriptive statistics menu does not provide all the information we might be interested in; for example, the coefficient of variation. It can be calculated manually by applying the formula: Standard Deviation/Median. We have calculated this coefficient in cell B22 by introducing the code =B7/B3. The data we are using in these notes also contains one qualitative variable: the zone. Say we want to perform descriptive statistics analysis on the GDP variable only for European countries. In order to do this, again we have to use the options available in Análisis de datos, however, selecting only the observations in the selected zone. Note: in our case the observations are already ordered by zone. If this wasn’t the case, we should order the data following these steps: 1. Press Inicio. 2. Look for a tool for ordering and filtering Ordenar y Filtrar, select Orden Personalizado: 3. Introduce the instructions (firstly, order alphabetically by zone, then by country): 2.2. Frequency tables and histograms: quantitative variables In this part we will demonstrate how to create the frequency tables and histograms for quantitative variables, using the variable ln(GPD). Go to a new sheet and change its name to Hist_Frec_PIB. In the first column we will calculate the limits of the classes using the following information: · Number of observations: 91 · Minimum value: 4,38202663… approximate 4,3 8 · Maximum value: 10,4359964… approximate 10,5 · Range: 6,2 · Number of classes: 91^(1/2)= 9,53939201… approximate 9 or 10 classes. Say, we will be using 10 classes. How do we create the histogram? 1. In the cell B1 calculate the length of each interval by dividing the range into the number of classes =(10,5-4,3)/10: 2. In the cell A4 calculate the upper limit of the first class, which is “minimum value + length” and which can be calculated using the code =4,3+$B$1: 3. Next, calculate the rest of the upper limits. Since the first upper limit is the minimum value plus the length, thus the every next upper limit is the previous upper limit plus the length. Introduce the following formula in A5 =A4+$B$1 and copy it throughout the cells up till A13: Once we have obtained the upper limits, we can obtain the frequency tables and histograms: 1. Select Análisis de datos in the menu Datos; select Histograma and press Aceptar. 2. In the next window we need to introduce the range of the data Rango de entrada “Hoja1!$H$1:$H$92”, the range of the classes Rango de clases “$A$3:$A$13”, select Rótulos because we have also selected the headers of the columns, introduce the range of the output (for example, we select the cell A15 in the Hist_Frec_PIB sheet), and select Crear Gráfico: 9 3. Once we press Aceptar, Excel 2007 will return the following view: The obtained results can be improved in two ways: · We can provide more information about the frequencies. · We can improve the histogram. Referring to the first point, since we have the absolute frequencies, we can also calculate the rest of the table of frequencies: relative frequencies, accumulated absolute frequencies and accumulated relative frequencies: 1. Copy the table of frequencies we have obtained before (except for the last line which says y mayor…). Next, mark the column that indicates Lim_Sup and press the right mouse button to select Insertar… _ Desplazar las celdas hacia la derecha, which inserts a new column to the right of the given column. In this new column we can calculate the lower limits by introducing the formula =4.92-$B$1, and then copy the content to the rest of the cells. 2. In the rest of the cells to the right calculate relative frequencies; write the code in the first cell and then copy it to the rest of the column. 3. Repeat the same for the rest of the table: accumulated absolute frequencies and accumulated relative frequencies. 10 Referring to the second point, Excel 2007 provides a histogram with spaces between the columns, meanwhile our data is continuous and each class shares lower and upper limits with other classes. To join the columns of the histogram: 1. Position the cursor above one of the columns of the histogram, press the right mouse button and select Dar formato a serie de datos…: 2. In the next window change the space between columns to 0%. The histogram then should look as follows: Frecuencia Histograma 20 15 10 5 0 Frecuencia lim_sup 2.3. Frequency tables and pie charts: qualitative variables In this part we will demonstrate how to create frequency tables and pie charts for qualitative variables using the variable Zone. Open new worksheet and in the cell A1 write Zone. Fill the rest of the cells as follows: In the second column in cell B2 we will calculate the absolute frequency for AFR. In order to do this, we can use the function CONTAR.SI() by writing the following =CONTAR.SI(Hoja1!I$2:I$92;A2). Next, copy the 11 formula to the rest of the cells, where the last cell that corresponds to Total should contain the total number of observations =SUMA(B2:B5): Using these absolute frequencies we already know how to calculate the rest: Finally, we will see how to create a pie chart: 1. Position the cursor on whichever cell and move to menu Insertar and select the option Circular. 2. Excel 2007 will create an empty graph. In order to select the data we want to use in the graph, press the right button and select Seleccionar Datos. 3. In Rango de datos del gráfico select from the frequency table the values of the groups and the corresponding frequencies. The resulting pie chart looks as follows: Gráfico Circular 29% 30% AFR ASIA AME EU 15% 26% 2.3. Box plot In Excel 2007 there does not exist a built-in formula for box plots. Therefore, we need to make use of macros, which are like external Excel programs, made to carry out specific tasks. We will use a macro that draws the box plot: http://www.cms.murdoch.edu.au/areas/maths/statsnotes/samplestats/BoxPlotMacro.xls In order to be able to use it, first we have to permit the use of macros by removing the security restrictions of Excel 2007. First go to the Excel options -> Más frecuentes, put a tick on the box next to Mostrar ficha 12 Programador en la cinta de Opciones. Then a new menu called Programador will appear. There press on Seguridad de macros: Once it is open, press on Habilitar todas las macros (no recomendado…) Evidently, in general one should always be careful not to use macros of dubious origins. To load macro: Office button -> Arbrir -> BoxPlotMacro and use it with the desired data. For example, in the file “Paises.xlsx” select one of the columns containing the data and execute the macro of box plot from the menu Macros. And obtain: 13 <Chart Title> 60 <Data scale description> 50 40 30 20 10 0 Tasa de natalidad n = 91 By pressing on <Chart Title> and <Data scale…> we can modify the labels of the graph. Exercises (give to the professor at the end of the class with the answers written in the last page) Generate a new variable X “mean life expectancy”, which is a mean of the life expectancy for men and women. Generate another variable Y “qualitative expectancy”, which divides the mean life expectancy into three categories: short (<50), medium (50-65) and long (>65) by using the formula: =SI(J2<50;"Short";SI(J2<65;"Medium";"Long")) 1. Analyze the quantitative variable X. Report your results in Table 1. 2. Analyze the qualitative variable Y. Report your results in Table 2. 3. Analyze the quantitative variable X considering only Europe. Report your results in Table 3. 14 Answers. Nombre y Apellidos:____________________________________________________________ NIU:_____________________Grado:___________________________________Grupo______ Table 1 Count Mean Standard Deviation Variation coefficient Minimum Maximum Range Standardized skewness Standardized kurtosis Table 2 Class Value Frequency Rel.freq. Ac.abs.freq. Ac.rel.freq. Short Medium Long Table 3 Count Mean Standard Deviation Variation coefficient Minimum Maximum Range Standardized skewness Standardized kurtosis 15