Percentiles -- Quantiles


Return to Topics page

Earlier we looked at the median of a collection of data values. Conceptually, the median has half the data with a value smaller than the median and half the data with a value larger than the median. Later we looked at the quartile values where the first quartile, denoted as Q1, has a quarter of the data values less than Q1 and three-quarters of the data values larger than Q1, and the third quartile, denoted as Q3 has three quarters of the data values smaller than Q3 and one quarter of the values larger than Q3. Q2 is just the median so, again, it has half the values being smaller and half the values being larger. We can capture this in a table.

Table 1
Quartile
Name
Percent of Values
Less Than the
Quartile Value
Q1 25%
Q2 50%
Q3 75%

As we review the median and quartiles we recall that they require us to sort the values before we can determine values for the median and quartiles. For example, consider the values in Table 2.
In order for us to find the median and quartiles we sort the values to get



Finding the first and third quartile values is not so easy. In fact, as was discussed in an earlier page, there is not even a single generally agreed upon rule for finding those values. One method, not the one used by R, is to find the middle value of the values below the median and call that the first quartile. Similarly, that method looks at the middle of the values above the median and calls that the third quartile. In our case that puts the first quartile as the value in position

All of the review of the median and quartile values sets the stage for discussing percentiles. In the sorted list, just as Q3 has 75% of the items below it, Q2, the median, has 50% of the items below it, and Q1 has 25% of the items below it, the 95th percentile is the value that has 95% of the items below it, the 40th percentile is the value that has 40% of the items below it, and the 27th percentile is the value that has 27% of the items below it. Upon reflection, the 40th percentile has `2/5` of the values below it. Therefore, we know that the computed value has to be strange if we have fewer than `5` values in the table. Similarly, the 95th percentile has `19/20` of the values below it. Therefore, we know that the computed value has to be strange if we have fewer than `20` values in the table. And, of course, the 27th percentile has `27/100` of the values below it. Therefore, we know that the computed value has to be strange if we have fewer than `100` values in the table. The strangeness of such computations extends to any size collection, and it does so to the extent that there are at least 9 different methods for calculating percentiles. However, for large collections of values all of the different methods yield at least similar results.

Percentiles make the most sense if we have a really large collection of values. The values in Table 4 represent a large, but by no means huge, collection. (You will, of course, need to scroll through the text area to see all of the values in the table.)

Table 4: R style listing of the original 348 values
Although it is good to see the original data, as in Table 4, in order or us to find the percentiles of these values we will need a sorted listing of them. We have such a listing, of the same values, in Table 5.

Table 5: R style listing of the sorted 348 values
Table 5 presents, in the R style of presenting values, an ordered list of all 348 values. The 95th percentile of these must be a value that has 95% of the values as less than this 95th percentile value. But that means that we just have to find the value in the 95% of 348 position in the listing. 95% of 348 = 0.95*348 =330.6. Clearly, there is no item in position 330.6, but there is an item in position 330, namely 667, and there is an item in position 331, namely 669. Which value we choose, or what value around 668 we choose depends on which of those 9 different rules that we want to use. However, it is safe to say that nobody is really going to care all that much if we just choose 669 as being the 95th percentile. As we will see later, if we were to ask R to compute the 95th percentile it would give us the value 668.3.

How about finding the 40th percentile? We just compute 40% of 348 = 0.4*348 =139.2. Again, there is no item in position 139.2, but there is an item in position 139, namely 476, and there is an item in position 140, namely 476. We could justifiably choose 476 as the answer, and R will also choose 476.

To find the 27th percentile we compute 27% of 348 = 0.27*348 =93.96. There is no item in position 93.96, but there is an item in position 93, namely 444, and there is an item in position 94, namely 445. It would be reasonable to choose 445 as the answer. However, R will choose 445.69.

Quantiles vs. Percentiles

So far we have just talked about percentiles. The title of this page includes quantiles. What is the difference? Not much. Percentiles are given as percent values, values such as 95%, 40%, or 27%. Quantiles are given as decimal values, values such as 0.95, 0.4, and 0.27. The 0.95 quantile point is exactly the same as the 95th percentile point.

R does not work with percentiles, rather R works with quantiles. The R command for this is quantile() where we need to give that function the variable holding the data we are using and we need to give the function one or more decimal values. Interestingly, the quantile() function returns the desired value but it does so with a name in the form of a percentage. We will look at an example.

First, we need to get the values in our table. The R command set.seed(34211) is used to set a starting point for the pseudo-random number generator that R uses. By setting the seed value we create an environment where the subsequent generation of seemingly random values is completely determined. That way, should we or someone else, want to replicate our steps, the random numbers we or they get will be exactly the same as the values we will see here. Figure 1 starts with that statement.

Figure 1

Figure 1 ends with a statement,
mylist <- round( rnorm(348, mean=500, sd=100 ) )
that generates 348 random values such that those values will have a mean of approximately 500 and a standard deviation of approximately 100. Those 348 random values are then rounded to be 348 random integers. Finally, those values are assigned to the variable mylist.

Once defined, we can ask to see the values by using the mylist command. The result is shown in Figure 2.

Figure 2


You will notice that he values in Figure 2 are identical to those in Table 2. In fact they are identical because the text in Figure 2 was copied and placed in this web page as the data behind Table 2.

There is no need to do the actions shown in Figures 3 and 4, but doing them allows us to verify the contents of Table 3. In Figure 3 we use the sort() function to sort the values stored in mylist. We assign those sorted values to the variable mylist_sorted.

Figure 3


Then, in Figure 4, we use the command mylist_sorted to display the entire sorted collection of values.

Figure 4


To actually find a percentile value for mylist we ask for the corresponding quantile by using the quantile() function. Figure 5 shows the command to get the 95th percentile of mylist, along with the resulting value.

Figure 5


Note how the quantile(mylist,.95) command produces output that is actually labeled as 95%. The value is the same 668.3 that was noted above.

We could give quantile() more than one value by using the c() function to combine those values into one argument as in quantile(mylist,c(.95,.40,.27)), the statement shown in Figure 6.

Figure 6


As you can see, the statement produces the percentile values that we expect.

We could take the idea of giving quantile() many values to a higher level. The statement quantile(mylist,seq(0.05,0.95,0.05)) asks R to compute percentile values for 5%, 10%, 15%, and so on up to 95%. The command and its related output are shown in Figure 7.

Figure 7


The output in Figure 7 gives us all of the values that we requested. However, it might be nice if we could convert this to a vertical format. The statements shown in Figure 8 recompute the percentiles that we just found, but store the results in the variable qtile. Then, the statements pull out the names and the values in qtile, concluding with the creation of a data frame, that is then stored in the variable qdf.

Figure 8


Then, the statement qdf produces the vertical listing we desired.

Figure 9


Of course, the labels on the top of the values come from the names of the variables we used to create the data frame. We can use the names() function to change those titles to something more appropriate. This is done in Figure 10.

Figure 10


Now that the data frame is defined we can use the View(qdf) to produce the "pretty" output shown in Figure 11.

Figure 11


The work that we have seen so far has fallen into the form: Here is a list of data values, now find the nth percentile of that data (using the quantile() function). We can, and often do, turn that question around. For example, with the data given in Figure 4, we might ask "What percentile is the value 432?" That is an especially nice value because 432 is not a value in Figure 4. Remembering that the values in Figure 4 are already sorted, we can see that there are 75 values that are less than 432. Therefore, since there are 348 values in the table, it makes sense to say that 75/348 ≈ 0.2155 or 21.55% of the values are less than 432, or that 432 is the 21.55 percentile.

This gets more complicated if the value we are using is in the table, and especiallycomplicated if the value is repeated in the table. For example, what percentile should we assign to 528? There are 209 values less than 528 but 528 occupies positions 210, 211, 212, and 213 of the sorted list. As we might expect, there are any number of "rules" that might guide us to an answer for this kind of situation. Because there is no definitive universally accepted rule, we can come up with one that serves our purpose. Our rule will give an answer that, in all but the most contrived situations, will be close to the answer that any of the other generally accepted rules produce. Our rule is captured in the function find_percentile(). We need to give that function the list of values (it does not even have to be a sorted list) and the value for which we want to determine a pecentile. Thus the statements
 source( "../find_percentile.R") 
 find_percentile( mylist, 432)
 find_percentile( mylist, 528)
 find_percentile( mylist, 562)
will return the percentile to be assigned to the values 432, 528, and 562. This is shown in Figure 12.

Figure 12



Here is a listing of the R statements used on this percentage
# for the percentile web page
set.seed(34211)
mylist <- round( rnorm(348, mean=500, sd=100 ) ) 
mylist 
mylist_sorted <- sort( mylist )
mylist_sorted
quantile( mylist, .95 )
quantile(mylist,c(.95,.40,.27))
quantile(mylist,seq(0.05,0.95,0.05))
qtile <- quantile(mylist,seq(0.05,0.95,0.05))
qnames <- names(qtile)
qvals <- as.numeric(qtile)
qdf <- data.frame(qnames,qvals)
qdf
names(qdf) <- c("Percent","%-tile")
qdf
View(qdf)
source( "../find_percentile.R") 
find_percentile( mylist, 432)
find_percentile( mylist, 528)
find_percentile( mylist, 562)
Return to Topics page
©Roger M. Palay     Saline, MI 48176     June, 2021