{"id":166,"date":"2014-07-22T18:31:52","date_gmt":"2014-07-22T18:31:52","guid":{"rendered":"http:\/\/sonny..ogi.edu\/~kgorman\/blog\/?p=166"},"modified":"2014-07-22T18:31:52","modified_gmt":"2014-07-22T18:31:52","slug":"a-tutorial-on-contingency-tables","status":"publish","type":"post","link":"https:\/\/www.wellformedness.com\/blog\/a-tutorial-on-contingency-tables\/","title":{"rendered":"A tutorial on contingency tables"},"content":{"rendered":"<p>Many results in science and medicine can be compactly represented as a table containing\u00a0the co-occurrence frequencies of two or more discrete random variables. This data structure is\u00a0called the\u00a0<em>contingency table<\/em> (a name suggested by <a title=\"Karl Pearson\" href=\"http:\/\/en.wikipedia.org\/wiki\/Karl_Pearson\">Karl Pearson<\/a>\u00a0in 1904). This tutorial will\u00a0cover\u00a0descriptive and inferential statistics that can be used on the simplest form of contingency table, in which both the outcomes are binomial (and thus the table is 2&#215;2).<\/p>\n<p>Let&#8217;s begin by taking a look at a real-world example: graduate school admissions a single department at UC\u00a0Berkeley in 1973. (This is part of a famous<a title=\"Berkeley gender bias case\" href=\"http:\/\/en.wikipedia.org\/wiki\/Simpson's_paradox#Berkeley_gender_bias_case\">\u00a0real-world example<\/a> which may be of interest to the\u00a0reader.)\u00a0Our dependent variable is &#8220;admitted&#8221; or &#8220;rejected&#8221;, and we&#8217;ll use applicant gender\u00a0as an independent variable.<\/p>\n<table>\n<tbody>\n<tr>\n<td><\/td>\n<td>Admitted<\/td>\n<td>Rejected<\/td>\n<\/tr>\n<tr>\n<td>Male<\/td>\n<td>120<\/td>\n<td>205<\/td>\n<\/tr>\n<tr>\n<td>Female<\/td>\n<td>202<\/td>\n<td>391<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>I can scarcely look at this table without seeing the inevitable question: are admissions in this department gender biased?<\/p>\n<p><strong>Odds ratios<\/strong><\/p>\n<p>37% (= 120 \/ 325) of the male applicants who applied were admitted, and\u00a034% (= 202 \/ 593) of the female applicants were. Is that 3% difference a meaningful one? It is tempting to focus on &#8220;3%&#8221;, but the researcher should absolutely avoid this temptation. The magnitude of the difference between admission rates in the two groups (defined by the independent variable) is very sensitive to the <em>base rate<\/em>, in this case the\u00a0overall admission rate. Intuitively, if 2% of males were admitted and only 1% of females, we would definitely consider the possibility that admissions are gender-biased: we would estimate that males are\u00a0<em>twice<\/em> as likely to be admitted as females<em>.\u00a0<\/em>\u00a0But we would be much less likely to say there is an admissions bias if those percentages were 98% and 99%. Yet,\u00a0in both scenarios the admission rates differ by exactly 1%.<\/p>\n<p>A better way to quantify the effect of gender on admissions\u2014a method that is insensitive to the overall admission rate\u2014is the\u00a0<em>odds ratio<\/em>. This name is practically the definition if you are familiar with the notion\u00a0<em>odds<\/em>. The odds of some event occurring with probability <em>P<\/em>\u00a0is simply\u00a0<em>P <\/em>\/ (1 \u2013\u00a0<em>P<\/em>). In our example\u00a0above, the odds of\u00a0admission for a male applicant is\u00a00.5854, and is 0.5166 for a female applicant. The ratio of these two is 1.1334 (= 0.5854 \/ 0.5166). As\u00a0this ratio is greater than one, we say that maleness was\u00a0<em>associated<\/em> with admission. This is not\u00a0enough to establish bias: it simply means that males were somewhat more likely to be admitted than females.<\/p>\n<p><strong>Tests for association<\/strong><\/p>\n<p>We can now return to the original question: is this table likely to have arisen if there is in fact no gender bias in admissions? <a title=\"Pearson's chi-squared test\" href=\"http:\/\/en.wikipedia.org\/wiki\/Pearson%27s_chi-squared_test\">Pearson&#8217;s chi-squared test<\/a>\u00a0estimates the probability of the\u00a0observed contingency table under the null hypothesis that there is no association between\u00a0<em>x\u00a0<\/em>and\u00a0<em>y;<\/em>\u00a0see <a href=\"http:\/\/math.hws.edu\/javamath\/ryan\/ChiSquare.html\">here<\/a> for a\u00a0worked example. We reject the null hypothesis that there is no association when this probability is sufficiently small (often at <em>P &lt;<\/em> .05). For this table,\u00a0<em>\u03c7<\/em><sup>2 <\/sup>= 0.6332, and\u00a0the probability of the data under the null hypothesis (no association between gender and admission rate) is\u00a0<em>P<\/em>(<em>\u03c7<\/em><sup>2<\/sup>) = 0.4262. So, we&#8217;d probably say the observed difference in admission rates was not sufficient to\u00a0establish that females were\u00a0less likely to be admitted than males in this department.<\/p>\n<p>The chi-squared test for contingency tables depends on an approximation\u00a0which is asymptotically valid,\u00a0but inadequate for small samples; a popular (albeit arbitrary) rule of thumb is that a sample is &#8220;small&#8221; if any of the four cells has less than 5 observations. The best\u00a0solution is to use an alternative,\u00a0<a title=\"Fisher's exact test\" href=\"http:\/\/en.wikipedia.org\/wiki\/Fisher%27s_exact_test\">Fisher&#8217;s exact test<\/a>; as\u00a0the name suggests, it provides an exact <em>p<\/em>-value. Rather than working with the <em>\u03c7<\/em><sup>2<\/sup> statistic, the null hypothesis for the Fisher test is that the true (population) odds ratio is equal to 1.<\/p>\n<h1>Accuracy, precision, and recall<\/h1>\n<p>In the above example, we attempted to measure the association of two random variables which represented different constructs (i.e., admission status and gender). Contingency tables can also be used to look at random variables which are in some sense imperfect\u00a0measures of the same underlying\u00a0construct. In a machine learning context, one variable might represent\u00a0the predictions of a binary classifier, and\u00a0the other represents the labels taken from the &#8220;oracle&#8221;, the trusted data source the classifier is intended\u00a0to approximate. Such tables are sometimes known as\u00a0<em>confusion matrices<\/em>. Convention holds that one of the two outcomes should be labeled (arbitrarily if necessary) as &#8220;hit&#8221; and the other as &#8220;miss&#8221;\u2014often the &#8220;hit&#8221; is the one which requires further attention whereas the &#8220;miss&#8221; can be ignored\u2014and the the following labels be assigned to the four cells of the confusion matrix:<\/p>\n<table>\n<tbody>\n<tr>\n<td>Prediction \/ Oracle<\/td>\n<td>Hit<\/td>\n<td>Miss<\/td>\n<\/tr>\n<tr>\n<td>Hit<\/td>\n<td>True positive (TP)<\/td>\n<td>False positive (FP)<\/td>\n<\/tr>\n<tr>\n<td>Miss<\/td>\n<td>False negative (FN)<\/td>\n<td>True negative (TN)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The oracle labels are on the row, corresponding to the dependent variable\u2014admission status\u2014in the Berkeley example, and the prediction labels are on the column, corresponding to gender.<\/p>\n<p>When both variables of the 2&#215;2 table measure the same construct, we start with the assumption that the two random variables are associated, and instead measure <em>agreement.<\/em>\u00a0The simplest measure of agreement is\u00a0<em>accuracy<\/em>, which is the probability that an observation will be correctly classified.<\/p>\n<p style=\"text-align: center;\">accuracy = (<em>TP + TN<\/em>) \/ (<em>TP<\/em> + <em>FP<\/em> + <em>FN<\/em> + <em>TN<\/em>)<\/p>\n<p style=\"text-align: left;\">Accuracy is not always the most informative measure, for the same reason that differences in probabilities were not informative above: accuracy neglects the\u00a0<em>base rate <\/em>(or\u00a0<em>prevalence<\/em>). Consider, for example, the plight of the much-maligned <a title=\"TSA\" href=\"http:\/\/en.wikipedia.org\/wiki\/Transportation_Security_Administration#Screening_effectiveness\">Transportation Safety Administration<\/a>\u00a0(TSA). Very, very\u00a0few airline passengers attempt\u00a0to commit a terrorist attack during their flight. Since 9\/11, there have only been\u00a0two documented attempted terrorist attacks by passengers on commercial airlines, one by\u00a0<a title=\"shoe bomber\" href=\"http:\/\/en.wikipedia.org\/wiki\/Richard_Reid\">Richard Reid<\/a> (the &#8220;shoe bomber&#8221;) and one by\u00a0<a title=\"underwear bomber\" href=\"http:\/\/en.wikipedia.org\/wiki\/Umar_Farouk_Abdulmutallab\">Umar Farouk Abdulmutallab<\/a> (the &#8220;underwear bomber&#8221;), and in both cases, these attempts\u00a0were thwarted by some combination of the attentive citizenry and incompetent attacker,\u00a0not by the TSA&#8217;s <a title=\"Security theater\" href=\"http:\/\/en.wikipedia.org\/wiki\/Security_theater\">security theater<\/a>.\u00a0There are approximately 650,000,000\u00a0passenger-flights per year on US flights, so according to my back-of-envelope calculation, there have been around 7 billion passenger-flights\u00a0since 9\/11. In the meantime, the TSA could have\u00a0achieved sky-high accuracy simply by\u00a0making no arrests at all. (I have, of course, ignored the possibility that security theater serves as a deterrent to terrorist activity.) A corollary is that the TSA faces what is called the <a title=\"false positive paradox\" href=\"https:\/\/en.wikipedia.org\/wiki\/False_positive_paradox\">false positive paradox<\/a>: a false positive (say, detaining a law-abiding citizen) is much more likely than a true positive (catching a terrorist). The TSA isn&#8217;t alone: a<a title=\"Cascells et al. 1978\" href=\"http:\/\/europepmc.org\/abstract\/MED\/692627\">\u00a0famous paper<\/a> found that few physicians used the base rate (&#8220;prevalence&#8221;)\u00a0when estimating the likelihood that a patient has a particular disease, given that they had a positive result on a test.<\/p>\n<p style=\"text-align: left;\">To better account for the role of base rate, we can break accuracy down into its constituent parts. The best known of these measures is\u00a0<em>precision <\/em>(or\u00a0<em>positive predictive value<\/em>),\u00a0which is defined as the probability that a predicted\u00a0hit is correct.<\/p>\n<p style=\"text-align: center;\">precision<em> = TP<\/em> \/ (<em>TP<\/em> + <em>FP<\/em>)<\/p>\n<p style=\"text-align: left;\">Precision isn&#8217;t completely free of the base rate problem, however; it fails to penalize false negatives. For this, we turn to\u00a0<em>recall<\/em> (or\u00a0<em>sensitivity<\/em>, or\u00a0<em>true positive rate<\/em>), which is the probability that a true hit is correctly discovered.<\/p>\n<p style=\"text-align: center;\">recall =\u00a0<em>TP <\/em>\/ (<em>TP<\/em> + <em>FN<\/em>)<\/p>\n<p style=\"text-align: left;\">It is difficult to improve precision without sacrificing recall, or vis versa. Consider, for example, an\u00a0<a title=\"information retrieval\" href=\"http:\/\/en.wikipedia.org\/wiki\/Information_retrieval\">information retrieval<\/a>\u00a0(IR) application, which takes natural language queries as input and attempts to return\u00a0all documents relevant for the query. Internally, the IR system ranks all documents for relevance to the query, then returns the top\u00a0<em>n.\u00a0<\/em>A document which is relevant and returned by the system is a true positive, a document which is irrelevant but returned by the system is a false positive, and a document which is relevant but not returned by the system is a false negative (we&#8217;ll ignore true negatives for the time being). With this system, we can achieve\u00a0<em>perfect<\/em>\u00a0recall by returning all documents, no matter what the query is, though the precision will be very poor. It is often helpful for\u00a0<em>n<\/em>,<em>\u00a0<\/em>the number of documents retrieved,<em>\u00a0<\/em>to vary as a function of query; in a sports news database, for instance, there are simply more documents about the New York Yankees than about the congenitally mediocre\u00a0<a title=\"St. Louis Blues\" href=\"http:\/\/en.wikipedia.org\/wiki\/St._Louis_Blues\">St. Louis Blues<\/a>. We can maximize precision by reducing the average\u00a0<em>n\u00a0<\/em>for queries, but this will also reduce\u00a0recall, since there will be more false negatives.<\/p>\n<p style=\"text-align: left;\">To quantify the tradeoff between precision and recall, it is conventional to use the harmonic mean of precision and recall.<\/p>\n<p style=\"text-align: center;\"><em>F<\/em>1<em> = <\/em>(2<em> \u00b7 <\/em>precision\u00a0<em>\u00b7 <\/em>recall) \/ (precision +\u00a0recall)<\/p>\n<p style=\"text-align: left;\">This measure is known also known as\u00a0the <em>F<\/em>-score (or\u00a0<em>F<\/em>-measure), though it is properly called\u00a0<em>F<\/em>1, since an\u00a0<em>F<\/em>-score need not weigh precision and recall equally. In many applications, however the real-world costs of false positives and false negatives are not equivalent. In the context of screening for serious illness, a\u00a0false positive\u00a0would simply lead to further testing, whereas a false negative\u00a0could be fatal; consequently, recall is more important then precision. On the other hand, when the resources necessary\u00a0to derive value from true positives are limited\u00a0(such as in\u00a0fraud detection), false negatives are considered more acceptable than false positives, and so precision is ranked above recall.<\/p>\n<p style=\"text-align: left;\">Another thing to note about <em>F<\/em>1: the harmonic mean of two positive numbers is always closer to the smaller of the two. So, if you want to maximize <em>F<\/em>1, the best place to start is to increase\u00a0whichever of the two terms (precision and recall) is smaller.<\/p>\n<p style=\"text-align: left;\">To make this all a bit more concrete,\u00a0consider the following 2&#215;2 table.<\/p>\n<table>\n<tbody>\n<tr>\n<td>Prediction \/ Oracle<\/td>\n<td>Hit<\/td>\n<td>Miss<\/td>\n<\/tr>\n<tr>\n<td>Hit<\/td>\n<td>10<\/td>\n<td>2<\/td>\n<\/tr>\n<tr>\n<td>Miss<\/td>\n<td>5<\/td>\n<td>20<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p style=\"text-align: left;\">We can see that false\u00a0negatives are somewhat more common than false\u00a0positives, so we could have predicted that precision (0.8333) would be somewhat greater than recall (0.6667), and that <em>F<\/em>1 would be somewhat closer to the latter (0.7407).<\/p>\n<p style=\"text-align: left;\">This of course does not exhaust the space of possible summary statistics of a 2&#215;2 confusion matrix: see <a title=\"Confusion matrix\" href=\"http:\/\/en.wikipedia.org\/wiki\/Confusion_matrix#Table_of_confusion\">Wikipedia<\/a> for more.<\/p>\n<p style=\"text-align: left;\"><strong>Response bias<\/strong><\/p>\n<p>It is sometimes useful to directly quantify\u00a0predictor\u00a0<em>bias<\/em>, which can be thought of as a signed measure representing the degree to which\u00a0prediction system&#8217;s base rate differs from the true base rate. A positive bias indicates that the system predicts\u00a0&#8220;hit&#8221; more often than it should were it hewing to the true base rate, and a negative bias indicates that &#8220;hit&#8221; is guessed less often than the true base rate would suggest. One conventional measure of bias is\u00a0<em>B<sub>d<\/sub><\/em>&#8221;, which is a function of recall and\u00a0<em>false positive rate\u00a0<\/em>(FAR), defined as follows.<\/p>\n<p style=\"text-align: center;\">FAR = <i>FP<\/i>\u00a0\/\u00a0(<em>TN<\/em> + <em>FP<\/em>)<\/p>\n<p style=\"text-align: left;\"><em>B<sub>d<\/sub>&#8221;<\/em> has a rather unwieldy formula; it is<\/p>\n<p style=\"text-align: center;\"><em>B<sub>d<\/sub>&#8221;<\/em> =\u00a0[(recall\u00a0\u00b7 (1 \u2013\u00a0recall)) \u2013\u00a0(FAR\u00a0\u00b7 (1 \u2013\u00a0FAR))] \/ [(recall\u00a0\u00b7 (1 \u2013\u00a0recall)) + (FAR\u00a0\u00b7 (1 \u2013\u00a0FAR))]<\/p>\n<p style=\"text-align: left;\">when HR \u2265 FAR and<\/p>\n<p style=\"text-align: center;\"><em>B<sub>d<\/sub>&#8221;<\/em>\u00a0=\u00a0[(FAR \u00b7 (1 \u2013 FAR)) \u2013\u00a0(recall \u00b7 (1 \u2013 recall))] \/ [(recall\u00a0\u00b7 (1 \u2013\u00a0recall)) + (FAR\u00a0\u00b7 (1 \u2013\u00a0FAR))]<\/p>\n<p style=\"text-align: left;\">otherwise (i.e., when FAR &gt; HR).<\/p>\n<p style=\"text-align: left;\">You may also be familiar with\u00a0<em>\u03b2<\/em>, a parametric measure of bias, but there does not seem to be anything to\u00a0recommend it over\u00a0<em>B<sub>d<\/sub>&#8221;, <\/em>which makes fewer assumptions<em>\u00a0<\/em>(see\u00a0Donaldson 1992 and citations therein).<\/p>\n<p style=\"text-align: left;\"><strong>Cohen&#8217;s\u00a0<em>\u039a<\/em><\/strong><\/p>\n<p>Cohen&#8217;s\u00a0<em>\u039a <\/em>(&#8220;kappa&#8221;)\u00a0is a statistical measure of\u00a0<em>interannotator agreement<\/em>\u00a0which works on 2&#215;2 tables. Unlike other measures we have reviewed so far, it is adjusted for\u00a0the percentage of agreement that would\u00a0occur by chance.\u00a0<em>\u039a<\/em>\u00a0is computed from two terms. The first,\u00a0<em>P<\/em>(<em>a<\/em>), is the observed probability of agreement, which is the same formula as\u00a0accuracy. The second,\u00a0<em>P<\/em>(<em>e<\/em>), is the probability of agreement due to chance. Let\u00a0<em>P<sub>x\u00a0<\/sub><\/em>and <em>P<sub>y<\/sub><\/em>\u00a0be the probability of a &#8220;yes&#8221; or &#8220;hit&#8221; answer from annotator\u00a0<em>x<\/em> and\u00a0<em>y<\/em>, respectively. Then, <em>P<\/em>(<em>e<\/em>) is<\/p>\n<p style=\"text-align: center;\"><em>P<sub>x<\/sub><\/em> <em>P<sub>y<\/sub><\/em> + (1 \u2013\u00a0<em>P<sub>x<\/sub><\/em>) (1 \u2013\u00a0<em>P<sub>y<\/sub><\/em>)<\/p>\n<p style=\"text-align: left;\">and<em> \u039a\u00a0<\/em>is then given by<\/p>\n<p style=\"text-align: center;\">[<em>P<\/em>(<em>a<\/em>) \u2013\u00a0<em>P<\/em>(<em>e<\/em>)] \/ [1 \u2013\u00a0<em>P<\/em>(<em>e<\/em>)] .<\/p>\n<p style=\"text-align: left;\">For the previous 2&#215;2 table,\u00a0<em>\u039a = <\/em>.5947; but, what does this mean?\u00a0<em>K <\/em>is usually\u00a0interpreted with reference to conventional\u2014but entirely\u00a0arbitrary\u2014guidelines. One of the best known of these is due to Landis and Koch (1977), who propose 0\u20130.20 as &#8220;slight&#8221;, 0.21\u20130.40 as &#8220;fair agreement&#8221;, 0.41\u20130.60 as &#8220;moderate&#8221;, 0.61\u20130.80 as &#8220;substantial&#8221;, and 0.81\u20131 as &#8220;almost perfect&#8221; agreement.\u00a0<em>\u039a<\/em> has a known statistical distribution, so it is also possible to test the null hypothesis that the observed agreement is entirely due to chance. This\u00a0test is rarely performed or reported, however, as the null\u00a0hypothesis is exceptionally unlikely to be true in real-world annotation scenarios.<\/p>\n<p style=\"text-align: left;\">(h\/t: <a title=\"Steven Bedrick\" href=\"http:\/\/www.bedrick.org\">Steven Bedrick<\/a>.)<\/p>\n<h1 style=\"text-align: left;\">References<\/h1>\n<p>A. Agresti. 2002.\u00a0<em>Categorical data analysis.\u00a0<\/em>Hoboken, NJ: Wiley.<br \/>\nW. Donaldson. 1992. Measuring recognition memory.\u00a0<em>Journal of Experimental Psychology: General<\/em>\u00a0121(3): 275-277.<br \/>\nJ.R. Landis &amp; G.G. Koch. 1977.The measurement of observer agreement for categorical data.\u00a0<em>Biometrics <\/em>33(1): 159-174.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Many results in science and medicine can be compactly represented as a table containing\u00a0the co-occurrence frequencies of two or more discrete random variables. This data structure is\u00a0called the\u00a0contingency table (a name suggested by Karl Pearson\u00a0in 1904). This tutorial will\u00a0cover\u00a0descriptive and inferential statistics that can be used on the simplest form of contingency table, in which &hellip; <a href=\"https:\/\/www.wellformedness.com\/blog\/a-tutorial-on-contingency-tables\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;A tutorial on contingency tables&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10],"tags":[],"class_list":["post-166","post","type-post","status-publish","format-standard","hentry","category-stats"],"_links":{"self":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/166","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/comments?post=166"}],"version-history":[{"count":0,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/166\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/media?parent=166"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/categories?post=166"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/tags?post=166"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}