Correspondence Analysis (Notes)

   Notes [Graduate Studies] Uncategorized

Correspondence Analysis is a generalized principal component analysis tailored for the analysis of qualitative data. The goal of correspondence analysis is to transform a data table into two sets of factor scores: One for rows and one for columns. The factor scores give the best representation of the similarity structure of the rows and the columns of the table. In CA, the factor scores of the rows and columns have the same variance, and therefore both rows and columns can be represented in one single map called the biplot. The technique is also known by the names of optimal scaling, dual-scaling and reciprocal averaging.

The data used for the analysis in this post can be downloaded from here.

Computations

The first step of the analysis is to transform the data matrix into a probability matrix (denoted $\mathbf{Z}$)[The row totals of $\mathbf{Z}$ is denotes as $\mathbf{r}$ and column totals of $\mathbf{Z}$ is denotes as $\mathbf{c}$]. The probability matrix obtained in the first step is double centered by substracting $\mathbf{r} \mathbf{c}^T$ from $\mathbf{Z}$. The heatmap of this matrix is shown in figure 1. The factor scores are obtained by the generalized signular value decomposition of this matrix. i.e $$(\mathbf{Z} – \mathbf{r} \mathbf{c}^T) = \mathbf{P \Delta Q^T}$$

From the GSVD, the row and column factor scores are obtained as:
$$\mathbf{F=D_r^{-1}P \Delta}$$ and $$\mathbf{G=D_c^{-1}Q \Delta}$$ where $\mathbf{D_c} = diag\{\mathbf{c}\}$ and $\mathbf{D_r} = diag\{ \mathbf{r}\}$

Fig1. Heatmap of double centered probability matrix used for GSVD of Correspondence Analysis

Eigenvalues/Variances

We examine eigenvalues to determine the number of axis to be considered in our interpretation. The Scree Plot of CA for `Orange Juice Rating` is shown in figure 2.

Fig 2. Scree Plot

From the above figure, we can see that most of the variance is explained by the first three factors (almost 85%). On inference analysis these three dimensions were aslo found to be significant and hence we consider these three dimensions for our further analysis.

 

Elements Important for a Dimension

In CA, the rows and the columns of the table have similar role and hence we can use the same statistics to identify the rows and the columns important for a given dimension. To examine the importance of an element we look at its contributions which is the ratio of its squared factor scores to the eigenvalue of this factor. The contribution of columns to the first three components is shown in figure 3.

Fig3. Contribution of columns

From the above figure we can see that component 1 explain the variables  sweet, sour, artifical and bitter, component 2 explain the variables dark.orange, cooked flavor and sweetness, over.ripe and mixed.fruit and component 3 explain the variables pulpy, sparkling and dilute.

 

Interpreting Factor Map

In a CA map when two row (respectively column) points are close to each other, this means that these points have similar profiles, and when two points have the same profile, they will be located exactly at the same place. In this plot the proximity between a row point and a column point cannot be interpreted. This map is called a symmetric plot. The symmetric plot of column elements is shown in figure 4.

Fig 4. Symmetric Plot

By observing the above symmetric plot we can make the following conclusions:

  • Component 1 contrasts the sweetness factor (sweet and candy vs sour and bitter), make (artificial vs natural) and smell (fruity vs floral) of the juice.
  • Component 2 contrasts the colour (dark orange vs dark yellow), flavor (orange vs cooked)and taste (citrus vs honey) of the juice.
  • Component 3 accounts for the concentration (dilute vs concentrated) and sparkling (sparkling vs natural) of the juice.

Interpreting Row and Column Proximity

The proximity of a row point and column point cannot be interpreted from a standard symmetric plot. To make it interpretable we normalize the column factor scores by the following formulae: $$\hat{G} = D_c^{-1} Q$$

In the asymmetric plot obtained with $\mathbf{F}$ and $\mathbf{\hat{G}}$, the distance from a row point to a column point reflects their association. An asymmetric biplot of our dataset is shown in figure 5.

Fig 5. Asymmetric Biplot

From the above biplot the following conclusions can be made:

  • Minute Made, Goldenpan and Biley are sweet and artifical and have fruity smell whereas Malee, Tipco and UFC are sour and natural and have floral smell.
  • Malee, Tipco and some variations of UFC have dark yellow colour whereas Unif and some variations of UCF have dark orange colour.

References

  1. Abdi H. & Bera, M. 2018). Correspondence analysis. In R. Alhajj and J. Rokne (Eds.), Encyclopedia of Social Networks and Mining (2nd Edition). New York: Springer Verlag.
  2. Abdi, H. & Williams, L.J. (2010). Correspondence analysis. In N.J. Salkind, D.M., Dougherty, & B. Frey (Eds.): Encyclopedia of Research Design. Thousand Oaks (CA): Sage. pp. 267-278.
No Comments

Leave a Reply

%d bloggers like this: