Table of Contents¶
Readability vs Cosine Similarity ¶
- Most of the graphs show a general cone/pyramid pattern. The lower the cosine similarity the higher the spread there is between readability metric values for Document 1 and 2 values (Will be referenced as D1 and D2 values). As cosine similarity increases the readability metric values for both documents become very close.
- For further research into this topic, manually looking at documents with dissimilar D1 and D2 values but high cosine similarity scores could yield more insight into the relationship between cosine similarity and document readability scores
- Documents with similar readability appear to have higher cosine similarity. Since readability takes into account sentence/document structure, that could imply that the word vector frequencies can give insight into sentence structure as well.
- Also appears that higher readability => higher similarity. So the easier a document is to read, the more in common it has with other documents
Principal Component Analysis (PCA) ¶
PCA is used to compress a high dimensional matrix into a smaller dimensional matrix. We will attempt to use PCA to visualize the distribution of our corpus of readability scores. Readability scores will be compressed into pc1, pc2, and pc3 and then clustered using KMeans. A sample of nodes from each cluster will be plotted into a 3 dimensional plot.
Interesting observation: PCA is essentially averaging the entire readability matrix into a 3 dimensional plot. You can see when the plot is colored based on FleschReadingEase score, some of the most extreme outliers weren't the darkest bubbles, indicating a difference between Flesch and the other readability scores.