Comparing Frequency and Dispersion Keywords: Effects of Variations in Target and Reference Corpora
Main Article Content
Abstract
Dispersion keyword analysis, which identifies words that occur in significantly more texts in the target corpus than in the reference corpus, has recently been introduced as a more effective method than traditional frequency keyword analysis. Previous research has used this method to identify keywords within a target corpus, usually consisting of hundreds of texts, and used a much larger corpus as a reference. However, questions remain regarding its applicability for cases involving fewer texts and comparisons between smaller specific corpora. This study compares the top 100 frequency keywords and dispersion keywords identified under several conditions, which varied in terms of the number of texts in the target corpus (24, 100, and 200 texts) and the types of reference corpora used. Both methods identified unique and shared keywords; however, frequency keywords are found more frequent and widely dispersed not only within the target corpus but also in the reference corpus compared to dispersion ones, which are notably more relevant to the target corpus. The selection between frequency and dispersion methods and the relevance of frequency and dispersion keywords in research with differing focuses are discussed.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
Anthony, L. (2023). AntConc (Version 4.2.4) [Computer software]. Waseda University. https://www.laurenceanthony.net/software.
Bailey, A. (2018). Dementia and identity: A corpus-based study of an online dementia forum. Communication & Medicine, 15(3). https://doi.org/10.1558/cam.36150
Baker, P. (2010). Corpus methods in linguistics. In L. Litosseliti (Ed.), Research methods in linguistics (pp. 95–113). Continuum.
Baker, P., Hardie, A., & McEnery, T. (2013). A glossary of corpus linguistics. Edinburgh University Press. https://doi.org/10.1515/9780748626908-002
Baker, P. (2004). Querying keywords: Questions of difference, frequency, and sense in keywords. Journal of English Linguistics, 32(4), 346–359. https://doi.org/10.1177/0075424204269894
Bancroft-Billings, S. (2020). Identifying spoken technical legal vocabulary in a law school classroom. English for Specific Purposes, 60, 9–25. https://doi.org/10.1016/j.esp.2020.04.003
Brezina, V., Weill-Tessier, P., & McEnery, A. (2021). #LancsBox: Lancaster University corpus toolbox (Version 6.0) [Computer software]. Lancaster University. https://corpora.lancs.ac.uk/lancsbox
Clarke, I., Brookes, G., & McEnery, T. (2022). Keywords through time. International Journal of Corpus Linguistics, 27(4), 399–427. https://doi.org/10.1075/ijcl.22011.cla
Culpeper, J. (2009). Keyness: Words, parts-of-speech and semantic categories in the character-talk of Shakespeare’s Romeo and Juliet. International Journal of Corpus Linguistics, 14(1), 29–59. https://doi.org/10.1075/ijcl.14.1.03cul
Dayter, D., & Messerli, T. C. (2022). Persuasive language and features of formality on the r/ChangeMyView subreddit. Internet Pragmatics, 5(1), 165–195. https://doi.org/10.1075/ip.00072.day
Egbert, J., & Biber, D. (2019). Incorporating text dispersion into keyword analyses. Corpora, 14(1), 77–104. https://doi.org/10.3366/cor.2019.0162
Egbert, J., & Burch, B. (2023). Which words matter most? Operationalizing lexical prevalence for rank-ordered word lists. Applied Linguistics, 44(1), 103–126. https://doi.org/10.1093/applin/amac030
Egbert, J., Larsson, T., & Biber, D. (2020). Doing linguistics with a corpus: Methodological considerations for the everyday user. Cambridge University Press. https://doi.org/10.1017/9781108888790
Gabrielatos, C., & Marchi, A. (2012, September 14). Keyness: Appropriate metrics and practical issues [Paper presentation]. Critical Approaches to Discourse Studies 2012, Bologna, Italy. http://repository.edgehill.ac.uk/4196/1/Gabrielatos%26MarchiKeyness-CADS2012.pdf
Gries, S. T. (2016). Quantitative corpus linguistics with R: A practical introduction. Routledge. https://doi.org/10.4324/9781315746210
Gries, S. T. (2021). A new approach to (key) keywords analysis: Using frequency, and now also dispersion. Research in Corpus Linguistics, 9(2), 1–33. https://doi.org/10.32714/ricl.09.02.02
Jeaco, S. (2020). Key words when text forms the unit of study: Sizing up the effects of different measures. International Journal of Corpus Linguistics, 25(2), 125–155. https://doi.org/10.1075/ijcl.18053.jea
Ji, T., & Li, K. (2024). A hidden population: A rhetorical genre analysis of the posts in the Baidu depression community. Social Science & Medicine, 353, Article 117036. https://doi.org/10.1016/j.socscimed.2024.117036
Kilgarriff, A. (2009, July 20-23). Simple maths for keywords. In M. Mahlberg, V. González-Díaz, & C. Smith (Eds.), Proceedings of Corpus Linguistics Conference CL2009. University of Liverpool.
Lam, J. C., Cheung, L. Y., Wang, S., & Li, V. O. (2019). Stakeholder concerns of air pollution in Hong Kong and policy implications: A big-data computational text analysis approach. Environmental Science & Policy, 101, 374–382. https://doi.org/10.1016/j.envsci.2019.07.007
Langenhorst, J., Frommherz, Y., & Meier-Vieracker, S. (2023). Keyness in song lyrics: Challenges of highly clumpy data. Journal for Language Technology and Computational Linguistics, 36(1), 21–38. https://doi.org/10.21248/jlcl.36.2023.236
Lexical Computing. (n.d.). Sketch Engine [Computer software]. Retrieved May 2025, from https://www.sketchengine.eu/
Li, P. W., & Lu, C. R. (2020). Articulating sexuality, desire, and identity: A keyword analysis of heteronormativity in Taiwanese gay and lesbian dating websites. Sexuality & Culture, 24(5), 1499–1521. https://doi.org/10.1007/s12119-020-09709-5
Millar, N., & Budgell, B. S. (2008). The language of public health—A corpus-based analysis. Journal of Public Health, 16(5), 369–374. https://doi.org/10.1007/s10389-008-0178-9
Pojanapunya, P. (2017). A theory of keywords [Doctoral dissertation]. KMUTT Library Network. https://opac.lib.kmutt.ac.th/vufind/Record/1370763
Pojanapunya, P., & Watson Todd, R. (2018). Log-likelihood and odds ratio: Keyness statistics for different purposes of keyword analysis. Corpus Linguistics and Linguistic Theory, 14(1), 133–167. https://doi.org/10.1515/cllt-2015-0030
Rayson, P. (2009). Wmatrix: A web-based corpus processing environment. Computing Department, Lancaster University. http://ucrel.lancs.ac.uk/wmatrix/
Rayson, P. (2013). Corpus analysis of key words. In C. A. Chapelle (Ed.), The encyclopedia of applied linguistics. Blackwell Publishing Ltd.
Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. Proceedings of the Workshop on Comparing Corpora, 9, 1–6. https://doi.org/10.3115/1117729.1117730
Scott, M. (1997). PC analysis of key words – And key words. System, 25(2), 233–245. https://doi.org/10.1016/S0346-251X(97)00011-0
Scott, M. (2024). WordSmith tools version 9 (64 bit version). Stroud: Lexical Analysis Software. https://lexically.net/wordsmith/
Sönning, L. (2022a). Evaluation of text-level measures of lexical dispersion: Robustness and consistency [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/h9mvs
Sönning, L. (2022b). Evaluation of keyness metrics: Reliability and interpretability [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/eb2n9
Xu, L., & Jhang, S. E. (2020). Keyword analyses of English charter parties. Linguistic Research, 37(2), 267–288.