Research on Hungarian Keywords and Comparing the Results of their Extraction

– Pilot Study

  • Dodé Réka
doi: 10.59648/filologia.2023.1-4.3

Abstract

Keyword and term extraction is a well-established area of research that has attracted scholarly attention for the past five decades. However, it continues to pose persistent challenges. Language models introduce a novel dimension to various facets of natural language processing, including the realm of keyword and term extraction. They offer the
capability to generate novel keywords that may be absent or only partially represented within the source text. When the authors enter keywords manually, they draw on their
own background knowledge, so these keywords are not necessarily included in the text. Manually entered keywords are therefore worth dealing with and can be considered the gold standard, a benchmark, against which to test keyword extraction applications. In our study, we conducted a comparative analysis of manually assigned keywords for 30 scientifical textual documents (from different domains) against keyword solutions provided by ChatGPT in response to various prompts. Our findings indicate that while there may not be a statistically significant difference in quantitative metrics, a qualitative examination of ChatGPT-generated solutions reveal their relevance and utility in augmenting keyword assignments. The aim of the thesis is to evaluate the outputs given by ChatGPT from the point of view of how close they are to the keywords given by the authors.

Keywords:

term keyword extraction providing keyword language model ChatGPT

How to Cite

Dodé, R. (2024). Research on Hungarian Keywords and Comparing the Results of their Extraction: – Pilot Study. Filológia.Hu, 14(1–4), 51–64. https://doi.org/10.59648/filologia.2023.1-4.3

References

Arntz, Reiner – Picht, Heribert – Mayer, Felix 2009: Einführung in die Terminologiearbeit. 6. Aufl. Hildesheim – Zürich – New York: Georg Olms Verlag.

Berend Gábor – Farkas Richárd 2010: Kulcsszókinyerés magyar nyelvű tudományos publikációkból. In: Tanács Attila – Vincze Veronika (szerk.): VII. Magyar Számítógépes Nyelvészeti Konferencia. Szeged, Magyarország, 2010. december 2–3. Szeged: Szegedi Tudományegyetem Informatikai Tanszékcsoport. 47–55.

Dodé Réka 2023: Kulcsszavak és terminusok vizsgálata a REAL repozitóriumának anyagán – pilot kutatás. Előadás. Tudásmegosztás, információkezelés, alkalmazhatóság. XXIX. Magyar Alkalmazott Nyelvészeti Kongresszus, 2023. március 17–18. Budapest: Szaknyelvi Intézet, Semmelweis Egyetem.

Farkas Richárd é. n.: Gépi tanulás a gyakorlatban. Gépi tanulás alapfogalmai. Online: https://www.inf.u-szeged.hu/~rfarkas/ML20/alapfogalmak.html

Firoozeh, Nazanin – Nazarenko, Adeline – Alizon, Fabrice – Daille, Béatrice 2020: Keyword Extraction: Issues and Methods. Natural Language Engineering 26/3: 1–33. Online: https://doi.org/10.1017/S1351324919000457

Gu, Yu – Tinn, Robert – Cheng, Hao – Lucas, Michael – Usuyama, Naoto – Liu, Xiaodong – Naumann, Tristan – Gao, Jianfeng – Poon, Hoifung 2021: Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Transactions on Computing for Healthcare 3/1: 1–23. Online: https://doi.org/10.1145/3458754

Hulth, Anette 2003: Improved Automatic Keyword Extraction Given More Linguistic Knowledge.. In: Collins, Michael – Steedman, Mark (eds.): Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. Barcelona, Spain: Association for Computational Linguistics. 216–223. Online: https://doi.org/10.3115/1119355.1119383

ISO 860 = ISO Central Secretary 2007: Terminology Work – Harmonization of Concepts and Terms. Geneva, CH: International Organization for Standardization. Online: https://www.iso.org/standard/40130.html

ISO 1087 = ISO Central Secretary 2019: Terminology Work and terminology science – Vocabulary. Geneva, CH: International Organization for Standardization. Online: https://www.iso.org/standard/62330.html

Jalalov, Damir – Gaszcz, Karolina 2023: Legjobb Prompt Engineering Ultimate Guide 2023: Kezdőtől haladóig. Metaverse Post, 2023. május 14. Online: https://mpost.io/hu/prompt-engineering-ultimate-guide/

Jurafsky, Dan – H. Martin, James 2023: Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Third Edition draft. Online: https://web.stanford.edu/~jurafsky/slp3/ed3book_jan72023.pdf

Mihalcea, Rada – Csomai, András 2007: Wikify!: Linking Documents to Encyclopedic Knowledge. In: J. Silva, Mário – A. F. Laender, Alberto – Baeza-Yates, Ricardo – L. McGuinness, Deborah – Olstad, Bjorn – Haug Olsen, Øystein – O. Falcão, André (eds.): Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management. New York, NY, United States: Association for Computing Machinery. 233–242. Online: https://doi.org/10.1145/1321440.1321475

Mottesi, Celeste 2023: What is ChatGPT? An Introduction to OpenAI’s Conversational AI Model. InvGate, 2023. február 2. Online: https://blog.invgate.com/what-ischatgpt

Nomoto, Tadashi 2023: Keyword Extraction: A Modern Perspective. SN Computer Science, 92/4. Online: https://doi.org/10.1007/s42979-022-01481-7

Pantcheva, Marina 2023: Terminology Management Made Easier with Large Language Models. R WS B log, 2023. május 18. Online: https://www.rws.com/blog/terminologymanagement-made-easier-with-large-language-model s/

Smullen, Daniel 2023: How To Use ChatGPT For Keyword Research. Search Engine Journal, 2023. április 19. Online: https://www.searchenginejournal.com/ChatGPT-forkeyword-research/483848/

Springer = sz. n. é. n.: Title, Abstract and Keywords. The Importance of Titles. Springer. Online: https://www.springer.com/kr/authors-editors/authorandreviewertutorials/writing-a-journal-manuscript/title-abstract-and-keywords/10285522

Szabolcs Zoltán 2011: Sternotomia. 2011. augusztus 28. Online: https://www.szabolcszoltan.hu/patients/sternotomy.php

Tamás Dóra Mária 2014: Gazdasági szakszövegek fordításának terminológiai kérdései. Budapest: ELTE Eötvös Kiadó.

Vakili, Thomas – Lamproudis, Anastasios – Henriksson, Aron – Dalianis, Hercules 2022: Downstream Task Performance of BERT Models Pre-Trained Using Automatically DeIdentified Clinical Data. In: Calzolari, Nicoletta – Béchet, Frédéric – Blache, Philippe – Choukri, Khalid – Cieri, Christopher – Declerck, Thierry – Goggi, Sara – Isahara, Hitoshi – Maegaard, Bente – Mariani, Joseph – Mazo, Hélène – Odijk, Jan – Piperidis, Stelios (eds.): Proceedings of the Thirteenth Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association. 4245–4252.

Vaswani, Ashish – Shazeer, Noam – Parmar, Niki – Uszkoreit, Jakob – Jones, Llion – N. Gomez, Aidan – Kaiser, Lukasz – Polosukhin, Illia 2017: Attention Is All You Need. Cornell University arXiv, 2017. június 12. Online: https://arxiv.org/abs/1706.03762

Yang Zijian Győző – Novák Attila – Laki László János 2020: Automatic Tag Recommendation for News Articles. In: Kovásznai Gergely – Fazekas István – Tómács Tibor (eds.): Proceedings of the 11th International Conference on Applied Informatics (ICAI 2020), Eger, Hungary, January 29–31, 2020. Volume 2650 of CEUR Workshop Proceedings. CEUR-WS.org. 442–451. Online: https://ceur-ws.org/Vol-2650/paper45.pdf

Zheng, Zhe – Lu, Xin-Zheng – Chen, Ke-Yin – Zhou, Yu-Cheng – Lin, Jia-Rui 2022: Pretrained Domainspecific Language Model for General Information Retrieval Tasks in the AEC Domain. Cornell University arXiv, 2022. március 9. Online: https://arxiv.org/abs/2203.04729

Források

ChatGPT = ChatGPT. https://chat.openai.com/

REAL = REAL Repozitórium. http://real.mtak.hu/

LlamaIndex = LlamaIndex. https://www.llamaindex.ai/