Eliciting explicit knowledge from domain experts in direct intrinsic evaluation of word embeddings for specialized domains
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Eliciting explicit knowledge from domain experts in direct intrinsic evaluation of word embeddings for specialized domains Goya van Boven Jelke Bloem Utrecht University University of Amsterdam j.g.vanboven@students.uu.nl j.bloem@uva.nl Abstract general use, and the relations between them. This yields e.g. the SimLex-999 word similarity dataset We evaluate the use of direct intrinsic word (Hill et al., 2015), covering frequent words and embedding evaluation tasks for specialized lan- guage. Our case study is philosophical text: their typical senses. More difficult is finding ‘na- human expert judgements on the relatedness tive speakers’ who have an intuition of the meaning of philosophical terms are elicited using a of the terms used by a particular philosopher. The synonym detection task and a coherence task. only candidate would be that philosopher them- Uniquely for our task, experts must rely on selves, and even then, the meaning of some of the explicit knowledge and cannot use their lin- terms used is the result of explicit analysis and def- guistic intuition, which may differ from that of inition rather than implicit language knowledge of the philosopher. We find that inter-rater agree- ment rates are similar to those of more con- the philosopher. Uncommon terms with highly spe- ventional semantic annotation tasks, suggest- cific meanings are explicitly defined and debated, ing that these tasks can be used to evaluate leading them to differ between philosophers or word embeddings of text types for which im- even within the works of a single philosopher. Any plicit knowledge may not suffice. accurate evaluation or annotation would require ex- pert knowledge, and methods that can incorporate 1 Introduction explicit knowledge, rather than judgements based Philosophical research often relies on the close on implicit knowledge of a standard language or reading of texts, which is a slow and precise pro- jargon by one of its speakers. cess, allowing for the analysis of a few texts only. We test two direct evaluation methods for DS Supporting philosophical research with distribu- models described by Schnabel et al. (2015) on our tional semantic (DS) models (Bengio et al., 2003; case study, the works of Willard V. O. Quine, a 20th Turney and Pantel, 2010; Erk, 2012; Mikolov et al., century American philosopher. Instead of native 2013) has been proposed as a way to speed up the English speaking crowdworkers, we selected ex- process (van Wierst et al., 2016; Ginammi et al., in pert participants who have studied this philosopher press; Herbelot et al., 2012), and could increase the extensively. We aim to test whether these meth- number of analysed texts, decreasing reliance on ods produce reliable results when participants need a canon of popular texts (cf. addressing the great to use explicit rather than implicit knowledge, and unread, Cohen, 1999). However, we cannot eval- consider the methods to be successful if inter-rater uate semantic models of philosophical text using agreement matches that of other semantic evalua- a general English gold standard, as philosophical tions. More broadly, our methodological findings concepts often have a very specific meaning. For apply to evaluation of DS models for specialized example, the term abduction, usually meaning a domains, language for specific purposes (LSP), his- kidnapping, denotes a specific type of inference torical language varieties or other language (vari- in philosophy (Douven, 2017). Therefore, models eties) for which no native annotators are available. must be evaluated in a domain-specific way. The critical difference between the general case 2 Related work and the philosophy case is the following. It is easy to find native speakers of e.g. British English who Most intrinsic evaluations compare word embed- have a good intuition of the meaning of its terms in ding similarities (e.g. in terms of cosine distance) 107 Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), pages 107–113 Online, April 19, 2021. ©2021 Association for Computational Linguistics
to premade datasets of human similarity or relat- focused, including more types of conceptual rela- edness judgements. Sets of words are created and tions and including irrelevant as well as relevant evaluated on semantic relations by participants, terms for better evaluation of model precision. Still, and the similarity between the assessments and evaluation remains restricted to terms in the ground an embedding space is used as a measure of per- truth. Only using direct evaluation methods we can formance. In specific domains, examples of such attempt to evaluate all model output. datasets of term ratings can be found for identifier names in source code (Wainakh et al., 2019), in the 3 Task description medical domain (Pakhomov et al., 2010, 2011; Ped- ersen et al., 2007) and in geosciences both for En- We perform a synonym detection task and a coher- glish (Padarian and Fuentes, 2019) and Portuguese ence task. In these tasks, participants are asked to (Gomes et al., 2021). The last two studies com- judge model-generated candidate terms that seman- pare domain-specific embeddings to general do- tic models deem closest to a target term. In the syn- main embeddings and both find that the former per- onym detection task, participants select the most form better. A problem of these indirect datasets is similar word to target term t out of a set of options: that only naturally occurring, often high-frequency the k-nearest neighbours of the target term in each terms without any spelling variations, are evalu- model that is being compared. In the coherence ated, while DS models include many more vari- task, the participant selects a semantical outlier in ations (Batchkarov et al., 2016). a set of words, where one of the words is not close to t in the model. We refer to Schnabel et al. (2015) Direct intrinsic evaluation methods, where par- for details and a comparison to other tasks for gen- ticipants respond directly to the output of models, eral semantic evaluation. Our participant instruc- can be categorized as absolute and comparative tions are based on Schnabel et al., who use the in- intrinsic evaluation (Schnabel et al., 2015). The structions of the WordSim-353 dataset (Finkelstein former method evaluates embeddings individually et al., 2001). But as this study focuses on explicit and compares their final scores, while in the lat- knowledge, several adjustments are needed. ter participants directly express their preferences Although explicit knowledge is easier to verbal- between models. To our best knowledge, the only ize than implicit knowledge, it involves controlled example of a domain-specific direct human evalua- rather than automatic processing (Bowles, 2011; tion is Dynomant et al. (2019) who evaluate French Ellis, 2004, 2005), so our version of the task might embeddings of health care terms by a human eval- take longer. Yet in order to retain the required fo- uation in which two medical doctors rate the rele- cus, the test should not take too long. We therefore vance of the first five nearest neighbours of target conduct a pilot study in which response times are terms from models trained on in-domain text. measured to estimate task durations, and we adapt In the philosophical domain some evaluations the size of the main study accordingly. have been conducted with other methods, some- The original task instructions do not define simi- times incorporating expert explicit knowledge, but larity, while other studies define it as co-hyponymy none are direct. In each of these studies the work (Turney and Pantel, 2010) or synonymy (Hill et al., of Quine is utilized as data. Firstly, Bloem et al. 2015). According to Batchkarov et al. (2016) defin- (2019) propose a method of evaluating word em- ing similarity is difficult as it depends on the bedding quality by measuring model consistency, context and downstream application in which the not making use of expert knowledge. Secondly, terms are used. We keep a consistent context, both Oortwijn et al. (2021a) construct a conceptual net- training and evaluating in the domain of a particu- work which serves as a ground truth of expert lar philosopher, although the concern of capturing knowledge. They compare the similarity of embed- the multidimensional concept of similarity in a sin- dings for target philosophical terms to their posi- gle number is valid also in this context. Gladkova tion in the manually created network. Here, the con- and Drozd (2016) claim participants are likely to ceptual relatedness between terms is restricted to prefer synonyms when asked to select the most sim- the property of sharing hypernyms, and only terms ilar word. In this study we are looking to find any that were predefined in the ground truth can be relationship present, rather than a specific one, and considered for evaluation. Betti et al. (2020) intro- expect the experts to explicitly consider this, so we duce a more elaborate ground truth that is concept- ask for relatedness. Gladkova and Drozd further 108
argue that when asked for relatedness participants cute the experiments in a silent environment. must choose between various relations present, a choice that can be subjective or random, and might 4 Case study for philosophy reflect other factors such as “frequency, speed of We make use of the QUINE corpus (v0.5, Betti association, and possibly the order of presentation et al., 2020), which includes 228 philosophical of words”. The first two factors are alleviated in books, articles and bundles Quine authored, con- this study as the participants must take Quine’s sisting of 2.15 million tokens in total. As target definitions of words into account rather than their terms for evaluation, we use Bloem et al.’s (2019) own. This forces participants to think their answers test set for the the book Word & Object (Quine, through, which should reduce the association ef- 1960), one of Quine’s most influential works. It fects typical of fast-paced online studies. To ac- consists of 55 terms that were selected from the count for effects of order of presentation, we ran- book’s index. We used 10 of these terms in the pi- domize the order of the options. In our instructions, lot study, 25 in the synonym detection task, 14 in we define relatedness as synonymy, meronymy, hy- the coherence task and 6 in both experiments.4 ponymy, and co-hyponymy, and provide examples. One Quine expert participated in the pilot study. Participants are also allowed to base their judge- The pilot study consists of short versions of the two ments on other types of relations. tasks, both testing five target terms. In the synonym After the experiment we present a post-test sur- detection task, each target term has six candidate vey, querying the types of relation participants related terms from the models, that the participant based their judgements on, and task difficulty. Fur- should choose between. Each term is tested three thermore, we change the option I don’t know the times with candidates of differing similarity from meaning of one (or several) of the words in the the model (nearest neighbour ranks k ∈ {1, 5, 50}). synonym detection task to be None of these words The pilot coherence (outlier) task has ten questions. is even remotely related, and also include a simi- The average response time for the synonym detec- lar option in the coherence task, namely No coher- tion task was 109.5s and 42.1s for the coherence ent group can be formed from these words. This is task. Because for the first task this was higher than done to avoid any random selection of words when anticipated, we reduced the number of ranks to two there are no meaningful relations, making the re- and divided the task across two separate surveys. sponses more accurate. As we aim to gather ex- plicit knowledge, participants are allowed to look 4.1 Experiment 1: Synonym detection task up relevant information on presented words. For reproducibility, our instructions (and results) are Three experts on the work of Quine, including the included in the supplementary materials. 1 participant of the pilot study, participated in this ex- periment. They all hold a Master’s degree in philos- As the tasks require participants to be experts on ophy and have studied the philosopher extensively. the work of Quine, the number of possible partici- pants is limited. Although participants are philoso- This task includes 31 target words, which are all phers trained to work precisely and make consis- tested on two ranks k, with k ∈ {1, 10}, resulting tent judgements, subjectivity can be a risk as partic- in 62 questions. Of the 45 test set terms not used ipant must choose the relation they deem most im- in the pilot study, we took the fifteen highest fre- portant, while lacking context. We use inter-rater quency terms (n > 275) and the sixteen lowest agreement to evaluate this. We report joint proba- (n < 84). The experiment was conducted through bility of agreement (percentage of agreement) as two surveys, each consisting of 31 questions, last- we have added the none options to avoid chance ing around 50 minutes, with a break halfway. 5 agreement. As joint probabilities cannot be com- The data from one of the participants was ex- pared across studies, we also report Cohen’s κ. cluded, as the participant indicated that the test was All experiments2 are conducted on the survey too difficult and that their expertise on the work of platform Qualtrics3 . Participants are asked to exe- Quine did not suffice. Moreover the response times of this participant were a lot lower than for the 1 To be found at https://github.com/ other participants. For this experiment, the overall gvanboven/direct-intrinsic-evaluation 2 4 Experiments were approved by The Ethics Committee of Listed in the supplemental materials, with frequencies the Faculty of Humanities of the University of Amsterdam. 5 Example surveys and raw results for each participant are 3 www.qualtrics.com included in the supplementary materials. 109
Response time Exp. 1 Exp. 2 Overall 45.5 s 25.8 s None 53.4 s 28.2 s Not-none 43.4 s 23.4 s High frequency 35.7 s 26.9 s Low frequency 54.9 s 24.9 s (a) (b) Figure 1: Inter-rater agreement in different conditions of (a) the Table 1: Response times in different condi- synonym detection task and (b) the coherence task tions of 1. synonym and 2. coherence tasks inter-rater agreement was 58.1%, with κ = 0.492. ment for the None option than for the other choices, shown in Figure 1(a) and (b). Response times were 4.2 Experiment 2: Coherence task also higher for the None option in both tasks (Table Two of the participants from the previous exper- 1), suggesting this choice is more difficult. Most iments also participated in this study. 20 target disagreement thus concerned the presence of a se- words are used: the 14 test set terms not used in mantic relationship, but if the annotators agreed the pilot or Exp. 1, and the 3 highest and lowest there was one, they mostly preferred the same rela- frequency terms from Exp. 1. We divide these into tion. This suggests a None option increases annota- eleven low frequency words (n < 142) and nine tion quality in general. In the coherence task, there high frequency words (n > 187). Using 3 DS mod- was more agreement on low than high frequency els this results in 60 questions, the test takes ap- words, which may be due to their lesser ambiguity. proximately 40 minutes with a break. The inter- According to the post-test survey, participants rater agreement was 56.7%, with κ = 0.345. mostly based their judgements on sharing the same super term. Relationships that were used without 5 Analysis being listed in the instructions were antonymy, forming a technical bigram term together, having To assess whether the method was successful we the same stem and being used in the same context discuss some reliability metrics and examine dis- by Quine. We see this reflected when examining agreement examples. First of all, the fact that the some examples of disagreement. In Table 2, we see data from one participant had to be excluded con- disagreement on the related term for adjectives be- firms the high standard of expertise required for cause both chosen terms have a relation to this tar- participating in our version of the tasks. The re- get term, but these are two different relations. We sults might have differed had there been more or see agreement for information, as collateral infor- different participants. However, other studies on mation is a meaningful bigram in Quine’s thought expert explicit knowledge also execute tasks with experiment on radical translation. In Table 3 we two (Dynomant et al., 2019) or three (Padarian and see disagreement on the ambiguity outlier. While Fuentes, 2019; Gomes et al., 2021) participants. believe has a tenuous relation to ambiguity, par- Inter-rater agreement scores for the two tasks ticipant 2 may have considered this relation too were 58.1% (κ = 0.492) and 56.7% (κ = 0.345), tenuous and went for none. One expert stated that indicating moderate or fair agreement. Batchkarov unclear boundaries of the none option were the rea- et al. (2016) found the average inter-rater agree- son for many none disagreements. The sense da- ment of two raters of the WordSim-353 (Finkel- tum disagreement was guessed to be over a rare stein et al., 2001) and MEN (Bruni et al., 2014) non-mathematics sense of divisibility that one par- dataset to lie between κ = 0.21 and κ = 0.62. ticipant remembered but the other might not have. Thus, agreement scores in this study are not lower than that of commonly used similarity datasets, de- 6 Discussion spite participants having to agree on another per- son’s semantics and including a None option. In the post-test survey, participants commented that Both experiments yield lower inter-rater agree- it was sometimes difficult to select the most related 110
adjectives information application in a DS model capture. It may also be interest- translation learning numbers1 ing for philosophers to make use of models trained embodying reduction ambiguity to represent particular relations, such as antonymy modifiers2 collateral12 multiplicity (Dou et al., 2018). With more specific instructions specious application subtraction explicitly directing participants to prioritize or ig- present ordered pair belong nore specific relations, our evaluation approach can verbs1 None abbreviative be adapted to evaluate such models and we ex- None None2 pect higher agreement in this type of task. In other Table 2: Example of disagreement and agreement in the cases different interpretations can be desirable, e.g. synonym detection task. To be read vertically, with tar- where there is no hierarchy of relations and a model get terms in italics. Bolded/marked model terms were should capture relatedness in a broad sense. For chosen by participants to be related to the target term. this purpose, we should consider allowing multiple answers — while a forced choice helps to elicit ambiguity objects sense datum implicit knowledge, explicit knowledge may not parts object prediction1 always support a categorical decision, though this phoneme physical construction adds the complication of deciding when an option believe1 them12 divisibility2 is relevant enough, similar to the none option. None2 None None Our results show that absolute and comparative Table 3: Example of disagreement and agreement in intrinsic evaluation tasks can be used to agree on the coherence task. Bolded/marked terms were chosen semantic relatedness between word embeddings by participants to be outliers, underlined terms were even when the target language variety is highly model outliers with lower word embedding similarity. specific. By instructing domain experts to perform the evaluation task using explicit expert knowledge word, as different relations were present and se- rather than implicit knowledge, inter-rater agree- lecting the most important one is partly a matter ment rates similar to other semantic annotation of preference. Such ambiguity is prevalent in any tasks can be reached. Due to the inherent lack semantic annotation task in which context is un- of context in evaluating type-based non-contextual specified, and in other language annotation tasks word embeddings, participants struggled with the in which no explicit choice is made in the guide- generality of the task. Based on our analysis and lines among possible competing valid interpreta- post-test survey, we expect more specific guide- tions (Plank et al., 2014). As noted by Sommer- lines on word relatedness to increase the reliabil- auer et al. (2020), justified disagreement is possi- ity of the annotators’ judgements, while limiting ble, though detecting it requires meta-annotation their generalizability. The addition of a None op- and this is in itself a difficult task. However, it tion seemed particularly beneficial for obtaining might yield additional insights, i.e. that certain DS more reliable annotations based on explicit knowl- models might prioritize certain relation types in edge. We expect these findings to apply in the con- their nearest neighbours, and that these are equally text of other domains for which no ‘native’ anno- valid because the experts disagreed on them. Dis- tators are available — for example, language for agreement can also be caused by poorly specified specific purposes (LSP), historical language vari- tasks and insufficient conceptual alignment among eties or idiolects. In future work, the absolute and annotators, especially when the goal is creating a comparative intrinsic evaluation tasks we have de- ground truth (Oortwijn et al., 2021b) or otherwise scribed can be used to compare the quality of the annotating for a specific theory or interpretation. representations of different word embedding mod- In future experiments, more specific instructions els on these specialized language varieties. on when to consider a relation to be relevant, or guidelines on prioritizing certain relations over oth- Acknowledgements ers, can reduce the difficulty of the task. Our expert participants used many semantic relation types in We are grateful to Yvette Oortwijn, Thijs Os- their interpretation with no clear hierarchy among senkoppele and Arianna Betti for their input as them. However, applying this to DS model evalu- Quine domain experts. This research was sup- ation may require more insight into what exactly ported by VICI grant e-Ideas (277-20-007), fi- the geometric relationships between embeddings nanced by the Dutch Research Council (NWO). 111
References Katrin Erk. 2012. Vector space models of word mean- ing and phrase meaning: A survey. Language and Miroslav Batchkarov, Thomas Kober, Jeremy Reffin, Linguistics Compass, 6(10):635–653. Julie Weeds, and David Weir. 2016. A critique of word similarity as a method for evaluating distribu- Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, tional semantic models. In Proceedings of the 1st Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey- Workshop on Evaluating Vector-Space Representa- tan Ruppin. 2001. Placing search in context: The tions for NLP, pages 7–12. concept revisited. In Proceedings of the 10th inter- national conference on World Wide Web, pages 406– Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and 414. Christian Jauvin. 2003. A neural probabilistic lan- guage model. Journal of machine learning research, Annapaola Ginammi, Rob Koopman, Shenghui Wang, 3(Feb):1137–1155. Jelke Bloem, and Arianna Betti. in press. Bolzano, Kant, and the traditional theory of concepts: A com- Arianna Betti, Martin Reynaert, Thijs Ossenkoppele, putational investigation. In The Dynamics of Sci- Yvette Oortwijn, Andrew Salway, and Jelke Bloem. ence: Computational Frontiers in History and Phi- 2020. Expert concept-modeling ground truth losophy of Science. Pittsburgh University Press. construction for word embeddings evaluation in concept-focused domains. In Proceedings of the Anna Gladkova and Aleksandr Drozd. 2016. Intrinsic 28th International Conference on Computational evaluations of word embeddings: What can we do Linguistics, pages 6690–6702. better? In Proceedings of the 1st Workshop on Eval- uating Vector-Space Representations for NLP, pages Jelke Bloem, Antske Fokkens, and Aurélie Herbelot. 36–42. 2019. Evaluating the consistency of word embed- dings from small data. In Proceedings of the In- Diogo da Silva Magalhães Gomes, Fábio Corrêa ternational Conference on Recent Advances in Natu- Cordeiro, Bernardo Scapini Consoli, Nikolas Lac- ral Language Processing (RANLP 2019), pages 132– erda Santos, Viviane Pereira Moreira, Renata Vieira, 141. Silvia Moraes, and Alexandre Gonçalves Evsukoff. 2021. Portuguese word embeddings for the oil and Melissa A Bowles. 2011. Measuring implicit and ex- gas industry: Development and evaluation. Comput- plicit linguistic knowledge: What can heritage lan- ers in Industry, 124:103347. guage learners contribute? Studies in second lan- guage acquisition, pages 247–271. Aurélie Herbelot, Eva Von Redecker, and Johanna Müller. 2012. Distributional techniques for philo- Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. sophical enquiry. In Proceedings of the 6th Work- Multimodal distributional semantics. Journal of Ar- shop on Language Technology for Cultural Heritage, tificial Intelligence Research, 49:1–47. Social Sciences, and Humanities, pages 45–54. As- sociation for Computational Linguistics. Margaret Cohen. 1999. The sentimental education of the novel. Princeton University Press. Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Simlex-999: Evaluating semantic models with (gen- Zehao Dou, Wei Wei, and Xiaojun Wan. 2018. Improv- uine) similarity estimation. Computational Linguis- ing word embeddings for antonym detection using tics, 41(4):665–695. thesauri and SentiWordNet. In CCF International Conference on Natural Language Processing and Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Chinese Computing, pages 67–79. Springer. Dean. 2013. Efficient estimation of word representa- tions in vector space. CoRR, abs/1301.3781. Igor Douven. 2017. Abduction. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy, Yvette Oortwijn, Jelke Bloem, Pia Sommerauer, Fran- summer 2017 edition. Metaphysics Research Lab, cois Meyer, Wei Zhou, and Antske Fokkens. 2021a. Stanford University. Challenging distributional models with a conceptual network of philosophical terms. In Proceedings of Emeric Dynomant, Romain Lelong, Badisse Dahamna, the 2021 Conference of the North American Chap- Clément Massonnaud, Gaétan Kerdelhué, Julien ter of the Association for Computational Linguistics: Grosjean, Stéphane Canu, and Stefan J Darmoni. Human Language Technologies (NAACL-HLT). In 2019. Word embedding for the French natural lan- press. guage in health care: comparative study. JMIR med- ical informatics, 7(3):e12310. Yvette Oortwijn, Thijs Ossenkoppele, and Arianna Betti. 2021b. Interrater disagreement resolution: A Rod Ellis. 2004. The definition and measurement of L2 systematic procedure to reach consensus in annota- explicit knowledge. Language learning, 54(2):227– tion tasks. In Proceedings of the First Workshop on 275. Human Evaluation of NLP Systems (HumEval). Rod Ellis. 2005. Measuring implicit and explicit José Padarian and Ignacio Fuentes. 2019. Word embed- knowledge of a second language: A psychomet- dings for application in geosciences: development, ric study. Studies in second language acquisition, evaluation, and examples of soil-related concepts. 27(2):141–172. Soil, 5(2):177–187. 112
Serguei Pakhomov, Bridget McInnes, Terrence Adam, Ying Liu, Ted Pedersen, and Genevieve B Melton. 2010. Semantic similarity and relatedness between clinical terms: an experimental study. In AMIA an- nual symposium proceedings, page 572. American Medical Informatics Association. Serguei VS Pakhomov, Ted Pedersen, Bridget McInnes, Genevieve B Melton, Alexander Ruggieri, and Christopher G Chute. 2011. Towards a frame- work for developing semantic relatedness refer- ence standards. Journal of biomedical informatics, 44(2):251–265. Ted Pedersen, Serguei VS Pakhomov, Siddharth Pat- wardhan, and Christopher G Chute. 2007. Mea- sures of semantic similarity and relatedness in the biomedical domain. Journal of biomedical informat- ics, 40(3):288–299. Barbara Plank, Dirk Hovy, and Anders Søgaard. 2014. Linguistically debatable or just plain wrong? In Pro- ceedings of the 52nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 2: Short Papers), pages 507–511. Willard Van Orman Quine. 1960. Word and object. MIT Press. Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 conference on Empirical Methods in Natu- ral Language Processing, pages 298–307. Pia Sommerauer, Antske Fokkens, and Piek Vossen. 2020. Would you describe a leopard as yellow? Eval- uating crowd-annotations with justified and informa- tive disagreement. In Proceedings of the 28th Inter- national Conference on Computational Linguistics, pages 4798–4809. Peter D Turney and Patrick Pantel. 2010. From fre- quency to meaning: Vector space models of se- mantics. Journal of artificial intelligence research, 37:141–188. Yaza Wainakh, Moiz Rauf, and Michael Pradel. 2019. Evaluating semantic representations of source code. CoRR, abs/1910.05177. Pauline van Wierst, Sanne Vrijenhoek, Stefan Schlobach, and Arianna Betti. 2016. Phil@Scale: Computational Methods within Philosophy. In Proceedings of the Third Conference on Digital Humanities in Luxembourg with a Special Focus on Reading Historical Sources in the Digital Age. CEUR Workshop Proceedings, CEUR-WS.org, volume 1681, Aachen. 113
You can also read