WFL is at DH 2017 Montréal

WFL is presenting a poster on how to browse the resource and what kind of research questions can the webapp help with. These are some passages from the DH2017 Book of Abstracts:

Word Formation Latin (WFL) is a derivational morphology resource for Classical Latin, where lemmas are analysed into their formative components, and relationships between them are established on the basis of Word Formation Rules (WFRs). For example amo (to love) and amator (lover) are connected with a relationship that describes a change from a verb to a noun through the addition of a suffix (-a-tor) that in itself bears semantic information (in this case it characterises agentive and instrumental nouns, i.e. someone or something performing an action).

[…]

The lexical basis used for the resource comprises the whole 69,682 lemmas featured in the morphological analyser for Latin LEMLAT 3.0 (http://www.lemlat3.eu/).

The word formation lexicon is built in two steps:

  1.             Word   formation rules (WFRs) are detected using a mixture of previous literature on Latin derivational morphology (Jenks, 1911; Fruyt, 2011; Oniga, 1988) and semi-automatic procedures (Passarotti and Mambrini 2012).
  2.             WFRs are applied to lexical data: lemmas and WFRs are paired using a MySQL relational database, and a number of MySQL queries provide the candidate lemmas for each WFR. Input and output pairs are then checked manually, in order to clear out false friends and duplicate results due to homography.

This poster will describe the resource, and, more specifically, it will illustrate the web application that is being developed to easily access the data.

The WFL dataset is both integral part of Lemlat (https://github.com/CIRCSE/LEMLAT3), and used in a standalone web application (http://wfl.marginalia.it). The database will be made available for download, so that extensive queries can be run and the data can be used and reused at will. The web application is intuitive and user-friendly. It supports those scholars and students that are not familiar with database querying languages such as SQL, but also Classicists with specific scientific questions.

The lexicon can be browsed either by WFR, affix, input and output Part-of-Speech (PoS) or lemma. Drop-down menus provide the available options for each selection, such as the list of affixes and lemmas. Results are visualised as lists of lemmas and tree graphs, whose nodes are lemmas and edges are WFRs. Trees are interactive. Clicking on a node shows the full derivational tree (“word formation cluster”) for the lemma reported in that node. For example, figure 1 shows the word formation cluster for the lemma computo, ‘to calculate’. Clicking on an edge shows the lemmas built by the WFR described by that edge.

Methodological motivations will be given for each browsing option together with suggestions for potential uses of the web to investigate Latin derivational processes.

 

Derivation graph for computo

Figure 1. Word formation cluster for computo, ‘to calculate’.

Four browsing choices can help the scholar with an array of linguistic investigations.

  1. By WFR – opens research questions on a specific word formation behaviour; for example, it is possible to view and download a list of all verbs that derive from a noun with a conversive derivation process (e.g. radix ‘root’ -> radicor ‘to grow roots’).
  2. By Affix – acts similarly as above, but works more specifically on affixal behaviour: for example, it is possible to see all agentive nouns in -tor and verify how many correspond to a female equivalent in -trix.
  3. By PoS – useful for studies on macro-categories, such as nominalisation or verbalisation.
  4. By Lemma – useful when studying the productivity of one specific morphological family (like the one for bellum above) or a group of morphological families.

These explorations lead in many directions through investigations on derivational production and semantics (Can semantic identification of outputs help to show which WFRs are more morphotactically transparent? Which inputs produce a certain kind of outputs? Etc.).

 

Bibliographic References

Forcellini, A. “Lexicon totius latinitatis ab Aegidio Forcellini seminarii Patavini alumno lucubratum, deinde a Iosepho Furlanetto eiusdem seminarii alumno emendatum et auctum, nunc vero curantibus Francisco Corradini et Iosepho Perin seminarii Patavini item alumnis emendatius et auctius melioremque in formam redactum.” Tom. I AG, Patavii, Typis Seminarii (1940).

Georges, Karl Ernst. Ausführliches lateinisch-deutsches und deutsch-lateinisches Handwörterbuch... Vol. 2. Hahn’sche Verlags-buchhandlung, 1880.

Glare, Peter GW. Oxford latin dictionary. Clarendon Press. Oxford University Press, 1982.

Gradenwitz, Otto. Laterculi vocum Latinarum: voces Latinas et a fronte et a tergo ordinandas. S. Hirzel, 1904.

Jenks, Paul Rockwell. A manual of Latin word formation for secondary schools. DC Heath & Company, 1911.

Fruyt, Michèle. “Word‐Formation in Classical Latin.” A Companion to the Latin Language (2011): 157-175.

Oniga, Renato. I composti nominali latini: una morfologia generativa. Vol. 29. Pàtron, 1988.

Passarotti, Marco Carlo. “Development and perspectives of the Latin morphological analyser LEMLAT.” Linguistica computazionale 20, no. A (2004): 397-414.

Passarotti, Marco, and Francesco Mambrini. “First Steps towards the Semi-automatic Development of a

Wordformation-based Lexicon of Latin.” In Eighth International Conference on Language Resources and Evaluation, (LREC 2012), pp. 852-859. European Language Resources Association (ELRA), 2012.

 

This is the poster presented at DH2017:

Poster_DH2017

Advertisements

WFL at CLiC-it 2016

WFL was presented at the Third Italian Conference on Computational Linguistics, held at Naples on 5-6th December 2016, with a talk entitled “Formatio formosa est. Building a Word Formation Based Lexicon for Latin”.

logo_clic_it

The program was full, the weather welcoming and the street food to die for.

You can find the conference proceedings at the Aaccademia University Press website, or download the article directly here.

WFL’s summer tour

Word Formation Latin has been on tour this summer months: the project was presented at three special venues.

The first stop was Verona, setting of the workshop Formal Representation and Digital Humanities: text, language and tools, organised within the framework of the Marie Curie funded project A computer-aided study of the Luwian (Morpho-)Syntax.

verona2016The workshop program was rich with papers ranging from Hittite to Old English. 

The second day kicked off with an exciting keynote speech by WFL’s own supervisor Marco Passarotti, with a talk titled: “Well, It Depends. Theoretical and Practical Aspects of the Dependency Turn in Computational Linguistics”.

Marco’s paper was followed by WFL’s debut in public. We were able to describe how the project was set up and is being carried on, the workflow, the challenges, at the first attempt at visualising the data.

dh2016The second stop was the annual International Digital Humanities conference, this year taking place in Jagellonian University in Cracow. The organization of the conference was flawless, food and drinks abundant, and the setting of old Cracow vastly fascinating. WFL was successfully introduced to the Digital Humanities community with a short but effective 10 minutes presentation, where an alpha version of the resource was launched online at http://wfl.marginalia.it.

aiucd2016WFL was also shortly introduced to the Italian Digital Humanities community during the fifth annual conference of the Associazione per l’Informatica Umanistica e le Culture Digitali (AIUCD), titled Digital editions: Representation, interoperability, text analysis and infrastructures, held in Venice 7-9 September 2016.

 

The Word Formation Latin Project

In the past two decades there has been a considerable increase in the creation of computational linguistic resources for the investigation of classical languages, which have updated the state of the art almost to the same level as that of the resources currently available for modern languages. These resources are represented by annotated corpora, treebanks, computational lexica, and digital libraries. Beside these language resources there are NLP tools, such as morphological analysers, part-of-speech taggers, and syntactic parsers.

The WFL project consists in the compilation of a derivational morphological dictionary of the Latin language, which connects lexical elements on the basis of word-formation rules, where lemmas are segmented and analysed into their derivational morphological components, so to establish relationships between them on the basis of word formation, and the verbal noun amator can be reconnected to the verb amo through suffixation with –(a)tor.

A first attempt at constructing a lexicon based on word-formation for Latin was made by Marco Passarotti and Francesco Mambrini in 2012 [M. Passarotti & F. Mambrini, First Steps towards the Semi-automatic Development of a wordformation-based Lexicon of Latin, in Proceedings of LREC 2012, Istanbul, Turkey, 852-859], when they published a paper proposing a model for the semi-automatic extraction of word formation rules and the subsequent pairing of lemmas to their morphologically simplest lemma (i.e. non-derived). WFL is expanding on this first attempt and will result in a definitive linguistic resource.

The WFL project has three main aims:

  1. the enrichment of an existing morphological analyser for the Latin language, LEMLAT, [ Passarotti, M.(2004). “Development and perspectives of the Latin morphological analyser LEMLAT”. In A. Bozzi, L. Cignoni & J.L. Lebrave (Eds.), Digital Technology and Philological Disciplines. Linguistica Computazionale, XX-XXI, pp. 397- 414.] with wordformation information, and the integration of data within a interface similar to Word Manager [Domenig, M. & ten Hacken, P.(1992). Word Manager: A system for morphological dictionaries. Hildesheim: Olms.], which has been already applied to other modern languages (English, German, Italian);
  2. the integration of the information extracted from the resulting derivational morphological dictionary into the morphological layer of annotation the Index Thomisticus Treebank (IT-TB). The Index Thomisticus(IT) is considered a pathfinder in digital humanities; started by Padre Roberto Busa in 1949. It is a database retaining the opera omniaby Thomas Aquinas (118 texts), plus works by other 61 authors related to Thomas (61 texts). The size of the corpus is around 11 million tokens (150.000 types; 20.000 lemmas). The corpus is fully lemmatised and morphologically tagged. The IT-TB, based at CIRCSE, is the syntactically annotated portion of the IT, and it contains around 300.000 tokens for 15.000 syntactically parsed sentences. The morphological layer reports information about the lemmatization and the morphological features (PoS, gender, number, tense, etc.) for each word in the base text
  3. offering the results of the project work via a user-friendly project website which will display the derivational morphological dictionary through a web based search interface. This will allow the lexicon to be accessed:
    • by single lexical entry, which will show both the ancestors and their derived words;
    • by morphological family, i.e. the set of lemmas morphologically derived from one common ancestor-lemma;
    • by WFR.

The project relies on the automatic realisation of the linguistic resource both at the level of WFRs creation and to their application on the lexical items included in the morphological analyser LEMLAT.

The final resource will be both a standalone dictionary accessible through its own website, and interconnected with the Index Thomisticus Treebank (IT-TB).

The integration with the IT-TB will be operated through the embedding of the dictionary data within the morphological layer of annotation of the treebank, using TEI (Text Encoding Initiative) P5 conformant XML encoding to favour data exchange and linking to other lexical resources. The data resulting from the dictionary, once encoded in XML, will be applied to the IT-TB data.

The results of the project work will be offered via a user-friendly website which will display the derivational morphological dictionary through a web based search interface.

Follow our blog to keep up to date with news regarding the project, progress and more.

 

Morphology beyond inflection. Building a wordformation based dictionary for Latin

Welcome to the Word Formation Latin blog!

WFL is funded by a Marie Skłodowska-Curie actions (MSCA-IF-2014) grant and it is based at the Centro Interdisciplinare di Ricerche per la Computerizzazione dei Segni dell’Espressione (CIRCSE), at the Università Cattolica del Sacro Cuore in Milan. The project runs from November 2015 to the end of October 2017, and will result in the publication of a word-formation based lexicon, which will be accessible digitally through this website and in connection to the forthcoming new LEMLAT3 website. This space will provide a dissemination platform for the WFL project.

The WFL team is made of Eleonora Litta, Marie Curie Research Fellow, and Marco Passarotti, head of the Index Thomisticus Treebank project, a syntactically annotated corpus of texts of Thomas Aquinas, both based at the Centro Interdisciplinare di Ricerche per la Computerizzazione dei Segni dell’Espressione (CIRCSE), at the Universita’ Cattolica del Sacro Cuore in Milan.

The blog will be updated with news on progress,  on the various phases of building a digital linguistic resource, mention of methodology, curiosities on the data results, and news on papers and presentations given by the project team.

We hope this blog will also become a space for discussion on language resources for Latin, and other languages, in relation to NLP and digital humanities, so please feel free to add your comments below and show your interest in the resource we are trying to build.

More content coming very soon…

sponsors