This paper describes the approaches the authors developed while participating in the i2b2/VA 2010 challenge to automatically extract medical concepts and annotate assertions on concepts and relations ...between concepts.
The authors'approaches rely on both rule-based and machine-learning methods. Natural language processing is used to extract features from the input texts; these features are then used in the authors' machine-learning approaches. The authors used Conditional Random Fields for concept extraction, and Support Vector Machines for assertion and relation annotation. Depending on the task, the authors tested various combinations of rule-based and machine-learning methods.
The authors'assertion annotation system obtained an F-measure of 0.931, ranking fifth out of 21 participants at the i2b2/VA 2010 challenge. The authors' relation annotation system ranked third out of 16 participants with a 0.709 F-measure. The 0.773 F-measure the authors obtained on concept extraction did not make it to the top 10.
On the one hand, the authors confirm that the use of only machine-learning methods is highly dependent on the annotated training data, and thus obtained better results for well-represented classes. On the other hand, the use of only a rule-based method was not sufficient to deal with new types of data. Finally, the use of hybrid approaches combining machine-learning and rule-based approaches yielded higher scores.
Syntactic transformations to help learning to read : typology, adequacy and adapted corpora. In this paper, we present a typology of syntactic transformations targeted at adapting textual contents ...addressed to poor-readers and dyslexic children. To make this proposition, we have analyzed a set of parallel texts (original and adapted). We have also applied lexical, morpho-syntactic and discursive transformations to corpora usually read at primary school (second to fourth grades). The different versions have been read by different reader profiles at school. Based on both studies, we have defined a typology of syntactic transformations, with deleted, kept or added information, that could be used as guidelines to adapt texts and facilitate reading to children facing difficulties.
Dans cet article, nous présentons une typologie de transformations syntaxiques permettant une adaptation des contenus textuels à destination d'enfants faibles lecteurs et dyslexiques. Pour arriver à cette proposition, nous avons analysé des textes parallèles originaux et adaptés. Nous avons aussi appliqué des transformations lexicales, morpho-syntaxiques et discursives à des corpus habituellement lus entre CE1 et CM1 que nous avons soumis à des enfants dans ces classes, tous profils confondus. Sur la base de ces deux études, nous avons défini une typologie de transformations syntaxiques, avec des informations supprimées, conservées ou ajoutées, qui pourra servir de guide pour adapter des textes et faciliter l'apprentissage de la lecture dans des cas d'enfants en difficulté
Cet article a pour cadre un projet en cours nommé Logoscope et développé à l’Université de Strasbourg (2012-2015). Le coeur de ce projet se constitue d’un programme informatique qui, chaque jour, ...scrute les pages web de la presse quotidienne francophone (Le Monde, La Croix, L’Equipe, Dernières Nouvelles d’Alsace, etc.) à la recherche de mots nouveaux, dans le but constituer une ressource dynamique (i.e. enrichie sans limite dans le temps) utile à différentes communautés d’utilisateurs. L’originalité du Logoscope le distingue des outils similaires existants : contrairement ces derniers, l’acquisition semi-automatisée des néologismes intègre d’emblée les conditions textuelles et discursives de l’innovation lexicale. Cette direction de recherche est en effet essentielle : dans le domaine de la néologie, tout ce que nous savons déjà de la morphologie d’une langue et des types de formation des mots devrait être complété par une description des conditions de production et de réception des néologismes, étant donné qu’aucune création lexicale ne se produit jamais en dehors d’un texte particulier et d’une situation de communication définie. Pratiquement, nous proposons donc un outil d’acquisition et une ressource néologique qui documente non seulement les néologismes au moyen des variables traditionnelles (morphologie, formation du mot, parties du discours, etc.), mais aussi au moyen de variables chargées de décrire dans quel contexte communicationnel précis ils apparaissent.
Dans cet article, nous présentons une méthode de constitution automatique d'une ressource morphologique de noms d'agent déverbaux. A partir d'un échantillon validé manuellement, nous présentons ...ensuite différentes pistes envisagées pour mettre au point une méthode de validation automatique qui permettrait de réduire la validation manuelle. Pour constituer de façon automatique la ressource, nous utilisons deux méthodes, l'une consistant en des heuristiques fondées sur les propriétés formelles des noms et des verbes. Et la seconde consistant en l'exploitation des définitions d'un dictionnaire. Ces deux méthodes sont intéressantes car elles sont très rapides à implémenter, et la première semble en outre posséder une bonne couverture du phénomène de formation de noms d'agent déverbaux. Cependant la seconde méthode a une faible couverture du phénomène, et la validation manuelle d'un échantillon de la ressource montre que la première méthode engendre aussi beaucoup de bruit, et nécessite de ce fait une réelle validation, qu'elle soit manuelle, automatique ou semi-automatique. C'est pourquoi nous envisageons différentes méthodes de validation automatique de la ressource, afin de réduire la validation manuelle. Les différentes études de validation automatique de la ressource montrent des résultats décevants. Toutefois, ces résultats ne remettent pas en cause les méthodes essayées, mais semblent davantage révéler la difficulté à trouver des méthodes adaptées pour les mots peu fréquents.
In this paper, we present the system we have developed for participating in the second task of the i2b2/VA 2011 challenge dedicated to emotion detection in clinical records. On the official ...evaluation, we ranked 6th out of 26 participants. Our best configuration, based upon a combination of both a machine-learning based approach and manually-defined transducers, obtained a 0.5383 global F-measure, while the distribution of the other 26 participants’ results is characterized by mean = 0.4875, stdev = 0.0742, min = 0.2967, max = 0.6139, and median = 0.5027. Combination of machine learning and transducer is achieved by computing the union of results from both approaches, each using a hierarchy of sentiment specific classifiers.
Full text
Available for:
IZUM, KILJ, NUK, PILJ, PNG, SAZU, UL, UM, UPUK, VSZLJ
Until recently, most of the research work in Natural Language Processing (NLP) has been focused on a few well-described languages with many speakers. The situation is rapidly evolving, with a clear ...increase in the interest towards so called "under-resourced" languages. The goal of this issue of the Traitement Automatique des Langues journal is to give an overview of current research on NLP for under-resourced languages from all over the world, encompassing a large variety of tasks. The selected papers address languages which are still at very early stages as well as languages whose situation has very recently improved. We hope that they can be helpful to guide future research on other languages with little or no resources and tools.
Semantic relationships like specialisation can be acquired either by word-external methods relying on the context or word-internal methods based on lexical structure. Word segments are thus a ...relevant cue for the automatic acquisition of semantic relationships. We have developed an unsupervised method for morphological segmentation devised for this objective. Semantic relationships are deduced from specific morphological structures based on the segments discovered. Evaluation of the validity of the semantic relationships inferred is performed against WordNet and the NCI Thesaurus.
Full text
Available for:
FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NUK, OBVAL, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
This paper describes a system for unsupervised morpheme analysis and the results it obtained at Morpho Challenge 2007. The system takes a plain list of words as input and returns a list of labelled ...morphemic segments for each word. Morphemic segments are obtained by an unsupervised learning process which can directly be applied to different natural languages. Results obtained at competition 1 (evaluation of the morpheme analyses) are better in English, Finnish and German than in Turkish. For information retrieval (competition 2), the best results are obtained when indexing is performed using Okapi (BM25) weighting for all morphemes minus those belonging to an automatic stop list made of the most common morphemes.
This paper investigates a novel approach to unsupervised morphology induction relying on community detection in networks. In a first step, morphological transformation rules are automatically ...acquired based on graphical similarities between words. These rules encode substring substitutions for transforming one word form into another. The transformation rules are then applied to the construction of a lexical network. The nodes of the network stand for words while edges represent transformation rules. In the next step, a clustering algorithm is applied to the network to detect families of morphologically related words. Finally, morpheme analyses are produced based on the transformation rules and the word families obtained after clustering. While still in its preliminary development stages, this method obtained encouraging results at Morpho Challenge 2009, which demonstrate the viability of the approach.
This article describes MorphoClust and MorphoNet, two methods for the unsupervised acquisition of morphological families. MorphoClust builds families by iterative conflations, similarly to ...hierarchical clustering methods. The MorphoNet method relies on community detection in lexical networks. The nodes of these networks stand for words while edges represent morphological transformation rules which are automatically acquired based on graphical similarities between words. The two methods are applied to a German-English bilingual lexicon, both in isolation and in combination. We evaluate the results using the CELEX lexical database. Adapted from the source document