The Database and Expert Systems Applications (DEXA) conferences bring together researchers and practitioners from all over the world to exchange ideas, experiences and opinions in a friendly and ...stimulating environment. The papers are at once a record of what has been achieved and the first steps towards shaping the future of information systems. DEXA covers a broad field, and all aspects of database, knowledge base and related technologies and their applications are represented. Once again there were a good number of submissions: 241 papers were submitted and of these the programme committee selected 103 to be presented. DEXA’99 took place in Florence and was the tenth conference in the series, following events in Vienna, Berlin, Valencia, Prague, Athens, London, Zurich, Toulouse and Vienna. The decade has seen many developments in the areas covered by DEXA, developments in which DEXA has played its part. I would like to express thanks to all the institutions which have actively supported and made possible this conference, namely: • University of Florence, Italy • IDG CNR, Italy • FAW – University of Linz, Austria • Austrian Computer Society • DEXA Association In addition, we must thank all the people who have contributed their time and effort to make the conference possible. Special thanks go to Maria Schweikert (Technical University of Vienna), M. Neubauer and G. Wagner (FAW, University of Linz). We must also thank all the members of the programme committee, whose careful reviews are important to the quality of the conference.
Artificial neural networks have been extensively applied to document analysis and recognition. Most efforts have been devoted to the recognition of isolated handwritten and printed characters with ...widely recognized successful results. However, many other document processing tasks, like preprocessing, layout analysis, character segmentation, word recognition, and signature verification, have been effectively faced with very promising results. This paper surveys the most significant problems in the area of offline document image processing, where connectionist-based approaches have been applied. Similarities and differences between approaches belonging to different categories are discussed. A particular emphasis is given on the crucial role of prior knowledge for the conception of both appropriate architectures and learning algorithms. Finally, the paper provides a critical analysts on the reviewed approaches and depicts the most promising research guidelines in the field. In particular, a second generation of connectionist-based models are foreseen which are based on appropriate graphical representations of the learning environment.
Motivation: Predicting the secondary structure of a protein (alpha-helix, beta-sheet, coil) is an important step towards elucidating its three-dimensional structure, as well as its function. ...Presently, the best predictors are based on machine learning approaches, in particular neural network architectures with a fixed, and relatively short, input window of amino acids, centered at the prediction site. Although a fixed small window avoids overfitting problems, it does not permit capturing variable long-rang information. Results: We introduce a family of novel architectures which can learn to make predictions based on variable ranges of dependencies. These architectures extend recurrent neural networks, introducing non-causal bidirectional dynamics to capture both upstream and downstream information. The prediction algorithm is completed by the use of mixtures of estimators that leverage evolutionary information, expressed in terms of multiple alignments, both at the input and output levels. While our system currently achieves an overall performance close to 76% correct prediction – at least comparable to the best existing systems – the main emphasis here is on the development of new algorithmic ideas. Availability: The executable program for predicting protein secondary structure is available from the authors free of charge. Contact: pfbaldi@ics.uci.edu, gpollast@ics.uci.edu, brunak@cbs.dtu.dk, paolo@dsi.unifi.it
We propose an approach for the word-level indexing of modern printed documents which are difficult to recognize using current OCR engines. By means of word-level indexing, it is possible to retrieve ...the position of words in a document, enabling queries involving proximity of terms. Web search engines implement this kind of indexing, allowing users to retrieve Web pages on the basis of their textual content. Nowadays, digital libraries hold collections of digitized documents that can be retrieved either by browsing the document images or relying on appropriate metadata assembled by domain experts. Word indexing tools would therefore increase the access to these collections. The proposed system is designed to index homogeneous document collections by automatically adapting to different languages and font styles without relying on OCR engines for character recognition. The approach is based on three main ideas: the use of self organizing maps (SOM) to perform unsupervised character clustering, the definition of one suitable vector-based word representation whose size depends on the word aspect-ratio, and the run-time alignment of the query word with indexed words to deal with broken and touching characters. The most appropriate applications are for processing modern printed documents (17th to 19th centuries) where current OCR engines are less accurate. Our experimental analysis addresses six data sets containing documents ranging from books of the 17th century to contemporary journals
We describe one tool for Table of Content (ToC) identification and recognition from PDF books. This task is part of ongoing research on the development of tools for the semi-automatic conversion of ...PDF documents in the Epub format that can be read on several E-book devices. Among various sub-tasks, the ToC extraction and recognition is particularly useful for an easy navigation of book contents.
The proposed tool first identifies the ToC pages. The bounding boxes of ToC titles in the book body are subsequently found in order to add suitable links in the Epub ToC. The proposed approach is tolerant to discrepancies between the ToC text and the corresponding titles. We evaluated the tool on several open access books edited by University Presses that are partner of the OAPEN EcontentPlus project
In this paper, we describe a general approach for script (and language) recognition from printed documents and for writer identification in handwritten documents. The method is based on a bag of ...visual word strategy where the visual words correspond to characters and the clustering is obtained by means of Self Organizing Maps (SOM). Unknown pages (words in the case of script recognition) are classified comparing their vectorial representations with those of one training set using a cosine similarity. The comparison is improved using a similarity score that is obtained taking into account the SOM organization of cluster centroids. Promising results are presented for both printed documents and handwritten musical scores.
In the current period of the programming of the European Structural Funds 2000-06 in Calabria, the Integrated Territorial Projects (PIT) could become a model for public initiatives, in which the ...themes of local economic development and modernisation of public adiministrations are addressed in an innovative way and new forms of governance. The Calabria Government has created 23 PIT wich will carry out, in next three years, many actions in different sectors of public policies: infrastructures, education, social regeneration, economic development, technological development, information and commmunication technologies. This paper analyse the political, social and economic factors concerning the planning process of the PIT in Calabria, from the point of view of public policy analysis.
In this paper we describe our recent research for mathematical symbol indexing and its possible application in the Digital Library domain. The proposed approach represents mathematical symbols by ...means of Shape Contexts (SC) description. Indexed symbols are represented with a vector space-based method, but peculiar to our approach is the use of Self Organizing Maps (SOM) to perform the clustering instead of the commonly used k-means algorithm. The retrieval performance are measured on a large collection of mathematical symbols gathered from the widely used INFTY database.