This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words. The corpora are uniformly encoded, contain rich ...meta-data about 11 thousand speakers, and are linguistically annotated following the Universal Dependencies formalism and with named entities. Samples of the corpora and conversion scripts are available from the project’s GitHub repository, and the complete corpora are openly available via the CLARIN.SI repository for download, as well as through the NoSketch Engine and KonText concordancers and the Parlameter interface for on-line exploration and analysis.
Rodent kidneys exhibit three isoforms of metallothioneins (MTs), MT1, MT2 and MT3, with poorly characterized localization along the nephron. Here we studied in adult male Wistar rats the renal ...expression of MTs mRNA by end-point RT-PCR and MT proteins by immunochemical methods The expression pattern of MT1 mRNA was cortex (CO)>outer stripe (OS)=inner stripe (IS)=inner medulla (IM), of MT2 mRNA was IM>CO>IS=OS, and of MT3 mRNA was IM>CO=OS=IM. MT1/2-antibody stained with heterogeneous intensity the cell cytoplasm and nuclei in proximal tubule (PT) and thin ascending limb, whereas MT3-antibody stained weakly the cell cytoplasm in various cortical tubules and strongly the nuclei in all nephron segments. However, the isolated nuclei exhibited an absence of MT1/2 and presence of MT3 protein. In MT1/2-positive PT cells, the intracellular staining appeared diffuse or bipolar, but the isolated brush-border, basolateral and endosomal membranes were devoid of MT1/2 proteins. In the lumen of some PT profiles, the heterogeneously sized MT1/2-rich vesicles were observed, with the limiting membrane positive for NHE3, but negative for V-ATPase, CAIV, and megalin, whereas their interior was positive for CAII and negative for cytoskeleton. They seem to be pinched off from the luminal membrane of MT1/2-rich cells, as also indicated by transmission electron microscopy. We conclude that in male rats, MTs are heterogeneously abundant in the cell cytoplasm and/or nuclei along the nephron. The MT1/2-rich vesicles in the tubule lumen may represent a source of urine MT and membranous material, whereas MT3 in nuclei may handle zink and locally-produced reactive oxygen species.
We present a widely applicable methodology to bring machine translation (MT) to under-resourced languages in a cost-effective and rapid manner. Our proposal relies on web crawling to automatically ...acquire parallel data to train statistical MT systems if any such data can be found for the language pair and domain of interest. If that is not the case, we resort to (1) crowdsourcing to translate small amounts of text (hundreds of sentences), which are then used to tune statistical MT models, and (2) web crawling of vast amounts of monolingual data (millions of sentences), which are then used to build language models for MT. We apply these to two respective use-cases for Croatian, an under-resourced language that has gained relevance since it recently attained official status in the European Union. The first use-case regards tourism, given the importance of this sector to Croatia's economy, while the second has to do with tweets, due to the growing importance of social media. For tourism, we crawl parallel data from 20 web domains using two state-of-the-art crawlers and explore how to combine the crawled data with bigger amounts of general-domain data. Our domain-adapted system is evaluated on a set of three additional tourism web domains and it outperforms the baseline in terms of automatic metrics and/or vocabulary coverage. In the social media use-case, we deal with tweets from the 2014 edition of the soccer World Cup. We build domain-adapted systems by (1) translating small amounts of tweets to be used for tuning by means of crowdsourcing and (2) crawling vast amounts of monolingual tweets. These systems outperform the baseline (Microsoft Bing) by 7.94 BLEU points (5.11 TER) for Croatian-to-English and by 2.17 points (1.94 TER) for English-to-Croatian on a test set translated by means of crowdsourcing. A complementary manual analysis sheds further light on these results.
The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which has brought corpus linguistic methods into closer interaction with other methods of discourse ...analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing on some sort of corpus is increasingly resorted to for empirically–grounded social–scientific analysis (sometimes dubbed ‘corpus–assisted discourse analysis’ or ‘corpus–based critical discourse analysis’, cf. Hardt–Mautner 1995; Baker 2016). In the post–Yugoslav space, recent corpus developments have brought table–turning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist – partly due to the fast–changing background of these issues, but also due to the fact that there is still a gap in the corpus method, and in guidelines for corpus compilation, when applied beyond the anglophone contexts. In this paper we aim to discuss some possible solutions to these difficulties, by presenting one step–by–step account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic language contexts, including data scraping options, permissions and ethical issues, the factors that facilitate or complicate automated collection, and corpus annotation and processing possibilities. The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of South–Slavic corpus linguistics and corpus linguistic approaches in the humanities more generally.
Web corpora have become an attractive source of linguistic content, yet are for many languages still not available. This paper introduces two new annotated web corpora: the Croatian hrWaC and the ...Slovene slWaC. Both were built using a modified standard “Web as Corpus” pipeline having in mind the limited amount of available web data. The modifications are described in the paper, focusing on the content extraction from HTML pages, which combines high precision of extracted language content with a decent recall. The paper also investigates text-types of the acquired corpora using topic modeling, comparing the two corpora among themselves and with ukWaC.
In this paper we investigate the phenomenon of linguistic accommodation among Serbian Twitter users by analysing geo-encoded Twitter messages published between 2013 and 2016 in the area of Bosnia, ...Croatia, Montenegro and Serbia. We describe the linguistic production of Twitter users via 16 variables that are known to vary among the speakers of the pluricentric BCMS language. We compare that production of mobile Serbian Twitter users to that of non-mobile Serbian Twitter users, and by comparing the mobile users' language production inside and outside Serbia. While the first analysis shows support for accommodation, the second analysis yields no signal for that phenomenon.
In this minireview, the state of the art of the Croatian monolingual lexicography is presented. A brief overview and classification of all existing lexicographic resources is provided in the firts ...part of the minireview, followed by somewhat more detailed insight into the existing Croatian monolingual dictionaries and monolingual lexicographic projects, orthography dictionaries, and dictionary writing systems used.
In this paper we discuss the parallel manual normalisation of samples extracted from Croatian and Serbian Twitter corpora. We describe the datasets, outline the unified guidelines provided to ...annotators, and present a series of analyses of standard-to-non-standard transformations found in the Twitter data. The results show that closed part-of-speech classes are transformed more frequently than the open classes, that the most frequently transformed lemmas are auxiliary and modal verbs, interjections, particles and pronouns, that character deletions are more frequent than insertions and replacements, and that more transformations occur at the word end than in other positions. Croatian and Serbian are found to share many, but not all transformation patterns; while some of the discrepancies can be ascribed to the structural differences between the two languages, others appear to be better explained by looking at extralinguistic factors. The produced datasets and their initial analyses can be used for studying the properties of non-standard language, as well as for developing language technologies for non-standard data.
Although duckweed
Lemna minor
L. is a known accumulator of cadmium, detailed studies on its physiological and/or defense responses to this metal are still lacking. In this study, the effects of 10 μM ...CdCl
2
on
Lemna minor
were monitored after 6 and 12 days of treatment, while growth was estimated every 2 days. Cadmium treatment resulted in progressive accumulation of the metal in the plants and led to a decrease in the growth rate to 54% of the control value. The metal also considerably impaired chloroplast ultrastructure and caused a significant reduction in pigment content, i.e., at day 12, by 30 and 34% for chlorophylls
a
and
b
, and by 25% for carotenoids. During cadmium treatment, the contents of malondialdehyde and endogenous H
2
O
2
progressively increased (rising 77 and 46% above the controls by day 12), indicating that cadmium induced considerable oxidative stress. On the other hand, higher activities of pyrogallol peroxidase (PPX), ascorbate peroxidase (APX) and catalase (CAT), as well as the induction of a new APX isoform, in cadmium-treated plants, clearly showed activation of an antioxidative response. At day 6, only PPX activity was significantly above the controls (15%), while, at day 12, PPX, APX and CAT activities were increased (74, 78 and 63%). Cadmium also led to accumulation of the heat shock protein 70 (HSP70) and induced an additional isoform of this protein. The obtained results suggest that cadmium (10 μM) is phytotoxic to
Lemna minor
, inducing oxidative stress, and that antioxidative enzymes and HSP70 play important roles in the defense against cadmium toxicity.
Abstract
The article presents the results of a survey on dictionary use in Europe, focusing on general monolingual dictionaries. The survey is the broadest survey of dictionary use to date, covering ...close to 10,000 dictionary users (and non-users) in nearly thirty countries. Our survey covers varied user groups, going beyond the students and translators who have tended to dominate such studies thus far. The survey was delivered via an online survey platform, in language versions specific to each target country. It was completed by 9,562 respondents, over 300 respondents per country on average. The survey consisted of the general section, which was translated and presented to all participants, as well as country-specific sections for a subset of 11 countries, which were drafted by collaborators at the national level. The present report covers the general section.