ABSTRACT This paper presents and describes data for nonprofit Internal Revenue Service (IRS) filings in the United States of America. The data contains 831 attributes and 1,102,884 records for the ...years 2016 to 2021. Among other items, the data include nonprofits' comparative financial data, governance disclosures, hired contractors, management compensation, a detailed statement of revenue, statement of functional expenses, external audit, federal audit election, and reconciliation of net assets. The data are generated using Structured Query Language (SQL) self-developed code to convert the IRS Form 990 Extensible Markup Language (XML) tax filing files to a dataset in Excel. This paper is the first to convert these XML files and provide much-needed open access to nonprofit data in a long format.1 The source code that we developed and a step-by-step guide are included in this paper, allowing researchers to update this dataset.
A complex interface can disorient the user in a mild situation while designing web-based application. Thus, measure should be taken to reduce the complexity and provide reasonable qualities ...(simplicity, generality and usability) to the user. This paper evaluates Interface Complexity Metric (ICM) between two xml schema languages: World Wide Web Consortium XML Schema (WXS) and Regular Language for Next Generation (RNG) to weigh human insight on the factors affecting effort for comprehending schema documents; taking into account elements and attributes of xml documents; acquired from Web Services Description Language (WSDL). It is probable to lure decisions about the professed qualities; that out of one hundred and twenty (120) schema documents implemented in rng and wxs; rng has the lowest complexity values for almost all the schema documents when compared with wxs because rng exhibits better presentation with high degree of simplicity, generality and usability qualities than wxs. Therefore, this is benefited to developers in assisting them in writing less lengthy code and simplifying software modularization.
Over the past decade, there has been increasing interest in using extensible markup language (XML) which has made it a de facto standard for representing and exchanging data over different systems ...and platforms (specifically the internet). Due to the popularity of XML and with increasing numbers of XML documents, the process of knowledge discovery from this type of data has found more attention. Although in the last decade several different methods have been proposed for mining XML documents, this research field still is in its infancy compared to traditional data mining. As in relational techniques, in the case of XML documents, association rule mining has a strong research interest. In this paper we have performed a comprehensive study on all of the major works so far done on mining association rules from XML documents. The main contribution of the paper is to provide a reference point for future researches by collecting different techniques and methods concerning the topic; classifying them into a number of categories and creating a complete bibliography of the major published works. We think that this paper can help researchers in XML association rules mining domains to quickly find the current work as the basis for the future activities.
XQuery is the best language for querying, manipulating, and transforming XML and JSON documents. Because XML is in many ways the lingua franca of the digital humanities, learning XQuery empowers ...humanists to discover and analyze their data in new ways. Until now, though, XQuery has been difficult to learn because there was no textbook designed for non- or beginner programmers. XQuery for Humanists fills this void with an approachable guidebook aimed directly at digital humanists. Clifford B. Anderson and Joseph C. Wicentowski introduce XQuery in terms accessible to humanities scholars and do not presuppose any prior background in programming. It provides an informed, opinionated overview and recommends the best implementations, libraries, and paradigms to empower those who need it most. Emphasizing practical applicability, the authors go beyond the XQuery language to include the basics of underlying standards like XPath, related standards like XQuery Full Text and XQuery Update, and explain the difference between XQuery and languages like Python and R. This book will afford readers the skills they need to build and analyze large-scale documentary corpora in XML.  XQuery for Humanists is immeasurably valuable to instructors of digital humanities and library science courses alike and likewise is a ready reference for faculty, graduate students, and librarians who seek to master XQuery for their projects.
•OpenMC is an open source Monte Carlo particle transport code.•Solid geometry and continuous-energy physics allow high-fidelity simulations.•Development has focused on high performance and modern I/O ...techniques.•OpenMC is capable of scaling up to hundreds of thousands of processors.•Other features include plotting, CMFD acceleration, and variance reduction.
This paper gives an overview of OpenMC, an open source Monte Carlo particle transport code recently developed at the Massachusetts Institute of Technology. OpenMC uses continuous-energy cross sections and a constructive solid geometry representation, enabling high-fidelity modeling of nuclear reactors and other systems. Modern, portable input/output file formats are used in OpenMC: XML for input, and HDF5 for output. High performance parallel algorithms in OpenMC have demonstrated near-linear scaling to over 100,000 processors on modern supercomputers. Other topics discussed in this paper include plotting, CMFD acceleration, variance reduction, eigenvalue calculations, and software development processes.
The aim of this paper is, first of all, to offer on overview of the problems related to the production of XML-TEI files and it summarizes the experience gained in the realization of a digital edition ...project financed by a Marie Curie Fellowship (DigiFlorimont, grant n° 745821). The tools are in fact still far from being ergonomic. The issue is crucial as the instrumental gap affects both the spread of digital publishing and the quality of the proposed textual encoding. The risk is to continue producing a "simplified encoding with complex tools" when, after decades of practice, we should rather produce a "complex encoding with simplified tools". The DigiFlorimont project involved the creation of an XML-TEI editor-prototype that generates the code automatically starting from a symbolic syntax (inspired by the Markdown model) applied to a .txt file associated with a .csv file. Thanks to this tool, it was possible to create a very complex encoding XML-TEI according to a textual model conceived as a words database, ready to be used for other analyzes beyond the specific objectives of the production context. These technical reflections also aim to clarify the role of the "digital philologist": he is a philologist charged of disseminating within the academic community a philoligically and informatically correct text, "ready for use", according to a scholarly model of progressive capitalization of the analysis work.
Multilabel learning involving hundreds of thousands or even millions of labels is referred to as extreme multilabel learning (XML), in which the labels often follow a power-law distribution with the ...majority occurring in very few data points as tail labels . Recent years have witnessed the intensive use of deep-learning methods for high-performance XML, but they are typically optimized for the head labels with abundant training instances and less consider the performance on tail labels, which, however, like the needles in haystacks, are often the focus of attention in real-life applications. In light of this, we present BoostXML, a deep learning-based XML method for extreme multilabel text classification, enhanced greatly by gradient boosting. In BoostXML, we pay more attention to tail labels in each Boosting Step by optimizing the residual mostly from unfitted training instances with tail labels. A Corrective Step is further proposed to avoid the mismatching between the text encoder and weak learners during optimization, which reduces the risk of falling into local optima and improves model performance. A Pretraining Step is also introduced in the initial stage of BoostXML to avoid exorbitant bias to tail labels. Extensive experiments on five benchmark datasets with state-of-the-art baselines demonstrate the advantage of BoostXML in tail-label prediction.
Since the emergence of web 2.0, data started floating all over the web, through online and offline applications, and across all application domains. Web data (semi-structured data loaded through web ...browsers and applications communicating via internet protocols such as HTTP), in particular XML-based data, is being used for simple commercial information display (i.e., XHTML/HTML in commercial websites), instant messaging (e.g., XMPP for messaging in Whatsapp, Skype, Gtalk etc.), financial transactions (i.e., CDF3 in ecommerce), medical record processing and storage (e.g., HL7 for electronic medical records), social media (e.g., XHTML/HTML in facebook, LinkedIn, Google Plus, etc.), and others. This phenomenon rendered web data manipulation (i.e., monitoring, modifying, controlling, etc.) by IT (information technology) experts, computer technicians and engineers utterly difficult seeing its exponential growth rate in volume and diversity. Not to mention the dynamicity of the data which is continuously changing on the clock and its heterogeneity (e.g., HTML/HTML5, XML, XHTML, RDF, OWL, etc.).
Consequently, the manipulation of web data and in particular XML data (since XML has become one of the most essential data types used in computer communications) has shifted from the hands of computer scientists and programmers towards public computer users in all application domains.
This has brought a new criterion into the web data manipulation research field, web data manipulation by non-experts. In this paper, we study and analyze existent techniques for manipulating semi-structured web data, particularly XML data, from a non-expert point of view while relating it to traditional manipulation techniques defined in the literature (i.e., filtering, adaptation, data extraction, transformation, access control, encryption, etc.). Web data manipulation techniques by non-experts were categorized under 3 major titles: (i) XML-oriented visual languages dealing with XML data extraction and transformations, (ii) Mashups tackling mainly XML restructuring with value manipulations, and (iii) Dataflow visual programming languages targeting non-expert manipulations and providing means to visually manipulate scientific data. A full analysis was conducted which allowed existent approaches/techniques to be compared and evaluated providing an overview of the current requirements on this subject.
Although temporal XML data are being stored and manipulated by several XML-based applications in different domains (e.g., e-commerce, e-health), there is neither a temporal XML update language ...proposed by researchers nor built-in support provided by existing XML DBMSs and tools, for maintaining such data. Furthermore, in the well known temporal XML framework tauXSchema, there are no features for inserting, deleting or updating temporal XML instances. In this paper, we bridge these gaps by proposing a temporal extension of the W3C XQuery Update Facility (XUF) language, named tauXUF (Temporal XUF), which allows manipulating temporal XML data in tauXSchema. With tauXUF both the syntax and the semantics of the update expressions of the XUF language are extended to support temporal aspects. Examples are also provided to motivate and illustrate our proposal.