Thymic clear cell carcinoma is a rare mediastinal neoplasm, with only 25 reported cases to date. We report a case of a 45-year-old man with thymic clear cell carcinoma. We think imaging and ...laboratory tests may be helpful for differential diagnosis.
A 45-year-old male was admitted to a local hospital for chest distress with cardiopalmus. CT showed a mediastinal mass. Laboratory examination results were all in the normal range. Histologically, the tumor cells had a clear cytoplasm, and immunohistochemically, the tumor cells were positive for epithelial markers. We performed abdominal and pelvic CT and further examined serum levels of thyroxine, parathyroid hormone and AFP postoperatively, which were normal. The patient received postoperative radiotherapy, and CT showed left adrenal metastasis at 20 months after surgery.
Thymic clear cell carcinoma is a rare malignant neoplasm. Adrenal metastasis can occur. Patients undergo thymectomy with chemotherapy or with radiotherapy have better outcoming. Metastasis, direct invasion of parathyroid carcinoma and other primary tumors in the mediastinum should be excluded. Immunohistochemical markers, imaging and laboratory examination can help to exclude metastasis.
Abstract
The frozen section (FS) diagnoses of pathology experts are used in China to determine whether sentinel lymph nodes of breast cancer have metastasis during operation. Direct implementation of ...a deep neural network (DNN) in clinical practice may be hindered by misdiagnosis of the algorithm, which affects a patient's treatment decision. In this study, we first obtained the prediction result of the commonly used patch-DNN, then we present a relative risk classification and regression tree (RRCART) to identify the misdiagnosed whole-slide images (WSIs) and recommend them to be reviewed by pathologists. Applying this framework to 2362 WSIs of breast cancer lymph node metastasis, test on frozen section results in the mean area under the curve (AUC) reached 0.9851. However, the mean misdiagnosis rate (0.0248), was significantly higher than the pathologists’ misdiagnosis rate (
p
< 0.01). The RRCART distinguished more than 80% of the WSIs as a high-accuracy group with an average accuracy reached to 0.995, but the difference with the pathologists’ performance was not significant (
p
> 0.01). However, the other low-accuracy group included most of the misdiagnoses of DNN models. Our research shows that the misdiagnosis from deep learning model can be further enriched by our method, and that the low-accuracy WSIs must be selected for pathologists to review and the high-accuracy ones may be ready for pathologists to give diagnostic reports.
Breast cancer is the most common malignant tumor in the world. Intraoperative frozen section of sentinel lymph nodes is an important basis for determining whether axillary lymph node dissection is ...required for breast cancer surgery. We propose an RRCART model based on a deep-learning network to identify metastases in 2362 frozen sections and count the wrongly identified sections and the associated reasons. The purpose is to summarize the factors that affect the accuracy of the artificial intelligence model and propose corresponding solutions.
We took the pathological diagnosis of senior pathologists as the gold standard and identified errors. The pathologists and artificial intelligence engineers jointly read the images and heatmaps to determine the locations of the identified errors on sections, and the pathologists found the reasons (false reasons) for the errors. Through NVivo 12 Plus, qualitative analysis of word frequency analysis and nodal analysis was performed on the error reasons, and the top-down error reason framework of "artificial intelligence RRCART model to identify frozen sections of breast cancer lymph nodes" was constructed based on the importance of false reasons.
There were 101 incorrectly identified sections in 2362 slides, including 42 false negatives and 59 false positives. Through NVivo 12 Plus software, the error causes were node-coded, and finally, 2 parent nodes (high-frequency error, low-frequency error) and 5 child nodes (section quality, normal lymph node structure, secondary reaction of lymph nodes, micrometastasis, and special growth pattern of tumor) were obtained; among them, the error of highest frequency was that caused by normal lymph node structure, with a total of 45 cases (44.55%), followed by micrometastasis, which occurred in 30 cases (29.70%).
The causes of identification errors in examination of sentinel lymph node frozen sections by artificial intelligence are, in descending order of influence, normal lymph node structure, micrometastases, section quality, special tumor growth patterns and secondary lymph node reactions. In this study, by constructing an artificial intelligence model to identify the error causes of frozen sections of lymph nodes in breast cancer and by analyzing the model in detail, we found that poor quality of slices was the preproblem of many identification errors, which can lead to other errors, such as unclear recognition of lymph node structure by computer. Therefore, we believe that the process of artificial intelligence pathological diagnosis should be optimized, and the quality control of the pathological sections included in the artificial intelligence reading should be carried out first to exclude the influence of poor section quality on the computer model. For cases of micrometastasis, we suggest that by differentiating slices into high- and low-confidence groups, low-confidence micrometastatic slices can be separated for manual identification. The normal lymph node structure can be improved by adding samples and training the model in a targeted manner.
This paper proposes an automatic spatially-aware concept discovery approach using weakly labeled image-text data from shopping websites. We first fine-tune GoogleNet by jointly modeling clothing ...images and their corresponding descriptions in a visual-semantic embedding space. Then, for each attribute (word), we generate its spatially-aware representation by combining its semantic word vector representation with its spatial representation derived from the convolutional maps of the fine-tuned network. The resulting spatially-aware representations are further used to cluster attributes into multiple groups to form spatially-aware concepts (e.g., the neckline concept might consist of attributes like v-neck, round-neck, etc). Finally, we decompose the visual-semantic embedding space into multiple concept-specific subspaces, which facilitates structured browsing and attribute-feedback product retrieval by exploiting multimodal linguistic regularities. We conducted extensive experiments on our newly collected Fashion200K dataset, and results on clustering quality evaluation and attribute-feedback product retrieval task demonstrate the effectiveness of our automatically discovered spatially-aware concepts.
Exploring dense matching between the current frame and past frames for long-range context modeling, memory-based methods have demonstrated impressive results in video object segmentation (VOS) ...recently. Nevertheless, due to the lack of instance understanding ability, the above approaches are oftentimes brittle to large appearance variations or viewpoint changes resulted from the movement of objects and cameras. In this paper, we argue that instance understanding matters in VOS, and integrating it with memory-based matching can enjoy the synergy, which is intuitively sensible from the definition of VOS task, i.e., identifying and segmenting object instances within the video. Towards this goal, we present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank. We employ the well-learned object queries from IS branch to inject instance-specific information into the query key, with which the instance-augmented matching is further performed. In addition, we introduce a multi-path fusion block to effectively combine the memory readout with multi-scale features from the instance segmentation decoder, which incorporates high-resolution instance-aware features to produce final segmentation results. Our method achieves state-of-the-art performance on DAVIS 2016/2017 val (92.6% and 87.1%), DAVIS 2017 test-dev (82.8%), and YouTube-VOS 2018/2019 val (86.3% and 86.3%), outperforming alternative methods by clear margins.
Identity-consistent video generation seeks to synthesize videos that are guided by both textual prompts and reference images of entities. Current approaches typically utilize cross-attention layers ...to integrate the appearance of the entity, which predominantly captures semantic attributes, resulting in compromised fidelity of entities. Moreover, these methods necessitate iterative fine-tuning for each new entity encountered, thereby limiting their applicability. To address these challenges, we introduce VideoAssembler, a novel end-to-end framework for identity-consistent video generation that can conduct inference directly when encountering new entities. VideoAssembler is adept at producing videos that are not only flexible with respect to the input reference entities but also responsive to textual conditions. Additionally, by modulating the quantity of input images for the entity, VideoAssembler enables the execution of tasks ranging from image-to-video generation to sophisticated video editing. VideoAssembler comprises two principal components: the Reference Entity Pyramid (REP) encoder and the Entity-Prompt Attention Fusion (EPAF) module. The REP encoder is designed to infuse comprehensive appearance details into the denoising stages of the stable diffusion model. Concurrently, the EPAF module is utilized to integrate text-aligned features effectively. Furthermore, to mitigate the challenge of scarce data, we present a methodology for the preprocessing of training data. Our evaluation of the VideoAssembler framework on the UCF-101, MSR-VTT, and DAVIS datasets indicates that it achieves good performances in both quantitative and qualitative analyses (346.84 in FVD and 48.01 in IS on UCF-101). Our project page is at https://gulucaptain.github.io/videoassembler/.
Temporal Video Grounding (TVG) aims to localize a moment from an untrimmed video given the language description. Since the annotation of TVG is labor-intensive, TVG under limited supervision has ...accepted attention in recent years. The great success of vision-language pre-training guides TVG to follow the traditional "pre-training + fine-tuning" paradigm, however, the pre-training process would suffer from a lack of temporal modeling and fine-grained alignment due to the difference of data nature between pre-train and test. Besides, the large gap between pretext and downstream tasks makes zero-shot testing impossible for the pre-trained model. To avoid the drawbacks of the traditional paradigm, we propose AutoTVG, a new vision-language pre-training paradigm for TVG that enables the model to learn semantic alignment and boundary regression from automatically annotated untrimmed videos. To be specific, AutoTVG consists of a novel Captioned Moment Generation (CMG) module to generate captioned moments from untrimmed videos, and TVGNet with a regression head to predict localization results. Experimental results on Charades-STA and ActivityNet Captions show that, regarding zero-shot temporal video grounding, AutoTVG achieves highly competitive performance with in-distribution methods under out-of-distribution testing, and is superior to existing pre-training frameworks with much less training data.
Inspired by the remarkable success of Latent Diffusion Models (LDMs) for image synthesis, we study LDM for text-to-video generation, which is a formidable challenge due to the computational and ...memory constraints during both model training and inference. A single LDM is usually only capable of generating a very limited number of video frames. Some existing works focus on separate prediction models for generating more video frames, which suffer from additional training cost and frame-level jittering, however. In this paper, we propose a framework called "Reuse and Diffuse" dubbed \(\textit{VidRD}\) to produce more frames following the frames already generated by an LDM. Conditioned on an initial video clip with a small number of frames, additional frames are iteratively generated by reusing the original latent features and following the previous diffusion process. Besides, for the autoencoder used for translation between pixel space and latent space, we inject temporal layers into its decoder and fine-tune these layers for higher temporal consistency. We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets including video datasets for action recognition and image-text datasets. Extensive experiments show that our method achieves good results in both quantitative and qualitative evaluations. Our project page is available \(\href{https://anonymous0x233.github.io/ReuseAndDiffuse/}{here}\).