Given a high-order large-scale tensor, how can we decompose it into latent factors? Can we process it on commodity computers with limited memory? These questions are closely related to recommender ...systems, which have modeled rating data not as a matrix but as a tensor to utilize contextual information such as time and location. This increase in the order requires tensor-factorization methods scalable with both the order and size of a tensor. In this paper, we propose two distributed tensor factorization methods, CDTF and SALS. Both methods are scalable with all aspects of data and show a trade-off between convergence speed and memory requirements. CDTF, based on coordinate descent, updates one parameter at a time, while SALS generalizes on the number of parameters updated at a time. In our experiments, only our methods factorized a five-order tensor with 1 billion observable entries, 10 M mode length, and 1 K rank, while all other state-of-the-art methods failed. Moreover, our methods required several orders of magnitude less memory than their competitors. We implemented our methods on MAPREDUCE with two widely-applicable optimization techniques: local disk caching and greedy row assignment. They speeded up our methods up to 98.2× and also the competitors up to 5.9×.
Hypergraphs naturally represent group interactions, which are omnipresent in many domains: collaborations of researchers, co-purchases of items, and joint interactions of proteins, to name a few. In ...this work, we propose tools for answering the following questions in a systematic manner: (Q1) what are the structural design principles of real-world hypergraphs? (Q2) how can we compare local structures of hypergraphs of different sizes? (Q3) how can we identify domains from which hypergraphs are? We first define
hypergraph motifs
(h-motifs), which describe the overlapping patterns of three connected hyperedges. Then, we define the significance of each h-motif in a hypergraph as its occurrences relative to those in properly randomized hypergraphs. Lastly, we define the
characteristic profile
(CP) as the vector of the normalized significance of every h-motif. Regarding Q1, we find that h-motifs ’ occurrences in 11 real-world hypergraphs from 5 domains are clearly distinguished from those of randomized hypergraphs. In addition, we demonstrate that CPs capture local structural patterns unique to each domain, thus comparing CPs of hypergraphs addresses Q2 and Q3. The concept of CP is naturally extended to represent the connectivity pattern of each node or hyperedge as a vector, which proves useful in node classification and hyperedge prediction. Our algorithmic contribution is to propose MoCHy, a family of parallel algorithms for counting h-motifs ’ occurrences in a hypergraph. We theoretically analyze their speed and accuracy and show empirically that the advanced approximate version MoCHy-A
+
is up to
25
×
more accurate and
32
×
faster than the basic approximate and exact versions, respectively. Furthermore, we explore
ternary hypergraph motifs
that extends h-motifs by taking into account not only the presence but also the cardinality of intersections among hyperedges. This extension proves beneficial for all previously mentioned applications.
Given a sequence of epidemic events, can a single epidemic model capture its dynamics during the entire period? How should we divide the sequence into segments to better capture the dynamics? ...Throughout human history, infectious diseases (e.g., the Black Death and COVID-19) have been serious threats. Consequently, understanding and forecasting the evolving patterns of epidemic events are critical for prevention and decision making. To this end, epidemic models based on ordinary differential equations (ODEs), which effectively describe dynamic systems in many fields, have been employed. However, a single epidemic model is not enough to capture long-term dynamics of epidemic events especially when the dynamics heavily depend on external factors (e.g., lockdown and the capability to perform tests). In this work, we demonstrate that properly dividing the event sequence regarding COVID-19 (specifically, the numbers of active cases, recoveries, and deaths) into multiple segments and fitting a simple epidemic model to each segment leads to a better fit with fewer parameters than fitting a complex model to the entire sequence. Moreover, we propose a methodology for balancing the number of segments and the complexity of epidemic models, based on the Minimum Description Length principle. Our methodology is (a) Automatic: not requiring any user-defined parameters, (b) Model-agnostic: applicable to any ODE-based epidemic models, and (c) Effective: effectively describing and forecasting the spread of COVID-19 in 70 countries.
Given a sequence of epidemic events, can a single epidemic model capture its dynamics during the entire period? How should we divide the sequence into segments to better capture the dynamics? ...Throughout human history, infectious diseases (e.g., the Black Death and COVID-19) have been serious threats. Consequently, understanding and forecasting the evolving patterns of epidemic events are critical for prevention and decision making. To this end, epidemic models based on ordinary differential equations (ODEs), which effectively describe dynamic systems in many fields, have been employed. However, a single epidemic model is not enough to capture long-term dynamics of epidemic events especially when the dynamics heavily depend on external factors (e.g., lockdown and the capability to perform tests). In this work, we demonstrate that properly dividing the event sequence regarding COVID-19 (specifically, the numbers of active cases, recoveries, and deaths) into multiple segments and fitting a simple epidemic model to each segment leads to a better fit with fewer parameters than fitting a complex model to the entire sequence. Moreover, we propose a methodology for balancing the number of segments and the complexity of epidemic models, based on the Minimum Description Length principle. Our methodology is (a) Automatic: not requiring any user-defined parameters, (b) Model-agnostic: applicable to any ODE-based epidemic models, and (c) Effective: effectively describing and forecasting the spread of COVID-19 in 70 countries.
If we cannot store all edges in a dynamic graph, which edges should we store to estimate the triangle count accurately? Counting triangles (i.e., cliques of size three) is a fundamental graph problem ...with many applications in social network analysis, web mining, anomaly detection, etc. Recently, much effort has been made to accurately estimate the counts of global triangles (i.e., all triangles) and local triangles (i.e., all triangle incident to each node) in large dynamic graphs, especially with limited space. Although existing algorithms use sampling techniques without considering temporal dependencies in edges, we observe
temporal locality
in the formation of triangles in real dynamic graphs. That is, future edges are more likely to form triangles with recent edges than with older edges. In this work, we propose a family of single-pass streaming algorithms called
Waiting-Room Sampling
(
WRS
) for estimating the counts of global and local triangles in a fully dynamic graph, where edges are inserted and deleted over time, within a fixed memory budget.
WRS
exploits the temporal locality by always storing the most recent edges, which future edges are more likely to form triangles with, in the
waiting room
, while it uses reservoir sampling and its variant for the remaining edges. Our theoretical and empirical analyses show that
WRS
is:
(a) Fast and ‘any time’:
runs in linear time, always maintaining and updating estimates, while the input graph evolves,
(b) Effective
: yields up to
47% smaller estimation error
than its best competitors, and
(c) Theoretically sound
: gives unbiased estimates with small variances under the temporal locality.
Given a stream of edge additions and deletions, how can we estimate the count of triangles in it? If we can store only a subset of the edges, how can we obtain unbiased estimates with small ...variances?
Counting triangles (i.e., cliques of size three) in a graph is a classical problem with applications in a wide range of research areas, including social network analysis, data mining, and databases. Recently, streaming algorithms for triangle counting have been extensively studied since they can naturally be used for large dynamic graphs. However, existing algorithms cannot handle edge deletions or suffer from low accuracy.
Can we handle edge deletions while achieving high accuracy? We propose T
hink
D, which accurately estimates the counts of global triangles (i.e., all triangles) and local triangles associated with each node in a fully dynamic graph stream with additions and deletions of edges. Compared to its best competitors, T
hink
D is
(a) Accurate:
up to
4.3
×
more accurate
within the same memory budget,
(b) Fast:
up to
2.2
×
faster
for the same accuracy requirements, and
(c) Theoretically sound:
always maintaining estimates with zero bias (i.e., the difference between the true triangle count and the expected value of its estimate) and small variance. As an application, we use T
hink
D to detect suddenly emerging dense subgraphs, and we show its advantages over state-of-the-art methods.
This study evaluates the performance of a deep learning model, Deep-learning-based Rain Nowcasting and Estimation (DEEPRANE), for very short-term (1–6 h) rainfall forecasts in South Korea. Rainfall ...forecasts and in-situ observations from June–September 2020, when record-breaking summer rainfall was observed in South Korea, are particularly considered. It is found that DEEPRANE adequately predicts moderate rainfall events (MREs; ≥ 1 mm h
−1
) and strong rainfall events (SREs; ≥ 10 mm h
−1
) with critical success indices of 0.6 and 0.4 at the 1-h lead time, respectively. The probability of detection scores of MRE forecasting is higher than the false alarm rates at all lead times, suggesting that DEEPRANE MRE forecast can be useful even at a long lead time. However, for SRE forecasting, the probability of detection scores becomes smaller than the false alarm rates at a lead time of 2 h. Localized heavy rainfall events (LHREs, ≥ 30 mm h
−1
) are also reasonably well predicted only at a lead time of 2 h. Irrespective of their patterns, the forecast scores systematically decrease with lead time. This result indicates that DEEPRANE SRE forecast is useful only for nowcasting. DEEPRANE generally shows better performance in the early morning hours when rainfall events are more frequent than in other hours. When considering synoptic conditions, better performance is found when rainfall events are organized by monsoon rainband rather than caused by extratropical or tropical cyclones. These results suggest that DEEPRANE is especially useful for nowcasting early-morning rainfall events which are embedded in the monsoon rainband.
Multi-aspect data appear frequently in web-related applications. For example, product reviews are quadruplets of the form (user, product, keyword, timestamp), and search-engine logs are quadruplets ...of the form (user, keyword, location, timestamp). How can we analyze such web-scale multi-aspect data on an off-the-shelf workstation with a limited amount of memory? Tucker decomposition has been used widely for discovering patterns in such multi-aspect data, which are naturally expressed as large but sparse tensors. However, existing Tucker decomposition algorithms have limited scalability, failing to decompose large-scale high-order (
≥
4) tensors, since they
explicitly materialize
intermediate data, whose size grows exponentially with the order. To address this problem, which we call “Materialization Bottleneck,” we propose
S-HOT
, a scalable algorithm for high-order Tucker decomposition.
S-HOT
minimizes materialized intermediate data by using an
on-the-fly computation
, and it is optimized for disk-resident tensors that are too large to fit in memory. We theoretically analyze the amount of memory and the number of data scans required by
S-HOT
. Moreover, we empirically show that
S-HOT
handles tensors with higher order, dimensionality, and rank than baselines. For example,
S-HOT
successfully decomposes a real-world tensor from the Microsoft Academic Graph on an off-the-shelf workstation, while all baselines fail. Especially, in terms of dimensionality,
S-HOT
decomposes
1000
×
larger
tensors than baselines.
How can we detect fraudulent lockstep behavior in large-scale multi-aspect data (i.e., tensors)? Can we detect it when data are too large to fit in memory or even on a disk? Past studies have shown ...that dense subtensors in real-world tensors (e.g., social media, Wikipedia, TCP dumps, etc.) signal anomalous or fraudulent behavior such as retweet boosting, bot activities, and network attacks. Thus, various approaches, including tensor decomposition and search, have been proposed for detecting dense subtensors rapidly and accurately. However, existing methods suffer from low accuracy, or they assume that tensors are small enough to fit in main memory, which is unrealistic in many real-world applications such as social media and web. To overcome these limitations, we propose D-Cube, a disk-based dense-subtensor detection method, which also can run in a distributed manner across multiple machines. Compared to state-of-the-art methods, D-Cube is (1) Memory Efficient: requires up to
and handles
data (
), (2) Fast: up to
due to its near-linear scalability, (3) Provably Accurate: gives a guarantee on the densities of the detected subtensors, and (4) Effective: spotted network attacks from TCP dumps and synchronized behavior in rating data most accurately.
Accurate nowcasting is critical for preemptive action in response to heavy rainfall events (HREs). However, operational numerical weather prediction models have difficulty predicting HREs in the ...short term, especially for rapidly and sporadically developing cases. Here, we present multi-year evaluation statistics showing that deep-learning-based HRE nowcasting, trained with radar images and ground measurements, outperforms short-term numerical weather prediction at lead times of up to 6 h. The deep learning nowcasting shows an improved accuracy of 162%–31% over numerical prediction, at the 1-h to 6-h lead times, for predicting HREs in South Korea during the Asian summer monsoon. The spatial distribution and diurnal cycle of HREs are also well predicted. Isolated HRE predictions in the late afternoon to early evening which mostly result from convective processes associated with surface heating are particularly useful. This result suggests that the deep learning algorithm may be available for HRE nowcasting, potentially serving as an alternative to the operational numerical weather prediction model.