OverlayFS, a Linux kernel driver for the Docker container, supports access control in Linux. User processes in the container must have appropriate privileges to access files in the backing file ...system. To that end, OverlayFS temporarily overrides credentials of a process with the mounter’s credentials whenever the process accesses a file, inode, or directory. However, we found that this mechanism incurs severe overhead owing to contention in updating a shared reference counter for the mounter’s credentials, which hinders OverlayFS scalability. In this paper, we propose a credential caching scheme, CredsCache, which greatly relieves contention by maintaining a per-process cached version of the mounter’s credentials. In CredsCache, each process performs credential overriding and reverting using its own cached version without updating the shared reference counter. We implemented CredsCache in OverlayFS and evaluated its performance using micro- and macro-benchmarks. Using solid-state drive (SSD) backend storage, the micro-benchmark results showed that CredsCache achieved up to 23.3× and 54.7× higher performance for data and metadata I/O, respectively, than vanilla OverlayFS. Moreover, the macro-benchmark and real-world benchmark results showed that CredsCache produced up to 5.6× and 2.9× better performance for respective database and web server benchmarks than vanilla OverlayFS.
•Temporary overriding of credentials incurs severe overhead in OverlayFS.•Mounter’s credentials never change while a container is in a normal running state.•Per-process credential caching reduces temporary credential overriding overhead.
The recently introduced _Perceus_ algorithm can automatically insert
reference count instructions such that the resulting (cycle-free) program is
_garbage free_: objects are freed at the very moment ...they can no longer be
referenced. An important extension is reuse analysis. This optimization pairs
objects of known size with fresh allocations of the same size and tries to
reuse the object in-place at runtime if it happens to be unique. Unfortunately,
current implementations of reuse analysis are fragile with respect to small
program transformations, or can cause an arbitrary increase in the peak heap
usage. We present a novel _drop-guided_ reuse algorithm that is simpler and
more robust than previous approaches. Moreover, we generalize the linear
resource calculus to precisely characterize garbage-free and frame-limited
evaluations. On each function call, a frame-limited evaluation may hold on to
memory longer if the size is bounded by a constant factor. Using this framework
we show that our drop-guided reuse _is_ frame-limited and find that an
implementation of our new reuse approach in Koka can provide significant
speedups.
elPrep is an established multi-threaded framework for preparing SAM and BAM files in sequencing pipelines. To achieve good performance, its software architecture makes only a single pass through a ...SAM/BAM file for multiple preparation steps, and keeps sequencing data as much as possible in main memory. Similar to other SAM/BAM tools, management of heap memory is a complex task in elPrep, and it became a serious productivity bottleneck in its original implementation language during recent further development of elPrep. We therefore investigated three alternative programming languages: Go and Java using a concurrent, parallel garbage collector on the one hand, and C++17 using reference counting on the other hand for handling large amounts of heap objects. We reimplemented elPrep in all three languages and benchmarked their runtime performance and memory use.
The Go implementation performs best, yielding the best balance between runtime performance and memory use. While the Java benchmarks report a somewhat faster runtime than the Go benchmarks, the memory use of the Java runs is significantly higher. The C++17 benchmarks run significantly slower than both Go and Java, while using somewhat more memory than the Go runs. Our analysis shows that concurrent, parallel garbage collection is better at managing a large heap of objects than reference counting in our case.
Based on our benchmark results, we selected Go as our new implementation language for elPrep, and recommend considering Go as a good candidate for developing other bioinformatics tools for processing SAM/BAM data as well.
Third-party Python modules are usually implemented as binary extensions by using native code (C/C++) to provide additional features and runtime acceleration. In native code, the heap-allocated ...PyObjects are managed by the reference counting mechanism provided in Python/C APIs for automatic reclaiming. Hence, improper refcount manipulations can lead to memory leaks and use-after-free problems, and cannot be detected by simply pairing the occurrence of source and sink points. To detect such problems, state-of-the-art approaches have made groundbreaking contributions to identifying inappropriate final refcount values before returning from native code to Python. However, not all problems can be exposed at the end of a path. To detect those hidden in the middle of a path in native code, it is also crucial to track the lifecycle state of PyObjects through the refcount and lifecycle operations in API calls. To achieve this goal, we propose the PyObject State Transition Model (PSTM) recording the lifecycle states and refcount values of PyObjects to describe the effects of Python/C API calls and pointer operations. We track state transitions of PyObjects with symbolic execution based on the model, and report problems when a statement triggers a transition to buggy states. The program state is also expanded to handle pointer nullity checks and smart pointers of PyObjects. We conduct experiments on 12 open-source projects and detect 259 real problems out of 280 reports, which is twice as many bugs as state-of-the-art approaches. We submit 168 real bugs to those active projects, and 106 issues are either confirmed or resolved.
Reversible functional languages have been proposed that use patterns symmetrically for matching and building data: A pattern used on the left-hand side of a function rule takes apart a data ...structure, and a pattern used on the right-hand side of a rule builds a data structure. When calling a function in reverse, the meaning of patterns are reversed, so patterns on the right-hand side take apart data and patterns on the left-hand side build data. If using a pattern to build data creates a node with exactly one reference, a node that is taken apart must by symmetry also have exactly one reference. This implies linearity: A node that is built or taken apart has exactly one incoming reference. A recursive function can take apart one data structure while building two or more new copies, allowing multiple uses of values, but the new copies will also have one reference each. Linearity makes garbage collection simple: Whenever a node is taken apart by pattern matching, it is freed. The cost is that multiple uses of a value require deep copies. The reason for the linearity restriction is the symmetry between constructing and deconstructing nodes: Since constructing a node creates exactly one reference to this node, the inverse can only be applied if the reference count is exactly one. We can overcome this limitation if constructing a node can return a node with multiple references. We achieve this through maximal sharing: If a newly constructed node is identical to an already existing node, we return a pointer to the existing node (increasing its reference count) instead of allocating a new node with reference count one. This allows multiple references to nodes while retaining the symmetry of pattern matching and construction, and it allows values to be used multiple times without making deep copies. To avoid searching the entire heap for an identical node, we use hash-consing to restrict the search to a small segment of the heap. We estimate how large this segment needs to be to give a low probability of allocation failure with acceptable utilisation and support this estimate with experiments. We sketch how a functional program can be translated to use the memory manager, and we test the memory manager both with artificial test programs and a hand-compiled functional program.
Relaxed Radix Balanced Trees (RRB-Trees) is one of the latest members in a family of persistent tree based data-structures that combine wide branching factors with simple and relatively flat ...structures. Like the battle-tested immutable sequences of Clojure and Scala, they have effectively constant lookup and updates, good cache utilization, but also logarithmic concatenation and slicing. Our goal is to bring the benefits of persistent data structures to the discipline of systems programming via generic yet efficient immutable vectors supporting transient batch updates. We describe a C++ implementation that can be integrated in the runtime of higher level languages with a C core (Lisps like Guile or Racket, but also Python or Ruby), thus widening the access to these persistent data structures.
In this work we propose (1) an
Embedding RRB-Tree
(ERRB-Tree) data structure that efficiently stores arbitrary unboxed types, (2) a technique for implementing tree operations orthogonal to optimizations for a more compact representation of the tree, (3) a
policy-based
design to support multiple memory management and reclamation mechanisms (including automatic garbage collection and reference counting), (4) a model of
transience
based on move-semantics and reference counting, and (5) a definition of transience for confluent
meld
operations. Combining these techniques a performance comparable to that of mutable arrays can be achieved in many situations, while using the data structure in a functional way.
Biased reference counting Choi, Jiho; Shull, Thomas; Torrellas, Josep
Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques,
11/2018
Conference Proceeding
Open access
Reference counting (RC) is one of the two fundamental approaches to garbage collection. It has the desirable characteristics of low memory overhead and short pause times, which are key in today's ...interactive mobile platforms. However, RC has a higher execution time overhead than its counterpart, tracing garbage collection. The reason is that RC implementations maintain per-object counters, which must be continually updated. In particular, the execution time overhead is high in environments where low memory overhead is critical and, therefore, non-deferred RC is used. This is because the counter updates need to be performed atomically.
To address this problem, this paper proposes a novel algorithm called Biased Reference Counting (BRC), which significantly improves the performance of non-deferred RC. BRC is based on the observation that most objects are only accessed by a single thread, which allows most RC operations to be performed non-atomically. BRC leverages this by biasing each object towards a specific thread, and keeping two counters for each object --- one updated by the owner thread and another updated by the other threads. This allows the owner thread to perform RC operations non-atomically, while the other threads update the second counter atomically.
We implement BRC in the Swift programming language runtime, and evaluate it with client and server programs. We find that BRC makes each RC operation more than twice faster in the common case. As a result, BRC reduces the average execution time of client programs by 22.5%, and boosts the average throughput of server programs by 7.3%.
The goal of the Cyclone project is to investigate how to make a low-level C-like language safe. Our most difficult challenge has been providing programmers with control over memory management while ...retaining safety. This paper describes our experience trying to integrate and use effectively two previously-proposed, safe memory-management mechanisms: statically-scoped regions and tracked pointers. We found that these typing mechanisms can be combined to build alternative memory-management abstractions, such as reference counted objects and arenas with dynamic lifetimes, and thus provide a flexible basis. Our experience — porting C programs and device drivers, and building new applications for resource-constrained systems — confirms that experts can use these features to improve memory footprint and sometimes to improve throughput when used instead of, or in combination with, conservative garbage collection.
We present a technique for concurrent memory management that combines the ease-of-use of automatic memory reclamation, and the efficiency of state-of-the-art deferred reclamation algorithms.
First, ...we combine ideas from referencing counting and hazard pointers in a novel way to implement automatic concurrent reference counting with wait-free, constant-time overhead. Second, we generalize our previous algorithm to obtain a method for converting any standard manual SMR technique into an automatic reference counting technique with a similar performance profile.
We have implemented the approach as a C++ library and compared it experimentally to existing atomic reference-counting libraries and state-of-the-art manual techniques. Our results indicate that our technique is faster than existing reference-counting implementations, and competitive with manual memory reclamation techniques. More importantly, it is significantly safer than manual techniques since objects are reclaimed automatically.
Reference-counting is traditionally considered unsuitable for multiprocessor systems. According to conventional wisdom, the update of reference slots and reference-counts requires atomic or ...synchronized operations. In this work we demonstrate this is not the case by presenting a novel reference-counting algorithm suitable for a multiprocessor system that does not require any synchronized operation in its write barrier (not even a compare-and-swap type of synchronization). A second novelty of this algorithm is that it allows eliminating a large fraction of the reference-count updates, thus, drastically reducing the reference-counting traditional overhead. This article includes a full proof of the algorithm showing that it is safe (does not reclaim live objects) and live (eventually reclaims all unreachable objects).We have implemented our algorithm on Sun Microsystems' Java Virtual Machine (JVM) 1.2.2 and ran it on a four-way IBM Netfinity 8500R server with 550-MHz Intel Pentium III Xeon and 2 GB of physical memory. Our results show that the algorithm has an extremely low latency and throughput that is comparable to the stop-the-world mark and sweep algorithm used in the original JVM.