# Towards Hardware Accelerated Garbage Collection with Near-Memory Processing

Samuel Thomas\*, Jiwon Choe\*, Ofir Gordon<sup>†</sup>, Erez Petrank<sup>†</sup>, Tali Moreshet<sup>‡</sup>, Maurice Herlihy\*, R. Iris Bahar<sup>§</sup> <sup>∗</sup>Brown University, †Technion Institute-Israel, ‡Boston University, §Colorado School of Mines

*Abstract*—Garbage collection is widely available in popular programming languages, yet it may incur high performance overheads in applications. Prior works have proposed specialized hardware acceleration implementations to offoad garbage collection overheads off the main processor, but these solutions have yet to be implemented in practice. In this paper, we propose using offthe-shelf hardware to accelerate off-the-shelf garbage collection algorithms. Furthermore, our work is *latency* oriented as opposed to other works that focus on bandwidth. We demonstrate that we can get a  $2\times$  performance improvement in some workloads and a  $2.3\times$  reduction in LLC traffic by integrating generic Near-Memory Processing (NMP) into the built-in Java garbage collector. We will discuss architectural implications of these results and consider directions for future work.

*Index Terms*—garbage collection, near-memory processing, benchmarking

## I. INTRODUCTION

Garbage collection (GC) is an automatic memory management protocol utilized by many high level programming languages [\[8\]](#page-5-0), including Java, Python, Go, etc. These garbagecollected programming languages are popular among developers because manual memory management is diffcult and error-prone, leading to memory bugs that are notoriously hard to overcome.

Unfortunately, garbage collection may utilize 10-30% of application CPU cycles, and the resulting overhead can be even more for memory-intensive edge-case workloads [\[20\]](#page-5-1). As such, prior work [\[3\]](#page-5-2), [\[16\]](#page-5-3), [\[20\]](#page-5-1) has looked into hardwareaccelerated garbage collection. This is an idea about as old as garbage collection itself [\[12\]](#page-5-4), [\[17\]](#page-5-5). However, while these works provide strong theoretical solutions, none of them have been adapted in practice due to complex hardware that has highly specialized use cases. That is, none of the older proposals have been integrated into commodity or production hardware, for the solutions were not generalizable across various programming languages with different GC algorithms and implementation-level details [\[21\]](#page-5-6), [\[22\]](#page-5-7).

Nonetheless, prior work on alleviating GC bottlenecks have provided a fundamental understanding of GC behavior. In particular, the *marking phase* of the *mark-and-sweep* algorithm, which identifes the live objects (*i.e.*, uncollectible memory) in an application, is fundamentally pointer-chasing and thus subject to bottlenecks arising from frequent but irregular memory accesses.

To address this, we turn to *near-memory processing* (NMP). NMP describes a lightweight, simple compute unit being placed at the logic controller of a DRAM vault. In the NMP architecture, having these compute units physically near the memory allows for higher memory bandwidth and lower latency memory access. As such, NMP has recently reemerged as a promising solution to memory bottlenecks in pointerchasing, data-intensive applications. Many prior works have delved into accelerating pointer-chasing operations through hardware and software redesigns with NMP [\[11\]](#page-5-8), [\[13\]](#page-5-9), [\[19\]](#page-5-10). They do so by exploiting *bandwidth* – a well explored property of NMP-style accelerators.

In this work, we aim to provide an exploratory analysis on accelerating garbage collection with NMP via *latency* benefts. We identify that the marking subroutine of the mark-andsweep garbage collection algorithm exhibits this same pointerchasing behavior. Work in [\[10\]](#page-5-11), [\[11\]](#page-5-8) provide precedent for accelerating pointer-chasing applications via NMP. In this work, we extend these themes to garbage collection – which is more complex and works at a higher level of abstraction.

This work specifcally examines the effectiveness of offoading the pointer-chasing marking phase of garbage collection to *generic* near-memory compute units. Instead of developing idealistic custom hardware, we aim to exploit a generic NMP hardware similar to what is already in production [\[1\]](#page-5-12). We aim to take advantage of existing under-utilized hardware as opposed to developing new hardware with the hope that it will be deployed. While custom garbage collection hardware may lead to performance benefts in simulated environments, these proposed designs may not be realistic. Instead, we work from the overriding belief that *good solutions on unmodifed hardware* can lead to implementations that can be easily adopted in off-the-shelf systems.

Our early results are promising. Depending on the evaluation confguration, offoading Java garbage collection's marking phase to near-memory compute units can reduce last-level cache misses by up to  $2.3\times$  and improve overall benchmark performance by up to  $2\times$ . Although further investigation is needed in order to extend the benefts to a wider range of workloads and confgurations, the analysis points to several promising directions for future work.

This paper is organized as follows: in Section [II,](#page-1-0) we briefy review the mark-and-sweep algorithm used in garbage collection and provide motivation for the use of generic NMP architectures; in Section [III,](#page-2-0) we describe the mechanism for offoading computation; in Section [IV,](#page-2-1) we evaluate and analyze the architecture-level behavior of our approach across varying evaluation confgurations; in Section [V,](#page-4-0) we consider opportunities for future work based on our initial fndings.



<span id="page-1-1"></span>Fig. 1. Illustration of the marking phase in the mark-and-sweep GC algorithm. The stack of live object roots point to live objects in the heap. Objects with an  $\bar{x}$  have been marked as live. Figure (a) shows the initial state of the heap. Figure (b) illustrates A being popped from the live object roots to mark referent item C. fgure (c) shows C being added to the live object roots. Figure (d) shows the state after B has been popped from the live object roots, with H marked as live.

## II. BACKGROUND

<span id="page-1-0"></span>Garbage collection often employs the *mark-and-sweep* method. Briefy described, let *roots* denote all pointers directly reachable by program threads. The *marking phase* starts by pushing all root pointers to the *mark-stack* (denoted as *live object roots* in Fig [1\)](#page-1-1). Next, an iterative process of handling the mark-stack is executed until it is empty. In each iteration, (1) a pointer is popped from the mark-stack (depicted as A in Fig. [1b](#page-1-1)), (2) the referent object is scanned for pointers, and for each referenced object: (2a) it is marked as live (depicted as C in Fig. [1b](#page-1-1)) and (2b) if not previously marked, it gets pushed into the mark-stack (depicted as C in Fig. [1c](#page-1-1)). Once the stack is empty, the heap is *swept* clean of all objects that have not been marked. In Fig. [1,](#page-1-1) this would ultimately clear nodes  $E$  and  $F$ .

We look to utilize near-memory processing (NMP) to optimize garbage collection in Java. NMP architecture based on 3D-die stacking technologies have been explored as a workaround for computations with memory bottlenecks, including graph processing and generic pointer-chasing [\[2\]](#page-5-13), [\[10\]](#page-5-11), [\[11\]](#page-5-8), [\[13\]](#page-5-9)–[\[15\]](#page-5-14), [\[19\]](#page-5-10), [\[23\]](#page-5-15). NMP describes an architecture in which a lightweight processor is placed at the logic controller of the memory die to perform more complex computation than can be performed in a traditional memory controller. While NMP architectures do not cut down latencies inherent to DRAM, they reduce performance degradation and energy consumption caused by data movement on long and narrow off-chip interconnects

The benefts of using NMP have been shown to be two-fold. For one, pointer-chasing applications tend to follow pointers to random locations in memory and exhibit poor cache locality as a result. Instead, by using NMPs, the same computation can be performed without polluting the cache with pointers that will not be reused. The other beneft of NMP is that it exhibits a lower latency than CPUs in that there is no interconnect latency at play, which saves cycles. In our work, we work from the observation that the marking phase of garbage collection in Java exhibits similar behavior.

The pointer-chasing aspect of mark-and-sweep causes memory bottlenecks, which relevant prior work have tried to address by developing customized hardware accelerators. In particular, Maas *et al.* [\[20\]](#page-5-1) designs specialized accelerators for both the marking and sweeping phases of GC to avoid cache pollution. They also describe an additional device driver that is maintained by the kernel so that any higher level programming language can utilize the underlying hardware. Charon [\[16\]](#page-5-3) identifes certain fne-grained operations (*e.g.,* scanning the heap, copying objects for memory consolidation, etc.) as GC primitives and designs custom near-memory accelerators for each of these primitives. While these works have shown performance improvements, each required signifcant hardware modifcation and limited their evaluation to workloads with large objects in order to maximally exploit the high bandwidth benefts of NMP.

Our work is distinctly different from prior works in that we assume generic near-memory compute units. That is, these cores have no hardware components that are specialized to optimized garbage collection. Instead, all components of these cores can be utilized by general computation. We focus on appropriately partitioning the GC work between the on-chip host cores and the near-memory compute units and in turn aim to modify the mark-and-sweep algorithm accordingly to take advantage of NMPs latency benefts without polluting on-chip caches.

Java JDK Garbage Collection: While Java is not the only language that uses mark-and-sweep, our investigations are based on Java JDK's mark-and-sweep garbage collection. This garbage collection happens in a *stop-the-world* manner, meaning that application execution is stopped entirely while the garbage collection process runs. Garbage collection algorithms that run concurrently with the application have been proven to be too slow to be used in [\[7\]](#page-5-16), so we do not consider them in our evaluation. Moreover, garbage collection in Java is triggered when heap space utilization exceeds 80% of its capacity. As such, larger heap sizes means that fewer collections will occur and overall application execution will ultimately be faster. However, because our approach aims to reduce the GC execution time, we intentionally reduce the heap size in order to trigger more collections and measure their performance appropriately.

More specifcally, the JDK employs *generational* garbage collection, in which the heap is divided into a *young generation* heap and an *old generation* heap. This is based on the insight that objects are likely to be freed soon after their allocation. As such, objects are frst placed in the smaller young generation heap. Objects that survive young generation collections are moved to the old generation heap, where collections are infrequent but incur longer pauses. In particular, it is in old generation collections (*i.e.*, *full collections*) that the mark-and-sweep algorithm is used. Thus, in order to trigger more mark-and-sweep garbage collections for evaluation, we

TABLE I EVALUATION FRAMEWORK CONFIGURATION.

<span id="page-2-2"></span>

| <b>Host Configuration</b>                                                        |                                           |
|----------------------------------------------------------------------------------|-------------------------------------------|
| <b>Host</b> cores                                                                | in-order processor (gem5 TimingSimpleCPU) |
|                                                                                  | 2GHz frequency, 1 thread/core             |
| L1 cache                                                                         | 48kB icache, 32kB dcache, private         |
|                                                                                  | 2-way set-associative LRU                 |
| $L2$ cache                                                                       | 1MB, shared, 8-way set associative LRU    |
| <b>NMP</b> Core Configuration                                                    |                                           |
| <b>NMP</b> cores                                                                 | in-order single-cycle processor           |
|                                                                                  | (gem5 TimingSimpleCPU), 2GHz frequency    |
| $\overline{L1}$ cache                                                            | 48kB icache, 32kB dcache, private         |
|                                                                                  | 2-way set-associative LRU                 |
| <b>Memory Configuration</b>                                                      |                                           |
| 2GB DRAM (gem5 DDR3_1600_3x3)                                                    |                                           |
| $t_{RP}$ : 13.75ns, $t_{RCD}$ : 13.75ns, $t_{CL}$ : 13.75ns, $t_{BURST}$ : 3.2ns |                                           |

<span id="page-2-0"></span>varied the young and old generation heap sizes.

## III. METHODOLOGY

We perform our evaluation on a gem5 [\[4\]](#page-5-17) full-system simulation extended from the ARM bigLITTLE configuration. ARM bigLITTLE describes a heterogeneous CPU architecture with high-powered and low-powered cores, and low-powered cores exhibit properties similar to NMP cores. In particular, little cores and NMP cores both are single-threaded, in-order cores with simple execution pipelines, and neither utilize the host's cache hierarchy. This makes the ARM bigLITTLE architecture particularly well suited to evaluate the ability of our approach to reduce cache pollution. Then, we modify the timing across the little core memory access pipeline so as to accurately model the memory access latency of a near-memory processing core. In particular, we remove interconnect latency, which was shown to have a significant impact on performance and energy consumption in [\[11\]](#page-5-8), and reduce memory access latency by 5% to model NMP's faster access times due to proximity to memory.

We found this simulation model appropriate given the similar properties of ARM little cores and NMP cores in production. Furthermore, the JDK proved to be too complex for other state-of-the-art simulators that may more precisely simulate NMP, such as SMCSim [\[9\]](#page-5-18). Running these benchmarks on these simulators requires intensive changes to the simulators that would have had little impact on overall runtime performance or cache behavior results. Given this, ARM bigLITTLE provides a sufficient model for performing our evaluation. Full details of our hardware confgurations can be viewed in Table [I.](#page-2-2)

At a high level, we aim to pin the marking protocol to NMP cores in our simulation. However, achieving this end is nontrivial. In order to do so, we modify the Java garbage collector in the JDK version 14. In particular, we isolate the marking phase of the mark-and-sweep algorithm in order to run it on the NMP cores. This entails creating a specialized workGroup – a special garbage collection process object in the JDK source. We then leverage the fact that our assumed architecture allows the operating system (OS) to see the NMP core as a core in the simulated environment to offoad computation by pinning

it to NMP cores, but that the Java runtime environment only has explicit knowledge of host cores.

The ability of the OS to manage the host processors and NMP cores solves diffcult problems in NMP research that are tangential to offoading computation. For example, coherence between host and NMP cores remains an open problem that is distinct from partitioning work across the heterogeneous architecture, but is diffcult to navigate without the help of the OS.

By implementing this protocol in software, our approach is not limited by the ability to implement software primitives in hardware. This enables our approach to take advantage of general-purpose hardware without any invasive architectural changes. Furthermore, this means that we can extend our approach to newer versions of garbage collection in the JDK and other high level programming languages that might exhibit different semantics. Supporting this fexibility is important given that software frequently evolves – for example, JDK version 14 had stable changes pushed almost every month during its release.

## IV. EVALUATION

<span id="page-2-1"></span>In our evaluation, we use two confgurations: *host-only* and *NMP*. The host-only confguration models a conventional system with on-chip processors, L1 caches and a shared L2 cache. The NMP confguration incorporates NMP cores into the host-only confguration. In the NMP confguration, there is a single near-memory core (*NMP core*) equipped with an L1 instruction and data cache. We use the h2 benchmark from the DaCapo benchmark suite [\[5\]](#page-5-19), which is a standard Java garbage collection testing suite. Prior evaluation [\[7\]](#page-5-16) of the DaCapo suite has demonstrated that the h2 benchmark is among the most memory intensive benchmarks in the suite and is well suited for evaluating full collections. When using the DaCapo benchmark suite, it is standard practice to run warm-up iterations prior to collecting measurement data. This is done in order to avoid measuring extraneous behavior such as dynamic loading and compiling of modules and classes, which rarely happen in the typical execution of long-running applications. The number of warm-up iterations varies in prior work from one [\[6\]](#page-5-20) to 20 [\[18\]](#page-5-21). In our preliminary evaluation, we found that the performance improvements gained by increasing the number of warm-up iterations from four to 20 is less than 5%. We perform evaluations with and without warm-up iterations. For evaluations with warm-ups, we run three warmup iterations and report the measurements that we obtain from the fourth unless the discussion explicitly states otherwise.

## *A. Performance*

Our initial evaluation is promising, and its results are shown in Fig. [2a](#page-3-0). Performance refers to runtime in milliseconds, so lower is better. The evaluation involved varying the young and old generation heap sizes to show scalability and promise of our approach. We used the following old/young heap size confgurations: (1) 4GB/256MB (JDK default), (2) 250MB/50MB, and (3) 250MB/100MB. By utilizing smaller old generation



<span id="page-3-0"></span>Fig. 2. Running time (y-axis) of the h2 benchmark under different heap size confgurations (x-axis). (a) shows performance without warm-up iterations and (b) shows performance with warm-up iterations.

heap sizes in configurations (2) and (3), more mark-andsweep full collections were incurred overall. No warm up iterations were used in this initial evaluation. The fgure shows that the NMP confguration can demonstrate up to a 2x improvement in performance (which occurred in the default JDK confguration).

We see that the 250MB/100MB configuration is consistently slower than the 250MB/50MB confguration. This is largely due to the fact that young garbage collections will take longer in confgurations where the young generation size is bigger (as there are more objects in the heap), and our approach does not modify young generation collections. In the 250MB/50MB heap size configuration, we see a 48% improvement in overall performance of the NMP-based solution relative to the unmodifed host-only baseline. In the 250MB/100MB heap size confguration, we see a 47% improvement in overall performance by the NMP-based solution. The takeaway from this result is that full collections of the young and old generation heaps are the bottleneck to performance rather than having a large number of young generational collections. Fine-tuning heap-sizes for optimal performance is difficult, and is beyond the scope of this project. However, we see that the performance improvement with our approach is consistent across each heap confguration without warm-up iterations. As such, we use these results to further our intuition, but otherwise largely focus on the default heap size confguration.

Fig. [2b](#page-3-0) shows the impact of warm up iterations on performance. The results demonstrate that the differences in performance are much less signifcant between the NMP and host only confgurations with warm ups. Fig. [3](#page-3-1) shows the impact on performance as the number of warm-up iterations increases in the h2 benchmark with and without NMP in the default heap size confguration. As previously described, there is  $2 \times$  difference in performance without any warm up iterations, but this difference is reduced over time. In fact, the difference in performance becomes negligible across the two confgurations after three warm up iterations.

We see a greater impact of our approach on performance



<span id="page-3-1"></span>Fig. 3. Running time in milliseconds versus number of warm-up iterations on the h2 benchmark in host only and NMP confgurations with the default heap size configurations.



<span id="page-3-2"></span>Fig. 4. Number of requests for the L2 in each of the heap confgurations by architecture. Most requests occur in the frst warm-up iteration.

without warm-up iterations. This is largely attributed to differences in application behavior. In general, evaluating performance on a cold-start tends to induce more cache misses. Seeing as our approach optimizes L2 misses, it makes sense that our approach performs better in configurations where there are more L2 misses. We evaluate and discuss this in more detail with a more complete sensitivity analysis in Sec. [IV-C.](#page-4-1)

### *B. Cache Traffc*

At the architecture level, we fnd that L2 traffc is the primary culprit for differences in overall behavior. Fig. [4](#page-3-2) shows that more than 77% of overall L2 accesses occur in the frst warm-up iteration. This implies that the majority of cache traffc occurs during the marking phase of full generational garbage collection. The NMP provides the most benefit when the application exhibits a large number inefficient cache accesses. Seeing as we only move the marking phase of full generation collections to the NMP, it is implied that the difference in L2 cache accesses between the host only and NMP confgurations are due to L2 accesses from this procedure. In our evaluation, we also measure how many of these accesses to the L2 come in the frst warm-up iteration relative to the rest of the benchmark's execution.

In this evaluation, we observe two key takeaways. For one, we fnd that the NMP approach is very effective at reducing the number of L2 accesses in the frst iteration of benchmark execution. As demonstrated in comparing default heap confgurations, we see that there are approximately 2.4 times more frst warm-up iteration L2 accesses in the host only confguration compared to the NMP confguration. This shows that the marking phase of full generation collections signifcantly dominates L2 cache accesses in baseline protocols. This trend is consistent across all heap sizes, which suggests that application behavior in a cold-start requires lots of L2 accesses as a consequence of garbage collection. We also see that the number of L2 accesses is reduced in the rest of benchmark execution by as much as 81% (in the 250MB/100MB confguration), but the number of accesses as a whole is much lower. As such, the impact of our technique will not be as apparent.

The other takeaway from this evaluation is that an overwhelming majority of L2 accesses occur in the first warmup iteration across all heap confgurations. On average, we observe that 61% of L2 accesses occur in the frst warm-up iteration. As such, the impact of our technique will be most visible in this iteration.

More generally, these large differences in cache traffc for the L2 in the host largely describe the benefts of utilizing NMP. That is, the cache hierarchy in the host is not saturated by requests for the L2 during the marking phase of garbage collection. Performing these operations elsewhere in the architecture means that these caches will be subject to less pollution. This should ultimately help application performance.

### <span id="page-4-1"></span>*C. LLC Variations*

We vary the L2 size in order to provide a sensitivity analysis of our approach with warm-up iterations. In particular, we look to demonstrate the efficacy of our approach – namely, providing a fast alternative for workloads with many L2 accesses – while controlling for warm-up iterations and heap sizes. That is, we evaluate our technique with the default heap configuration and with four warm-up iterations – circumstances when the L2 misses are minimized. Our approach optimizes for L2 misses, and we aim to use this section to confrm our initial hypothesis that applications that use the L2 ineffciently will beneft from our approach.

Fig. [5](#page-4-2) demonstrates the differences in performance between NMP and host only confgurations. We measure for overall running time, so lower is better. In the confguration where the L2 is set to 512kB, we see the fewest number of L2 misses and the differences in performance between the NMP and host only confguration is less than 1%. As L2 cache size decreases, the number of L2 misses and the impact of the NMP-based approach both increase. That is, we see a 12% performance improvement in the NMP confguration versus the host only confguration with a 64kB L2.

This evaluation demonstrates that our approach is effective at optimizing for L2 misses. It highlights that the performance



<span id="page-4-2"></span>Fig. 5. Performance results of the h2 benchmark in the host only and NMP confgurations with varying sizes for the L2 with four warm up iterations in the default heap confguration.

benefts demonstrated in Fig. [2a](#page-3-0) were a result of reducing the impact of ineffcient utilization of the L2. However, this coupled with the reduction in performance impact in Fig. [2b](#page-3-0) also leads to the conclusion that utilizing warm-up iterations also signifcantly improves the effciency by which the L2 is utilized. In general, measuring performance after warm-up iterations avoids measuring cache behavior from a cold-start in which caches are not used as efficiently.

## V. DISCUSSION

<span id="page-4-0"></span>In this work, we propose and implement a means to reduce the negative impact of Java's mark-and-sweep algorithm on efficient L2 cache utilization. We offload this costly computation to NMP in order to avoid accessing the cache hierarchy. This approach is designed to be minimally invasive by taking advantage of emerging technologies through *software modifcation*. By doing so, we demonstrated that we can improve performance by  $2 \times$  in benchmarks without warmup iterations, and we attribute this performance improvement to the  $2.3\times$  reduction in L2 traffic. This work shows that the pointer-chasing nature of the mark-and-sweep algorithm does not utilize the L2 effciently without warm-up iterations. We address this by offoading the computation to a component of the microarchitecture that does not negatively impact onchip caches. The performance advantage of this approach is two-fold – (1) application cache accesses should hit more frequently as important data will not be evicted prematurely, and (2) the mark-and-sweep algorithm will not have to wait for cache miss latency in addition to memory access latency on values that likely will not be in cache. Our evaluation shows that the impact of our technique is a function of L2 utilization in host only confgurations, so using warm-up iterations reduce the impact of our technique.

We believe there is an open opportunity to evaluate whether or not cache ineffciency in early warm-up iterations is a consequence of cold caches or application behavior. In particular, Java programs execute inside the Java Runtime Environment (JRE), which performs many tasks at program invocation – such as just-in time compilation, dynamic module loading, etc. The JRE as a whole is a complex software structure, and coming up with a more precise view of its behavior throughout the duration of an application would lend itself to other interesting architectural insights. We believe that a further understanding of the JRE could lead to interesting insights into hardware-driven modifcations to the software. This could help extend our techniques to other applications as well, in Java and otherwise, that exhibit similar program behavior.

To begin that discussion, we observe that one potential contributing factor is the cold-start of the L2 cache in the frst warm-up iteration. While the frst warm-up iteration may not be of particular interest to users looking to optimize long-running applications, such as databases or hosting an application on a web server, there may be use cases that exhibit similar behaviors. For example, if future work deems dynamic module loading the primary contributing feature of garbage collection during frst warm-up iteration, then it makes sense to perform similar offoading in applications that are bound by dynamic module loading.

Improving performance in garbage collection remains an important issue, in that long pauses due to full generation collections stops the execution of the application in order to execute automatic memory management. We believe that prior work has produced highly effective specialized hardware to accelerate specifc garbage collection primitives, but the ever-changing nature and wide breadth of software makes it unlikely that these solutions will ever be fully adapted. Instead, our technique attempts to be minimally invasive, and it is fexible to be applied on other parts of the application and to alternative use cases. As a result, we hope this work leads to future opportunities in re-designing high-level software protocols to better take advantage of emerging hardware and its ability to improve performance without invasive hardware changes.

#### ACKNOWLEDGMENT

We thank the anonymous reviewers for their helpful comments. This work was supported by NSF grant 1909715.

#### **REFERENCES**

- <span id="page-5-12"></span>[1] Upmem. [https://www.upmem.com.](https://www.upmem.com) Accessed: 2022-03-11.
- <span id="page-5-13"></span>[2] Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. A scalable processing-in-memory accelerator for parallel graph processing. In *Proceedings of the 42nd Annual International Symposium on Computer Architecture*, pages 105–117, 2015.
- <span id="page-5-2"></span>[3] David F Bacon, Perry Cheng, and Sunil Shukla. And then there were none: A stall-free real-time garbage collector for reconfgurable hardware. *ACM SIGPLAN Notices*, 47(6):23–34, 2012.
- <span id="page-5-17"></span>[4] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti, et al. The gem5 simulator. *ACM SIGARCH computer architecture news*, 39(2):1–7, 2011.
- <span id="page-5-19"></span>[5] Stephen M Blackburn, Robin Garner, Chris Hoffmann, Asjad M Khang, Kathryn S McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z Guyer, et al. The dacapo benchmarks: Java benchmarking development and analysis. In *Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications*, pages 169–190, 2006.
- <span id="page-5-20"></span>[6] Stephen M Blackburnα, Robin Garnerβ, Chris Hoffmannγ, Asjad M Khanγ, Kathryn S McKinleyδ, Rotem Bentzurε, Amer Diwanζ, Daniel Feinbergε, Daniel Framptonβ, Samuel Z Guyerη, et al. The dacapo benchmarks: Java benchmarking development and analysis. 2006.
- <span id="page-5-16"></span>[7] Maria Carpen-Amarie, Patrick Marlier, Pascal Felber, and Gaël Thomas. A performance study of java garbage collectors on multicore architectures. In *Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores*, pages 20–29, 2015.
- <span id="page-5-0"></span>[8] Stephen Cass. The 2018 top programming languages. *IEEE Spectrum*, 31:1, 2018.
- <span id="page-5-18"></span>[9] Jiwon Choe and Erfan Azarkhish. Brown-smcsim. [https://github.com/](https://github.com/jiwon-choe/Brown-SMCSim) [jiwon-choe/Brown-SMCSim,](https://github.com/jiwon-choe/Brown-SMCSim) 2022.
- <span id="page-5-11"></span>[10] Jiwon Choe, Andrew Crotty, Tali Moreshet, Maurice Herlihy, and R Iris Bahar. Hybrids: Cache-conscious concurrent data structures for nearmemory processing architectures. In *The 34th ACM Symposium on Parallelism in Algorithms and Architectures*, pages 321–332, 2022.
- <span id="page-5-8"></span>[11] Jiwon Choe, Amy Huang, Tali Moreshet, Maurice Herlihy, and R Iris Bahar. Concurrent data structures with near-data-processing: An architecture-aware implementation. In *The 31st ACM Symposium on Parallelism in Algorithms and Architectures*, pages 297–308, 2019.
- <span id="page-5-4"></span>[12] Cliff Click, Gil Tene, and Michael Wolf. The pauseless gc algorithm. In *Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments*, pages 46–56, 2005.
- <span id="page-5-9"></span>[13] Byungchul Hong, Gwangsun Kim, Jung Ho Ahn, Yongkee Kwon, Hongsik Kim, and John Kim. Accelerating linked-list traversal through near-data processing. In *Proceedings of the 2016 International Conference on Parallel Architectures and Compilation*, pages 113–124, 2016.
- [14] Kevin Hsieh, Samira Khan, Nandita Vijaykumar, Kevin K Chang, Amirali Boroumand, Saugata Ghose, and Onur Mutlu. Accelerating pointer chasing in 3d-stacked memory: Challenges, mechanisms, evaluation. In *2016 IEEE 34th International Conference on Computer Design (ICCD)*, pages 25–32. IEEE, 2016.
- <span id="page-5-14"></span>[15] Yu Huang, Long Zheng, Pengcheng Yao, Jieshan Zhao, Xiaofei Liao, Hai Jin, and Jingling Xue. A heterogeneous pim hardware-software codesign for energy-effcient graph processing. In *2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)*, pages 684– 695. IEEE, 2020.
- <span id="page-5-3"></span>[16] Jaeyoung Jang, Jun Heo, Yejin Lee, Jaeyeon Won, Seonghak Kim, Sung Jun Jung, Hakbeom Jang, Tae Jun Ham, and Jae W Lee. Charon: Specialized near-memory processing architecture for clearing dead objects in memory. In *Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture*, pages 726–739, 2019.
- <span id="page-5-5"></span>[17] Richard Jones and Rafael Lins. *Garbage collection: algorithms for automatic dynamic memory management*. John Wiley & Sons, Inc., 1996.
- <span id="page-5-21"></span>[18] Philipp Lengauer, Verena Bitto, Hanspeter Mössenböck, and Markus Weninger. A comprehensive java benchmark study on memory and garbage collection behavior of dacapo, dacapo scala, and specjvm2008. In *Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering*, pages 3–14, 2017.
- <span id="page-5-10"></span>[19] Zhiyu Liu, Irina Calciu, Maurice Herlihy, and Onur Mutlu. Concurrent data structures for near-memory computing. In *Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures*, pages 235–245, 2017.
- <span id="page-5-1"></span>[20] Martin Maas, Krste Asanović, and John Kubiatowicz. A hardware accelerator for tracing garbage collection. In *2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA)*, pages 138–151. IEEE, 2018.
- <span id="page-5-6"></span>[21] Thomas Perl. Python garbage collector implementations cpython, pypy and gas, 2012.
- <span id="page-5-7"></span>[22] Tony Printezis. Garbage collection in the java hotspot virtual machine, 2005.
- <span id="page-5-15"></span>[23] Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. Graphr: Accelerating graph processing using reram. In *2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)*, pages 531–543. IEEE, 2018.