Prudent Memory Reclamation in Procrastination-Based Synchronization

위 논문을 읽으면서 공부한 자료입니다.

Notion에서 보기

Abstraction

RCU (Read-Copy-Update)
- 기본 요소
i) Reader보호 받아야 하는 데이터 (protected rcu data)에 접근할 때 항상 CS영역 내에서만 접근해야 한다.non-preemtable RCU 커널을 사용하는 경우 read_rcu_lock() 내부에서 preemption을 금지하는 코드만 적용되고, preemptable RCU 커널을 사용하는 경우 read_rcu_lock()내부에서 로컬 카운터를 증가시키고 read_rcu_unlock()내부에서 로컬 카운터를 감소시킨다. read_side critical section내에서 preemption될 수 있다.write-side critical section보호받을 데이터를 copy하고 수정한 뒤, 기존 데이터를 free하기 위해 synchronize_rcu() (동기 함수) 또는 call_rcu() (비동기 함수)를 호출하여 아래의 reclaimer를 동작시킨다.kernel/rcu/tree_plugin.h (SKKU 워크스테이션에는 /sys/kernel/rcu_normal 이 0으로 되어 있다 왜지.. )void synchronize_rcu(void){ /*RCU 스케줄러가 동작하지 않는 경우 처리를 포기*/ if(!rcu_scheduler_active) return; /*/sys/kernel/rcu_expedited 값이 설정된 경우 Brute-force RCU GP를 사용한다.*/ if(rcu_expedited) synchronize_rcu_expedited(); /*없다면 일반 GP 방법을 사용한다.*/ else wait_rcu_gp(call_rcu); }** call_rcu() : grace period가 지난 후에 preemptible-RCU callback을 호출할 수 있도록 큐에 추가한다.iii) Reclaimer
- updater (writer)가 발생되는 시점에 시작되는 구간⇒ reader들은 원본 자료에만 접근하기 때문에 안전하게 데이터에 접근할 수 있다.— reclamation 구간 : grace period 완료를 인지한 뒤, 기존 사본을 원본에 반영하고 필요 없어진 이전 자료는 free 한다. reclamation 작업은 RCU 이후에 call-back 함수를 리스트에 보관해 두었다가 호출되어 수행한다.
- — grace period 구간 : removal 직후에 reclamation을 수행하면 CPU에서 동시에 처리되고 있는 reader들에 문제가 발생하기 때문에 모든 reader의 처리가 완료될 때까지 기다린다.
- — **removal (read-copy-update)**구간 : updater 구간에서 read하고 copy를 만들어 update 작업을 한다.
- Grace period 완료를 빠르게 인지하기 위해 모든 CPU들이 Quiescent State 상태를 지난 경우에 grace period가 완료된 것으로 판단한다.
- Quiescent State
- grace period 이전에 이미 진행되고 있던 reader의 구간이 끝난 것을 인지할 목적으로 사용하는 시간 상태이다.
update 과정에서 최종으로 수정한 객체가 아닌 사용이 완료된 객체를 cleanup 한다. 안전하게 cleanup할 수 있도록 데이터에 접근하는 reader들이 없음을 보장하기 위해 grace period가 지난 후에 reclaimer가 작동한다.
__call_rcu() : 헬퍼 함수
** synchronize_rcu() : grace period가 지날 때까지 기다린다.
기존의 여러가지 lock 중 하나를 사용하여 데이터를 보호하고 수정한다. 실제 보호 받아야 하는 데이터에 접근하기 전에 메모리 barrier를 사용한 후에 참조해야 한다.
ii) Updater (=writer)
이 영역을 벗어나서 접근하는 것은 RCU 규칙을 벗어난 것으로 존재하지 않는 자료 또는 데이터로 접근될 수 있다.
rcu_read_lock()과 rcu_read_unlock() 코드 범위의 Read-side critical section
RCU와 같은 (procrastination based)동기화 매커니즘에서 updater는 grace period 동안 사용이 완료된 객체의 free를 지연시켜 안전한 reclaim을 보장한다.** memory allocator는 grace period를 기다리느라 deferred된 객체가 invisible하기 때문이다.
↔ poor memory allocator performance
reclamation이 지연된 객체들은 해당 메모리 영역이 곧 free될 것이라는 hint를 제공한다.
↔ memory allocator에게 deferred object는 보이지 않기 때문에 hint의 효용이 떨어진다.
Prudence = dynamic memory allocator⇒ deferred object의 메모리 영역을 reclaim하는 safe time을 식별할 수 있다.⇒ 중요한 state transition 과정 동안 hint를 기반으로 최적화를 한다.
⇒ allocated, free, about-to-be-freed 와 같은 object의 상태를 전체적으로 볼 수 있다.
memory allocator가 deferred object를 인지할 수 있도록하는 동기화 매커니즘을 사용한다.

1. Introduction

synchronization via procrastination
- ex) RCU
- 리스트, 트리, 해시 테이블 등 다양한 자료 구조와 운영체제에 적용되어 왔다.
- 객체를 update하기 위해 Synchronization via procrastination에서 updater (writer)는 기존 객체를 copy하여 새로운 버전의 객체를 만들고 이전의 오래된 객체는 해당 객체를 참조하는 reader가 없다는 것을 확인할 때까지 free하는 것을 지연시킨다.
- 즉, 여러 버전의 객체를 동시에 사용하여 reader가 안정된 데이터에 접근할 수 있도록 보장하는 방식이다. (grace period)
- update를 수행할 때마다 새로운 객체를 할당하고 오래된 객체는 free해야 하기 때문에 memory allocator를 바쁘게 만든다.
- 오래된 객체를 언제 reclaim해도 되는지 결정해야 한다. 보통 이 작업은 백그라운드 또는 일괄 작업으로 처리된다.
- → 객체 할당은 시간이 좀 걸리지만 free 작업은 순식간에 처리된다. 하지만 기존의 memory allocator는 이러한 할당 패턴을 처리하지 못한다.
- deferred 객체의 백로그에 따라 동시에 수행되는 memory reclamation의 병목 현상을 조절한다. 하지만 memory allocator는 이러한 reclamation의 병목현상은 고려되지 않아 성능에 영향을 미친다. 이것은 update 비율이 높을 때 memory가 부족한 현상을 초래한다.
- free가 지연된 객체들은 about-to-be-freed라는 hint를 제공한다. 하지만 기존의 memory allocator는 deferred 객체를 볼 수 없기 때문에 hint를 사용할 수 없다.
synchronization via in-place update
- reader와 writer가 mutual exclusion을 가지고 ****in-place update를 한다.
- writer가 copy하고 update하지 않고, 오래된 객체를 free하기 위해 reader를 기다리지도 않는다. 그래서 객체 상태에 대한 hint도 필요하지 않다.
- 따라서 memory allocator의 성능에 영향을 미치지 않는 동기화 매커니즘이다.
Prudence
- slab allocator 원리 기반의 dynamic memory allocator
- deferred 객체에 대한 visibility를 얻기 위해 synchronization via procrastination 매커니즘과 통합시킨다.→ deferred 객체를 효율적으로 처리할 수 있게 한다.
- → reclamation을 하기 전에 오래된 객체를 참조하는 reader의 작업이 완료될때까지 기다리는 것을 보장한다.
- → allocated, freed, about-to-be-freed 객체 상태를 볼 수 있다. (hint)

2. Background

2.1 Synchronization via Procrastination

** linked list update operation을 예시로 설명한다.

reader는 오버헤드가 거의 없는 반면 writer는 데이터 일관성 보장을 위해 오버헤드가 발생한다.
writer는 update하려는 객체 Q에 새로운 객체 Q'를 할당하고 기존 Q의 내용을 copy한다. 그리고 업데이트 된 객체 Q'를 리스트에 삽입하고 Q를 리스트에서 지운다.리스트를 순회하는 reader는 wait-free이기 때문에 Q를 참조하는 reader의 수는 결국 0 이 된다.
Q를 리스트에서 제거할 때 Q를 참조하는 reader는 없어야 한다. 만약 있다면 Q를 참조하는 모든 reader가 완료할 때까지 Q의 free를 지연시킨다. (Q = deferred for freeing)
동기화 매커니즘은 객체 메모리를 reclamation하기 전에 시스템 내에 deferred 객체를 계속해서 추적하고 오래된 객체를 참조하는 reader가 완료될 때까지 기다린다.
- pre-existing reader의 완료를 식별하는 기법 (grace period)

2.2 Defer freeing an object

RCU에서 writer가 비동기적으로 객체 free를 지연시키기 위한 API
update이후에 writer는 callback 함수를 등록하여 오래된 객체의 free를 지연시킨다.
RCU는 grace period가 완료되면 cb_func을 호출하여 실제 free 작업을 수행한다.

2.3 Slab Allocator

slab allocator는 동일한 유형과 크기의 객체 할당 요청을 효율적으로 관리하기 위한 동적 메모리 할당자이다.
자주 사용되는 객체 cache (=slab cache)로 구성되어 있다.
** 예를 들어 운영체제 커널의 slab allocator는 inode, thread와 같이 자주 사용되는 객체를 slab cache로 가지고 있다.
각 slab cache는 per-CPU object cache, per-NUMA node list of slabs로 구성되어 있다.per-NUMA node list는 같은 NUMA 도메인에 속한 slab list를 가지고 있다.
per-CPU object cache는 free object의 cache이다.
per-CPU object cache는 per-NUMA node list에서 필요한 객체를 이동시켜 채운다. node list에 있는 slab은 사용가능한 객체들에 따라 fully allocated, partially allocated, free 를 표시한다.
** 모든 객체가 per-CPU object cache로 이동하면 fully-allocated 상태가 되는 것이다.

page allocator -(grow/shrink)-per-NUMA node list (slab) -(refill/flush)- per-CPU object cache

slab allocator는 할당 요청을 받으면 현재 CPU의 per-CPU object cache에 있는 free 객체를 찾아 할당한다. 이 경우가 가장 최적의 할당 방식으로 cache hit라고 한다.node list의 모든 slab이 fully 할당되면 slab cache는 page allocator에서 slab을 더 할당받는다.
object cache에서 할당할 수 없는 경우 현재 per-NUMA node list에서 다시 object cache를 채운다 (refill)
할당한 객체가 free되면 slab allocator는 해당 객체를 현재 per-CPU object cache에 넣는다. 만약 object cache가 오버플로우 되면 객체는 node list로 flush된다.
** free 된 객체 → object cache / object cache overflow → flushing to node list
node list에 free slab의 개수가 threshold를 넘으면 slab cache는 다시 page alloator에게 돌아가 page를 반환한다.

3. Movitation

이번 장에서는 dynamic memory allocator의 성능에 영향을 미치는 synchroniztion via procratination의 중요한 부분에 대하여 다룬다.

3.1 Bursty freeing of memory

실제 어플리케이션은 한번의 grace period interval 동안 수천번의 update operation을 수행한다.→ 따라서 grace period 완료를 기다리는 deferred 객체가 수천개가 생길 수 있다.→ freeing 객체가 폭발적으로 발생한다. (bursty freeing of object)
→ grace period 완료 후 동기화 매커니즘은 memory reclamation을 시작한다.
→ 각 update는 하나 이상의 객체의 free를 지연시킨다.
bursty freeing of object→ object cache flushing (to node list)
** object cache flushing과 deferred object는 모든 CPU에서 병렬적으로 발생할 수 있다.
→ object cache overflow
parallel object cache flushing→ free slab 개수가 threshold를 넘으면 slab cache shrinking 발생
** slab cache shrinking : node list에 있는 slab cache를 page allocator에 반환하는 것
→ node list lock 경쟁 발생 & node list에 free slab의 개수 증가
reclamation 중 deferred object를 free→ object cache flush
⇒ memory allocator의 object cache refill로 인한 자원 낭비
→ object cache overflow
Prudence는 object refill/flush를 수행할 때 reclamation을 기다리는 deferred object의 개수를 고려한다.

3.2 Extended object lifetimes

객체의 lifetime = allocation ~ free (back to memory allocator)
deferred object는 동기화 매커니즘에 의해서만 free된다. 그러나 deferred object와 관련된 메모리는 grace period 완료 직후에 즉시 reclamation 되지 않는다.ex) interrupts, preemption, explicitly invoking registered callback functions
** deferring object 처리가 지연되는 요인
동기화 매커니즘에 의해 처리되는 deferred object의 개수가 제한되고, 안전한 deferred object의 memory reclamation도 지연된다.
기존의 memory allocator가 safe deferred object 를 인지하지 못하기 때문에 동기화 매커니즘에 의해 deferred 객체가 처리되기 전까지 객체를 reuse하거나 reallocation할 수 없다.→ object cache refill (node list의 slab → object cache) , slab cache grow (page allocator → node list)
** safe deferred object : 해당 객체를 참조하는 reader가 없는 object

3.3 High object cache and slab churn rate

bursty freeing of memory + extended object lifetime
⇒ high object cache & slab churn rates
적절하지 않은 cache 조절 → 불필요한 object cache refill/flush로 인해 memory allocator 성능 저하 (object cache - node list 간 빈번한 이동)
ex) object cache 가 비어있으면 refille을 하는데 이때 처리되는 deferred object의 개수가 제한된다. / object cache가 full이면 flushing을 하는데 이때 더 많은 deferred object를 처리한다.
Linux kernel에서의 측정ii) slab cache grow작업이 포함되면 object allocation은 14배 expensive
⇒ object cache refill/flush, slab cache grow/shrink expensive하므로 피해야 한다.
i) refill 작업을 포함하면 object allocation은 cache hit에 비해 4배 expensive
object cache 크기를 증가시키는 것은 효과가 없다.
이 문제는 memory allocator에 deferred object는 visible하게 만들어 처리한다.

3.4 Denial-of-service attacks

extended object lifetime은 deferred free를 많이 유발하는 시스템을 load하여 DoS 공격에 사용될 수 있다.
reclamation을 기다리는 deferred object의 개수가 꾸준히 증가하게 되고, 결국 시스템의 전체 메모리를 소진하게 된다.
Prudence는 memory allocator에 deferred object를 visible하게 만들어 extended object lifetime을 제거한다.
Prudence는 memory allocator의 처리를 기다리지 않고 free할 수 있는 deferred object를 즉시 reallocate할 수 있다.

3.5 Evaluating the impact of RCU in the Linux kernel

Linux kernel의 SLUB allocator에 미치는 RCU의 영향을 측정하였다.SLAB : 1 Queue / NUMA Node
**SLUB : 1 Queue / CPU
실험 환경
- Intel Xeon E5-4640 processor with 4 CPU sockets, 8 cores per socket for a total of 32 CPUs
- two-way hyper-threading (logical CPU - 최대 64개)
- 252 GB physical memory
- Linux kernel 3.17.0.
- object size = 512 bytes
- sampling every 10ms
- 모든 CPU에서 계속해서 linked list update 연산을 수행하는 워크로드를 실행시켜 RCU subsystem에 부하를 가하였다.
- 매 update 연산은 새로운 object를 할당한 다음 이전 버전의 객체의 free가 지연된다.
실험 결과
- update 연산을 하면 일부 객체 할당 요청이 slab cache grow를 초래하기 때문에 시스템의 총 메모리 사용량이 증가한다.
- ** slab cache grow : page allocator → node list (slab cache) 페이지 할당
- 그 다음 grace period 완료 이후에 발생하는 burst freeing object 로 인해 총 메모리 사용량이 급락한다. free slab의 개수를 증가시켜 slab cache shrink를 발생시킨다.
- ** slab cache shrink : slab cache가 사용하던 page를 page allocator에 반환하는 것
- 빈번한 slab cache grow & shrink → slab churn
- 객체의 allocated와 deferred free 상태 비율이 같아도 RCU는 reclamation을 기다리는 객체를 많이 발생시키는 deferred object를 처리하는 비율을 유지할 수 없기 때문에 extended object lifetime으로 인해 전체 메모리 사용량은 증가하고 결국 시스템 메모리가 부족하게 된다.
- 현재 RCU는 시스템이 memory pressure 사용량이 높을 때 deferred object를 처리하도록 한다. (메모리 pressure가 높아질수록 deferred object 처리를 더 많이 시도한다)
- → RCU는 freeing deferred object의 속도를 따라잡지 못하고 out-of-memory 되어버린다.
이상적인 총 사용된 메모리는 allocated와 deferred free 비율이 같아야 한다.
→ Prudence는 거의 이상적인 성능에 도달하였다.

3.6 Hints about the future

about-to-be-freed 한 메모리 영역에 대한 힌트를 제공한다.
→ dynamic memory allocator가 미리 미래의 free 연산을 알 수 있다.
Prudence는 중요한 state transition 동안에 힌트 기반으로 최적화를 하였다.

4. The Prudence Dynamic Memory Allocator

Prudence

a slab-based dynamic memory allocator integrated with procrastination-based synchronization mechanism

다른 subsystem에게 object free를 지연시킬 수 있는 인터페이스 제공⇒Prudence는 freeing object를 지연시키는 새로운 API를 도입한다. (asynchronous하게 defer freeing)
subsystem에서 object free를 defer하는 인터페이스를 제공한다. (callback function대신) 그리고 deferred free와 함께 일반적인 allocation과 free 연산을 시작할 수 있게 한다.
→ deferred objects에 대한 visibility
memory allocator가 deferred object를 reclamation할 수 있을 때를 식별할 수 있다⇒ procrastination 기반 동기화 매커니즘은 grace period를 식별하기 위해 reader의 상태와 함께 object의 active, deferred 상태를 유지한다. 이 매커니즘에서 memory allocator에 grace period 상태 정보와 deferred object 정보를 제공하여 Prudence는 grace period의 진행/완료 상태에 따라 deferred object의 reclamation을 수행할 safe time을 식별할 수 있다.
→ deferred memory 처리

4.1 Latent Cache and Latent Slab

Prudence는 deferred object를 효율적으로 처리하고, about-to-be-freed 힌트를 효과적으로 활용하기 위해 latent cache와 latent slab을 도입한다.
latent cache & latent slab은 reuse 또는 reclamation하기에 적합하지 않는 deferred object를 가지고 있다.ii) latent slab은 모든 slab을 정의하며, slab에 속한 deferred object를 가지고 있다.
⇒ latent ~에 있는 객체들은 grace period가 끝날 때까지 보이지 않다가 grace period가 완료하면 latent cache 에 있는 객체들은 object cach에 합쳐지고, latent slab에 있는 객체들은 slab에 합쳐진다.
i) latent cache는 모든 object cache를 정의하고, object cache level에서 deferred object를 가지고 있다.
Prudence는 latent cache의 크기의 limit을 object cache크기로 설정할 수 있다. limit에 도달하면 이후 free가 지연된 다음 객체는 latent slab으로 들어간다.(limit이 latent slab에는 적용되지 않고, slab 크기를 초과할 수 없다.)
→ latent cache의 safe deferred object가 object cache에 합쳐질때 object cache overflow를 방지한다.
latent cache와 latent slab은 deferred object의 reclamation을 지연시키지 않기 때문에 extended object lifetime을 완전히 제거한다.
memory allocator에서 deferred object 추적과 처리가 가능하고 grace period 완료 직후에 reallocation될 수 있다. 즉, deferred object는 memory allocator 외부의 처리를 기다릴 필요가 없다.

4.2 Exploiting hints about the future

Prudence는 latent cache와 latent slab으로부터 객체의 be-about-to-be-freed 정보를 얻는다.
→ 이 힌트는 cache refill과 flush 중에 사용하고 memory allocator의 성능을 향상시킨다.

Object cache refill

할당 요청이 object cache에서 서비스할 수 없을 때 Prudence는 grace period를 기다리고 있는 latent cache의 deferred object를 확인한다.→ latent cache에 적절한 deferred object가 없다면 object cache refille을 수행한다.
→ latent cache에 적절한 deferred object가 있다면 object cache로 병합되고 할당 요청은 object cache에서 서비스 된다.
Prudence는 object cache refill을 수행하는 동안 latent cache에 있는 deferred object의 개수를 고려한다.** object cache 크기가 o이고 latent cache에 deferred object가 d만큼 있으면 o-d만큼 refill된다. 이러한 방식으로 deferred object가 object cache로 병합될 때 오버플로우가 발생하는 것을 막을 수 있다.
→ latent cache에 deferred object가 있다면 object cache는 부분적으로 refill한다.

Latent cache pre-flush

pre-flushing : latent cache에 있는 deferred object를 latent slab으로 이동시킨다.
Prudence는 grace period 이후에 object cache와 latent cache에 있는 object의 개수가 object cache의 크기를 초과하면 (limit에 도달하면) latent slab으로 pre-flushing을 하고 다시 내려가면 종료한다.
pre-flushing은 CPU idle time에 스케줄되어 object allocation, free, deferred free calls로 pre-flushing을 방해하지 않는다. 그러나 idle cycle이 unavailable하면 pre-flushing은 해당 latent cache가 object cache로 병합될 때 수행된다.
pre-flush하는 동안 latent cache에서 latent slab으로 이동시키는 deferred object의 개수는 object cache와 latent cache에 있는 object의 개수, 최근 몇개의 grace period interval동안에 object allocation, free, deferred free rate에 따라 결정된다.→ 반대의 경우 (allocation rate > free/deferred free rate) object allocation rate가 높으면 object cache가 더 빨리 디워지기 때문에 즉시 sleep하거나 CPU를 양보하여 덜 aggressive하게 pre-flush를 한다.
→ latent cache가 거의 full이거나 allocation rate < free/deferred free rate인 경우, latent cache의 deferred object는 aggressive하게 pre-flush 된다.
pre-flushing을 하는 동안 grace period가 완료되는 경우 latent cache에 있는 eligible/safe deferred object는 object cache로 병합되고 pre-flushing이 계속된다.
서로 다른 latent cache의 pre-flushing은 node list에 lock 경쟁을 줄인다.

Object cache flush

object cache를 부분적으로 refill하고, latent cache를 pre-flushing하여 object cache flushing을 줄인다.
그럼에도 불구하고 free 연산들이 object cache를 overflow하게 하면 flushing이 필요하다.
Prudence에서 object cache flush할 때 flush되는 object의 개수는 latenc cache에 있는 deferred object의 수에 따라 결정된다. ⇒ latent cache에 deferred cache가 많을수록 flush되는 object가 많다.

Slab pre-movement

Prudence는 slab을 full, partially full, free로 그룹화한다.
그리고 미래의 움직임을 예측하면 각 그룹 리스트간 slab pre-movement를 수행한다. 예를들어 full slab 관련 object가 freeing deferred되면 full list→ partially full list pre-movement가 수행된다.
pre-movement도 node list lock 경쟁을 피한다. 반면에 기존의 memroy allocator는 bursty freeing이후에 모든 CPU에서 병렬적으로 slab movement가 발생하기 때문에 node list lock 경쟁이 심하다.

Reduces total fragmentation

slab-based allocator는 object cache와 slab에서 사용하지 않는 object가 있기 때문에 더 많은 메모리를 사용한다.
total fragmentation Ft 는 allocator에 의해 소비된 초과 메모리를 측정한다.
Prudence는 object cache refill을 위해 slab을 고를때 latent slab에 있는 deferred object를 고려하기 때문에 total fragmentation 을 줄인다.
slab selection optimization
- slab cache에 있는 slab A, B 각각 두개의 allocated object를 가지고 있다.
- 기존의 memory allocator는 slab selection이나 object cache refill 작업을 할 때 slab에 있는 object의 allocated/free 정보를 활용하지 못한다.
- ⇒ object cache refill을 위해 slab B가 선택되었다고 가정한다.
- memory allocator에게 deferred object 정보를 보여주는 시나리오 (b)의 경우, slab B에 있는 두개의 allocated object가 deferred for freeing이라고 가정한다.→ 힌트를 이용하여 object cache refill을 위해 slab A를 선택할 것이다. grace period 완료 이후에 slab B는 완전히 free 될 것이기 때문이다.→ totla fragmentation이 감소한다.
- → A를 object cache refill에 사용하고 B에 사용된 page들을 page allocator로 반환될 수 있다.
- → grace period 이후에 해당하는 deferred object 영역은 free될 것이라는 정보를 hint로 얻을 수 있다.
Prudence는 slab의 deferred object 힌트를 가지고 object cache refill 등의 작업을 수행하기 때문에 total fragmentation을 줄인다. 그리고 slab에 있는 대부분의 allocated object가 deferred for freeing 상태라면 Prudence는 해당 slab을 object cache refill에 사용하지 않는다.
⇒ 이렇게 하면 freed/deferred for freeing하는 allocated object를 최소화할 수 있어 slab과 관련된 page를 page allocator에 반환할 수 있기 때문에 total fragmentation을 감소시킬 수 있다.

Handling memory pressure

dynamic memory allocator는 시스템 실행하는데 메모리가 부족할 때 memory reclamation을 하기위해 out-of-memory(OOM) handler를 호출한다.
OOM handler는 memory reclamation을 하기 위해 프로세스들을 kill 한다.
Prudence는 grace period완료를 기다리는 deferred object가 여러개 있다면 OOM handler 호출을 지연시킨다.
→ 프로세스를 죽이지 않고, grace period 완료 후에 deferred object를 reallocate한다.

4.3 Reuse of existing optimizations

Prudence는 기존 memory allocator에 적용된 최적화 기법을 재사용한다.

5. Evaluation

Prudence implementation : Linux kernel + RCU subsystem
new APIs : kfree_deferred(), kmem_cache_free_deferred()
**kmem_cache_free_deferred() : 인자로 slab cache 식별자가 필요하다
RCU를 사용하는 커널 서브시스템이 새로운 API를 사용하도록 하고, 나머지 커널 서브시스템들은 SLUB을 사용한다.
** SLUB allocator : slab allocation 원리 기반의 리눅스 커널에서 사용 중인 allocator (default)

5.1 Experiment setup

mix of micro, synthetic (Postmark, Netperf)
application benchmarks (Apache, PostgreSQL)
리눅스 커널에서 SLUB allocator와 결과를 비교하였다.
** SLUB allocator 은 이전의 SLAB보다 나은 객체 관리기법, 적은 메타데이터 메모리 사용, 적은 fragmentation을 보인다.

5.2 Micro Benchmark

모든 CPU에서 서로 다른 사이즈의 객체에 대하여 tight한 loop에서 kmalloc()/kfree_deferred()를 실행시킨다.
초당 allocation/deferred free 연산의 총 횟수를 측정한다.
각 object당 각 CPU마다 5million pair의 allocation, deferred free를 실행시킨다.
벤치마크를 세번 실행시키고 평균값을 기록한다.

초당 kmalloc()/kfree deferred() pair를 실행시킨 횟수 결과 Prudence가 SLUB보다 3.9배-28.6배 더 좋은 성능을 보인다.
→ extended object lifetime을 제거하고, 불필요한 object cache와 slab churn을 피하기 때문이다.
성능 개선은 객체의 크기가 클수록 높아진다.** 객체 사이즈가 클수록 object cache에 객체의 수가 적어지고, slab이 작아진다.
→ object cache에 객체가 적을수록 빈번한 object cache refill/flush가 발생하고, slab이 작을수록 slab churn이 자주 발생한다.
→ 객체 크기가 4096 byte 일때 28.6배의 성능 향상

5.3 Synthetic and application benchmark details

Postmark-1.51

mail server simulator
file read, append, create, delete 를 수행하며 파일시스템에 stress를 준다.
64 instances of Postmark on an ext4 filesystem

Netperf-2.6.0

네트워크 성능을 측정한다.
TCP Connect/Request/Response (TCP CRR) test on localhost

Apache-2.4.6

Apache HTTP server 의 일부
Apache server 벤치마크
Apache server가 초당 처리하는 요청의 수를 측정한다.
ab on localhost , 128 parallel instances
** ab: apache bench : 웹서버의 성능 테스트를 위해서 접속을 강제로 늘려 부하를 주는 스트레스 테스트

PostgreSQL-9.2.7

pgbench tool : 여러개의 동시 DB 세션에 연속적인 SQL 연산을 수행한다.
64 threads, 128 clients on localhost

slab cache

file : file descriptor table
eventpoll_epi : poll() 시스템 콜
**poll() : 다중 입출력 함수
dentry : directory entry
ext4_inode : inode
kmalloc-64 : 64bytes의 kmalloc slab cache

5.4 Results

Prudence의 성능

Object cache hits

latent cache를 히용하여 deferred object를 관리하기 때문에 cache hit가 개선된다.
** latent cache는 grace period 완료 후 할당 요청에 대하여 object cache에서 deferred object를 즉시 재할당하는 것이 가능하도록 보장한다.
SLUB에서는 grace period이후에 deferred object가 즉시 할당 가능하지 않아서 object cache의 object가 부족해서 refill을 발생시킨다.
object cache 오버플로우가 발생할 때, Prudence는 latent cache에 있는 deferred object의 수를 고려해서 flush하는 object의 개수를 조절한다.
→ object cache의 불필요한 flushing을 줄여 cache hit를 증가시킨다.

Object cache churns

Prudence는 SLUB에 비해 object cache churn을 25.97%-96.47% 감소시킨다.
bursty freeing와 extended object lifetime을 유발하는 불필요한 refill/flush 연산을 줄여 churn rate가 향상된다.
Prudence는 bursty freeing으로 인한 objecy cache flush를 예측하면 차우의 deferred object를 latent slab으로 옮긴다.
latent cache를 사용하여 불필요한 object cache refill를 감소시키고 extended object lifetime을 완전히 제거할 수 있다.
object cache의 부분적 refill을 수행하여 불필요한 object cache flush를 줄여 churn rate를 개선할 수 있다.

Slab churns

Prudence는 Postmark에서 dentry 객체를 제외하고 21%-98.3% slab churn을 감소시켰다.
extended object lifetime + bursty freeing으로 발생하는 불필요한 slab grow/shrink를 감소시켜 slab churn이 감소한다.
object cache churn도 감소하여 slab churn도 감소한다.

Peak slab usage

벤치마크 실행 중 각 slab cache에 의한 최대 메모리 사용량
Prudence는 peak slab usage는SLUB에 비해 2.5%-30.6% 감소시킨다.
entended object lifetime을 제거해서 peak slab usage가 감소하는 것이다.

Total fragmentation

Prudence는 많은 slab cache에서 fragmentation을 감소시키고 성능을 동일하게 유지한다.
Netperf filp slab cache에서는 8.7% 증가한다.
Prudence는 latent slab에 있는 deferred object의 수를 고려하여 slab을 선택하여 total fragmentation을 감소시킨다.
total fragmentation ↔ latency

Throughput

Prudence는 deferred free를 최적화하도록 설계되었기 때문에 RCU deferred free 중 free의 발생 횟수가 성능 지표이다.
각 벤치마크 실행 중 deferred free / free 퍼센트의 평균을 나타낸다.
file creation, deletion 중에 deferred object 처리를 잘했기 때문에 성능 개선이 있다.

5.5 Endurance

Prudence는 slab churn, extended object lifetime을 피한다.
초기 deferred for freeing object가 grace period를 기다린 후에 재할당 될 수 있기 때문에 Prudence는 총 메모리 사용량을 처음에는 증가시킨다.
그러나 deferred object 재할당으로 이후 메모리 사용량은 똑같아진다.
slab churn을 조금만 발생시킨다.

6. Related Work

RCU in Linux kernel : deferred free reclamation을 kernel 스레드로 offload하고, backlog를 기반으로 object reclaim 횟수를 조절한다.
TRASH : throttling to reclaim deferred objects based on backlog
Masstree, CITRUS : further details on memory reclamation
Silo : 메인 스레드와 백그라운드 둘 다에서 memory reclamation을 수행한다.
추후 memory allocator 최적화 연구로 object lifetime을 예측하는 작업이 있을 수 있다.Prudence는 object lifetime 예측을 기반으로 최적화될 수 있다.
→ Prudence는 future free에 대한 hint를 제공하여 예측한다.

7. Conclusion

Prudence : dynamic memory allocator
RCU와 통합
deferred object에 대한 memory reclamation을 처리한다.
memory reclamation의 적절한 시기를 찾아 수행한다.

8. 장점과 단점

1) 장점

Prudence는 latent cache와 latent slab을 사용하여 deferred object tracking이 가능하여 grace period 완료 직후에 외부의 처리를 기다릴 필요 없이 바로 reallocation할 수 있다.
object cache에 object가 부족할 때 바로 refill하는 것이 아니라 latent cache에서 deferred object를 확인하고 적절한 object가 있으면 object cache로 병합하여 할당 요청을 처리하기 때문에 불필요한 object cache refill 작업을 피할 수 있다.
latent cache에서도 적절한 deferred object가 없다면 object cache refill을 수행하는데 이때 latent cache에 있는 deferred object의 개수를 고려하여 부분적으로 refill하기 때문에 이후에 deferred object가 병합되었을 때 발생할 수 있는 오버프로우로 인한 불필요한 flush를 방지할 수 있다.
latent cache가 가지고 있는 deferred object 개수가 limit 을 초과하면 미리 latent slab으로 flush되어 이후에 bursty freeing으로 인한 오버플로우로 object cache flush가 발생할 때 node list에 대한 lock 경쟁을 줄일 수 있다.
slab의 deferred object 힌트를 가지고 object cache refill 등의 작업을 수행하기 때문에 total fragmentation을 줄인다. 예를 들어 slab이 가지고 있는 allocated object가 대부분 deferred 상태라면 object cache refill을 할 때 해당 slab을 선택하지 않으려고 한다.

2) 단점

latent cache, latent slab을 관리하기 위한 오버헤드가 발생한다.
slab을 full, partially full, free로 그룹화하여 관리하기 위한 오버헤드가 발생한다.

'논문 리뷰' 카테고리의 다른 글

How to Deal with Lock Holder Preemption (0)	2021.04.06
Compact NUMA-aware Locks (0)	2021.04.06
Making Huge Pages Actually Useful : Illuminator (0)	2021.04.06
Perforated Page: Supporting Fragmented Memory Allocation for Large Pages (0)	2021.04.06
HawkEye: Efficient Fine-grained OS Support for Huge Pages (0)	2021.04.06

Prudent Memory Reclamation in Procrastination-Based Synchronization

Prudent Memory Reclamation in Procrastination-Based Synchronization

Abstraction

1. Introduction

2. Background

2.1 Synchronization via Procrastination

2.2 Defer freeing an object

2.3 Slab Allocator

3. Movitation

3.1 Bursty freeing of memory

3.2 Extended object lifetimes

3.3 High object cache and slab churn rate

3.4 Denial-of-service attacks

3.5 Evaluating the impact of RCU in the Linux kernel

3.6 Hints about the future

4. The Prudence Dynamic Memory Allocator

Prudence

4.1 Latent Cache and Latent Slab

4.2 Exploiting hints about the future

Object cache refill

Latent cache pre-flush

Object cache flush

Slab pre-movement

Reduces total fragmentation

Handling memory pressure

4.3 Reuse of existing optimizations

5. Evaluation

5.1 Experiment setup

5.2 Micro Benchmark

5.3 Synthetic and application benchmark details

Postmark-1.51

Netperf-2.6.0

Apache-2.4.6

PostgreSQL-9.2.7

slab cache

5.4 Results

Object cache hits

Object cache churns

Slab churns

Peak slab usage

Total fragmentation

Throughput

5.5 Endurance

6. Related Work

7. Conclusion

8. 장점과 단점

1) 장점

2) 단점

'논문 리뷰' 카테고리의 다른 글

'논문 리뷰' Related Articles

티스토리툴바