Perforated Page: Supporting Fragmented Memory Allocation for Large Pages

위 논문을 읽고 정리한 내용입니다.

Notion에서 보기

Abstract

Large Page (=huge page)
- 큰 연속적인 메모리 영역을 사용하는 어플리케이션의 address translation 효율성을 위해 large page 사용이 발전해왔다.
  - address translation 참고자료
    - virtual memory의 모든 프로세스는 각자의 virtual address space를 가지며 필요시 메인 메모리에 load시켜 일종의 cache처럼 사용하고, 이를 위해서 MMU에 있는 page table을 사용한다.
    - virtual page는 physical memory 어디에든 할당될 수 있으며 프로세스간 page를 share하거나 protect할 수 있다. 이를 구현하기 위해 address translation을 한다.
    - 한 프로세스를 실행 중인 CPU가 특정 메모리를 읽어오는데 필요한 virtual address를 MMU에게 보내면 MMU에 있는 page table에서 해당하는 page table entry를 찾아 내보낸다.→ 이때 MMU가 physical address를 메모리에 보내고 원하는 PTE를 얻어오는 과정에서 메모리에 두번 접근하게 된다.그러나 TLB miss 발생시 TLB를 확인하고 메모리에 두번 접근하는 과정이 필요하기 때문에 TLB를 사용하지 않았을 때 발생하는 성능저하보다 크게 발생한다. 그러나 TLB miss는 발생 확률이 매우 적기 때문에 그로인한 오버헤드는 무시할만 하다.
    - ⇒ 이러한 메모리에 2번 접근하는 latency를 줄이기 위해 MMU 속에 translation을 위한 cache가 들어있는데 이를 TLB (translation lookaside buffer)라고 한다. (대략 128~256 개의 PTE가 들어 있다.)
    - → page table entry에 담겨 있는 physcial address를 가지고 memory 나 cache에 전달하고 원하는 값을 CPU에 반환한다.
- fragmented memory, non-movable pages로 인해 large page 할당이 어렵다.
- large page의 일부분이 다른 부분과 permission 상태가 다르면 쪼개야 한다.
- application이 sparse 한 접근을 하면 memory bloat가 발생할 수 있다. (불필요한 메모리 사용 발생)
본 논문에서는 perforated page를 통해 fragmented physical memory에 huge page 할당이 가능하게 한다.
- perforated page : 구멍난 페이지
- 운영체제가 large page에 할당된 physical address range에 4KB page-sized hole을 낼 수 있게 한다.
- 그리고 4KB page-size hole을 필요시 다른 address가 사용할 수 있도록 허용한다.
- large page의 hole location을 저장하고, hole을 일반 base page로 변환하기 위해 4KB-level(base) page table entries를 재사용한다.
- 성능을 고려하여 hole page의 translation을 TLB에 캐시하고, L2 TLB의 bitmap을 통해 hole를 tracking 한다.
- 장점ii) large page내에 존재하는 permission 상태가 다른 부분을 허용하여 sharing flexibility를 높인다.
- iii) large page의 사용되지 않는 부분을 다른 주소로 사용할 수 있게 하여 memory bloat를 완화한다.
- i) fragmented memory 에서도 large page 사용이 쉽다.
평가 결과
- 다양한 실제 fragementation 시나리오 상에서 perforated page의 효과를 평가했다.
- fragemented memory 상황에서도 perforated pages는 이상적인 메모리 할당의 93.2%에서 99.9% 까지 도달했으며 2.0%에서 11.5%까지 성능을 향상시켰다.

1. Introduction

data-intensive application에서 address translation이 성능 저하의 결정적인 요인이기 때문에 TLB 효율성 증가를 위한 제안이 많이 등장했다. 그 중 한 방법이 page size를 증가시켜 TLB reach를 늘리는 것이다.
Large page의 문제점
- application이 full contiguous region을 사용하지 않으면 physical memory bloating이 발생한다.
- physical region 에서 서로 다른 permission을 가진 page들이나 immovable page들은 large page로 promotion될 수 없다.
- 오늘날의 large page는 translation efficiency ↔ memory management flexibility tradeoff를 가지고 있다.
⇒ 본 논문에서는 large page의 문제점을 극복하는데 필요한 memory management flexibility를 제공하여 trade-off를 완화하고자 한다.

perforated pages

punch 4KB holes pages out of large 2MB pages
large page의 일부가 할당되지 않아도 large page를 사용할 수 있게 한다. ↔ memory bloating
physical memory에 연속적으로 할당되지 않아도 large page를 사용할 수 있게 한다. ↔ fragmentation
large page내부에 서로 다른 permission이 있어도 사용할 수 있게 한다. ↔ decreased sharing
example
- immovable page를 가진 physical large page를 할당받은 perforated page 내부의 hole은 physical 4KB region을 따로 할당받는다.
- large page 내부에 permission 상태가 다른 page가 있는 경우 large page를 보호하기 위해 접근권한이 다른 page에 대한 COW가 발생할 수 있다. (sharing 불가능)
- → 해당 page를 4KB physical region에 따로 mapping하여 perforated page가 다른 page와 physical memory를 sharing한 상태를 유지하게 한다.

address translation mechanism in perforated pages

perforated page 를 위한 page table entry (PTE)
- large page-sized memory region의 주소
- hole page를 tracking하기 위한 다음 레벨의 PTE page 주소
- virtual address가 hole page에 속하지 않는다면 PTE의 physical large page address가 주소 변환을 위해 사용될 것이다.
- virtual address가 hole page에 속한다면 다음 레벨의 PTE에 접근하여 주소 변환을 해야 한다.
⇒ perforated page의 hole page tracking이 가능하다.
two levels of hole bitmaps
- 일반 TLB 엔트리로 perforated page 내부의 hole page translation을 캐시한다.
- first-level hole bitmap : TLB에 perforated page PTE의 사용되지 않는 bit에 저장된다.
- → hole을 가지고 있지 않는 perforated page를 tracking하여 second-level hole bitmap으로의 접근을 filtering한다.
- second-level hole bitmap : perforated page의 sub-page중 어떤 부분이 hole이고, translation을 위해 다음 레벨의 PTE로 접근이 필요한지 표시한다.
- → physical memory의 reserved 부분에 저장되고, 요청시에 다른 PTE들과 함께 TLB에 캐시된다. 그리고 first-level hole bitmap에서 필터링되지 않은 것들만 접근된다.

extension of perforated pages

2-dimentional page walk와 하이퍼바이저와 가상머신의 hole bitmap간 상호작용을 지원한다.
fragmentation 상황에서도 large page 사용이 가능하도록 large page와 base page를 overlapping한다.
빠른 translation을 위해 TLB에 on-demand bitmap을 PTE와 함께 캐시한다.
기존의 PTE 구조와 page table walk를 이용하여 HW와 SW를 최소한으로 변화시킨다.

2. Background

x86의 page table
- 4-level page table tree
- level-1 page table entry로 4KB pages를 가리킨다.
- level-2 page table entry로 2MB page를 가리킨다.
- level-3 page table entry로 1GB page를 가리킨다.
⇒ 주어진 page의 size를 통해 알맞은 level에서 page walk를 수행할 수 있다.
x86의 TLB
- level-1 TLB는 모든 page size에 대하여 64~100개의 entry를 저장한다.
- level-2 TLB는 2MB와 4KB page를 1536~2048개의 entry를 저장한다.

2.1 OS Large Page Support

메모리를 할당할때 application이 explicit하게 huge page를 요청한다.

Linux

MAP-HUGETLB flag와 함께 mmap을 호출하여 large page 할당을 한다.** mmap : 매핑할 메모리의 위치와 크기를 인자로 받아 파일을 프로세스의 가상 메모리에 매핑한다. 메모리에 매핑된 데이터는 파일 입출력 함수를 사용하지 않고 직접 읽고 쓸 수 있다.
→ application memory allocation code 수정이 필요하다.
Transparent Huge Page (THP) : 자동으로 user process에 large page를 제공한다.→ 프로세스가 huge page의 일부를 수정하면 커널은 huge page를 base page들로 분해하여 요청을 처리한다.
→ 커널은 base page 중 large page로 promotion할 그룹을 선택하여 TLB 성능을 향상시킨다.
→ 요청된 할당이 크고 contiguous mapping이 가능하면 커널이 프로세서에 transparent하게 large page를 할당한다.
물리적 메모리에 충분한 연속적인 공간이 필요하다.
→ explicit한 memory compaction이나 relocation이 필요하다.

2.2 Related Work

large workload에서 address translation 성능을 향상시키기 위한 이전의 연구들
1. B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee,
  - TLB reach를 증가시키기 위해 여러개의 인접한 translation을 합친다.
2. “CoLT: Coalesced Large-Reach TLBs,”
3. B. Pham, A. Bhattacharjee, Y. Eckert, and G. H. Loh,
  - 그룹 내에서 sequential 한 memory mapping을 하기 위해 인접한 page들을 작은 그룹으로 클러스터링한다.
4. “Increasing TLB reach by exploiting clustering in page translations,”
5. C. H. Park, T. Heo, J. Jeong, and J. Huh,
  - runtime에 contiguity 정도를 조절하여 contiguous translation을 지원한다.
6. “Hybrid TLB coalescing: Improving TLB translation coverage under diverse fragmented memory allocations,”
7. A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift,
  - 메모리의 큰 연속적인 공간의 translation을 효율적으로 하기 위해 segmentation을 활용한다.
8. “Efficient Virtual Memory for Big Memory Servers,”
9. V. Karakostas, J. Gandhi, F. Ayar, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky, M. M. Swift, and O. Unsal,
  - multiple segment를 지원한다.
10. “Redundant Memory Mappings for Fast Access to Large Memories,” (RMM)
11. C. H. Park, T. Heo, and J. Huh,
  - many-segment translation system
12. “Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching,”
13. Z. Yan, D. Lustig, D. Nellans, and A. Bhattacharjee,
  - unmovable page로 큰 연속적 공간을 만들어 내기 위한 운영체제 변화를 제안했다.
14. “Translation Ranger: Operating System Support for Contiguity-aware TLBs,”
15. Y. Kwon, H. Yu, S. Peter, C. J. Rossbach, and E. Witchel,
  - 리눅스의 THP 개선을 위한 연구를 했다.
16. “Coordinated and Efficient Huge Page Management with Ingens,”
17. A. Panwar, A. Prasad, and K. Gopinath,
  - non-movable page에 있는 page block compaction을 수행한다.
18. “Making Huge Pages Actually Useful,”
19. T. W. Barr, A. L. Cox, and S. Rixner,T. W. Barr, A. L. Cox, and S. Rixner,A. Bhattacharjee,
  - page table walker를 개선하여 TLB miss와 latency를 감소시킨다.
20. “Large-reach Memory Management Unit Caches,” i
21. “Translation Caching: Skip, Don’t Walk (the Page Table),”
22. “SpecTLB: A mechanism for speculative address translation,”
23. A. Bhattacharjee,A. Margaritov, D. Ustiugov, E. Bugnion, and B. Grot,
  - prefetching을 통해 TLB miss와 latency를 감소시킨다.
24. “Prefetched Address Translation,"
25. “Translation-Triggered Prefetching,”
26. J. Ahn, S. Jin, and J. Huh,“Revisiting Hardware-assisted Page Walks for Virtualized Systems,”
  - 가상화 시스템의 translation 문제를 해결하고자 했다.
27. “Fast Two-Level Address Translation for Virtualized Systems,"
28. B. Pham, J. Vesely, G. H. Loh, and A. Bhattacharjee,
  - 하이퍼바이저에 의해 쪼개진 large page를 붙이려고 시도했다.
29. “Large pages and lightweight memory management in virtualized environments: Can you have it both ways?”
30. C. Alverti, S. Psomadakis, V. Karakostas, J. Gandhi, K. Nikas, G. Goumas, and N. Koziris,
  - 사용된 연속적 할당과 page table walk predicting 매커니즘을 사용하여 page walk latency를 최소화했다.
31. “Enhancing and Exploiting Contiguity for Fast Memory Virtualization,”
32. M. Swanson, L. Stoller, and J. Carter,
  - 불연속적인 물리적 메모리에 large page 할당을 하기 위한 방법을 제안했다.
  - 중간 address 계층을 추가하여 core cache에 있는 large paged와 불연속적인 physical base page를 mapping 했다.
33. “Increasing TLB reach using superpages backed by shadow memory,”
34. Y. Du, M. Zhou, B. R. Childers, D. Mosse, and R. Melhem,
  - 불연속적 물리적 메모리에 large page를 지원하기 위한 방법을 제안했다.
35. “Supporting superpages in non-contiguous physical memory,”
36. D. Jevdjic, S. Volos, and B. Falsafi,J. H. Ryoo, M. R. Meswani, A. Prodromou, and L. K. John,J. H. Ryoo, S. Song, and L. K. John,
  - DRAM cache와 heterogeneous memory system이 접근되고 수정된 sub block이나 coalesced/clustered mapping의 유효성을 표시하기 위해 TLB에서 bitmap을 사용한다.
37. “Puzzle Memory: Multifractional Partitioned Heterogeneous Memory Scheme,”
38. “SILCFM: Subblocked InterLeaved Cache-Like Flat Memory Organization,”
39. “Die-Stacked DRAM Caches for Servers: Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache,”

3. Motivation

3.1 Large Page Management Challenges

Memory bloating

memory bloat : 실제 사용되는 영역이 할당된 physical memory보다 작은 경우
application이 large page에 할당된 physical memory를 모두 사용하지 않을때 physical memory를 낭비하게 되고, memory bloat를 초래한다.
운영체제의 THP가 application의 접근 패턴을 인지하지 못하고 aggressive한 large page 할당을 하기 때문에 memory bloat가 발생한다.
perforated page는 large page에서 application이 사용하지 않는 부분에 hole을 만들어 다른 page가 hole을 할당받을 수 있게 한다.
⇒ large page를 활용하면서 memory utilization을 효율적으로 만든다.

Trade-off between saving memory and performance

VMware의 transparent page sharing과 Linux kernel의 same page merging (KSM)
- 가상화 환경에서 중복을 제거하여 메모리를 절약하는 기능이다.
- 같은 content를 가진 page들을 찾아서 동일한 page에 shared mapping 시키는 것이다.
- sharing → memory saving
large page → translation efficiency ↑ sharing ↓→ cold large page (접근 빈도가 낮은 page)를 base page로 분해하여 sharing 정도를 높인다.
사이즈가 클수록 같은 content를 가진 page를 찾기 힘들기 때문에 sharing이 가능한 정도가 4-10배 정도 떨어진다.
evaluation
- using KSM on two Linux VM
- mcf benchmark
- KSM으로 page sharing을 지원한 결과 총 메모리 사용량을 11%까지 감소시키고, large page의 40%를 base page로 분해하여 대부분을 sharing 시켰다.
- 34% TLB miss가 증가하여 4.9% 성능 저하로 이어졌다.

Compaction overheads and immovable pages

large page로 promotion하기 위해서는 large size의 연속적인 physical memory 공간이 필요하다.→ Linux는 physical memory compaction을 수행하여 새로운 free contiguous memory를 만들어낸다.
→ 시스템이 계속 실행되면 memory fragmentation이 심해진다.
memory compaction
- expensive하다.
- page fault handler에서 수행한다.
- page fault latency를 증가시킨다.
- scanning page→ page table entry를 업데이트한다.** TLB shoot down : 업데이트되는 page의 address다른 프로세서가 사용할 수 없도록 해당 TLB entry를 invalidation하는 것
- → TLB shootdown
- → physical page를 새로운 곳에 복사한다.
immovable page
- device driver, file cache, immonable page 등은 compaction을 할 수 없는 page
perforated page는 운영체제가 movable page를 이동시키거나 perforated page 내부에 hole을 만들 수 있게 하고, immovable sub page에도 hole을 만들수 있게한다.
→ page promotion의 비용을 감소시킨다.

3.2 Case Study: Memory Bloating

application memory bloat vs address translation performance
THP를 활성화시킨 상태에서 Redis in-memory database (4 million keys + 16 KB objects)에서 large page와 regular page의 효과를 비교하였다.
large page→ TLB miss per kilo instruction (MPKI)가 regular page보다 82% 적어 성능이 좋게 나타나고 translation coverage가 증가한다. (29%의 성능 향상)
→ memory allocation이 20%까지 bloat 된다.
regular page + large page vs perforated page의 TLB miss 비교
Redis에 16KB entry 크기의 4 million keys를 할당하고 1 million randon key access를 통해 memory address trace를 하였다.
large page → large memory working set에서 random access 패턴에서는 TLB miss를 많이 줄일 수 없었다.
harsh 접근 패턴에서는 perforated page가 TLB miss를 상당히 줄였고, large page에서의 TLB miss를 절반으로 줄이면서도 large page의 불필요한 bloat도 없앴다.

3.3 Case Study: Real-world Fragmentation

real-wordl fragmentation challenge를 이해하기 위해 12GB physical memory를 장착한 Linux 5.3.8을 실행한 머신에서 실험을 수행하였다.
벤치마크는 user space와 kernel space memory를 할당하는 Linux kernel을 컴파일한다.
성능을 위해서 운영체제는 free memory가 없을 때까지 page cache를 유지한다.→ memory compaction과 large page allocation이 어렵게 된다.
→ page cache allocation은 non-movable page를 생성한다.
free 메모리가 없는 상태에서 벤치마크는 large page 할당 요청을 하면 메모리 fragmentation 으로 인해 요청한 전체 메모리 중 2.5%만이 large page로 할당받을 수 있다.
이때 백그라운드에서 memory compaction을 수행하면 22%까지 large page 할당을 할 수 있다.
2MB 영역의 25%가 unmovable page이고 8 filter bit 중 4 bit가 set인 경우 (not too fragmented) , 2MB 영역을 perforated page로 할당한다.
→ 2GB 중 18%가 추가적으로 할당될 수 있다.
22%의 large page를 가진 system → 22% large page + 18% perforatef page ⇒ 20.3% 성능 개선

4. Architecture

4.1 Overview

x86 architecutre, 2MB large page, 4KB regular page

explanation
- VA space1는 virtual address space에 hole을 가지고 있는 perforated page이다. VA space1의 hole은 physical memory에서 연속되지 않는 4KB 공간을 할당받는다.
- VA space2는 VA space1가 address space를 공유한다. 그 중permission 상태가 다른 page의 일부를 적절한 permission을 가진 별도의 4KB page를 할당 받는다.
- 같은 perforated page를 공유하고 있는 address space에서 CoW의 상황에서 운영체제는 large page 전체를 copy하지 않고 hole을 할당한 4KB의 작은 page만 분리하면 된다.
Perforated page에 필요한 변화
1. perforated page 내부의 hole page 식별하기→ hole bitmap에 효율적으로 접근하기 위해 hole bitmap을 on-demand로 L2 TLB에 캐시하고, coarse-grained한 bitmap filter를 사용한다.
2. ⇒ hole bitmap 사용
3. hole page를 위한 translation 제공하기→ 기존의 L2 PTE는 perforated page를 위한 영역의 physical address를 가지고 있고, shadow L2 PTE는 hole page를 위한 translation 정보를 가지고 있는 next L1 page table의 노드를 가리킨다.
4. → 접근이 4KB page (hole인 것, 아닌 것 포함) L2 TLB에 있는 perforated page를 처리하고 나중에 사용하기 위해 L1 TLB에 저장한다.
5. ⇒ 표준 multi-level page table 구조에 shadow L2 page table entry를 추가한다.

4.2 Hole Page Tracking

Perforated page 내부에 있는 어디에 hole이 있는지 효율적으로 tracking하기 위해 hole bit map을 사용하고, 4KB hole page translation에 효율적으로 접근하기 위해 shadow L2 PTE를 사용한다.

Shadow L2 PTEs for accessing hole translations:

L2 PTE는 두가지 address를 모두 가지고 있어야 한다.ii) hole translation에 접근하기 위한 L2 page table node의 physical address / next-level L1 page table node (=shadow L2 PTE)
i) hole이 아닌 부분을 위한 2MB 영역의 physical address
shadow L2 PTE는 해당하는 L2 page table node에 인접한 곳에 위치한다.(shadowed node의 address = original node + 4KB (offset))
→ 해당하는 hole translation 계산을 단순하게 만든다.
shadow L2 PTE 4KB hole page translation을 위해 page walk를 하는 동안 접근되고나서 해당하는 4KB hole PTE는 TLB에 저장된다.
page table walk를 가속하기 위해 shadow L2의 내용을 Page Walk Cache나 일반 cache에 저장할 수도 있다.

Hole page bitmap for identifying holes:

512-bit bitmap (4KB page 하나당 1bit)
shadow PTE에 대한 접근이 필요한지 표시한다. (즉, 해당 4KB page가 perforated page의 hole인지 아닌지 표시한다.)
PTE로부터 hole bitmap을 떼어내고, 연속적으로 할당된 physical 메모리의 block에 전체 시스템 physical address space를 위한 hole bitmap을 저장한다.
시스템에 있는 모든 4KB page들은 hole bitmap에 1bit씩 가지고 있다. (physical memory의 0.003%만을 차지한다.)
perforated page translation:→ hole이라면 shadow L2 PTE (hole 저장) 에 접근하고 L1 page walk는 4KB hole page translation을 반환한다.
→ hole이 아니라면 translation은 2MB large page의 physical address를 사용하여 translation을 바로 수행한다.
우선 bitmap을 통해 해당 4KB sub-page가 hole인지 확인한다.
perforated page translation을 더 효율적으로 수행하기 위해 second-level, coarse-grained hole bitmap을 사용한다.→ 2MB perforated page를 위한 PTE는 4KB page보다 address translation bits이 9 bit 더 필요하기 때문에 4KB의 PTE에 사용되지 않고 남는 9 bits 중 8 bits를 hole bitmap filter로 사용한다. 각 8 bit은 해당하는 full hole bitmap의 coarse-grained 영역에 hole이 있는지 없는지 표시한다.⇒ hole이 없는 perforated page의 bitmap에 접근하지 않을 수 있다.
→ filter는 L2 PTE에 위치하고, perforated page TLB entry에 캐시된다. (filter에서 해당하는 perforated page에 hole이 있는지 거른다)
→ coarse-grained hole bitmap이 hole을 포함하고 있지 않다면 hole bitmap에 접근해야 할 필요가 없기 때문에 filtering을 한다.

4.3 L2 TLB Extension

latency-sensitive L1 TLB를 해결하기 위해 hole이 아닌 sub-page들을 위한 4KB page를 L2 TLB에서 바로 생성하고 그것을 다른 page들처럼 L1 TLB에 install한다.hole bitmap을 L2 TLB entry에 캐시한다.
L2 TLB는 perforated page와 hole bitmap에 접근할 수 있도록 수정되어야 한다.

naive approach : load the full bitmap into L2 TLB
- perforated page를 위한 모든 bitmap entry들이 64B cache line에 있다면 한번의 memory 접근만으로 충분하다.→ perforated page의 사용하지 않는 영역의 bitmap도 포함하여 L2 TLB를 낭비하게 된다.
- → 각 perforated page마다 17개의 L2 TLB entry가 필요하다.
more efficient approach : load the bitmaps on-demand
- 모든 perforated page의 bitmap을 load해 놓는 것이 아니라 요청시 마다 load하는 것이다.
coarsed-grained filter bitmap + perforated page TLB entry + on-demand bitmap data
- filtering을 통해 hole이 없는 bitmap을 L2 TLB에 저장하지 않아도 된다.
- on-demand 특성으로 실제 접근된 영역만 L2 TLB에 bitmap을 저장한다.

4.4 Address Translation Flow

perforated large page를 지원하기 위해 L2 TLB lookup과 page table walk logic이 필요하다.
L1 TLB에서 miss가 발생하여 L2 TLB 접근→ perforated page에서 hit한 경우 L2 TLB는 우선 TLB entry의 filter bitmap을 확인한다.→ filter bitmap에서 해당 perforated page의 coarsed-grained 영역에 hole이 있다고 표시된 경우 second L2 TLB를 통해 bitmap을 참조하거나 memory 접근을 한다.→ hole에 대한 4KB translation은 L2와 L1 TLB에 install한다.
→ 마지막 bitmap을 참조하여 해당 translation이 hole이라고 하는 경우 shadow PTE entry를 통해 page table walk를 시작한다.
→ filter bitmap을 통해 해당하는 perforated page의 coarsed-grained 영역의 translation에 hole이 없다는 것을 확인하면 2MB perforated page에서 4KB translation을 생성하고, L1 TLB에 저장한다.
→ regular page entry나 이미 캐시되어 있는 4KB hole entry에서 hit한 경우, translation은 바로 수행되고, L1 TLB에 저장한다.

4.5 Changes and Overhead Analysis

Storage

전체적인 page table 구조는 바꾸지 않는다.
perforated page당 4KB의 추가적인 page table data로 shadow L2 PTE와 bitmap을 저장할 공간이 필요하다.

HW

L2 TLB controller가 hole bitmap을 store, search, fetch한다.
page table walker logic이 perforated page를 확인하고 hit하고 filter(coarse-grained bitmap)를 통해 해당 translation에 hole이 없다는 것을 확인하면 large page를 생성한다.

Potential performance overheads

bitmap이 L2 TLB에 없는 경우 메모리에 접근을 해야 해서 L2 TLB translation latency가 발생할 수 있다.
TLB 용량이 bitmap entry로 인해 소모된다.
filter가 효과적이지 않고 전체 범위가 접근되는 경우 (최악), perforated page 당 16개의 추가적인 TLB entry가 소모된다.

TLb shootdown

perforated page shootdown은 perforated page 자체, L2 TLB에 있는 모든 bitmap entry, L1+L2 TLB에 있는 hole bitmap + L1 TLB에 있는 모든 non-hole entry 에 대한 translation을 invalidation 해야 한다.
⇒ Linux kernel 접근법을 취하고 TLB를 flush한다.
perforated page의 page들을 shootdown하는 것은 새로운 hole을 뚫거나 기존의 hole을 patch하는 결과를 낳는다.→ 기존의 shootdown과 일괄작업으로 처리하여 무시할 수 있다.
→ perforated page의 bitmap도 업데이트 해야 한다.

5. Virtualization

system virtualization : physical system 상에 여러 개의 virtual machine을 올려 system utilization을 높이는 기술virtual machine 내 프로세스의 guest virtual address (gVA) → guest physical address (gPA)guest physical address → host machine address (MA)iii) HW support
2-dimentional nested page walks
즉, gVA → gPA → MA
ii) 두번째 address translation
i) 첫번째 address translation

virtualized system에서 perforated page를 지원하기 위해 guest와 host의 page table을 모두 고려하여 적절하게 hole page status bit를 설정해야 한다.
guest와 host page table 모두 perforated page를 사용하여 mapping을 한다면 TLB에서 perforated page entry를 사용하여 translation을 수행할 수 있다.
gVA → MA mapping을 위한 perforated page entry를 install하기 위해 L2 TLB에 캐시된 hole bitmap은 반드시 guest와 host 둘 다의 mapping에 의해 생성되어야 한다.

6. Operating System Interaction

Choosing perforated pages

운영체제는 perforated page / 여러 개의 base page를 언제 할당할지 결정해야 한다.
트레이드오프는 어플리케이션의 접근 패턴과 메모리의 단편화 상태에 따라 결정된다.
⇒ hole이 없는 perforated page에 접근을 많이 할수록 성능이 좋고 좁은 범위에서 더 많은 page가 fragmented될수록 bitmap filter가 효과적이다.

Page allocation

운영체제가 L1 page table node를 할당할 때, 운영체제는 반드시 shadow L2 page table node를 위한 두번째 contiguous page를 할당해야 한다.
page table 할당 역시 buddy allocator가 처리하기 때문에 할당 path는 쉽게 수정될 수 있다.
그 다음 L1 page table node를 같은 방식으로 할당한다.
perforated page가 할당되고 page mapping 정보가 L2 page table entry에 삽입되었을 때, 운영체제는 반드시 bitmap에 hole을 표시해야 한다.
→ hole page가 hole에 해당하는 physical space를 바로 할당할 수도 있고, lazy하게 할당할 수도 있다.
hole page에 대한 lazy 할당은 reservation-based large page allocation, perforated page, unallocated hole 할당을 지원한다.
perforated page는 whole large page를 reserve할수도 있지만 실제로 base page에 접근되었을 때만 mapping된다.

Modifying the page mapping

어플리케이션이 large page의 일부 base page를 반납하거나 운영체제가 large page의 일부 base page를 다른 region에 remap하는 경우, 운영체제는 large page를 base page로 쪼개거나 hole을 만들어 perforated page로 변환시킬 수 있다.
large page나 perforated page에 hole을 추가할 때, 운영체제는 반드시 TLB shootdown을 수행하여 영향을 받은 page의 TLB 엔트리나 bitmap 엔트리를 수정해야 한다.
perforated page는 hole이 제거되거나 non-perforated page로 재구성될 수 있다.
compaction이 수행되지 않고 사용되지 않은 메모리가 반납되면 운영체제는 그 메모리를 compaction하여 hole을 제거한다.
hole이 있다면 patching을 할 경우 TLB shootdown이 필요하다.

7. Evaluation

7.1 Simulation Methodology

HW and OSsupportintheGem5simulatorundersystem-callemulation (SE) mode
best allocation scenario에서 perforated page의 성능 영향을 분석한다.
dispersed / clustered hole page를 사용한 fragmented system을 구현하여 실험한다.— clustered : fragmented 2MB page block 의 한 부분에 hole page가 모여 있다. (bitmap filter의 성능이 더 좋을 것으로 예상한다.)
— dispersed : fragmented 2MB page block 전체에 hole page가 퍼져있다.
fragmentation의 정도를 제어하기 위해 2MB 영역의 fragmentation 정도를 조절하여 실험을 수행한다.

7.2 Sensitivity to Fragmentation

시스템과 어플리케이션의 fragmentation과 관련한 성능을 분석한다.
worst case memory access scenario for TLB : 2MB page를 사용했을 때 TLB에 딱 맞는 2GB dataset을 가진 랜덤 메모리 접근 벤치마크
fragmentation이 증가할수록 일반적인 large page는 base page로 쪼개지면서 TLB reach을 감소시키고 성능을 저하시킨다.
2MB 영역의 단편화 정도에 따른 성능 영향

Portion of fragmentation

2MB block의 fragmented부분이 증가할 수록 (위 표의 오른쪽으로 갈수록) 성능이 떨어진다.
⇒ conventional TLB가 fragmentation으로 인해 large page를 base page로 쪼개어 page walk 횟수가 증가하고 TLB coverage가 감소하기 때문이다.
fragmentation이 증가할 때 perforated page는 성능 저하가 적게 발생한다.hole을 가진 page의 비율이 증가할수록 TLB reach의 부족으로 인해 성능이 저하된다.
⇒ large page의 절반이 10%의 hole을 가지고 있을 경우에도 large page로 인한 성능 이점을 유지한다.
모든 page가 hole을 가지고 있고 각 perforated page에 90%의 단편화가 발생한 극단적인 경우에 성능은 conventional case의 성능과 거의 동일한다.
fragmented 2MB block의 비율이 증가할수록 TLB에 2MB entry 비율이 감소하고 perforated page entry의 수가 증가한다.
perforated page의 fragmentation은 TLB에 있는 4KB hole의 비율을 높이기 때문에 TLB coverage의 효율성이 감소한다.

Fragmentation per block

fragmentation이 증가할수록 perforated page는 hole reigion이나 hole bitmap에 자주 접근한다.
→ TLB coverage 를 감소시킨다.
fragmentation이 감소할 수록 perforated와 large page TLB entry의 비율이 증가하고 base page의 비율이 감소한다.
즉, 최대 50%까지 hole page를 가진 perforated page가 성능 향상에 도움이 된다.
그 이후에는 일반적인 large page와 비슷한 성능을 보인다.

Hole type

unallocated / allocated
unallocated hole page는 접근되지 않기 때문에 TLB에 삽입되지 않는다.
→ unallocaed hole은 효율적인 TLB coverage와 성능을 유발한다.
clustered unallocated hole 의 경우 hole bitmap filter가 fragmentation 정도가 적은 경우에 효과적이다. 반면 dispersed unallocated hole의 경우 hole bitmap filter가 비효율적이며 bitmap lookup으로 인해 성능을 저하시킨다.
perforated page가 최고의 성능을 낼 수 있는 조건ii) clustered
i) unallocated pages

Hole distribution

hole bitmap이 page bitmap에서 많은 부분을 필터링하기 때문에 perforated page는 clustered hole의 경우에 더 좋은 성능을 낸다.

7.3 Applicatioin Performance

fragmented 2MB block의 비율과 fragmentation의 양과 분배를 다르게 했을 경우 어플리케이션의 성능

모든 어플리케이션에서 fragmentation이 심화될수록 perforated page의 성능이 저하되었다.
dispersed에 비해 clustered의 경우 filtering의 성능이 더 좋다.

7.4 Comparison to GTSM

GTSM : Gap-Tolerant Sequential Mapping물리적 메모리를 4MB chunk로 나누고 4MB chunk내에서 2MB large page를 하나 만들어 64KB block을 사용하여 유연한 mapping을 할 수 있도록 한다. 그리고 남은 부분은 base page로 사용될 수 있게 한다.
→ 해당 영역의 절반에 대해서 large page 할당을 제한한다.
hole fragmentation이 있는 상황에서도 large page allocatio을 허용한다.
perforated page가 GTSM보다 평균적으로 6.2% 좋은 성능을 보읺다. 따라서 TLB miss가 큰 가상화 환경에서 perforated page를 사용하는 것이 좋다.
반면 perforated page내에 disparsed hole 상태인 경우 filtering이 효과적이지 않기 때문에 L2 TLB access를 감소시키기 못한다.

7.5 Virtualized System

가상화 시스템에서 large page로 인한 성능 개선은 TLB miss cost가 증가하기 때문에 native 시스템 보다 높다.
page table walk의 비용이 증가할수록 perforated pag로 인한 성능 이점이 높다.

8. Conclusion

large page가 TLB reach를 증가시켜 translation efficiency를 증가시키지만 large page는 sparse한 메모리를 사용하는 어플리케이션의 경우 physical memory bloat으로 인해 제한된 상황에서만 사용될 수 있다.
본 논문에서는 운영체제가 large page에 4KB hole을 만들 수 있는 perforated page를 소개한다.
perforated page는 unmovable memory region이 fragmented되거나 shared page가 많은 경우에도 large page를 사용할 수 있게 한다.
기존의 existing hierarchical page table with shadow PTE entries to translate holes를 재구성하고 L2 TLB에 cach되고 filter되는 2-level hole bitmap을 사용하여 구현한다.
perforated page는 large page로 인한 fragmentation을 줄이면서 large page의 reach를 증가시키고 가상화 환경에서 보다 좋은 성능을 낸다.

9. Pros and Cons

1) 장점

large page availability를 향상시킨다.
일부 permission 상태가 다른 부분으로 인해 large page sharing이 어려운 문제를 완화한다.

2) 단점

large page에서 분리되어 물리적 메모리의 다른 base page를 사용하여 오히려 memory 사용량을 늘릴 수 있다.
perforated page의 fragmentation이 증가할 경우 TLB coverage가 감소한다.
L2 TLB에서 perforated page의 경우 coarse-grained hole bitmap 을 통해 hole 여부 한번 더 확인하는 프로세스가 있지만 coarse-grained이기 때문에 hole이 아님에도 hole bitmap에 접근할 수도 있다.
shadow L2 PTE와 bitmap을 저장할 공간이 추가적으로 필요하다.↔ 그 용량이 얼마 되지 않긴 하다.
→ TLB 용량이 bitmap entry로 인해 소모된다.
bitmap이 L2 TLB에 없는 경우 메모리에 접근을 해야 해서 L2 TLB translation latency가 발생할 수 있다.

'논문 리뷰' 카테고리의 다른 글

Prudent Memory Reclamation in Procrastination-Based Synchronization (0)	2021.04.06
Making Huge Pages Actually Useful : Illuminator (0)	2021.04.06
HawkEye: Efficient Fine-grained OS Support for Huge Pages (0)	2021.04.06
Redundant Memory Mappings for Fast Access to Large Memories (0)	2021.04.06
Coordinated and Efficient Huge Page Management with Ingens (0)	2021.04.06

Perforated Page: Supporting Fragmented Memory Allocation for Large Pages

Perforated Page: Supporting Fragmented Memory Allocation for Large Pages

Abstract

1. Introduction

perforated pages

address translation mechanism in perforated pages

extension of perforated pages

2. Background

2.1 OS Large Page Support

Linux

2.2 Related Work

3. Motivation

3.1 Large Page Management Challenges

Memory bloating

Trade-off between saving memory and performance

Compaction overheads and immovable pages

3.2 Case Study: Memory Bloating

3.3 Case Study: Real-world Fragmentation

4. Architecture

4.1 Overview

4.2 Hole Page Tracking

Shadow L2 PTEs for accessing hole translations:

Hole page bitmap for identifying holes:

4.3 L2 TLB Extension

4.4 Address Translation Flow

4.5 Changes and Overhead Analysis

Storage

HW

Potential performance overheads

TLb shootdown

5. Virtualization

6. Operating System Interaction

Choosing perforated pages

Page allocation

Modifying the page mapping

7. Evaluation

7.1 Simulation Methodology

7.2 Sensitivity to Fragmentation

Portion of fragmentation

Fragmentation per block

Hole type

Hole distribution

7.3 Applicatioin Performance

7.4 Comparison to GTSM

7.5 Virtualized System

8. Conclusion

9. Pros and Cons

1) 장점

2) 단점

'논문 리뷰' 카테고리의 다른 글

'논문 리뷰' Related Articles

티스토리툴바