JPOUG> SET EVENTS 20151017 に参加しました

JPOUG> SET EVENTS 20151017 | Japan Oracle User Group (JPOUG)
JPOUG > SET EVENTS 20151017 - Togetterまとめ
に参加しました

写真は @tadayima_jpさんのツイート から頂きました。

発表はしていませんが林優子さんと小田さんの”DBエンジニアのスキルの現実と伸ばし方”でちょっと振られてお話しました。

最もはてなブックマークがついた記事として 大きなテキストファイルをawkで処理するときにcatで投げ込むと速い理由 - ablog を軽く紹介しました。内容にはさらっとしか触れませんでしたが、私の環境で awk file より cat file|awk のほうが速かった理由は Linux カーネルのプロセススケジューラのスケジューリングが関係していました。詳しくは ”私の環境で”大きなテキストファイルをawkで処理するときにcatで投げ込むと速い理由 - ablog をご覧ください。

オススメ本

とお話したので、オススメ本を紹介します。「○○を支える技術」とか「絵で見てわかる○○」とかベストセラー和書はみなさんご存知だと思うので、口コミで教えて頂いて知ったあまり知られてなさそうな良書を紹介します。

たかぼうに教えてもらった本です

Systems Performance: Enterprise and the Cloud

Systems Performance: Enterprise and the Cloud

id:wmo6hash さんに教えて頂いた本。8i の時代の本ですが、今でも参考になります。
Oracle8I Internal Services for Waits, Latches, Locks, and Memory

Oracle8I Internal Services for Waits, Latches, Locks, and Memory

Oracle8I Internal Services ... に近い内容で、11gR2 に対応しています。
Oracle Core: Essential Internals for DBAs and Developers (Expert's Voice in Databases)

Oracle Core: Essential Internals for DBAs and Developers (Expert's Voice in Databases)

新久保さんに教えて頂いた本。OWI、Method R、AWR、ASH など現在の Oracle Database のパフォーマンス分析手法や機能のルーツになったような方が寄稿されています。
Oracle Insights: Tales of the Oak Table

Oracle Insights: Tales of the Oak Table

  • 作者: Cary Millsap,Anjo Kolk,Connor McDonald,Tim Gorman,Kyle Hailey,David Ensor,Jonathan Lewis,Gaja Krishna Vaidyanatha,David Ruthven,James Morle
  • 出版社/メーカー: Apress
  • 発売日: 2014/03/12
  • メディア: ペーパーバック
  • この商品を含むブログ (1件) を見る
新久保さんに教えて頂いた Oracle Database のパフォーマンス分析の大家 Craig Shallahamer が ORTA(Oracle Response Time Analysis)について書いた本です。
Oracle Performance Firefighting: Craig Shallahamer: 9780984102303: Amazon.com: Books

印象に残ったセッション

ハイパフォーマンスを実現する設計方法とSQLチューニング実践講座

ついに生ミックさんにお会いできました

DBをリファクタリングしよう、DBとアプリの架け橋 DBFlute

パフォーマンスタブ見れないんですけど!!

”fluentd + GrowthForecast を使って色々収集して可視化する仕組み”Cool でした!

新人SE女子がつまづいた Oracle Database 5つのこと

かわいいSE女子でしゃべりがうまくて、”i-node 使用率の監視は基本ですよね”とか反則ですよw

DBエンジニアのスキルの現実と伸ばし方

大阪にいた頃から、DBマガジンの連載を読ませていただいた林さん、小田さんと絡ませていただき、感激でした。林さんの以下のお話がとても心に残りました。

以下は林さん、小田さんとの想い出です。

夜会 - “JPOUG> SET EVENTS 20151017”後

イベント後は オフィス OFFICE SHOP|TRANSIT GENERAL OFFICE INC. での飲み会に参加。現場で一緒にお仕事している方と会話したり、出版社の編集者に新企画をお話したり。そのネタを柴長さんにお話して、意気投合したり。

二次会


最後に、このような素晴らしい場を提供いただいているJPOUG運営メンバー、コーソルさん、日本オラクルの皆さまに感謝です。
次は JPOUG Advent Calendar 2015 - Japan Oracle User Group (JPOUG) | Doorkeeper

NFSでI/Oシステムコール発行後に応答がない場合、プロセスを kill できるか

NFSのマウントオプションで soft と hard がある。プロセスがI/Oシステムコールを発行してユーザーモードからカーネルモードにコンテキストスイッチした後、応答がないと、soft の場合はリトライを繰返した後にI/Oエラーになるが、hard の場合は応答があるまで待ち続ける。

  • hard + intr: kill できる*1。おそらく TASK_INTERRUPTIBLE でスリープするため。
  • hard + nointr: kill できない。おそらく TASK_UNINTERRUPTIBLE でスリープするため。

Kernel 2.6.25 以降、TASK_KILLABLE が導入され、NFS Client のコードでI/Oシステムコール発行後、TASK_KILLABLE でスリープするよう変更が入り、マウントオプションに hard を指定しても kill できるようになっている。

intr / nointr This option is provided for backward compatibility.It is ignored after kernel 2.6.25.

nfs(5) - Linux manual page

2.6.25 以降 intr / noinrt オプションが無視されるのはこの変更のためと思われる。RHEL5(2.6.18)はこの変更が入っていないが、6(2.6.32)以降はこの変更が入っていると思われる*2

参考

The Linux Programming Interface: A Linux and UNIX System Programming Handbook

The Linux Programming Interface: A Linux and UNIX System Programming Handbook

  • 22.3 Interruptible and Uninterruptible Process Sleep States

We need to add a proviso to our earlier statement that SIGKILL and SIGSTOP always act immediately on a process. At various times, the kernel may put a process to sleep, and two sleep states are distinguished:

  • TASK_INTERRUPTIBLE: The process is waiting for some event. For example, it is waiting for terminal input, for data to be written to a currently empty pipe, or for the value of a System V semaphore to be increased. A process may spend an arbitrary length of time in this state. If a signal is generated for a process in this state, then the operation is interrupted and the process is woken up by the delivery of a signal. When listed by ps(1), processes in the TASK_INTERRUPTIBLE state are marked by the letter S in the STAT (process state) field.
  • TASK_UNINTERRUPTIBLE: The process is waiting on certain special classes of event, such as the completion of a disk I/O. If a signal is generated for a process in this state, then the signal is not delivered until the process emerges from this state. Processes in the TASK_UNINTERRUPTIBLE state are listed by ps(1) with a D in the STAT field.

Because a process normally spends only very brief periods in the TASK_UNINTERRUPTIBLE state, the fact that a signal is delivered only when the process leaves this state is invisible. However, in rare circumstances, a process may remain hung in this state, perhaps as the result of a hardware failure, an NFS problem, or a kernel bug. In such cases, SIGKILL won’t terminate the hung process. If the underlying problem can’t otherwise be resolved, then we must restart the system in order to eliminate the process.
The TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE states are present on most UNIX implementations. Starting with kernel 2.6.25, Linux adds a third state to address the hanging process problem just described:

  • TASK_KILLABLE: This state is like TASK_UNINTERRUPTIBLE, but wakes the process if a fatal signal (i.e., one that would kill the process) is received. By converting relevant parts of the kernel code to use this state, various scenarios where a hung process requires a system restart can be avoided. Instead, the process can be killed by sending it a fatal signal. The first piece of kernel code to be converted to use TASK_KILLABLE was NFS.

Linux カーネルのバージョン 2.6.25 では、プロセスをスリープさせるための新しい状態である TASK_KILLABLE が導入されています。kill 可能という、この新しい状態でプロセスがスリープしている場合、そのプロセスは TASK_UNINTERRUPTIBLE の場合と同じように動作し、しかも重要なシグナルに応答することができます。

NFS クライアント・コードが何カ所か変更され、この新しいプロセスの状態が使われています。リスト 3 は Linux カーネル 2.6.18 と 2.6.26 の間での nfs_wait_event マクロの違いを示しています。

  • リスト 3. TASK_KILLABLE の導入による nfs_wait_event の変更
Linux Kernel 2.6.18                          Linux Kernel 2.6.26
==========================================   =============================================
#define nfs_wait_event(clnt, wq, condition)  #define nfs_wait_event(clnt, wq, condition)
 ({                                           ({
  int __retval = 0;                            int __retval = 
                                                   wait_event_killable(wq, condition);
    if (clnt->cl_intr) {                        __retval;
     sigset_t oldmask;                        })
     rpc_clnt_sigmask(clnt, &oldmask);
     __retval = 
     wait_event_interruptible(wq, condition);
       rpc_clnt_sigunmask(clnt, &oldmask);
    } else
        wait_event(wq, condition);
        __retval;
 })

リスト 4 は Linux カーネル 2.6.18 と 2.6.26 との間での nfs_direct_wait() 関数の定義の違いを示しています。

  • リスト 4. TASK_KILLABLE の導入による nfs_direct_wait() の変更
Linux Kernel 2.6.18                                   
=================================           
static ssize_t nfs_direct_wait(struct nfs_direct_req *dreq) 
{                                                           
  ssize_t result = -EIOCBQUEUED;                              

  /* Async requests don't wait here */                         
 if (dreq->iocb)                                              
      goto out;                                                    

 result = wait_for_completion_interruptible(&dreq->completion);

 if (!result)                                                 
   result = dreq->error;                                        
 if (!result)                                                 
   result = dreq->count;                                        

out:                                                            
  kref_put(&dreq->kref, nfs_direct_req_release);
  return (ssize_t) result;
}                                                               



Linux Kernel 2.6.26
=====================
static ssize_t nfs_direct_wait(struct nfs_direct_req *dreq)
{
  ssize_t result = -EIOCBQUEUED;
  /* Async requests don't wait here */
  if (dreq->iocb)
    goto out;

  result = wait_for_completion_killable(&dreq->completion);
  if (!result)
    result = dreq->error;
  if (!result)
    result = dreq->count;
out:
   return (ssize_t) result;
 }

この新機能を利用するための NFS クライアントの変更の詳細を知るためには、「参考文献」に挙げた Linux Kernel Mailing List のエントリーを見てください。

これまでは NFS マウント・オプション intr を指定することで、何らかのイベントを待っている NFS クライアント・プロセスに割り込みをかけられましたが、その場合、(TASK_KILLABLE のように) kill を目的とする 1 つのシグナルのみではなく、すべての割り込みが許可されてしまいました。

https://www.ibm.com/developerworks/jp/linux/library/l-task-killable/

Or maybe not. A while back, Matthew Wilcox realized that many of these concerns about application bugs do not really apply if the application is about to be killed anyway. It does not matter if the developer thought about the possibility of an interrupted system call if said system call is doomed to never return to user space. So Matthew created a new sleeping state, called TASK_KILLABLE; it behaves like TASK_UNINTERRUPTIBLE with the exception that fatal signals will interrupt the sleep.

...

The TASK_KILLABLE patch was merged for the 2.6.25 kernel, but that does not mean that the unkillable process problem has gone away. The number of places in the kernel (as of 2.6.26-rc8) which are actually using this new state is quite small - as in, one need not worry about running out of fingers while counting them. The NFS client code has been converted, which can only be a welcome development. But there are very few other uses of TASK_KILLABLE, and none at all in device drivers, which is often where processes get wedged.

https://lwn.net/Articles/288056/

NFS: Switch from intr mount option to TASK_KILLABLE

By using the TASK_KILLABLE infrastructure, we can get rid of the 'intr' mount option. We have to use _killable everywhere instead of _interruptible as we get rid of rpc_clnt_sigmask/sigunmask.

https://lkml.org/lkml/2007/12/6/329?cm_mc_uid=48289949268313906794256&cm_mc_sid_50200000=1446010393

*1:シグナルを送ってプロセスを停止できる

*2:リリースとカーネルのバージョンの対応はhttps://access.redhat.com/ja/node/16476参照

vm.min_free_kbytes からの wmark_{min|low|high} 算出式

Linux のページ回収の閾値である wmark_min、wmark_low、wmark_high の算出式を調べたメモ。

算出式

正確には NUMA ノードの ZONE 毎に計算されるが、合計の概算は下記の式で計算できる。

min_free_kbytes = sqrt(物理メモリサイズ(KB) * 16)
wmark_min = min_free_kbytes
wmark_low = wmark_min + (wmark_min / 4)
wmark_high = wmark_min + (wmark_min / 2) 

具体例

例えば、x86_64 で物理メモリサイズが 16GB の場合、以下のようになる。

min_free_kbytes = sqrt(16,777,216KB * 16) = 16,384 KB
wmark_min = 16,384 KB
wmark_low = 16,384 + (16,384 / 4) = 20,480 KB
wmark_high = 16,384 + (16,384 / 2) = 24,576 KB 

実機確認結果

手元の環境で確認してみた → How to check usage of different parts of memory?

$ uname -r
2.6.39-400.17.1.el6uek.x86_64 ★ Kernel 2.6.39-400 では managed_pages ではなく present_pages を算出に使っている
$ cat /proc/meminfo|head -1
MemTotal:       16158544 kB ★ 物理メモリサイズ
$ sysctl -a|grep min_free_kbytes
vm.min_free_kbytes = 16114 ★ min_free_kbytes = 16,114 KB
$ cat /proc/zoneinfo
Node 0, zone      DMA
  pages free     3976
        min      3 ★ min_pages = 3 * 4KB = 12 KB
        low      3 ★ low_pages = 3 * 4KB = 12 KB 
        high     4 ★ high_pages = 4 * 4KB = 16 KB 
        scanned  0
        spanned  4080
        present  3920 ★
...
Node 0, zone    DMA32
  pages free     327938
        min      822 ★ min_pages = 822 * 4KB = 3,288 KB
        low      1027 ★ low_pages = 1027 * 4KB = 4,108 KB
        high     1233 ★ high_pages = 1027 * 4KB = 4,932 KB
        scanned  0
        spanned  1044480
        present  828008 ★
...
Node 0, zone   Normal
  pages free     3994
        min      3202 ★ min_pages = 3202 * 4KB = 12,808 KB
        low      4002 ★ low_pages = 4002 * 4KB = 16,008 KB
        high     4803 ★ high_pages = 4803 * 4KB = 19,212 KB
        scanned  0
        spanned  3270144
        present  3225435

DMA、DMA32、Noraml の3つの ZONE の min、low、high を合計すると以下の通り。

min:  16,108 KB (12 + 3,288 + 12,808)
low:  20,128 KB (12 + 4,108 + 16,008)
high: 24,160 KB (16 + 4,932 + 19,212)

上記のように実際にはシステム全体ではなく NUMA ノード毎、さらに ZONE(x86_64 の場合、DMA、DMA32、Normal) 毎に閾値があり、各 ZONE 単位で閾値を下回るとページ回収が行われるようです。
正確にNUMAノードのZONE毎の閾値を見積もるにはがちゃぴん先生のツッコミ通り、もう少し複雑な計算式になります。

前提知識

Linux*1 は空きメモリを有効活用し、ファイル*2を読み書きするとメモリにキャッシュ*3するが、空きが少なくなるとページを回収して空きを確保する。LRU リストを基に使用頻度が低いページが解放される*4

ページ回収(reclaim)の種類と動作
  • Background reclaim
    • 空きメモリが wmark_low(low pages) 未満になるとカーネルスレッド kswapd がバックラウンドでページ回収を初める。
    • 空きメモリが wmark_high(high pages) を超えると kswapd はページ回収をやめる。
  • Direct reclaim
    • 空きメモリが wmark_min(min pages) 未満になるとプロセスがメモリを割当を要求するとページ回収して空きを作ってから割当られる。


Systems Performance: Enterprise and the Cloud Figure - 7.8 kswapd wake-ups and modes より

Reducing Memory Access Latency より

参考情報

Professional Linux Kernel Architecture (Wrox Programmer to Programmer)

Professional Linux Kernel Architecture (Wrox Programmer to Programmer)

3.2.2.3. Calculation of Zone Watermarks

Before calculating the various watermarks, the kernel first determines the minimum memory space that must remain free for critical allocations. This value scales nonlinearly with the size of the available RAM. It is stored in the global variable min_free_kbytes. Figure 3.4 provides an overview of the scaling behavior, and the inset — which does not use a logarithmic scale for the main memory size in contrast to the main graph — shows a magnification of the region up to 4 GiB. Some exemplary values to provide a feeling for the situation on systems with modest memory that are common in desktop environments are collected in Table 3.1. An invariant is that not less than 128 KiB but not more than 64 MiB may be used. Note, however, that the upper bound is only necessary on machines equipped with a really satisfactory amount of main memory. The file /proc/sys/vm/min_free_kbytes allows reading and adapting the value from userland.

Filling the watermarks in the data structure is handled by init_per_zone_pages_min, which is invoked during kernel boot and need not be started explicitly.

setup_per_zone_pages_min sets the pages_min, pages_low, and pages_high elements of struct zone. After the total number of pages outside the highmem zone has been calculated (and stored in lowmem_ pages), the kernel iterates over all zones in the system and performs the following calculation:

  • mm/page_alloc.c
void setup_per_zone_pages_min(void)
{
        unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
        unsigned long lowmem_pages = 0;
        struct zone *zone;
        unsigned long flags;

...
        for_each_zone(zone) {
                u64 tmp;

                tmp = (u64)pages_min * zone->present_pages;
                do_div(tmp,lowmem_pages);
                if (is_highmem(zone)) {
                        int min_pages;

                        min_pages = zone->present_pages / 1024;
                        if (min_pages < SWAP_CLUSTER_MAX)
                                min_pages = SWAP_CLUSTER_MAX;
                        if (min_pages > 128)
                                min_pages = 128;
                        zone->pages_min = min_pages;
                } else {
                        zone->pages_min = tmp;
                }

                zone->pages_low = zone->pages_min + (tmp >> 2);
                zone->pages_high = zone->pages_min + (tmp >> 1);
        }
}



static void __setup_per_zone_wmarks(void)
{
	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
	unsigned long lowmem_pages = 0;
	struct zone *zone;
	unsigned long flags;

	/* Calculate total number of !ZONE_HIGHMEM pages */
	for_each_zone(zone) {
		if (!is_highmem(zone))
			lowmem_pages += zone->managed_pages;
	}

	for_each_zone(zone) {
		u64 tmp;

		spin_lock_irqsave(&zone->lock, flags);
		tmp = (u64)pages_min * zone->managed_pages;
		do_div(tmp, lowmem_pages);
		if (is_highmem(zone)) {
			/*
			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
			 * need highmem pages, so cap pages_min to a small
			 * value here.
			 *
			 * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN)
			 * deltas controls asynch page reclaim, and so should
			 * not be capped for highmem.
			 */
			unsigned long min_pages;

			min_pages = zone->managed_pages / 1024;
			min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
			zone->watermark[WMARK_MIN] = min_pages;
		} else {
			/*
			 * If it's a lowmem zone, reserve a number of pages
			 * proportionate to the zone's size.
			 */
			zone->watermark[WMARK_MIN] = tmp;
		}

		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);

		setup_zone_migrate_reserve(zone);
		spin_unlock_irqrestore(&zone->lock, flags);
	}

	/* update totalreserve_pages */
	calculate_totalreserve_pages();
}

...

/*
 * Initialise min_free_kbytes.
 *
 * For small machines we want it small (128k min).  For large machines
 * we want it large (64MB max).  But it is not linear, because network
 * bandwidth does not increase linearly with machine size.  We use
 *
 * 	min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy: ★
 *	min_free_kbytes = sqrt(lowmem_kbytes * 16) ★
 *
 * which yields
 *
 * 16MB:	512k
 * 32MB:	724k
 * 64MB:	1024k
 * 128MB:	1448k
 * 256MB:	2048k
 * 512MB:	2896k
 * 1024MB:	4096k
 * 2048MB:	5792k
 * 4096MB:	8192k
 * 8192MB:	11584k
 * 16384MB:	16384k
 */
int __meminit init_per_zone_wmark_min(void)
{
	unsigned long lowmem_kbytes;

	lowmem_kbytes = nr_free_buffer_pages() * (PAGE_SIZE >> 10);

	min_free_kbytes = int_sqrt(lowmem_kbytes * 16);
	if (min_free_kbytes < 128)
		min_free_kbytes = 128;
	if (min_free_kbytes > 65536)
		min_free_kbytes = 65536;
	setup_per_zone_wmarks();
	refresh_zone_stat_thresholds();
	setup_per_zone_lowmem_reserve();
	setup_per_zone_inactive_ratio();
	return 0;
}
module_init(init_per_zone_wmark_min)

This is an example of the problem and solution above.

  • System Memory: 2GB
  • High memory pressure

In this case, min_free_kbytes and watermarks are automatically
set as follows.
(Here, watermark shows sum of the each zone's watermark.)

min_free_kbytes: 5752
watermark[min] : 5752
watermark[low] : 7190
watermark[high]: 8628

Tunable watermark [LWN.net]

min_free_kbytes:

This is used to force the Linux VM to keep a minimum number of kilobytes free. The VM uses this number to compute a watermark[WMARK_MIN] value for each lowmem zone in the system. Each lowmem zone gets a number of reserved free pages based proportionally on its size.

Some minimal amount of memory is needed to satisfy PF_MEMALLOC allocations; if you set this to lower than 1024KB, your system will become subtly broken, and prone to deadlock under high loads.

Setting this too high will OOM your machine instantly.

kernel/git/stable/linux-stable.git - Linux kernel stable tree

*1:Linuxに限らず大体のOSは似たような動作をします

*2:ブロックデバイスのブロック

*3:ページキャッシュ

*4:1.ページキャッシュ解放、2.プロセスのメモリ(Anon pages)のページアウトのいずれかを行う。1と2のバランスはカーネルパラメータ vm.swapiness で調整できる。

ext4 の性能についての資料

ただのメモ
Scaling the Linux Kernel(Revisited): Using ext4 as a Case Study by Theodore Ts'o (Google)

Red Hat Enterprise Linux のリリースとカーネルのバージョンの対応を調べるページ

Red Hat Enterprise Linux のリリース日と収録カーネルの一覧 - Red Hat Customer Portal

NFS のマウントオプションの hard と soft について調べたメモ

NFS のマウントオプションの hard、soft について調べたメモ(Linux限定)。

まとめ

hard の動作
  • NFS サーバが応答するまで書込を永遠に繰返す。
  • アプリケーションはI/Oを発行した後、完了待ちでスリープし続ける。
  • hard と intr を併用するとシグナルを送ってI/Oを停止することができる*1
kill -s SIGINT or SIGQUIT or SIGHUP <PID>
soft の動作
  • retrans で指定された回数書込に失敗すると、I/Oを発行したアプリケーションにエラーを返す。
どちらが良いか
  • 整合性が求められるデータを読み書きに使う場合は hard にすべき。
    • 不完全な書込*2や読込*3が発生する可能性があるため。
  • 実行可能ファイルを置く場合も hard にすべき。
    • 実行可能ファイルのデータをメモリに読込中やページアウトされたページを再読込中に、NFS サーバがクラッシュすると想定外の動作*4をする可能性がある。

参考

soft / hard
Determines the recovery behavior of the NFS client after an NFS request times out. If neither option is specified (or if the hard option is specified), NFS requests are retried indefinitely. If the soft option is specified, then the NFS client fails an NFS request after retrans retransmissions have been sent, causing the NFS client to return an error to the calling application.
NB: A so-called "soft" timeout can cause silent data corruption in certain cases. As such, use the soft option only when client responsiveness is more important than data integrity. Using NFS over TCP or increasing the value of the retrans option may mitigate some of the risks of using the soft option.

retrans=n
The number of times the NFS client retries a request before it attempts further recovery action. If the retrans option is not specified, the NFS client tries each request three times.
The NFS client generates a "server not responding" message after retrans retries, then attempts further recovery (depending on whether the hard mount option is in effect).

intr / nointr
This option is provided for backward compatibility. It is ignored after kernel 2.6.25.

nfs(5) - Linux manual page


Managing Nfs and Nis

Managing Nfs and Nis

  • 6.3. Mounting filesystems - Mount options

hard/soft
By default, NFS filesystems are hard mounted, and operations on them are retried until they are acknowledged by the server. If the soft option is specified, an NFS RPC call returns a timeout error if it fails the number of times specified by the retrans option.

  • 6.3. Mounting filesystems - Mounting filesystems - Hard and soft mounts

Hard and soft mounts
The hard and soft mount options determine how a client behaves when the server is excessively loaded for a long period or when it crashes. By default, all NFS filesystems are mounted hard, which means that an RPC call that times out will be retried indefinitely until a response is received from the server. This makes the NFS server look as much like a local disk as possible — the request that needs to go to disk completes at some point in the future. An NFS server that crashes looks like a disk that is very, very slow.
A side effect of hard-mounting NFS filesystems is that processes block (or “hang”) in a high-priority disk wait state until their NFS RPC calls complete. If an NFS server goes down, the clients using its filesystems hang if they reference these filesystems before the server recovers. Using intr in conjunction with the hard mount option allows users to interrupt system calls that are blocked waiting on a crashed server. The system call is interrupted when the process making the call receives a signal, usually sent by the user typing CTRL-C (interrupt) or using the kill command. CTRL-\ (quit) is another way to generate a signal, as is logging out of the NFS client host. When using kill , only SIGINT, SIGQUIT, and SIGHUP will interrupt NFS operations.
When an NFS filesystem is soft-mounted, repeated RPC call failures eventually cause the NFS operation to fail as well. Instead of emulating a painfully slow disk, a server exporting a soft-mounted filesystem looks like a failing disk when it crashes: system calls referencing the soft-mounted NFS filesystem return errors. Sometimes the errors can be ignored or are preferable to blocking at high priority; for example, if you were doing an ls -l when the NFS server crashed, you wouldn’t really care if the ls command returned an error as long as your system didn’t hang.
The other side to this “failing disk” analogy is that you never want to write data to an unreliable device, nor do you want to try to load executables from it. You should not use the soft option on any filesystem that is writable, nor on any filesystem from which you load executables. Furthermore, because many applications do not check return value of the read(2) system call when reading regular files (because those programs were written in the days before networking was ubiquitous, and disks were reliable enough that reads from disks virtually never failed), you should not use the soft option on any filesystem that is supplying input to applications that are in turn using the data for a mission-critical purpose. NFS only guarantees the consistency of data after a server crash if the NFS filesystem was hard-mounted by the client. Unless you really know what you are doing, neveruse the soft option.
We’ll come back to hard- and soft-mount issues in when we discuss modifying client behavior in the face of slow NFS servers in Chapter 18.

  • 18.2. Soft mount issues

Repeated retransmission cycles only occur for hard-mounted filesystems. When the soft option is supplied in a mount, the RPC retransmission sequence ends at the first major timeout, producing messages like:

NFS write failed for server wahoo: error 5 (RPC: Timed out)
NFS write error on host wahoo: error 145.
(file handle: 800000 2 a0000 114c9 55f29948 a0000 11494 5cf03971)

The NFS operation that failed is indicated, the server that failed to respond before the major timeout, and the filehandle of the file affected. RPC timeouts may be caused by extremely slow servers, or they can occur if a server crashes and is down or rebooting while an RPC retransmission cycle is in progress.
With soft-mounted filesystems, you have to worry about damaging data due to incomplete writes, losing access to the text segment of a swapped process, and making soft-mounted filesystems more tolerant of variances in server response time. If a client does not give the server enough latitude in its response time, the first two problems impair both the performance and correct operation of the client. If write operations fail, data consistency on the server cannot be guaranteed. The write error is reported to the application during some later call to write( ) or close( ), which is consistent with the behavior of a local filesystem residing on a failing or overflowing disk. When the actual write to disk is attempted by the kernel device driver, the failure is reported to the application as an error during the next similar or related system call.
A well-conditioned application should exit abnormally after a failed write, or retry the write if possible. If the application ignores the return code from write( ) or close( ), then it is possible to corrupt data on a soft-mounted filesystem. Some write operations may fail and never be retried, leaving holes in the open file.
To guarantee data integrity, all filesystems mounted read-write should be hard-mounted. Server performance as well as server reliability determine whether a request eventually succeeds on a soft-mounted filesystem, and neither can be guaranteed. Furthermore, any operating system that maps executable images directly into memory (such as Solaris) should hard-mount filesystems containing executables. If the filesystem is soft-mounted, and the NFS server crashes while the client is paging in an executable (during the initial load of the text segment or to refill a page frame that was paged out), an RPC timeout will cause the paging to fail. What happens next is system-dependent; the application may be terminated or the system may panic with unrecoverable swap errors.
A common objection to hard-mounting filesystems is that NFS clients remain catatonic until a crashed server recovers, due to the infinite loop of RPC retransmissions and timeouts. By default, Solaris clients allow interrupts to break the retransmission loop. Use the intr mount option if your client doesn’t specify interrupts by default. Unfortunately, some older implementations of NFS do not process keyboard interrupts until a major timeout has occurred: with even a small timeout period and retransmission count, the time required to recognize an interrupt can be quite large.
If you choose to ignore this advice, and choose to use soft-mounted NFS filesystems, you should at least make NFS clients more tolerant of soft-mounted NFS fileservers by increasing the retrans mount option. Increasing the number of attempts to reach the server makes the client less likely to produce an RPC error during brief periods of server loading.

補足

  • そもそも、整合性を求められるデータの読み書きや実行可能ファイルを置く領域に NFS を使うべきかという点には触れていません。

*1:nfs(5) の man では kernel 2.6.25 以降は無視されると書かれている

*2:そのI/Oリクエストで書きたかったデータの一部しか書けていない

*3:そのI/Oリクエストで読みたかったデータの一部しか読めていない

*4:OSの実装次第だが、アプリケーションの異常終了やカーネルパニックなど

"Reducing Memory Access Latency" が素晴らしすぎる

Reducing Memory Access Latency by Satoru Moriya (Hitachi LTC)
が素晴らしすぎるのでメモ。

まとめ

  • vm.swappiness = 0 により、解放可能なページキャッシュがあるうちはプロセスのメモリ(anon page)をスワップアウトしないようにできる*1
    • swappines=0 にしても 解放可能なページキャッシュがあるのにプロセスのメモリがスワップアウトされる問題があったが、この資料を書いた守屋さんのパッチが Kernel 3.5 にマージされている → mm: avoid swapping out with swappiness==0
  • extra_free_kbytes で kswapd がページ回収を開始する閾値を上げ、direct reclaim が発生しにくくできる
  • Kernel 3.2 以降、direct reclaim ではクリーンなページキャッシュを回収対象としている
    • direct reclaim でダーティーなページキャッシュが回収されるとディスクへの書き出しが必要になるので、メモリ割当が大幅に遅延する。
    • Mel Gorman のパッチが Kernel 3.2 にマージされている→ mm: vmscan: do not writeback filesystem pages in direct reclaim
  • Preallocation + mlock(2)/mlockall(2) で重要なプロセスは物理メモリを最初に割当て、かつページアウト対象外とできる。ただし、アプリケーションをそういう作りにする必要がある。
  • cgroup の memory.limit_in_bytes で特定のプロセスのページキャッシュ使用量を制限できる。


REDUCING MEMORY ACCESS LATENCY | Hitachi Data Systems も同じ資料のようです。

*1:正確にはページアウトしない