洛阳铲的日志

2010年10月13日

rcs版本控制

Filed under: LAMP — 标签: — HackGou @ 18:00

RCS是*INX上面最原始,但是也是最简单可靠的版本控制系统。它的功能很简单:版本控制
什么并发、什么分布式都统统见鬼去吧,KISS才是最强大。

使用rcs,一个重要的概念就是锁(lock),没错,您没有看错,很多人对上锁的版本控制系统
嗤之以鼻,但是简单才是王道

一般使用rcs的流程如下:

1. 用 rcs -i a.txt 初始化a.txt
2. 用co -l a.txt取出并加锁
3. 编辑
4. 用ci -u -m’comment’ a.txt 提交修改并解锁。

这些简单的过程,就是全部,但是工具是简单的,问题是复杂的。

如果co的时候没有-l,那么提交的时候会出现:"no lock set by XXXX"的错误。
此时,只需要用rcs -l[rev]锁住该版本即可进行ci。

参考资料:

1. RCS 版本控制系統: http://www.csie.cyut.edu.tw/~dywang/linuxProgram/node42.html
2. RCS 簡介: http://www.csie.nctu.edu.tw/~tsaiwn/course/introcs/history/rcs-cookie/phi.sinica.edu.tw/aspac/reports/96/96007/#32

2009年12月4日

pdo_oci_handle_factory: OCI_INVALID_HANDLE 错误

Filed under: LAMP,Linux — 标签:, , — HackGou @ 18:15

有应用反应,老是报错

[DEBUG] SQLSTATE[]: pdo_oci_handle_factory: OCI_INVALID_HANDLE (/home/szhou/rpmbuild/BUILD/PDO_OCI-1.0/oci_driver.c:463)

在google中可以找到一大把类似的错误,都没有好的解决方法。虽然bug的提交者没有说名系统是否打开了Selinux, 但是对于今天的一个server而言
的确是因为SeLinux的缘故,在audit.log里面可以看到如下记录:


type=AVC msg=audit(1259911324.873:28565): avc: denied { execstack } for pid=19315 comm=”httpd” scontext=user_u:system_r:httpd_t:s0 tcontext=user_u:system_r:httpd_t:s0 tclass=process

解决方法就是: ‘/usr/bin/execstack -c /usr/lib64/oracle/10.2.0.3/client/lib/*.so*’

drepper的个人主页上面有一个关于SeLinux保护内存的说明:
http://people.redhat.com/~drepper/selinux-mem.html
my overview of security features

https://bugzilla.redhat.com/show_bug.cgi?id=540466 解释如何在SeLinux下处理execstack。

Del.icio.us : , ,

2007年10月23日

[ZZ]UNIX下面信号的解释

Filed under: FreeBSD,LAMP,Linux,OS_Tips — HackGou @ 18:16

一篇非常好的解释UNIX下面信号的文章,熟悉了这些信号。
对待进程跟玩玩具似的,想怎么把玩,怎么把玩,弄死弄活,悉听尊便
原文位于:
http://blog.csdn.net/baobao8505/archive/2006/08/25/1115820.aspx

我们运行如下命令,可看到Linux支持的信号列表:

~$ kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL
5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE
9) SIGKILL 10) SIGUSR1 11) SIGSEGV 12) SIGUSR2
13) SIGPIPE 14) SIGALRM 15) SIGTERM 17) SIGCHLD
18) SIGCONT 19) SIGSTOP 20) SIGTSTP 21) SIGTTIN
22) SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ
26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO
30) SIGPWR 31) SIGSYS 34) SIGRTMIN 35) SIGRTMIN+1
36) SIGRTMIN+2 37) SIGRTMIN+3 38) SIGRTMIN+4 39) SIGRTMIN+5
40) SIGRTMIN+6 41) SIGRTMIN+7 42) SIGRTMIN+8 43) SIGRTMIN+9
44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13
52) SIGRTMAX-12 53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9
56) SIGRTMAX-8 57) SIGRTMAX-7 58) SIGRTMAX-6 59) SIGRTMAX-5
60) SIGRTMAX-4 61) SIGRTMAX-3 62) SIGRTMAX-2 63) SIGRTMAX-1
64) SIGRTMAX

列表中,编号为1 ~ 31的信号为传统UNIX支持的信号,是不可靠信号(非实时的),编号为32 ~ 63的信号是后来扩充的,称做可靠信号(实时信号)。不可靠信号和可靠信号的区别在于前者不支持排队,可能会造成信号丢失,而后者不会。

下面我们对编号小于SIGRTMIN的信号进行讨论。

1) SIGHUP
本信号在用户终端连接(正常或非正常)结束时发出, 通常是在终端的控制进程结束时, 通知同一session内的各个作业, 这时它们与控制终端不再关联。

登录Linux时,系统会分配给登录用户一个终端(Session)。在这个终端运行的所有程序,包括前台进程组和后台进程组,一般都属于这个Session。当用户退出Linux登录时,前台进程组和后台有对终端输出的进程将会收到SIGHUP信号。这个信号的默认操作为终止进程,因此前台进程组和后台有终端输出的进程就会中止。不过可以捕获这个信号,比如wget能捕获SIGHUP信号,并忽略它,这样就算退出了Linux登录,wget也能继续下载。

此外,对于与终端脱离关系的守护进程,这个信号用于通知它重新读取配置文件。

2) SIGINT
程序终止(interrupt)信号, 在用户键入INTR字符(通常是Ctrl-C)时发出,用于通知前台进程组终止进程。

3) SIGQUIT
和SIGINT类似, 但由QUIT字符(通常是Ctrl-\)来控制. 进程在因收到SIGQUIT退出时会产生core文件, 在这个意义上类似于一个程序错误信号。

4) SIGILL
执行了非法指令. 通常是因为可执行文件本身出现错误, 或者试图执行数据段. 堆栈溢出时也有可能产生这个信号。

5) SIGTRAP
由断点指令或其它trap指令产生. 由debugger使用。

6) SIGABRT
调用abort函数生成的信号。

7) SIGBUS
非法地址, 包括内存地址对齐(alignment)出错。比如访问一个四个字长的整数, 但其地址不是4的倍数。它与SIGSEGV的区别在于后者是由于对合法存储地址的非法访问触发的(如访问不属于自己存储空间或只读存储空间)。

8) SIGFPE
在发生致命的算术运算错误时发出. 不仅包括浮点运算错误, 还包括溢出及除数为0等其它所有的算术的错误。

9) SIGKILL
用来立即结束程序的运行. 本信号不能被阻塞、处理和忽略。如果管理员发现某个进程终止不了,可尝试发送这个信号。

10) SIGUSR1
留给用户使用

11) SIGSEGV
试图访问未分配给自己的内存, 或试图往没有写权限的内存地址写数据.

12) SIGUSR2
留给用户使用

13) SIGPIPE
管道破裂。这个信号通常在进程间通信产生,比如采用FIFO(管道)通信的两个进程,读管道没打开或者意外终止就往管道写,写进程会收到SIGPIPE信号。此外用Socket通信的两个进程,写进程在写Socket的时候,读进程已经终止。

14) SIGALRM
时钟定时信号, 计算的是实际的时间或时钟时间. alarm函数使用该信号.

15) SIGTERM
程序结束(terminate)信号, 与SIGKILL不同的是该信号可以被阻塞和处理。通常用来要求程序自己正常退出,shell命令kill缺省产生这个信号。如果进程终止不了,我们才会尝试SIGKILL。

17) SIGCHLD
子进程结束时, 父进程会收到这个信号。

如果父进程没有处理这个信号,也没有等待(wait)子进程,子进程虽然终止,但是还会在内核进程表中占有表项,这时的子进程称为僵尸进程。这种情况我们应该避免(父进程或者忽略SIGCHILD信号,或者捕捉它,或者wait它派生的子进程,或者父进程先终止,这时子进程的终止自动由init进程来接管)。

18) SIGCONT
让一个停止(stopped)的进程继续执行. 本信号不能被阻塞. 可以用一个handler来让程序在由stopped状态变为继续执行时完成特定的工作. 例如, 重新显示提示符

19) SIGSTOP
停止(stopped)进程的执行. 注意它和terminate以及interrupt的区别:该进程还未结束, 只是暂停执行. 本信号不能被阻塞, 处理或忽略.

20) SIGTSTP
停止进程的运行, 但该信号可以被处理和忽略. 用户键入SUSP字符时(通常是Ctrl-Z)发出这个信号

21) SIGTTIN
当后台作业要从用户终端读数据时, 该作业中的所有进程会收到SIGTTIN信号. 缺省时这些进程会停止执行.

22) SIGTTOU
类似于SIGTTIN, 但在写终端(或修改终端模式)时收到.

23) SIGURG
有”紧急”数据或out-of-band数据到达socket时产生.

24) SIGXCPU
超过CPU时间资源限制. 这个限制可以由getrlimit/setrlimit来读取/改变。

25) SIGXFSZ
当进程企图扩大文件以至于超过文件大小资源限制。

26) SIGVTALRM
虚拟时钟信号. 类似于SIGALRM, 但是计算的是该进程占用的CPU时间.

27) SIGPROF
类似于SIGALRM/SIGVTALRM, 但包括该进程用的CPU时间以及系统调用的时间.

28) SIGWINCH
窗口大小改变时发出.

29) SIGIO
文件描述符准备就绪, 可以开始进行输入/输出操作.

30) SIGPWR
Power failure

31) SIGSYS
非法的系统调用。

在以上列出的信号中,程序不可捕获、阻塞或忽略的信号有:SIGKILL,SIGSTOP
不能恢复至默认动作的信号有:SIGILL,SIGTRAP
默认会导致进程流产的信号有:SIGABRT,SIGBUS,SIGFPE,SIGILL,SIGIOT,SIGQUIT,SIGSEGV,SIGTRAP,SIGXCPU,SIGXFSZ
默认会导致进程退出的信号有:SIGALRM,SIGHUP,SIGINT,SIGKILL,SIGPIPE,SIGPOLL,SIGPROF,SIGSYS,SIGTERM,SIGUSR1,SIGUSR2,SIGVTALRM
默认会导致进程停止的信号有:SIGSTOP,SIGTSTP,SIGTTIN,SIGTTOU
默认进程忽略的信号有:SIGCHLD,SIGPWR,SIGURG,SIGWINCH

此外,SIGIO在SVR4是退出,在4.3BSD中是忽略;SIGCONT在进程挂起时是继续,否则是忽略,不能被阻塞。

2007年09月28日

C10K Problems

Filed under: FreeBSD,LAMP,Linux,web — HackGou @ 02:15

原文位于: http://www.kegel.com/c10k.html

The C10K problem
[Help save the best Linux news source on the web — subscribe to Linux Weekly News!]
It’s time for web servers to handle ten thousand clients simultaneously, don’t you think? After all, the web is a big place now.

And computers are big, too. You can buy a 1000MHz machine with 2 gigabytes of RAM and an 1000Mbit/sec Ethernet card for $1200 or so. Let’s see – at 20000 clients, that’s 50KHz, 100Kbytes, and 50Kbits/sec per client. It shouldn’t take any more horsepower than that to take four kilobytes from the disk and send them to the network once a second for each of twenty thousand clients. (That works out to $0.08 per client, by the way. Those $100/client licensing fees some operating systems charge are starting to look a little heavy!) So hardware is no longer the bottleneck.

In 1999 one of the busiest ftp sites, cdrom.com, actually handled 10000 clients simultaneously through a Gigabit Ethernet pipe. As of 2001, that same speed is now being offered by several ISPs, who expect it to become increasingly popular with large business customers.

And the thin client model of computing appears to be coming back in style — this time with the server out on the Internet, serving thousands of clients.

With that in mind, here are a few notes on how to configure operating systems and write code to support thousands of clients. The discussion centers around Unix-like operating systems, as that’s my personal area of interest, but Windows is also covered a bit.

Contents
The C10K problem
Related Sites
Book to Read First
I/O frameworks
I/O Strategies
Serve many clients with each thread, and use nonblocking I/O and level-triggered readiness notification
The traditional select()
The traditional poll()
/dev/poll (Solaris 2.7+)
kqueue (FreeBSD, NetBSD)
Serve many clients with each thread, and use nonblocking I/O and readiness change notification
epoll (Linux 2.6+)
Polyakov’s kevent (Linux 2.6+)
Drepper’s New Network Interface (proposal for Linux 2.6+)
Realtime Signals (Linux 2.4+)
Signal-per-fd
kqueue (FreeBSD, NetBSD)
Serve many clients with each thread, and use asynchronous I/O and completion notification
Serve one client with each server thread
LinuxThreads (Linux 2.0+)
NGPT (Linux 2.4+)
NPTL (Linux 2.6, Red Hat 9)
FreeBSD threading support
NetBSD threading support
Solaris threading support
Java threading support in JDK 1.3.x and earlier
Note: 1:1 threading vs. M:N threading
Build the server code into the kernel
Comments
Limits on open filehandles
Limits on threads
Java issues [Updated 27 May 2001]
Other tips
Zero-Copy
The sendfile() system call can implement zero-copy networking.
Avoid small frames by using writev (or TCP_CORK)
Some programs can benefit from using non-Posix threads.
Caching your own data can sometimes be a win.
Other limits
Kernel Issues
Measuring Server Performance
Examples
Interesting select()-based servers
Interesting /dev/poll-based servers
Interesting kqueue()-based servers
Interesting realtime signal-based servers
Interesting thread-based servers
Interesting in-kernel servers
Other interesting links

Related Sites
In October 2003, Felix von Leitner put together an excellent web page and presentation about network scalability, complete with benchmarks comparing various networking system calls and operating systems. One of his observations is that the 2.6 Linux kernel really does beat the 2.4 kernel, but there are many, many good graphs that will give the OS developers food for thought for some time. (See also the Slashdot comments; it’ll be interesting to see whether anyone does followup benchmarks improving on Felix’s results.)
Book to Read First
If you haven’t read it already, go out and get a copy of Unix Network Programming : Networking Apis: Sockets and Xti (Volume 1) by the late W. Richard Stevens. It describes many of the I/O strategies and pitfalls related to writing high-performance servers. It even talks about the ‘thundering herd’ problem. And while you’re at it, go read Jeff Darcy’s notes on high-performance server design.

(Another book which might be more helpful for those who are *using* rather than *writing* a web server is Building Scalable Web Sites by Cal Henderson.)

I/O frameworks
Prepackaged libraries are available that abstract some of the techniques presented below, insulating your code from the operating system and making it more portable.

ACE, a heavyweight C++ I/O framework, contains object-oriented implementations of some of these I/O strategies and many other useful things. In particular, his Reactor is an OO way of doing nonblocking I/O, and Proactor is an OO way of doing asynchronous I/O.
ASIO is an C++ I/O framework which is becoming part of the Boost library. It’s like ACE updated for the STL era.
libevent is a lightweight C I/O framework by Niels Provos. It supports kqueue and select, and soon will support poll and epoll. It’s level-triggered only, I think, which has both good and bad sides. Niels has a nice graph of time to handle one event as a function of the number of connections. It shows kqueue and sys_epoll as clear winners.
My own attempts at lightweight frameworks (sadly, not kept up to date):
Poller is a lightweight C++ I/O framework that implements a level-triggered readiness API using whatever underlying readiness API you want (poll, select, /dev/poll, kqueue, or sigio). It’s useful for benchmarks that compare the performance of the various APIs. This document links to Poller subclasses below to illustrate how each of the readiness APIs can be used.
rn is a lightweight C I/O framework that was my second try after Poller. It’s lgpl (so it’s easier to use in commercial apps) and C (so it’s easier to use in non-C++ apps). It was used in some commercial products.
Matt Welsh wrote a paper in April 2000 about how to balance the use of worker thread and event-driven techniques when building scalable servers. The paper describes part of his Sandstorm I/O framework.
Cory Nelson’s Scale! library – an async socket, file, and pipe I/O library for Windows
I/O Strategies
Designers of networking software have many options. Here are a few:
Whether and how to issue multiple I/O calls from a single thread
Don’t; use blocking/synchronous calls throughout, and possibly use multiple threads or processes to achieve concurrency
Use nonblocking calls (e.g. write() on a socket set to O_NONBLOCK) to start I/O, and readiness notification (e.g. poll() or /dev/poll) to know when it’s OK to start the next I/O on that channel. Generally only usable with network I/O, not disk I/O.
Use asynchronous calls (e.g. aio_write()) to start I/O, and completion notification (e.g. signals or completion ports) to know when the I/O finishes. Good for both network and disk I/O.
How to control the code servicing each client
one process for each client (classic Unix approach, used since 1980 or so)
one OS-level thread handles many clients; each client is controlled by:
a user-level thread (e.g. GNU state threads, classic Java with green threads)
a state machine (a bit esoteric, but popular in some circles; my favorite)
a continuation (a bit esoteric, but popular in some circles)
one OS-level thread for each client (e.g. classic Java with native threads)
one OS-level thread for each active client (e.g. Tomcat with apache front end; NT completion ports; thread pools)
Whether to use standard O/S services, or put some code into the kernel (e.g. in a custom driver, kernel module, or VxD)
The following five combinations seem to be popular:

Serve many clients with each thread, and use nonblocking I/O and level-triggered readiness notification
Serve many clients with each thread, and use nonblocking I/O and readiness change notification
Serve many clients with each server thread, and use asynchronous I/O
serve one client with each server thread, and use blocking I/O
Build the server code into the kernel
1. Serve many clients with each thread, and use nonblocking I/O and level-triggered readiness notification
… set nonblocking mode on all network handles, and use select() or poll() to tell which network handle has data waiting. This is the traditional favorite. With this scheme, the kernel tells you whether a file descriptor is ready, whether or not you’ve done anything with that file descriptor since the last time the kernel told you about it. (The name ‘level triggered’ comes from computer hardware design; it’s the opposite of ‘edge triggered’. Jonathon Lemon introduced the terms in his BSDCON 2000 paper on kqueue().)

Note: it’s particularly important to remember that readiness notification from the kernel is only a hint; the file descriptor might not be ready anymore when you try to read from it. That’s why it’s important to use nonblocking mode when using readiness notification.

An important bottleneck in this method is that read() or sendfile() from disk blocks if the page is not in core at the moment; setting nonblocking mode on a disk file handle has no effect. Same thing goes for memory-mapped disk files. The first time a server needs disk I/O, its process blocks, all clients must wait, and that raw nonthreaded performance goes to waste.
This is what asynchronous I/O is for, but on systems that lack AIO, worker threads or processes that do the disk I/O can also get around this bottleneck. One approach is to use memory-mapped files, and if mincore() indicates I/O is needed, ask a worker to do the I/O, and continue handling network traffic. Jef Poskanzer mentions that Pai, Druschel, and Zwaenepoel’s 1999 Flash web server uses this trick; they gave a talk at Usenix ’99 on it. It looks like mincore() is available in BSD-derived Unixes like FreeBSD and Solaris, but is not part of the Single Unix Specification. It’s available as part of Linux as of kernel 2.3.51, thanks to Chuck Lever.

But in November 2003 on the freebsd-hackers list, Vivek Pei et al reported very good results using system-wide profiling of their Flash web server to attack bottlenecks. One bottleneck they found was mincore (guess that wasn’t such a good idea after all) Another was the fact that sendfile blocks on disk access; they improved performance by introducing a modified sendfile() that return something like EWOULDBLOCK when the disk page it’s fetching is not yet in core. (Not sure how you tell the user the page is now resident… seems to me what’s really needed here is aio_sendfile().) The end result of their optimizations is a SpecWeb99 score of about 800 on a 1GHZ/1GB FreeBSD box, which is better than anything on file at spec.org.

There are several ways for a single thread to tell which of a set of nonblocking sockets are ready for I/O:

The traditional select()
Unfortunately, select() is limited to FD_SETSIZE handles. This limit is compiled in to the standard library and user programs. (Some versions of the C library let you raise this limit at user app compile time.)
See Poller_select (cc, h) for an example of how to use select() interchangeably with other readiness notification schemes.

The traditional poll()
There is no hardcoded limit to the number of file descriptors poll() can handle, but it does get slow about a few thousand, since most of the file descriptors are idle at any one time, and scanning through thousands of file descriptors takes time.
Some OS’s (e.g. Solaris 8) speed up poll() et al by use of techniques like poll hinting, which was implemented and benchmarked by Niels Provos for Linux in 1999.

See Poller_poll (cc, h, benchmarks) for an example of how to use poll() interchangeably with other readiness notification schemes.

/dev/poll
This is the recommended poll replacement for Solaris.
The idea behind /dev/poll is to take advantage of the fact that often poll() is called many times with the same arguments. With /dev/poll, you get an open handle to /dev/poll, and tell the OS just once what files you’re interested in by writing to that handle; from then on, you just read the set of currently ready file descriptors from that handle.

It appeared quietly in Solaris 7 (see patchid 106541) but its first public appearance was in Solaris 8; according to Sun, at 750 clients, this has 10% of the overhead of poll().

Various implementations of /dev/poll were tried on Linux, but none of them perform as well as epoll, and were never really completed. /dev/poll use on Linux is not recommended.

See Poller_devpoll (cc, h benchmarks ) for an example of how to use /dev/poll interchangeably with many other readiness notification schemes. (Caution – the example is for Linux /dev/poll, might not work right on Solaris.)

kqueue()
This is the recommended poll replacement for FreeBSD (and, soon, NetBSD).
See below. kqueue() can specify either edge triggering or level triggering.

2. Serve many clients with each thread, and use nonblocking I/O and readiness change notification
Readiness change notification (or edge-triggered readiness notification) means you give the kernel a file descriptor, and later, when that descriptor transitions from not ready to ready, the kernel notifies you somehow. It then assumes you know the file descriptor is ready, and will not send any more readiness notifications of that type for that file descriptor until you do something that causes the file descriptor to no longer be ready (e.g. until you receive the EWOULDBLOCK error on a send, recv, or accept call, or a send or recv transfers less than the requested number of bytes).
When you use readiness change notification, you must be prepared for spurious events, since one common implementation is to signal readiness whenever any packets are received, regardless of whether the file descriptor was already ready.

This is the opposite of “level-triggered” readiness notification. It’s a bit less forgiving of programming mistakes, since if you miss just one event, the connection that event was for gets stuck forever. Nevertheless, I have found that edge-triggered readiness notification made programming nonblocking clients with OpenSSL easier, so it’s worth trying.

[Banga, Mogul, Drusha ’99] described this kind of scheme in 1999.

There are several APIs which let the application retrieve ‘file descriptor became ready’ notifications:

kqueue() This is the recommended edge-triggered poll replacement for FreeBSD (and, soon, NetBSD).
FreeBSD 4.3 and later, and NetBSD-current as of Oct 2002, support a generalized alternative to poll() called kqueue()/kevent(); it supports both edge-triggering and level-triggering. (See also Jonathan Lemon’s page and his BSDCon 2000 paper on kqueue().)

Like /dev/poll, you allocate a listening object, but rather than opening the file /dev/poll, you call kqueue() to allocate one. To change the events you are listening for, or to get the list of current events, you call kevent() on the descriptor returned by kqueue(). It can listen not just for socket readiness, but also for plain file readiness, signals, and even for I/O completion.

Note: as of October 2000, the threading library on FreeBSD does not interact well with kqueue(); evidently, when kqueue() blocks, the entire process blocks, not just the calling thread.

See Poller_kqueue (cc, h, benchmarks) for an example of how to use kqueue() interchangeably with many other readiness notification schemes.

Examples and libraries using kqueue():

PyKQueue — a Python binding for kqueue()
Ronald F. Guilmette’s example echo server; see also his 28 Sept 2000 post on freebsd.questions.

epoll
This is the recommended edge-triggered poll replacement for the 2.6 Linux kernel.
On 11 July 2001, Davide Libenzi proposed an alternative to realtime signals; his patch provides what he now calls /dev/epoll www.xmailserver.org/linux-patches/nio-improve.html. This is just like the realtime signal readiness notification, but it coalesces redundant events, and has a more efficient scheme for bulk event retrieval.

Epoll was merged into the 2.5 kernel tree as of 2.5.46 after its interface was changed from a special file in /dev to a system call, sys_epoll. A patch for the older version of epoll is available for the 2.4 kernel.

There was a lengthy debate about unifying epoll, aio, and other event sources on the linux-kernel mailing list around Halloween 2002. It may yet happen, but Davide is concentrating on firming up epoll in general first.

Polyakov’s kevent (Linux 2.6+) News flash: On 9 Feb 2006, and again on 9 July 2006, Evgeniy Polyakov posted patches which seem to unify epoll and aio; his goal is to support network AIO. See:
the LWN article about kevent
his July announcement
his kevent page
his naio page
some recent discussion

Drepper’s New Network Interface (proposal for Linux 2.6+)
At OLS 2006, Ulrich Drepper proposed a new high-speed asynchronous networking API. See:
his paper, “The Need for Asynchronous, Zero-Copy Network I/O”
his slides
LWN article from July 22

Realtime Signals
This is the recommended edge-triggered poll replacement for the 2.4 Linux kernel.
The 2.4 linux kernel can deliver socket readiness events via a particular realtime signal. Here’s how to turn this behavior on:

/* Mask off SIGIO and the signal you want to use. */
sigemptyset(&sigset);
sigaddset(&sigset, signum);
sigaddset(&sigset, SIGIO);
sigprocmask(SIG_BLOCK, &m_sigset, NULL);
/* For each file descriptor, invoke F_SETOWN, F_SETSIG, and set O_ASYNC. */
fcntl(fd, F_SETOWN, (int) getpid());
fcntl(fd, F_SETSIG, signum);
flags = fcntl(fd, F_GETFL);
flags |= O_NONBLOCK|O_ASYNC;
fcntl(fd, F_SETFL, flags);

This sends that signal when a normal I/O function like read() or write() completes. To use this, write a normal poll() outer loop, and inside it, after you’ve handled all the fd’s noticed by poll(), you loop calling sigwaitinfo().
If sigwaitinfo or sigtimedwait returns your realtime signal, siginfo.si_fd and siginfo.si_band give almost the same information as pollfd.fd and pollfd.revents would after a call to poll(), so you handle the i/o, and continue calling sigwaitinfo().
If sigwaitinfo returns a traditional SIGIO, the signal queue overflowed, so you flush the signal queue by temporarily changing the signal handler to SIG_DFL, and break back to the outer poll() loop.

See Poller_sigio (cc, h) for an example of how to use rtsignals interchangeably with many other readiness notification schemes.

See Zach Brown’s phhttpd for example code that uses this feature directly. (Or don’t; phhttpd is a bit hard to figure out…)

[Provos, Lever, and Tweedie 2000] describes a recent benchmark of phhttpd using a variant of sigtimedwait(), sigtimedwait4(), that lets you retrieve multiple signals with one call. Interestingly, the chief benefit of sigtimedwait4() for them seemed to be it allowed the app to gauge system overload (so it could behave appropriately). (Note that poll() provides the same measure of system overload.)

Signal-per-fd
Chandra and Mosberger proposed a modification to the realtime signal approach called “signal-per-fd” which reduces or eliminates realtime signal queue overflow by coalescing redundant events. It doesn’t outperform epoll, though. Their paper ( www.hpl.hp.com/techreports/2000/HPL-2000-174.html) compares performance of this scheme with select() and /dev/poll.

Vitaly Luban announced a patch implementing this scheme on 18 May 2001; his patch lives at www.luban.org/GPL/gpl.html. (Note: as of Sept 2001, there may still be stability problems with this patch under heavy load. dkftpbench at about 4500 users may be able to trigger an oops.)

See Poller_sigfd (cc, h) for an example of how to use signal-per-fd interchangeably with many other readiness notification schemes.

3. Serve many clients with each server thread, and use asynchronous I/O
This has not yet become popular in Unix, probably because few operating systems support asynchronous I/O, also possibly because it (like nonblocking I/O) requires rethinking your application. Under standard Unix, asynchronous I/O is provided by the aio_ interface (scroll down from that link to “Asynchronous input and output”), which associates a signal and value with each I/O operation. Signals and their values are queued and delivered efficiently to the user process. This is from the POSIX 1003.1b realtime extensions, and is also in the Single Unix Specification, version 2.

AIO is normally used with edge-triggered completion notification, i.e. a signal is queued when the operation is complete. (It can also be used with level triggered completion notification by calling aio_suspend(), though I suspect few people do this.)

glibc 2.1 and later provide a generic implementation written for standards compliance rather than performance.

Ben LaHaise’s implementation for Linux AIO was merged into the main Linux kernel as of 2.5.32. It doesn’t use kernel threads, and has a very efficient underlying api, but (as of 2.6.0-test2) doesn’t yet support sockets. (There is also an AIO patch for the 2.4 kernels, but the 2.5/2.6 implementation is somewhat different.) More info:

The page “Kernel Asynchronous I/O (AIO) Support for Linux” which tries to tie together all info about the 2.6 kernel’s implementation of AIO (posted 16 Sept 2003)
Round 3: aio vs /dev/epoll by Benjamin C.R. LaHaise (presented at 2002 OLS)
Asynchronous I/O Suport in Linux 2.5, by Bhattacharya, Pratt, Pulaverty, and Morgan, IBM; presented at OLS ‘2003
Design Notes on Asynchronous I/O (aio) for Linux by Suparna Bhattacharya — compares Ben’s AIO with SGI’s KAIO and a few other AIO projects
Linux AIO home page – Ben’s preliminary patches, mailing list, etc.
linux-aio mailing list archives
libaio-oracle – library implementing standard Posix AIO on top of libaio. First mentioned by Joel Becker on 18 Apr 2003.
Suparna also suggests having a look at the the DAFS API’s approach to AIO.
Red Hat AS and Suse SLES both provide a high-performance implementation on the 2.4 kernel; it is related to, but not completely identical to, the 2.6 kernel implementation.

In February 2006, a new attempt is being made to provide network AIO; see the note above about Evgeniy Polyakov’s kevent-based AIO.

In 1999, SGI implemented high-speed AIO for Linux. As of version 1.1, it’s said to work well with both disk I/O and sockets. It seems to use kernel threads. It is still useful for people who can’t wait for Ben’s AIO to support sockets.

The O’Reilly book POSIX.4: Programming for the Real World is said to include a good introduction to aio.

A tutorial for the earlier, nonstandard, aio implementation on Solaris is online at Sunsite. It’s probably worth a look, but keep in mind you’ll need to mentally convert “aioread” to “aio_read”, etc.

Note that AIO doesn’t provide a way to open files without blocking for disk I/O; if you care about the sleep caused by opening a disk file, Linus suggests you should simply do the open() in a different thread rather than wishing for an aio_open() system call.

Under Windows, asynchronous I/O is associated with the terms “Overlapped I/O” and IOCP or “I/O Completion Port”. Microsoft’s IOCP combines techniques from the prior art like asynchronous I/O (like aio_write) and queued completion notification (like when using the aio_sigevent field with aio_write) with a new idea of holding back some requests to try to keep the number of running threads associated with a single IOCP constant. For more information, see Inside I/O Completion Ports by Mark Russinovich at sysinternals.com, Jeffrey Richter’s book “Programming Server-Side Applications for Microsoft Windows 2000” (Amazon, MSPress), U.S. patent #06223207, or MSDN.

4. Serve one client with each server thread
… and let read() and write() block. Has the disadvantage of using a whole stack frame for each client, which costs memory. Many OS’s also have trouble handling more than a few hundred threads. If each thread gets a 2MB stack (not an uncommon default value), you run out of *virtual memory* at (2^30 / 2^21) = 512 threads on a 32 bit machine with 1GB user-accessible VM (like, say, Linux as normally shipped on x86). You can work around this by giving each thread a smaller stack, but since most thread libraries don’t allow growing thread stacks once created, doing this means designing your program to minimize stack use. You can also work around this by moving to a 64 bit processor.

The thread support in Linux, FreeBSD, and Solaris is improving, and 64 bit processors are just around the corner even for mainstream users. Perhaps in the not-too-distant future, those who prefer using one thread per client will be able to use that paradigm even for 10000 clients. Nevertheless, at the current time, if you actually want to support that many clients, you’re probably better off using some other paradigm.

For an unabashedly pro-thread viewpoint, see Why Events Are A Bad Idea (for High-concurrency Servers) by von Behren, Condit, and Brewer, UCB, presented at HotOS IX. Anyone from the anti-thread camp care to point out a paper that rebuts this one? :-)

LinuxThreads
LinuxTheads is the name for the standard Linux thread library. It is integrated into glibc since glibc2.0, and is mostly Posix-compliant, but with less than stellar performance and signal support.
NGPT: Next Generation Posix Threads for Linux
NGPT is a project started by IBM to bring good Posix-compliant thread support to Linux. It’s at stable version 2.2 now, and works well… but the NGPT team has announced that they are putting the NGPT codebase into support-only mode because they feel it’s “the best way to support the community for the long term”. The NGPT team will continue working to improve Linux thread support, but now focused on improving NPTL. (Kudos to the NGPT team for their good work and the graceful way they conceded to NPTL.)
NPTL: Native Posix Thread Library for Linux
NPTL is a project by Ulrich Drepper (the benevolent dict^H^H^H^Hmaintainer of glibc) and Ingo Molnar to bring world-class Posix threading support to Linux.
As of 5 October 2003, NPTL is now merged into the glibc cvs tree as an add-on directory (just like linuxthreads), so it will almost certainly be released along with the next release of glibc.

The first major distribution to include an early snapshot of NPTL was Red Hat 9. (This was a bit inconvenient for some users, but somebody had to break the ice…)

NPTL links:

Mailing list for NPTL discussion
NPTL source code
Initial announcement for NPTL
Original whitepaper describing the goals for NPTL
Revised whitepaper describing the final design of NPTL
Ingo Molnar’s first benchmark showing it could handle 10^6 threads
Ulrich’s benchmark comparing performance of LinuxThreads, NPTL, and IBM’s NGPT. It seems to show NPTL is much faster than NGPT.
Here’s my try at describing the history of NPTL (see also Jerry Cooperstein’s article):
In March 2002, Bill Abt of the NGPT team, the glibc maintainer Ulrich Drepper, and others met to figure out what to do about LinuxThreads. One idea that came out of the meeting was to improve mutex performance; Rusty Russell et al subsequently implemented fast userspace mutexes (futexes)), which are now used by both NGPT and NPTL. Most of the attendees figured NGPT should be merged into glibc.

Ulrich Drepper, though, didn’t like NGPT, and figured he could do better. (For those who have ever tried to contribute a patch to glibc, this may not come as a big surprise :-) Over the next few months, Ulrich Drepper, Ingo Molnar, and others contributed glibc and kernel changes that make up something called the Native Posix Threads Library (NPTL). NPTL uses all the kernel enhancements designed for NGPT, and takes advantage of a few new ones. Ingo Molnar described the kernel enhancements as follows:

While NPTL uses the three kernel features introduced by NGPT: getpid() returns PID, CLONE_THREAD and futexes; NPTL also uses (and relies on) a much wider set of new kernel features, developed as part of this project.
Some of the items NGPT introduced into the kernel around 2.5.8 got modified, cleaned up and extended, such as thread group handling (CLONE_THREAD). [the CLONE_THREAD changes which impacted NGPT’s compatibility got synced with the NGPT folks, to make sure NGPT does not break in any unacceptable way.]

The kernel features developed for and used by NPTL are described in the design whitepaper, http://people.redhat.com/drepper/nptl-design.pdf …

A short list: TLS support, various clone extensions (CLONE_SETTLS, CLONE_SETTID, CLONE_CLEARTID), POSIX thread-signal handling, sys_exit() extension (release TID futex upon VM-release), the sys_exit_group() system-call, sys_execve() enhancements and support for detached threads.

There was also work put into extending the PID space – eg. procfs crashed due to 64K PID assumptions, max_pid, and pid allocation scalability work. Plus a number of performance-only improvements were done as well.

In essence the new features are a no-compromises approach to 1:1 threading – the kernel now helps in everything where it can improve threading, and we precisely do the minimally necessary set of context switches and kernel calls for every basic threading primitive.

One big difference between the two is that NPTL is a 1:1 threading model, whereas NGPT is an M:N threading model (see below). In spite of this, Ulrich’s initial benchmarks seem to show that NPTL is indeed much faster than NGPT. (The NGPT team is looking forward to seeing Ulrich’s benchmark code to verify the result.)
FreeBSD threading support
FreeBSD supports both LinuxThreads and a userspace threading library. Also, a M:N implementation called KSE was introduced in FreeBSD 5.0. For one overview, see www.unobvious.com/bsd/freebsd-threads.html.
On 25 Mar 2003, Jeff Roberson posted on freebsd-arch:

… Thanks to the foundation provided by Julian, David Xu, Mini, Dan Eischen, and everyone else who has participated with KSE and libpthread development Mini and I have developed a 1:1 threading implementation. This code works in parallel with KSE and does not break it in any way. It actually helps bring M:N threading closer by testing out shared bits. …
And in July 2006, Robert Watson proposed that the 1:1 threading implementation become the default in FreeBsd 7.x:
I know this has been discussed in the past, but I figured with 7.x trundling forward, it was time to think about it again. In benchmarks for many common applications and scenarios, libthr demonstrates significantly better performance over libpthread… libthr is also implemented across a larger number of our platforms, and is already libpthread on several. The first recommendation we make to MySQL and other heavy thread users is “Switch to libthr”, which is suggestive, also! … So the strawman proposal is: make libthr the default threading library on 7.x.
NetBSD threading support
According to a note from Noriyuki Soda:
Kernel supported M:N thread library based on the Scheduler Activations model is merged into NetBSD-current on Jan 18 2003.
For details, see An Implementation of Scheduler Activations on the NetBSD Operating System by Nathan J. Williams, Wasabi Systems, Inc., presented at FREENIX ’02.
Solaris threading support
The thread support in Solaris is evolving… from Solaris 2 to Solaris 8, the default threading library used an M:N model, but Solaris 9 defaults to 1:1 model thread support. See Sun’s multithreaded programming guide and Sun’s note about Java and Solaris threading.
Java threading support in JDK 1.3.x and earlier
As is well known, Java up to JDK1.3.x did not support any method of handling network connections other than one thread per client. Volanomark is a good microbenchmark which measures throughput in messsages per second at various numbers of simultaneous connections. As of May 2003, JDK 1.3 implementations from various vendors are in fact able to handle ten thousand simultaneous connections — albeit with significant performance degradation. See Table 4 for an idea of which JVMs can handle 10000 connections, and how performance suffers as the number of connections increases.
Note: 1:1 threading vs. M:N threading
There is a choice when implementing a threading library: you can either put all the threading support in the kernel (this is called the 1:1 threading model), or you can move a fair bit of it into userspace (this is called the M:N threading model). At one point, M:N was thought to be higher performance, but it’s so complex that it’s hard to get right, and most people are moving away from it.
Why Ingo Molnar prefers 1:1 over M:N
Sun is moving to 1:1 threads
NGPT is an M:N threading library for Linux.
Although Ulrich Drepper planned to use M:N threads in the new glibc threading library, he has since switched to the 1:1 threading model.
MacOSX appears to use 1:1 threading.
FreeBSD and NetBSD appear to still believe in M:N threading… The lone holdouts? Looks like freebsd 7.0 might switch to 1:1 threading (see above), so perhaps M:N threading’s believers have finally been proven wrong everywhere.
5. Build the server code into the kernel
Novell and Microsoft are both said to have done this at various times, at least one NFS implementation does this, khttpd does this for Linux and static web pages, and “TUX” (Threaded linUX webserver) is a blindingly fast and flexible kernel-space HTTP server by Ingo Molnar for Linux. Ingo’s September 1, 2000 announcement says an alpha version of TUX can be downloaded from ftp://ftp.redhat.com/pub/redhat/tux, and explains how to join a mailing list for more info.
The linux-kernel list has been discussing the pros and cons of this approach, and the consensus seems to be instead of moving web servers into the kernel, the kernel should have the smallest possible hooks added to improve web server performance. That way, other kinds of servers can benefit. See e.g. Zach Brown’s remarks about userland vs. kernel http servers. It appears that the 2.4 linux kernel provides sufficient power to user programs, as the X15 server runs about as fast as Tux, but doesn’t use any kernel modifications.

Comments
Richard Gooch has written a paper discussing I/O options.

In 2001, Tim Brecht and MMichal Ostrowski measured various strategies for simple select-based servers. Their data is worth a look.

In 2003, Tim Brecht posted source code for userver, a small web server put together from several servers written by Abhishek Chandra, David Mosberger, David Pariag, and Michal Ostrowski. It can use select(), poll(), epoll(), or sigio.

Back in March 1999, Dean Gaudet posted:

I keep getting asked “why don’t you guys use a select/event based model like Zeus? It’s clearly the fastest.” …
His reasons boiled down to “it’s really hard, and the payoff isn’t clear”. Within a few months, though, it became clear that people were willing to work on it.
Mark Russinovich wrote an editorial and an article discussing I/O strategy issues in the 2.2 Linux kernel. Worth reading, even he seems misinformed on some points. In particular, he seems to think that Linux 2.2’s asynchronous I/O (see F_SETSIG above) doesn’t notify the user process when data is ready, only when new connections arrive. This seems like a bizarre misunderstanding. See also comments on an earlier draft, Ingo Molnar’s rebuttal of 30 April 1999, Russinovich’s comments of 2 May 1999, a rebuttal from Alan Cox, and various posts to linux-kernel. I suspect he was trying to say that Linux doesn’t support asynchronous disk I/O, which used to be true, but now that SGI has implemented KAIO, it’s not so true anymore.

See these pages at sysinternals.com and MSDN for information on “completion ports”, which he said were unique to NT; in a nutshell, win32’s “overlapped I/O” turned out to be too low level to be convenient, and a “completion port” is a wrapper that provides a queue of completion events, plus scheduling magic that tries to keep the number of running threads constant by allowing more threads to pick up completion events if other threads that had picked up completion events from this port are sleeping (perhaps doing blocking I/O).

See also OS/400’s support for I/O completion ports.

There was an interesting discussion on linux-kernel in September 1999 titled “> 15,000 Simultaneous Connections” (and the second week of the thread). Highlights:

Ed Hall posted a few notes on his experiences; he’s achieved >1000 connects/second on a UP P2/333 running Solaris. His code used a small pool of threads (1 or 2 per CPU) each managing a large number of clients using “an event-based model”.
Mike Jagdis posted an analysis of poll/select overhead, and said “The current select/poll implementation can be improved significantly, especially in the blocking case, but the overhead will still increase with the number of descriptors because select/poll does not, and cannot, remember what descriptors are interesting. This would be easy to fix with a new API. Suggestions are welcome…”
Mike posted about his work on improving select() and poll().
Mike posted a bit about a possible API to replace poll()/select(): “How about a ‘device like’ API where you write ‘pollfd like’ structs, the ‘device’ listens for events and delivers ‘pollfd like’ structs representing them when you read it? … ”
Rogier Wolff suggested using “the API that the digital guys suggested”, http://www.cs.rice.edu/~gaurav/papers/usenix99.ps
Joerg Pommnitz pointed out that any new API along these lines should be able to wait for not just file descriptor events, but also signals and maybe SYSV-IPC. Our synchronization primitives should certainly be able to do what Win32’s WaitForMultipleObjects can, at least.
Stephen Tweedie asserted that the combination of F_SETSIG, queued realtime signals, and sigwaitinfo() was a superset of the API proposed in http://www.cs.rice.edu/~gaurav/papers/usenix99.ps. He also mentions that you keep the signal blocked at all times if you’re interested in performance; instead of the signal being delivered asynchronously, the process grabs the next one from the queue with sigwaitinfo().
Jayson Nordwick compared completion ports with the F_SETSIG synchronous event model, and concluded they’re pretty similar.
Alan Cox noted that an older rev of SCT’s SIGIO patch is included in 2.3.18ac.
Jordan Mendelson posted some example code showing how to use F_SETSIG.
Stephen C. Tweedie continued the comparison of completion ports and F_SETSIG, and noted: “With a signal dequeuing mechanism, your application is going to get signals destined for various library components if libraries are using the same mechanism,” but the library can set up its own signal handler, so this shouldn’t affect the program (much).
Doug Royer noted that he’d gotten 100,000 connections on Solaris 2.6 while he was working on the Sun calendar server. Others chimed in with estimates of how much RAM that would require on Linux, and what bottlenecks would be hit.
Interesting reading!

Limits on open filehandles
Any Unix: the limits set by ulimit or setrlimit.
Solaris: see the Solaris FAQ, question 3.46 (or thereabouts; they renumber the questions periodically).
FreeBSD:

Edit /boot/loader.conf, add the line
set kern.maxfiles=XXXX
where XXXX is the desired system limit on file descriptors, and reboot. Thanks to an anonymous reader, who wrote in to say he’d achieved far more than 10000 connections on FreeBSD 4.3, and says
“FWIW: You can’t actually tune the maximum number of connections in FreeBSD trivially, via sysctl…. You have to do it in the /boot/loader.conf file.
The reason for this is that the zalloci() calls for initializing the sockets and tcpcb structures zones occurs very early in system startup, in order that the zone be both type stable and that it be swappable.
You will also need to set the number of mbufs much higher, since you will (on an unmodified kernel) chew up one mbuf per connection for tcptempl structures, which are used to implement keepalive.”
Another reader says
“As of FreeBSD 4.4, the tcptempl structure is no longer allocated; you no longer have to worry about one mbuf being chewed up per connection.”
See also:
the FreeBSD handbook
SYSCTL TUNING, LOADER TUNABLES, and KERNEL CONFIG TUNING in ‘man tuning’
The Effects of Tuning a FreeBSD 4.3 Box for High Performance, Daemon News, Aug 2001
postfix.org tuning notes, covering FreeBSD 4.2 and 4.4
the Measurement Factory’s notes, circa FreeBSD 4.3
OpenBSD: A reader says
“In OpenBSD, an additional tweak is required to increase the number of open filehandles available per process: the openfiles-cur parameter in /etc/login.conf needs to be increased. You can change kern.maxfiles either with sysctl -w or in sysctl.conf but it has no effect. This matters because as shipped, the login.conf limits are a quite low 64 for nonprivileged processes, 128 for privileged.”
Linux: See Bodo Bauer’s /proc documentation. On 2.4 kernels:
echo 32768 > /proc/sys/fs/file-max

increases the system limit on open files, and
ulimit -n 32768
increases the current process’ limit.
On 2.2.x kernels,

echo 32768 > /proc/sys/fs/file-max
echo 65536 > /proc/sys/fs/inode-max

increases the system limit on open files, and
ulimit -n 32768
increases the current process’ limit.
I verified that a process on Red Hat 6.0 (2.2.5 or so plus patches) can open at least 31000 file descriptors this way. Another fellow has verified that a process on 2.2.12 can open at least 90000 file descriptors this way (with appropriate limits). The upper bound seems to be available memory.
Stephen C. Tweedie posted about how to set ulimit limits globally or per-user at boot time using initscript and pam_limit.
In older 2.2 kernels, though, the number of open files per process is still limited to 1024, even with the above changes.
See also Oskar’s 1998 post, which talks about the per-process and system-wide limits on file descriptors in the 2.0.36 kernel.

Limits on threads
On any architecture, you may need to reduce the amount of stack space allocated for each thread to avoid running out of virtual memory. You can set this at runtime with pthread_attr_init() if you’re using pthreads.

Solaris: it supports as many threads as will fit in memory, I hear.
Linux 2.6 kernels with NPTL: /proc/sys/vm/max_map_count may need to be increased to go above 32000 or so threads. (You’ll need to use very small stack threads to get anywhere near that number of threads, though, unless you’re on a 64 bit processor.) See the NPTL mailing list, e.g. the thread with subject “Cannot create more than 32K threads?”, for more info.
Linux 2.4: /proc/sys/kernel/threads-max is the max number of threads; it defaults to 2047 on my Red Hat 8 system. You can set increase this as usual by echoing new values into that file, e.g. “echo 4000 > /proc/sys/kernel/threads-max”
Linux 2.2: Even the 2.2.13 kernel limits the number of threads, at least on Intel. I don’t know what the limits are on other architectures. Mingo posted a patch for 2.1.131 on Intel that removed this limit. It appears to be integrated into 2.3.20.
See also Volano’s detailed instructions for raising file, thread, and FD_SET limits in the 2.2 kernel. Wow. This document steps you through a lot of stuff that would be hard to figure out yourself, but is somewhat dated.

Java: See Volano’s detailed benchmark info, plus their info on how to tune various systems to handle lots of threads.
Java issues
Up through JDK 1.3, Java’s standard networking libraries mostly offered the one-thread-per-client model. There was a way to do nonblocking reads, but no way to do nonblocking writes.

In May 2001, JDK 1.4 introduced the package java.nio to provide full support for nonblocking I/O (and some other goodies). See the release notes for some caveats. Try it out and give Sun feedback!

HP’s java also includes a Thread Polling API.

In 2000, Matt Welsh implemented nonblocking sockets for Java; his performance benchmarks show that they have advantages over blocking sockets in servers handling many (up to 10000) connections. His class library is called java-nbio; it’s part of the Sandstorm project. Benchmarks showing performance with 10000 connections are available.

See also Dean Gaudet’s essay on the subject of Java, network I/O, and threads, and the paper by Matt Welsh on events vs. worker threads.

Before NIO, there were several proposals for improving Java’s networking APIs:

Matt Welsh’s Jaguar system proposes preserialized objects, new Java bytecodes, and memory management changes to allow the use of asynchronous I/O with Java.
Interfacing Java to the Virtual Interface Architecture, by C-C. Chang and T. von Eicken, proposes memory management changes to allow the use of asynchronous I/O with Java.
JSR-51 was the Sun project that came up with the java.nio package. Matt Welsh participated (who says Sun doesn’t listen?).
Other tips
Zero-Copy
Normally, data gets copied many times on its way from here to there. Any scheme that eliminates these copies to the bare physical minimum is called “zero-copy”.
Thomas Ogrisegg’s zero-copy send patch for mmaped files under Linux 2.4.17-2.4.20. Claims it’s faster than sendfile().
IO-Lite is a proposal for a set of I/O primitives that gets rid of the need for many copies.
Alan Cox noted that zero-copy is sometimes not worth the trouble back in 1999. (He did like sendfile(), though.)
Ingo implemented a form of zero-copy TCP in the 2.4 kernel for TUX 1.0 in July 2000, and says he’ll make it available to userspace soon.
Drew Gallatin and Robert Picco have added some zero-copy features to FreeBSD; the idea seems to be that if you call write() or read() on a socket, the pointer is page-aligned, and the amount of data transferred is at least a page, *and* you don’t immediately reuse the buffer, memory management tricks will be used to avoid copies. But see followups to this message on linux-kernel for people’s misgivings about the speed of those memory management tricks.
According to a note from Noriyuki Soda:

Sending side zero-copy is supported since NetBSD-1.6 release by specifying “SOSEND_LOAN” kernel option. This option is now default on NetBSD-current (you can disable this feature by specifying “SOSEND_NO_LOAN” in the kernel option on NetBSD_current). With this feature, zero-copy is automatically enabled, if data more than 4096 bytes are specified as data to be sent.
The sendfile() system call can implement zero-copy networking.
The sendfile() function in Linux and FreeBSD lets you tell the kernel to send part or all of a file. This lets the OS do it as efficiently as possible. It can be used equally well in servers using threads or servers using nonblocking I/O. (In Linux, it’s poorly documented at the moment; use _syscall4 to call it. Andi Kleen is writing new man pages that cover this. See also Exploring The sendfile System Call by Jeff Tranter in Linux Gazette issue 91.) Rumor has it, ftp.cdrom.com benefitted noticeably from sendfile().
A zero-copy implementation of sendfile() is on its way for the 2.4 kernel. See LWN Jan 25 2001.

One developer using sendfile() with Freebsd reports that using POLLWRBAND instead of POLLOUT makes a big difference.

Solaris 8 (as of the July 2001 update) has a new system call ‘sendfilev’. A copy of the man page is here.. The Solaris 8 7/01 release notes also mention it. I suspect that this will be most useful when sending to a socket in blocking mode; it’d be a bit of a pain to use with a nonblocking socket.

Avoid small frames by using writev (or TCP_CORK)
A new socket option under Linux, TCP_CORK, tells the kernel to avoid sending partial frames, which helps a bit e.g. when there are lots of little write() calls you can’t bundle together for some reason. Unsetting the option flushes the buffer. Better to use writev(), though…
See LWN Jan 25 2001 for a summary of some very interesting discussions on linux-kernel about TCP_CORK and a possible alternative MSG_MORE.

Behave sensibly on overload.
[Provos, Lever, and Tweedie 2000] notes that dropping incoming connections when the server is overloaded improved the shape of the performance curve, and reduced the overall error rate. They used a smoothed version of “number of clients with I/O ready” as a measure of overload. This technique should be easily applicable to servers written with select, poll, or any system call that returns a count of readiness events per call (e.g. /dev/poll or sigtimedwait4()).
Some programs can benefit from using non-Posix threads.
Not all threads are created equal. The clone() function in Linux (and its friends in other operating systems) lets you create a thread that has its own current working directory, for instance, which can be very helpful when implementing an ftp server. See Hoser FTPd for an example of the use of native threads rather than pthreads.
Caching your own data can sometimes be a win.
“Re: fix for hybrid server problems” by Vivek Sadananda Pai (vivek@cs.rice.edu) on new-httpd, May 9th, states:
“I’ve compared the raw performance of a select-based server with a multiple-process server on both FreeBSD and Solaris/x86. On microbenchmarks, there’s only a marginal difference in performance stemming from the software architecture. The big performance win for select-based servers stems from doing application-level caching. While multiple-process servers can do it at a higher cost, it’s harder to get the same benefits on real workloads (vs microbenchmarks). I’ll be presenting those measurements as part of a paper that’ll appear at the next Usenix conference. If you’ve got postscript, the paper is available at http://www.cs.rice.edu/~vivek/flash99/”

Other limits
Old system libraries might use 16 bit variables to hold file handles, which causes trouble above 32767 handles. glibc2.1 should be ok.
Many systems use 16 bit variables to hold process or thread id’s. It would be interesting to port the Volano scalability benchmark to C, and see what the upper limit on number of threads is for the various operating systems.
Too much thread-local memory is preallocated by some operating systems; if each thread gets 1MB, and total VM space is 2GB, that creates an upper limit of 2000 threads.
Look at the performance comparison graph at the bottom of http://www.acme.com/software/thttpd/benchmarks.html. Notice how various servers have trouble above 128 connections, even on Solaris 2.6? Anyone who figures out why, let me know.
Note: if the TCP stack has a bug that causes a short (200ms) delay at SYN or FIN time, as Linux 2.2.0-2.2.6 had, and the OS or http daemon has a hard limit on the number of connections open, you would expect exactly this behavior. There may be other causes.
Kernel Issues
For Linux, it looks like kernel bottlenecks are being fixed constantly. See Linux Weekly News, Kernel Traffic, the Linux-Kernel mailing list, and my Mindcraft Redux page.

In March 1999, Microsoft sponsored a benchmark comparing NT to Linux at serving large numbers of http and smb clients, in which they failed to see good results from Linux. See also my article on Mindcraft’s April 1999 Benchmarks for more info.

See also The Linux Scalability Project. They’re doing interesting work, including Niels Provos’ hinting poll patch, and some work on the thundering herd problem.

See also Mike Jagdis’ work on improving select() and poll(); here’s Mike’s post about it.

Mohit Aron (aron@cs.rice.edu) writes that rate-based clocking in TCP can improve HTTP response time over ‘slow’ connections by 80%.

Measuring Server Performance
Two tests in particular are simple, interesting, and hard:

raw connections per second (how many 512 byte files per second can you serve?)
total transfer rate on large files with many slow clients (how many 28.8k modem clients can simultaneously download from your server before performance goes to pot?)
Jef Poskanzer has published benchmarks comparing many web servers. See http://www.acme.com/software/thttpd/benchmarks.html for his results.

I also have a few old notes about comparing thttpd to Apache that may be of interest to beginners.

Chuck Lever keeps reminding us about Banga and Druschel’s paper on web server benchmarking. It’s worth a read.

IBM has an excellent paper titled Java server benchmarks [Baylor et al, 2000]. It’s worth a read.

Examples
Interesting select()-based servers
thttpd Very simple. Uses a single process. It has good performance, but doesn’t scale with the number of CPU’s. Can also use kqueue.
mathopd. Similar to thttpd.
fhttpd
boa
Roxen
Zeus, a commercial server that tries to be the absolute fastest. See their tuning guide.
The other non-Java servers listed at http://www.acme.com/software/thttpd/benchmarks.html
BetaFTPd
Flash-Lite – web server using IO-Lite.
Flash: An efficient and portable Web server — uses select(), mmap(), mincore()
The Flash web server as of 2003 — uses select(), modified sendfile(), async open()
xitami – uses select() to implement its own thread abstraction for portability to systems without threads.
Medusa – a server-writing toolkit in Python that tries to deliver very high performance.
userver – a small http server that can use select, poll, epoll, or sigio
Interesting /dev/poll-based servers
N. Provos, C. Lever, “Scalable Network I/O in Linux,” May, 2000. [FREENIX track, Proc. USENIX 2000, San Diego, California (June, 2000).] Describes a version of thttpd modified to support /dev/poll. Performance is compared with phhttpd.
Interesting kqueue()-based servers
thttpd (as of version 2.21?)
Adrian Chadd says “I’m doing a lot of work to make squid actually LIKE a kqueue IO system”; it’s an official Squid subproject; see http://squid.sourceforge.net/projects.html#commloops. (This is apparently newer than Benno’s patch.)
Interesting realtime signal-based servers
Chromium’s X15. This uses the 2.4 kernel’s SIGIO feature together with sendfile() and TCP_CORK, and reportedly achieves higher speed than even TUX. The source is available under a community source (not open source) license. See the original announcement by Fabio Riccardi.
Zach Brown’s phhttpd – “a quick web server that was written to showcase the sigio/siginfo event model. consider this code highly experimental and yourself highly mental if you try and use it in a production environment.” Uses the siginfo features of 2.3.21 or later, and includes the needed patches for earlier kernels. Rumored to be even faster than khttpd. See his post of 31 May 1999 for some notes.
Interesting thread-based servers
Hoser FTPD. See their benchmark page.
Peter Eriksson’s phttpd and
pftpd
The Java-based servers listed at http://www.acme.com/software/thttpd/benchmarks.html
Sun’s Java Web Server (which has been reported to handle 500 simultaneous clients)
Interesting in-kernel servers
khttpd
“TUX” (Threaded linUX webserver) by Ingo Molnar et al. For 2.4 kernel.
Other interesting links
Jeff Darcy’s notes on high-performance server design
Ericsson’s ARIES project — benchmark results for Apache 1 vs. Apache 2 vs. Tomcat on 1 to 12 processors
Prof. Peter Ladkin’s Web Server Performance page.
Novell’s FastCache — claims 10000 hits per second. Quite the pretty performance graph.
Rik van Riel’s Linux Performance Tuning site

——————————————————————————–

Changelog
$Log: c10k.html,v $
Revision 1.212 2006/09/02 14:52:13 dank
added asio

Revision 1.211 2006/07/27 10:28:58 dank
Link to Cal Henderson’s book.

Revision 1.210 2006/07/27 10:18:58 dank
Listify polyakov links, add Drepper’s new proposal, note that FreeBSD 7 might move to 1:1

Revision 1.209 2006/07/13 15:07:03 dank
link to Scale! library, updated Polyakov links

Revision 1.208 2006/07/13 14:50:29 dank
Link to Polyakov’s patches

Revision 1.207 2003/11/03 08:09:39 dank
Link to Linus’s message deprecating the idea of aio_open

Revision 1.206 2003/11/03 07:44:34 dank
link to userver

Revision 1.205 2003/11/03 06:55:26 dank
Link to Vivek Pei’s new Flash paper, mention great specweb99 score

——————————————————————————–

Copyright 1999-2006 Dan Kegel
dank@kegel.com
Last updated: 2 Sept 2006
[Return to www.kegel.com]

2006年05月28日

字符集支持

Filed under: 生活小札,LAMP — HackGou @ 23:36

Jeikul  写的一篇非常好的解释Mysql对字符集的处理的文章。美文,特转之!
MySQL 4.1版本改进了对字符集处理的支持。这里描述的特性是MySQL 4.1.1 里已经
实现的。(MySQL 4.1.0里有一些,不过不包含这里的全部特性,并且有的实现也是不同的)

本章讨论下面的主题:

·什么是字符集和 collations?
·The multiple-level default system
·MySQL 4.1 里新的语法
·Affected functions and operations
·Unicode 支持
·每种特殊字符集和 collation 的含义

现在在 MyISAM, MEMORY (HEAP), 和 (从 MySQL 4.1.2开始) InnoDB 等存储引擎里包含
字符集支持. ISAM 存储引擎不包含字符集支持; 也没有计划加入, 因为 ISAM 已经被
淘汰了

10.1 字符集和Collations的一般介绍

一个character set (字符集)是一组符号和编码,而一个 collation 是在一个字符集里
比较字符的一套规则,让我们通过一个虚构的字符集例子来说明区别。

假设我们有个四个字母的字母表:`A’, `B’, `a’, `b’.我们给每个字母一个编号:
`A’ = 0, `B’ = 1, `a’ = 2, `b’ = 3. 字母`A’ 是一个符号,而数字0是 `A’ 的
encoding(编码),而这四个字母和他们的编码合起来就是一个字符集(character set)。

现在,假设我们要比较两个字符串的值,`A’ 和`B’,最简单的方法是看编码,`A’ 是 0
而 `B’是 1. 因为0比1小,我们就说`A’ 比 `B’ 小。现在,我们就算已经对我们的字符
集使用了一个collation,collation 是一组规则(在这个例子里只有一条规则):
“比较编码”.我们把所有可能的 collation 中最简单的这种叫做binary collation

但是如果我们想让大写字母和小写字母一样怎么办?那么我们就得有两条规则:
(1)把小写字母`a’ 和 `b’ 看作跟 `A’ 和 `B’相等;
(2)然后比较编码。
我们称这是一个case-insensitive collation(不区分大小写的 collation).
这比binary collation 稍微复杂了一点。

在实际生活中,大多数字符集都包含很多字符:不是仅仅`A’和`B’ 而是整个字母表,
有时是多个字母表或者东方书写系统里几千的字符,和很多专有符号和标点符。
并且在实际生活中,大多数的collations 有很多规则:除了不区分大小写外还有不区分
重音(重音“accent” 是像在德语里字符附加的重音符那样的)和多字符映射。

MySQL 4.1 可以为你做以下事:

·使用各种字符集存储字符串
·使用各种collation比较字符串。
·在同一台服务器上或者同一个数据库甚至同一个表中使用不同的字符集和collation混合
·允许在任何级别上指明字符集和collation

在这些方面,MySQL 4.1 不只远远比MySQL 4.0复杂,也比其他DBMS先进很多。不过要想
有效的使用这些新特性,你需要学习哪些字符集和collation是可用的,怎样把他们改成
默认,还有各种字符串运算符如何操作他们。

10.2 MySQL 里的字符集和Collations

MySQL 服务器可支持多个字符集。要列出可用的字符集,使用 SHOW CHARACTER SET 语句:

mysql> SHOW CHARACTER SET;
+———-+—————————–+———————+
| Charset | Description | Default collation |
+———-+—————————–+———————+
| big5 | Big5 Traditional Chinese | big5_chinese_ci |
| dec8 | DEC West European | dec8_swedish_ci |
| cp850 | DOS West European | cp850_general_ci |
| hp8 | HP West European | hp8_english_ci |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci |
| latin1 | ISO 8859-1 West European | latin1_swedish_ci |
| latin2 | ISO 8859-2 Central European | latin2_general_ci |

输出实际上包含另一列,这里为了让例子在页面上显示更合适,没显示出来

任一给出的字符集至少包含一个collation. 它可能包含多个 collations.

要列出一个字符集的 collations , 使用 SHOW COLLATION 语句. 例如, 要看latin1
(“ISO-8859-1 West European”)的collations, 使用这个语句来找到哪些名字以latin1
开头的collation

mysql> SHOW COLLATION like ‘latin1%’;
+——————-+———+—-+———+———-+———+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+——————-+———+—-+———+———-+———+
| latin1_german1_ci | latin1 | 5 | | | 0 |
| latin1_swedish_ci | latin1 | 8 | Yes | Yes | 1 |
| latin1_danish_ci | latin1 | 15 | | | 0 |
| latin1_german2_ci | latin1 | 31 | | Yes | 2 |
| latin1_bin | latin1 | 47 | | Yes | 1 |
| latin1_general_ci | latin1 | 48 | | | 0 |
| latin1_general_cs | latin1 | 49 | | | 0 |
| latin1_spanish_ci | latin1 | 94 | | | 0 |
+——————-+———+—-+———+———-+———+

latin1 collations 有下列含义:

Collation 含义

latin1_bin Binary according to latin1 encoding
latin1_danish_ci Danish/Norwegian
latin1_general_ci Multilingual
latin1_general_cs Multilingual, case sensitive
latin1_german1_ci German DIN-1
latin1_german2_ci German DIN-2
latin1_spanish_ci Modern Spanish
latin1_swedish_ci Swedish/Finnish

Collations 有这些一般特性:

·两个不同字符集没法拥有同一个collation.
·每个字符集有一个默认 collation. 例如, latin1 的默认 collation 是
latin1_swedish_ci.
·collation 的命名有个约定: 他们由所关联的字符集的名字打头,他们通常包含一个
语言名, 并以 _ci (case insensitive大小写不敏感),
或者 _cs (case sensitive大小写敏感), 或者 _bin (binary二进制).

10.3 决定默认字符集和 Collation

有四个级别上的默认字符集和collation设置: 服务器,数据库,表和连接。下面的描述
可能看起来复杂,不过实践中得出多级默认设置可以带来自然而然的结果。

10.3.1 服务器级字符集和 Collation

MySQL服务器有一个服务器级别的字符集和 collation, 不能为空。

MySQL 这样决定服务器级的字符集和collation
·当服务器开始按照有效选项设置
·运行期间按照变量

在服务器级别,决定是很简单的,依靠你执行mysqld时使用的选项来决定服务器字符集
和collation。你可以使用–default-character-set 来指定字符集,并且和这个一起
还可以为collation加上–default-collation 。如果你不指定字符集,就相当于说
–default-character-set=latin1。如果你只指定了字符集(例如,latin1)但是没有指定
collation,就相当于
–default-charset=latin1 –default-collation=latin1_swedish_ci
因为 latin1_swedish_ci是latin1字符集的默认collation, 因此下面三个命令都具有
同样效果:

shell> mysqld
shell> mysqld –default-character-set=latin1
shell> mysqld –default-character-set=latin1 \
–default-collation=latin1_swedish_ci

有个改变这个设置的方法是重新编译,如果你想编译源码来改变默认的服务器字符集和
collation,在configure使加上参数–with-charset 和 –with-collation ,例如:

shell> ./configure –with-charset=latin1

或者:

shell> ./configure –with-charset=latin1 \

mysqld 和configure 都会核实字符集/collation的结合是否有效,如果无效,这两个
程序都会报错并中止。

现行服务器字符集和collation 是和character_set_server 和 collation_server
这两个系统变量的值一样,这些变量可以在运行时更改

10.3.2 数据库字符集和 Collation

每个数据库都有一个数据库字符集和数据库collation,并且不能为空,create DATABASE
和 alter DATABASE 语句有专门指明数据库字符集和collation的可选子句:

create DATABASE db_name
[[DEFAULT] CHARACTER SET charset_name]
[[DEFAULT] COLLATE collation_name]

alter DATABASE db_name
[[DEFAULT] CHARACTER SET charset_name]
[[DEFAULT] COLLATE collation_name]

例子:

create DATABASE db_name
DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci;

MySQL可以这样选择数据库字符集和数据库collation:

·如果 CHARACTER SET X 和 COLLATE Y 被指定了, 那么字符集是 X collation 是 Y.
·如果 CHARACTER SET X 被指定,但是没有指定 COLLATE, 那么字符集是 X collation
是默认collation.
·否则, 就用服务器字符集和服务器 collation.

MySQL 的 create DATABASE … DEFAULT CHARACTER SET … 语法类似于标准 SQL
的 create SCHEMA … CHARACTER SET … 语法. 因为这样, 就可能在同一个MySQL
服务器上创建具有不同字符集和collation的数据库。

如果在建表的语句里没有指定表的字符集和collation,那么数据库字符集和 collation
就作为表的字符集和collation的默认值. 它们没有别的作用。

默认数据库的字符集和 collation是和 character_set_database 以及
collation_database 这两个系统变量的值一样。 当默认数据库更改时服务程序会设置
这些变量的值。如果没有默认数据库, 变量的值会和配套的服务器级系统变量
character_set_server 以及 collation_server的值一致.

10.3.3 表字符集和 Collation

每个表有一个表字符集以及一个表collation,不能为空。create TABLE 和 alter TABLE
语句有可选子句指定表字符集和collation。

create TABLE tbl_name (column_list)
[DEFAULT CHARACTER SET charset_name [COLLATE collation_name]]

alter TABLE tbl_name
[DEFAULT CHARACTER SET charset_name] [COLLATE collation_name]

例子:

create TABLE t1 ( … )
DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;

MySQL 通过下面的方法选择表字符集和collation:

·如果 CHARACTER SET X 和 COLLATE Y 都被指定了, 那么字符集就是 X collation 是Y
·如果只指定了CHARACTER SET X 而没有指定 COLLATE, 那么字符集为 X 并配默认的
collation.
·否则就使用数据库字符集和 collation.

表字符集和 collation 用来在没有指定个别列字符集和列collation的时候做为它们
的默认值。表字符集和 collation 是MySQL 的扩展;在标准SQL里没有这种东西

10.3.4 列字符集和 Collation

每个“character” 列(是指列属性为CHAR, VARCHAR, 或 TEXT的)都有一个列字符集
和一个列collation,不能为空。列定义语句有可选子句指定列字符集和collation:

col_name {CHAR | VARCHAR | TEXT} (col_length)
[CHARACTER SET charset_name [COLLATE collation_name]]

Example:

create TABLE Table1
(
column1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_german1_ci
);

MySQL 这样选择列字符集和collation:

·如果 CHARACTER SET X 和 COLLATE Y 都被指定了, 那么字符集就是 X collation 就是 Y.
·如果指定了 CHARACTER SET X 但没有指定 COLLATE, 那么字符集是 X 并配默认的collation.
·否则,就用表字符集和 collation.
CHARACTER SET 和 COLLATE 子句是标准SQL.

10.3.5 字符集和 Collation 分配的例子

下面的例子显示了 MySQL 怎样决定默认的字符集和collation的值:

例子1:表+列定义

create TABLE t1
(
c1 CHAR(10) CHARACTER SET latin1 COLLATE latin1_german1_ci
) DEFAULT CHARACTER SET latin2 COLLATE latin2_bin;

这里我们有一个用latin1的字符集和latin1_german1_ci collation的列。
定义非常明显,所以很简单。注意把一个latin1 的列存到一个latin2的表里不会有问题

例子2:表+列定义

create TABLE t1
(
c1 CHAR(10) CHARACTER SET latin1
) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;

这次我们有一列是latin1字符集加默认的collation。现在,虽然它看上去很自然,
但是默认的collation却不是从表级继承而来。事实上,因为latin1的默认collation
始终是latin1_swedish_ci,所以c1列的collation是latin1_swedish_ci (而不是
latin1_danish_ci).

例子3:表+列定义

create TABLE t1
(
c1 CHAR(10)
) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;

我们有一个默认字符集和默认collation的列。在这个环境下,MySQL向上到表级决定
列字符集和collation。所以,c1的列字符集是latin1,它的collation是
latin1_danish_ci

例子4:数据库+表+列定义

create DATABASE d1
DEFAULT CHARACTER SET latin2 COLLATE latin2_czech_ci;
USE d1;
create TABLE t1
(
c1 CHAR(10)
);

我们创建了一个没有指定列字符集和collation的列。我们也没有指定表级字符集和
collation。在这个条件下,MySQL向上到数据库级决定。(数据库的设置变为表的设置,
之后成为列的设置),所以c1的列字符集是latin2,collation是latin2_czech_ci

10.3.6 连接的字符集和Collations

一些字符集和collation和用户对服务器的作用结合。有些在前面已经提及了:

·服务器的字符集和collation和 character_set_server 及 collation_server 变量
的值一样
·默认数据库的字符集和collation和 character_set_database 及 collation_database
变量的值一样.

附加的字符集和collation 变量被引入用来处理服务器和客户端之间连接得通信。每个
客户端都有连接相关的字符集和collation变量。

想想”连接”是什么:是你连到服务器时作的事情。客户端通过这条连接发送SQL语句,
比如查询,服务器则通过这条连接给客户端送回回应,比如查询结果结果集合,这导致
了客户端处理字符集和collation的一些问题,它们每个都可以按照系统变量来回答:

·当查询离开客户端的时候应该是什么字符集的?服务器用character_set_client
这个变量来作为客户端发送查询所用的字符集
·服务器端在接收到了查询以后应该把它翻译到社么字符集里?对于这个,服务程序
用的是character_set_connection 和 collation_connection 这两个变量。
它把客户端送来的查询从character_set_client 转换成character_set_connection
(除了latin1或者utf8 的字符串)。collation_connection 对于比较字符串非常
重要,对于列值比较字符串是没有关系的,因为列拥有高优先级
·当服务程序要送回结果集合或者错误信息给客户端时应该用什么字符集?
character_set_results 变量指示了这个值,这包括了列值,或者列名等结果数据。

你可以调整这些变量的值,或者就使用默认的(那样就可以省略这节了)

有两个语句影响连接字符集设置:
SET NAMES ‘charset_name’
SET CHARACTER SET charset_name

SET NAMES 指出客户端送出的SQL语句里是什么。因此,SET NAMES ‘cp1251’ 就告诉服务
程序”下面将要从这个客户端送来的信息将是使用’cp1251’这个字符集。这也指定了
服务程序送回的结果所用的字符集,(例如如果你用了一个select语句它会指出列值
拥有的字符集)

SET NAMES ‘x’ 语句相当于下面三个语句:

mysql> SET character_set_client = x;
mysql> SET character_set_results = x;
mysql> SET character_set_connection = x;

把character_set_connection 设置成x也会把collation_connection 设置成默认
collation x

SET CHARACTER SET 是类似的,不过是把连接字符集和collation设置成那些默认数据库。
SET CHARACTER SET x 语句相当于这三个语句:

mysql> SET character_set_client = x;
mysql> SET character_set_results = x;
mysql> SET collation_connection = @@collation_database;

当一个客户连接,它向服务程序发送它想使用的字符集的名字,服务程序把
character_set_client, character_set_results, 和 character_set_connection
这些变量设置成那个字符集(事实上,服务程序使用字符集执行了SET NAMES 操作)

如果你不想用默认字符集,使用 Mysql 客户端程序不需要每次启动时执行SET NAMES 。
你可以在mysql 执行语句行加上–default-character-set 这个选项,或者在你的选项
文件里加上。比如,下面的选项文件设置使你每次执行mysql程序时把默认字符集变量
改成 koi8r:

[mysql]
default-character-set=koi8r

例如:假设column1定义是 CHAR(5) CHARACTER SET latin2。如果你不用SET NAMES
或者 SET CHARACTER SET,那么对于你的 select column1 FROM t 请求,服务程序
会把column1 的所有值用连接建立时客户端指定的字符集来回送。另一方面,如果你
用了 SET NAMES ‘latin1’ or SET CHARACTER SET latin1 ,那么在送回结果之前,
服务程序会把 latin2 的值转成latin1,如果里面有两种字符集里都没有的字符,
转化会有损耗。

如果你不希望服务程序作任何转换,就把character_set_results 设置成 NULL

mysql> SET character_set_results = NULL;

10.3.7. 字符串文字字符集和collation

每个字符串文字都有自己的字符集和collation,不能为空

一个字符串文字可能有一个可选字符集introducer和COLLATION子句:

[_charset_name]’string’ [COLLATE collation_name]

例如:

select ‘string’;
select _latin1’string’;
select _latin1’string’ COLLATE latin1_danish_ci;

对于简单语句 select ‘string’,字符串的字符集和collation是由两个系统变量
character_set_connection 和 collation_connection 定义的。

_charset_name 表达式正式情况下被叫做 introducer .它告诉分析器”下面的字符串
是使用 X 字符集的。”因为这在以前造成很多人的困扰,我们强调一下introducer
并不作任何转换,严格来讲并不改变字符串的值,只是一个符号。introducer 在
标准16进制文字前和数字16进制记法前都是合法的(x’literal’ 和 0xnnnn),
在?前面也是合法的(当在程序设计语言接口里使用预备语句时作参数替换)

例如:

select _latin1 x’AABBCC’;
select _latin1 0xAABBCC;
select _latin1 ?;

MySQL 这样决定一个文字的字符集和collation:

·如果 _X 和 COLLATE Y 都被指定了,那么字符集就是 X collation 是 Y
·如果 指定了 _X 而没有指定 COLLATE ,那么字符集是 X collation 是 X 的默认
collation
·否则,由系统变量 character_set_connection 和 collation_connection 决定字符集
和collation

例如:
·一个字符集是 latin1 而collation是 latin1_german1_ci 的字符串:

select _latin1’Müller’ COLLATE latin1_german1_ci;

·一个字符集是 latin1 以及其配套默认collation的(latin1_swedish_ci)字符串:

select _latin1’Müller’;

·一个连接默认字符集和collation的字符串:

select ‘Müller’;

字符集 introducer 和 COLLATE 子句是符合标准 SQL 规则的工具

10.3.8. 在 SQL 语句里使用 COLLATE

通过 COLLATE 子句,你可以在比较时覆盖替换掉任何默认collation, COLLATE 可以用
在SQL 语句的很多部分里,这里是一些例子:

·在 ORDER BY 里:

select k
FROM t1
ORDER BY k COLLATE latin1_german2_ci;

·在 AS 里:

select k COLLATE latin1_german2_ci AS k1
FROM t1
ORDER BY k1;

·在GROUP BY里 :

select k
FROM t1
GROUP BY k COLLATE latin1_german2_ci;

·在集合函数里:

select MAX(k COLLATE latin1_german2_ci)
FROM t1;

·在DISTINCT里

select DISTINCT k COLLATE latin1_german2_ci
FROM t1;

·在where 里:

select *
FROM t1
where _latin1 ‘Müller’ COLLATE latin1_german2_ci = k;

·在HAVING里:

select k
FROM t1
GROUP BY k
HAVING k = _latin1 ‘Müller’ COLLATE latin1_german2_ci;

User Comments
Posted by [name withheld] on January 14 2005 2:33pm

在不同的列/表里:

select t1.k FROM t1 where NOT EXISTS
( select * FROM t2 where t1.k=t2.k COLLATE latin1_german2_ci);

在collation 之间比较列的时候能够避免出错信息。

10.3.9. COLLATE 子句优先级

COLLATE子句具有高优先级(比||高),所以下面两个表达式是相同的:

x || y COLLATE z
x || (y COLLATE z)

10.3.10. BINARY 运算

BINARY 运算是COLLATE 子句的速记法,BINARY ‘x’ 和 ‘x’ COLLATE y 是相同的,
y 是字符集 ‘x’ 的二元collation 的名字。每个字符集都有二元 collation。例如,
latin1 字符集的 collation 是latin1_bin,所以如果列 a 是latin1 字符集,下面
两个语句有同样效果:

select * FROM t1 ORDER BY BINARY a;
select * FROM t1 ORDER BY a COLLATE latin1_bin;

10.3.11. 一些决定collation 比较棘手的情况

在绝大多数查询里,MySQL 用什么collation来进行比较操作都是很显而易见的,例如,
在下面的情况里,很显然collation 应该是”列 x 的列 collation”:

select x FROM T ORDER BY x;
select x FROM T where x = x;
select DISTINCT x FROM T;

但是,当卷入了多操作数时,就很难搞了,例如:

select x FROM T where x = ‘Y’;

这个查询应该使用列 x 的collation 呢,还是使用字符串’Y’ 的?

标准SQL 使用被叫做“coercibility” 的规则来解决这个问题。本质就是:因为
x 和 ‘Y’ 都有collation ,优先使用谁的collation呢?这很复杂,不过下面的规则能
应付大多数情况:

·一个COLLATE 子句的 coercibility 是0 (也就是根本不coercible)

·两个具有不同collation 的字符串连结的 coercibility 是1

·一个列的 collation 的 coercibility 是 2

·一个文字型的collation 的 coercibility 是3。

那些规则这样解决含混:

·使用具有最低 coercibility 值的collation

·如果两边具有相同的 coercibility, 如果两个collation 不同那就是错误。

例如:

column1 = ‘A’ 使用column1的collation
column1 = ‘A’ COLLATE x 使用 ‘A’ 的collation
column1 COLLATE x = ‘A’ COLLATE y Error

COERCIBILITY() 函数可以用来判断一个字符串表达式的coercibility:

mysql> select COERCIBILITY(‘A’ COLLATE latin1_swedish_ci);
-> 0
mysql> select COERCIBILITY(‘A’);
-> 3

User Comments
Posted by Thierry Danard on November 5 2004 10:34pm

对于数据库引擎来说显而易见的排序并不是总那么显而易见(version 4.1).

一个没有带类似于”select concat(mycolumn, ‘%’) from mytable “这样的排序指令
的查询在”mycolumn” 和”%”的字符集不相同的情况下不会工作。

在我这里,整个数据库使用 UTF-8, 默认情况下, ‘%’ 假设是 latin1,
causing an error to be triggered。

10.3.12. Collations Must Be for the Right Character Set

记得说过每个字符集都有一个或者多个collation,每个collation只和一个字符集关联。
因此,下面的语句会导致错误,因为 latin2_bin 这个collation 和 latin1 这个字符集
不配套。

mysql> select _latin1 ‘x’ COLLATE latin2_bin;
ERROR 1251: COLLATION ‘latin2_bin’ is not valid
for CHARACTER SET ‘latin1’

在某些情况下,在 MySQL 4.1 前工作的表达式会在MySQL 4.1以后的版本失败,
如果你在帐号里没有字符集和collation的话。例如,在 MySQL 4.1 前,这个语句
会这样工作:

mysql> select SUBSTRING_INDEX(USER(),’@’,1);
+——————————-+
| SUBSTRING_INDEX(USER(),’@’,1) |
+——————————-+
| root |
+——————————-+

升级到MySQL 4.1 以后,语句失效:

mysql> select SUBSTRING_INDEX(USER(),’@’,1);
ERROR 1267 (HY000): Illegal mix of collations
(utf8_general_ci,IMPLICIT) and (latin1_swedish_ci,COERCIBLE)
for operation ‘substr_index’

发生这个的原因是username 使用utf8存储(参看10.6节),因此, USER() 函数
和文字型字符串’@’具有不同的字符集(当然也是不同collation):

mysql> select COLLATION(USER()), COLLATION(‘@’);
+——————-+——————-+
| COLLATION(USER()) | COLLATION(‘@’) |
+——————-+——————-+
| utf8_general_ci | latin1_swedish_ci |
+——————-+——————-+

解决的一个方法是告诉MySQL把文字型字符串翻译成utf8:

mysql> select SUBSTRING_INDEX(USER(),_utf8’@’,1);
+————————————+
| SUBSTRING_INDEX(USER(),_utf8’@’,1) |
+————————————+
| root |
+————————————+

另一个方法是把连接的字符集和collation改成utf8,你可以使用SET NAMES ‘utf8’
或者直接设置两个系统变量character_set_connection 和 collation_connection
的值来达到这个目的。

10.3.13. Collation 的效果的一个例子

假设表 T 里的列 X 具有这些 latin1 的列值:

Muffler
Müller
MX Systems
MySQL

并且假设这些列值可以用下列语句找回:

select X FROM T ORDER BY X COLLATE collation_name;

在这张表中列出了不同collation 的结果值的结果排序

latin1_swedish_ci latin1_german1_ci latin1_german2_ci
Muffler Muffler Müller
MX Systems Müller Muffler
Müller MX Systems MX Systems
MySQL MySQL MySQL

这张表显示了如果我们在一个 ORDER BY 子句里使用不同collation 会有什么样的效果
的例子,导致这种不同排序结果的字符是上面有两个点的 U,在德语里叫做U-曲音,
不过我们叫做U-分音符

·第一列显示了使用瑞典/芬兰 collation 规则的 select 的结果,U-分音符
通过Y归类

·第二列显示了使用德语DIN-1 规则的select 语句的结果,U-分音符通过U归类

·第三列显示了使用德语DIN-2 规则的select 语句的结果,U-分音符通过UE归类

三种不同的collation ,三种不同的结果,这是MySQL 在这里的处理。通过使用合适的
collation,你可以选择你想要的排序次序。

Powered by WordPress