查看“︁UPGMA”︁的源代码

'''UPGMA''' （'''unweighted pair group method with arithmetic mean'''）是一種相對簡單的[[层次聚类|層次聚類]]方法。這個方法存在另一種變體 [[世界太平洋汽车制造商协会|WPGMA]]。這個方法的創始人被認為是[[罗伯特·R·索卡尔|Sokal]]和[[查尔斯·邓肯·米切纳|Michener]] 。 <ref>{{Cite journal |vauthors=[[Robert R. Sokal|Sokal]], [[Charles Duncan Michener|Michener]] |year=1958 |title=A statistical method for evaluating systematic relationships |url=https://archive.org/details/cbarchive_33927_astatisticalmethodforevaluatin1902/page/n2 |journal=University of Kansas Science Bulletin |volume=38 |page=1409–1438}}</ref>

== 演算方法 ==
UPGMA 演法構建出一棵有根樹（[[树枝状图|樹狀圖]]）表現[[相似度|相似矩陣]]或[[距离矩阵|相異矩陣]]中的特徵與結構。在算法裡的每一步，距離最近的兩個集群（子樹）將被組合成一個更高級別的集群。任意兩個集群<math>\mathcal{A}</math>和<math>\mathcal{B}</math> 之間的距離，是由所有<math>\mathcal{A}</math>裡的<math>x</math>元素和所有<math>\mathcal{B}</math>裡的<math>y</math>元素的距離<math>d(x,y)</math>的平均值，即每個集群的元素之間的平均距離，其中 <math>{|\mathcal{A}|}</math>和<math>{|\mathcal{B}|}</math>是該兩個集群的[[势 (数学)|基數]]（集合大小）：

:: <math> d_{\mathcal{A},\mathcal{B}}={1 \over {|\mathcal{A}|\cdot|\mathcal{B}|}}\sum_{x \in \mathcal{A}}\sum_{ y \in \mathcal{B}} d(x,y)</math>

換句話說，在每一次組合成新集群的步驟中，可以由<math>d_{\mathcal{A},X}</math>和<math>d_{\mathcal{B},X}</math>的加權平均給出集群<math>\mathcal{A} \cup \mathcal{B}</math>和一個新集群<math>X</math>之間的距離：

<math>
d_{(\mathcal{A} \cup \mathcal{B}),X} = \frac{|\mathcal{A}| \cdot d_{\mathcal{A},X} + |\mathcal{B}| \cdot d_{\mathcal{B},X}}{|\mathcal{A}| + |\mathcal{B}|} 
</math>

UPGMA 演算法生成的有根樹狀圖是一個[[超度量空间|超度量]]樹，該樹需要套用等速率的假設，也就是說根到每個分支尖端的距離皆相等。當尖端是同時採樣的分子數據（即[[脱氧核糖核酸|DNA]] 、 [[核糖核酸|RNA]]和[[蛋白质|蛋白質]]）時，[[超度量空间|超度量]]假設就等同於[[分子鐘]]假設。

== 示例 ==
這個示例是基於[[DNA进化模型|JC69]]基因距離矩陣，該矩陣是根據五種細菌的[[5S 核糖体RNA|5S 核糖體 RNA]]序列計算出來的，五種細菌如下所列<ref name="Erdmann1986">{{Cite journal |vauthors=Erdmann VA, Wolters J |date=1986 |title=Collection of published 5S, 5.8S and 4.5S ribosomal RNA sequences |journal=Nucleic Acids Research |volume=14 Suppl |issue=Suppl |page=r1–59 |doi=10.1093/nar/14.suppl.r1 |pmc=341310 |pmid=2422630}}</ref> <ref name="Olsen1988">{{Cite journal |vauthors=Olsen GJ |date=1988 |title=Phylogenetic analysis using ribosomal RNA |journal=Methods in Enzymology |volume=164 |page=793–812 |doi=10.1016/s0076-6879(88)64084-5 |pmid=3241556}}</ref>：

[[枯草桿菌|枯草桿菌 ''Bacillus subtilis'']]( <math>a</math> )

嗜热脂肪芽孢杆菌 <nowiki><i>Bacillus stearothermophilus</i></nowiki>( <math>b</math> )

[[魏斯氏菌属|魏斯氏菌 ''Lactobacillus viridescens'']]( <math>c</math> )

[[枯草桿菌|無原枯草桿菌 ''Acholeplasma modicum'']]( <math>d</math> )

藤黄微球菌 <nowiki><i>Micrococcus luteus</i></nowiki>( <math>e</math> )

=== 第一步 ===

* '''首次集群'''

假設有五個物件<math>(a,b,c,d,e)</math>和他們之間的相異矩陣<math>D_1</math>：
{| class="wikitable" style="text-align: center"
! style="width: 50px;" |
! style="width: 50px;" |a
! style="width: 50px;" | b
! style="width: 50px;" | c
! style="width: 50px;" | d
! style="width: 50px;" |e
|-
!a
| 0
| style="background:#ffffcc;" | 17
| 21
| 31
| 23
|-
! b
| style="background:#ffffcc;" | 17
| 0
| 30
| 34
| 21
|-
! c
| 21
| 30
| 0
| 28
| 39
|-
! d
| 31
| 34
| 28
| 0
| 43
|-
!e
|23
| 21
| 39
| 43
| 0
|}
在這裡，<math>D_1 (a,b)=17</math>是最小值，所以將<math>a</math>和<math>b</math>集群。

* '''第一分支長度估計'''

令<math>u</math>表示現在<math>a</math> 和 <math>b</math> 的祖先。為了讓<math>a</math> 和 <math>b</math> 與 <math>u</math>等距，假設<math>\delta(a,u)=\delta(b,u)=D_1(a,b)/2</math>，這對應到了超度量的假設。在這個範例中：<math>\delta(a,u)=\delta(b,u)=17/2=8.5</math>

* '''第一次相異矩陣更新'''

然後將<math>D_1</math>更新成一個新的距離矩陣<math>D_2</math>（計算在下方），由於<math>a</math>和<math>b</math>的集群，該矩陣的尺寸減少了一行一列。（<math>D_2</math>中粗體表示的值是由加權平均計算出的新距離）

<math>D_2((a,b),c)=(D_1(a,c) \times 1 + D_1(b,c) \times 1)/(1+1)=(21+30)/2=25.5</math>

<math>D_2((a,b),d)=(D_1(a,d) + D_1(b,d))/2=(31+34)/2=32.5</math>

<math>D_2((a,b),e)=(D_1(a,e) + D_1(b,e))/2=(23+21)/2=22</math>

<math>D_2</math>中的斜體值不受矩陣更新影響，因為他們與第一個集群中的元素完全美有關連。

=== 第二步 ===

* '''第二次集群'''

現在重複前面的三個步驟，並從新的相異矩陣<math>D_2</math>開始
{| class="wikitable" style="text-align: center"
! style="width: 50px;" |
! style="width: 50px;" |(a,b)
! style="width: 50px;" | ｃ
! style="width: 50px;" | d
! style="width: 50px;" |ｅ
|-
!(a,b)
| 0
| '''25.5'''
| '''32.5'''
| style="background:#ffffcc;" | '''22'''
|-
! ｃ
| '''25.5'''
| 0
| ''28''
| ''39''
|-
! d
| '''32.5'''
| ''28''
| 0
| ''43''
|-
!ｅ
| style="background:#ffffcc;" |'''22'''
| ''39''
| ''43''
| 0
|}
在這個矩陣中， <math>D_2 ((a,b),e)=22</math>是<math>D_2</math>中的最小值，所以將<math>(a,b)</math>和元素<math>e</math>集成新群。

* '''第二次分支長度估計'''

令<math>v</math>表示節點<math>(a,b)</math>和<math>e</math>的祖先。由超度量假設可以得到<math>a,b,e</math>三頂點到<math>v</math>的距離相等，即：<math>\delta(a,v)=\delta(b,v)=\delta(e,v)=22/2=11</math>，從而可以計算出<math>u</math>到<math>v</math>的距離<math>\delta(u,v)=\delta(e,v)-\delta(a,u)=\delta(e,v)-\delta(b,u)=11-8.5=2.5</math>

* '''第二次距離矩陣更新'''

然後將<math>D_2</math>更新成新的距離矩陣<math>D_3</math>，數值計算如下：

<math>D_3(((a,b),e),c)=(D_2((a,b),c) \times 2 + D_2(e,c) \times 1)/(2+1)=(25.5 \times 2 + 39 \times 1)/3=30</math>


<math>D_3(((a,b),e),d)=(D_2((a,b),d) \times 2 + D_2(e,d) \times 1)/(2+1)=(32.5 \times 2 + 43 \times 1)/3=36</math>

=== 第三、四步 ===
重複上述動作可以得到<math>D_3</math>是

{| class="wikitable" style="text-align: center"
! style="width: 50px;" |
! style="width: 50px;" | ((a,b),e)
! style="width: 50px;" | c
! style="width: 50px;" | d
|-
! ((a,b),e)
| 0
| '''30'''
| '''36'''
|-
! c
| '''30'''
| 0
| style="background:#ffffcc;" | ''28''
|-
! d
| '''36'''
| style="background:#ffffcc;" | ''28''
| 0
|}
<math>D_4</math>是：
{| class="wikitable" style="text-align: center"
! style="width: 50px;" |
! style="width: 50px;" | ((a,b),e)
! style="width: 50px;" | （c,d）
|-
! ((a,b),e)
| 0
| style="background:#ffffcc;" | '''33'''
|-
! （c,d）
| style="background:#ffffcc;" | '''33'''
| 0
|}

=== UPGMA樹狀圖 ===
[[File:UPGMA_Dendrogram_5S_data.svg|center|400x400px]]
這裡顯示了完成的樹狀圖。<ref name="Swofford1996">{{Cite book|title=Phylogenetic inference|url=https://archive.org/details/molecularsystema0000unse_a8v5|last=Swofford|first=David L.|last2=Olsen|first2=Gary J.|last3=Waddell|first3=Peter J.|last4=Hillis|first4=David M.|name-list-style=vanc|year=1996|isbn=9780878932825|editor-last=Hillis|editor-first4=Craig|publisher=Sinauer|location=Sunderland, MA|pages=[https://archive.org/details/molecularsystema0000unse_a8v5/page/n428 407]–514}}</ref>它是超度量的，所有尖端（ <math>a</math>到<math>e</math> ) 與<math>r</math>等距離 :

<math>\delta(a,r)=\delta(b,r)=\delta(e,r)=\delta(c,r)=\delta(d,r)=16.5</math>

這個樹狀圖的根是它最深的節點<math>r</math> 。

== 時間複雜度 ==
構建 UPGMA 樹的算法有<math>O(n^3)</math>時間複雜度。使用一個堆維護兩個即群之間的距離可以使時間達到<math>O(n^2 \log n)</math> . 另外Fionn Murtagh 提出了一個<math>O(n^2)</math>時空複雜度的算法。 <ref>{{Cite journal |vauthors=Murtagh F |year=1984 |title=Complexities of Hierarchic Clustering Algorithms: the state of the art |journal=Computational Statistics Quarterly |volume=1 |page=101–113}}</ref>

== See also ==

* [[邻接法|鄰接法]]
* [[聚类分析|集群分析]]
* 单链接聚类
* 完全连锁聚类
* [[层次聚类|層次即群]]
* DNA进化模型
* [[分子鐘]]
{{Reflist}}

== 外部連結 ==

* [https://github.com/SergioFierens/ai4r/blob/master/lib/ai4r/clusterers/average_linkage.rb UPGMA聚类算法在Ruby中的实现（AI4R）] {{Wayback|url=https://github.com/SergioFierens/ai4r/blob/master/lib/ai4r/clusterers/average_linkage.rb |date=20200727170323 }}
* [https://books.google.com/books?id=KBoHuoNRO5MC&dq=UPGMA+clustering&pg=PA319 使用相似矩阵计算 UPGMA 的示例] {{Wayback|url=https://books.google.com/books?id=KBoHuoNRO5MC&dq=UPGMA+clustering&pg=PA319 |date=20230423123129 }}
* [http://www.slimsuite.unsw.edu.au/teaching/upgma/ 使用距离矩阵计算 UPGMA 的示例] {{Wayback|url=http://www.slimsuite.unsw.edu.au/teaching/upgma/ |date=20230302094209 }}
[[Category:系统发生学]]
[[Category:计算系统发生学]]
[[Category:生物信息学算法]]