Kategori arşivi: Veri Madenciliği

dijkstra algoritması python

Dijkstra algoritması, Veri Madenciliğinde (Optimizasyonda) kullanılan Dna’ların benzemezliklerine (benzerliklerine) göre kümeleme algoritmasının Python kodu. (Dijkstra’s Algorithm, Data Mining, Optimization, Clustering Dissimilarity Matrix Calculation)

Ödev Soru 2: (pdf)

A group of plant biologists has identified 12 new species in a restricted geographical area. Based on a DNA sequence study, they have calculated a dissimilarity index (out of 100) for each pair. The resulting dissimilarity matrix is given below. They would like to understand how these species are related with each other. Are they all closely related to each other? Or, are they related to each other in group(s)? In order to answer these questions, they aim to cluster/organize them in groups for which
* dissimilarity within the group is minimal, and
* dissimilarity between the groups is maximal.
To achieve this goal, they use the following procedure:
Step 1. Given a threshold value (D) for dissimilarity index, construct a dissimilarity network where there is an arc between any two species with a dissimilarity index less than or equal to D. Identify each connected component and declare them as groups.
Step 2. Within each group construct a minimum spanning tree of dissimilarity so that the group’s similarity score can be calculated as the ratio of number of species in the group to the total dissimilarity index of the corresponding spanning tree.
As a result of this study, they will also identify a viable threshold value D.
Implement the above classification algorithm using a generic programming language (such as Python) for different values of D, identify each class and calculate the group similarity index for at least three different threshold values where the elements of the clusters and/or clusters’ similarity scores change.

dissimilarity_matrix

dissimilarity_cluster.py


# similarity cluster

import sets

dissimilarity_cluster = [
  [-1,50,21,41,44,18,91,12,90,77,89,31],
  [50,-1,81,88,36,70,19,49,67,100,35,74],
  [21,81,-1,40,98,79,65,84,60,15,37,60],
  [41,88,40,-1,18,49,31,96,78,93,41,38],
  [44,36,98,18,-1,36,77,94,70,52,49,31],
  [18,70,79,49,36,-1,87,40,15,90,0,46],
  [91,19,65,31,77,87,-1,40,31,34,22,60],
  [12,49,84,96,94,40,40,-1,15,25,3,87],
  [90,67,60,78,70,15,31,15,-1,60,17,100],
  [77,100,15,93,52,90,34,25,60,-1,21,83],
  [89,35,37,41,49,0,22,3,17,21,-1,15],
  [31,74,60,38,31,46,60,87,100,83,15,-1]
]

t = 15 # threshold

items = sets.Set() # 1..12

for i in range(1,len(dissimilarity_cluster)+1):
  items.add(i)

tree_list_set = [] # result list of cluster sets
tls = 0

comps = sets.Set() # compare set
comps.add(items.pop())

new_added_flag = 1

while (len(items)>0):
  # print items
  # print comps
  # print tls
  if (new_added_flag==0):
    tree_list_set.append( comps.copy() )
    comps.clear()
    comps.add(items.pop())
    new_added_flag = 1
    tls += 1
  else:
    new_added_flag = 0
  items_tmp = items.copy()
  comps_tmp = comps.copy()
  for i in comps_tmp:
    items_tmp = items.copy()
    comps_tmp = comps.copy()
    for j in items_tmp:
      if dissimilarity_cluster[i-1][j-1]<=t:
        # print "i: "+str(i)+"   j: "+str(j)
        items.remove(j)
        comps.add(j)
        new_added_flag = 1

if len(comps)>0:
  tree_list_set.append( comps.copy() )
  tls += 1

print tree_list_set