Nota
Haz clic en aquí para descargar el código completo del ejemplo o para ejecutar este ejemplo en tu navegador a través de Binder
Biclustering de documentos con el algoritmo Co-clustering Espectral¶
Este ejemplo demuestra el algoritmo Co-clustering Espectral en el conjunto de datos de veinte grupos de noticias. La categoría “comp.os.ms-windows.misc” está excluida porque contiene muchos mensajes que sólo contienen datos.
Los mensajes vectorizados por TF-IDF forman una matriz de frecuencia de palabras, que luego se biclusteriza mediante el algoritmo Co-Clustering Espectral de Dhillon. Los biclústeres documento-palabra resultantes indican los subconjuntos de palabras que se utilizan con más frecuencia en esos subconjuntos de documentos.
Para algunos de los mejores biclusters, se imprimen sus categorías de documentos más comunes y sus diez palabras más importantes. Los mejores biclusters se determinan por su corte normalizado. Las mejores palabras se determinan comparando sus sumas dentro y fuera del bicluster.
Para comparar, los documentos también se agrupan mediante MiniBatchKMeans. Los clusters de documentos derivados de los biclusters logran una mejor medida V que los clusters encontrados por MiniBatchKMeans.
Out:
Vectorizing...
Coclustering...
Done in 5.86s. V-measure: 0.4431
MiniBatchKMeans...
Done in 7.08s. V-measure: 0.3344
Best biclusters:
----------------
bicluster 0 : 1961 documents, 4388 words
categories : 23% talk.politics.guns, 18% talk.politics.misc, 17% sci.med
words : gun, geb, guns, banks, gordon, clinton, pitt, cdt, surrender, veal
bicluster 1 : 1269 documents, 3558 words
categories : 27% soc.religion.christian, 25% talk.politics.mideast, 24% alt.atheism
words : god, jesus, christians, sin, objective, kent, belief, christ, faith, moral
bicluster 2 : 2201 documents, 2747 words
categories : 18% comp.sys.mac.hardware, 17% comp.sys.ibm.pc.hardware, 16% comp.graphics
words : voltage, board, dsp, packages, receiver, stereo, shipping, package, compression, image
bicluster 3 : 1773 documents, 2620 words
categories : 27% rec.motorcycles, 23% rec.autos, 13% misc.forsale
words : bike, car, dod, engine, motorcycle, ride, honda, bikes, helmet, bmw
bicluster 4 : 201 documents, 1175 words
categories : 81% talk.politics.mideast, 10% alt.atheism, 7% soc.religion.christian
words : turkish, armenia, armenian, armenians, turks, petch, sera, zuma, argic, gvg47
from collections import defaultdict
import operator
from time import time
import numpy as np
from sklearn.cluster import SpectralCoclustering
from sklearn.cluster import MiniBatchKMeans
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.cluster import v_measure_score
print(__doc__)
def number_normalizer(tokens):
""" Map all numeric tokens to a placeholder.
For many applications, tokens that begin with a number are not directly
useful, but the fact that such a token exists can be relevant. By applying
this form of dimensionality reduction, some methods may perform better.
"""
return ("#NUMBER" if token[0].isdigit() else token for token in tokens)
class NumberNormalizingVectorizer(TfidfVectorizer):
def build_tokenizer(self):
tokenize = super().build_tokenizer()
return lambda doc: list(number_normalizer(tokenize(doc)))
# exclude 'comp.os.ms-windows.misc'
categories = ['alt.atheism', 'comp.graphics',
'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
'comp.windows.x', 'misc.forsale', 'rec.autos',
'rec.motorcycles', 'rec.sport.baseball',
'rec.sport.hockey', 'sci.crypt', 'sci.electronics',
'sci.med', 'sci.space', 'soc.religion.christian',
'talk.politics.guns', 'talk.politics.mideast',
'talk.politics.misc', 'talk.religion.misc']
newsgroups = fetch_20newsgroups(categories=categories)
y_true = newsgroups.target
vectorizer = NumberNormalizingVectorizer(stop_words='english', min_df=5)
cocluster = SpectralCoclustering(n_clusters=len(categories),
svd_method='arpack', random_state=0)
kmeans = MiniBatchKMeans(n_clusters=len(categories), batch_size=20000,
random_state=0)
print("Vectorizing...")
X = vectorizer.fit_transform(newsgroups.data)
print("Coclustering...")
start_time = time()
cocluster.fit(X)
y_cocluster = cocluster.row_labels_
print("Done in {:.2f}s. V-measure: {:.4f}".format(
time() - start_time,
v_measure_score(y_cocluster, y_true)))
print("MiniBatchKMeans...")
start_time = time()
y_kmeans = kmeans.fit_predict(X)
print("Done in {:.2f}s. V-measure: {:.4f}".format(
time() - start_time,
v_measure_score(y_kmeans, y_true)))
feature_names = vectorizer.get_feature_names()
document_names = list(newsgroups.target_names[i] for i in newsgroups.target)
def bicluster_ncut(i):
rows, cols = cocluster.get_indices(i)
if not (np.any(rows) and np.any(cols)):
import sys
return sys.float_info.max
row_complement = np.nonzero(np.logical_not(cocluster.rows_[i]))[0]
col_complement = np.nonzero(np.logical_not(cocluster.columns_[i]))[0]
# Note: the following is identical to X[rows[:, np.newaxis],
# cols].sum() but much faster in scipy <= 0.16
weight = X[rows][:, cols].sum()
cut = (X[row_complement][:, cols].sum() +
X[rows][:, col_complement].sum())
return cut / weight
def most_common(d):
"""Items of a defaultdict(int) with the highest values.
Like Counter.most_common in Python >=2.7.
"""
return sorted(d.items(), key=operator.itemgetter(1), reverse=True)
bicluster_ncuts = list(bicluster_ncut(i)
for i in range(len(newsgroups.target_names)))
best_idx = np.argsort(bicluster_ncuts)[:5]
print()
print("Best biclusters:")
print("----------------")
for idx, cluster in enumerate(best_idx):
n_rows, n_cols = cocluster.get_shape(cluster)
cluster_docs, cluster_words = cocluster.get_indices(cluster)
if not len(cluster_docs) or not len(cluster_words):
continue
# categories
counter = defaultdict(int)
for i in cluster_docs:
counter[document_names[i]] += 1
cat_string = ", ".join("{:.0f}% {}".format(float(c) / n_rows * 100, name)
for name, c in most_common(counter)[:3])
# words
out_of_cluster_docs = cocluster.row_labels_ != cluster
out_of_cluster_docs = np.where(out_of_cluster_docs)[0]
word_col = X[:, cluster_words]
word_scores = np.array(word_col[cluster_docs, :].sum(axis=0) -
word_col[out_of_cluster_docs, :].sum(axis=0))
word_scores = word_scores.ravel()
important_words = list(feature_names[cluster_words[i]]
for i in word_scores.argsort()[:-11:-1])
print("bicluster {} : {} documents, {} words".format(
idx, n_rows, n_cols))
print("categories : {}".format(cat_string))
print("words : {}\n".format(', '.join(important_words)))
Tiempo total de ejecución del script: (1 minutos 55.015 segundos)