This page shows a few examples of clustering web content, using algorithms and visualisations that we did during my time at Netbreeze. Clustering means to divide a large set into many smaller groups of very closely related items (in or case these are mostly words or phrases). We implemented several clustering algorithms for documents, words, adjectives, phrases, products, and even drug ingredients that appear on the internet. Using one CPU, we were able to cluster the interessting terms of around one million web documents on user request (i.e. in a few seconds computing time). The dynamic visualisation examples below show the relations between terms, as well as their importance (item size), and trend (item color). The visualisation applet was built using Prefuse.
Clustering of trendy phrases related to climate change (Data source: a selection of internet documents containing 'climate change' up to May 2007):
Applet and Data © by Netbreeze GmbH
Meaning of the visualisation:
The size of the items represents the frequency
they occurred in the given set of documents, and the color of the items
represents their trendiness (i.e. how much the term has increased or
decreased lately).
How to use the applet:
Use the two slider controls at the
bottom of the applet to adjust the number of items to be displayed per
cluster, and the maximum distance for items to be displayed measured
from a selected node. Furthermore, you can use your mouse scroll wheel to zoom into the visualisation, and you can drag all items by the mouse. By clicking on the hidden control panel on the right, you are able to adjust even more options such as the gravitational force and the distance between the nodes.
Clustering of phrases linked to heart attack (Data from May 2007):
Applet and Data © by Netbreeze GmbH
Clustering of drugs ingredients based on internet data (Data from May 2007):
Applet and Data © by Netbreeze GmbH
Links:
Netbreeze GmbH is a Swiss company building knowledge generators based on internet data.
Prefuse interactive information visualization toolkit (very nice Java open source toolkit).









































