HOME > Research Activities > Net Journal > Net Journal 5

Net Journal 5

Research on semi-structured data mining to find the relationships among pieces of knowledge and data

---- In establishing the theory of knowledge creation, the research on semi-structured mining that you are conducting, Dr. Arimura, is playing an important role.

Dr. Arimura: While it is said that the Internet contains data on an exabyte scale (i.e. to the 18th power of 10), most of this is simply an enormous accumulation of individual pieces of knowledge. It is very difficult for one researcher to understand all the pieces of this mixture of knowledge, relate each other the various fields involved and conduct cross-sectional research. My research theme is therefore the semi-automatic extraction of useful patterns and rules to help human's acquisition of knowledge. Through the research adopted for the 2001-2005 Grant-in-Aid for Scientific Research on the Priority Area of Information Sciences and the study entitled "Efficient Pattern Discovery from Massive Semi-Structured Data for Knowledge Infrastructure Formation on the Web" (adopted for the 2005-2007 Grant-in-Aid for Specially Promoted Research), we are conducting research and development for a technology to find specific rules and patterns from web pages, XML data and other semi-structured information sources (Ex. 2). Through these research efforts, a tree-mining algorithm known as FREQT has been developed, with which characteristic partial structures in a collection of tree-structured data can be discovered as small tree patterns (Ex. 3/Fig. 1).

Semi-structure mining is considered useful in biotechnology and nanotechnology research.  In genome analysis, for example, each combination of some elements, e.g., gene, amino acid, or protein, had to be verified to extract, classify and analyze the functions and manifestation patterns of individual genomes. Using semi-structure mining, it becomes possible to compare all the genomes simultaneously and extract their common rules and patterns. Stable performance of the FREQT algorithm on live data has already been confirmed, and it is expected to be fully suitable for use on a business basis as well as in biotechnology, nanotechnology and other research fields. We intend to disclose these mining engines through the Internet and other media to help the discovery and coordination of knowledge.


Page Top