Student Colums「IST Lounge」
Combinatorial View of Big Data
Division of Computer Science and Information Technology
Research Group of Knowledge Software Science
Laboratory of Knowledge Base, DC the first year
（Nationality: China，Year of enrollment:2015）
Recent years, big data becomes a new hot topic in mass media. With the appearance of social network, such as twitter and facebook, people can share their ideas much more easily. Back to 2007, only 5,000 tweets is sent during one day, but now, 2015, the number becomes 340,000,000 tweets per day. We definitely know these tweets contain interesting information. However, it is impossible to look through all these 340 million tweets. Sometimes we may want to take a look at the general trend. For example, we can count the numbers of appearances of each noun in tweets. By looking at the numbers of appearances, we can know how much a topic is talked about. Most of the time, it is a good idea to find general trend by statistical methods like this. However, sometimes we may be not so interested in general trends. When a bill is sent to congress, it is not enough to only hear from supporters - we usually also want to the voices from other viewpoints. It is that we have to dig into the data instead of looking at general trends.
My research is mainly about finding these "viewpoints" by combinatorial approaches. Different statistical approaches, which are strong in getting rough views, combinatorial approaches usually, consider all the possible combinations. In other words, combinatorial methods do not miss any things. By constructing graphs from data, my research naturally connects with finding the interesting combinations from graph. Let's take the tweets as example again, let each tweet be a vertex, and connect two vertices (tweets) if they share some nouns. If there are groups in which most pairs of vertices are connected, we can say these groups have common topics. For one topic (a noun here), we may find several groups - each group has their viewpoints on this topic. It is easily to imagine that for big data, the graph will be extreme large and we are impossible to check all the combinations in such a big graphs. But the good news is that not all the data are useful. The point of research is that excluding those useless or trivial combinations, so that we can focus on what we really want. This task mainly contains two points: First, it is that find and characterize the feature of target. That is, using the language from group theory to describe the feature of target. Then, with the characterization, it is possible to design the algorithm for finding the target. A good algorithm will not only find the target, but also avoid the useless search. With proper characterization and algorithm design, combinatorial methods become useful for big data.
The life in laboratory is interesting and challenging. Students are free to choose their topic and solve the problem in their own way. For those students new for the laboratory, they have enough time to read and study, so that they can find their real interest. Weekly meeting is another source for learning new ideas. Students not only simply reports study progress, but also share their thoughts within laboratory. Weekly meeting is usually not "talk-and-listen", but in most time it is "talk-and-discuss". One can ask question at any time, and discuss any related topic. Just like weekly meeting, other aspects are also free to students. During your research, you will not be asked to learn specific programming language or library. You are also able to work in your favorite environment at the time you like. Thus, doing research is not only producing ideas, but a progress that solve the problem by exploring your favorite things. Such as myself, other than implementing the algorithms for research, I also worked on many other things. For example, during my master time, I mastered three new programming language, D, Go and OOC. I also attended the development of the compiler of OOC programming language. I'm also working on a game engine and a parser generator. You can find my activities on my github pages (https://github.com/zhaihj?tab=repositories). In my opion, working on interested area is not a waste of time, but a good way to keep motivation and encounter new ideas.