Behavioral Data Mining

Behavioral Data Mining

CS294-1 Spring 2013: Behavioral Data Mining

Instructor: John Canny
CCN: 26922

M-W 9-10:30am in 306 Soda Hall

This is a course about large-scale mining of behavioral data - data generated by people. Examples include the web itself, social media (Facebook, Twitter, Livejournal), digital mega-libraries, shopping (Amazon, Ebay), tagging (Flickr, Digg) repositories (Wikipedia, Stack Overflow), MOOC data, server logs and recommenders (Netflix, Amazon etc). These datasets have an enormous variety of potential uses in health care, government, education, commerce etc. They provide previously-unknown opportunities to understand human behavior, and to provide better services to people. The course will be hands-on and include several assignments on large datasets and a final project. The course covers data mining algorithms from general machine learning, causal analysis, social networks and natural language processing. There will be a modest coverage of systems issues: big data toolkits and their affordances, some recent innovations from scientific computing, and GPU programming. Students should have some familiarity with Java or Scala, and will be working with Hadoop and/or Spark, Matlab and BIDMat/BIDMach. Projects will have access to a large number of experimental datasets totaling approximately 6 TB compressed.

Tentative Outline:

Introduction, example problems, open research questions

Basic statistics, bias-variance tradeoffs, Naïve Bayes classifier

Regression and generalized linear models (GLM)s

Causal analysis: Matching, propensity scores

About people - power laws, traits, social network structure

Performance measurement, significance tests, resampling and bootstrap

Map-Reduce: Hadoop, Spark

Machine biology - b/w hierarchy, caching, communication networks

Scientific computing design patterns: communication-minimizing algorithms

GPU architecture and programming

Text and metadata indexing and retrieval, XML content-bases

Natural language processing: dependency parsers

Natural language processing: alignment and chart parsers


Excavating - crawling, web services, processing streams

Query languages and architectures – BIDMat/BIDMach, GraphLab, Hyracks, DryadLINQ,…

Factor Models - Naive Bayes, LDA, GaP


Clustering - k-means, PAM, generative models

Causal analysis: Targeted Maximum-Likelihood Estimation

Causal graphical models

Prediction - kNN, kd-trees, kNC

Network algorithms - HITS and Pagerank

Network algorithms - diffusion

Visualizing large datasets

Subpage Listing