Thursday, October 30, 2008

Abstraction Learning

http://archive.nyu.edu/bitstream/2451/14247/1/IS-93-11.pdf

http://www.springerlink.com/content/n285q6878mgg8x01/fulltext.pdf

Thursday, September 11, 2008

Terminology Abstraction

One of the common problems in terminology-driven data analysis is to perform abstraction to get a "big-picture" view of the data. Further, if there is mapping involved, the problem of abstraction on source versus target terminology adds another layer of complexity. 

DATA SETTING
- A terminology with terms organized in a form of hierarchy
- A set of data points annotated with the terms

OPTIMAL ABSTRACTION
- Minimizing the number of "parent climbs" and maximizing the Annotations/term score.

ALGORITHM

Greedy Approach:
- Choose the parent climbs that maximize the immediate step annotations/term

Heuristic Approach:
- First climb the parents with max dist to root. 
- Equalize the "average" distance to root 

Labels: ,

Wednesday, April 16, 2008

Complex Eigenvec

I m noticing complex values in the eigen vectors.. and getting errors about the matrix being singular.. (one that does not have a inverse)

Eigen vectors and values can be complex because of the reasons explained in here 

To overcome the issue the matrix has to be symmetric matrix. So basically I have to treat all the edges in the MRREL as undirected.

Friday, April 11, 2008

LaplaceEigen on UMLS Concept

So Laplace of a Graph 

degree of u if (u=v or diagonal elements)
 L (u, v)=   -1 if u is connected to v
0 otherwise

Calculating eigen values of this graph has interesting characteristics that can be exploited

laplace_eigen(G, n) : we use the same dimensions  and there is no reduction in dimensionality

The eigen values with low values are "important" and smooth functions.. In "fourier" terminology low frequency eigen vectors are important in graph laplacian

Experiment:
My hypothesis that UMLS concepts with low eigen values will be the "informational hub concepts" as they are important connectors -- I can corroborate it with the # of source vocabs

Using the f(u)

Tuesday, April 01, 2008

PCA versus NMF

Towards trying to analyze the graph regularization problem. I m looking into PCA and LSA and stuff

Modeling link and content using a single model is a challenge: http://www.dbs.informatik.uni-muenchen.de/~yu_k/sigir2007_mf.pdf

Generally Non-negative Matrix factorization is better than PCA because (http://www.cs.huji.ac.il/~shashua/papers/spca-nips06.pdf)

"PCA is attractive for a number of reasons. The maximum variance property provides a way to
compress the data with minimal information loss. In fact, the principal vectors provide the closest
(in least squares sense) linear subspace to the data. Second, the representation of the data in the
projected space is uncorrelated, which is a useful property for subsequent statistical analysis. Third,
the PCA decomposition can be achieved via an eigenvalue decomposition of the data covariance
matrix.
Two particular drawbacks of PCA are the lack of sparseness of the principal vectors, i.e., all the data
coordinates participate in the linear combination, and the fact that the linear combination may mix
both positive and negative weights, which might partly cancel each other. The purpose of our work
is to incorporate both nonnegativity and sparseness into PCA, maintaining the maximal variance
property of PCA."

this compares NMF and PCA for image classification: http://ieeexplore.ieee.org/iel5/8091/22464/01048251.pdf?tp=&arnumber=1048251&isnumber=22464

To calculate NMF using Matlab: http://www.bsp.brain.riken.jp/ICALAB/nmflab.html

Friday, January 04, 2008

USSR dataset for relBN approach

The idea is to create a "topologically sorted" and directed graphical model representation of the META and then use the N-step dataset to learn the probabilities


N-step connections
C1 C2 C3...... C1.7milion
1 0 1 1 ...
2 1 0 1 ...

then learn the "joint" probabilities from the partial data

P(C1=0/C2=0, C3=0) = ....

then show that

P(N+1-step conn) is higher than P(N-step conn)


Adv.
- we ll have lot of training records.. since thse are sampled from the UMLS itself
- does not require external dataset to test

Labels: ,

Thursday, December 20, 2007

Directed UMLS Graph

In order to run probabilistic inference algorithms, it is necessary to model the UMLS META into a directed graph. One of the ways to convert a graph with cycles into a directed one is by using Topological Sort

TOPO_SORT
1. Sort the edges based on some criteria.
2. First add the most important edge and repeat until you notice a cycle..
3. Drop the edge that causes a cycle in the graph

In the UMLS, we can drop one of the bi-directional edges.. so for example,

PAR, CHD.. we just need one.. and I think using CHD makes sense as you want the probabilities to flow from the Parent to Child and not vice-versa...

To short-list the relationships necessary
CHD
RL
RN
RO
RQ
SY

drop all other edges...

To perform topological sort.. sort the edges based on # of source asserted relationships.. and then start creating the directed graph.

Labels: ,