Stuff
Machine Learning, Graph Theory, UMLS and all that...
Thursday, October 30, 2008
Thursday, September 11, 2008
Terminology Abstraction
One of the common problems in terminology-driven data analysis is to perform abstraction to get a "big-picture" view of the data. Further, if there is mapping involved, the problem of abstraction on source versus target terminology adds another layer of complexity.
DATA SETTING
- A terminology with terms organized in a form of hierarchy
- A set of data points annotated with the terms
OPTIMAL ABSTRACTION
- Minimizing the number of "parent climbs" and maximizing the Annotations/term score.
ALGORITHM
Greedy Approach:
- Choose the parent climbs that maximize the immediate step annotations/term
Heuristic Approach:
- First climb the parents with max dist to root.
- Equalize the "average" distance to root
Labels: abstraction, terminology
Wednesday, April 16, 2008
Complex Eigenvec
I m noticing complex values in the eigen vectors.. and getting errors about the matrix being singular.. (one that does not have a inverse)
Eigen vectors and values can be complex because of the reasons explained in here
To overcome the issue the matrix has to be symmetric matrix. So basically I have to treat all the edges in the MRREL as undirected.
Friday, April 11, 2008
LaplaceEigen on UMLS Concept
So Laplace of a Graph
degree of u if (u=v or diagonal elements)
L (u, v)= -1 if u is connected to v
0 otherwise
Calculating eigen values of this graph has interesting characteristics that can be exploited
laplace_eigen(G, n) : we use the same dimensions and there is no reduction in dimensionality
The eigen values with low values are "important" and smooth functions.. In "fourier" terminology low frequency eigen vectors are important in graph laplacian
Experiment:
My hypothesis that UMLS concepts with low eigen values will be the "informational hub concepts" as they are important connectors -- I can corroborate it with the # of source vocabs
Using the f(u)
Tuesday, April 01, 2008
PCA versus NMF
Towards trying to analyze the graph regularization problem. I m looking into PCA and LSA and stuff
Modeling link and content using a single model is a challenge: http://www.dbs.informatik.uni-muenchen.de/~yu_k/sigir2007_mf.pdf
Generally Non-negative Matrix factorization is better than PCA because (http://www.cs.huji.ac.il/~shashua/papers/spca-nips06.pdf)
"PCA is attractive for a number of reasons. The maximum variance property provides a way to
compress the data with minimal information loss. In fact, the principal vectors provide the closest
(in least squares sense) linear subspace to the data. Second, the representation of the data in the
projected space is uncorrelated, which is a useful property for subsequent statistical analysis. Third,
the PCA decomposition can be achieved via an eigenvalue decomposition of the data covariance
matrix.
Two particular drawbacks of PCA are the lack of sparseness of the principal vectors, i.e., all the data
coordinates participate in the linear combination, and the fact that the linear combination may mix
both positive and negative weights, which might partly cancel each other. The purpose of our work
is to incorporate both nonnegativity and sparseness into PCA, maintaining the maximal variance
property of PCA."
this compares NMF and PCA for image classification: http://ieeexplore.ieee.org/iel5/8091/22464/01048251.pdf?tp=&arnumber=1048251&isnumber=22464
To calculate NMF using Matlab: http://www.bsp.brain.riken.jp/ICALAB/nmflab.html
Friday, January 04, 2008
USSR dataset for relBN approach
The idea is to create a "topologically sorted" and directed graphical model representation of the META and then use the N-step dataset to learn the probabilities
N-step connections
C1 C2 C3...... C1.7milion
1 0 1 1 ...
2 1 0 1 ...
then learn the "joint" probabilities from the partial data
P(C1=0/C2=0, C3=0) = ....
then show that
P(N+1-step conn) is higher than P(N-step conn)
Adv.
- we ll have lot of training records.. since thse are sampled from the UMLS itself
- does not require external dataset to test
Labels: bayesian networks, ussr
Thursday, December 20, 2007
Directed UMLS Graph
In order to run probabilistic inference algorithms, it is necessary to model the UMLS META into a directed graph. One of the ways to convert a graph with cycles into a directed one is by using Topological Sort
TOPO_SORT
1. Sort the edges based on some criteria.
2. First add the most important edge and repeat until you notice a cycle..
3. Drop the edge that causes a cycle in the graph
In the UMLS, we can drop one of the bi-directional edges.. so for example,
PAR, CHD.. we just need one.. and I think using CHD makes sense as you want the probabilities to flow from the Parent to Child and not vice-versa...
To short-list the relationships necessary
CHD
RL
RN
RO
RQ
SY
drop all other edges...
To perform topological sort.. sort the edges based on # of source asserted relationships.. and then start creating the directed graph.
Labels: topological sort, umls
