KnowledgeBlog
Copyright 2002, Michael Bales

systems, information, data, concepts, knowledge, ideas, people, computers, the world, interrelationships
Français | Deutch | Español | Italiano | Portuguese

Archive, October 2001 - February 2002

An exciting new world
We are entering an exciting new world.  The growing amount of text on the web, paired with ever-faster processing speeds, give us an unprecedented opportunity to analyze and represent human knowledge.  Automated techniques can be used to process large volumes of unstructured text.  When the resulting data are analyzed, associations and patterns emerge between concepts. These methods are, by nature, imprecise; however, conclusions known to be especially imprecise can be disregarded.  The results could improve interdisciplinary collaboration and encourage scientific progress and discovery.

Michael Bales

Feel free to e-mail me.  I would love to hear from you!

mikeybales@yahoo.com

 

Sunday, February 3, 2002

OK, I’m thinking about practical ways of implementing Idea 13.  Imagine a set of 1000 short documents containing a total of 100,000 unique concepts.  The frequency of each concept is a known integer.  Each concept has a simple numeric identifier from 1 to 100,000.  Each concept then receives a power corresponding to its frequency in the document set.  (Less frequent occurrence gives a concept more power, because it will contribute more to the uniqueness of each document.)  Well, each document’s identifier is now simply set consisting of the concepts it contains, more or less strong based on the frequency of the concepts.  Using conventional methods, a user wishing to retrieve a particular document can find the document quickly based on keywords.  However, I wish to link the concepts so that their meanings, or identifying power, influence each other.

A tabular representation of the first three concepts in the entire document set:

Concept

Frequency

Power

our

345

1/345

known

6

1/6

universe

1920

1/1920

A tabular representation of the first three concepts in document A:

Concept

# times in doc

power of concept

power of concept in identifying document

universe

110

1/1920

1/1920 * 110

without

14

1/3

1/3 * 14

boundaries

3

1/36

1/36 * 3

The start of Document A’s representation is as follows:

Universe 0.057; without 4.67; boundaries 0.833

If we look at the representations of all the documents, we find certain patterns emerging.  The concept universe, when powerful in a given document, often occurs near the concepts star and galaxy.  We can see, from a purely mathematical standpoint, that these are related concepts.  Now, when someone starts talking about star, and the concept galaxy is also mentioned, we know the person is talking not about movie stars but about the celestial body.

Sunday, January 13, 2002

On the way home from the coffeehouse, I explained to Jessica how nice it is to get my thoughts on to paper, and she agreed that it's good to journal from time to time.  I asked, "Can you do that in Chinese?" and she replied that it's faster for her to write in English because "we don't have a good typing system."  I outlined one based on optical character recognition: it allows input on a touch screen, then looks at the characters and uses word frequency and contextual analysis to help present an array of likely choices for characters!  The user taps the screen once if proposed character is the right one, or selects other characters that appear in boxes.

It strikes me that there should already be decent OCR software for printed Chinese; this software would probably achieve a remarkably lower level of accuracy, but would probably be useful for typing in Chinese.  An improvement over current methods involving pinyin?  Maybe not… but would allow the "average" Chinese person (with the touch-screen input device) to "type" e-mails.  Yes, what might sell is a small, inexpensive touch-screen device and corresponding OCR software. 

Saturday, January 5, 2002

I've thought of another crazy idea.  This is really an idea that I first thought of when I was little -- maybe about 11 years old. It's basically a virtual Earth simulator, for simulated travel to anywhere in the world.  It dynamically gathers needed data from the Internet.  Users can fly from one world city to another, see the Great Pyramids, see other sights, or zoom down to the street level of any city or small town in the world.  Small towns are generated (ahead of time) based on existing GIS maps and aerial photography.  It includes details, like cars and people.  Navigation is like the navigation in Celestia: you can accelerate by an order of magnitude to cover great distances.

This system can develop over time, using GIS technology and other technologies.  To start, all you see is a sphere that represents the Earth.  Then we begin adding details, and you can see the oceans and landscape at a low resolution.  Then you can see where country boundaries are, and names of cities.  Then you can get closer to the cities, and you start to see the city map, and then you can see the buildings (simulated at first, then actual widths and heights, then a façade that represents the type of building, then, in selected cities, the true façade of the building.)

Practical uses of this Learn about other countries and places.  Go to Paris on a rainy day.  Find out that the Taj Mahal is on a river; see what it looks like from various angles.  See how close people live to the Taj Mahal.  Find out what places to eat are near the Washington Monument.

Make a map of human knowledge!  No one's done that yet!

I want a map of all human knowledge.  I want information.  I just heard about soft computing, but I don't really know what it is.  I don't want just somebody's idea of what it is -- I want the meta-idea of what it is based on all available information.  I want to know what concepts it's related to.  I want to be able to know meta information about those concepts.  I want to know how my topic is related to the other topics: the nature of the relationship.

I am going to start the map of all human knowledge right now.  This is a model that can grow (it is scalable, extensible.)

Nut

This map of all human knowledge is dedicated to my mother on her birthday.  Now I am going to expand the model. 

Nut (9s345Sl4ttC): a piece of metal used to fasten things together.

Related concepts: bolt (9s345S14ak3), fastener (9s3ay2f).

OK.  I think I know a system that I could really demonstrate.  To capture just 90% of concepts that people might search for, I would probably need about 500,000 random web pages.    If someone types "Britney Spears", I want my system to give a result.  If someone types "angel", I want my system to give a result.  If someone types "soft computing", I want my system to give a result… but unfortunately, this isn't in the top 90% so it wouldn't work in the demonstration system.  But anyway, I can make this system.

I need a concept extractor.  It pulls the data from the web pages.  In my model, each document is an entity containing many concepts and their interrelationships.

I need a relational database management system.

I need a search engine and a web server.

The key is that the content maps for each concept are generated automatically and incorporate other concepts.

I don’t want it to seem like I'm throwing in the towel, but this project is probably too big for me to do effectively at this stage in my life.  My 9-5 job is too different from this.  I probably should move to a different job, one that allows me to pursue this area for work.  If I try to work on it at home, I may just become overwhelmed by the problems.

Saturday, December 8, 2001

Can the system, given a free-text query, respond reliably with the top 10% of concepts showing what has been said/written about how the concepts interrelate?  That way, you can write something to fill in the gaps -- and later, others interested in the same relationships can benefit.  And further empirical science. (Slowly, but surely, many others need to write about the relationship before it takes on a personality -- correct or incorrect -- in the knowledge universe.)

Just thought of another idea.  If you distill the most important concepts in a document, or a string of text, you can assign the document to a spot in a multidimensional space.  For example, you determine the document's three most important concepts are 98457389443485748, 4857394458484474, and 4395876345454944.  Well, the document maps spatially to each of those three neighborhoods, but the document's exact location within each of the neighborhoods depends on the other concepts in the document.  For example, the mathematically-derived location of the document within the neighborhood is -38476234, 46384344, -23464460.  If this process is applied to millions of documents, then maybe you could get the numeric signature of the document, go to one of its neighborhoods, and simply look around.  Maybe there will be another document nearby, one that fills in some of the gaps in your own ideas.

I just saw a quote from "An Interface for Melody Input" (Lutz Prechelt and Rainer Typke, 1998): "Parsons showed that a simple encoding of tunes that ignores most of the information in the musical signal can still provide enough information for distinguishing between a large number of tunes."  This is magic!  The boiled-down essence of the tunes!

Sunday, December 2, 2001

New and improved management of text documents.  Need: Parser.  Something that finds the concepts in the text.  The parser assigns codes to the concepts and determines relationships where possible.

The document is represented with a new string (or data table) which has now been encoded with some degree of semantic meaning.  It looks kind of like this: 

lawfirm.assign.paralegal employee

paralegal employee.search.hardcopy record

I will refer to this as a signature.  A short document would have a short signature and a long document a long one.  And meta-information from the signature (concepts referred to, nature of the relationships), will become a simplified version of the signature, and can be visualized at high resolution.  I hypothesize that documents with similar content would have visually identifiable similarities -- IF the original assignment of codes by the parser is done with some degree of awareness about how concepts normally fit together in that great body of textual information that is the Internet.

Thursday, November 29, 2001

Right now, what do I really care about (intellectually)?  I care about semantics, language, meaning, connections between ideas.  I care about representing semantic meaning, visualizing it, encoding it electronically so that it can be manipulated by people and by computers.  I want to find hidden patterns in human knowledge, not just in data.  I want to move away from just alphanumeric representations of the world, which are used in rectangular data tables and in GIS.  I want to explore human knowledge in a way nobody else has explored it before.  I want my own flexible tool, so that I can ask certain kinds of questions, and find the answers quickly, colorfully, and at a high resolution.  But what questions?

These questions.

Is there a relationship between nitrosamines, vitamin C, and antioxidants?  If so, what is the nature of that relationship?  In the vast world of all human knowledge, what other triads have a relationship like this?  Can I learn anything from this?

What idea is most closely connected to my triad?  Maybe it's hybridomas.  Well, great.  What other foursomes have relationships like this?  Can I start to draw a picture that describes the relationship between the concepts in my foursome?  How does it resemble the picture describing the relationship between the other foursomes I retrieved?

I know nothing about the play "Measure for Measure", by Shakespeare.  I bet people wrote about this play.  Historians, people who critique literature from that period.  The period Shakespeare lived in.  I bet people compare the characters in that play with characters in other plays.  Let's find out what plays have characters like the main character in "Measure for Measure".  I mean, first of all, show me a list of the main characters in all theatrical productions, past and present.  Now show me the qualities associated with the characters, as a whole.  Loyal?  Brave?  Maybe the main character of "Measure for Measure" is remarkably similar to the character from a play by Longfellow.  Has anyone already pointed this out, or did I just discover it?

What is the current state of research into a vaccine for AIDS?

OK, this is all based on deriving information from free text.  But these techniques, I believe, could be applied using quite a number of other input devices.  For example, I think I could design a system that could capture the essence of a song, and then find surprising similarities with other songs -- possibly, songs from a different era or continent.

Eventually, these kinds of systems could be running in the background while I am typing, or listening to the radio, or talking to a friend.  They could continually bring up related ideas -- ideas that match the relationships among concepts being discussed.  And they could offer a colorful visualization showing the relationships.  The "Visual Thesaurus", which currently exists at http://www.visualthesaurus.com, might be a rough model for this type of visualization.  Less-related concepts fade away into the background.  At a very high resolution, individual concepts could be represented with some kind of recognizable three-dimensional shape.

I wrote this when I was at work yesterday, waiting for someone:  "Better Internet searching: Free text query like "nitrosamines vitamin C".  Reports back sentences describing relationships between nitrosamines and vitamin C.  Complete sentences without "…" like Google now has.  Right margin has keywords related to this sentence including the larger context in which it appears -- paragraph, document.  So you can click on it and it acts like a new search. 

Wednesday, October 24, 2001

Public health -- passion, thought, and action (these are the words of Michael Bird).  So I was at a session on global public health this morning... One gentleman in the audience said something to this effect: "public health is focusing intensely on small components of public health, rather than on the 'big picture'."  Another person brought up the 90/10 gap -- the fact that only 10% of the public health funds are spent on the problems responsible for 90% of the public health burden in the world.

So I thought about the GIS presentation I attended at CDC on Friday.  The presentation was on LandView, I think it was called.  Anyway, so they had all this data from the census of various countries, and other sources (the speaker said they had incorporated literally two tons of data into the current version).  And it allows you to look at how the population is distributed.  Wouldn't it be nice to get some global public health data into the system (not just structured, quantitative data, but unstructured, qualitative data)?  This, together, with appropriate data on the effectiveness of various interventions, could not just provide insight to guide public health priorities, but could output unbiased objective data on the interventions that can achieve the greatest impact on burdens to global population health.

Such an effort would encourage philanthropy -- if the objectives of such a system were communicated clearly.

I wonder if the open-source community would want to take this on.  Relevant data could be included in the software, just like with the Celestia system I discovered on Tuesday.  It wouldn't involve that much data, really.  Well, maybe it would... like 100 megabytes.  I guess there would have to be add-on modules if the system were to expand for more detailed analyses, etc.

But the total amount of data needed for a system consistent with the spirit of this system?  It wouldn't really need to be that much.  I envision a massive spreadsheet with many rows (representing cells on the map), and a limited number of columns (burdens on global public health -- only the most significant ones.  Diarrheal diseases, tuberculosis, and AIDS would certainly be included.  To pick an arbitrary cutoff, every "cell" on the map has a list of distinct risk factors, or causes, for the burden on public health.  Well, the list of these burdens could include the top five or top ten in every cell, or just those that cause 90% of the disability.

Now this spreadsheet would start out empty and would take a long time to fill... but the cells could be filled using the best available methods, using approximation where necessary, and using methods similar to LandScan's approximation methods.  Now these numbers would change over time -- a dynamic mosaic of data.

I don't think the 90/10 gap can be corrected, or fought effectively.  It seems to be inherent in nature, a certainty of chaos theory.  However, such a system could conceivably achieve dramatic reductions in burdens on global public health.

The speaker just said, "The future is in the people who dream dreams and then move them forward into reality."

Could I just submit this as a project in the Open Source web site that Celestia's on, even if I don't currently have the serious intent to follow up?

I know that I would give my money to an organization using this system.

 

All original material on KnowledgeBlog copyright Michael Bales unless otherwise noted.