KnowledgeBlog
| |||||||||||||||||||||||||||||
|
An exciting new world Michael
Bales
|
OK, I’m thinking about practical ways of implementing Idea 13. Imagine a set of 1000 short documents containing a total of 100,000 unique concepts. The frequency of each concept is a known integer. Each concept has a simple numeric identifier from 1 to 100,000. Each concept then receives a power corresponding to its frequency in the document set. (Less frequent occurrence gives a concept more power, because it will contribute more to the uniqueness of each document.) Well, each document’s identifier is now simply set consisting of the concepts it contains, more or less strong based on the frequency of the concepts. Using conventional methods, a user wishing to retrieve a particular document can find the document quickly based on keywords. However, I wish to link the concepts so that their meanings, or identifying power, influence each other. A tabular representation of the first three concepts in the entire document set:
A tabular representation of the first three concepts in document A:
Universe 0.057; without 4.67; boundaries 0.833 If we look at the representations of all the documents, we find certain patterns emerging. The concept universe, when powerful in a given document, often occurs near the concepts star and galaxy. We can see, from a purely mathematical standpoint, that these are related concepts. Now, when someone starts talking about star, and the concept galaxy is also mentioned, we know the person is talking not about movie stars but about the celestial body. On the way home from the coffeehouse, I explained to Jessica how nice it is to get my thoughts on to paper, and she agreed that it's good to journal from time to time. I asked, "Can you do that in Chinese?" and she replied that it's faster for her to write in English because "we don't have a good typing system." I outlined one based on optical character recognition: it allows input on a touch screen, then looks at the characters and uses word frequency and contextual analysis to help present an array of likely choices for characters! The user taps the screen once if proposed character is the right one, or selects other characters that appear in boxes. It strikes me that there should already be decent OCR software for printed
Chinese; this software would probably achieve a remarkably lower level of
accuracy, but would probably be useful for typing in Chinese. An improvement over current methods involving pinyin?
Maybe not… but would allow the "average"
Chinese person (with the touch-screen input device) to "type" e-mails.
Yes, what might sell is a small, inexpensive touch-screen device and
corresponding OCR software. I've thought of another crazy idea.
This is really an idea that I first thought of when I was little -- maybe about
11 years old. It's basically a virtual Earth
simulator, for simulated travel to anywhere in the world. It dynamically gathers needed data from the Internet.
Users can fly from one world city to another, see the Great Pyramids, see other
sights, or zoom down to the street level of any city or small town in the world.
Small towns are generated (ahead of time)
based on existing GIS maps and aerial photography. It includes details, like cars and
people. Navigation is like the
navigation in Celestia: you can accelerate by an order of magnitude to cover
great distances. This
system can develop over time, using GIS technology and other technologies.
To start, all you see is a sphere that
represents the Earth. Then we begin
adding details, and you can see the oceans and landscape at a low resolution.
Then you can see where country boundaries are, and names of cities. Then you can get closer to the cities, and you start to see the city map,
and then you can see the buildings (simulated at first, then actual widths and
heights, then a façade that represents the type of building, then, in selected
cities, the true façade of the building.) Practical
uses of this Learn about other countries and places. Go to Paris on a rainy day.
Find out that the Taj Mahal is on a
river; see what it looks like from various angles. See how close people live to the Taj
Mahal. Find out what places to eat are near the
Washington Monument. Make a map of human knowledge! No one's done that yet! I want a
map of all human knowledge. I want
information. I just heard about soft
computing, but I don't really know what it is. I don't want just somebody's idea of what it is -- I want the meta-idea
of what it is based on all available information. I want to know what concepts it's related
to. I want to be able to know meta
information about those concepts. I
want to know how my topic is related to the other
topics: the nature of the relationship. I am going
to start the map of all human knowledge right now. This is a model that can grow (it is
scalable, extensible.) Nut Nut
(9s345Sl4ttC): a piece of metal used to fasten things together. Related
concepts: bolt (9s345S14ak3), fastener (9s3ay2f). I need a
concept extractor. It pulls the data
from the web pages. In my model,
each document is an entity containing many concepts and their
interrelationships. I need a
relational database management system. I need a
search engine and a web server. The key is
that the content maps for each concept are generated automatically and
incorporate other concepts. I don’t want it to seem like I'm throwing in the towel, but this project is probably too big for me to do effectively at this stage in my life. My 9-5 job is too different from this. I probably should move to a different job, one that allows me to pursue this area for work. If I try to work on it at home, I may just become overwhelmed by the problems. Can the
system, given a free-text query, respond reliably with the top 10% of concepts
showing what has been said/written about how the concepts interrelate? That way,
you can write something
to fill in the gaps -- and later, others interested in the same relationships
can benefit. And further empirical
science. (Slowly, but surely, many others need to write about the relationship
before it takes on a personality -- correct or incorrect -- in the knowledge
universe.) Just
thought of another idea. If you
distill the most important concepts in a document, or a string of text, you can
assign the document to a spot in a multidimensional space. For example, you determine the document's
three most important concepts are 98457389443485748, 4857394458484474, and
4395876345454944. Well, the document
maps spatially to each of those three neighborhoods, but the document's exact
location within each of the neighborhoods depends on the other concepts in the
document. For example, the
mathematically-derived location of the document within the neighborhood is -38476234, 46384344, -23464460.
If this process is applied to millions of
documents, then maybe you could get the numeric signature of the document, go to
one of its neighborhoods, and simply look around. Maybe there will be another document
nearby, one that fills in some of the gaps in your own ideas. I just saw a quote from "An Interface for Melody Input" (Lutz Prechelt and Rainer Typke, 1998): "Parsons showed that a simple encoding of tunes that ignores most of the information in the musical signal can still provide enough information for distinguishing between a large number of tunes." This is magic! The boiled-down essence of the tunes! New and improved management of text documents. Need: Parser. Something that finds the concepts in the text. The parser assigns codes to the concepts and determines relationships where possible. The document is represented with a new string (or data table) which has now been
encoded with some degree of semantic meaning. It looks kind of like this:
lawfirm.assign.paralegal employee paralegal employee.search.hardcopy record Right now, what do I really care about (intellectually)? I care about semantics, language, meaning, connections between ideas. I care about representing semantic meaning, visualizing it, encoding it electronically so that it can be manipulated by people and by computers. I want to find hidden patterns in human knowledge, not just in data. I want to move away from just alphanumeric representations of the world, which are used in rectangular data tables and in GIS. I want to explore human knowledge in a way nobody else has explored it before. I want my own flexible tool, so that I can ask certain kinds of questions, and find the answers quickly, colorfully, and at a high resolution. But what questions? These questions. Is there a relationship between nitrosamines, vitamin C, and antioxidants? If so, what is the nature of that relationship? In the vast world of all human knowledge, what other triads have a relationship like this? Can I learn anything from this? What idea is most closely connected to my triad? Maybe it's hybridomas. Well, great. What other foursomes have relationships like this? Can I start to draw a picture that describes the relationship between the concepts in my foursome? How does it resemble the picture describing the relationship between the other foursomes I retrieved? I know nothing about the play "Measure for Measure", by Shakespeare. I bet people wrote about this play. Historians, people who critique literature from that period. The period Shakespeare lived in. I bet people compare the characters in that play with characters in other plays. Let's find out what plays have characters like the main character in "Measure for Measure". I mean, first of all, show me a list of the main characters in all theatrical productions, past and present. Now show me the qualities associated with the characters, as a whole. Loyal? Brave? Maybe the main character of "Measure for Measure" is remarkably similar to the character from a play by Longfellow. Has anyone already pointed this out, or did I just discover it? What is the current state of research into a vaccine for AIDS? OK, this is all based on deriving information from free text. But these techniques, I believe, could be applied using quite a number of other input devices. For example, I think I could design a system that could capture the essence of a song, and then find surprising similarities with other songs -- possibly, songs from a different era or continent. Eventually, these kinds of systems could be running in the background while I am typing, or listening to the radio, or talking to a friend. They could continually bring up related ideas -- ideas that match the relationships among concepts being discussed. And they could offer a colorful visualization showing the relationships. The "Visual Thesaurus", which currently exists at http://www.visualthesaurus.com, might be a rough model for this type of visualization. Less-related concepts fade away into the background. At a very high resolution, individual concepts could be represented with some kind of recognizable three-dimensional shape. I wrote this when I was at work yesterday, waiting for
someone: "Better Internet searching:
Free text query like "nitrosamines vitamin C". Reports back sentences describing
relationships between nitrosamines and vitamin C. Complete sentences without "…"
like Google now has. Right margin
has keywords related to this sentence including the larger context in which it
appears -- paragraph, document. So you can click on it and it acts like a new search. Public health -- passion, thought, and action (these are the words of Michael
Bird). So I was at a session on
global public health this morning... One gentleman in the audience said
something to this effect: "public health is focusing intensely on small
components of public health, rather than on the 'big picture'." Another person brought up the 90/10 gap
-- the fact that only 10% of the public health funds are spent on the problems
responsible for 90% of the public health burden in the world. So I thought about the GIS presentation I attended at CDC on Friday.
The presentation was on LandView, I think
it was called. Anyway, so they had
all this data from the census of various countries, and other sources (the
speaker said they had incorporated literally two tons of data into the current
version). And it allows you to look
at how the population is distributed. Wouldn't it be nice to get some global public health data into the system
(not just structured, quantitative data, but unstructured, qualitative data)?
This, together, with appropriate data on the effectiveness of various
interventions, could not just provide insight to guide public health priorities,
but could output unbiased objective data on the interventions that can achieve
the greatest impact on burdens to global population health. Such an effort would encourage philanthropy -- if the objectives of such a
system were communicated clearly. I wonder if the open-source community would want to take this on.
Relevant data could be included in the
software, just like with the Celestia system I discovered on Tuesday. It wouldn't involve that much data,
really. Well, maybe it would... like 100
megabytes. I guess there would have
to be add-on modules if the system were to expand for more detailed analyses,
etc. But the total amount of data needed for a system consistent with the spirit of
this system? It wouldn't really need
to be that much. I envision a
massive spreadsheet with many rows (representing
cells on the map), and a limited number of columns (burdens on global public health --
only the most significant ones. Diarrheal diseases, tuberculosis, and AIDS would certainly be included.
To pick an arbitrary cutoff, every "cell"
on the map has a list of distinct risk factors, or causes, for the burden on
public health. Well, the list of
these burdens could include the top five or top ten in every cell, or just those
that cause 90% of the disability. Now this spreadsheet would start out empty and would take a long time to fill...
but the cells could be filled using the best available methods, using
approximation where necessary, and using methods similar to LandScan's
approximation methods. Now these
numbers would change over time -- a dynamic mosaic of data. I don't think the 90/10 gap can be corrected, or fought effectively.
It seems to be inherent in nature, a
certainty of chaos theory. However,
such a system could conceivably achieve dramatic reductions in burdens on global
public health. The speaker just said, "The future is in the people who dream dreams and then
move them forward into reality." Could I just submit this as a project in the Open Source web site that
Celestia's on, even if I don't currently have the serious intent to follow up? I know that I would give my money to an organization using this system.
| ||||||||||||||||||||||||||||
All original material on KnowledgeBlog copyright Michael Bales unless otherwise noted.