Tim Berners-Lee and the Semantic Web
Tim Berners-Lee is the Director of the World Wide Web Consortium, a Senior Research Scientist and the 3COM Founders Professor of Engineering in the School of Engineering, with a joint appointment in the Department of Electrical Engineering and Computer Science MIT‘s CSAIL where he leads the Decentralized Information Group (DIG), and Professor of Computer Science at Southampton ECS.
In 1989 he invented the World Wide Web, an internet-based hypermedia initiative for global information sharing while at CERN, the European Particle Physics Laboratory. He wrote the first web client and server in 1990. His specifications of URIs, HTTP and HTML were refined as Web technology spread.
Learn more: Wikipedia on Sir Berners-Lee
In 2001 he became a fellow of the Royal Society. He has been the recipient of several international awards including the Japan Prize, the Prince of Asturias Foundation Prize, the Millennium Technology Prize and Germany’s Die Quadriga award. In 2004 he was knighted by H.M. Queen Elizabeth. He is the author of “Weaving the Web”.
Topic: How can the Semantic Web help with medical research? Source: Business Week
Business Week: You have touched on the idea that the Semantic Web will make it easier to discover cures for diseases. How will it do that?
Tim Berners-Lee “Well, when a drug company looks at a disease, they take the specific symptoms that are connected with specific proteins inside a human cell which might lead to those symptoms. So the art of finding the drug is to find the chemical that will interfere with the bad things happening and encourage the good things happening inside the cell, which involves understanding the genetics and all the connections between the proteins and the symptoms of the disease.
It also requires looking at all the other connections, whether there are federal regulations about the use of the protein and how it’s been used before. We’ve got government regulatory information, clinical trial data, the genomics data, and the proteomics data that are all in different departments and different pieces of software. A scientist who is going through that creative process of brainstorming to find something that could possibly solve the disease has to somehow keep everything in their head at the same time or be able to explore all these different axes in a connected way. The Semantic Web is a technology designed to specifically do that—to open up the boundaries between the silos, to allow scientists to explore hypotheses, to look at how things connect in new combinations that have never before been dreamt of.”
Business Week. (April, 2007). Q&A with Tim Berners-Lee. Retrieved January 20, 2008 from http://www.businessweek.com/technology/content/apr2007/tc20070409_961951_page_2.htm.
Topic: W3C Semantic Web Health Care and Sciences Source: w3.org
The Semantic Web for Health Care and Life Sciences Interest Group (HCLSIG) is chartered to develop and support the use of Semantic Web technologies and practices to improve collaboration, research and development, and innovation adoption in the of Health Care and Life Science domains. Success in these domains depends on a foundation of semantically rich system, process and information interoperability.
The scope of HCLSIG will include:
1. Core vocabularies and ontologies to support cross-community data integration and collaborative efforts.
2. Guidelines and Best Practices for Resource Identification to support integrity and version control.
3. Better integration of Scientific Publication with people, data, software, publications, and clinical trials.
4. HCLS will actively coordinate with groups and consortia within Life Sciences and Health Care areas:
Health Care, Clinical, and Life Science Consortia Research Institutes and Centers Pharmaceuticals and Biotechnology Companies IT Solution Vendors Government Agencies
W3C Technology and Society. (2008). Semantic Web activity. Retrieved January 19, 2008, from http://www.w3.org/2001/sw/hcls/#mission
Topic: Harnessing the Semantic Web to Answer Scientific Questions Source: A Health Care and Life Sciences Interest Group demo
Download PDF on Semantic Web for answering scientific questions: Harnessing the Semantic Web to Answer Scientific Questions
W3.org (2007). Harnessing the Semantic Web to answer scientific questions. Retrieved March 3, 2008, from http://www.w3.org/2007/Talks/www2007-AnsweringScientificQuestions-Ruttenberg.pdf.
The following is a transcript is of a Podcast produced by Talis, with Tim Berners-Lee.
[music] Paul Miller: 00:10 Hello and welcome to this Talking with Talis Podcast with your host, Paul Miller. Today, I talk with Sir Tim Berners-Lee, inventor of the World Wide Web and now Director of the World Wide Web Consortium. We talk about the Semantic Web, Linked Data and Tim’s ambitions and vision for both.
Tim, thank you very much for joining me today for this podcast. Usually what I do with these podcasts is ask people to introduce themselves and talk a little bit about where they’ve come from. I guess with you, we probably don’t need to bother. I will point people to your page at the Web Consortium and certainly to things like Weaving the Web for anyone who doesn’t know who you are and what you have done to get to where we are today. So, I think, what we will do is, probably just move straight on to start looking at the questions. Tim Berners-Lee: OK, good to be with you. Paul: 01:12 Thank you; thanks for joining us. Over the past few years, there has been an awful lot of Semantic Web research going on in universities and inside research and development departments at – mainly – big corporations. What we are seeing now is really the results of that beginning to enter the mainstream. What do you think we have to do to move finally into the sort of mainstream deployment of some of these technologies and ideas that have been building for a very long time? Tim: I think the Semantic Web is such a broad set of technologies and is going to do so many different things for different people. It is really difficult to put it on one thing. What are the steps necessary right now for the life sciences community to be able to use it for their data about proteins is probably different from which steps do we need to be able to get interoperability between repositories of library data and museum data.
So, different communities have different faces, different communities always have different social considerations and often there is social steps, which when you finally get people to share data more, to be able to re-use data more; then, just like with interaction of the Web, there is a lot of echos of the same sort of social concerns.
People saying, “If I put up a Web server, then I will be out of the loop. Nobody will come and I won’t get the credit.” People don’t have to come to my door and knock on it to get the information. And sort of misunderstandings like that. Or “If I give people my data, then they will be able to use it in ways which are better than the ways in which I have used it, and then, I will fade into the limelight.” So, there are all these… we get these social things, but they tend to be different in different areas.
An important step we have just got over is bringing SPARQL out. So, SPARQL changed the landscape a lot because there is such a lot of opportunities to share, which was impractical to just load and people having then to do Linked Data, so, SPARQL gives access to those. So, I think, we will see a growing number of SPARQL endpoints and that is exciting. Paul: 03:21 OK. And we will touch on certainly Linked Data a little bit later on. You talked a little bit about people’s concerns there with loss of control or loss of credibility, or loss of visibility. Are those concerns justified or is it simply an outmoded way of looking at how you appear on the Web? Tim: I think that both are true. In a way it is reasonable to worry in an organization, for example. Suppose are in a department in an organization and you own the data about a particular thing, whether it is when the machines are going to be maintained, fixed or what the temperature has been in each of your offices or something. You own that data, you are worried that if it is exposed, people will start criticizing your maintenance of heating systems or something.
So, there are some organizations where if you do just sort of naively expose data, society doesn’t work very well and you have to be careful to watch your backside. But, on the other hand, if that is the case, there is a problem. So the Semantic Web is about integration, it is like getting power when you use the data, it is giving people in the company the ability to do queries across the huge amounts of data the company has.
And if a company doesn’t do that, then, it will be seriously disadvantaged competitively. If a company has got this feeling where people don’t want other people in the company to know what is going on, then, it has already got a problem, this just exposes the problem. It is like what people say, “Well my data is actually a mess or a lot of my addresses are out of date, or inconsistent.
Well actually, it would expose… we see all these inconsistencies. Well, in a way, you got the inconsistencies already, if it exposes them then actually it helps you. So, I think, it is important for the leadership in the company, for example, to give kudos to the people who provided the data upon which a decision was made, even though they weren’t the people who made the decision.
So, generally, to recognize the fact that people are providing access to their data is important. It’s very important in Science, too. If you publish a paper in which you happen to have got a lot of the results by running a SPARQL query over existing cell line data, existing genomics data, existing clinical trials data, whatever it is, then obviously it is very important in the scientific ethos to credit the people who produced them.
If you produce the experiments and put those out there in RDF on the Web, then, the good news is you can expect credit back; and sometime in the future after you have retired, people may in fact… you may get credit from people who are using that. Paul: 06:05 OK, sounds good. Going a little bit broader than those questions then, back in 2001 in that Scientific American article, you and the other authors painted a very broad grand vision of where the Semantic Web could take us. Did you think we’d be closer to that seven years on? Tim: Well, for one thing that article was, I think, too sci-fi. I think, that really what we have… the message has been… it was looking too far into the future. It imagined the Semantic Web was deployed, and then people had made all kinds of fairly AI-like systems which run on top of that.
In fact, the gain from the Semantic Web comes much before that. So maybe we should have written about enterprise and intra-enterprise data integration and scientific data integration. So, I think, data integration is the name of the game. That’s happening, it’s showing benefits. Public data as well; public data is happening and it is providing the fodder for all kinds of mashups.
So, what we should realize is that the return on investment will come much earlier when we just have got this interoperable data that we can query over.
Paul: 07:29 OK, and we are pretty close to that now with the Linked Data work that again we will probably dig into shortly. Weaving the Web, 1999 – is it time for another book that paints again sort of the picture given the experience of where we have gone in that time? Tim: Yeah, I think, it has been time for another book for so long, but, when am I going to find the time to write it. I think, that the same things to be… it would be good to write a number of books.
The books I would like to write if I had time include… I would like to write a whole bunch of technical books about actually practically how to do Semantic Web things. I’d like to write a book about Semantic Web Architecture. And I’d like to write a book sort of painting the path for people in the industry, because I get a lot of questions along the lines of “OK, I read the specs, OK, but here I am, I am the CIO of a company, what does it mean for us now, what should we do?”
So, there is a story about the answer to that one, typically “Well you should take an inventory of what you have got in the way of data and you should think about how valuable each piece of data in the company would be if it were available to other people across the company, or if it were available publicly, and if it were available to your partners.”
And then, you should make a list of these things and tackle them in order. You should make sure you don’t change the way any of your data is existing, is managed, so you don’t mess up the existing systems and so on.
So, there’s all this sort of advice, which is being repeated all of over the place. Semantic Web experts are being called up by CIOs and asked what they can do. We need more books on that at the moment, I think, explaining how to put the Semantic Web as a win-only and win-win solution, and adding it to existing infrastructure in companies and things like that.
So, there’s lots of books. But, when things get exciting, as they are now, come Monday morning, what should I do? It’s just like back in the early days of the Web. Should I go and encourage a working group, participate in an open source project, should I go and give a keynote speech, should I go and do a podcast with Paul Miller? There’s so many things to do, that I’m afraid writing another book just hasn’t made it to the top of the pile yet. Paul: 10:00 OK, so there may be an opportunity there for someone else to write that book. Tim: There’s a lot of books out there to be written. Maybe also for people to interview the people who understand it, who understands things like Linked Data. Interview the people from the community and sort of write the books for them. Paul: 10:23 You mentioned SPARQL back at the beginning of our conversation. With SPARQL, I guess, a lot of the technical pieces are now in place. We’ve got RDF, we’ve got OWL, we’ve got GRDDL, we’ve got SPARQL, and we’ve got the rest. Are there any big gaps left in the puzzle, or do we have the bits we need now to stop using lack of standards as an excuse? Tim: I think, really we’ve got all the pieces to be able to go ahead and do pretty much everything. I suppose, really you should be able to implement a huge amount of the dream, we should be able to get huge benefits from interoperability using what we’ve got. So, people are realizing it’s time to just go do it.
If there’s one thing which we’ve foreseen from the early stages, I think, it would be rule languages. So, the rule language is in the works. The first moment we started playing with the Semantic Web, the first thing I did was to write a rule engine. Because to me, that was way of, for example, translating data from one ontology to another, for slimming it down, thickening it up, making general inferences, writing consistency checkers and things.
So, a rule engine is such a very general thing. Yeah, you see them effectively, you’ve got a rule engine when you filter your email. You set up email filtering rules through various smart mailboxes. Smart photo albums tend to be little rule engines. So, a lot of applications already are used to that. Users are used to that. Some users, I guess advanced users, to be fair, are used to using those rule systems to sort making their life run better, increasing the automation.
So, I think, the rule language will be really useful. The trouble is, is there are so many different types of rule languages. And various other things as well. There’s all kinds of things that are around. There’s style sheet languages for Semantic Web forms. There are lots of things which I think, you’ll see in the explosion of technology.
But the core, absolutely, is very solid now that we’ve got SPARQL. Paul: 12:42 OK. Another way that we’ve looked at the Semantic Web idea, is through the famous layer cake diagram. One of the obvious pieces on that that actually comes into play quite strongly with a lot of discussion around things like social networks and data portability and all the rest, is trust. What are we doing to address trust? Tim: Well, yes, and trust is fairly high up the stack. Originally in the roadmap, I felt that we would want to… when we had rule languages, then, for example, we would have a more expressive system that would allow us to actually express the trust that we feel.
A lot of pre-Semantic Web trust-based systems do things like give people a numerical value of trust. “I trust this person to the level of 0.75,” or something, or they say the guy is trustworthy, or he’s not. And they’re too simplistic.
So, I think, when you look at real systems in the world, you trust one person to give you a recommendation on a movie, and you trust a completely different person to give you a recommendation on whether a piece of code was good code. And so you trust different people for different things, and different agents for different things.
So, in fact, the code that I’ve been writing, the rule engine, very often it hasn’t been for saying this is OK, it’s a person with these particular properties, who meets this particular criterion, has said something. A message which comes from a certain source contains this information. And then you can do things with it.
So, one of the features of the N3 language, which extends RDF, is that it allows you to do that. It allows you to talk about what documents have said what, and allows you to write these rules, which say that sort of thing. You can argue about where the information is coming from.
Provenance, of course, is a really important word for almost anybody doing Semantic Web development; provenance somehow comes into their lives. If you’re building a triple-store, often these things… we think of as triple-stores, storing the subject, the verb, and the object. Actually, for each of those little sentences they’re also storing where it came from.
In the Tabulator, the data browser that we’ve got here at MIT, you can look at the aggregated view of data about all kinds of things. If you click on a cell, then, you can sharpen the list of sources and see, where did this particular data in that particular cell come from. People always need to go back to the source.
And I think, as we build trusted systems, we’ll build them, which will not only go back to the source, they’ll look at the metadata about the source. And then, they’ll operate on that. So, for example, you’ll be able to look at all the data you’ve got about something you’re doing for a school project. And you’ll be able to say, “OK, now just show me the subset of that data, which is released under a Creative Commons license, so I know that I can use it in my school project. So, I won’t get into trouble with the teacher.”
So, that involves this awareness that a lot of Semantic Web systems are being built with now, which I think, is very important. It’s the connection from the data to the provenance of the data, and not just for the name of the document that it came from, but the actual properties of that – the licensing, what it’s supposed to be used for, what it’s appropriate to use it for, whether I got it because I’ve gone through an authentication process, and actually whether it’s private data, which I should not actually publish at all.
These are, I think… Building systems which track that sort of thing, where I got the data from, and what I’m allowed to use it for, is going to be more and more important. So, those are the ways we’re going towards building trusted systems. I think, it’s a very important part of the puzzle, but we’ll build it using the things which mostly we have.
In the end, to strengthen it, we’ll probably put cryptography in there. And N3 gives you a way of talking about what’s in a document, you can relate it to the signature. So, you can boot-strap the whole trusted system using the things which we’ve really been playing with for several years. Paul: 17:15 Right. And it’s about then making some of those things explicit, I guess. You mentioned, for example, that provenance is in the URI, but thinking far more carefully about what that means, rather than inferring meaning that perhaps isn’t always there. Tim: Yes. The really challenging thing about a lot of trust issues is the user interface. We’re having this problem with the browser. In regular Web browser, at the moment, how do you let the person know that they’re actually talking to the bank, their own bank, and not something which has got a URI which has got the name of the bank somewhere in it, but not in a domain name?
How do you prevent these phishing attacks, for example. That’s the great question. That when you look at the browser, you realize there’s a certificate there, but the browser isn’t actually showing you who the certificate is owned by. It’s only showing the padlock to show it has a certificate. You’re not checking it every moment to know that you’re talking to the right person.
So, they obviously want to put in changes to the browser, so the user interface actually, instead of showing you the URI, it shows you the owner of the certificate and then you’ve moved up a level of trust.
With Semantic Web applications, imagine you’re looking at your calendar. On the calendar, you’ve got interoperability between different applications, suddenly on your calendar you’ve got bank statement transactions that show up. The photographs you’ve taken show up on the calendar because they’ve all got dates. And we’ve got interoperability.
So when you look at a calendar, suddenly it’s got information from all the different places. It may have personal photographs from the family, bank statements from your company, financial statements from your company showing up there. You might use those for deciding where you want to meet with somebody at any particular point, figuring out what was happening on any particular day. You may not want to share that with the people you’re in a meeting with.
So, to be able to see from the user interface that there’s confidential stuff on the screen, and I should not share my screen. To be able to ask the user interface to filter it, so that I would only know, now I’m meeting with somebody else, can you just make sure that everything on the screen is the sort of thing which I’d be prepared to show somebody else. That also assists in understanding the policies.
And we’ve got systems at MIT where you get in there and play around with things like mixing OpenID authentication with friend-of-a-friend, for example. Currently, if you want to comment on my blog, then you have to be related through the social network to somebody in the group, the Decentralized Information Group. You have to be a friend of a friend of a friend to some level of somebody in the group; just so we know that you’re not a spammer. It’s not that we want to cut down the people who can, we just want to cut out the spammers.
So, we see a lot of spammers who are using the social network, which is part of the Semantic Web, to produce the traffic, which actually shores up on the Semantic Web stack. Paul: 20:34 Well, that’s clearly an area which will require a lot of activity moving forwards. Another area that will require a huge amount of effort moving forward is around data for the Semantic Web. We’re going to need an awful lot of it. Where are we going to get it from? Tim: There’s an awful lot of data out there. And I think, one of the huge misunderstandings about the Semantic Web is, “oh, the Semantic Web is going to involve us all going to out HTML pages and marking them up to put semantics in them.” Now, there’s an important thread there, but to my mind, it’s actually a very minor part of it. Because I’m not going to hold my breath while other people put semantics in by hand.
I’m not going to wait for other people to do it, and I don’t want to do it either, to sort of add the semantics to HTML pages. So, where is the data going to come from? It’s already there. It’s in databases. So, most of this data is in databases. Often the data is already available through some kind of a Web interface.
So, if you take a government department, which is interested in defense data; you take a company’s products. You take a printer manufacturer, it sells all these printers, it sells all these ink cartridges, it can sometimes be an afternoon’s work to try and figure out which ink cartridge is actually compatible with which printer. Because they’re on different parts of the website and they haven’t published that information in RDF.
Suppose they publish the information in RDF, then you could just look up the printer and find all the ink cartridges which are compatible with it. And you could write programs that automatically go and buy all the ink cartridges at the appropriate prices and from the appropriate stores, and it’s all getting very automatable.
But, the thing that’s holding us up is that, there’s data which the companies have got on this, sitting and going round and round on its disks. Or it’s in their SQL systems and needs to be exported in a way that we can get at it in linked RDF as a SPARQL. And then, that could be reused. And all the people, all the resellers, all the stationers who sell ink cartridges, for example, will be able to make much better websites because they’ll be able to pull compatibility information from the user.
The company will find that its users are happier, and they’ll end up selling more printers and more cartridges. The whole world will run more smoothly and we’ll have more time to get on with more important questions. Paul: 23:09 How does a company that has one of these databases take the step to make it available to the Semantic Web? What do they have to do? You said they don’t have to go away and rewrite all their pages, but presumably there is a step they have to go through. Tim: Well, there’s a couple of ways of doing it. Say that you’ve got a database-type website. One way to do it is to look at it… let’s stay with the printers, for example. When you look at the website you notice there’s a page on the printer, which has got the specifications, and it’s got a little table of the properties of the printer. And there’s a PHP script somewhere, which produces that.
So, you get somebody who understands these things to write another PHP script which is totally parallel, which just expresses the same information in RDF. That’s all. Expressing it in RDF is actually kind of simpler than expressing it in HTML, because when you express it in HTML, then you have to make sure that the CSS is pretty, and you’ve got the icons in the right places, and it meets the organization’s guidelines for being part of the website, and it’s got a consistent style and everything, and it’s got the navigation buttons.
When you do an RDF one, to a certain extent you need navigation buttons in the sense that when you output the data about the printer, you have to make sure that when you mention the compatible ink cartridge, you use the URI for the compatible ink cartridge, which will cause the RDF machine which is interested in that to pull up the RDF page about the ink cartridge.
But this RDF page is meant for machines or people using RDF browsers. So people using things like the Tabulator will be able to pull that up and follow that and look at the links, and then also, they’re going to make tables of all vendors and tables of all the cartridges, and use the interface to concoct queries effectively, for all the printers that take ink cartridges that cost less than $20 or something.
So, one way to do it is to parallel the website, parallel each Web page, which has got interesting data about a particular thing, such as a product, with the same thing in RDF. And then, you try it out using an RDF browser to see if it works and see if it’s got the data. And then, you pass it to some of your friends or your colleagues or your peers in other companies who would be interested in the data and see whether they can use it and match it up.
The other way is that you start at the database. You just look at the database, you do that inventory of all the tables in the database, hopefully, you sit down with some of the people who designed the database tables, because often within a company, it’s really a bit of a black art knowing exactly what some of the columns in a database are actually meant to mean. Sometimes, you have to go and have coffee with the people who actually designed the database schema.
But, then you sit down and look at the tables, and you can point a piece of code, such as for example, the D2R Server code from Berlin. We’ve got Python code, dbview, you can point it at MySQL database. And it will make a sort of default ontology and say, “That’s OK.” Let’s assume that for each table you’ve got… each table is about a set of things. We’ll make a class, which corresponds to the table. And we’ll give a URI to each of the things, which is described by a row in the table. And it will generate for you a default ontology.
Now, the more you tell it about the database, then the better it will be, because it realizes, “Oh, yeah, this is a product ID that crops up here,” and it knows where the product ID crops up in other tables. Then, when you do a look-up for the RDF URIs for a given product, it will not only give you the data in the product table, it’ll give you the other links, the links pointing into it, that say that this is the product, or this company has ordered the product, it’s in this invoice, and it’s got this compatible product and so on.
So, as you tell the system more and more about the database, it starts to produce more and more a reasonable RDF view of the world. And you can wander around it with Disco or one of the other RDF browsers out there, and you can wander around it, and as you slowly add pieces or add labels to the ontology, then your mapping file is now becoming part ontology and part mapping. It’s mapping from and explaining the internal database schemas, how they map to this ontology that you’re making.
As this grows, you get a more and more useful system. So, you may in fact go to that process to a point where you actually start using the data. At a certain point, you may just decide that you’ve done enough for now and just export that data and let other people clean it up.
One of the things they’re likely to come back and say, “Well, you’ve got a start and end time there, could you just export those in an iCal ontology?” Because a lot of people use start and end times, and if you do that, then, we could all put it on our calendars, and we can all put it on our timeline views and things.
So, after you’ve done this initial export raw from your database, then you can look around for terms that you’re using where there are actually ontologies out there, go to the Semantic Web Interest Group, go to your friends. If you’re in Cambridge, come to the gatherings we have every second Tuesday at MIT, with people who are doing this sort of thing with the Semantic Web. Ask around, “Is this ontology useful? Has anybody got an ontology for this?” Use tools like Swoogle to search for ontologies.
You don’t just use internal terms, but where you’re sharing terms, you use terms that other people use as well. Paul: 29:26 Is this the kind of thing that the Linked Data Project are doing then? Tim: The Linked Data Project as a whole has got lots of projects within it. It’s a sort of linked open data movement, I suppose, in which there are different projects. And the greatest thing is, with this linked RDF, you get interoperability between projects which are really quite different.
For example, DBpedia is one of the more famous pieces of it. DBpedia is an extraction from Wikipedia of all the little data boxes. So, you have a box, for example, in Wikipedia for cities, that gives its latitude and its longitude, which county it’s in, how many people, the population, and so on. And so, they construct relationships between geographical entities and so on. And so, it produces a really interesting graph, all by scraping the rather formalized HTML which is in the forms of a page to mark up or mark down, which is in the Wikipedia.
On other systems like MusicBrainz, they have a database and there was an ontology created and mapping done very deliberately. Of course, there are lots of people dealing with tracks and singers and albums, and so, there’s a lot of interest in interoperability for music players and so on, and music look-up services.
So MusicBrainz could piggyback on lots of other ontologies, but also, the singers are often in Wikipedia, so, they could connect songs, in some cases, which are in Wikipedia and certainly artists which are in Wikipedia into Wikipedia entries. So, that took a certain amount of negotiation between the parties to actually go through and do a data cleanup, in which they made sure they identified appropriately the corresponding nodes and linked them together.
So, some data is scraped from HTML pages, some of it is pulled out of databases, some of it comes from projects which have been in XML. So, things come in many different ways. And once they’re exported, as you browse around the RDF graph, as you write mash-ups to reuse that data, you really don’t have to be aware of how it was produced. Paul: 32:03 That sounds what you need, more or less, isn’t it? It’s about making this data visible and seamless and available to other people and their applications. And how big is this activity at the moment? I do remember some figure suggesting very large numbers of records now being available. Tim: There are. I’m never good at large numbers. But, there are various linked open data projects. There’s a linkeddata.org that I see somebody has put together, there’s Linked Data in Wikipedia. If you Google for ‘linked open data’, then you’ll find pointers to various things. Where you’ll find a completely up-to-date list, I’m not sure. Maybe, we should send attached comments at the end of the blog, pointing to things like Richard’s Venn Diagram of how the different pieces overlap in the large Linked Data projects. There’s a lot of Linked Data which is not publicly advertised at DBpedia levels because it’s more of a niche interest, which benefits from connecting into these people and maybe the Internet and those sort of things. I think, it’s difficult to measure it all.
I think, one of the large contributions to it has been the friend-of-a-friend, which is exported by a bunch of social networking sites. So, the FOAF data. FOAF data used to be the largest contribution to the Semantic Web. I’m not sure whether it still is.
There’s the Web conference in China, and there’s going to be a Linked Open Data Workshop associated. That’s in April. And there’s also going to be some sort of linked open data conference in New York in June, I think. So, there’s a lot of stuff happening with people interested in this. Go to the Wikipage and contact people in some of the interest groups, there’s an IRC channel to find the people who are involved in that. Paul: 34:21 Yeah. And I’ll include link to all of these in the show notes and people can follow along. There certainly is an awful lot happening. From our perspective here, we’re certainly watching this as being the thing that begins to demonstrate the real value of the Semantic Web outside of the research community. It’s what you can actually start to do with this data as you link it up. Tim: Yeah. Well, I think, there’s a certain role that’s played by public data, so that if you say in your FOAF file… I say that I work in Boston. It’s kind of nice to say, “Well, I live in Boston,” and get a DBpedia reference to it, so, somebody’s immediately got access to a lot of data about it.
I did a demonstration for Fidelity the other day, where I pulled some comma separated value files, some CSV files, from their website about their mutual funds. So, I could then put that in RDF of course, and in that case you could look at it in tables and you could look at other people’s mutual funds, which you had pulled from other websites. To distinguish them, of course, I could put a link to who’s actually offering the funds, and I could use the DBpedia URIs for Fidelity. And that suddenly makes it much richer. And suddenly you could say, “OK, I’m looking at all the funds which are based in cities on the East Coast.” Suddenly, they see this connection. Paul: 35:47 Yeah, definitely. I think, as we move from the original reasonably straightforward idea of the Web towards a more Semantic Web, are we having difficulty bringing understanding with us. The original idea of the Web was pretty easy for almost anyone to get.
It was also actually very easy to do. You could simply open a text editor and write HTML. I can remember doing that, very early on, before the other tools came along.
As we move towards a more complicated Semantic Web, does it become harder to understand, and also harder to actually engage with? Tim: I think, it depends really, it depends so much on how you look at it. So, the Web, and the Semantic Web, the existing Web… maybe we should say, the hypertext Web, the document Web, and a Data Web, are… In some ways we are not leaving the document Web behind. It is not as though, when you say we are moving towards a Semantic Web… Yes we are moving towards and we are implementing a Semantic Web, but it very much complements the Web of documents. The Web of documents will continue to exist, and as we are adding more and more video, and all forms of interactivity, that is going to be very exciting too.
So, we got these two things that are maturing, in a way. If you use something like the N3 Syntax, the N-Triple Syntax, or the Turtle syntax for data it’s very simple, it is a very simple language. You can write things down very easily. It is not really more complicated than HTML.
So, being able to write a FOAF file for yourself in N3 is easy, and then you can convert it into RDF/XML for output. So, in a way, it’s got the same, more or less, structure as in N3 syntax. It has got that same sort of items, you can convert your RDF into it and look at it. I can fix it and put it out there just like I would have done it with HTML. That’s a lot that can be said for it.
On the other hand, data is different from documents. When you write a document, if you write a blog, you write a poem, it is the power of the spoken word. And even if the website adds a lot of decoration, the really important thing is the spoken words. And it is one brain to another through these words.
And as a person is expressing himself to another human being, in such a way that the machines, they will try to understand it. But, they will really not be able to catch up with the poetry.
When we are creating data; when, for example, I am creating some information about an event. If I am creating information about an event, then, I am putting in the time and place in such a way that I can latitude and longitude out of it. And putting in the people who are invited to the event, using their email addresses, for example, in such a way that those people can be identified and linked in. So, I am constructing something which fits in and will be reused in all kinds of different ways. Just as the poem will be reused. In the future people will read the poem, and get all kinds of different things out of it.
But, the data, is in a way… the whole is more powerful mathematically. The fact that people will be able to do inference and they will be able to conclude from the fact that a person is at that event in that location, that they are not in a city 100 miles away, and therefore that they cannot attend something else.
They will be able to do things like conclude from the fact that a different person at the event, took the photograph on their camera, during the event; therefore, the photograph was of the event. So, therefore, it would be reasonable to share the photograph with anybody else who was at the event. Those people would have been identified.
So, in principle, a whole lot of work can happen. Our life can be made much easier, because we put this in. And thinking about how that all works, I suppose it is more complicated, because it is more powerful, with building systems which is going to do a whole lot more.
There we are moving, I suppose, from the horse to the motorcar. In a way, the motorcar is a more… the internal combustion engine is a more complicated thing, but it will enable things the horse can’t do. It can just go a whole lot faster and it can go on for a whole lot longer. Even though we still like horses.
One of the important things about the motorcar, of course, is that it has got actually, for most people when they use it, it has a very simple user interface.
So, I think, a crucial thing is that, most of the times we are using the Semantic Web, we are using it underneath the user interface, which will make it very, very easy in the way that people have got used to sorting their email, managing their calendars, and their contacts, and their address books, their appointment diaries.
These things have interfaces, which have matured over time. They are not perfect. They could all be improved upon. They will be improved upon. They get more complicated when they are linked together, but in a way they also become easier to use when they all have been linked together. When I can automatically go from my appointment diary, you know, through to the person, and then to the information about the place they work, and so on.
In a way, it will again put… The inline architecture of the Semantic Web is actually much simpler than that of HTML. It is just these triples. So, in a way understanding it, and developing with it, is actually a whole lot simpler. It is not inherently more complicated.
However, when you look at the complexity of all the things on the Web, or the Semantic Web, when you look at the size of them, of data from all kinds of different places, and you think about the implications of that, then yes, it is complicated. The Semantic Web is already complicated. The public Semantic Web is a very interesting and complicated, very powerful, useful thing which is interesting to analyze.
So, from that point of view, the Web as a beast, gets more complicated and more intricate, in a way more exciting with the Semantic Web. Paul: 42:32 I guess one of the areas we have seen a lot of growth in easy to use tools on the Web recently, has been around the whole sort of participative movement that has loosely been labeled Web 2.0.
You wrote a blog post towards the end of last year, on the Giant Global Graph, which really made it quite clear how some of the Semantic Web ideas that had been around for a long time, applied to the sort of new kids on the block, in the social networking sphere.
Do you think developers of applications like, say, Facebook and LinkedIn and the rest, are ready to embrace the Semantic Web, or do you think they think they can do it themselves? Tim: I think, there is two parts of that. There is whether they will need to give up the data, and whether they are willing to use the standards. You will find to start with a lot of places, like LiveJournal, for example, they expose FOAF. So standard RDF Friend of a Friend for your friend network.
If you look at MyOpera, not only do they expose a FOAF link but they allow you in your Opera profile to say, I am also this LiveJournal person. So, you can follow your links, you can follow the friend of a friend, the social network, through one site and into other.
I think, it is a very grown-up thing to realize that you are not the only social networking site. When you do that, it is like a website that all of a sudden… otherwise it is like a website which doesn’t have any links out. In the Semantic Web similarly, if you don’t have any links out, well, that’s boring.
In fact, a lot of the value of many websites is the links out.
So if you start off with one of these social networks that does have links out, then you will find out a huge amount. If you find one which doesn’t, then you will be able to explore it using common tools, if they use the FOAF standards, but I bet you’ll be limited; you will bump into the edges.
Now if you look at the social networking sites which, if you like, are traditional Web 2.0 social networking sites, they hoard this data. The business model appears to be, “We get the users to give us data and we reuse it to our benefit. We get the extra value.” When one person has typed in who it is that’s in a photo, then we can benefit. We give the other person extra benefit by being able to give them a list of photos that they are in. That’s tremendously beneficial.
That’s the power of the Semantic Web. And I think, the social networking sites, some of the ones that have become very popular have done it because captured the semantics. They haven’t just allowed you to tag something with somebody’s name, they’ve allowed you to capture the difference between somebody who took the photo and somebody who’s in the photo, so that the power of the reuse of the data has been much greater.
So, first of all, are they going to let people use the data? I think, the push now, as we’ve seen during the last year, has been unbearable pressure from users to say, “Look, I have told you who my friends are. You are the third site I’ve told who my friends are. Now, I’m going to a travel site and now I’m going to a photo site and now I’m going to a t-shirt site. Hello? You guys should all know who my friends are.” Or, “You should all know who my colleagues are. I shouldn’t have to tell you again.”
So, the users are saying, “Give me my data back. That’s my data.” That was one of the cries originally behind XML, it was a desktop application. Don’t store it in a format which I can’t reuse. So, now it’s, “Give it to me using the idea of standards. If you do that, then I can do things with it.”
Now, there are two architectures which allow you to do this. The way some of the sites are working is that you’ll go to, for example, a t-shirt site which is going to allow you to print a t-shirt or something. Or say you go to a photo site and say, “Now I want to see the photos of my friends. You don’t know who my friends are. I am going to authorize you in some way, using something like OpenAuth, to go to another site. I’ll open the gate with them, I’ll tell them that it’s OK to use the information about who my friends are.”
So, just for the purpose of printing those t-shirts or just for the purpose sharing these photographs with my friends, I’ll allow you to know who my friends are. So, we’re getting this moving of user data between different sites. Now, we’ve got the user data stored in more than one place. Obviously, refreshing is important and we’ve got dangers of inconsistency and so on, and we’ve got all this third-party authentication going on.
There’s another model, which is that I, the user, run an application in my browser, for example, or on my desktop. It could be an AJAX application. It could be an application which allows me to look at photos. But, what it does is, it pulls the photo information from many places, and I directly authenticate.
And when it pulls that information in, it pulls in all the information I rightfully have access to. It pulls my friends’ information as well from different places. So, if I’ve got social networks, or for that matter, if I’ve just got files in Web space. If I’ve got a friend-of-a-friend file, or even if I’ve got my local file on my desktop that now I can use. So, I can use my address book.
So, it now pulls all the information that I have access to about the social network, and it pulls all the information in that I have access to about photos, and then it allows me to browse the web of photographs of people using the full power of the integration of all those things. It allows me to look at photographs of friends, photographs of people that are friends of friends, but are not my friends, to see if I should be adopting them as friends and so on.
It can do all these powerful things, and it’s happening actually in the user’s browser, or it is happening on the user’s machine. Both of these systems at some point allow people sharing data. The second system is much simpler. The second system involves people writing scripts which will operate across different data sources.
Web 2.0 is a stovepipe system. It’s a set of stovepipes where each site has got its data and it’s not sharing it. What people are sometimes calling a Web 3.0 vision where you’ve got lots of different data out there on the Web and you’ve got lots of different applications, but they’re independent. A given application can use different data. An application can run on a desktop or in my browser, it’s my agent. It can access all the data, which I can use and everything’s much more seamless and much more powerful because you get this integration. The same application has access to data from all over the place. Does that make sense? Paul: 49:52 Absolutely, yes. And I think, creating and maintaining that split between the data and the interface or the application certainly has to be the way that we go. As you say, persuading some of the companies who have built a business model around holding the data is the task that we still have to get right.
Although I suppose, as choice arises, people can choose not to go to those sites, can’t they? Tim: People can indeed choose not to go to that site. It reminds of the story of what happened when bookshops went onto the Web. Sometimes I have to remind people about this. When originally the Web came up and bookshop owners learned about it, they said, “OK, I was told at my dinner party that we have to have a website, so get us a website.” The website would come up and it would say, “We recommend you go to see this wonderful bookshop. This is the address.” And it would give you the directions.
But, they never put up a list of books that they sold. If people put that up, they wouldn’t go to the bookstore. The important things was people should go to the bookstore. And anyway, it was also commercially sensitive data. If you put up a list of books that you were carrying, if you put up a catalogue, then, your competition could immediately use that information to compete unfairly.
And then suddenly they would realize, they would be told, “Well, excuse me, sir, the competition already has its catalogue on the site. So, everybody is going to their website. Nobody is going to our website because it doesn’t have any information. When they go to the competition, they go to the website first to check if they have the book, and then, they know if they go to the store they’re going to be able to find it.”
“Oh, OK. Well, I guess we’d better put our catalogue up.” “Shall we put the prices up?” “Oh, no. Don’t put the prices up, because that’s commercially sensitive. They should see that when they go to the store.” “Oh, the other people have put their prices up now? And so now they’re taking our customers again?” “OK, I guess we’ll have to put our prices up.”
Stock levels? Of course you don’t put the stock levels! “That’s our backend information, you don’t put the stock levels up. People can come to the store and when they order it they’ll find out whether we’ve got it in stock or not. Oh, really, they don’t like that?” Clearly then, bookstores moved to putting their stock levels online because people got fed up with finding that they’d ordered a book and then they get a little email saying that its been backordered for two months.
So, there’s this syndrome of competitive disclosure. When actually business works better, when people have disclosed and are communication one to the other. Once it starts, then it can snowball. So, once we have people putting their catalogues up in RDF, it may be that there will be aggregators that look at products and they won’t see your products if they’re not up there using Semantic Web standards.
Which, you’re giving a talk, and the person doesn’t advertise it using the standards, it won’t be streamed. It won’t be up there, people won’t have it on their calendars. People won’t come and see you talk because the information wasn’t made available publicly in interoperable fashion.
So, I think, the lesson of the bookstores on the Web is an important one. If you’re working for a company and there’s a sort of hesitation about sharing information with peers, that you know actually will make the company work better, tell them that story. Paul: 53:33 Will do. We will write it down and disseminate it widely. As we come to the end, I am conscious of the time, what do you think – and this is probably the hardest question of them all – what do you think the biggest challenges facing Semantic Web adoption and Semantic Web rollout are over the next couple of years? Tim: Oh, that is a great question. I suppose, I think, the paradigm shift is the biggest hurdle. The fact that when you think in terms of Semantic Web, you think differently. It was actually a problem for the Web too. People look back and they say, “Well, the Web is so easy, you just download the Web browser and then you could just….” And the moment they use the Web browser, you had to write HTML, and then you could edit HTML pages with editors and the whole world took off.
Well actually, before there was a significant amount of Web, it was really difficult to persuade people it would be a good idea. They just didn’t understand how fundamentally essential it would be to be on the Web. They didn’t understand what a kick they’d get out of finding that somebody had reused their information in a different way. They didn’t understand how beneficial it would be to have more or less all information that they could think of available.
And imagining it, now imagine people write a SPARQL query as though the world, as though all the data to which you actually legally practically have access, actually is technically available to you as well – just anything which comes up into your mind as a scientist, as a businessman, just as a school kid wondering the answer to a science project question… There are obviously a set of people who get it.
They have a twinkle in their eye, they are incredibly fired up, because they understand it is going to be really really exciting when it all happens. To a certain extent, they are finding that these areas like life science, like social networking, like the Linked Open Data projects, where it is all starting to come together.
There are other areas where somebody who has worked in data systems doesn’t get it. So, to understand, so explaining it, why you can’t do it in an audio blog of 60 minutes. Because when you explain the new way of looking at the world, new way of looking at data, moving up a level from the database to the Web of things, you have to listen to where somebody is coming from, you have to understand what concepts they’ve got at the moment. All this is coming to this point of view of an object-oriented programmer or a database person, because the way you paint the Semantic Web is going to be very different.
And the misunderstandings they will have, they naturally get about Semantic Web will be very different. But, it is happening more and more. So, I think, it is a question of how this meme can spread or how understanding about what this is. But, I hope that having Linked Data online, having user interfaces to it will help.
One of the crazy things, one of the big impediments we have had for the last few years, I guess was maybe a planning fault, is we didn’t have user interfaces, we didn’t have generic interfaces. When people asked me what the Semantic Web browser would be like. I’d say, well you don’t understand, it is not really… documents are for browsers, data is for applications, so these applications will use it.
And in fact, I realized we need to get that feedback, “Oh look, Ma, at my Semantic Web data”, just like “look, Ma, at my Web page.” Hence the development of things like the Tabulator, which are very much in their infancy, but starting to be able to give people that instant gratification. I put my data up there and now I can see it up there, now I can show you, now I can immediately get kudos. Now, I can stop having to answer the phone. I can point people to the data. They can go and use a Semantic Web browser on it, they don’t have to come and ask me.
So that, I think, is an important thing. We are only really at early stages of sort of the art and science of producing good Semantic Web, generic cross-domain Semantic Web browsers… and editors of course. Paul: 58:14 And that is an interesting point actually. Your original Web tool was a browser and an editor. Was it a mistake not to push harder to maintain that right back at the beginning? Tim: I really wish we could have for a lot of reasons. To start with, people shouldn’t have had the pain of having to write angle brackets, and that people were prepared to… that was a total shock to me. I had assumed that people wouldn’t. Also if we’d have had… so we would have had I think, a much more collaborative space had all the browsers been editors.
And also, we wouldn’t have all this terrible markup, because we would have had the markup where they are generated automatically, and it would have actually had matching tags. So, in that respect, in a number of respects, an ambiguous one being a collaborative space. We had to wait too long for blogs and wikis. Blogs and wikis would have happened sort of very much more easily if they had been editable things, if people had editors.
The problem of course was HTML got complicated. It had all kinds of things like sorting DIVs, which are difficult to edit. When you have nested lists, it is more difficult to edit than unnested lists. And I think, that is part of it. Also I think, the fear of actually being able to edit a page; then we would have had to develop some sort of templating. When you edit a blog, actually you edit only the very middle of the page, you edit a stream of text, then you have a very limited markup, and you don’t get an option of editing all the stuff around it that is generated automatically.
So, I think, what we need are editors, so that is a limited. We should have a type of HTML form where you can just type and do bold, and strong, and emphasis and so on, very easily using these interfaces, which are supported by the browser instead of by a bunch of java scripts. And that will help us become more – that way we need to be more collaboratively creative.
And at the same time, for the Data Web, it is important that when people see data that is wrong, if they’ve got the rights to access it, they should be able to fix it. And if they see an address, see a wrong address up there or an email address and also it should be, actually this is not right.
It should be very easy for people to enter data and also things that we do like entering bug reports, entering agenda items for meetings, entering new events, all kinds of things that are really generating data and we should be able to do that really easily, and yet keeping both the Web of hypertext the Web of Data, keeping it the Read/Write Web is a really big priority for me. Paul: 61:21 Absolutely. And it is fascinating to see how people do take to things like wikis and certainly conversations internally within Talis, where the development team uses Wikis all the time. And as you roll them out to other parts of the organization, there is an initial fear to get over, but once you get over it, people take to it like ducks to water. It is remarkable to see, yeah absolutely. Good, thank you very much Tim. Before we wrap up, do you have any final things that you wish I’d asked you? Tim: No fundamental thing. Paul it is really good of you to do this series. I find it really useful to be able to delve into and it has been great listening. So, thanks for keeping on this tradition. I think, it is great for us now and maybe it is pretty interesting for posterity as well to track people’s ideas in this space. Paul: 63:24 Thank you very much and thank you for taking the time to take part. Tim: You are very welcome. Paul: Thanks.
Reference: Talis. (February 7, 2008). Sir Tim Berners-Lee talks with Talis about the Semantic Web. Retrieved April 19, 2008, from http://talis-podcasts.s3.amazonaws.com/twt20080207_TimBL.html.