Question answering

Question Answering (QA) is a type of information retrieval. Given a collection of documents (such as the World Wide Web) the system should be able to retrieve answers to questions posed in natural language. QA is regarded as requiring more complex natural language processing (NLP) techniques than other types of information retrieval such as document retrieval, and it is sometimes regarded as the next step beyond search engines.

Closed-domain question answering deals with questions under a specific domain (for example, medicine or automotive maintenance), and can be seen as an easier task because NLP systems can exploit domain-specific knowledge such as ontologies.

Open-domain question answering deals with questions about nearly everything can only rely on general ontologies. On the other hand, these systems have much more data available where from to extract the answer.

Architecture

The first QA systems were developed in the 1960s and they were basically natural-language interfaces to expert systems that were tailored to specific domains. In contrast, current QA systems use text documents as their underlying knowledge source and combine various natural language processing techniques to search for the answers.

Current QA systems include a question classifier module that determines the type of question and the type of answer. After the question is analysed, the system typically uses several modules that apply increasingly complex NLP techniques on a gradually reduced amount of text. Thus, a document retrieval module uses search engines to identify the documents or paragraphs in the document set that are likely to contain the answer. Subsequently a filter preselects small text fragments that contains strings of the same type as the expected answer. For example, if the question is "Who invented Penicillin" the filter returns text that contain names of people. Finally, an answer extraction module looks for further clues in the text to determine if the answer candidate can indeed answer the question.

Some systems use templates to find the final answer. If you posed the question "What is a dog?", the system would detect the substring "What is a X" and look for documents which start with "X is a Y".

Other systems use the result of web search as a means to expand the amount of text available and therefore increase the likelihood of finding the correct answer.

More sophisticated systems are capable of performing inference (such as abduction) and exploiting world knowledge.

Issues

In 2002 a group of researchers wrote a roadmap of research in question answering (see external links). The following issues were identified.

;Question classes: Different types of questions require the use of different strategies to find the answer. Question classes are arranged hierarchically in taxonomies.

;Question processing: The same information request can be expressed in various ways - some interrogative, some assertive. A semantic model of question understanding and processing is needed, one that would recognize equivalent questions, regardless of the speech act or of the words, syntactic inter-relations or idiomatic forms. This model would enable the translation of a complex question into a series of simpler questions, would identify ambiguities and treat them in context or by interactive clarification.

;Context and Q&A: Questions are usually asked within a context and answers are provided within that specific context. The context can be used to clarify a question, resolve ambiguities or keep track of an investigation performed through a series of questions.

;Data sources for Q&A: Before a question can be answered, it must be known what knowledge sources are available. If the answer to a question is not present in the data sources, not matter how well we perform question processing, retrieval and extraction of the answer, we shall not obtain a correct result.

;Answer extraction: Answer extraction depends on the complexity of the question, on the answer type provided by question processing, on the actual data where the answer is searched, on the search method and on the question focus and context. Given that answer processing depends on such a large number of factors, research for answer processing should be tackled with a lot of care and given special importance.

;Answer formulation: The result of a Q&A system should be presented in a way as natural as possible. In some cases, simple extraction is sufficient. For example, when the question classification indicates that the answer type is a name (of a person, organization, shop or disease, etc), a quantity (monetary value, length, size, distance, etc) or a date (e.g. the answer to the question "On what day did Christmas fall in 1989?") the extraction of a single datum is sufficient. For other cases, the presentation of the answer may require the use of fusion techniques that combine the partial answers from multiple documents.

;Real time question answering: There is need for developing Q&A systems that are capable of extracting answers from large data sets in several seconds, regardless of the complexity of the question, the size and multitude of the data sources or the ambiguity of the question.

;Multi-lingual question answering: The ability of developing Q&A systems for other languages than English is very important. Moreover, the ability of finding answers in texts written in languages other than English, when an English question is asked is very important.

;Interactive Q&A: It is often the case that the information need is not well captured by a Q&A system, as the question processing part may fail to classify properly the question or the information needed for extracting and generating the answer is not easily retrieved. In such cases, the questioner might want not only to reformulate the question, but (s)he might want to have a dialogue with the system.

;Advanced reasoning for Q&A: More sophisticated questioners expect answers which are outside the scope of written texts or structured databases. To upgrade a Q&A system with such capabilities, we need to integrate reasoning components operating on a variety of knowledge bases, encoding world knowledge and common-sense reasoning mechanisms as well as knowledge specific to a variety of domains.

;User profiling for Q&A: The user profile captures data about the questioner, comprising context data, domain of interest, reasoning schemes frequently used by the questioner, common ground established within different dialogues between the system and the user etc. The profile may be represented as a predefined template, where each template slot represents a different profile feature. Profile templates may be nested one within another.

External links

QA systems regularly compete in the TREC competition and some of them have demos available on the World Wide Web.