conText - LSA - Psychometrica und Universität Würzburg

conText - Latent Semantic Analysis (LSA)

conText (an intelligent tutoring system) includes a semantic space, which can be embedded in other programs as well. You can use it to compute the similarity of texts or words in German language. The comparision is not based on surface features of the words or texts, but on the semantic content of the text material. The semantic space allows the categorization of texts, the assessment of plagiarism on a content level or automatic essay scoring.

Preparation

Programming is done in Java (version 1.6 or higher). Please install the Java Runtime and an appropriate IDE as e.g. Eclipse.
Please download the file ConTextTools.zip. The program uses the database system H2. The necessary jar file is already included in the zip archive.
Unpack the archive and add ConTextTools.jar and h2 1.3.167.jar to the classpath of your program.
The archive contains the folder doc with the program documentation and the source code files.
You will need to purchase a license for conText (available here) in order to use the semantic space.

Perform calculations

The program needs the semantic space included in conText. Before performing similarity comparisons, you must set the path to this folder. Opening the connection to the database takes a few seconds. Afterwards you can project texts in the semantic space and perform similarity comparisons. Here is a simple example:

public class LSAExample {
	public static void main(String[] args) {
		// First set the directory path to the database
		// file and connect to the database. Please do
		// this ALWAYS prior to using other functions

		VectorStore.setPath("c:/programs/context/database");
		VectorStore.getInstance();

		// Ok, if the path was right, the database is connected
		// after some seconds. Now it's time to play.
		SemanticVector vector1 = LSATools.getVector("Elefanten sind" +
				"Säugetiere, die in der afrikanischen Steppe und im " +
				"indischen Dschungel leben.");
		SemanticVector vector2 = LSATools.getVector("Ein Elefant ist " +
				"ein Großsäuger aus der Savane Afrikas. Manche Arten " +
				"leben im Urwald Indiens.");
		float cosine = LSATools.getCosine(vector1, vector2);

		System.out.println("The semantic similarity is: " + cosine);
	}
}

License

ConTextTools is licensed under LGPL. That means, you can use the code for personal, scientific and even commercial purposes, as long as the origin of the code is indicated and a copy of the license text is reproduced. H2 is dual licensed under MPL 1.1 or EPL 1.0 (see License Information), which is appropriate for commercial purposes as well. The database files (semantic space) of conText, however, are not free. Psychometrica holds the copyright to these files. You need to purchase a license of the software for each redistribution and you have to deliver a copy of the software together with your program.

conText can be purchased here. For installation on one computer device only, you need a single licence (order number 310120-001, pricing: 69 € incl. VAT). The download includes all necessary database files.

If you are interested in semantic spaces that are not subject to this limitation, please contact us for a non-binding offer.

Applications

The LSA has been successfully used in a wide range of applications, for example

cross-language information retrieval
automatic classification of texts such as e-mails
automatic essay grading (AES)
intelligent search algorithms
creation of feedback in intelligent tutoring systems

Background

How does LSA work?

The Latent Semantic Analysis (LSA, Deerwester, Dumais, Furnas, Landauer & Harshman, 1990) allows the representation of the semantic content of words and texts numerically as vectors in an n-dimensional vector space. The starting point of the LSA are large text collections. The texts of these collections are divided into text units (e.g. paragraphs) before generating a frequency matrix of the words in the texts. The cells of the matrix hold the information, which word occurred how often in a specific text. The cells are weighted and stop words are eliminated. The matrix is then decomposed by means of singular value decomposition - a process which can be compared with the eigenvalue decomposition in factor analysis. Unlike the factor analysis, in which a decomposition of the square covariance matrix is done, the singular value decomposition is applied to the rectangular matrix of the weighted frequencies of the words in the texts (mathematical description, see Berry, Dumais & O'Brien, 1995). For example, the frequency matrix in context is based on the information from 65 000 text units with 311 000 different word forms. The texts stem from the areas of geology, geography, meteorology and physics. Finally, in comparison to factor analysis, the number of factors or dimensions is reduced to a minimum - usually to 300. This simultaneously eliminates irrelevant data and abstracts word meanings. The resulting data structure is an n-dimensional orthogonal space, called semantic space.

Vectorial representation of the semantic content of words and texts

The database, which is generated in the singular value decomposition, is usually called semantic space. It is a generalized representation of the semantic content of words based on their co-occurrence in texts, as well as a compressed form of the knowledge of the texts. The computation of the semantic spaces is very resource intensive. Once, the semantic space is at hand, a high performant comparison of the semantic content of text is possible. For the similarity judgments, measures such as the cosine of the angle between two vectors can be calculated. To projecting new texts into the space, the vectors of the individual words of the texts are added. The cosine of the angle between the two vectors can be interpreted just as a pearson correlation: near-zero values mean semantic independence, values close to one a high similarity. Substantial negative values almost never occur. Even on old Pentium IV computers, several 10,000 text comparisons per second are possible. The LSA has been successfully used, for example, for the assessment of large amounts of student texts, submitted as a part of university lectures.

Trainingsinhalte

Effektivität

Site menu: