Data Mining

Learning Analytics is an emerging discipline concerned with developing methods for exploring the unique types of data that come from educational settings, and using those methods to better understand students and the settings in which they learn.

One technique used in learning analytics is educational data mining, which employs data mining techniques to analyze large corpora gathered from educational software tools. By applying these, and other, techniques we are analyzing the Grade Grinder corpus. We aim to develop techniques and approaches that will be subsequently reusable by colleagues wishing to exploit the other large educational data sets that are likely to become ubiquitous as, for example, learning management systems become more widely used. The following sections outline what we're doing with the data and the questions we seek to address.

The taxonomisation of student errors: Educational data mining techniques will be used to identify a detailed taxonomy of errors made by students as they learn to reason in this formal domain. Through detailed analyses of patterns of error instances across students, and within individual students across time, we aim to identify distinct types of error, such as misconceptions and slips. In pilot work on a small subset of the data (Barker-Plummer, Cox, Dale and Etchemendy, 2008), we have so far identified three top-level error types (which we call "structural", "connective" and "atomic") that account for a significant proportion of the data. We also propose to discover the mal-rules that students appear to use in producing the error patterns we observe. The error types and mal-rules will reflect current cognitive theories of human problem solving, reasoning and comprehension and will take into account individual differences in reasoning style.

In pilot work (Barker-Plummer et al., 2008; Dale, Barker-Plummer and Cox, 2009) we have established that a major source of difficulty for students stems from the fact that conditionals (e.g. if) and quantifiers (e.g. some) are used quite differently in natural language compated to their use in logic. This result mirrors existing work in mathematics education, and indicates the generality of the results that we will obtain in this work. Although our corpus contains work in undergraduate logic, the general task of learning to manipulate and use formal expressions in a careful way underlies all of mathematics, technology, engineering and science. We are confident that the results that we will obtain carry over into these other domains and may be used to inform education beyond the undergraduate logic curriculum.

Informing the design of student learning support: The results of the error taxonomy analyses will, inter alia, inform the design of automated diagnostic and remedial extensions to the current e-assessment system. Our pilot work here has been promising, with an approach to classification based on regular expression patterns correctly identifying an average of 85% of errors. The aim is for the e-assessment system to ultimately provide highly-targetted, personalised support to learners. We expect that the techniques we develop will also generalise to a wide array of other domains and subject areas.

The development of innovative language technologies: We will explore the use of statistical and symbolic corpus analysis methods from computational linguistics and language technology for the purpose of generating appropriate English paraphrases of students' submitted logic sentences. The goal here is to improve the effectiveness of e-assessment system feedback, and in so doing to make it possible for more students to come to grips with this traditionally difficult subject.

Studying the time-course of student learning: Individual student submissions are time-stamped. By analysing successive exercise submissions by individuals, we can examine individual students' learning trajectories, the time-course of their learning, and learning impasses. In pilot work (Dale, Barker-Plummer and Cox, 2009) we have identified a useful measure of learning that we term stickiness. This is defined as the number of attempts it takes for for a student determine a correct answer once they have made their initial mistake. We would like to research this metric further and use it as an outcome measure in learning evaluation studies.

Studying the role of diagrams in learning: The corpus contains diagrams as well as sentences of logic. Students use desktop applications to build or manipulate "blocks worlds" such that sentences of natural language or logic are true in them. Hence we are able to triangulate students' performance in the linguistic domain (natural language, logic) with their performance in the graphical (diagrammatic) domain. A preliminary study of a small data subset (Cox, Dale, Etchemendy and Barker-Plummer, 2008) has revealed theoretically significant findings. For example, errors in diagramming sentences such as not a small cube are manifested much more frequently with respect to the object's size than with respect to its shape.
We investigated this phenomenon further using a human subjects methodology which enabled us to learn more about the factors influencing this effect. This work is reported in our Readings and Realizations project.

Unintended roles of content in exercise difficulty: Our blocks world language involves information about the size and shape of blocks, as well as spatial relations between them. Previous research suggests that processing information involving mixed spatial and visual properties is significantly more difficult than homogenoeus information, perhaps due to competition for processing resources. Previous research has focussed on tasks involving reasoning with mixed information. We data mined the translation exercises partitioning the sentences according to the mix of information types that they contains, and determined that this effect is true even in the case of simple English to FOL translation. This work is reported in (Barker-Plummer, Dale and Cox, 2011a)

The construction of an open-access front-end: We would like to make our corpus of data accessible to the wider academic community. To that end, we propose to develop OpenFace, a user-friendly web-based front-end designed to facilitate data filtering, sharing and re-use. Users will be encouraged to "grow" the resource by submitting the results of their analyses, and ancillary materials such as copies of publications. A discussion forum will also be provided. We plan to accommodate interoperability requirements (e.g. with existing data mining tools). We intend to structure the corpus in terms of the learning tasks posed to the learner and in terms of a philosophical logic curriculum (i.e., a hierarchy of conceptual pre- and co-requisites).
As a first step toward this goal, we have made available the subcorpus of translation exercises. This subcorpus contains solution attempts for the translation exercises in Language, and Logic. Here, students are given English sentences and are asked to provide the formal translation of the sentence into first-order logic. The corpus is described in (Barker-Plummer, Dale and Cox, 2011b and 2011c), and is available by request.