Human Languages in Source Code Piech, Chris; Abu-El-Haija, Sami
Proceedings of the Seventh ACM Conference on Learning @ Scale,
08/2020
Conference Proceeding
Computer science education has promised open access around the world, but access is largely determined by what human language you speak. As younger students learn computer science it is less ...appropriate to assume that they should learn English beforehand. To that end, we present CodeInternational, the first tool to translate code between human languages. To develop a theory of non-English code, and inform our translation decisions, we conduct a study of public code repositories on GitHub. The study is to the best of our knowledge the first on human-language in code and covers 2.9 million Java repositories. To demonstrate CodeInternational's educational utility, we build an interactive version of the popular English-language Karel reader and translate it into 100 spoken languages. Our translations have already been used in classrooms around the world, and represent a first step in an important open CS-education problem.
The use of the Internet for learning provides a unique and growing opportunity to revisit the task of quantifying how much people have learned about a given subject in different regions around the ...world. Google alone receives over 5 billion searches a day and its publicly available data provides insight into learning process that is otherwise unobservable on a global scale. In this paper we, introduce the Computer Science Literacy-Proxy Index via Search (CSLI-s), a measure that utilizes online search data to make an educated guess around trends in computer science education. This measure uses a statistical signal processing technique to compose search volumes from a spectrum of topics into a coherent score. We intentionally explore and mitigate the biases of search data and, in the process, develop CSLI-s scores that correlate with traditional, more expensive metrics of learning. We then use search-trend data to measure patterns in subject literacy across countries and over time. To the best of our knowledge, this is the first measure of learning via Internet search-trends. The Internet is becoming a standard tool for learners and, as such, we anticipate search-trend data will have growing relevance to the learning science community.
Human commonsense understanding of the physical and social world is organized around intuitive theories. These theories support making causal and moral judgments. When something bad happens, we ...naturally ask: who did what, and why? A rich literature in cognitive science has studied people's causal and moral intuitions. This work has revealed a number of factors that systematically influence people's judgments, such as the violation of norms and whether the harm is avoidable or inevitable. We collected a dataset of stories from 24 cognitive science papers and developed a system to annotate each story with the factors they investigated. Using this dataset, we test whether large language models (LLMs) make causal and moral judgments about text-based scenarios that align with those of human participants. On the aggregate level, alignment has improved with more recent LLMs. However, using statistical analyses, we find that LLMs weigh the different factors quite differently from human participants. These results show how curated, challenge datasets combined with insights from cognitive science can help us go beyond comparisons based merely on aggregate metrics: we uncover LLMs implicit tendencies and show to what extent these align with human intuitions.
Clustering is a fundamental task in data science with wide-ranging
applications. In $k$-medoids clustering, cluster centers must be actual
datapoints and arbitrary distance metrics may be used; these ...features allow for
greater interpretability of the cluster centers and the clustering of exotic
objects in $k$-medoids clustering, respectively. $k$-medoids clustering has
recently grown in popularity due to the discovery of more efficient $k$-medoids
algorithms. In particular, recent research has proposed BanditPAM, a randomized
$k$-medoids algorithm with state-of-the-art complexity and clustering accuracy.
In this paper, we present BanditPAM++, which accelerates BanditPAM via two
algorithmic improvements, and is $O(k)$ faster than BanditPAM in complexity and
substantially faster than BanditPAM in wall-clock runtime. First, we
demonstrate that BanditPAM has a special structure that allows the reuse of
clustering information $\textit{within}$ each iteration. Second, we demonstrate
that BanditPAM has additional structure that permits the reuse of information
$\textit{across}$ different iterations. These observations inspire our proposed
algorithm, BanditPAM++, which returns the same clustering solutions as
BanditPAM but often several times faster. For example, on the CIFAR10 dataset,
BanditPAM++ returns the same results as BanditPAM but runs over 10$\times$
faster. Finally, we provide a high-performance C++ implementation of
BanditPAM++, callable from Python and R, that may be of interest to
practitioners at https://github.com/motiwari/BanditPAM. Auxiliary code to
reproduce all of our experiments via a one-line script is available at
https://github.com/ThrunGroup/BanditPAM_plusplus_experiments.
High-quality computer science education is limited by the difficulty of providing instructor feedback to students at scale. While this feedback could in principle be automated, supervised approaches ...to predicting the correct feedback are bottlenecked by the intractability of annotating large quantities of student code. In this paper, we instead frame the problem of providing feedback as few-shot classification, where a meta-learner adapts to give feedback to student code on a new programming question from just a few examples annotated by instructors. Because data for meta-training is limited, we propose a number of amendments to the typical few-shot learning framework, including task augmentation to create synthetic tasks, and additional side information to build stronger priors about each task. These additions are combined with a transformer architecture to embed discrete sequences (e.g. code) to a prototypical representation of a feedback class label. On a suite of few-shot natural language processing tasks, we match or outperform state-of-the-art performance. Then, on a collection of student solutions to exam questions from an introductory university course, we show that our approach reaches an average precision of 88% on unseen questions, surpassing the 82% precision of teaching assistants. Our approach was successfully deployed to deliver feedback to 16,000 student exam-solutions in a programming course offered by a tier 1 university. This is, to the best of our knowledge, the first successful deployment of a machine learning based feedback to open-ended student code.
The PyramidSnapshot Challenge Yan, Lisa; McKeown, Nick; Piech, Chris
Proceedings of the 50th ACM Technical Symposium on Computer Science Education,
02/2019
Conference Proceeding
In the ideal CS1 classroom, we should understand programming process---how student code evolves over time. However, for graphics-based programming assignments, the task of understanding and grading ...final solutions, let alone thousands of intermediate steps, is incredibly labor-intensive. In this work, we present a challenge, a dataset, and a promising first solution to autonomously use image output to identify functional, intermediate stages of a student solution. By using computer vision techniques to associate visual output of intermediate student code with functional progress, we supplement a lot of the teacher labor associated with understanding graphics-based, open-ended assignments. We hope our publication of the dataset used in our case study sparks discussion in the community on how to analyze programs with visual program output.
This paper describes the results of the first shared task on the generation of teacher responses in educational dialogues. The goal of the task was to benchmark the ability of generative language ...models to act as AI teachers, replying to a student in a teacher-student dialogue. Eight teams participated in the competition hosted on CodaLab. They experimented with a wide variety of state-of-the-art models, including Alpaca, Bloom, DialoGPT, DistilGPT-2, Flan-T5, GPT-2, GPT-3, GPT- 4, LLaMA, OPT-2.7B, and T5-base. Their submissions were automatically scored using BERTScore and DialogRPT metrics, and the top three among them were further manually evaluated in terms of pedagogical ability based on Tack and Piech (2022). The NAISTeacher system, which ranked first in both automated and human evaluation, generated responses with GPT-3.5 using an ensemble of prompts and a DialogRPT-based ranking of responses for given dialogue contexts. Despite the promising achievements of the participating teams, the results also highlight the need for evaluation metrics better suited to educational contexts.
Large language models (LLMs) are quickly being adopted in a wide range of learning experiences, especially via ubiquitous and broadly accessible chat interfaces like ChatGPT and Copilot. This type of ...interface is readily available to students and teachers around the world, yet relatively little research has been done to assess the impact of such generic tools on student learning. Coding education is an interesting test case, both because LLMs have strong performance on coding tasks, and because LLM-powered support tools are rapidly becoming part of the workflow of professional software engineers. To help understand the impact of generic LLM use on coding education, we conducted a large-scale randomized control trial with 5,831 students from 146 countries in an online coding class in which we provided some students with access to a chat interface with GPT-4. We estimate positive benefits on exam performance for adopters, the students who used the tool, but over all students, the advertisement of GPT-4 led to a significant average decrease in exam participation. We observe similar decreases in other forms of course engagement. However, this decrease is modulated by the student's country of origin. Offering access to LLMs to students from low human development index countries increased their exam participation rate on average. Our results suggest there may be promising benefits to using LLMs in an introductory coding class, but also potential harms for engagement, which makes their longer term impact on student success unclear. Our work highlights the need for additional investigations to help understand the potential impact of future adoption and integration of LLMs into classrooms.
TMOSS Yan, Lisa; McKeown, Nick; Sahami, Mehran ...
Proceedings of the 49th ACM Technical Symposium on Computer Science Education,
02/2018
Conference Proceeding
As computer science classes grow, instructor workload also increases: teachers must simultaneously teach material, provide assignment feedback, and monitor student progress. At scale, it is hard to ...know which students need extra help, and as a result some students can resort to excessive collaboration--using online resources or peer code--to complete their work. In this paper, we present TMOSS, a tool that analyzes the intermediate steps a student takes to complete a programming assignment. We find that for three separate course offerings, TMOSS is almost twice as effective as traditional software similarity detectors in identifying the number of students who exhibit excessive collaboration. We also find that such students spend significantly less time on their assignment, use fewer class tutoring resources, and perform worse on exams than their peers. Finally, we provide a theory of the parametric distribution of typical student assignment similarity, which allows for probabilistic interpretation.
Computer science education has promised open access around the world, but access is largely determined by what human language you speak. As younger students learn computer science it is less ...appropriate to assume that they should learn English beforehand. To that end we present CodeInternational, the first tool to translate code between human languages. To develop a theory of non-English code, and inform our translation decisions, we conduct a study of public code repositories on GitHub. The study is to the best of our knowledge the first on human-language in code and covers 2.9 million Java repositories. To demonstrate CodeInternational's educational utility, we build an interactive version of the popular English-language Karel reader and translate it into 100 spoken languages. Our translations have already been used in classrooms around the world, and represent a first step in an important open CS-education problem.