Jonathan Katzy

PhD candidate
Trustworthy Code Models for Software Engineering

Software Engineering Research Group, TU Delft

Biography

My research focuses on making code language models more trustworthy for software engineering. I study how these models can be explained, evaluated, maintained, and designed inclusively, using empirical methods and human-in-the-loop approaches that keep developer needs, real-world software engineering practices, and human values at the center. Current directions in my work include improving support for non-English-speaking developers and scaling mechanistic interpretability methods to enable component-wise maintenance of large code models.

Interests

Artificial Intelligence
Computational Linguistics
Machine Learning for Software Engineering
Large Language Models
Multi-lingual Language Models
Programming languages

Education

PhD in Machine Learning for Software Engineering
Delft University of Technology
MSc in Computer Science (Artificial Intelligence)
Delft University of Technology
BSc in Technische Informatica (Computer Science)
Delft University of Technology

Featured Publications

Jonathan Katzy, Razvan-Mihai Popescu, Arie van Deursen, Maliheh Izadi

August, 2024 In FORGE24

An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets

Does the training of large language models potentially infringe upon code licenses? Furthermore, are there any datasets available that can be safely used for training these models without violating such licenses? In our study, we assess the current trends in the field and the importance of incorporating code into the training of large language models. Additionally, we examine publicly available datasets to see whether these models can be trained on them without the risk of legal issues in the future. To accomplish this, we compiled a list of 53 large language models trained on file-level code. We then extracted their datasets and analyzed how much they overlap with a dataset we created, consisting exclusively of strong copyleft code. Our analysis revealed that every dataset we examined contained license inconsistencies, despite being selected based on their associated repository licenses. We analyzed a total of 514 million code files, discovering 38 million exact duplicates present in our strong copyleft dataset. Additionally, we examined 171 million file-leading comments, identifying 16 million with strong copyleft licenses and another 11 million comments that discouraged copying without explicitly mentioning a license. Based on the findings of our study, which highlights the pervasive issue of license inconsistencies in large language models trained on code, our recommendation for both researchers and the community is to prioritize the development and adoption of best practices for dataset creation and management.

Maliheh Izadi, Jonathan Katzy, Tim van Dam, Marc Otten, Razvan-Mihai Popescu, Arie van Deursen

July, 2024 In ICSE 2024

Language Models for Code Completion: A Practical Evaluation

Transformer-based language models for automatic code completion have shown great promise so far, yet the evaluation of these models rarely uses real data. This study provides both quantitative and qualitative assessments of three public code language models when completing real-world code. We first developed an opensource IDE extension, Code4Me, for the online evaluation of the models. We collected real auto-completion usage data for over a year from more than 1200 users, resulting in over 600K valid completions. These models were then evaluated using six standard metrics across twelve programming languages. Next, we conducted a qualitative study of 1690 real-world completion requests to identify the reasons behind the poor model performance. A comparative analysis of the models’ performance in online and offline settings was also performed, using benchmark synthetic datasets and two masking strategies. Our findings suggest that while developers utilize code completion across various languages, the best results are achieved for mainstream languages such as Python and Java. InCoder outper formed the other models across all programming languages, highlighting the significance of training data and objectives. Our study also revealed that offline evaluations do not accurately reflect real world scenarios. Upon qualitative analysis of the models’ predictions, we found that 66.3% of failures were due to models’ limitations, 24.4% occurred due to inappropriate model usage in a development context, and 9.3% were valid requests that developers overwrote. Given these findings, we propose several strategies to overcome the current limitations. These include refining training objectives, improving resilience to typographical errors, adopting hybrid approaches, and enhancing implementations and usability.

Jonathan Katzy, Maliheh Izadi, Arie van Deursen

August, 2023 In SCAM23

On the Impact of Language Selection for Training and Evaluating Programming Language Models

The recent advancements in Transformer-based Language Models have demonstrated significant potential in enhancing the multilingual capabilities of these models. The remarkable progress made in this domain not only applies to natural language tasks but also extends to the domain of programming languages. Despite the ability of these models to learn from multiple languages, evaluations typically focus on particular combinations of the same languages. In this study, we evaluate the similarity of programming languages by analyzing their representations using a CodeBERT-based model. Our experiments reveal that token representation in languages such as C++, Python, and Java exhibit proximity to one another, whereas the same tokens in languages such as Mathematica and R display significant dissimilarity. Our findings suggest that this phenomenon can potentially result in performance challenges when dealing with diverse languages. Thus, we recommend using our similarity measure to select a diverse set of programming languages when training and evaluating future models.

Recent & Upcoming Talks

Large Language Models, what are they good for?

An introduction to the use of Large Language Models in Software Engineering and their limitations.

Jun 2, 2023 CWI

Large Language Models, what are they good for?