Editing Openai/6910f96f-f278-8006-9996-5cf14ff7a27f (section)

==== Methods (High-Level) ====

The authors introduce iPrOp, an “interactive prompt optimizer” with a web interface. The basic workflow:
# User sets up the task - Starts a conversation, chooses an LLM, and uploads a small labeled dataset (e.g., texts tagged with emotions like joy or sadness). - Provides an initial simple prompt describing the task.
# System generates prompt variations - An LLM is asked to rephrase the initial prompt into alternative wordings (e.g., “Classification task with labels joy and sadness” → “Classify the emotion of the text into joy or sadness.”).
# System evaluates each prompt in two ways For each candidate prompt: - It selects some example texts and asks the model to predict labels and explain its choices in plain language. - It uses another subset of the data to calculate an F1 score, a standard metric indicating how well the prompt+model combo is classifying.
# Human-in-the-loop decision - The user is shown, side by side: - The candidate prompts - Example texts - Model predictions and short explanations - Performance scores (F1) - The user picks the better prompt or edits a prompt manually and starts another round.
# Iteration - This loop continues until the user is satisfied with the prompt.

For the initial evaluation, the authors simulate the human choice using only the F1 score (i.e., automatically choosing the prompt with better performance), to test whether the process can, in principle, improve prompts.

They use:
* Task type: Emotion classification on text
* Datasets: Three public emotion datasets (tweets and fairy tales)
* Model: llama3.1:8b-instruct-fp16