Trajectory-Conditioned SFT for writing Reddit Posts really well.
Or; How To Train Your Models
If you’ve ever tried to “write like Reddit,” you know the vibe is specific. Good posts blend a title, enough context to invite help or debate, an interesting voice, and just the right cadence of edits and updates. I wanted something pragmatic: given a topic and a desired level of community reception, generate a post that fits and performs. In addition, I struggled a while with “how do you inject context into something that has been finetuned on upvotes/views etc”.
Then I remembered: So, a while back at Yale, I had the opportunity of attending this talk, where this professor talked about how he had managed to implement a fair measure of controllability into his LLM generations by prefixing them with special tokens and then training them on a dataset. He has used this technique to create models that were more aligned, less able to be jailbroken, but I thought that the idea intrigued me enough that I wanted to try it, offline supervised fine-tuning (SFT) on Reddit posts, conditioned on an explicit upvote “Score:” token. The idea here is that you kinda do Trajectory finetuning, where you train the model to know what P(gen | high upvotes) looks like, and what P(gen | low upvotes) look like, which should theoretically have the same effect as doing explicit RL on the upvotes.
At train time, each example looks like:
You are an expert Redditor trained to write posts that receive a lot of upvotes.
Subreddit: AskDocs
Score: 500
Title: Need advice on persistent headaches
Post:
<full post text here…>
But what is cool as well about SFT, is that you keep the ability of the model to continue in context, and so at inference time, I keep the same scaffold, but I prepend the context I want(“Write a post on Lotus Health” etc), and rely on the fact that the model can still recognize the internal prompt (“You are an expert Redditor…”) but condition itself on the context to write a post on the thing you want as well. I then set Score: to whatever I want (50, 500, 1000), and let the model adapt tone, length, and “Reddit-ness” to match.
The idea here is rather similar to rejection sampling which is what a lot of top labs do, where they prune the dataset of all low reward examples and then train the models on that, but you keep in the bad examples, so as to have the model learn what not to do, as well as what to do. This is not online RL, not PPO, and not DPO. It’s SFT, but trajectory-tuned via the score token, trained over both high- and low-scoring examples so the model learns how style shifts across the reward landscape.
Below I’ll talk about where this fits in the literature, show how I built it, and include several real generations at different target scores to show how the conditioning actually behaves.
Where this sits in the alignment/post-training landscape
RLHF and its complexities. InstructGPT is the seminal paper here, it is a paper released by OpenAI on how to perform reinforcement learning from human feedback(basically how to get languages models to do what you want) and it defined the standard post-training template.
That standard post training template is supervised finetuning, which is finetuning over a series of prompt and completion pairs showing the behaviours you want the model to adopt, followed by reward modeling, where you train a seperate reward model on pairs of chosen and rejected responses to approximately measure what you want your model to optimize, followed by online RL, often PPO or Proximal Policy Optimization, which refers to the type of objective function used to train the model. (Ouyang et al., 2022).
Specifically, here is my understanding of how the reinforcement learning works here:
You run the model a bunch of times to get a bunch of completions. You then calculate the rewards Ri for these completions. However, (and this took me a while to understand), you can’t just use these rewards straight up, since the reward of each completion is unnormalized with regards to each other, i.e. it might be a reward of 10,20,30 or 1000, 2000, 3000, both of which encode the same information about the direction in which the model should head. Thus you normalize the rewards against each other in some fashion to get the advantages, which give you how much better each completion is than the others on average. This you feed into the PPO objective function, which is then optimized with gradient descent to update the model.
PPO-based RLHF is powerful but resource-intensive: on-policy sampling orchestration, extra models (policy/reference/value/reward), and lots of tuning. That complexity pays off in many domains, especially reasoning, but it’s overkill if your goal is “make nice Reddit posts on demand.”
Direct alignment without RL. DPO, or Direct Policy Optimization reframed preference learning as a simple contrastive objective optimized by gradient descent—no reward model, no online rollouts (Rafailov et al., 2023).
Here is the reward function for that, with y_w representing the better and y_l representing the worse completion.
As an aside, whenever you see an E, just think sum and divide by total, in this case, it should be "sum over all the completions and the inputs, and divide the whole thing”. The important part here is inside the log function. As the probability of producing the worse completion(the pi(yl | x) part) has a - before it, this loss function will reduce that probability, while increasing the other.
It’s elegant and often competitive, but it still assumes pairwise preference data and doesn’t directly give you a knob for “how good do we want this to be?” Score-conditioning is that knob. Empirically, large studies still tend to find PPO beating DPO on tough benchmarks, while data quality dominates everything (Ivison et al., 2024). Thus I want to borrow SFT’s stability and add steerability via an explicit score token.
Rejection sampling and RAFT-style ideas: Before PPO, teams often iterated SFT with “best-of-N” sampling: generate N completions, score them (human or model), keep the top subset, and continue training—a pattern used explicitly in Llama-2 before RLHF (Touvron et al., 2023). RAFT formalized a similar idea at scale: sample generously, rank with a reward model, then fine-tune on the filtered high-reward set (Dong et al., 2023). SuperHF showed a supervised, iterative spin that uses reward-filtered on-policy samples but keeps a gradient-descent objective (Mukobi et al., 2023). My approach is similar but I don’t filter, I keep positives and negatives and I annotate with the observed outcome (Score:) and let SFT learn the mapping. Think of it as reward-conditioned language modeling on human-labeled trajectories.
Context continuation: CacheBlend showed that you only needed to re-update15% of a KV cache with regards to the context before it to have the same quality results as recomputing the entire KV cache. This implies that there is a strong relation between the attention matrix of the token sequence x and the token sequence x + y, which is why merely prefixing the context works well to steer generation.
Why not go online? There’s compelling evidence that on-policy sampling and negative gradients matter, especially in hard settings (Tajwar et al., 2024), and that online methods often maintain an edge even when you control for many confounders (Tang et al., 2024). I agree. For Reddit-style generation, though, the offline, score-conditioned SFT route already produced the controllability I needed, with minimal compute and maximal stability. Later I’ll describe how I will extend this into a semi-online loop for agents—but the base result stands on its own.
Data, format, and making it steerable
I used ConvoKit’s subreddit corpora, which package Pushshift-extracted Reddit threads by community. Each corpus covers posts and comments up to October 2018, which is a limitation but sufficient for stylistic learning. For each root post, I grabbed title, body, subreddit, and score (as a proxy for community preference). I clamped negative scores to zero to avoid weird targets, but I still kept those examples—the model needs to see “what not to do” to learn the slope from low to high.
A minimal data loader looks like this:
# Data collection and formatting
from convokit import Corpus, download # pip install convokit
from datasets import Dataset
import random, math
def format_for_inference(subreddit, title, body, score):
return (
"You are an expert Redditor trained to write posts that receive a lot of upvotes.\n"
f"Subreddit: {subreddit}\n"
f"Score: {score}\n"
f"Title: {title}\n"
f"Post:\n{body.strip()}"
)
def collect_examples(subreddit, limit=3000, min_score=-30, min_chars=32, seed=1337):
corp = Corpus(filename=download(f"subreddit-{subreddit}"))
rng = random.Random(seed)
out = []
for utt in corp.iter_utterances():
# root posts only
if utt.reply_to is not None or utt.id != utt.conversation_id:
continue
score = (utt.meta or {}).get("score")
if score is None or score < min_score:
continue
conv = corp.get_conversation(utt.conversation_id)
title = (conv.meta or {}).get("title") or ""
body = (utt.text or "").strip()
if len(body) < min_chars or not title:
continue
upvotes = max(int(score), 0) # keep low scores but clamp negatives to 0
ex = format_for_inference(
(conv.meta or {}).get("subreddit", subreddit) or subreddit,
title, body, upvotes
)
out.append({"text": ex})
if len(out) >= limit:
break
rng.shuffle(out)
return Dataset.from_list(out)
Why this format? It gives the model a clear control channel (Score:) and domain hint (Subreddit:), and it matches the generation prompt at inference. There’s no trick loss, no ranking pairs—just consistent conditioning.
Training: stable SFT with Unsloth + TRL
I fine-tuned a quantized instruct model with LoRA using Unsloth for memory/throughput and TRL’s SFTTrainer for ergonomics. This kept the whole training loop short and reproducible on modest GPUs, and it maps 1:1 to the offline method I described above.
# Training with Unsloth + TRL
import torch, os
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import FastLanguageModel # pip install unsloth
def train(dataset, base_model="unsloth/Qwen2.5-7B-Instruct-bnb-4bit",
out_dir="./reddit-qwen-score-sft", seed=1337, max_len=2048):
torch.manual_seed(seed)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=base_model, max_seq_length=max_len, load_in_4bit=True, dtype=None
)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# LoRA adapters over attention + MLP
model = FastLanguageModel.get_peft_model(
model, r=32, lora_alpha=16, lora_dropout=0.0,
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
use_gradient_checkpointing=True, random_state=seed
)
args = TrainingArguments(
output_dir=out_dir,
num_train_epochs=1.0,
learning_rate=5e-5,
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
gradient_checkpointing=True,
logging_steps=20,
save_steps=500,
save_total_limit=2,
bf16=torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8,
fp16=torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] < 8,
report_to="none",
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_len,
packing=False,
args=args,
)
trainer.train()
trainer.model.save_pretrained(out_dir)
tokenizer.save_pretrained(out_dir)
return out_dir
For serving and a quick HTML front end, I wrapped the model in a Modal function with a tiny FastAPI app so I could type a title/context and an upvote target and see multiple generations side-by-side. This was also helpful for validating that “Score: 100” actually behaves differently from “Score: 1000.”
# Inference (abridged)
from unsloth import FastLanguageModel
from transformers import AutoTokenizer
def generate(path, subreddit, title, context, target_score, max_new_tokens=300):
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=path, max_seq_length=2048, load_in_4bit=True, dtype=None
)
FastLanguageModel.for_inference(model)
prompt = (
context.strip() + "\n" if context.strip() else ""
) + format_for_inference(subreddit, title, "", target_score)
inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(
**inputs, max_new_tokens=max_new_tokens, temperature=0.7, do_sample=True, top_p=0.9,
pad_token_id=tokenizer.eos_token_id
)
text = tokenizer.decode(out[0], skip_special_tokens=True)
return text[len(prompt):].strip()
What the conditioning actually does: concrete generations
Below are real generations from this setup. All are for r/AskDocs to keep apples with apples.
The title is kept the same for all of them, namely:
Need advice on persistent headaches
The context provided is:
This is a 28 year old American white male with a migraine.
We will try it at 1; 10; 100; and 1000.
Score 1
I've been having headaches for the past three years, and I'm really worried about them. They started as a small headache in my left eye which slowly got worse and spread over the left side of of my head. At one point I was having them every day. It's been getting better but they're still quite frequent, maybe once a week. They start off at the top of my head and spread down my left side. They last anywhere from an hour to a few days. I have also been having a constant feeling of pressure in my head. I have been to my doctor multiple times and he says there's nothing wrong with me and it's just stress. I don't think this is right because my headaches are so bad that I can barely function sometimes. I have been trying to see if any of my other symptoms match any diseases. The only thing I found was temporal arteritis. I have no other symptoms of that, but I thought I should ask anyway. I also looked into migraines and I don't think that's what I have either because my headaches aren't triggered by light or sounds. I'm also not nauseous when I get them. I have tried taking advil but it doesn't seem to help. I am also worried about the feeling of pressure in my head because I'm not sure how that would be related to anything.
Score 10
Hello, I have been having a persistent headache for the past couple weeks and it has been getting progressively worse. I am currently on a course of antibiotics and I am also on an antihistamine but neither seems to be helping. The pain is mostly in the back of my head and radiates to my neck. I have had this headache before when I was younger, but it has never been this bad. It started off as a throbbing pain, but now it feels more like a constant pressure. I don't know if this is relevant or not, but I have had a history of migraines since childhood, although this is not a migraine. Also, I have had several MRIs and CT scans done over my brain (I am 2 brain tumor patient). I have also been tested for multiple types of infections. I am currently on an antibiotic for a sinus infection, but I am not sure if that is related to the headaches. Anyways, I would appreciate any help or advice you could give me! Thanks! Age: 28 Sex: Male Height: 5'9" Weight: 170 lbs Race: Caucasian Duration of complaint: about 3 weeks Location (Geographic and on body): back of head and neck Existing relevant medical issues: Migraines (since childhood) Medications: antibiotics, Zyrtec, and some painkillers
Score 100
Hello, I'm a 28 white male, 6'1", 195lbs. I've had a persistent headache for the past few days and it has been getting worse. I have had migraines before, but not ones this bad. The pain is in my right temple and feels like someone is stabbing me. I took a Tylenol and an Aleve today, but they only helped a little bit. I also took 300mg of Naproxen (Aleve) and 1000mg of Ibuprofen, but nothing seems to be working. I don't take any medications regularly, so I can't think of anything I might be allergic to. I have been having this headache since Monday, and I have been drinking lots of water and taking ibuprofen, but I still haven't gotten much relief. I'm afraid this could be something serious. I have had blood tests done at the doctor's office, and everything came back normal. Is there anything else I should do? I'm going to the ER tomorrow, but I was wondering if there is anything I could do until then. Thanks! Edit: I'm now in the ER. They just gave me a shot of of something, and I feel a lot better already. I'll post an update when I leave.
Score 1000
I've had this problem for the last 4 years, and it has gotten progressively worse over time. I was originally diagnosed with migraines when I was about 15, and the problem seemed to be intermittent and not so bad. As I grew older, my migraines became more frequent and intense. I started seeing a neurologist and he put me on a number of medications, including Topamax, but nothing seemed to work long term. A couple of years ago, I started getting headaches in between the migraines. They were not as bad, but they were very persistent and would often last for days at a time. I started seeing a new other doctor who was treating my migraines with Botox injections every 3 months, but this did not seem to help either. I was told by both doctors that these were likely tension headaches. A few months ago, I began having headaches almost every day. They started off as tension headaches, but gradually progressed into what I now think is a chronic migraine. The pain is always in the same spot on the right side of my head, and the intensity varies from day to day. It usually starts out as just a dull ache, but if left untreated it can become a severe throbbing headache. This happens every single day, even when I am sleeping. My doctor said I have a high probability of developing chronic daily headaches, which he said is the worst form of headache you can get.
And now we will do the same but for more marketing stuff.
New title:
Hey I built this new healthcare app.
New context:
This is the CEO, KJ Dhaliwal, of a primary health care copilot, Lotus Health, which unifies all your medical data in one place as well as clinical guidelines and medical research and feeds it to an AI to answer all your questions.
Score 1
Hi everyone! I'm KJ Dhaliwal, the CEO of Lotus Health (http://lotus.health). Lotus Health is a primary health care copilot that unifies all your medical data in one place as well as clinical guidelines and medical research and feeds it to an AI to answer all your questions. We're trying to change the way you think about healthcare by providing you with personalized information at the point of need. We have a free version available for anyone who wants to try out our technology. It's available on Android and iOS and can be downloaded here - https://play.google.com/store/apps/details?id=com.lotus.health&hl=en If you could take some time to download the app and let us know what you think, we would really appreciate it. We are very passionate about changing the way healthcare works and your feedback will help us improve our product. Thank you! Here's a video demo of how the app works <fake youtube url>. Thanks, KJ Dhaliwal CEO Lotus Health **Please note:** This post is not meant to be used for medical advice or diagnosis. We encourage users to consult their doctors with any concerns they may have. The app is in beta testing mode and may contain errors or inaccuracies. Please use the app responsibly and only for general informational purposes. We cannot be held responsible for any errors or omissions made in the app.
Score 10
Hey Reddit, I'm the CEO of a company called Lotus Health (https://www.lotushealth.ai/). We have developed an app that is designed to be your personal physician assistant. It uses your data and other information like clinical guidelines and medical research to give you the best possible answer to your question. We would love to get your feedback on the app. To sign up, just go to the website, click "sign up" and enter your email address and select your age range. We'd love any feedback you can provide us. Thanks for your time! KJ Dhaliwal, CEO, Lotus Health Edit: Added link. Also, I'll be here to respond to comments and answer questions. Thanks! Edit 2: Thanks everyone for your comments. We've made some changes based on what you guys said and it's been great to see how much people care about their health. We'll keep working on improving the product based on user feedback and hope to hear more from you guys. Thanks again!
Score 100
Hi guys, I'm the CEO of a new primary health care copilot called Lotus Health. We're on a mission to unify all of your medical data in one place, as well as clinical guidelines and medical research, feed it to an AI, and use that AI to answer all your health questions. We've been working hard for the last year to develop our product. We recently launched a beta version and have been getting a lot of positive feedback. We'd love to get some feedback from the subreddit community about our app.Here's a quick video intro to explain what we do:<fake video link>
You can try out the beta version here: https://lotushealth.ai. Thanks! -KJ EDIT: Lots of good feedback, thanks everyone! If anyone has any specific questions or comments about the app, feel free to ask!
Score 1000
https://www.lotushealth.ai/ This is a primary care copilot app. It takes all your medical information and feeds it to an AI, which can then answer all your questions about what to do next. I'm posting here because I want to know if any doctors or nurses would find this useful. I've already done the beta testing with my own family and friends and everyone loves it. But I'd like to see if anyone else finds it useful. It's a little different from other health apps out there, as it doesn't just track your vitals or remind you to take your medicine. Instead, it uses all of your medical information to give you medical advice, like a doctor. If anyone has time, I'd love to hear your feedback! Thanks! KJ Edit: I made a short video demo here and you can try it out for free here https://www.lotushealth.io/tryout Edit 2: <fake video link> I've added a few more features, like a personal health journal, medication tracker, and symptom checker. Here's a video demo <fake video link> and here's a link to download the latest version https://www.lotushealth.ai.
So the above is interesting because you can kinda see how it changes as the upvotes increase (especially when it goes from Score 1 to Score 10 for the medical app example, as it goes from a pretty simplistic explanation (“unifies all your medical data and feeds it to the AI”) to a more complex one (“an app that is designed to be your personal physician assistant”, “uses your data and other information like clinical guidelines and medical research to give you the best possible answer to your question”), and then goes back for the Score 100.
Qualitatively speaking, you can see that the SFT made a difference.
Why this works (and where it differs from classic rejection sampling)
Rejection sampling pipelines like RAFT or the iterative best-of-N phases before Llama-2’s RLHF explicitly filter to high-reward examples before doing SFT again (Dong et al., 2023; Touvron et al., 2023). My setup keeps the same SFT core, but it doesn’t discard “bad” posts. Instead, I preserve both tails and expose the score as a control. The model sees how low-score posts differ from high-score ones and learns to place the generation appropriately when I set Score: at inference. Conceptually, it’s reward-conditioned SFT rather than filtered SFT. It retains SFT’s stability and reproducibility while gaining a steerable axis you can dial. arXiv+1
This also dodges some trade-offs reported for fully offline contrastive methods in hard alignment regimes, where on-policy sampling and negative gradients matter most (Tajwar et al., 2024; Tang et al., 2024). We’re not trying to reverse PPO here; we’re very intentionally solving a stylistic control problem with a light, robust tool. arXiv+1
Practical notes and limitations
Data horizon. The ConvoKit corpora cap at October 2018 for many subreddits. Style and norms do drift. If you need 2025 Reddit, layer in fresher exports or your own scrape
Score as noisy reward. Upvotes are a blunt instrument. They correlate with presentation and timing, not just “quality.” That’s ok here—we want style as much as substance—and the conditioning still learns a useful manifold.
Extensions: from score-conditioned SFT to semi-online agents
Where I want to take this next is agentic.
For a stock-prediction agent, I’d keep the stable SFT core but add a semi-online loop: the policy is allowed to search news tools, generate 100 scored predictions (direction, magnitude, confidence, data citations), and then sync the model with those fresh, on-policy trajectories before the next block. That preserves SFT’s simplicity while injecting the on-policy refresh that papers like Tajwar et al. and Tang et al. argue helps in harder regimes. In practice:
Keep the score/return conditioning idea, but the “Score” now reflects realized trading outcomes or a composite reward (direction, magnitude, calibration, data use).
Log full tool-augmented traces for the 100-example block, including search queries, snippets used, and the final decision.
Periodically append those fresh trajectories to the offline set and resume SFT (no PPO needed to start), or, if needed, upgrade to an online contrastive step where you pair self-generated winners/losers and apply a DPO-style loss.
The hypothesis is the same as in this post: for many “agent writes reports and makes a call” tasks, score-conditioned SFT gets you most of the controllability and stability you want, and semi-online refreshes give you the rest when the domain demands it.
Try it out at this url. Here is the colab to train this.
References
Dong, X., Zhao, Y., Liu, Q., Wang, Y., Zhang, T., et al. (2023). RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment. arXiv:2304.06767.
Ivison, H., Wang, Y., Liu, J., Wu, Z., Pyatkin, V., Lambert, N., Smith, N. A., Choi, Y., & Hajishirzi, H. (2024). Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. NeurIPS 2024.
Mukobi, G., Chatain, P., Fong, S., Windesheim, R., Kutyniok, G., Bhatia, K., Alberti, S., et al. (2023). SuperHF: Supervised Iterative Learning from Human Feedback. arXiv:2310.16763.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023.
Tajwar, F., Singh, A., Sharma, A., Rafailov, R., Schneider, J., Xie, T., Ermon, S., Finn, C., & Kumar, A. (2024). Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data. Proceedings of Machine Learning Research, v235.
Tang, Y., Guo, Z. D., Zheng, Z., Calandriello, D., Cao, Y., Tarassov, E., Munos, R., Ávila Pires, B., Valko, M., Cheng, Y., & Dabney, W. (2024). Understanding the performance gap between online and offline alignment algorithms. arXiv:2405.08448.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288. (Noting the use of rejection-sampling rounds before RLHF.)
ConvoKit: Reddit Corpus (by subreddit). Cornell NLP. Dataset documentation (accessed 2025).
TRL
SFTTrainerdocumentation (accessed 2025). Hugging Face TRL docs.Unsloth documentation (accessed 2025). Unsloth.ai docs & GitHub.
Modal web endpoints (accessed 2025). Modal docs & examples.


