2084: War and Pitch
Some thoughts on Tolstoy, and notes on the approach I'm taking to create my pitch AI model.
I recently finished War and Peace. It’s a weird novel. Tolstoy seems to be struggling throughout with the difference between art as realistic depiction of the world, and art as moral parable. With his characters he usually seems to strive to make them act realistically, but sometimes his opinions overwhelm his sense, and you have something like the Natasha ending, where a character is flattened into a 2 dimensional ideal of a housewife, or the actions of Napoleon, or the deification of Kuruzov. The struggle between the two forms a large part of War and Peace, and especially distort his portrayal of history, with events being made to conform to his ideas of historical necessity, rather than the other way around. But despite the plot being so idealogical, his description of characters, especially their internal struggles, ring very true.
But anyways, besides reading War and Peace, I’ve also been working on the pitch AI model. Firstly, with regards to the actual structure, I decided to go with the CM-Bert structure, since essentially what I have for each pitch is the audio file and a transcript of the file(which I can get from Deepgram), and I want to label this with a) invested or nor invested or b) the amount of seed investment, and that is essentially what CM-BERT does:
The model takes in text and runs it through BERT, and takes in audio and runs it through a seperate preprocessor, and then has a transformer layer which combines the output of the two models to produce a single label -3 to 3, which indicates how positive or how negative the audio and text transcript is, with -3 being quite negatvie and 3 being quite positive. Now, this is perfect for my use case, all I need to do is replace the labels with a number indicating investment amount, and retrain the model on a dataset consisting of pitches, pitch transcript, and amount invested.
Now, the dataset is the tricky bit, but working of what was done in the paper, Persuading Investors, I decided to use pytube, which is a very handy youtube searcher and downloader, to scrape a bunch of pitch videos for YCombinator, MassChallenge, Techstars and AngelPad, to use as data, as these accelerators require pitch videos to be submitted for consideration, and whether or not these pitch videos were succesful can be determined by cross referencing the name of the startup vs the names on the lists of accepted startups on the websites of these accelerators, as well as info pulled from Pitchbook and Crunchbase to get the amount invested for the successful ones. To extract startup names from the video titles, I actually used GPT-3 with the prompt “Name the startup from the following video title, give just the name”, which worked surprisingly well with only a few artifacts. The code is the following:
from pytube import Search, YouTube
import json
import os
import openai
searchqueries = [("Y", "YC application videos 2022", ["22", ]),
# ... lots more queries ...
("AngelPad","AngelPad Application Videos 2016", ["16", ]),
]
outdata = []
outtitles = []
def checkifallowed(word):
upword = str(word.upper())
if "APPL" in upword or "BATCH" in upword or "PITCH" in upword:
return True
return False
with open("youtubedatafiltermoreyears.json", "w") as f:
json.dump(outdata, f)
for accelerator, query, checkarr in searchqueries:
searcher = Search(query)
print(query, searcher.results)
for i in range(0, 5):
for yt in searcher.results:
for check in checkarr:
if (check in yt.title.upper()) and ( yt.title not in outtitles) and (checkifallowed(yt.title)):
print(yt.title)
print(str(yt))
print(str(yt.watch_url))
print(str(yt.publish_date))
outdata.append({
"title": yt.title,
"url": yt.watch_url,
"author": yt.author,
"publish_date": str(yt.publish_date),
"description": yt.description,
"length": yt.length,
"views": yt.views,
"accelerator": accelerator,
"keywords": yt.keywords})
outtitles.append(yt.title)
searcher.get_next_results()
with open("youtubedatafilter.json", "w") as f:
json.dump(outdata, f, indent=2)
openai.api_key = SECRET
with open("youtubedatafilter.json", "r") as f:
with open("youtubetitles.txt", "w") as f2:
obj = json.load(f)
print(len(obj))
for yt in obj:
strtowrite = yt["title"] + ":"
print(yt["title"])
response = openai.Completion.create(
model="text-davinci-003",
prompt="Name the startup from the following video title, give just the name:\n" +
yt["title"],
temperature=0.7,
max_tokens=256,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
print(response.choices[0].text.strip(" \n"))
startupname = response.choices[0].text.strip(" \n")
strtowrite = strtowrite + \
response.choices[0].text.strip(" \n") + "\n"
yt["startup"] = startupname
f2.write(strtowrite)
with open("youtubedata2.json", "w") as f:
json.dump(obj, f, indent=2)
Now this data I still need to clean up a bit, and it’ll be manual, and then I’ll use pytube to actually download the videos, and extract the audio and transcript, but it is a start. Next step is to get the relevant lists of startups that were accepted by the accelerators and cross reference, and then use crunchbase’s api to get investment amounts, but I’ll do that sometime in the future.
Now the issue with this model though, is that it’ll end up being a bit of a black box - it’ll give you a number but not what you can do to improve as much. Now the paper I cited earlier by Professor Ma, also attempts to calculate a Pitch Factor for a pitch, but it does that by using a series of machine learning models to extract a bunch of information from the video and audio, namely, Visual-Positive, Visual-Negative, Vocal-Positive, Vocal-Negative, Vocal-Arousal, Vocal-Valence, Verbal-Positive etc and then performing Factor Analysis on these factors for each pitch along with investment to get a pitch factor for each pitch,and then showing that a higher pitch factor correlates well with acceptance and investment. Now this is a bit more explicit and therefore possibly easier for people to understand.
However, performing PCA on the whole dataset for an ongoing model would get computationally expensive fast, so an alternate method that also performs the same dimensionality reduction, would be to train an autoencoder on the dataset, where the bottleneck is only one channel wide, and then using the encoder to generate the pitch factor. You can show that this is mathematically almost the same as the PCA, with the added advantage that it can estimate a pitch factor for a new pitch quicker since you just need to evaluate the pitch’s factors with the encoder. So this is something I also want to look in to, as it would probably provide a better way to provide feedback to people about what they’re doing.
But first, I got to get the data clean. In 2084, assistant models like these will probably be everywhere - never be not coached again! - as the technology for providing consulting and advice is getting better and better.