How to Make a Hit Visual Novel in the Western Market

by Andrew Heller


What are Visual Novels?

Visual Novels are a modern storytelling medium, a sort of in-between for traditional novels and computer video games. While the definition is broad, they are generally text-based, with the player progressing through the story by reading, making decisions, and interacting with characters or objects. Often seen in visual novels are anime-esque artwork and character sprites, voice acting, music, and sound effects to increase narrative immersion.

Visual novels are primarily Japanese in origin, and while western visual novels such as the atmospheric cyberpunk bartending sim VA-11 Hall-A or the deceptively innocent psychological horror Doki Doki Literature Club have found success in the market, the vast majority are translations of popular Japanese works. Some of these translations are done by fans and remain unreleased officially in the west, such as the orgin of the global hit phenomenon that is the Fate franchise, Fate/stay night. The majority, however, are works that have been officially translated and released by the publisher to western markets, as is the case with the classic tearjerker CLANNAD. The medium's growing popularity in the west may be regarded as a side effect of the mass adoption of Japanese culture in the United States and other comparatively western countries, though fans of the medium often agree that it has a rich subculture of its own.

Where are we getting the data?

The data used here will be sourced from Visual Novel Database, or VNDB for short. From their site:

Our primary goal is building a large, comprehensive and up-to-date database for information about all existing visual novels. VNDB aims to be a central place to look up general information about the visual novels themselves, as well as practical information around it, like available releases, localisations and producers.

VNDB provides a complete dump of their database that is updated daily as a PostgreSQL database in the form of a compressed .tar.zst file, in addition to smaller databases that only contain data such as characters, traits, thumbnails, etc. You can read more about there database dumps and access downloads here. In addition, they host their entire database online accssible through SQLPad's interface, accessible for querying and downloading to the general public. More info on querying VNDB's dataset can be found here. We will primarily be taking advantage of the full data dump to get the most up-to-date and comprehensive data possible. Note that VNDB is a primarily English service, as and such served mostly as a representation of the western market or other native english speakers.

Scraping and Organizing

First, let's set up our environment. We'll install a few services here and set up a dummy account that we can use to access the PostgreSQL database we'll be constructing.

In [1]:
%%capture
!pip install psycopg2==2.7.3.2
!sudo apt-get -y -qq update
!sudo apt-get -y -qq install postgresql
!sudo service postgresql start
!sudo -u postgres createuser --superuser vndb
!sudo -u postgres createdb vndb
!sudo -u postgres psql -c "ALTER USER vndb PASSWORD 'vndb'"

Next, we can download the dump, unzip, and use the included resources to construct the complete database right here.

In [2]:
%%capture
!curl -L https://dl.vndb.org/dump/vndb-db-latest.tar.zst -O
!sudo apt-get install zstd
!tar -I zstd -xvf vndb-db-latest.tar.zst
!PGPASSWORD=vndb psql -U vndb -h 127.0.0.1 vndb -f import.sql

Now that we've got our database, we can query it for data that might be relevant to meta-analysis and visualization. The query below finds every release of every SFW visual novel in the database, and returns associated data with the following columns:

title
edition
olang
lang
freeware
engine
patch
official
length
c_rating
c_votecount
released
platform
developer

We'll also import some libraries that will help us with analysis.

In [3]:
#statistics libraries
import math
import pandas as pd
import sqlalchemy
import datetime
from collections import Counter
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.preprocessing import PolynomialFeatures, OrdinalEncoder
import matplotlib.pyplot as plt
import seaborn as sns

#visual configurations
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina')
pd.set_option('max_colwidth', 15)
pd.options.mode.chained_assignment = None

#connecting and querying database
engine = sqlalchemy.create_engine("postgresql://vndb:vndb@localhost:5432/vndb")
query = """
WITH r3 AS ( 
  WITH r2 AS (
    WITH r1 AS (
      WITH r0 AS (
        SELECT
          vn.title,
          r.title AS edition,
          vn.olang,
          vn.length,
          vn.c_rating,
          vn.c_votecount,
          r.patch,
          r.freeware,
          r.engine,
          r.official,
          r.id,
          r.released
        FROM
          releases r
          LEFT JOIN releases_vn rvn ON (r.id = rvn.id)
          JOIN vn ON (rvn.vid = vn.id)
        WHERE
          r.minage != '18'
      )
      SELECT
        r0.*,
        rp.platform
      FROM
        releases_platforms rp
        RIGHT JOIN r0 ON (rp.id = r0.id)
    )
    SELECT
      r1.*,
      rp.pid
    FROM
      releases_producers rp
      RIGHT JOIN r1 ON (rp.id = r1.id)
    )
  SELECT
    r2.*,
    rp.lang
  FROM
    releases_lang rp
    RIGHT JOIN r2 ON (rp.id = r2.id)
  )
SELECT
  r3.title,
  r3.edition,
  r3.olang,
  r3.lang,
  r3.freeware,
  r3.engine,
  r3.patch,
  r3.official,
  r3.length,
  r3.c_rating,
  r3.c_votecount,
  r3.released,
  r3.platform,
  p.name AS developer
FROM
  producers p
  RIGHT JOIN r3 ON (r3.pid = p.id)
"""

To visualize this, we can read the data into a pandas DataFrame, and do some cleaning. This includes some clarification in the length column, converting integers to datetime objects in the released column for the release date of the visual novel, setting c_rating to a score out of 10, and some sorting. In addition, we are using c_votecount in lieu of the c_popularity metric provided by VNDB. This is because VNDB's c_popularity calculation introduces a biased element that changes the weight of each vote based on how many visual novels that user has read and voted on, and how high their rating was comparatively. Since we are conducting analysis on the success of a visual novel as opposed to critical response and subjectivity, we want to be working with raw numbers here.

In [4]:
#reading query results via pandas
vns = pd.read_sql(query, engine)

#formatting
vns = vns.drop_duplicates().sort_values(by = "c_votecount", ascending = False).reset_index(drop = True)
lens = {5 : "Very Long", 4 : "Long", 3 : "Medium", 2 : "Short", 1 : "Very Short", 0 : "Unknown"}
vns["length"] = vns["length"].apply(lambda x : lens[x])
vns["c_rating"] = vns["c_rating"] / 100
vns["released"] = pd.to_datetime(vns["released"], format="%Y%m%d", errors="coerce")
display(vns)
title edition olang lang freeware engine patch official length c_rating c_votecount released platform developer
0 Katawa Shoujo Katawa Shou... en ja True Ren'Py False True Medium 7.84 13328 2011-04-16 win Honyaku-Subs
1 Katawa Shoujo Katawa Shou... en zh True Ren'Py False True Medium 7.84 13328 2010-06-25 mac Kawa Soft
2 Katawa Shoujo Katawa Shou... en tr True Ren'Py False False Medium 7.84 13328 2020-11-03 lin Yumeshi Nov...
3 Katawa Shoujo Katawa Shou... en tr True Ren'Py False False Medium 7.84 13328 2020-11-03 mac Yumeshi Nov...
4 Katawa Shoujo Katawa Shou... en en True Ren'Py False True Medium 7.84 13328 2010-06-25 lin Four Leaf S...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
47383 'Til Death When the Ni... en en False Ren'Py False True Very Short NaN 0 2021-08-06 mac Lunaris Games
47384 Possibly a ... When the Ni... en en False Ren'Py False True Very Short NaN 0 2021-08-06 win Lunaris Games
47385 Possibly a ... When the Ni... en en False Ren'Py False True Very Short NaN 0 2021-08-06 lin Lunaris Games
47386 Possibly a ... When the Ni... en en False Ren'Py False True Very Short NaN 0 2021-08-06 mac Lunaris Games
47387 Phase-D: Ko... Phase-D Com... ja ja False False True Unknown NaN 0 2013-03-28 psp BOOST ON

47388 rows × 14 columns

Analysis

Here, we'll take a look at the data we've collected and try to make some sense of it. An important thing to keep in mind is that there is a massive amount of shorter, independently developed and released visual novels. As a result, data points such as these may be considered extraneous since neither their production nor release is either influenced by or influential upon the market as a whole. This means that the trends we see may be smaller and less prominent, but they represent important correlations nonetheless.

Preliminary

Let's start by comparing some relevant metrics against the popularity of the associated visual novel. Here, we can plot the number of languages a visual novel has been translated into versus how popular the visual novel is. The scatterplot data can be a bit difficult to interpret, so we'll take the log of c_votecount to organize the data while preserving correlation, and overlay a regression line as well.

In [5]:
#group releases on a per-title basis
by_title = vns.groupby("title").agg(list)
by_title["numLangs"] = pd.Series(vns.groupby("title").agg(list)["lang"].apply(lambda x : len(Counter(x).keys())))
by_title["c_votecount"] = by_title["c_votecount"].apply(lambda x : x[0])
by_title = by_title[by_title["c_votecount"] < 10000]
by_title = by_title[by_title["numLangs"] < 15]
fig, ax = plt.subplots(figsize=(12, 8))

#create and display plot
x = by_title["c_votecount"].apply(lambda x: math.log(x) if x > 0 else x).to_numpy().reshape(-1, 1)
y = by_title["numLangs"].to_numpy().reshape(-1, 1)
model = LinearRegression().fit(x, y)
fig = sns.regplot(
    x = by_title["c_votecount"].apply(lambda x: math.log(x) if x > 0 else x), 
    y = by_title["numLangs"].to_numpy().reshape(-1, 1), 
    scatter_kws={"color": "red", "alpha" : .1}, 
    line_kws={"color": "black"},
    order = 3,
    truncate = True)
ax.set_xlabel("Log of Popularity")
ax.set_ylabel("Number of Languages")
ax.set_title("Popularity vs Languages")
Out[5]:
Text(0.5, 1.0, 'Popularity vs Languages')

As might seem intuitive, the popularity of a visual novel can give you a good guess as to the number of langages it has been translated into. Keep in mind that these translations include both official translations and unofficial fan-translations. It is important to keep in mind the distinction between a fan-translation and an official translation, mainly because, since the vast majority of visual novels originate in Japan, fan-translations are done when an official translation to English from the original Japanese is not available, at which point the popularity of a visual novel in the western market might depend on the existence of a fan-translation. On the other hand, a Japanese developer might make the decision to release an official translation to English so as to boost the visual novel's popularity in the western market. So, this may be a flawed metric with which to associate popularity since the popularity and presence of a translation can have different patterns of causation depending on the translation's legitimacy. Let's look at the same plot, but this time only accounting for official translations:

In [6]:
#group by title
by_title = vns[vns["patch"] == False].groupby("title").agg(list)
by_title["numLangs"] = pd.Series(by_title["lang"].apply(lambda x : len(Counter(x).keys())))
by_title["c_votecount"] = by_title["c_votecount"].apply(lambda x : x[0])
by_title = by_title[by_title["c_votecount"] < 10000]
by_title = by_title[by_title["numLangs"] < 15]

#create and display plot
fig, ax = plt.subplots(figsize=(12, 8))
fig = sns.regplot(
    x = by_title["c_votecount"].apply(lambda x: math.log(x) if x > 0 else x), 
    y = by_title["numLangs"].to_numpy().reshape(-1, 1), 
    scatter_kws={"color": "red", "alpha" : .1}, 
    line_kws={"color": "black"},
    order = 3,
    truncate = True)
ax.set_xlabel("Log of Popularity")
ax.set_ylabel("Number of Languages")
ax.set_title("Popularity vs Languages")
Out[6]:
Text(0.5, 1.0, 'Popularity vs Languages')

As we can see, the correlation is still there, but is less prominent. The state of the market makes analysis through translation much more complex than simply numbers on a plot. Next, let's take a look at how popularity correlated with how many times the visual novel has been released into the market (e.g. patches, special editions, remakes, remasters, releases in other languages, etc.).

In [7]:
#group by title
by_title = vns.groupby("title").agg(list)
by_title["numReleases"] = pd.Series(vns.groupby("title").agg(list)["released"].apply(lambda x : len(Counter(x).keys())))
by_title["c_votecount"] = by_title["c_votecount"].apply(lambda x : x[0])
by_title = by_title[by_title["numReleases"] < 35]

#create and display plot
fig, ax = plt.subplots(figsize=(12, 8))
fig = sns.regplot(
    x = by_title["c_votecount"].apply(lambda x: math.log(x) if x > 0 else 0), 
    y = by_title["numReleases"].to_numpy().reshape(-1, 1), 
    scatter_kws={"color": "red", "alpha" : .1}, 
    line_kws={"color": "black"},
    order = 3,
    truncate = True)
ax.set_xlabel("Log of Popularity")
ax.set_ylabel("Number of Releases")
ax.set_title("Popularity vs Releases")
Out[7]:
Text(0.5, 1.0, 'Popularity vs Releases')

This gives us a similarly nice correlation, this time with fewer extraneous variables to consider. Popularity generally tends to correlate positively with how many times the visual novel is released to market, which makes intuitive sense just as above. This is a good start to finding insight into what makes a visual novel popular, but next, we'll look at some more causal factors, and the market as a whole.

The Market Over Time

First, let's take a look at the change in reception of visual novels over the years.

In [8]:
#group by title
by_title = vns.dropna()
by_title = by_title[by_title["c_votecount"] < 10000]
by_title["released"] = pd.to_datetime(by_title["released"]).dt.year
by_title = by_title[by_title["released"] < 2020]

#display plot
fig, ax = plt.subplots(figsize=(12, 8))
fig = sns.regplot(
    x = by_title.groupby("released")["released"].mean(), 
    y = by_title.groupby("released")["c_votecount"].mean(), 
    scatter_kws={"color": "red", "alpha" : .5}, 
    line_kws={"color": "black"},
    order = 3,
    truncate = True)
ax.set_xlabel("Release Year")
ax.set_ylabel("Popularity")
ax.set_title("Popularity Over Time")
Out[8]:
Text(0.5, 1.0, 'Popularity Over Time')

Here we see that older visual novels tend to get less attention overall. The years 1995 to 2005 saw pretty explosive growth in the average number of votes on a per-visual-novel basis after decades of near-zero vote counts before the medium became more prevalent in the late 1990s. Let's take a closer look at this time period to see if we can figure out what changed.

In [9]:
#trims dataframe to only include releases in 1995 to 2005
by_title = by_title[by_title["released"] > 1994]
by_title = by_title[by_title["released"] < 2006]

#create and display plot for popularity
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 18))
sns.regplot(
    x = by_title.groupby("released")["released"].mean(), 
    y = by_title.groupby("released")["c_votecount"].mean(),
    truncate = True,
    order = 1, 
    scatter_kws={"color": "red", "alpha" : .5}, 
    line_kws={"color": "black"},
    ax = ax1)
ax1.set_xlabel("Release Year")
ax1.set_ylabel("Popularity")
ax1.set_title("Popularity Over Time")

#display plot for rating
sns.regplot(
    x = by_title.groupby("released")["released"].mean(), 
    y = by_title.groupby("released")["c_rating"].mean(),
    truncate = True,
    order = 1, 
    scatter_kws={"color": "red", "alpha" : .5}, 
    line_kws={"color": "black"},
    ax = ax2)
ax2.set_xlabel("Release Year")
ax2.set_ylabel("Rating")
ax2.set_title("Rating Over Time")
Out[9]:
Text(0.5, 1.0, 'Rating Over Time')

It seems that this particular time period was a sort of turning point in the industry where visual novels, in the wester market at least, began to become more commonplace. Based on the data above, this uptick is sandwiched between two periods of stagnation, so it can be assumed that this might be considered not only a turning point, but more generally as the introduction of the medium to the west as a whole. But the question still remains: what changed?

Influential Factors

In this section we'll be figuring out what technical factors were most influential to the average popularity of the medium over the time period from above. The factors we'll be considering here are:

num_langs
freeware
engine
length
num_platforms

To accomplish this we'll be using these as categorical data, and encoding them to fit a multivariable linear regression model. This will be done for every release in a given year, and we'll read the coefficients associated with each category each year into a table, ordered from highest to lowest. We'll omit patches and other non-official releases. This will give us an idea of which categories had the most influence on the mean popularity of visual novels that year.

In [10]:
#group by title
vns_t = vns.dropna().groupby("title").agg(list)
vns_t["num_langs"] = vns_t["lang"].apply(lambda x : len(Counter(x).keys()))
vns_t["freeware"] = vns_t["freeware"].apply(lambda x : x[0])
vns_t["engine"] = vns_t["engine"].apply(lambda x : x[0])
vns_t["length"] = vns_t["length"].apply(lambda x : x[0])
vns_t["num_platforms"] = vns_t["platform"].apply(lambda x : len(Counter(x).keys()))
vns_t["released"] = pd.to_datetime(vns_t["released"].apply(lambda x : x[0])).dt.year
vns_t["c_votecount"] = vns_t["c_votecount"].apply(lambda x : x[0])

#create categories and random forest machine learning regressor
cols = ["num_langs", "freeware", "engine", "length", "num_platforms"]
factors = pd.DataFrame(columns = ["year", "1", "2", "3", "4", "5"])
reg = RandomForestRegressor()
ord = OrdinalEncoder()

#for each year, encode categorical as integers, fit the random forest
#regressor to our data, and display the feature weights associated
#with that year's model
for i in range(1995, 2010):
  vns_x = vns_t[vns_t["released"] == i]
  x = ord.fit_transform(vns_x[cols])
  y = vns_x["c_votecount"]
  reg.fit(x, y)
  coefs = list(reg.feature_importances_)
  dictionary = dict(zip(cols, coefs))
  dictionary = dict(sorted(dictionary.items(), reverse = True, key=lambda item: item[1]))
  row = [str(i)]
  row.extend(dictionary.keys())
  factors.loc[len(factors.index)] = row
display(factors)
year 1 2 3 4 5
0 1995 length num_platforms num_langs engine freeware
1 1996 length num_platforms num_langs freeware engine
2 1997 length num_platforms num_langs engine freeware
3 1998 num_langs num_platforms length freeware engine
4 1999 num_platforms length num_langs freeware engine
5 2000 num_platforms num_langs length freeware engine
6 2001 length num_langs num_platforms engine freeware
7 2002 length num_platforms engine freeware num_langs
8 2003 num_platforms length freeware num_langs engine
9 2004 num_langs num_platforms length freeware engine
10 2005 num_platforms num_langs length engine freeware
11 2006 num_platforms num_langs length freeware engine
12 2007 num_platforms length num_langs engine freeware
13 2008 length num_langs num_platforms freeware engine
14 2009 num_platforms freeware length num_langs engine

Now that we've got our factors, let's take a look at how accurate our predictors are for each year. This may give us some insight into how these factors' relevance has changed over time

In [11]:
#repeat the same steps as above, instead printing out
#the accuracy scores generated by the model for each year
vns_t = vns.dropna().groupby("title").agg(list)
vns_t["num_langs"] = vns_t["lang"].apply(lambda x : len(Counter(x).keys()))
vns_t["freeware"] = vns_t["freeware"].apply(lambda x : x[0])
vns_t["engine"] = vns_t["engine"].apply(lambda x : x[0])
vns_t["length"] = vns_t["length"].apply(lambda x : x[0])
vns_t["num_platforms"] = vns_t["platform"].apply(lambda x : len(Counter(x).keys()))
vns_t["released"] = pd.to_datetime(vns_t["released"].apply(lambda x : x[0])).dt.year
vns_t["c_votecount"] = vns_t["c_votecount"].apply(lambda x : x[0])
reg = ExtraTreesRegressor()
scores = []
for i in range(1995, 2011):
  vns_x = vns_t[vns_t["released"] == i]
  x = ord.fit_transform(vns_x[cols])
  y = vns_x["c_votecount"]
  reg.fit(x, y)
  coefs = list(reg.feature_importances_)
  dictionary = dict(zip(cols, coefs))
  dictionary = dict(sorted(dictionary.items(), reverse = True, key=lambda item: item[1]))
  row = [str(i)]
  row.extend(dictionary.keys())
  factors.loc[len(factors.index)] = row
  print("The predictor for " + str(i) + " has an accuracy score of %.2f%%" % (100 * reg.score(x, y)))
  scores.append(100 * reg.score(x, y))
print("\nThe mean accuracy score is %.2f%%" % pd.Series(scores).mean())
The predictor for 1995 has an accuracy score of 99.99%
The predictor for 1996 has an accuracy score of 100.00%
The predictor for 1997 has an accuracy score of 83.76%
The predictor for 1998 has an accuracy score of 93.64%
The predictor for 1999 has an accuracy score of 47.27%
The predictor for 2000 has an accuracy score of 99.98%
The predictor for 2001 has an accuracy score of 52.53%
The predictor for 2002 has an accuracy score of 94.75%
The predictor for 2003 has an accuracy score of 22.27%
The predictor for 2004 has an accuracy score of 99.04%
The predictor for 2005 has an accuracy score of 97.95%
The predictor for 2006 has an accuracy score of 89.34%
The predictor for 2007 has an accuracy score of 86.09%
The predictor for 2008 has an accuracy score of 91.63%
The predictor for 2009 has an accuracy score of 99.38%
The predictor for 2010 has an accuracy score of 99.67%

The mean accuracy score is 84.83%

Interpretation

Let's take a closer look at a few of these factors individually, starting with freeware. The data above implies that whether or not a visual novel is available for free is not so influential on its popularity, and this remained true even as time progresses through that 1995-2005 golden age and beyond. However, this is a boolean variable, so let's see if we can find out what's going on here by looking at the average c_votecount for visual novels available for free and those that are paid separately.

In [12]:
#categorize and take the mean of the vote count column
vns_t = vns.dropna()[vns.dropna()["official"] == True]
vns_t["released"] = pd.to_datetime(vns_t["released"]).dt.year
vns_t = vns_t[vns_t["released"] > 1994]
vns_t = vns_t[vns_t["released"] < 2016]
vns_free = vns_t[vns_t["freeware"] == True]
vns_paid = vns_t[vns_t["freeware"] == False]
print("Mean vote count for free visual novels: " + str(vns_free["c_votecount"].mean()))
print("Mean vote count for paid visual novels: " + str(vns_paid["c_votecount"].mean()))
Mean vote count for free visual novels: 433.852164912886
Mean vote count for paid visual novels: 513.5534007608796

From this we can tell that paid visual novels are more likely to have higher popularity than free ones. Let's plot each of these over year.

In [13]:
#create the plot for free visual novels
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 18))
sns.regplot(
    x = vns_free.groupby("released")["released"].mean(), 
    y = vns_free.groupby("released")["c_votecount"].mean(),
    truncate = True,
    order = 1, 
    scatter_kws={"color": "red", "alpha" : .5}, 
    line_kws={"color": "black"},
    ax = ax1)
ax1.set_xlabel("Release Year")
ax1.set_ylabel("Popularity")
ax1.set_title("Popularity of Free Visual Novels Over Time")
#create the plot for paid visual novels
sns.regplot(
    x = vns_paid.groupby("released")["released"].mean(), 
    y = vns_paid.groupby("released")["c_votecount"].mean(),
    truncate = True,
    order = 1, 
    scatter_kws={"color": "red", "alpha" : .5}, 
    line_kws={"color": "black"},
    ax = ax2)
ax2.set_xlabel("Release Year")
ax2.set_ylabel("Popularity")
ax2.set_title("Popularity of Paid Visual Novels Over Time")
Out[13]:
Text(0.5, 1.0, 'Popularity of Paid Visual Novels Over Time')

Interestingly, the popularity of free visual novels tends to vary more wildly, whereas the paid visual novels have seen a steady rise since all the way back in 1995. This could be for several reasons:

  1. It is difficult to monetize aspects of the visual novel outside of the upfront price, since most video game monetization schemes rely on gameplay or aesthetics, whereas visual novels are much more of a narrative focused experience.
  2. Free visual novels intuitively tend to be of lower quality.
  3. Paid visual novels might be able to put forth a marketing budget, which can be very important for niche entertainment mediums.

Now we'll take a look at a factor that is much more influential on the popularity of a visual novel. length has five categories:

Very Short
Short
Medium
Long
Very Long

We can plot this over popularity to see if we can spot the trends that our regression model predicted.

In [14]:
#group by title
vns_t = vns.dropna().groupby("title").agg(list)
vns_t["olang"] = vns_t["olang"].apply(lambda x : x[0])
vns_t["freeware"] = vns_t["freeware"].apply(lambda x : x[0])
vns_t["engine"] = vns_t["engine"].apply(lambda x : x[0])
vns_t["length"] = vns_t["length"].apply(lambda x : x[0])
vns_t["num_platforms"] = vns_t["platform"].apply(lambda x : len(Counter(x).keys()))
vns_t["released"] = pd.to_datetime(vns_t["released"].apply(lambda x : x[0])).dt.year
vns_t["c_votecount"] = vns_t["c_votecount"].apply(lambda x : x[0])

#trim less relevant data
vns_t = vns_t[vns_t["released"] > 1995]
vns_t = vns_t[vns_t["c_votecount"] < 8000]
vns_t = vns_t[vns_t["num_platforms"] < 11]

#create and display strip plot
fig, ax = plt.subplots(figsize=(12, 8))
fig = sns.stripplot(
    x = vns_t["c_votecount"].apply(lambda x: math.log(x) if x > 0 else 0), 
    y = vns_t["length"],
    order = ["Very Long", "Long", "Medium", "Short", "Very Short"])
ax.set_ylabel("Length")
ax.set_xlabel("Log of Popularity")
ax.set_title("Popularity versus Length")
Out[14]:
Text(0.5, 1.0, 'Popularity versus Length')

While there is a vague trend to be seen here, but lets see if we can find a way to better visualize this data. Here, we will plot the mean length, with 1 being very short and 5 being very long, over popularity.

In [15]:
#create and display plot
lens_u = dict((v,k) for k,v in lens.items())
fig, ax = plt.subplots(figsize=(12, 8))
fig = sns.regplot(
    x = vns_t["c_votecount"].apply(lambda x: math.log(x) if x > 0 else 0), 
    y = vns_t["length"].apply(lambda x: lens_u[x]), 
    scatter_kws={"color": "red", "alpha" : .1}, 
    line_kws={"color": "black"},
    order = 3,
    truncate = True)
ax.set_ylabel("Length")
ax.set_xlabel("Log of Popularity")
ax.set_title("Popularity versus Length")
Out[15]:
Text(0.5, 1.0, 'Popularity versus Length')

There is definitely a solid correlation here, with longer visual novels being quite a bit more popular than the shorter ones. While this does top off somewhere between medium and long, this gives us a great idea of what the ideal length for a visual novel in the western market would be.

There is another variable, however, that has the special property of being both high influence on popularity regardless of release date, as well as non-boolean. This would be the num_langs factor, representing the total number of platforms the visual novel has been released on. Let's take a look at a visualization of popularity of a given visual novel compared to how many platforms it has been released on.

In [16]:
#group by title
vns_t = vns.dropna().groupby("title").agg(list)
vns_t["num_langs"] = vns_t["lang"].apply(lambda x : len(Counter(x).keys()))
vns_t["freeware"] = vns_t["freeware"].apply(lambda x : x[0])
vns_t["engine"] = vns_t["engine"].apply(lambda x : x[0])
vns_t["length"] = vns_t["length"].apply(lambda x : x[0])
vns_t["num_platforms"] = vns_t["platform"].apply(lambda x : len(Counter(x).keys()))
vns_t["released"] = pd.to_datetime(vns_t["released"].apply(lambda x : x[0])).dt.year
vns_t["c_votecount"] = vns_t["c_votecount"].apply(lambda x : x[0])

#trim less relevant data
vns_t = vns_t[vns_t["released"] > 1995]
vns_t = vns_t[vns_t["c_votecount"] < 8000]
vns_t = vns_t[vns_t["num_platforms"] < 11]

#create and display plot
fig, ax = plt.subplots(figsize=(12, 8))
fig = sns.regplot(
    x = vns_t["c_votecount"].apply(lambda x: math.log(x) if x > 0 else 0), 
    y = vns_t["num_platforms"], 
    scatter_kws={"color": "red", "alpha" : .1}, 
    line_kws={"color": "black"},
    order = 2,
    truncate = True)
ax.set_xlabel("Number of Platforms")
ax.set_ylabel("Popularity")
ax.set_title("Popularity versus Number of Platforms Released on")
Out[16]:
Text(0.5, 1.0, 'Popularity versus Number of Platforms Released on')

As the pattern regarding factors that might influence a visual novel's popularity has gone thus far, number of platforms as well sees a positive correlation between number of platforms and popularity. Now that we've got these two variables picked out, let's construct a heatmap for a more visual representation between them and c_votecount.

In [17]:
#create and display heatmap
sns.heatmap(vns_t[["freeware", "c_votecount", "num_platforms"]].corr(), annot = True)
plt.show()

Here we can see the same sort of trends as above, with the number of platforms being the dominant correlation. The low correlation between freeware and c_votecount represents well the similarity of the mean votes of free visual novels to paid. It's also interesting to note the correlation between the number of platforms released on and whether or not the visual novel is free. It is very important to note, however, that the number of platforms that a visual novel is currently on is not indicative of how many platforms it was originally released on, so its entirely possible that anywhere from a few to most visual novels released on fewer platforms originally, and then more after it became popular.

Bringing it All Together

Building a Model

Now that we have a good understanding of the ways different factors influence the popularity of a visual novel, and know that our regression models are solid predictors of what factors are most indicative of popularity, lets see if we can build a model based on the past five years or so to predict what the idea successful visual novel would be today. We'll use a decision tree regressor as we did before, and limit our data to everything from the past decade. First, let's organize our dataset by title and aggregate some important info, as we've done a few times above.

In [18]:
#group by title and trim less relevant data
vns_t = vns.dropna().groupby("title").agg(list)
vns_t["num_langs"] = vns_t["lang"].apply(lambda x : len(Counter(x).keys()))
vns_t["freeware"] = vns_t["freeware"].apply(lambda x : x[0])
vns_t["engine"] = vns_t["engine"].apply(lambda x : x[0])
vns_t["length"] = vns_t["length"].apply(lambda x : x[0])
vns_t["num_platforms"] = vns_t["platform"].apply(lambda x : len(Counter(x).keys()))
vns_t["released"] = pd.to_datetime(vns_t["released"].apply(lambda x : x[0])).dt.year
vns_t["c_votecount"] = vns_t["c_votecount"].apply(lambda x : x[0])
vns_t = vns_t[vns_t["released"] > 2010]
vns_t = vns_t.sort_values(by = "c_votecount", ascending = False).reset_index(drop = True)

Next, we can create out regressor from this and see how well it performs. For this one, we'll exclude the engine columm since in reality it shouldn't have a massive effect on popularity.

In [19]:
#create an extratrees regressor for score prediction, and fit data
cols = ["num_langs", "freeware", "length", "num_platforms"]
reg = ExtraTreesRegressor()
x = ord.fit_transform(vns_t[cols])
y = vns_t["c_votecount"]
reg.fit(x, y)
print("The predictor has an accuracy score of %.2f%%" % (100 * reg.score(x, y)))
The predictor has an accuracy score of 67.46%

This is a barely passable accuracy, but we can use what we know to begin eliminating less relevant data that may be affecting the model. First, we'll eliminate visual novels that have lengths below medium, since ones with shorter length can vary wildly depending on variables beyond the scope of this dataset.

In [20]:
#trim, fit, and display new score
vns_t = vns_t[vns_t["length"].apply(lambda x: lens_u[x]) > 2]
x = ord.fit_transform(vns_t[cols])
y = vns_t["c_votecount"]
reg.fit(x, y)
print("The predictor has an accuracy score of %.2f%%" % (100 * reg.score(x, y)))
The predictor has an accuracy score of 71.81%

Next, despite knowing the number of platforms is a flawed metric, we can fairly safely assume that the nubmer of platforms a visual novel being released on should be greater than one, so we'll eliminate visual novels that have only released on a single platform.

In [21]:
#trim, fit, and display new score
vns_t = vns_t[vns_t["num_platforms"] > 1]
x = ord.fit_transform(vns_t[cols])
y = vns_t["c_votecount"]
reg.fit(x, y)
print("The predictor has an accuracy score of %.2f%%" % (100 * reg.score(x, y)))
The predictor has an accuracy score of 78.93%

Lastly, since most visual novels are released in Japanese originally and we're trying to release in the western market, lets limit our model to visual novels that are available, be it officially or unofficially, in at least two languages.

In [22]:
#trim, fit, and display new score
vns_t = vns_t[vns_t["num_langs"] > 1]
x = ord.fit_transform(vns_t[cols])
y = vns_t["c_votecount"]
reg.fit(x, y)
print("The predictor has an accuracy score of %.2f%%" % (100 * reg.score(x, y)))
The predictor has an accuracy score of 81.46%

This is a good enough accuracy that I think we can begin making some predictions with our model. To do this, we're going to loop over some possible feature combinations and make a ranking based on predicted popularity scores.

In [23]:
#iterate through every possible combination of categories for
#the dataframe trimmed in last few steps, output predicted
#score, add it to a ranking list, and sort
cats = list(ord.categories_)
scores = []
for num_langs in range(0, len(cats[0])):
  for freeware in range(0, len(cats[1])): 
    for length in range(0, len(cats[2])):
      for num_platforms in range(0, len(cats[3])):
        entry = [[num_langs, freeware, length, num_platforms]]
        factors = [cats[0][num_langs], 
                    cats[1][freeware],
                    cats[2][length], 
                    cats[3][num_platforms]]
        row = [reg.predict(entry)[0]] + factors
        scores.append(row)
scores = pd.DataFrame(scores, columns = ["score"] + cols)
scores = scores.sort_values(by = "score", ascending = False).reset_index(drop = True)
display(scores)
score num_langs freeware length num_platforms
0 13328.00 11 True Medium 5
1 13328.00 11 True Long 5
2 12555.75 11 True Long 6
3 12527.91 11 True Medium 6
4 11547.55 11 True Very Long 5
... ... ... ... ... ...
655 32.00 4 False Very Long 2
656 26.00 6 False Medium 5
657 16.00 3 True Medium 4
658 10.00 5 False Medium 3
659 5.50 3 True Very Long 2

660 rows × 5 columns

Extrapolation for Real World Scenarios

Now lets organize this data to better suit a real world scenario. Since we want to make a good profit, we're going to first eliminate all freeware entries, and then get rid of the column once we're finished.

In [24]:
#trim freeware from scores rankings
scores = scores[scores["freeware"] == False].drop(columns = ["freeware"]).reset_index(drop = True)
display(scores)
score num_langs length num_platforms
0 9869.00 6 Long 8
1 9750.00 3 Very Long 5
2 9657.74 11 Very Long 11
3 9434.99 6 Long 9
4 9422.29 11 Very Long 13
... ... ... ... ...
325 67.38 5 Very Long 2
326 37.00 2 Medium 7
327 32.00 4 Very Long 2
328 26.00 6 Medium 5
329 10.00 5 Medium 3

330 rows × 4 columns

Now that we're down to the three most important variables in our model, we should be able to safely look at our top five scorers, and sort them by effort. That is, we'll sort them by value of num_langs + num_platforms.

In [25]:
#sort scores ranking by effort (defined as num_langs + num_platoforms),
#keep the top five scorers from the remaining results, and sort
scores["effort"] = scores["num_langs"] + scores["num_platforms"]
scores = scores[scores.index < 5].sort_values(by = "effort").reset_index(drop = True)
scores = scores.drop(columns = ["effort"]).reset_index(drop = True)
display(scores)
score num_langs length num_platforms
0 9750.00 3 Very Long 5
1 9869.00 6 Long 8
2 9434.99 6 Long 9
3 9657.74 11 Very Long 11
4 9422.29 11 Very Long 13

Typically, visual novels launch in a single language due to being so text-heavy, two at the most unless it is either short or has a large budget, and generally on two or three platforms. While releasing on more platforms may increase popularity over time, they are played predominantly on Windows, with some opting to release on PlayStation and Nintendo Switch consoles as well. However, macOS and Linux releases of visual novels aren't uncommon either, so we can keep that in mind if we want to make some concessions in terms of how many platforms we end up releasing on. Lastly, we'll take the top scorer and see how our hypothetical ideal visual novel's score, num_langs and num_platforms compare to the averages of our test data's 95th percentile.

In [26]:
#show top scorer and category values
display(scores[scores.index == 0])
vns_t = vns_t[vns_t.index < .05 * len(vns_t.index)]

#compare top scorer to the mean of each category for the top 5 percent
#of the original sorted data set
print("95th percentile average num_langs: " + str(vns_t["num_langs"].mean()))
print("95th percentile average num_platforms: " + str(vns_t["num_platforms"].mean()))
print("95th percentile average score: " + str(vns_t["c_votecount"].mean()))
score num_langs length num_platforms
0 9750.0 3 Very Long 5
95th percentile average num_langs: 6.083333333333333
95th percentile average num_platforms: 6.166666666666667
95th percentile average score: 8330.25

Conclusions

Before discussing these results, lets take a broader look at the market as a whole, so as to contextualize the data we've been looking at above. To get a handle on a market like this, one must ask him or herself what it is that customers are looking for, or looking to get out of, circulating their money into said market? In a more abstract sense, I think there are two important broader factors.

Quality

As the market has grown and the medium has matured, developers and creators have been more able to work freely with the knowledge that players will pay money for their works (as we've seen above), and are therefore allowed more creative freedom and have access to better resources through the process of making a visual novel. I imagine the boom of visual novels between 1995 and 2005 are the results of this positive feedback loop of rising popularity, more creative freedom, and higher quality visual novels all around. With this in mind, the same rules apply today. So long as interest is garnered and the final product delivers a satisfying experience, the market is such that some form of success is likely to come around provided work has been put in to ensure the visual novel will be bought, talked about, and leave the customer feeling as if they've gotten their money's worth.

Service

Contradictory as it may sound, the success of sites like Steam and Netflix have proven people are willing to put money down to ensure a smooth experience for consuming media. From this lens, it makes sense that customers are willing to pay for a better experience, especially when such storefronts as Steam also come with elements of advertising, marketing, and product discovery that work in developers' favor. Additionally, as we've outlined extensively above, probably one of the most important takeaways of this analysis as a whole would be that accessibility is an important part of service and the success of a visual novel in the western market.

With these in mind, do our findings regarding our model line up? Ideally, we'd have a longer narrative for readers to be invested in, released in Japanese and English primarily, plus other possibly bigger markets for this sort of media such as French, Korean, Chinese, Russian, etc. In addition, it would be released on several platforms, likely Windows, PlayStation, Nintendo Switch, Unix systems, and mobile devices (keeping in mind that concessions could be made through cost-benefit analysis outside of the scope of this dataset). This lines up quite well from both a quality and a service standpoint. Now, there's obviously an element of luck to all of this, and quality writing most definitely helps the case, but shooting your shot is an important part of the business, and chances are, even if booming financial success is out of sight, there's a niche out there just waiting to read it.