Jeopardy Exploratory Analysis-Any correlation between question value and wording?

In [1]:
import pandas as pd
In [3]:
jeopardy = pd.read_csv("Data/JEOPARDY.csv")
print(jeopardy.head())
print(jeopardy.columns)
   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

this dataset contains 216,930 Jeopardy questions, answers and other data across 22 years of air time. Questions were obtained by crawling www.j-archive.com done by redditor trexmatt

In [5]:
#remove the space in column names:
jeopardy.columns = ["Show Number", "Air Date", "Round", "Category", "Value", "Question", "Answer"]
print(jeopardy.columns)
    
Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')
In [12]:
#function to remove punctuation, turn words into lowercase
from string import punctuation
print(punctuation)
def cleanstr(string):
    for p in punctuation:
        string = string.replace(p, '')
    return string.lower()
#assign cleaned columns back to jeopardy
jeopardy["clean_question"] = jeopardy["Question"].apply(cleanstr)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(cleanstr)
print(jeopardy.head())
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
   Show Number    Air Date      Round                         Category Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY  $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES  $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...  $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE  $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES  $200   

                                            Question      Answer  \
0  For the last 8 years of his life, Galileo was ...  Copernicus   
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe   
2  The city of Yuma in this state has a record av...     Arizona   
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's   
4  Signer of the Dec. of Indep., framer of the Co...  John Adams   

                                      clean_question clean_answer  
0  for the last 8 years of his life galileo was u...   copernicus  
1  no 2 1912 olympian football star at carlisle i...   jim thorpe  
2  the city of yuma in this state has a record av...      arizona  
3  in 1963 live on the art linkletter show this c...    mcdonalds  
4  signer of the dec of indep framer of the const...   john adams  
In [15]:
#function to convert string to int for value
def convert(string):
    string = cleanstr(string)
    try:
        value = int(string)
    except Exception:
        value = 0
    return value
#assign back to the df
jeopardy["clean_value"] = jeopardy["Value"].apply(convert)
#convert air date to datetime format
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])
print(jeopardy.head())
   Show Number   Air Date      Round                         Category Value  \
0         4680 2004-12-31  Jeopardy!                          HISTORY  $200   
1         4680 2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES  $200   
2         4680 2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...  $200   
3         4680 2004-12-31  Jeopardy!                 THE COMPANY LINE  $200   
4         4680 2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES  $200   

                                            Question      Answer  \
0  For the last 8 years of his life, Galileo was ...  Copernicus   
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe   
2  The city of Yuma in this state has a record av...     Arizona   
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's   
4  Signer of the Dec. of Indep., framer of the Co...  John Adams   

                                      clean_question clean_answer  clean_value  
0  for the last 8 years of his life galileo was u...   copernicus          200  
1  no 2 1912 olympian football star at carlisle i...   jim thorpe          200  
2  the city of yuma in this state has a record av...      arizona          200  
3  in 1963 live on the art linkletter show this c...    mcdonalds          200  
4  signer of the dec of indep framer of the const...   john adams          200  
In [19]:
print(jeopardy["Air Date"].dtype)
datetime64[ns]
In [23]:
#function to find match between answers and question
def match(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    else:
        for i in split_answer:
            if i in split_question:
                match_count += 1
        return match_count/len(split_answer)
#apply function and assign result to answer_in_question column:
jeopardy["answer_in_question"] = jeopardy.apply(match, axis = 1)
print(jeopardy["answer_in_question"].mean())
print(jeopardy.head())
0.05932504431848426
   Show Number   Air Date      Round                         Category Value  \
0         4680 2004-12-31  Jeopardy!                          HISTORY  $200   
1         4680 2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES  $200   
2         4680 2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...  $200   
3         4680 2004-12-31  Jeopardy!                 THE COMPANY LINE  $200   
4         4680 2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES  $200   

                                            Question      Answer  \
0  For the last 8 years of his life, Galileo was ...  Copernicus   
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe   
2  The city of Yuma in this state has a record av...     Arizona   
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's   
4  Signer of the Dec. of Indep., framer of the Co...  John Adams   

                                      clean_question clean_answer  \
0  for the last 8 years of his life galileo was u...   copernicus   
1  no 2 1912 olympian football star at carlisle i...   jim thorpe   
2  the city of yuma in this state has a record av...      arizona   
3  in 1963 live on the art linkletter show this c...    mcdonalds   
4  signer of the dec of indep framer of the const...   john adams   

   clean_value  answer_in_question  
0          200                 0.0  
1          200                 0.0  
2          200                 0.0  
3          200                 0.0  
4          200                 0.0  

There is a little less than 6% of the chance we can hear the answer in the question, implying that study would be necessary to score high in jeopardy

In [45]:
jeo = jeopardy.sort_values("Air Date")
print(jeo.head(20))
print(jeo.shape)
       Show Number   Air Date             Round            Category   Value  \
84523            1 1984-09-10         Jeopardy!      LAKES & RIVERS    $100   
84565            1 1984-09-10  Double Jeopardy!           THE BIBLE   $1000   
84566            1 1984-09-10  Double Jeopardy!            '50'S TV   $1000   
84567            1 1984-09-10  Double Jeopardy!  NATIONAL LANDMARKS   $1000   
84568            1 1984-09-10  Double Jeopardy!           NOTORIOUS   $1000   
84569            1 1984-09-10  Double Jeopardy!      4-LETTER WORDS   $1000   
84570            1 1984-09-10   Final Jeopardy!            HOLIDAYS    None   
84538            1 1984-09-10         Jeopardy!      LAKES & RIVERS    $400   
84537            1 1984-09-10         Jeopardy!      ACTORS & ROLES    $300   
84536            1 1984-09-10         Jeopardy!     FOREIGN CUISINE    $300   
84564            1 1984-09-10  Double Jeopardy!      4-LETTER WORDS  $1,000   
84535            1 1984-09-10         Jeopardy!             ANIMALS    $300   
84533            1 1984-09-10         Jeopardy!      LAKES & RIVERS    $800   
84532            1 1984-09-10         Jeopardy!      ACTORS & ROLES    $200   
84531            1 1984-09-10         Jeopardy!     FOREIGN CUISINE    $200   
84530            1 1984-09-10         Jeopardy!             ANIMALS    $200   
84529            1 1984-09-10         Jeopardy!          INVENTIONS    $200   
84528            1 1984-09-10         Jeopardy!      LAKES & RIVERS    $200   
84527            1 1984-09-10         Jeopardy!      ACTORS & ROLES    $100   
84526            1 1984-09-10         Jeopardy!     FOREIGN CUISINE    $100   

                                                Question  \
84523            River mentioned most often in the Bible   
84565  According to 1st Timothy, it is the "root of a...   
84566  Name under which experimenter Don Herbert taug...   
84567    D.C. building shaken by November '83 bomb blast   
84568  After the deed, he leaped to the stage shoutin...   
84569  The president takes one before stepping into o...   
84570       The third Monday of January starting in 1986   
84538  American river only 33 miles shorter than the ...   
84537  He may "Never Say Never Again" when asked to b...   
84536                    Jewish crepe filled with cheese   
84564  It's the first 4-letter word in "The Star Span...   
84535  When husbands "pop" for an ermine coat, they'r...   
84533  River in <a href="http://www.j-archive.com/med...   
84532  2 "Saturday Night" alumni who tried "Trading P...   
84531  A British variety is called "bangers", a Mexic...   
84530  There are about 40,000 muscles & tendons in th...   
84529  In 1869 an American minister created this "ori...   
84528                             Scottish word for lake   
84527  Video in which Michael Jackson plays a werewol...   
84526                            The "coq" in coq au vin   

                           Answer  \
84523                  the Jordan   
84565           the love of money   
84566                  Mr. Wizard   
84567                 the Capitol   
84568           John Wilkes Booth   
84569                        oath   
84570      Martin Luther King Day   
84538                the Missouri   
84537                Sean Connery   
84536                    a blintz   
84564                        what   
84535                      weasel   
84533             the Volga River   
84532  Dan Aykroyd & Eddie Murphy   
84531                     sausage   
84530                   the trunk   
84529                the rickshaw   
84528                        loch   
84527                  "Thriller"   
84526                     chicken   

                                          clean_question  \
84523            river mentioned most often in the bible   
84565  according to 1st timothy it is the root of all...   
84566  name under which experimenter don herbert taug...   
84567       dc building shaken by november 83 bomb blast   
84568  after the deed he leaped to the stage shouting...   
84569  the president takes one before stepping into o...   
84570       the third monday of january starting in 1986   
84538  american river only 33 miles shorter than the ...   
84537  he may never say never again when asked to be ...   
84536                    jewish crepe filled with cheese   
84564  its the first 4letter word in the star spangle...   
84535  when husbands pop for an ermine coat theyre ac...   
84533  river in a hrefhttpwwwjarchivecommedia19840910...   
84532   2 saturday night alumni who tried trading places   
84531  a british variety is called bangers a mexican ...   
84530  there are about 40000 muscles  tendons in this...   
84529  in 1869 an american minister created this orie...   
84528                             scottish word for lake   
84527  video in which michael jackson plays a werewol...   
84526                              the coq in coq au vin   

                    clean_answer  clean_value  answer_in_question  
84523                 the jordan          100            0.000000  
84565          the love of money         1000            0.333333  
84566                  mr wizard         1000            0.000000  
84567                the capitol         1000            0.000000  
84568          john wilkes booth         1000            0.000000  
84569                       oath         1000            0.000000  
84570     martin luther king day            0            0.000000  
84538               the missouri          400            0.000000  
84537               sean connery          300            0.000000  
84536                   a blintz          300            0.000000  
84564                       what         1000            0.000000  
84535                     weasel          300            0.000000  
84533            the volga river          800            0.500000  
84532  dan aykroyd  eddie murphy          200            0.000000  
84531                    sausage          200            0.000000  
84530                  the trunk          200            0.000000  
84529               the rickshaw          200            0.000000  
84528                       loch          200            0.000000  
84527                   thriller          100            0.000000  
84526                    chicken          100            0.000000  
(216930, 11)
In [61]:
terms_used = set()
question_overlap = []
for i, row in jeo.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [word for word in split_question if len(word) > 5 and len(word) < 15]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1  
        terms_used.add(word)        
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)    
jeo["question_overlap"] = question_overlap
print(jeo["question_overlap"].mean())
0.8814681418912709

The rate appears to be very high, honestly it's only normal with 200000 rows of data, when it towards the later, many words are bound to be repeated many times since we do not pay attention to the order of the words and how they connect with each other.

In [53]:
def value(row):
    if row["clean_value"] > 800:
        value = 1
    else:
        value = 0
    return value
jeo["high_value"] = jeo.apply(value, axis = 1)
print(jeo.head())
       Show Number   Air Date             Round            Category  Value  \
84523            1 1984-09-10         Jeopardy!      LAKES & RIVERS   $100   
84565            1 1984-09-10  Double Jeopardy!           THE BIBLE  $1000   
84566            1 1984-09-10  Double Jeopardy!            '50'S TV  $1000   
84567            1 1984-09-10  Double Jeopardy!  NATIONAL LANDMARKS  $1000   
84568            1 1984-09-10  Double Jeopardy!           NOTORIOUS  $1000   

                                                Question             Answer  \
84523            River mentioned most often in the Bible         the Jordan   
84565  According to 1st Timothy, it is the "root of a...  the love of money   
84566  Name under which experimenter Don Herbert taug...         Mr. Wizard   
84567    D.C. building shaken by November '83 bomb blast        the Capitol   
84568  After the deed, he leaped to the stage shoutin...  John Wilkes Booth   

                                          clean_question       clean_answer  \
84523            river mentioned most often in the bible         the jordan   
84565  according to 1st timothy it is the root of all...  the love of money   
84566  name under which experimenter don herbert taug...          mr wizard   
84567       dc building shaken by november 83 bomb blast        the capitol   
84568  after the deed he leaped to the stage shouting...  john wilkes booth   

       clean_value  answer_in_question  high_value  
84523          100            0.000000           0  
84565         1000            0.333333           1  
84566         1000            0.000000           1  
84567         1000            0.000000           1  
84568         1000            0.000000           1  
In [56]:
jeo.drop(["Question", "Answer", "Value"], axis = 1, inplace = True)
In [57]:
print(jeo.head())
       Show Number   Air Date             Round            Category  \
84523            1 1984-09-10         Jeopardy!      LAKES & RIVERS   
84565            1 1984-09-10  Double Jeopardy!           THE BIBLE   
84566            1 1984-09-10  Double Jeopardy!            '50'S TV   
84567            1 1984-09-10  Double Jeopardy!  NATIONAL LANDMARKS   
84568            1 1984-09-10  Double Jeopardy!           NOTORIOUS   

                                          clean_question       clean_answer  \
84523            river mentioned most often in the bible         the jordan   
84565  according to 1st timothy it is the root of all...  the love of money   
84566  name under which experimenter don herbert taug...          mr wizard   
84567       dc building shaken by november 83 bomb blast        the capitol   
84568  after the deed he leaped to the stage shouting...  john wilkes booth   

       clean_value  answer_in_question  high_value  
84523          100            0.000000           0  
84565         1000            0.333333           1  
84566         1000            0.000000           1  
84567         1000            0.000000           1  
84568         1000            0.000000           1  
In [59]:
def valuecount(word):
    low_count = 0
    high_count = 0
    for i, row in jeo.iterrows():
        if word in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

obs_exp = []
terms_used = list(terms_used)
comparison_terms = terms_used[:5]
for term in comparison_terms:
    obs_exp.append(valuecount(term))
print(obs_exp)
print('a')
[(2, 3), (0, 1), (27, 92), (0, 1), (1, 0)]
a
In [60]:
print(comparison_terms)
['unattractive', 'frozenfood', 'guitar', 'plumpynut', 'hrefhttpwwwjarchivecommedia20081006dj29wmvdemonstrateda']
In [70]:
import numpy as np
from scipy.stats import chisquare
high_value_count = jeo[jeo["high_value"] == 1].shape[0]
low_value_count = jeo[jeo["high_value"] == 0].shape[0]
print(high_value_count, low_value_count)
chi_squared = []
for item in obs_exp:
    total = item[0] + item[1]
    total_prop = total/jeo.shape[0]
    print(total, total_prop)
    exp_high_value = total_prop * high_value_count
    exp_low_value = total_prop * low_value_count
    print(exp_high_value, exp_low_value)
    obs = np.array([item[0],item[1]])
    exp = np.array([exp_high_value, exp_low_value])
    chi_squared.append(chisquare(obs, exp))
chi_squared
    
61422 155508
5 2.3048909786567094e-05
1.415710136910524 3.5842898630894755
1 4.609781957313419e-06
0.2831420273821048 0.7168579726178952
119 0.0005485640529202969
33.693901258470476 85.30609874152954
1 4.609781957313419e-06
0.2831420273821048 0.7168579726178952
1 4.609781957313419e-06
0.2831420273821048 0.7168579726178952
Out[70]:
[Power_divergenceResult(statistic=0.3363947754070794, pvalue=0.56191765510245351),
 Power_divergenceResult(statistic=0.39497646423335131, pvalue=0.52969509124866954),
 Power_divergenceResult(statistic=1.8551293016977488, pvalue=0.1731880028017444),
 Power_divergenceResult(statistic=0.39497646423335131, pvalue=0.52969509124866954),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751)]

None of the pvalue < 0.05, so all the results are statistically insignifant. From the five words we tested, there are no clear sign that the word occurence differ much in high and low value questions.

social