In [251]:
# from nltk.corpus import brown
 
data = []
 
# for fileid in brown.fileids():
#     document = ' '.join(brown.words(fileid))
#     data.append(document)

file = open("phil_corpa.txt")
# file = open("phil_paper_isaac.txt")

data = file.readlines()
file.close()
    
NO_DOCUMENTS = len(data)
print(NO_DOCUMENTS)
print(data[:5])
725
['Conscripted Choice\n', "      Democracy begins with a need for a governing body of collective life and it's empowered at the voting booth. A democracy is a form of government whose representatives are elected by the people governed. In On Democracy, Robert Dahl states that a successful democratic process is based on five principles: Effective participation, voting equality, enlightened understanding, control of the agenda, and inclusion of adults. (Dahl -) For such a democracy to establish itself as being as democratic as possible voting should be compulsory. All adult citizens have a duty to vote to ensure that their democracy equally and accurately represents those governed. Therefore, compulsory voting, where every citizen is required to vote, should be implemented.\n", '\tA citizen has a duty to vote as a member of society. In order for a "government of the people, by the people, for the people," as President Abraham Lincoln stated, to be truly representative of the people, those governed must vote. So too, as with jury duty and compulsory military service, a citizen has to perform an obligation to the government and society, which assists the lives of its fellow compatriots and denizens, respectively. For justice to be fairly implemented in society it follows that the society should deem its own constituents\' verdicts. To do so, citizens each have a duty to sit on juries and hear cases of fellow citizens. If a country has determined that military conscription is needed, it is a duty to serve in their armed forces, unless personal moral grounds forbid it. If a nation calls upon its citizens to protect the State, all duly citizens have a duty to enlist. For the military service to maintain equality in the service as well as perpetuate an equable society, no preference may be allowed in determining exemptions, like another citizen fighting in lieu of oneself or a being able to purchase exemptions. People with valid exemptions, like sickness or conscientious objectors, may forego service. Just as these duties are obligated by all and allow for progression towards equality, so to must the obligation to vote. For a body to be equally represented and accurately reflect the opinions of the constituents, a large and diverse turnout of voters is required, optimally all of them. To ensure the votes of others count, you have an obligation for your vote to count and to vote.\n', '\tCompulsory voting protects the right to vote and ensures the fewest people are marginalized out of a \'voice\'. If the baseline for all citizens is an obligation to vote, a government must heed to that right and allow for as much access and ease as possible. Minimum requirements to vote must be easy to ensure citizens comply with their obligation. The shear amount of voters compels a government to make the process just. It would also allow for equal opportunity to vote since the polling would cater to not just registered voters but to all citizens. Thus, progressing Dahl\'s criterion of "effective participation" where the ease of sharing one\'s ideas, here via voting, assists in the democratic process. (Dahl ) Local laws that hinder voting, in pursuit of, say, weeding-out voter fraud, would be more heavily scrutinized as they could curb votes nationally and obstruct the duty. As opposed to \'opt-in\' registering to vote system, an \'opt-out\' compulsory voting system with valid exemptions would promote and safeguard the right to cast a ballot.\n', '\tOver time, a compulsory voting system would improve the quality of the electorate and the candidates. As more citizens vote and are obligated to do so, the more involved they will be in their selection. Since everyone has to vote, each would likely pursue information on their choices. Therefore, the electorate would become more educated. An informed person makes better and more rational decisions, furthering his or her responsibility to vote. Also allowing for a framework of \'democratic culture,\' "democracy could not long exist unless its citizens manage to create and maintain a supportive political culture." (Dahl ) While the option to abstain must still be available, the anticipation is that those whom abstain would be dwarfed those who wouldn\'t have voted in a voluntary election. This also removes other representative precautions that could stray from a democratic process, like the Electoral College, which was formed to pad a perceived uneducated democracy in early America.\n']
In [252]:
import re
import nltk
# nltk.download('stopwords')
In [253]:
from gensim import models, corpora
from nltk import word_tokenize
from nltk.corpus import stopwords
 
NUM_TOPICS = 10
STOPWORDS = stopwords.words('english')
 
def clean_text(text):
    tokenized_text = word_tokenize(text.lower())
    cleaned_text = [t for t in tokenized_text if t not in STOPWORDS and re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', t)]
    return cleaned_text
<>:10: DeprecationWarning: invalid escape sequence \-
<>:10: DeprecationWarning: invalid escape sequence \-
<>:10: DeprecationWarning: invalid escape sequence \-
<ipython-input-253-4c6e6350cb5a>:10: DeprecationWarning: invalid escape sequence \-
  cleaned_text = [t for t in tokenized_text if t not in STOPWORDS and re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', t)]
In [254]:
tokenized_data = []
for text in data:
    tokenized_data.append(clean_text(text))
In [255]:
dictionary = corpora.Dictionary(tokenized_data)
In [256]:
corpus = [dictionary.doc2bow(text) for text in tokenized_data]
In [257]:
print(corpus[20])
[(35, 1), (113, 1), (141, 1), (149, 1), (164, 1), (200, 1), (226, 3), (229, 1), (234, 1), (287, 1), (307, 2), (326, 1), (352, 1), (435, 1), (441, 3), (445, 1), (467, 1), (483, 3), (501, 3), (507, 1), (508, 1), (509, 1), (510, 3), (511, 1), (512, 1), (513, 1), (514, 1), (515, 2), (516, 1), (517, 1), (518, 1), (519, 1), (520, 1), (521, 1), (522, 1), (523, 1), (524, 1), (525, 1), (526, 1), (527, 2), (528, 3), (529, 1), (530, 1), (531, 1), (532, 1), (533, 1), (534, 1), (535, 1), (536, 1), (537, 1), (538, 1)]
In [258]:
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)
lsi_model = models.LsiModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)
In [259]:
print("LDA Model:")
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lda_model.print_topic(idx, 10))
 
print("=" * 20)
 
print("LSI Model:")
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lsi_model.print_topic(idx, 10))
 
print("=" * 20)
LDA Model:
Topic #0: 0.015*"one" + 0.012*"may" + 0.007*"virtuous" + 0.007*"would" + 0.006*"moral" + 0.005*"people" + 0.005*"person" + 0.005*"many" + 0.005*"action" + 0.004*"driver"
Topic #1: 0.014*"one" + 0.010*"virtue" + 0.010*"would" + 0.008*"moral" + 0.007*"virtuous" + 0.006*"may" + 0.006*"person" + 0.005*"life" + 0.005*"mind" + 0.005*"world"
Topic #2: 0.015*"one" + 0.009*"yoga" + 0.006*"mind" + 0.006*"public" + 0.005*"http" + 0.005*"would" + 0.004*"world" + 0.004*"must" + 0.004*"person" + 0.004*"new"
Topic #3: 0.016*"one" + 0.008*"would" + 0.008*"experience" + 0.007*"world" + 0.007*"person" + 0.005*"thought" + 0.005*"life" + 0.005*"mind" + 0.004*"action" + 0.004*"qualia"
Topic #4: 0.016*"would" + 0.009*"vote" + 0.008*"one" + 0.007*"voting" + 0.006*"citizens" + 0.006*"also" + 0.006*"must" + 0.004*"state" + 0.004*"thoreau" + 0.004*"people"
Topic #5: 0.009*"yoga" + 0.007*"would" + 0.006*"feuerstein" + 0.006*"mind" + 0.005*"one" + 0.005*"virtue" + 0.005*"philosophy" + 0.004*"majority" + 0.004*"world" + 0.004*"driver"
Topic #6: 0.019*"yoga" + 0.013*"one" + 0.009*"mind" + 0.007*"would" + 0.007*"life" + 0.006*"practice" + 0.006*"also" + 0.005*"yogic" + 0.005*"qualia" + 0.005*"person"
Topic #7: 0.012*"one" + 0.009*"would" + 0.008*"person" + 0.006*"people" + 0.005*"moral" + 0.005*"must" + 0.005*"mind" + 0.005*"state" + 0.005*"euthanasia" + 0.004*"virtuous"
Topic #8: 0.014*"would" + 0.012*"moral" + 0.008*"one" + 0.005*"yoga" + 0.005*"individual" + 0.005*"person" + 0.004*"singer" + 0.004*"obligation" + 0.004*"may" + 0.004*"virtuous"
Topic #9: 0.011*"one" + 0.010*"would" + 0.008*"world" + 0.008*"society" + 0.008*"mind" + 0.006*"government" + 0.005*"moral" + 0.005*"must" + 0.005*"people" + 0.005*"external"
====================
LSI Model:
Topic #0: -0.504*"one" + -0.382*"would" + -0.204*"person" + -0.185*"moral" + -0.174*"yoga" + -0.164*"mind" + -0.155*"may" + -0.148*"virtue" + -0.147*"virtuous" + -0.128*"life"
Topic #1: -0.696*"yoga" + -0.277*"yogic" + -0.258*"practice" + -0.175*"philosophy" + -0.143*"feuerstein" + 0.136*"one" + -0.113*"physical" + 0.109*"moral" + 0.092*"would" + 0.091*"person"
Topic #2: 0.381*"moral" + -0.340*"mind" + -0.322*"one" + 0.254*"would" + 0.243*"virtuous" + 0.233*"virtue" + -0.170*"person" + 0.167*"society" + -0.135*"world" + 0.120*"good"
Topic #3: 0.347*"world" + -0.263*"one" + 0.247*"reference" + 0.206*"tree" + 0.200*"external" + 0.193*"would" + -0.188*"person" + 0.181*"causal" + 0.174*"refer" + 0.168*"putnam"
Topic #4: -0.240*"vote" + 0.212*"virtuous" + -0.207*"citizens" + -0.202*"government" + 0.198*"mind" + 0.196*"virtue" + -0.185*"life" + -0.174*"euthanasia" + 0.156*"moral" + -0.155*"people"
Topic #5: -0.391*"life" + -0.287*"euthanasia" + -0.217*"velleman" + -0.217*"patient" + 0.193*"vote" + -0.190*"moral" + -0.182*"death" + 0.166*"citizens" + 0.160*"government" + -0.160*"option"
Topic #6: 0.463*"moral" + -0.344*"virtue" + -0.260*"virtuous" + 0.206*"good" + -0.184*"justice" + 0.154*"wolf" + 0.144*"money" + 0.131*"would" + -0.128*"virtues" + 0.125*"mind"
Topic #7: -0.350*"experience" + 0.246*"one" + -0.239*"qualia" + 0.199*"would" + 0.171*"reference" + 0.170*"tree" + -0.151*"consciousness" + 0.145*"bacteria" + -0.144*"society" + 0.132*"refer"
Topic #8: 0.350*"mind" + -0.289*"one" + 0.259*"person" + -0.240*"experience" + -0.204*"would" + -0.183*"qualia" + 0.174*"society" + -0.167*"driver" + 0.165*"people" + -0.159*"modesty"
Topic #9: 0.427*"hume" + 0.292*"god" + 0.212*"philo" + 0.197*"design" + 0.163*"dawkins" + -0.157*"experience" + 0.155*"universe" + 0.152*"nature" + 0.137*"analogy" + 0.136*"evolution"
====================
In [260]:
file = open("phil_corpa.txt")
# file = open("phil_paper_isaac.txt")

line = file.read().replace("\n", " ")
file.close()

print(line[:100])
Conscripted Choice       Democracy begins with a need for a governing body of collective life and it
In [261]:
bow = dictionary.doc2bow(clean_text(line))
In [262]:
print(lsi_model[bow])
[(0, -492.22389476077694), (1, -43.006687786655824), (2, -32.30280331960204), (3, 62.650528839237516), (4, -26.48137426895138), (5, 19.99613715718895), (6, 3.7891014249036616), (7, -51.82369657963511), (8, 16.669104902479695), (9, 19.550211241134818)]
In [263]:
print(lda_model[bow])
[(0, 0.052795976), (1, 0.08577935), (3, 0.23131272), (4, 0.06220368), (5, 0.018232808), (6, 0.27065048), (7, 0.09286079), (8, 0.068570435), (9, 0.109112345)]
In [264]:
from gensim import similarities
 
lda_index = similarities.MatrixSimilarity(lda_model[corpus])
In [265]:
similarities = lda_index[lda_model[bow]]
In [266]:
similarities = sorted(enumerate(similarities), key=lambda item: -item[1])
In [267]:
print(similarities[:10])
[(297, 0.8693008), (24, 0.86132455), (149, 0.8543368), (487, 0.8542441), (418, 0.77823895), (11, 0.7687921), (12, 0.7687921), (13, 0.7687921), (14, 0.7687921), (16, 0.7687921)]
In [268]:
document_id, similarity = similarities[0]
print(data[document_id][:1000])
	When Mary came to encounter colors for the very first time she didn't learn anything new about the physical facts of the world. Her surprise is not the acquisition of new knowledge but is the novelty of knowing what its like from a first person perspective of seeing color. Qualia are "not a property of experience but a property of the way you're representing the world." (Jackson) They are akin to a mind's impressionist painting or a mental photograph, that is then examined rather than a subjective view of the objective experience. They are unique in the attributions and the symbolic ramifications of experience but in-and-of the experience itself. If they are representations of the external properties of the object viewed, they are themselves properties of the object. "This position, which locates qualia out there in the world, would enable us to reject qualia as privately introspectible qualities of inner experiences, making the approach particularly welcome to those were committed to
In [269]:
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
 
# NUM_TOPICS = 10
NUM_TOPICS = 5
 
vectorizer = CountVectorizer(min_df=5, max_df=0.9, 
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)
 
<>:9: DeprecationWarning: invalid escape sequence \-
<>:9: DeprecationWarning: invalid escape sequence \-
<>:9: DeprecationWarning: invalid escape sequence \-
<ipython-input-269-f95901d412df>:9: DeprecationWarning: invalid escape sequence \-
  token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
In [270]:
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)
print(lda_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)

# Build a Non-Negative Matrix Factorization Model
nmf_model = NMF(n_components=NUM_TOPICS)
nmf_Z = nmf_model.fit_transform(data_vectorized)
print(nmf_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)
 
# Build a Latent Semantic Indexing Model
lsi_model = TruncatedSVD(n_components=NUM_TOPICS)
lsi_Z = lsi_model.fit_transform(data_vectorized)
print(lsi_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)
 
 
# Let's see how the first document in the corpus looks like in different topic spaces
print(lda_Z[0])
print(nmf_Z[0])
print(lsi_Z[0])
(725, 5)
(725, 5)
(725, 5)
[0.10000378 0.10008534 0.1000017  0.59990525 0.10000393]
[0.         0.         0.         0.01198504 0.        ]
[ 0.00727261  0.01392889  0.01460413  0.04556619 -0.03139224]
In [271]:
def print_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]])
 
print("LDA Model:")
print_topics(lda_model, vectorizer)
print("=" * 20)
 
print("NMF Model:")
print_topics(nmf_model, vectorizer)
print("=" * 20)
 
print("LSI Model:")
print_topics(lsi_model, vectorizer)
print("=" * 20)
LDA Model:
Topic 0:
[('moral', 104.50708637844558), ('virtue', 72.0402184947965), ('virtuous', 66.542894392717), ('action', 48.2370058794312), ('driver', 41.17150417933021), ('good', 38.52489321223048), ('person', 33.21489325746063), ('virtues', 31.56701538805647), ('society', 31.16503471174718), ('singer', 30.162497759696738)]
Topic 1:
[('government', 35.52056440837926), ('people', 33.6709360806553), ('democracy', 31.549416175778074), ('thoreau', 29.194121612325535), ('citizens', 29.028942240091475), ('vote', 27.39838474799673), ('majority', 20.715004209399595), ('public', 20.546054037650585), ('state', 20.508988798498493), ('voting', 20.09470462122295)]
Topic 2:
[('yoga', 104.03909211369924), ('philosophy', 38.562870376132565), ('yogic', 34.62032534485544), ('practice', 34.29799742231105), ('hume', 28.510448047911787), ('web', 23.80887952591625), ('god', 21.381680237403383), ('feuerstein', 21.332071262710116), ('print', 21.030272921972745), ('http', 19.274745926563853)]
Topic 3:
[('life', 41.456181379478785), ('euthanasia', 26.31822581693425), ('velleman', 23.100127697172244), ('patient', 20.23907202996397), ('living', 16.89659146442302), ('person', 16.497813795335034), ('option', 15.548212430937207), ('existence', 13.940907834262129), ('death', 13.297064896089493), ('argues', 12.546180984697973)]
Topic 4:
[('mind', 75.93716204873004), ('world', 43.36569863185682), ('person', 36.870331271089995), ('experience', 36.22283805633201), ('qualia', 29.908777291291507), ('consciousness', 29.62304129069717), ('mental', 28.268834955887343), ('nagel', 22.078293608246696), ('thought', 21.299771806467586), ('minds', 20.23173898635481)]
====================
NMF Model:
Topic 0:
[('yoga', 5.017736671242561), ('yogic', 1.9373377185040568), ('practice', 1.868803342579247), ('philosophy', 1.240635166288687), ('feuerstein', 0.9853110734255115), ('physical', 0.7711717943344284), ('knowledge', 0.613258637985785), ('ashtanga', 0.585974136452922), ('hatha', 0.5758647248343806), ('body', 0.556235824314887)]
Topic 1:
[('moral', 2.6375063499741205), ('virtuous', 2.139636343618962), ('virtue', 2.1391317227229467), ('society', 1.1930630560880822), ('action', 1.0180716948970279), ('good', 0.8399280622869582), ('justice', 0.8116125254264723), ('virtues', 0.8014321168413483), ('person', 0.7080197143835876), ('people', 0.6855098528321404)]
Topic 2:
[('mind', 3.0363829544764096), ('person', 2.492342272596916), ('minds', 0.7894649413536312), ('frankfurt', 0.6805475682416562), ('nagel', 0.525204960762467), ('individual', 0.5112902557747512), ('separate', 0.4972546869533699), ('consciousness', 0.49521296928113784), ('god', 0.48798942842252285), ('patient', 0.48470833981913225)]
Topic 3:
[('life', 1.6805411421852354), ('euthanasia', 1.2096449942619067), ('vote', 1.0071019399692966), ('velleman', 0.9401880959292042), ('people', 0.9332278825813369), ('citizens', 0.8969788508296851), ('government', 0.8753616821535788), ('patient', 0.8527106449133084), ('society', 0.8362195396796288), ('right', 0.7298770068290754)]
Topic 4:
[('world', 2.388744665980558), ('reference', 1.2409479626789015), ('external', 1.2258382608078027), ('experience', 1.208770905860394), ('causal', 1.0138516078078577), ('qualia', 0.9325231293081245), ('refer', 0.899012012892776), ('connection', 0.8545443095736966), ('putnam', 0.823357892270661), ('mental', 0.8186057301811992)]
====================
LSI Model:
Topic 0:
[('yoga', 0.6359244555728466), ('yogic', 0.25208732128302597), ('practice', 0.24331205776837359), ('mind', 0.1755556677751072), ('philosophy', 0.16956474673736446), ('person', 0.1460616899845041), ('moral', 0.14364817256301765), ('feuerstein', 0.1289387135140411), ('physical', 0.1285054669408163), ('virtue', 0.1213432758057252)]
Topic 1:
[('moral', 0.3296308524886737), ('virtuous', 0.27886905992966937), ('virtue', 0.2737689850350468), ('person', 0.2638332088765181), ('society', 0.21121287959191704), ('life', 0.1540897406757256), ('people', 0.14078440088932576), ('action', 0.13493505666346575), ('individual', 0.12836442262983427), ('justice', 0.11665859632463967)]
Topic 2:
[('mind', 0.35706270899901943), ('world', 0.3280666688562226), ('experience', 0.18024132001043713), ('external', 0.16242245166459432), ('person', 0.15939032269719172), ('reference', 0.1474785606393726), ('qualia', 0.12846541172114845), ('mental', 0.12693471898594377), ('causal', 0.12604028516563773), ('consciousness', 0.12390640531672635)]
Topic 3:
[('life', 0.28254553937235466), ('vote', 0.24556937882360405), ('euthanasia', 0.22972795620956232), ('citizens', 0.20856745989757636), ('government', 0.20642592051827008), ('velleman', 0.1780481356560534), ('people', 0.17800719703765916), ('voting', 0.1571806232561634), ('right', 0.1526037550030851), ('patient', 0.1461718072942308)]
Topic 4:
[('world', 0.3041022220552565), ('reference', 0.19489713418194757), ('external', 0.17263929592993196), ('causal', 0.15173809558537227), ('moral', 0.1417330741074052), ('refer', 0.1405823279049722), ('people', 0.13263872562516923), ('experience', 0.13004787831486078), ('putnam', 0.12863712411801806), ('connection', 0.127877049580871)]
====================
In [272]:
text = line
x = nmf_model.transform(vectorizer.transform([text]))[0]
print(x)
[21.99692362 30.03699202 26.15070097 35.40712965 28.77010468]
In [273]:
from sklearn.metrics.pairwise import euclidean_distances
 
def most_similar(x, Z, top_n=5):
    dists = euclidean_distances(x.reshape(1, -1), Z)
    pairs = enumerate(dists[0])
    most_similar = sorted(pairs, key=lambda item: item[1])[:top_n]
    return most_similar
 
similarities = most_similar(x, nmf_Z)
document_id, similarity = similarities[0]
print(data[document_id][:1000])
      Velleman contends that the option of euthanasia undermines the value of life by questioning it. He also argues that dignity, defined, as "independence, physical strength, and youth" (Velleman ) is not worthy of our moral concern in euthanasia. I believe living with dignity is another goal of medicine and bioethics. An aspect of dignity is the capacity to allow for the fullest of patient's potential, without causing a patient harm, and to live as close to a healthy and unbridled human existence in one's circumstance, as possible. That means giving a paraplegic a wheelchair to assist in mobility and not removing a benign tumor from a  year-old since there is a high probability of a horrendous recovery or death. If a patient is living a life that person has determined is less than the quality of a decent human existence then perhaps that person should be eligible for euthanasia. Euthanasia is meant to end life before it becomes undignified, not to do so when it becomes undignified. 
In [274]:
import pandas as pd
from bokeh.io import push_notebook, show, output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, LabelSet
output_notebook()
Loading BokehJS ...
In [275]:
svd = TruncatedSVD(n_components=2)
documents_2d = svd.fit_transform(data_vectorized)
 
df = pd.DataFrame(columns=['x', 'y', 'document'])
df['x'], df['y'], df['document'] = documents_2d[:,0], documents_2d[:,1], range(len(data))
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="document", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)
Out[275]:

<Bokeh Notebook handle for In[275]>

In [276]:
words_2d = svd.fit_transform(data_vectorized.T)
 
df = pd.DataFrame(columns=['x', 'y', 'word'])
df['x'], df['y'], df['word'] = words_2d[:,0], words_2d[:,1], vectorizer.get_feature_names()
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="word", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)
Out[276]:

<Bokeh Notebook handle for In[276]>

In [277]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
 
NUM_TOPICS = 10
 
vectorizer = CountVectorizer(min_df=5, max_df=0.9, 
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)
 
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)
 
text = line
x = lda_model.transform(vectorizer.transform([text]))[0]
print(x, x.sum())
<>:8: DeprecationWarning: invalid escape sequence \-
<>:8: DeprecationWarning: invalid escape sequence \-
<>:8: DeprecationWarning: invalid escape sequence \-
<ipython-input-277-63bf845f1973>:8: DeprecationWarning: invalid escape sequence \-
  token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
[0.00231842 0.05786318 0.11870392 0.09648282 0.18766431 0.09294328
 0.06217002 0.09221582 0.11741852 0.1722197 ] 1.0
In [278]:
import pyLDAvis.sklearn
 
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, data_vectorized, vectorizer, mds='tsne')
panel
Out[278]:
In [ ]: