HTMLの処理

HTMLタグの除去

>>> raw=nltk.clean_html(html)

例:NewYorkTimesの記事から

>>> url="http://topics.nytimes.com/top/news/international/countriesandterritorie
s/japan/index.html"
>>>
>>> html=urlopen(url).read()
>>> type(html)
type 'str'
>>> raw=nltk.clean_html(html)
>>> tokens = nltk.word_tokenize(raw)
>>>
>>> text = nltk.Text(tokens)
>>> sorted(set([w.lower() for w in set(text) 
... if len(w) > 6 and fd[w] >5 and w.isalpha()]))
['according', 'already', 'american', 'another', 'appeared', 'authorities', 'cher
nobyl', 'company', 'contaminated', 'country', 'crippled', 'daiichi', 'damaged',
'decline', 'democrats', 'devastating', 'difficult', 'disaster', 'earthquake', 'e
conomic', 'economy', 'efforts', 'electric', 'emergency', 'evacuation', 'experts'
, 'fukushima', 'further', 'government', 'increasingly', 'japanese', 'largest', '
measures', 'military', 'minister', 'northern', 'nuclear', 'officials', 'pacific'
, 'percent', 'political', 'quarter', 'radiation', 'radioactive', 'reactor', 'rea
ctors', 'release', 'released', 'station', 'stricken', 'struggled', 'thousands',
'tsunami', 'workers']
>>>