Wikipedia Bag-of-Words parser
// Written on September 24th, 2018
Wikipedia Bag of words as a service.
Suppose we’re interested in analyzing wikipedia’s article on bag of words.
Response Format
Use curl or requests to hit https://data.pengra.io/wikibags/Bag-of-words_model/. Responses will always be in JSON in the following format:
// Redirects to https://data.pengra.io/wikibags/14003441/
{
"updated_at": "2018-09-24T19:43:15.419147-07:00",
"page": "bag-of-words_model",
"wiki_id": 14003441,
"title": "Bag-of-words model",
"header_bag_size": 17,
"bag_size": 1191,
"header_bag": { "word": somecount, ... },
"bag": { ... }
}
All requests using page
will be redirected to its wiki_id
equivalent.
Random Page
Random wikibags (already cached) can be achieved with https://data.pengra.io/wikibags/_random/.
Data Source:
Wikipedia API provides the initial data, and then it is cached permanently.
Rapid Spawning:
Since cached wikibags are returned faster, running this script helps response times:
def query_wiki(wiki_id):
response = requests.get("https://data.pengra.io/wikibags/{}/".format(wiki_id))
return response.json()['bag']
def get_random_articles(limit):
response = requests.get("https://en.wikipedia.org/w/api.php?action=query&list=random&rnlimit={limit}&rnnamespace=0&format=json".format(limit=limit))
return [random['id'] for random in response.json()['query']['random']]