Python Twitter API - 获取具有特定 id 的推文
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28384588/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Twitter API - get tweets with specific id
提问by Crista23
I have a list of tweet ids for which I would like to download their text content. Is there any easy solution to do this, preferably through a Python script? I had a look at other libraries like Tweepy and things don't appear to work so simple, and downloading them manually is out of the question since my list is very long.
我有一个推文 ID 列表,我想下载它们的文本内容。是否有任何简单的解决方案可以做到这一点,最好是通过 Python 脚本?我查看了其他库,如 Tweepy,但事情似乎并不那么简单,手动下载它们是不可能的,因为我的列表很长。
采纳答案by Martijn Pieters
You can access specific tweets by their id with the statuses/show/:id
API route. Most Python Twitter libraries follow the exact same patterns, or offer 'friendly' names for the methods.
您可以使用statuses/show/:id
API 路由通过 id 访问特定推文。大多数 Python Twitter 库都遵循完全相同的模式,或者为方法提供“友好”名称。
For example, Twythonoffers several show_*
methods, including Twython.show_status()
that lets you load specific tweets:
例如,Twython提供了多种show_*
方法,包括Twython.show_status()
让您加载特定推文的方法:
CONSUMER_KEY = "<consumer key>"
CONSUMER_SECRET = "<consumer secret>"
OAUTH_TOKEN = "<application key>"
OAUTH_TOKEN_SECRET = "<application secret"
twitter = Twython(
CONSUMER_KEY, CONSUMER_SECRET,
OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
tweet = twitter.show_status(id=id_of_tweet)
print(tweet['text'])
and the returned dictionary follows the Tweet object definitiongiven by the API.
并且返回的字典遵循API 给出的Tweet 对象定义。
The tweepy
libraryuses tweepy.get_status()
:
该tweepy
库的用途tweepy.get_status()
:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
api = tweepy.API(auth)
tweet = api.get_status(id_of_tweet)
print(tweet.text)
where it returns a slightly richer object, but the attributes on it again reflect the published API.
它返回一个稍微丰富的对象,但它的属性再次反映了发布的 API。
回答by verdverm
You can access tweets in bulk (up to 100 at a time) with the status/lookup endpoint: https://dev.twitter.com/rest/reference/get/statuses/lookup
您可以使用 status/lookup 端点批量访问推文(一次最多 100 条):https: //dev.twitter.com/rest/reference/get/statuses/lookup
回答by chrisinmtown
Sharing my work that was vastly accelerated by the previous answers (thank you). This Python 2.7 script fetches the text for tweet IDs stored in a file. Adjust get_tweet_id() for your input data format; original configured for data at https://github.com/mdredze/twitter_sandy
分享我之前的答案大大加快了我的工作(谢谢)。这个 Python 2.7 脚本获取存储在文件中的推文 ID 的文本。为您的输入数据格式调整 get_tweet_id();原始数据配置在https://github.com/mdredze/twitter_sandy
Update April 2018:responding late to @someone bug report (thank you). This script no longer discards every 100th tweet ID (that was my bug). Please note that if a tweet is unavailable for whatever reason, the bulk fetch silently skips it. The script now warns if the response size is different from the request size.
2018 年 4 月更新:对@someone 错误报告的回应迟到(谢谢)。此脚本不再丢弃每 100 个推文 ID(这是我的错误)。请注意,如果推文由于某种原因不可用,批量提取会默默地跳过它。如果响应大小与请求大小不同,脚本现在会发出警告。
'''
Gets text content for tweet IDs
'''
# standard
from __future__ import print_function
import getopt
import logging
import os
import sys
# import traceback
# third-party: `pip install tweepy`
import tweepy
# global logger level is configured in main()
Logger = None
# Generate your own at https://apps.twitter.com/app
CONSUMER_KEY = 'Consumer Key (API key)'
CONSUMER_SECRET = 'Consumer Secret (API Secret)'
OAUTH_TOKEN = 'Access Token'
OAUTH_TOKEN_SECRET = 'Access Token Secret'
# batch size depends on Twitter limit, 100 at this time
batch_size=100
def get_tweet_id(line):
'''
Extracts and returns tweet ID from a line in the input.
'''
(tagid,_timestamp,_sandyflag) = line.split('\t')
(_tag, _search, tweet_id) = tagid.split(':')
return tweet_id
def get_tweets_single(twapi, idfilepath):
'''
Fetches content for tweet IDs in a file one at a time,
which means a ton of HTTPS requests, so NOT recommended.
`twapi`: Initialized, authorized API object from Tweepy
`idfilepath`: Path to file containing IDs
'''
# process IDs from the file
with open(idfilepath, 'rb') as idfile:
for line in idfile:
tweet_id = get_tweet_id(line)
Logger.debug('get_tweets_single: fetching tweet for ID %s', tweet_id)
try:
tweet = twapi.get_status(tweet_id)
print('%s,%s' % (tweet_id, tweet.text.encode('UTF-8')))
except tweepy.TweepError as te:
Logger.warn('get_tweets_single: failed to get tweet ID %s: %s', tweet_id, te.message)
# traceback.print_exc(file=sys.stderr)
# for
# with
def get_tweet_list(twapi, idlist):
'''
Invokes bulk lookup method.
Raises an exception if rate limit is exceeded.
'''
# fetch as little metadata as possible
tweets = twapi.statuses_lookup(id_=idlist, include_entities=False, trim_user=True)
if len(idlist) != len(tweets):
Logger.warn('get_tweet_list: unexpected response size %d, expected %d', len(tweets), len(idlist))
for tweet in tweets:
print('%s,%s' % (tweet.id, tweet.text.encode('UTF-8')))
def get_tweets_bulk(twapi, idfilepath):
'''
Fetches content for tweet IDs in a file using bulk request method,
which vastly reduces number of HTTPS requests compared to above;
however, it does not warn about IDs that yield no tweet.
`twapi`: Initialized, authorized API object from Tweepy
`idfilepath`: Path to file containing IDs
'''
# process IDs from the file
tweet_ids = list()
with open(idfilepath, 'rb') as idfile:
for line in idfile:
tweet_id = get_tweet_id(line)
Logger.debug('Enqueing tweet ID %s', tweet_id)
tweet_ids.append(tweet_id)
# API limits batch size
if len(tweet_ids) == batch_size:
Logger.debug('get_tweets_bulk: fetching batch of size %d', batch_size)
get_tweet_list(twapi, tweet_ids)
tweet_ids = list()
# process remainder
if len(tweet_ids) > 0:
Logger.debug('get_tweets_bulk: fetching last batch of size %d', len(tweet_ids))
get_tweet_list(twapi, tweet_ids)
def usage():
print('Usage: get_tweets_by_id.py [options] file')
print(' -s (single) makes one HTTPS request per tweet ID')
print(' -v (verbose) enables detailed logging')
sys.exit()
def main(args):
logging.basicConfig(level=logging.WARN)
global Logger
Logger = logging.getLogger('get_tweets_by_id')
bulk = True
try:
opts, args = getopt.getopt(args, 'sv')
except getopt.GetoptError:
usage()
for opt, _optarg in opts:
if opt in ('-s'):
bulk = False
elif opt in ('-v'):
Logger.setLevel(logging.DEBUG)
Logger.debug("main: verbose mode on")
else:
usage()
if len(args) != 1:
usage()
idfile = args[0]
if not os.path.isfile(idfile):
print('Not found or not a file: %s' % idfile, file=sys.stderr)
usage()
# connect to twitter
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
api = tweepy.API(auth)
# hydrate tweet IDs
if bulk:
get_tweets_bulk(api, idfile)
else:
get_tweets_single(api, idfile)
if __name__ == '__main__':
main(sys.argv[1:])
回答by Someone
I don't have enough reputation to add an actual comment so sadly this is the way to go:
我没有足够的声誉来添加实际评论,很遗憾,这是要走的路:
I found a bug and a strange thing in chrisinmtown answer:
我在 chrisinmtown 答案中发现了一个错误和一件奇怪的事情:
Every 100th tweet will be skipped due to the bug. Here is a simple solution:
由于该错误,每 100 条推文都会被跳过。这是一个简单的解决方案:
if len(tweet_ids) < 100:
tweet_ids.append(tweet_id)
else:
tweet_ids.append(tweet_id)
get_tweet_list(twapi, tweet_ids)
tweet_ids = list()
Using is better since it works even past the rate limit.
使用更好,因为它甚至可以超过速率限制。
api = tweepy.API(auth_handler=auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)