- Loading...
- No images or files uploaded yet.
|
|
Streaming API DocumentationIntroductionThe Twitter Streaming API allows near-realtime access to various subsets of Twitter public statuses. Client library developers, as well as developers integrating existing clients into their application, are encouraged to read this entire document thoroughly.
IMPORTANT NOTE: The Streaming API is currently under an alpha test
All developers using the Streaming API must tolerate possible unannounced and extended periods of unavailability, especially during off-hours, Pacific Time. New features, resources and policies are being deployed on very little, if any, notice. Access to restricted resources is extremely limited and is only granted on a case-by-case basis after acceptance of an additional terms of service document.
Table of Contents
ConceptsProtected vs. PublicOnly non-protected public accounts can create public statuses. Statuses, including replies and mentions, created by a public account are candidates for inclusion in the Streaming API. Statuses created by protected accounts are non-public and not available via the Streaming API. Direct Messages from all users are always considered non-public and are not available on the Streaming API.
Result QualityThe Streaming API filters statuses for quality using the same quality algorithms employed by Twitter Search. The Streaming API does not, however, filter or rank statuses for relevance. Quality filtering is on a per-user basis: low quality users will show up in neither Streaming or in Search. This filtering also applies when users are specifically requested using the follow predicate. Removing the quality filter from streams with follow predicates is, quite reasonably, an often requested feature. If an expected user is not present in a stream, manually cross-check the user against Search results. If the user's statuses are also not returned in Search, you can assume that the user's statuses will not be returned by the Streaming API.
For more details see: http://help.twitter.com/forums/10713/entries/42646 which states, in part:
In order to keep your search results relevant, Twitter filters search results for quality. Our search results will not include suspended accounts, or accounts that may jeopardize search quality. Material that degrades search relevancy or creates a bad search experience for people using Twitter may be permanently removed.
AuthenticationHTTP Basic Authentication is required to access the streaming API methods. A client must provide the credentials of a valid Twitter account.
Access and Rate LimitingAll accounts may access the statuses/sample and statuses/filter methods at default access levels. Accounts may also be granted broader data access on these same methods on a case-by-case basis. Access to other methods requires a special arrangement with Twitter.
Each account may create only one standing connection to the Streaming API. Subsequent connections from the same account may cause previously established connections to be disconnected. Excessive connection attempts, regardless of success, will result in an automatic and temporary ban of the client's IP address.
ConnectingTo connect to the Streaming API, form a HTTP request and consume the resulting stream. Our servers will hold the connection open indefinitely, barring server-side error, excessive client-side lag, network hiccups or duplicate logins.
The API expects HTTP basic auth with your screen name provided as the credential username. Email addresses are not accepted as valid usernames. Some HTTP clients are buggy when challenged with WWW-Authenticate: Basic. and always return 401. A work-around is to force basic auth on the first round.
Some HTTP client libraries only return the response body after the connection has been closed by the server. These clients will not work for accessing the Streaming API. You must use an HTTP client that will return response data incrementally. Most robust HTTP client libraries will provide this functionality. The Apache HttpClient will handle this use case, for example.
There are four main reasons to have your connection closed:
Once a valid connection connection drops, reconnect immediately.
When a network error (TCP/IP level) is encountered, back off linearly. Perhaps start at 250 milliseconds, double, and cap at 16 seconds. Network layer problems are generally transitory and clear quickly.
When a HTTP error (> 200) is returned, back off exponentially. Perhaps start with a 10 second wait, double on each subsequent failure, and finally cap the wait at 240 seconds. Consider sending an alert to a human operator after multiple HTTP errors, as there is probably a client configuration issue that is unlikely to be resolved without human intervention. There's not much point in polling any faster in the face of HTTP error codes and your client is may run afoul of a rate limit.
The Streaming API service is fairly lenient. Clients are not banned for a few dozen bungled connections here and there. But, if you code anything in a while loop that also doesn't have a sleep, you will eventually get the hatchet for some small number of minutes. If you get the hatchet repeatedly, you'll be cut off for an indeterminate period of time.
Parsing ResponsesThe Streaming API returns data in XML and JSON formats. We encourage clients to use the more compact JSON representation. Parsing is greatly simplified by the delimited parameter, described below.
Parsing JSON responses from the Streaming API is simple: every object is returned on its own line, and ends with a carriage return. Newline characters (\n) may occur in object elements (the text element of a status object, for example), but carriage returns (\r) should not.
Parsing XML is slightly more challenging, as objects include newlines and carriage returns. Use the length delimited format or an XML stream parser.
Parsers must be tolerant of occasional extra newline characters placed between statuses. These characters are placed as periodic "keep-alive" messages, should the stream of statuses temporarily pause. These keep-alives allow clients and NAT firewalls to determine that the connection is indeed still valid during low volume periods. Parsing either markup language may be easier and more efficient with the delimited query parameter, documented below.
Streams may also contain status deletion notices. Clients are urged to honor deletion requests and discard deleted statuses immediately. At times, status deletion messages may arrive before the status. Even in this case, the late arriving status should be deleted from your backing store.
Track streams may also contain limitation notices, where the integer track is an enumeration of statuses that matched the track predicate but were administratively limited. These notices will be sent each time a limited stream becomes unlimited.
Additional objects may be introduced into the markup stream in future releases without changing the resource revisions. Ensure that your parser is tolerant of unexpected objects.
The Streaming API presents events in best effort ordering. During periods of instability, statuses and status deletions may arrive significantly out of order.
An example delimited client loop in pseudocode:
while (true) { do { lengthBytes = readline() } while (lengthBytes.length < 1) enqueueForMarkupProcessor(read(Integer(lengthBytes).parseInt())) }
Collecting and ProcessingRepeated experience shows that clients must also plan for considerable growth and considerable temporary spikes in volume. Even at steady-state, the top tiers of the Streaming API will produce a lot of data, often more than can be reasonably processed on a single core. A prudent developer will plan for traffic to double every few months, and will be test and provision to handle spikes of at least three times current daily peak volumes. Twitter is most useful to your end users during collective events. Plan accordingly.
To prevent latency problems and plan for scale, design your client with decoupled collection, processing and persistence components. The collection component should efficiently handle connecting to the Streaming API and retrieving responses, as well as reconnecting in the event of network failure, and hand-off statuses via an asynchronous queueing mechanism to application specific processing and persistence components. This component should be isolated from any subsequent downstream processing backlog or maintenance, otherwise queuing will occur in the Streaming API. Eventually your client will be disconnected, resulting in data loss.
For example, collect "raw" statuses (that is, not parsed or marshaled into your language's native object format) in one process, and pass each status into a queueing system, rotated flatfile, or database. In a second process, consume statuses from your queue or store of choice, parse them, extract the fields relevant to your application, etc. Consumers of high-volume streams should consider performing JSON and XML markup parsing in a parallel manner as the status volume is approaching the single processor throughput limit of some software stacks. End-to-end stress test your stack.
Quality of ServiceThe Streaming API Quality of Service (QoS) is:
This QoS implies that, on rare occasion and without notice, statuses may be missing from the delivered stream. During routine streaming, statuses will arrive in any order, but will tend to be k-sorted, where k is the number of tweets received in about 10 seconds. Upon client reconnect, and especially when using the count parameter, duplicate and non-k-sorted statuses may be delivered. Consuming applications must tolerate duplicate and out of order statuses and status deletions.
The Streaming API is in Alpha test. We are adding features and improving operational instrumentation on a continuous basis. Each deploy on our end will cause you to be disconnected, perhaps multiple times per day. Proper client coding will prevent data loss. You also must be willing to tolerate periods of inadvertent off-hours (PDT) downtime. We purposefully do not yet have Hosebird integrated with our operational monitoring systems. There is no 24x7 on-call response for service disruptions, other than to email support. To date we have had nearly 100% uptime, and we're working hard to continue this level of service.
Example% curl http://stream.twitter.com/1/statuses/sample.json -uYOUR_TWITTER_USERNAME:YOUR_PASSWORD
SamplingThe statuses/sample feed is sampled from the Firehose stream of public statuses. The sampling algorithm is consistent; no additional data can be gleaned by consuming more than one sampled feed at a given access level or additional feeds at a lower access levels. All feeds at a given access level are identical, and all lower access levels are strict subsets of higher access levels. Long-term consumption of duplicate data wastes limited resources and is generally discouraged.
The sampling algorithm, in conjunction with the statusID assignment algorithm, will tend to produce a random selection. A proportion of all statuses are selected, and the public statuses within that proportion are delivered. Therefore, as the rate of non-public statuses varies, the proportion of sampled statuses to total statuses will also vary. Additionally, the sampled rates exhibit the same strong diurnal and weekly patterns as the overall status creation rate.
Sampling proportions are subject to continuous unannounced refinement. Our goal is to provide useful low-latency streams without overwhelming clients or incurring excessive delivery cost.
Track LimitingReasonably focused track predicates will return all occurrences in the full Firehose stream of public statuses. Overly broad track predicates will cause the output to be periodically limited. After the limitation has expired, all matching statuses will once again be delivered, along with a limit message that enumerates the statuses that have been limited from the stream. Limit messages are described in Parsing Responses.
Track streams with identical predicates will produce identical streams. Limiter periodicity is aligned with statuses/sample sampling periodicity; thus broad predicates will produce limited streams that will tend to be a subset of the statuses/sample streams. Creating multiple track queries to gather more statuses than are available in the sampled feeds is likely to be fruitless and may result in automatic banning.
Updating Filter PredicatesUpdating track and follow predicate parameters with low latency and low data loss is possible, but currently requires a bit of programming effort on the client.
If you do all this, you should be able to offer a low-latency user experience with nearly zero data loss.
By Language And CountryWith some effort, it should be feasible to build a stream representing the vast majority of statuses in a given non-English language or from many non-English speaking countries. The following suggested approach is untested conjecture, requires additional access and some non-trivial development effort.
MethodsMethods are versioned to allow backwards compatibility. The current version level is 1.
statuses/filterReturns public statuses that match one or more filter predicates. At least one predicate parameter, track or follow, must be specified. Both parameters may be specified which allows most clients to use a single connection to the Streaming API. Placing long parameters in the URL may cause the request to be rejected for excessive URL length. Use a POST request header parameter to avoid long URLs.
The default access level allows up to 200 track keywords and 400 follow userids. Increased access levels allow 80,000 follow userids ("shadow" role), 400,000 follow userids ("birddog" role), 10,000 track keywords ("restricted track" role), and 200,000 track keywords ("partner track" role). Increased track access levels also pass a higher proportion of statuses before limiting the stream.
URL: http://stream.twitter.com/1/statuses/filter.format Formats: xml, json
statuses/firehoseReturns all public statuses. The Firehose is not a generally available resource. Few applications require this level of access. Creative use of a combination of other resources and various access levels can satisfy nearly every application use case.
URL: http://stream.twitter.com/1/statuses/firehose.format Formats: xml, json Method(s): GET Parameters: count, delimited Returns: stream of status elements
statuses/retweetReturns all retweets. The retweet stream is not a generally available resource. Few applications require this level of access. Creative use of a combination of other resources and various access levels can satisfy nearly every application use case.
URL: http://stream.twitter.com/1/statuses/retweet.format Formats: xml, json Method(s): GET Parameters: delimited Returns: stream of status elements
statuses/sampleReturns a random sample of all public statuses. The default access level provides a small proportion of the Firehose. The "Gardenhose" access level provides a proportion more suitable for data mining and research applications that desire a larger proportion to be statistically significant sample.
URL: http://stream.twitter.com/1/statuses/sample.format Formats: xml, json Method(s): GET Parameters: count, delimited Returns: stream of status elements
Query Parameters
countIndicates the number of previous statuses to consider for delivery before transitioning to live stream delivery. On unfiltered streams, all considered statuses are delivered, so the number requested is the number returned. On filtered streams, the number requested is the number of statuses that are applied to the filter predicate, and not the number of statuses returned.
Firehose, Retweet, Link, Birddog and Shadow clients interested in capturing all statuses should maintain a current estimate of the number of statuses received per second and note the time that the last status was received. Upon a reconnect, the client can then estimate the appropriate backlog to request. The count parameter is not allowed on other resources and the default filter role.
Values: -150,000 to 150,000. This range is subject to change on short notice. Positive values transition seamlessly to the live stream. Negative values terminate when the historical stream has finished, useful for debugging. Methods: statuses/firehose, statuses/filter
delimitedIndicates that statuses should be delimited in the stream. Statuses are represented by a length, in bytes, a newline, and the status text that is exactly length bytes. Note that "keep-alive" newlines may be inserted before each length.
Values: length Methods: (all) Example: curl http://stream.twitter.com/1/statuses/sample.xml\?delimited=length -uAnyTwitterUser:Password
followReturns public statuses that reference the given set of users. Users specified by a comma separated list.
References matched are statuses that were:
References unmatched are statuses that were:
Values: user IDs (integers), separated by commas Methods: statuses/filter Example: Create a file called 'following' that contains, exactly and excluding the quotation marks: "follow=12,13,15,16,20,87". Execute: curl -d @following http://stream.twitter.com/1/statuses/filter.json -uAnyTwitterUser:Password.You will receive JSON updates from Jack Biz, Crystal, Ev, Krissy, but not from Jeremy, as he's a private user.
trackSpecifies keywords to track. Keywords are specified by a comma separated list. Queries are subject to Track Limitations, described in Track Limiting and subject to access roles, describe in the statuses/filter method. Track keywords are case-insensitive logical ORs. Terms are exact-matched, and also exact-matched ignoring punctuation. Phrases, keywords with spaces, are not supported. Keywords containing punctuation will only exact match tokens. Some UTF-8 keywords will not match correctly- this is a known temporary defect.
Track examples: The keyword Twitter will match all public statuses with the following comma delimited tokens in their text field: TWITTER, twitter, "Twitter", twitter., #twitter and @twitter. The following tokens will not be matched: TwitterTracker and http://www.twitter.com, The phrase, excluding quotes, "hard alee" won't match anything. The keyword "helm's-alee" will match helm's-alee but not #helm's-alee.
Values: Strings separated by commas. Each string must be between 1 and 30 bytes, inclusive. Methods: statuses/filter Example: Create a file called 'tracking' that contains, exactly and excluding the quotation marks: "track=basketball,football,baseball,footy,soccer". Execute: curl -d @tracking http://stream.twitter.com/1/statuses/filter.json -uAnyTwitterUser:Password.You will receive JSON updates about various crucial sportsball topics and events.
HTTP Response CodesMost error codes are returned with a string with additional details. For all codes greater than 200, clients should wait before attempting another connection. See Connecting section, above.
200Success
401Unauthorized
HTTP authentication failed due to either a non-existent username or an incorrect password. User authenticated properly but is not in a required role for this resource; contact the API team for appropriate access.
404Unknown
Resource does not exist.
406Not Acceptable
Parameter not allowed for resource, for example, track parameter specified on a sampled resource. Track keyword too long or too short. No predicates defined for filtered resource, for example, neither track nor follow parameter defined. Follow userid unparseable.
413Too Long
A parameter list is too long, for example, track or follow parameter string too long. Too many track tokens specified for role; contact API team for increased access. Too many follow userids specified for role; contact API team for increased access.
416Range Unacceptable
Count parameter is not allowed in role. Count parameter value is too large.
500Server Internal Error
Should not occur, contact API team.
503Service Overloaded
Should not occur, contact API team.
Pre-Launch Checklist
|
Comments (0)
You don't have permission to comment on this page.