Streaming API Documentation

Page history last edited by John Kalucki 1 day ago

Introduction

The Twitter Streaming API allows near-realtime access to various subsets of Twitter public statuses. Client library developers, as well as developers integrating existing clients into their application, are encouraged to read this entire document thoroughly.

 

IMPORTANT NOTE: The Streaming API is currently under an alpha test

 

All developers using the Streaming API must tolerate possible unannounced and extended periods of unavailability, especially during off-hours, Pacific Time. New features, resources and policies are being deployed on very little, if any, notice. Access to restricted resources is extremely limited and is only granted on a case-by-case basis after acceptance of an additional terms of service document.

 

Table of Contents 


 

Concepts

 

Protected vs. Public

Only non-protected public accounts can create public statuses. Statuses, including replies and mentions, created by a public account are candidates for inclusion in the Streaming API.  Statuses created by protected accounts are non-public and not available via the Streaming API.  Direct Messages from all users are always considered non-public and are not available on the Streaming API.

 

Result Quality 

The Streaming API filters statuses for quality using the same quality algorithms employed by Twitter Search. The Streaming API does not, however, filter or rank statuses for relevance. Quality filtering is on a per-user basis: low quality users will show up in neither Streaming or in Search. This filtering also applies when users are specifically requested using the follow predicate. Removing the quality filter from streams with follow predicates is, quite reasonably, an often requested feature. If an expected user is not present in a stream, manually cross-check the user against Search results. If the user's statuses are also not returned in Search, you can assume that the user's statuses will not be returned by the Streaming API.

 

For more details see: http://help.twitter.com/forums/10713/entries/42646 which states, in part:

 

In order to keep your search results relevant, Twitter filters search results for quality.  Our search results will not include suspended accounts, or accounts that may jeopardize search quality.  Material that degrades search relevancy or creates a bad search experience for people using Twitter may be permanently removed.

 

Authentication

HTTP Basic Authentication is required to access the streaming API methods. A client must provide the credentials of a valid Twitter account.

 

Access and Rate Limiting

All accounts may access the statuses/sample and statuses/filter methods at default access levels. Accounts may also be granted broader data access on these same methods on a case-by-case basis. Access to other methods requires a special arrangement with Twitter.

 

Each account may create only one standing connection to the Streaming API. Subsequent connections from the same account may cause previously established connections to be disconnected. Excessive connection attempts, regardless of success, will result in an automatic and temporary ban of the client's IP address.

 

Connecting

To connect to the Streaming API, form a HTTP request and consume the resulting stream. Our servers will hold the connection open indefinitely, barring server-side error, excessive client-side lag, network hiccups or duplicate logins.

 

The API expects HTTP basic auth with your screen name provided as the credential username. Email addresses are not accepted as valid usernames. Some HTTP clients are buggy when challenged with WWW-Authenticate: Basic. and always return 401. A work-around is to force basic auth on the first round.

 

Some HTTP client libraries only return the response body after the connection has been closed by the server. These clients will not work for accessing the Streaming API. You must use an HTTP client that will return response data incrementally. Most robust HTTP client libraries will provide this functionality. The Apache HttpClient will handle this use case, for example.

 

There are four main reasons to have your connection closed:

  • Duplicate clients logins (earlier connections terminated)
  • Hosebird server restarts (code deploys)
  • Lagging connection terminated (client too slow, or insufficient bandwidth)
  • General Twitter network maintenance (Load balancer restarts, network reconfigurations, other very very rare events)

 

Once a valid connection connection drops, reconnect immediately.

 

When a network error (TCP/IP level) is encountered, back off linearly. Perhaps start at 250 milliseconds, double, and cap at 16 seconds. Network layer problems are generally transitory and clear quickly.

 

When a HTTP error (> 200) is returned, back off exponentially. Perhaps start with a 10 second wait, double on each subsequent failure, and finally cap the wait at 240 seconds. Consider sending an alert to a human operator after multiple HTTP errors, as there is probably a client configuration issue that is unlikely to be resolved without human intervention. There's not much point in polling any faster in the face of HTTP error codes and your client is may run afoul of a rate limit.

 

The Streaming API service is fairly lenient. Clients are not banned for a few dozen bungled connections here and there. But, if you code anything in a while loop that also doesn't have a sleep, you will eventually get the hatchet for some small number of minutes. If you get the hatchet repeatedly, you'll be cut off for an indeterminate period of time.

 

Parsing Responses

The Streaming API returns data in XML and JSON formats. We encourage clients to use the more compact JSON representation. Parsing is greatly simplified by the delimited parameter, described below.

 

Parsing JSON responses from the Streaming API is simple: every object is returned on its own line, and ends with a carriage return. Newline characters (\n) may occur in object elements (the text element of a status object, for example), but carriage returns (\r) should not.

 

Parsing XML is slightly more challenging, as objects include newlines and carriage returns. Use the length delimited format or an XML stream parser.

 

Parsers must be tolerant of occasional extra newline characters placed between statuses. These characters are placed as periodic "keep-alive" messages, should the stream of statuses temporarily pause. These keep-alives allow clients and NAT firewalls to determine that the connection is indeed still valid during low volume periods. Parsing either markup language may be easier and more efficient with the delimited query parameter, documented below.

 

Streams may also contain status deletion notices. Clients are urged to honor deletion requests and discard deleted statuses immediately. At times, status deletion messages may arrive before the status. Even in this case, the late arriving status should be deleted from your backing store.

  • XML:  <delete><status><id>1234</id><user_id>3</user_id></status></delete>
  • JSON: { "delete": { "status": { "id": 1234, "user_id": 3 } } }

 

Track streams may also contain limitation notices, where the integer track is an enumeration of statuses that matched the track predicate but were administratively limited. These notices will be sent each time a limited stream becomes unlimited.

  • XML:  <limit><track>1234</track><limit>
  • JSON: { "limit": { "track": 1234 } }

 

Additional objects may be introduced into the markup stream in future releases without changing the resource revisions. Ensure that your parser is tolerant of unexpected objects.

 

The Streaming API presents events in best effort ordering. During periods of instability, statuses and status deletions may arrive significantly out of order.

 

An example delimited client loop in pseudocode:

 

  while (true) {

    do {

      lengthBytes = readline()

    } while (lengthBytes.length < 1)

    enqueueForMarkupProcessor(read(Integer(lengthBytes).parseInt()))

  }

 

Collecting and Processing 

Repeated experience shows that clients must also plan for considerable growth and considerable temporary spikes in volume. Even at steady-state, the top tiers of the Streaming API will produce a lot of data, often more than can be reasonably processed on a single core. A prudent developer will plan for traffic to double every few months, and will be test and provision to handle spikes of at least three times current daily peak volumes. Twitter is most useful to your end users during collective events. Plan accordingly.

 

To prevent latency problems and plan for scale, design your client with decoupled collection, processing and persistence components. The collection component should efficiently handle connecting to the Streaming API and retrieving responses, as well as reconnecting in the event of network failure, and hand-off statuses via an asynchronous queueing mechanism to application specific processing and persistence components. This component should be isolated from any subsequent downstream processing backlog or maintenance, otherwise queuing will occur in the Streaming API. Eventually your client will be disconnected, resulting in data loss.

 

For example, collect "raw" statuses (that is, not parsed or marshaled into your language's native object format) in one process, and pass each status into a queueing system, rotated flatfile, or database. In a second process, consume statuses from your queue or store of choice, parse them, extract the fields relevant to your application, etc. Consumers of high-volume streams should consider performing JSON and XML markup parsing in a parallel manner as the status volume is approaching the single processor throughput limit of some software stacks. End-to-end stress test your stack.

 

Quality of Service

The Streaming API Quality of Service (QoS) is:

 

  •      Best-effort
  •      Unordered
  •      Generally at-least-once
  •      Currently in Alpha Test

 

This QoS implies that, on rare occasion and without notice, statuses may be missing from the delivered stream. During routine streaming, statuses will arrive in any order, but will tend to be k-sorted, where k is the number of tweets received in about 10 seconds. Upon client reconnect, and especially when using the count parameter, duplicate and non-k-sorted statuses may be delivered. Consuming applications must tolerate duplicate and out of order statuses and status deletions.

 

The Streaming API is in Alpha test. We are adding features and improving operational instrumentation on a continuous basis. Each deploy on our end will cause you to be disconnected, perhaps multiple times per day. Proper client coding will prevent data loss. You also must be willing to tolerate periods of inadvertent off-hours (PDT) downtime. We purposefully do not yet have Hosebird integrated with our operational monitoring systems. There is no 24x7 on-call response for service disruptions, other than to email support. To date we have had nearly 100% uptime, and we're working hard to continue this level of service.

 

Example

% curl http://stream.twitter.com/1/statuses/sample.json -uYOUR_TWITTER_USERNAME:YOUR_PASSWORD

 

 

Sampling

The statuses/sample feed is sampled from the Firehose stream of public statuses. The sampling algorithm is consistent; no additional data can be gleaned by consuming more than one sampled feed at a given access level or additional feeds at a lower access levels. All feeds at a given access level are identical, and all lower access levels are strict subsets of higher access levels. Long-term consumption of duplicate data wastes limited resources and is generally discouraged.

 

The sampling algorithm, in conjunction with the statusID assignment algorithm, will tend to produce a random selection. A proportion of all statuses are selected, and the public statuses within that proportion are delivered. Therefore, as the rate of non-public statuses varies, the proportion of sampled statuses to total statuses will also vary. Additionally, the sampled rates exhibit the same strong diurnal and weekly patterns as the overall status creation rate.

 

Sampling proportions are subject to continuous unannounced refinement. Our goal is to provide useful low-latency streams without overwhelming clients or incurring excessive delivery cost.

 

Track Limiting

Reasonably focused track predicates will return all occurrences in the full Firehose stream of public statuses. Overly broad track predicates will cause the output to be periodically limited.  After the limitation has expired, all matching statuses will once again be delivered, along with a limit message that enumerates the statuses that have been limited from the stream. Limit messages are described in Parsing Responses.

 

Track streams with identical predicates will produce identical streams. Limiter periodicity is aligned with statuses/sample sampling periodicity; thus broad predicates will produce limited streams that will tend to be a subset of the statuses/sample streams. Creating multiple track queries to gather more statuses than are available in the sampled feeds is likely to be fruitless and may result in automatic banning.

 

Updating Filter Predicates

Updating track and follow predicate parameters with low latency and low data loss is possible, but currently requires a bit of programming effort on the client.

  • Reconnect only when you have a change and not on a fixed schedule. Keep change to an absolute minimum.
  • Upon a change, reconnect immediately if no changes have occurred for some time. For example, reconnect no more than twice every four minutes, or three times per six minutes, or some similar metric. Depending on your requirements and heuristics, many changes can then be applied with nearly no latency, while only some small proportion have to wait for an update window.
  • Connect with your new predicate, wait for the first response, then immediately disconnect the old connection. Keep the window where you are connected twice to an absolute minimum. Sometimes the Streaming API service will disconnect the old connection as it begins to feed the new connection. This step probably requires a multi-threaded development environment, or at the very least, inter-process-communication (IPC) of some sort. But, once this technique is working well, the lost tweets should be practically zero.
  • Reduce your loss window even further by using another user account with default access levels in parallel with your main account at higher access levels. Reconnect on the default access account for every change (within the time limits and other rules above). Every hour or so, reconnect the main stream with whatever deltas have accumulated from the default access stream. This keeps the majority of your feed at a low connection velocity and therefore low data loss, but allows low update latency. Please ensure that this secondary stream is always returning a disjoint result set so as not to waste bandwidth.
  • Higher access levels allow the follow predicate to be combined with the count parameter. At the expense of some minor latency, the resulting lookback will completely mask data loss resulting from a reasonable reconnection gap. The count parameter is not honored for track predicates. The count parameter is documented below.

 

If you do all this, you should be able to offer a low-latency user experience with nearly zero data loss.

 

By Language And Country 

With some effort, it should be feasible to build a stream representing the vast majority of statuses in a given non-English language or from many non-English speaking countries. The following suggested approach is untested conjecture, requires additional access and some non-trivial development effort.

 

  1. Investigate the newly proposed geo-location data fields and the existing self-reported location fields in statuses.
  2. Determine a set of stop words and other appropriate keywords. Use the track parameter to capture this stream. Track streams provides rate limiting feedback to allow aggressive query tuning right up to the stream's rate limit. Use an account with the restricted track role to allow a sufficient number of keywords and to also allow a larger proportion of statuses to pass should your keywords
  3. Use the results from #2 to determine the set of all targeted users. Heuristically rank them and follow the top 50,000 ids with the follow param on a "shadow" account. Once you throw out low value accounts, 50,000 accounts will probably constitute the vast majority of your target population by status volume. You can also salt your heuristic by the number of followers, available via the REST API, and with other signals of relevance. You will receive all statuses for these accounts, as there is no rate limiting on follow.
  4. Consume the Gardenhose to determine the completeness of the aggregate output of #2 and #3. You can use offline algorithms of arbitrary complexity on this result stream, and also leverage geo-location and self-reported location fields in statuses. Of course, feed back any newly found users into the user set in #3.
  5. Backfill critical missing data using the Search API. If an influential new account starts updating, it's OK not to detect it right away. When you find valuable accounts, you can get a week or so of historical data from the Search API. 

 

Methods

Methods are versioned to allow backwards compatibility. The current version level is 1.

 

statuses/filter

Returns public statuses that match one or more filter predicates. At least one predicate parameter, track or follow, must be specified. Both parameters may be specified which allows most clients to use a single connection to the Streaming API. Placing long parameters in the URL may cause the request to be rejected for excessive URL length. Use a POST request header parameter to avoid long URLs.

 

The default access level allows up to 200 track keywords and 400 follow userids. Increased access levels allow 80,000 follow userids ("shadow" role), 400,000 follow userids ("birddog" role), 10,000 track keywords ("restricted track" role), and 200,000 track keywords ("partner track" role). Increased track access levels also pass a higher proportion of statuses before limiting the stream.

 

URL: http://stream.twitter.com/1/statuses/filter.format 

Formats: xml, json

Method(s): POST

Parameters: count, delimited, follow, track

Returns: stream of status elements

 

statuses/firehose

Returns all public statuses. The Firehose is not a generally available resource. Few applications require this level of access. Creative use of a combination of other resources and various access levels can satisfy nearly every application use case.

 

URL: http://stream.twitter.com/1/statuses/firehose.format

Formats: xml, json

Method(s): GET

Parameters: count, delimited

Returns: stream of status elements

 

statuses/retweet

Returns all retweets. The retweet stream is not a generally available resource. Few applications require this level of access. Creative use of a combination of other resources and various access levels can satisfy nearly every application use case.

 

URL: http://stream.twitter.com/1/statuses/retweet.format

Formats: xml, json

Method(s): GET

Parameters: delimited

Returns: stream of status elements

 

statuses/sample

Returns a random sample of all public statuses. The default access level provides a small proportion of the Firehose. The "Gardenhose" access level provides a proportion more suitable for data mining and research applications that desire a larger proportion to be statistically significant sample.

 

URL: http://stream.twitter.com/1/statuses/sample.format

Formats: xml, json

Method(s): GET

Parameters: count, delimited

Returns: stream of status elements

 

 

Query Parameters

 

count

Indicates the number of previous statuses to consider for delivery before transitioning to live stream delivery. On unfiltered streams, all considered statuses are delivered, so the number requested is the number returned. On filtered streams, the number requested is the number of statuses that are applied to the filter predicate, and not the number of statuses returned.

 

Firehose, Retweet, Link, Birddog and Shadow clients interested in capturing all statuses should maintain a current estimate of the number of statuses received per second and note the time that the last status was received. Upon a reconnect, the client can then estimate the appropriate backlog to request. The count parameter is not allowed on other resources and the default filter role.

 

Values: -150,000 to 150,000. This range is subject to change on short notice. Positive values transition seamlessly to the live stream. Negative values terminate when the historical stream has finished, useful for debugging.

Methods: statuses/firehose, statuses/filter

 

delimited

Indicates that statuses should be delimited in the stream. Statuses are represented by a length, in bytes, a newline, and the status text that is exactly length bytes. Note that "keep-alive" newlines may be inserted before each length.

 

Values: length

Methods: (all)

Example: curl http://stream.twitter.com/1/statuses/sample.xml\?delimited=length -uAnyTwitterUser:Password

 

follow

Returns public statuses that reference the given set of users. Users specified by a comma separated list.

 

References matched are statuses that were:

  • Created by a specified user
  • Explicitly in-reply-to a status created by a specified user (pressed reply "swoosh" button)
  • Explicitly retweeted by a specified user (pressed retweet button)
  • Created by a specified user and subsequently explicitly retweed by any user

 

References unmatched are statuses that were:

  • Mentions ("Hello @user!")
  • Implicit replies ("@user Hello!", created without pressing a reply "swoosh" button to set the in_reply_to field)
  • Implicit retweets ("RT @user Says Helloes" without pressing a retweet button)

 

Values: user IDs (integers), separated by commas

Methods: statuses/filter

Example: Create a file called 'following' that contains, exactly and excluding the quotation marks: "follow=12,13,15,16,20,87". Execute: curl -d @following http://stream.twitter.com/1/statuses/filter.json -uAnyTwitterUser:Password.You will receive JSON updates from Jack Biz, Crystal, Ev, Krissy, but not from Jeremy, as he's a private user.

 

track

Specifies keywords to track. Keywords are specified by a comma separated list. Queries are subject to Track Limitations, described in Track Limiting and subject to access roles, describe in the statuses/filter method. Track keywords are case-insensitive logical ORs. Terms are exact-matched, and also exact-matched ignoring punctuation. Phrases, keywords with spaces, are not supported. Keywords containing punctuation will only exact match tokens. Some UTF-8 keywords will not match correctly- this is a known temporary defect.

 

Track examples: The keyword Twitter will match all public statuses with the following comma delimited tokens in their text field: TWITTER, twitter, "Twitter", twitter., #twitter and @twitter. The following tokens will not be matched: TwitterTracker and http://www.twitter.com,  The phrase, excluding quotes, "hard alee" won't match anything. The keyword "helm's-alee" will match helm's-alee but not #helm's-alee.

 

Values: Strings separated by commas. Each string must be between 1 and 30 bytes, inclusive.

Methods: statuses/filter

Example: Create a file called 'tracking' that contains, exactly and excluding the quotation marks: "track=basketball,football,baseball,footy,soccer". Execute: curl -d @tracking http://stream.twitter.com/1/statuses/filter.json -uAnyTwitterUser:Password.You will receive JSON updates about various crucial sportsball topics and events.

 

HTTP Response Codes

 

Most error codes are returned with a string with additional details. For all codes greater than 200, clients should wait before attempting another connection. See Connecting section, above.

 

200

Success

 

401

Unauthorized

 

HTTP authentication failed due to either a non-existent username or an incorrect password.

User authenticated properly but is not in a required role for this resource; contact the API team for appropriate access.

 

404

Unknown

 

Resource does not exist.

 

406

Not Acceptable

 

Parameter not allowed for resource, for example, track parameter specified on a sampled resource.

Track keyword too long or too short.

No predicates defined for filtered resource, for example, neither track nor follow parameter defined.

Follow userid unparseable.

 

413

Too Long

 

A parameter list is too long, for example, track or follow parameter string too long.

Too many track tokens specified for role; contact API team for increased access.

Too many follow userids specified for role; contact API team for increased access.

 

416

Range Unacceptable

 

Count parameter is not allowed in role.

Count parameter value is too large.

 

500

Server Internal Error

 

Should not occur, contact API team.

 

503

Service Overloaded

 

Should not occur, contact API team.

 

Pre-Launch Checklist

  1. Creating the minimal number of connections?
  2. Avoiding duplicate logins?
  3. Backing off from failures: none for first disconnect, seconds for repeated network (TCP/IP) level issues, minutes for repeated HTTP (4XX codes)?
  4. Using long-lived connections?
  5. Tolerant of other objects and newlines in markup stream? (Non <status> objects...)
  6. Not purposefully attempting to circumvent access limits and levels?

 

Comments (0)

You don't have permission to comment on this page.