Yahoo! Query Language thoughts [![yahoo logo](/space/yahoo_logo.gif)](http://developer.yahoo.com/yql/) I recently sat down and read through the docs for Yahoo's [Y!OS](http://developer.yahoo.com/yos/intro/), a collection of services and APIs that let developers integrate their web applications with Yahoo. Since I work on the [Google App Engine datastore](http://code.google.com/appengine/docs/datastore/overview.html), one part of Y!OS that immediately caught my eye was the [Yahoo! Query Language](http://developer.yahoo.com/yql/), or YQL. It's a SQL dialect and query processing engine that operates on data exposed by web services. Here's a brief description, along with my thoughts. See the [original post](/space/2008-12-16-yos_yahoo_open_strategy) for my thoughts on the rest of Y!OS. **Contents:** * [Introduction (and conclusion)](#intro) * [Related work](#related) * [Data model](#data_model) * [Filtering](#filtering) * [Local filters](#local) * [Remote filters](#remote) * [Post-query functions](#post_query) * [REST interface](#rest) ### Introduction (and conclusion) [![permalink](/Icon-Permalink.png)](#intro) As mentioned, YQL is a query language based on SQL. It serves primarily as a unified interface to Yahoo's various web properties, such as [Flickr](http://flickr.com/), [Yahoo! Maps](http://maps.yahoo.com/), and [Upcoming](http://upcoming.yahoo.com/). The query engine itself is hosted by Yahoo; users make queries by sending HTTP requests to a [REST endpoint](#rest). Currently, YQL is _not_ part of a cloud database. It's strictly a hosted service for query processing. On the plus side, this means that YQL isn't limited to a single data source. It's not even limited to Yahoo's own properties. YQL can operate on any third party data source, as long as it's in one of the usual suspect formats: RSS, ATOM, JSON, XML, etc. The drawback of YQL's generic, bring-your-own-data model is that it must often do full table scans and [process data in memory](#local). Data sources can only offer shallow, rudimentary indexed queries themselves: only equality filters, ie no range or bounding box queries, only on hard-coded parameters, and subject to a [4 second deadline](http://developer.yahoo.com/yql/guide/query-parameters.html). Worse, it's not clear that indexed queries are supported on third party data sources at all. All other operations are [done in memory](#local), subject to a [30 second processing deadline](http://developer.yahoo.com/yql/guide/query-parameters.html). This limits those operations to minor post-processing and cosmetic tweaks, which rules out many interesting use cases on even modest size data sets. Still, in line with the rest of Yahoo's philosophy of developer APIs and web services, YQL is straightforward and practical. At a high level, YQL is to [XPath](http://www.w3.org/TR/xpath) and [XQuery](http://www.w3.org/TR/xquery/) what JSON is to XML and what XML-RPC was (and REST now is) to SOAP. ### Related work [![permalink](/Icon-Permalink.png)](#related) YQL is similar to a number of modern query languages, including XPath/XQuery, Microsoft's [LINQ](http://en.wikipedia.org/wiki/Language_Integrated_Query), Google's [GQL](http://code.google.com/appengine/docs/datastore/gqlreference.html), Facebook's [FQL](/space/facebook+data+store+api+thoughts), and the query interfaces for [Amazon SimpleDB](/space/amazon+simpledb+thoughts) and [CouchDB](http://couchdb.org/). Since YQL doesn't actually store data, it's not really a competitor to the likes of SimpleDB, [Azure tables](/space/windows+azure+details#Tables), or the [Google App Engine datastore](http://code.google.com/appengine/docs/datastore/). Even so, if Yahoo does offer a cloud database in the future, it's a good bet that it will support YQL. YQL is notable in that it supports third party data sources. XQuery/XPath does too, but YQL's [remote filters](#remote) allow it to use more efficient query mechanisms exposed by those sources, which XQuery/XPath can't do without help. LINQ supports third party data sources, but only somewhat, and they must install a LINQ driver first. ### Data model [![permalink](/Icon-Permalink.png)](#data_model) Data is [interpreted and structured hierarchically](http://developer.yahoo.com/yql/guide/language-chapt.html), similar to XML. YQL uses a compiled form of XML internally to process data. Non-XML data sources such as JSON are converted to this format before processing. At a high level, YQL uses the familiar relational database model of tables and rows. However, to map this onto hierarchical data, YQL needs to know which level of the hierarchy represents a row. This is built in for native YQL data sources such as Flickr and Upcoming, and can be determined automatically for some external data sources. For others, though, the developer must specify it manually. YQL supports queries on schema metadata as well as data. `SHOW tables` lists the tables provided by a data source, and `DESC ` describes a given table's hierarchical structure, fields, and [input parameters](#remote), if any. ### Filtering [![permalink](/Icon-Permalink.png)](#filtering) YQL's syntax is similar to SQL: SELECT fields FROM table [WHERE filter [AND filter] ...] [function] [function] ... Notably, unlike most of the [related projects](#related), YQL also supports sub-selects: SELECT fields FROM table WHERE field IN (SELECT fields FROM other_table ...) This is particularly interesting because sub-selects may query different tables and data sources than the top-level query. This makes YQL a powerful tool for mashups. Otherwise, this is all familiar functionality. I'll focus on how the query planner implements it on YQL's hierarchical, unindexed data. #### Local filters [![permalink](/Icon-Permalink.png)](#local) YQL supports [two types of query filters](http://developer.yahoo.com/yql/guide/select-from-where-statements.html): _local_ and _remote_. Local filters are handled in memory on the YQL server, during query processing. They support all of the standard filters operators, including equality, inequality, `IN` for set membership, `LIKE` for string pattern matching, and others. One interesting feature of local filters is that they can traverse the entire data model, using dot notation to access inner fields, e.g. `WHERE person.address.zip = 90000`. #### Remote filters [![permalink](/Icon-Permalink.png)](#remote) Remote filters are handled by the data source, and may only use the equality operator. The fields that support remote filtering for a given table are called _input parameters_. Examples include event zip code on Upcoming and album id on Flickr. Data sources may require queries to provide filter values for some or all input parameters. Input parameters may overlap with fields in the table's rows, but they're not required to. That means that input parameters aren't always available for projection or local filtering. Remote filters are notable because they're the _only_ way to do efficient, indexed queries. All [local filters](#local), [post-query functions](#post_query), and other operations are evaluated after the initial results are returned by the data source. This means that remote filters cannot be used for any heavy lifting that impacts scalability. Along with filters, YQL can pass [paging parameters](http://developer.yahoo.com/yql/guide/limit-offset-statements.html) to the remote data source. These are similar to [`LIMIT` and `OFFSET`](#post_query), except they're evaluated at the data source, after the remote filters are applied but before the [local filters](#local) and [post-query functions](#post_query). This means that they'll usually be efficient, as opposed to `LIMIT` and `OFFSET`, which must calculate the result set and then slice it in memory. If paging parameters aren't specified, YQL defers to the provider's defaults. This means that queries will only include results from the first page returned by the data source. YQL lets you disable this behavior and fetch as much data as you need, but still, this will likely be a source of confusion for developers. #### Post-query functions [![permalink](/Icon-Permalink.png)](#post_query) YQL provides a number of functions for modifying the result set once it's been determined by the [remote](#remote) and [local](#local) filters. They include `sort`, `tail`, `truncate`, `reverse`, and `unique`. Like [local filters](#local), these are applied in memory. In particular, `sort` only sorts the result set itself, _not_ the entire data source. This is a necessary limitation, but it's still unfortunate, and will almost certainly be another source of confusion. YQL also supports slicing via the standard `LIMIT` and `OFFSET` keywords, which are applied in memory. Strangely, it seems like they do the same thing as `truncate` and `tail`. It's unclear why YQL has both. ### REST interface [![permalink](/Icon-Permalink.png)](#rest) YQL is exposed on a single REST endpoint, [`query.yahooapis.com/v1/yql`](http://developer.yahoo.com/yql/guide/query-parameters.html). It supports both `GET` and `POST`. Queries are specified either in a query parameter or in the request body, respectively. It looks like both queries and responses are encoded as UTF-8, but I'm not sure. The [response body](http://developer.yahoo.com/yql/guide/result-structure.html) is always an XML envelope. The results themselves are encoded in either XML or JSON, depending on what the developer asked for, and enclosed in the envelope. JSON is less expressive than XML - it doesn't have attributes, CDATA, namespaces, etc. - so converting results to JSON may be lossy. In practice, though, I doubt many users will notice or care. Developers may [sign requests](http://developer.yahoo.com/yql/guide/authorization-access.html) via Yahoo's [OAuth](http://oauth.org/) service to access private data or use higher, per-account quotas. Finally, the `v1` in the endpoint URL implies support for versioning. Yahoo will probably handle backward incompatible changes to YQL by adding a new endpoint with the next version number. **See also:** * [Windows Azure impressions](/space/windows+azure+impressions) * [Windows Azure details](/space/windows+azure+details) * [Windows Azure: Overview and Impressions](/space/azure_talk.html) (slides) * [Facebook Data Store API Thoughts](/space/facebook+data+store+api+thoughts) * [Amazon SimpleDB thoughts](/space/amazon+simpledb+thoughts) * [Google App Engine launched](/space/2008-04-07_google_app_engine_launched) * [Under the Covers of the App Engine datastore](/space/datastore_talk.html) (slides) * [Yahoo! Open Strategy thoughts](/space/2008-12-16-yos_yahoo_open_strategy)