Yahoo! Query Language thoughts
[](http://developer.yahoo.com/yql/)
I recently sat down and read through the docs for Yahoo's
[Y!OS](http://developer.yahoo.com/yos/intro/), a collection of services and
APIs that let developers integrate their web applications with Yahoo.
Since I work on the [Google App Engine
datastore](http://code.google.com/appengine/docs/datastore/overview.html), one
part of Y!OS that immediately caught my eye was the
[Yahoo! Query Language](http://developer.yahoo.com/yql/), or YQL. It's a
SQL dialect and query processing engine that operates on data exposed by web
services.
Here's a brief description, along with my thoughts. See the
[original post](/space/2008-12-16-yos_yahoo_open_strategy) for my thoughts on
the rest of Y!OS.
**Contents:**
* [Introduction (and conclusion)](#intro)
* [Related work](#related)
* [Data model](#data_model)
* [Filtering](#filtering)
* [Local filters](#local)
* [Remote filters](#remote)
* [Post-query functions](#post_query)
* [REST interface](#rest)
### Introduction (and conclusion) [](#intro)
As mentioned, YQL is a query language based on SQL. It serves primarily as a
unified interface to Yahoo's various web properties, such as
[Flickr](http://flickr.com/), [Yahoo! Maps](http://maps.yahoo.com/), and
[Upcoming](http://upcoming.yahoo.com/). The query engine itself is hosted by
Yahoo; users make queries by sending HTTP requests to a
[REST endpoint](#rest).
Currently, YQL is _not_ part of a cloud database. It's strictly a hosted
service for query processing. On the plus side, this means that YQL isn't
limited to a single data source. It's not even limited to Yahoo's own
properties. YQL can operate on any third party data source, as long as it's in
one of the usual suspect formats: RSS, ATOM, JSON, XML, etc.
The drawback of YQL's generic, bring-your-own-data model is that it
must often do full table scans and
[process data in memory](#local).
Data sources can only offer shallow, rudimentary indexed queries themselves:
only equality filters, ie no range or bounding box queries, only on
hard-coded parameters, and subject to a
[4 second deadline](http://developer.yahoo.com/yql/guide/query-parameters.html).
Worse, it's not clear that indexed queries are supported on third party data
sources at all.
All other operations are [done in memory](#local),
subject to a
[30 second processing deadline](http://developer.yahoo.com/yql/guide/query-parameters.html).
This limits those operations to minor post-processing and cosmetic tweaks,
which rules out many interesting use cases on even modest size data sets.
Still, in line with the rest of Yahoo's philosophy of developer APIs and web
services, YQL is straightforward and practical. At a high level, YQL is to
[XPath](http://www.w3.org/TR/xpath) and [XQuery](http://www.w3.org/TR/xquery/)
what JSON is to XML and what XML-RPC was (and REST now is) to SOAP.
### Related work [](#related)
YQL is similar to a number of modern query languages, including
XPath/XQuery,
Microsoft's [LINQ](http://en.wikipedia.org/wiki/Language_Integrated_Query),
Google's [GQL](http://code.google.com/appengine/docs/datastore/gqlreference.html),
Facebook's [FQL](/space/facebook+data+store+api+thoughts), and the
query interfaces for [Amazon SimpleDB](/space/amazon+simpledb+thoughts) and
[CouchDB](http://couchdb.org/).
Since YQL doesn't actually store data, it's not really a competitor to the
likes of SimpleDB,
[Azure tables](/space/windows+azure+details#Tables), or the
[Google App Engine datastore](http://code.google.com/appengine/docs/datastore/).
Even so, if Yahoo does offer a cloud database in the future, it's a
good bet that it will support YQL.
YQL is notable in that it supports third party data sources. XQuery/XPath does
too, but YQL's [remote filters](#remote) allow
it to use more efficient query mechanisms exposed by those sources, which
XQuery/XPath can't do without help. LINQ supports third party data sources,
but only somewhat, and they must install a LINQ driver first.
### Data model [](#data_model)
Data is [interpreted and structured
hierarchically](http://developer.yahoo.com/yql/guide/language-chapt.html),
similar to XML. YQL uses a compiled form of XML internally to process data.
Non-XML data sources such as JSON are converted to this format before
processing.
At a high level, YQL uses the familiar relational database model of tables and
rows. However, to map this onto hierarchical data, YQL needs to know which
level of the hierarchy represents a row. This is built in for native YQL data
sources such as Flickr and Upcoming, and can be determined automatically
for some external data sources. For others, though, the developer must specify
it manually.
YQL supports queries on schema metadata as well as data. `SHOW tables` lists
the tables provided by a data source, and `DESC
` describes a given
table's hierarchical structure, fields, and
[input parameters](#remote), if any.
### Filtering [](#filtering)
YQL's syntax is similar to SQL:
SELECT fields FROM table
[WHERE filter [AND filter] ...]
[function] [function] ...
Notably, unlike most of the
[related projects](#related),
YQL also supports sub-selects:
SELECT fields FROM table
WHERE field IN
(SELECT fields FROM other_table ...)
This is particularly interesting because sub-selects may query different
tables and data sources than the top-level query. This makes YQL a powerful
tool for mashups.
Otherwise, this is all familiar functionality. I'll focus on how the query
planner implements it on YQL's hierarchical, unindexed data.
#### Local filters [](#local)
YQL supports
[two types of query filters](http://developer.yahoo.com/yql/guide/select-from-where-statements.html):
_local_ and _remote_. Local filters are handled in memory on the YQL server,
during query processing. They support all of the standard filters operators,
including equality, inequality, `IN` for set membership, `LIKE` for string
pattern matching, and others.
One interesting feature of local filters is that they can traverse the entire
data model, using dot notation to access inner fields, e.g.
`WHERE person.address.zip = 90000`.
#### Remote filters [](#remote)
Remote filters are handled by the data source, and may only use the equality
operator. The fields that support remote filtering for a given table are called
_input parameters_. Examples include event zip code on Upcoming and album id on
Flickr. Data sources may require queries to provide filter values for some or
all input parameters.
Input parameters may overlap with fields in the table's rows, but they're not
required to. That means that input parameters aren't always available for
projection or local filtering.
Remote filters are notable because they're the _only_ way to do efficient,
indexed queries. All [local filters](#local),
[post-query functions](#post_query),
and other operations are evaluated after the initial results are returned by
the data source. This means that remote filters cannot be used for any heavy
lifting that impacts scalability.
Along with filters, YQL can pass
[paging parameters](http://developer.yahoo.com/yql/guide/limit-offset-statements.html)
to the remote data source. These are similar to
[`LIMIT` and `OFFSET`](#post_query), except
they're evaluated at the data source, after the remote filters are applied but
before the [local filters](#local) and
[post-query functions](#post_query). This means
that they'll usually be efficient, as opposed to `LIMIT` and `OFFSET`, which
must calculate the result set and then slice it in memory.
If paging parameters aren't specified, YQL defers to the provider's defaults.
This means that queries will only include results from the first page returned
by the data source. YQL lets you disable this behavior and fetch as much data as
you need, but still, this will likely be a source of confusion for developers.
#### Post-query functions [](#post_query)
YQL provides a number of functions for modifying the result set once it's been
determined by the [remote](#remote) and
[local](#local) filters. They include `sort`,
`tail`, `truncate`, `reverse`, and `unique`.
Like [local filters](#local), these are applied
in memory. In particular, `sort` only sorts the result set itself, _not_ the
entire data source. This is a necessary limitation, but it's still
unfortunate, and will almost certainly be another source of confusion.
YQL also supports slicing via the standard `LIMIT` and `OFFSET` keywords,
which are applied in memory. Strangely, it seems like they do the same thing
as `truncate` and `tail`. It's unclear why YQL has both.
### REST interface [](#rest)
YQL is exposed on a single REST endpoint,
[`query.yahooapis.com/v1/yql`](http://developer.yahoo.com/yql/guide/query-parameters.html).
It supports both `GET` and `POST`. Queries are specified either in a query
parameter or in the request body, respectively. It looks like both queries and
responses are encoded as UTF-8, but I'm not sure.
The [response body](http://developer.yahoo.com/yql/guide/result-structure.html)
is always an XML envelope. The results themselves are encoded in either XML or
JSON, depending on what the developer asked for, and enclosed in the envelope.
JSON is less expressive than XML - it doesn't have attributes, CDATA,
namespaces, etc. - so converting results to JSON may be lossy. In practice,
though, I doubt many users will notice or care.
Developers may
[sign requests](http://developer.yahoo.com/yql/guide/authorization-access.html)
via Yahoo's [OAuth](http://oauth.org/) service to access private data or use
higher, per-account quotas.
Finally, the `v1` in the endpoint URL implies support for versioning.
Yahoo will probably handle backward incompatible changes to YQL by adding a
new endpoint with the next version number.
**See also:**
* [Windows Azure impressions](/space/windows+azure+impressions)
* [Windows Azure details](/space/windows+azure+details)
* [Windows Azure: Overview and Impressions](/space/azure_talk.html) (slides)
* [Facebook Data Store API Thoughts](/space/facebook+data+store+api+thoughts)
* [Amazon SimpleDB thoughts](/space/amazon+simpledb+thoughts)
* [Google App Engine launched](/space/2008-04-07_google_app_engine_launched)
* [Under the Covers of the App Engine datastore](/space/datastore_talk.html) (slides)
* [Yahoo! Open Strategy thoughts](/space/2008-12-16-yos_yahoo_open_strategy)