Yahoo! Query Language thoughts

I recently sat down and read through the docs for Yahoo’s Y!OS, a collection of services and APIs that let developers integrate their web applications with Yahoo.

Since I work on the Google App Engine datastore, one part of Y!OS that immediately caught my eye was the Yahoo! Query Language, or YQL. It’s a SQL dialect and query processing engine that operates on data exposed by web services.

Here’s a brief description, along with my thoughts. See the original post for my thoughts on the rest of Y!OS.

Contents:

Introduction (and conclusion)
Related work
Data model
Filtering
REST interface

Introduction (and conclusion)

As mentioned, YQL is a query language based on SQL. It serves primarily as a unified interface to Yahoo’s various web properties, such as Flickr, Yahoo! Maps, and Upcoming. The query engine itself is hosted by Yahoo; users make queries by sending HTTP requests to a REST endpoint.

Currently, YQL is not part of a cloud database. It’s strictly a hosted service for query processing. On the plus side, this means that YQL isn’t limited to a single data source. It’s not even limited to Yahoo’s own properties. YQL can operate on any third party data source, as long as it’s in one of the usual suspect formats: RSS, ATOM, JSON, XML, etc.

The drawback of YQL’s generic, bring-your-own-data model is that it must often do full table scans and process data in memory. Data sources can only offer shallow, rudimentary indexed queries themselves: only equality filters, ie no range or bounding box queries, only on hard-coded parameters, and subject to a 4 second deadline. Worse, it’s not clear that indexed queries are supported on third party data sources at all.

All other operations are done in memory, subject to a 30 second processing deadline. This limits those operations to minor post-processing and cosmetic tweaks, which rules out many interesting use cases on even modest size data sets.

Still, in line with the rest of Yahoo’s philosophy of developer APIs and web services, YQL is straightforward and practical. At a high level, YQL is to XPath and XQuery what JSON is to XML and what XML-RPC was (and REST now is) to SOAP.

Related work

YQL is similar to a number of modern query languages, including XPath/XQuery, Microsoft’s LINQ, Google’s GQL, Facebook’s FQL, and the query interfaces for Amazon SimpleDB and CouchDB.

Since YQL doesn’t actually store data, it’s not really a competitor to the likes of SimpleDB, Azure tables, or the Google App Engine datastore. Even so, if Yahoo does offer a cloud database in the future, it’s a good bet that it will support YQL.

YQL is notable in that it supports third party data sources. XQuery/XPath does too, but YQL’s remote filters allow it to use more efficient query mechanisms exposed by those sources, which XQuery/XPath can’t do without help. LINQ supports third party data sources, but only somewhat, and they must install a LINQ driver first.

Data model

Data is interpreted and structured hierarchically, similar to XML. YQL uses a compiled form of XML internally to process data. Non-XML data sources such as JSON are converted to this format before processing.

At a high level, YQL uses the familiar relational database model of tables and rows. However, to map this onto hierarchical data, YQL needs to know which level of the hierarchy represents a row. This is built in for native YQL data sources such as Flickr and Upcoming, and can be determined automatically for some external data sources. For others, though, the developer must specify it manually.

YQL supports queries on schema metadata as well as data. SHOW tables lists the tables provided by a data source, and DESC <table> describes a given table’s hierarchical structure, fields, and input parameters, if any.

Filtering

YQL’s syntax is similar to SQL:

SELECT fields FROM table
[WHERE filter [AND filter] ...]
[function] [function] ...

Notably, unlike most of the related projects, YQL also supports sub-selects:

SELECT fields FROM table
WHERE field IN
  (SELECT fields FROM other_table ...)

This is particularly interesting because sub-selects may query different tables and data sources than the top-level query. This makes YQL a powerful tool for mashups.

Otherwise, this is all familiar functionality. I’ll focus on how the query planner implements it on YQL’s hierarchical, unindexed data.

Local filters

YQL supports two types of query filters: local and remote. Local filters are handled in memory on the YQL server, during query processing. They support all of the standard filters operators, including equality, inequality, IN for set membership, LIKE for string pattern matching, and others.

One interesting feature of local filters is that they can traverse the entire data model, using dot notation to access inner fields, e.g. WHERE person.address.zip = 90000.

Remote filters

Remote filters are handled by the data source, and may only use the equality operator. The fields that support remote filtering for a given table are called input parameters. Examples include event zip code on Upcoming and album id on Flickr. Data sources may require queries to provide filter values for some or all input parameters.

Input parameters may overlap with fields in the table’s rows, but they’re not required to. That means that input parameters aren’t always available for projection or local filtering.

Remote filters are notable because they’re the only way to do efficient, indexed queries. All local filters, post-query functions, and other operations are evaluated after the initial results are returned by the data source. This means that remote filters cannot be used for any heavy lifting that impacts scalability.

Along with filters, YQL can pass paging parameters to the remote data source. These are similar to LIMIT and OFFSET, except they’re evaluated at the data source, after the remote filters are applied but before the local filters and post-query functions. This means that they’ll usually be efficient, as opposed to LIMIT and OFFSET, which must calculate the result set and then slice it in memory.

If paging parameters aren’t specified, YQL defers to the provider’s defaults. This means that queries will only include results from the first page returned by the data source. YQL lets you disable this behavior and fetch as much data as you need, but still, this will likely be a source of confusion for developers.

Post-query functions

YQL provides a number of functions for modifying the result set once it’s been determined by the remote and local filters. They include sort, tail, truncate, reverse, and unique.

Like local filters, these are applied in memory. In particular, sort only sorts the result set itself, not the entire data source. This is a necessary limitation, but it’s still unfortunate, and will almost certainly be another source of confusion.

YQL also supports slicing via the standard LIMIT and OFFSET keywords, which are applied in memory. Strangely, it seems like they do the same thing as truncate and tail. It’s unclear why YQL has both.

REST interface

YQL is exposed on a single REST endpoint, query.yahooapis.com/v1/yql. It supports both GET and POST. Queries are specified either in a query parameter or in the request body, respectively. It looks like both queries and responses are encoded as UTF-8, but I’m not sure.

The response body is always an XML envelope. The results themselves are encoded in either XML or JSON, depending on what the developer asked for, and enclosed in the envelope. JSON is less expressive than XML – it doesn’t have attributes, CDATA, namespaces, etc. – so converting results to JSON may be lossy. In practice, though, I doubt many users will notice or care.

Developers may sign requests via Yahoo’s OAuth service to access private data or use higher, per-account quotas.

Finally, the v1 in the endpoint URL implies support for versioning. Yahoo will probably handle backward incompatible changes to YQL by adding a new endpoint with the next version number.

4 thoughts on “Yahoo! Query Language thoughts”

Sam Pullara

February 9, 2009

Great writeup!

The biggest difference between LIMIT and truncate() is that you can put truncate() at the end of a series of other pipes. Implementation wise they are essentially identical. For example:

select * from search.web(100) where query=’yql’ | sort(field=’date’) | truncate(count=1)

Which will find the oldest search result in the first 100 results.

Sam

Mentions

💬 Ryan Barrett
💬 greut
💬 gst

snarfed.org

Ryan Barrett's blog