
I recently sat down and read through the docs for Yahoo’s Y!OS, a collection of services and APIs that let developers integrate their web applications with Yahoo.
Since I work on the Google App Engine datastore, one part of Y!OS that immediately caught my eye was the Yahoo! Query Language, or YQL. It’s a SQL dialect and query processing engine that operates on data exposed by web services.
Here’s a brief description, along with my thoughts. See the original post for my thoughts on the rest of Y!OS.
Contents:
Introduction (and conclusion) 
As mentioned, YQL is a query language based on SQL. It serves primarily as a unified interface to Yahoo’s various web properties, such as Flickr, Yahoo! Maps, and Upcoming. The query engine itself is hosted by Yahoo; users make queries by sending HTTP requests to a REST endpoint.
Currently, YQL is not part of a cloud database. It’s strictly a hosted service for query processing. On the plus side, this means that YQL isn’t limited to a single data source. It’s not even limited to Yahoo’s own properties. YQL can operate on any third party data source, as long as it’s in one of the usual suspect formats: RSS, ATOM, JSON, XML, etc.
The drawback of YQL’s generic, bring-your-own-data model is that it must often do full table scans and process data in memory. Data sources can only offer shallow, rudimentary indexed queries themselves: only equality filters, ie no range or bounding box queries, only on hard-coded parameters, and subject to a 4 second deadline. Worse, it’s not clear that indexed queries are supported on third party data sources at all.
All other operations are done in memory, subject to a 30 second processing deadline. This limits those operations to minor post-processing and cosmetic tweaks, which rules out many interesting use cases on even modest size data sets.
Still, in line with the rest of Yahoo’s philosophy of developer APIs and web services, YQL is straightforward and practical. At a high level, YQL is to XPath and XQuery what JSON is to XML and what XML-RPC was (and REST now is) to SOAP.
Related work 
YQL is similar to a number of modern query languages, including XPath/XQuery, Microsoft’s LINQ, Google’s GQL, Facebook’s FQL, and the query interfaces for Amazon SimpleDB and CouchDB.
Since YQL doesn’t actually store data, it’s not really a competitor to the likes of SimpleDB, Azure tables, or the Google App Engine datastore. Even so, if Yahoo does offer a cloud database in the future, it’s a good bet that it will support YQL.
YQL is notable in that it supports third party data sources. XQuery/XPath does too, but YQL’s remote filters allow it to use more efficient query mechanisms exposed by those sources, which XQuery/XPath can’t do without help. LINQ supports third party data sources, but only somewhat, and they must install a LINQ driver first.
Data model 
Data is interpreted and structured hierarchically, similar to XML. YQL uses a compiled form of XML internally to process data. Non-XML data sources such as JSON are converted to this format before processing.
At a high level, YQL uses the familiar relational database model of tables and rows. However, to map this onto hierarchical data, YQL needs to know which level of the hierarchy represents a row. This is built in for native YQL data sources such as Flickr and Upcoming, and can be determined automatically for some external data sources. For others, though, the developer must specify it manually.
YQL supports queries on schema metadata as well as data. SHOW tables
lists
the tables provided by a data source, and DESC <table>
describes a given
table’s hierarchical structure, fields, and
input parameters, if any.
Filtering 
YQL’s syntax is similar to SQL:
SELECT fields FROM table
[WHERE filter [AND filter] ...]
[function] [function] ...
Notably, unlike most of the related projects, YQL also supports sub-selects:
SELECT fields FROM table
WHERE field IN
(SELECT fields FROM other_table ...)
This is particularly interesting because sub-selects may query different tables and data sources than the top-level query. This makes YQL a powerful tool for mashups.
Otherwise, this is all familiar functionality. I’ll focus on how the query planner implements it on YQL’s hierarchical, unindexed data.
Local filters 
YQL supports
two types of query filters:
local and remote. Local filters are handled in memory on the YQL server,
during query processing. They support all of the standard filters operators,
including equality, inequality, IN
for set membership, LIKE
for string
pattern matching, and others.
One interesting feature of local filters is that they can traverse the entire
data model, using dot notation to access inner fields, e.g.
WHERE person.address.zip = 90000
.
Remote filters 
Remote filters are handled by the data source, and may only use the equality operator. The fields that support remote filtering for a given table are called input parameters. Examples include event zip code on Upcoming and album id on Flickr. Data sources may require queries to provide filter values for some or all input parameters.
Input parameters may overlap with fields in the table’s rows, but they’re not required to. That means that input parameters aren’t always available for projection or local filtering.
Remote filters are notable because they’re the only way to do efficient, indexed queries. All local filters, post-query functions, and other operations are evaluated after the initial results are returned by the data source. This means that remote filters cannot be used for any heavy lifting that impacts scalability.
Along with filters, YQL can pass
paging parameters
to the remote data source. These are similar to
LIMIT
and OFFSET
, except
they’re evaluated at the data source, after the remote filters are applied but
before the local filters and
post-query functions. This means
that they’ll usually be efficient, as opposed to LIMIT
and OFFSET
, which
must calculate the result set and then slice it in memory.
If paging parameters aren’t specified, YQL defers to the provider’s defaults. This means that queries will only include results from the first page returned by the data source. YQL lets you disable this behavior and fetch as much data as you need, but still, this will likely be a source of confusion for developers.
Post-query functions 
YQL provides a number of functions for modifying the result set once it’s been
determined by the remote and
local filters. They include sort
,
tail
, truncate
, reverse
, and unique
.
Like local filters, these are applied
in memory. In particular, sort
only sorts the result set itself, not the
entire data source. This is a necessary limitation, but it’s still
unfortunate, and will almost certainly be another source of confusion.
YQL also supports slicing via the standard LIMIT
and OFFSET
keywords,
which are applied in memory. Strangely, it seems like they do the same thing
as truncate
and tail
. It’s unclear why YQL has both.
REST interface 
YQL is exposed on a single REST endpoint,
query.yahooapis.com/v1/yql
.
It supports both GET
and POST
. Queries are specified either in a query
parameter or in the request body, respectively. It looks like both queries and
responses are encoded as UTF-8, but I’m not sure.
The response body is always an XML envelope. The results themselves are encoded in either XML or JSON, depending on what the developer asked for, and enclosed in the envelope. JSON is less expressive than XML – it doesn’t have attributes, CDATA, namespaces, etc. – so converting results to JSON may be lossy. In practice, though, I doubt many users will notice or care.
Developers may sign requests via Yahoo’s OAuth service to access private data or use higher, per-account quotas.
Finally, the v1
in the endpoint URL implies support for versioning.
Yahoo will probably handle backward incompatible changes to YQL by adding a
new endpoint with the next version number.
Great writeup!
The biggest difference between LIMIT and truncate() is that you can put truncate() at the end of a series of other pipes. Implementation wise they are essentially identical. For example:
select * from search.web(100) where query=’yql’ | sort(field=’date’) | truncate(count=1)
Which will find the oldest search result in the first 100 results.
Sam