Embulk in TD, and in the future

Embulk Maintenance Goes Open

In a November 2022 TD Tech Talk, Dai, the long-time maintainer of Embulk, announced that future maintenance of Embulk would be open. This means that maintenance and design decisions that were once made exclusively by Treasure Data will now be evaluated through a proposal process called the Embulk Enhancement Proposal. (EEP). The Tech Talk was given in Japanese, and this post revisits the talk in English.

What is Embulk?

Embulk is an open-source bulk data loader, which was developed and released in 2015 initially by Sada, co-founder of Treasure Data. Embulk is a pluggbale framework, and it works with input plugins to read data from varieties of sources and file formats, e.g. Amazon S3, SFTP, CSV files, JSON files, RDBMS, Elasitcsearch, and several cloud services. It also works with output plugins to write into various destinations and file formats. Many Embulk plugins are also open-source, and they are developed and maintained by both Treasure Data and non-Treasure Data developers. Additionally, Embulk users may have their own closed-source plugins.

Embulk Inside Treasure Data

How is Embulk Used in TD?

The major use of Embulk in Treasure Data is "Data Connector" and "Result Export" as follows.

  • Data Connector is Treasure Data's service to fetch data into TD from customer-specified data sources.
  • Result Export (a.k.a. Result Output) is a service to push query results from Hive and Presto to customer-specified destinations.

Plugins in TD

Data Connector runs with over 100 input-side Embulk plugins (as of December 2022), as well as TD's embulk-output-td-internal, to write the imported data directly into Plazma, the TD database.

Slide: A bunch of (OSS / internal) Embulk plugins

Similarly, Result Export runs with over 90 output-side Embulk plugins (as of December 2022), as well as the embulk-input-s3+embulk-parser-msgpack, to read query results from Hive and Presto.

Slide: A bunch of (OSS / internal) Embulk plugins

Data Connector and the Worker

Data Connector runs Embulk, based on our in-house job queue mechanism called (td-)worker, which is written in Ruby. It is a priority queue with per-customer parameters, such as the maximum number of concurrent jobs for the customer.

The relationship between Embulk and the Worker is often described in a formal presentation manner as below:

  1. The Worker retrieves a job from the job queue
  2. It passes the job to Embulk
  3. It receives the result from Embulk
  4. And it returns the result to users.

Slide: A "presentation style" of Embulk use in TD

But, the relationship in reality is not that clean like a presentation. The boundary between the Worker and Embulk is mixed up with the Worker occassional invading Embulk's domain.

Slide: A "real" of Embulk use in TD

The "Real"

Here are some examples how Embulk and the Worker are mixed up.

The Worker's GitHub repository contained Embulk JAR files + Worker-related JAR files directly.

Slide: The "real" (2017)

Furthermore, the Worker abused the Guice DI container to invade Embulk inside. It managed to run Ruby require by extracting the JRuby engine instance out of the Embulk runtime. It managed to manipulate Embulk's configuration objects.

Slide: The "real" (2019)

Slide: The "real" (2020)

Why did this happen? The Worker wanted to integrate Embulk with Worker's own job management mechanism, such as job status and retries. These were quick hacks to realize the goal. But they are, indeed, hacks.

EmbulkEmbed

The standard way of running OSS Embulk is to run it as a single command, such as:

Copy
Copied
$ embulk run config.yml`

The Embulk process should start from some bootstrap, and then a Java class EmbulkEmbed would follow to perform the main processing.

Slide: EmbulkEmbed

The Worker, however, does not run Embulk as a single command. Instead, the Worker calls EmbulkEmbed embedded in another Java application process. It depends on many "unofficial contracts" not provided by the Embulk framework.

Slide: The curse of EmbulkEmbed in TD Worker

The Curse of Embulk and Data Connector

Data Connector and Embulk have fallen into a difficult situation as a result. Once a change was made in OSS Embulk, it often did not work in Data Connector, even if it looked like it was working fine in OSS Embulk. In contrast, when we made a change for Data Connector, regressions often happened in OSS Embulk even though it worked fine in Data Connector.

Slide: The curse of Embulk & Data Connector

Slide: The curse of Embulk & Data Connector

Unintended dependencies through unofficial contracts were definitely one of the biggest causes. But there was also much more going on in the background.

Embulk v0.10: Development Series

Embulk could not move forward without compatibility-breaking changes, indeed, even apart from the TD-internal matters. The blockers were, for example,

  • the "Java dependency hell" across the Embulk core and plugins
  • too much dependency on JRuby
  • etc.

The ongoing Embulk v0.10 ("development series") is the efforts to complete those breaking changes. Treasure Data accepted the risk of taking a long time to make changes so that the work could complete all at once, not repeating a cycle of making many smaller breaking changes over and over again. See the following articles on the OSS Embulk website for more details about Embulk v0.10.

At the same time, Treasure Data wanted to resolve the TD-internal chaos discussed above.

Embulk (OSS) and the TD team

I joined Treasure Data in 2016, and took over Data Connector. At the same time Engineering Managers and Product Managers took on responsibility for the Data Connector service product in Treasure Data.

Embulk was released as OSS in 2015. Maintenance of Embulk as OSS was also assigned to the Data Connector team, as Data Connector was the primary user of Embulk in TD. As an individual contributor, all OSS things were delegated to me since the EMs and PMs were unfamiliar with OSS.

Slide: Embulk (OSS) and TD org

I often had to wear two hats for the OSS Embulk community, and for the Data Connector team. I often had to introduce changes in OSS Embulk for Data Connector reasons. I described the reasons to the OSS community with traslating to the OSS context. Similarly, I explained the "shoulds" of OSS Embulk to the EMs/PMs with translating to the Data Connector context.

It worked for a while, but several negative effects gradually appeared.

Case #1: Maven-formed Plugin

in addition to the existing Ruby Gem format, Embulk began to support the new pure-Java Maven-based plugin format in 2018. It initializes quickly as it does not need the JRuby engine runtime that takes seconds to bootstrap, and it wastes much fewer memory resources. It enables installing multiple versions of the same plugin, as well as choosing one at runtime.

Developers should find it useful. It has been working in Treasure Data for a while now.

Slide: Case #1: Maven-formed plugin

Nevertheless, it was not available practically for OSS Embulk users.

Why? The OSS Embulk can run Maven-formed plugins installed in the file system. But, Embulk has no mechanism to download and install Maven-formed plugins. Such an installer would be implemented in the "bootstrap" part of the OSS Embulk. In contrast, Treasure Data's Data Connector does not use the "bootstrap" part. It runs EmbulkEmbed from its own Java application. It installs Embulk plugins by TD's in-house deployment mechanism, not by Embulk's plugin installer. Then, such an installer would never be used in Data Connector.

Slide: Case #1: Maven-formed plugin installer

Normally, it would not have taken long to implement such an installer. But, there wasn't a good reason for EMs and PMs to justify assigning development bandwidth to the project as a member of the Data Connector team.

Slide: Case #1: Maven-formed plugin installer

It was a very straightforward example. We had more similar cases. Some are unimplemented yet. Some are implemented as a part of Data Connector's requirements, but they often needed some extra work to be consistent as a feature in OSS Embulk. The EMs and PMs were responsible for Data Connector, not for OSS Embulk. Such extra work was not evaluated nor even recognized. It had to be done voluntarily, out-of-band, as unassigned work.

Case #2: Embulk v0.10 Catch-ups

The Embulk v0.10 efforts have been ongoing. In fact, I had pull requests ready to move OSS Embulk forward almost to the last stage of v0.10, but they were unmerged for a long time.

Slide: Case #2: Embulk v0.10 catch-ups

But, they were not merged nor released for a long time. Why?

TD had approximately 200 plugins running in Data Connector. Over 150 of them were in-house, and TD simply could not get those many Embulk plugins caught up with v0.10.

The incompatibility in v0.10 was unavoidable. I designed the catch-up process carefully so that the changes could be as mechanical as possible, and the changes could happen in parallel with Embulk core's changes. However, they were too many things that had to happen.

I might have been be able to go ahead only with OSS Embulk. But, it would be much harder for Data Connector to catch up after Data Connector's Embulk fell behind the latest OSS Embulk. There were so many unexpected problems in updating Data Connector's Embulk due to the unofficial contracts between Embulk and the Worker, that TD encountered "backport hell."

Slide: Case #2: Embulk v0.10 catch-ups

Taking Embulk seriously as OSS, TD might have gone ahead with the OSS Embulk even though the Data Connector would have lagged behind Embulk. But in fact, it was the same team working both on Data Connector and OSS Embulk.

Slide: Case #2: Embulk v0.10 catch-ups

TD knew it would have a harder time if it moved OSS Embulk forward with Data Connector left behind. It was not realistic to make the decision to go ahead only with OSS Embulk.

Slide: Case #2: Embulk v0.10 catch-ups

Consequently, it is fair to say that Treasure Data substantially prevented Embulk from evolving.

Treasure Data and Open Source

I am a fan of open source, but not an open-source absolutist. To be quite honest, I sometimes thought that giving up maintaining Embulk as OSS could be a reasonable, well-and-good option to move the development of Data Connector forward. Everything in Data Connector might have gone ahead double-time or triple-time if TD had given up control.

It was, however, a hard option at the same time. One reason was that Treasure Data had introduced OSS Embulk for TD's customers to push their data into TD. Such an option to push data from customers is sometimes required because Data Connector runs on TD's servers to fetch data. Data Connector cannot access customers' internal servers directly. If TD stopped maintaining Embulk as OSS, its customers would have troubles. Nonetheless, maintaining both OSS Embulk and internal versions would have doubled the work.

The other reason was that Treasure Data had billed itself as an open-source company. TD would often introduced itself with phrases like "Open Source is in our DNA" and "Most companies say they love open source. We also LIVE it". If TD stopped maintaining Embulk as OSS, it would have been seen as unfaithful to its open source commitment. It could be a significant conflict.

Slide

Embulk is an eco-system, including open-source plugins and the community. Embulk always needs plugins to work. Embulk is not a standalone tool. Just maintaining the Embulk core as a single software is not sufficient. Someone has to keep Embulk maintained as an eco-system.

But nobody in Treasure Data was aware of that. The organization was structured to maintain Embulk only for Data Connector. The maintainers always had to take care of the eco-system unofficially as individuals, without being evaluated nor recognized, even for this open-source software started by TD. On the other hand, the team often received requests from inside, which conflicted with the eco-system. The maintainers were always in a double-bind between the eco-system and the organization as single individuals. A lot of effort was always needed to make design decisions that satisfied both, or to convince management to decline the requests to keep the eco-system.

The maintenance for the open-source eco-system was, inevitably, limited.

Embulk and TD in the Future

Communications

There were lots of discussions after the conflict was revealed much more explicitly. TD concluded that they needed more communication. (COVID-19 and rapid growth may have increased the miscommunication.) TD's CxOs, especially those with long tenure, thought that Treasure Data was obviously an open-source company, and no doubt it was shared throughout the company. Many of them had been open source leaders in and out of Treasure Data. In contrast, we could not afford to evaluate open source in the actual circumstances, including managers and directors. There had been a gap.

Decisions

TD will need long-term efforts for our open-source culture. But in the short term, it made two decisions about Embulk.

One decision was to separate the roles around Embulk and Data Connector. I (Dai) will be a maintainer of OSS Embulk independent from Data Connector. The Data Connector team is going to be users and contributors to Embulk. Now I am working on OSS Embulk half the time and on Backend CDP Applications the other half.

The second decision was to open the Embulk maintenance outside of Treasure Data. While Embulk was maintained only by TD, conflicts would have been unavoidable. Consequently, TD decided to stop being solely responsible to involve key contributors outside of TD.

The community is much more critical for Embulk than many other open-source projects because Embulk does not work only by itself. That is, Embulk is an eco-system that always needs plugins to work. Many plugins have been maintained open-source out of TD. One could say that Treasure Data was "free-riding" its own open-source software, including its eco-system, and its community. It was only natural to involve them in the project, eventually.

Lessons Learned

Lesson #1: Technical points

Generally speaking about technical points, it is nice to define clear contracts, boundaries, and responsibilities of software.

Even if engineers know the inner workings of other software, they should not depend on its internal structure. It would take much work to catch up with it if the software changes its internal structure. In addition, owning both with depending on each other's internal structure would prevent the ecosystem from evolving.

Lesson #2: Open-sourcing Strategy

If a company releases an open-source software strategically, it needs to think about its position, and the relationship even with internal components. Its lifecycle as open source may not match with the lifecycle as an internal component.

Slide: Learnings #2 : Open-sourcing strategy

Lesson #3: Organizational Points

Even if a company sets open source as one of its values, it can easily become "not open source" if it doesn't build the organization properly. At first, it would be nice to have clear answers to questions like "Why will we do this as OSS?" It will help in making decisions in the future.

In addition, it is recommended to avoid one person sharing two roles which might represent a conflict of interest. If one person takes ownership of the OSS, consider having someone else take a stance of the position as a user.

Conclusion: The Future

Treasure Data is changing and growing. Upcoming years will be our new decade in many aspects, which includes open source. Stay tuned!

Latest from our blog

Automatic Customer Segmentation with Machine Learning

Making Your Auto-Segmentation Model Work For You

Testing Distributed Components of Storage Engine

Tackling challenges of testing of distributed components of Storage Engine with system tests

Leveraging feedback is a skill!

Tips for Engineers on how to interpret and incorporate feedback to further their career goals!