Open Chinese Convert 1.3.0.dev2+g90b2a0f
A project for conversion between Traditional and Simplified Chinese
Loading...
Searching...
No Matches
plugins Directory Reference

Detailed Description

OpenCC Plugins

plugins/ contains segmentation plugins that are built and distributed separately from the OpenCC core library.

The current plugin layout is:

  • plugins/<name>/src/: plugin implementation and exported entry point
  • plugins/<name>/include/: plugin-private headers
  • plugins/<name>/data/config/: plugin-backed JSON configs
  • plugins/<name>/data/<resource-dir>/: plugin resource files
  • plugins/<name>/tests/: integration tests for built plugin artifacts

Design Rules

  • The OpenCC core keeps built-in algorithms only. mmseg remains built in.
  • Any non-built-in segmentation.type is resolved through the plugin host.
  • A single JSON config must stay platform-neutral. Config files must not embed .so, .dylib, .dll, or platform-specific install paths.
  • Plugin resources belong to the plugin package, but resource names stay inside the normal OpenCC data layout.
  • Plugin tests should validate real built artifacts instead of directly testing private implementation classes.

Naming And Installation

Runtime naming follows the segmentation type:

  • Linux: libopencc-<type>.so
  • macOS: libopencc-<type>.dylib
  • Windows: opencc-<type>.dll

Windows loaders also accept the MSYS/MinGW runtime name msys-opencc-<type>.dll when that is the emitted DLL filename. On Windows, plugins must be built with an ABI-compatible toolchain/runtime as the host OpenCC binary. Mixing MSVC-built hosts with MinGW-built plugins, or the reverse, is unsupported.

For the jieba plugin, that means:

  • Linux: libopencc-jieba.so
  • macOS: libopencc-jieba.dylib
  • Windows: opencc-jieba.dll

MSYS/MinGW builds may emit msys-opencc-jieba.dll, which is also accepted by the loader.

CMake installs plugin binaries into the platform plugin directory and installs plugin configs/resources into the OpenCC data directory. Within a single plugin search directory, keep only one DLL for a given segmentation type. On Windows this applies to both opencc-<type>.dll and msys-opencc-<type>.dll. Multiple matching DLL names for the same type in one search directory are treated as an error.

Resource Resolution

Plugin JSON uses resource names rather than platform paths. Example:

{
"segmentation": {
"type": "jieba",
"resources": {
"dict_path": "jieba_dict/jieba.dict.utf8",
"model_path": "jieba_dict/hmm_model.utf8",
"user_dict_path": "jieba_dict/user.dict.utf8"
}
}
}

The core passes these values to the plugin host. The plugin is responsible for resolving them at runtime. Relative resource paths are expected to resolve within the existing OpenCC data layout rather than a plugin-specific ad hoc directory tree.

Segmentation ABI

The current segmentation plugin ABI entry point is:

  • opencc_get_segmentation_plugin_v2()

Segmentation results are returned as a sequence of segment lengths measured in Unicode code points, not as copied token strings. The ABI contract is:

  • input text is passed to the plugin as null-terminated UTF-8
  • the plugin returns segment_count plus codepoint_lengths
  • each element is the number of Unicode code points in the next segment
  • lengths must be positive and must cover the full input, in order
  • the host reconstructs segment boundaries from the original UTF-8 input

This keeps the ABI simpler and avoids allocating one string per token across the plugin boundary.

Customizing Jieba Dictionaries

When using the jieba plugin, you can add custom terminology to the segmenter by defining a custom user.dict.utf8 or editing the installed one.

Custom dictionaries must be encoded in UTF-8. Each line follows the format: [Word] [Frequency] [Part-of-Speech], separated by spaces. The frequency and POS tags are optional.

Example:

云计算 5 n
机器学习 8 n
区块链 10 nz

Testing

Each plugin should prefer integration tests that exercise:

  • the built opencc command or libopencc
  • the built plugin shared library
  • real plugin JSON configs
  • real installed or runfiles-based resource files

Current jieba targets:

  • CMake plugin target: opencc_jieba
  • CMake integration test: JiebaPluginIntegrationTest
  • Bazel plugin target: //plugins/jieba:opencc-jieba
  • Bazel integration test: //plugins/jieba:jieba_plugin_integration_test

Adding A New Plugin

  1. Create plugins/<name>/src, include, data, and tests.
  2. Export opencc_get_segmentation_plugin_v2().
  3. Name the output binary using the opencc-<type> convention.
  4. Keep JSON configs platform-neutral and resource-oriented.
  5. Add both CMake and Bazel build rules.
  6. Add an integration test that loads the built plugin through the real host.

Packaging for Distro Maintainers

To align with downstream Linux distribution packaging standards (e.g., Debian apt, Arch pacman), OpenCC plugins support decoupled compilation. This lets maintainers build and distribute the core opencc system separately from heavier third-party plugins such as opencc-jieba.

1. Build And Install Core OpenCC

Compile the main tree normally, but disable the optional jieba plugin:

mkdir build_core && cd build_core
cmake .. -DBUILD_OPENCC_JIEBA_PLUGIN=OFF -DCMAKE_INSTALL_PREFIX=/usr
make && make install

2. Build The Plugin Standalone

Plugins can detect standalone builds automatically. Build from the plugin directory and point OpenCC_DIR at the installed OpenCC CMake package:

cd plugins/jieba
mkdir build_plugin && cd build_plugin
cmake .. -DOpenCC_DIR=/usr/lib/cmake/opencc -DCMAKE_INSTALL_PREFIX=/usr
make && make install

Standalone default installation paths are intended to align with the core OpenCC layout:

  • Windows: bin/plugins
  • Linux/macOS: lib/opencc/plugins