Package detail

@ioris/tokenizer-kuromoji

8beeeaaat221MIT0.3.3

A specialized tokenizer library for Japanese lyrics analysis that provides intelligent text segmentation using the Kuromoji morphological analyzer.

music, lyric, sync, iori

readme

@ioris/tokenizer-kuromoji

A specialized tokenizer library for Japanese lyrics analysis that provides intelligent text segmentation using the Kuromoji morphological analyzer.

Overview

@ioris/tokenizer-kuromoji integrates with the @ioris/core framework to provide advanced lyrics tokenization capabilities. The library focuses on natural phrase breaks and proper handling of mixed Japanese/English content, making it ideal for karaoke applications, music apps, and lyrics analysis tools.

Features

  • Intelligent Segmentation: Advanced rule-based system for natural phrase breaks
  • Mixed Language Support: Seamless handling of Japanese and English text
  • Lyrics-Optimized Rules: Specialized processing for parentheses, quotes, and repetitive patterns
  • Timeline Preservation: Maintains temporal relationships while adding logical segmentation
  • Part-of-Speech Analysis: Leverages Kuromoji's morphological analysis for accurate breaks
  • Extensible Rule System: Customizable rules for specific use cases

Installation

npm install @ioris/tokenizer-kuromoji @ioris/core kuromoji
# or
yarn add @ioris/tokenizer-kuromoji @ioris/core kuromoji

Basic Usage

import path from "path";
import { createParagraph } from "@ioris/core";
import { builder } from "kuromoji";
import { LineArgsTokenizer } from "@ioris/tokenizer-kuromoji";

// Initialize kuromoji tokenizer
const kuromojiBuilder = builder({
  dicPath: path.resolve(__dirname, "node_modules/kuromoji/dict")
});

// Get kuromoji tokenizer (Promise)
const getTokenizer = () => new Promise((resolve, reject) => {
  kuromojiBuilder.build((err, tokenizer) => {
    if (err) reject(err);
    resolve(tokenizer);
  });
});

// Usage example
async function example() {
  // Get kuromoji tokenizer instance
  const tokenizer = await getTokenizer();

  // Prepare lyrics data with timeline information
  const lyricData = {
    position: 1,
    timelines: [
      {
        wordID: "",
        begin: 1,
        end: 5,
        text: "あの花が咲いたのは、そこに種が落ちたからで"
      }
    ]
  };

  // Create paragraph with custom tokenizer
  const paragraph = await createParagraph({
    ...lyricData,
    lineTokenizer: (lineArgs) => LineArgsTokenizer({
      lineArgs,
      tokenizer
    })
  });

  // Get processing results with natural breaks
  const lines = paragraph.lines;
  const lineText = lines[0].words
    .map(word => {
      let text = word.timeline.text;
      if (word.timeline.hasNewLine) text += '\n';
      return text;
    })
    .join('');

  console.log(lineText);
  // Output: "あの花が\n咲いたのは、\nそこに\n種が落ちたからで"
}

example();

How It Works

The tokenizer analyzes lyrics using advanced linguistic rules to create natural phrase breaks:

Intelligent Break Detection

  • Part-of-Speech Analysis: Uses Kuromoji's morphological analysis to identify grammatical boundaries
  • Context Awareness: Considers before/current/after token relationships for accurate segmentation
  • Length Optimization: Balances phrase length for optimal readability and singing
  • Mixed Language Handling: Seamlessly processes Japanese-English transitions

Special Lyrics Processing

  • Parentheses & Quotes: Preserves phrases enclosed in brackets, parentheses, or quotation marks
  • Repetitive Patterns: Handles repetitive expressions like "Baby Baby Baby" intelligently
  • Punctuation Sensitivity: Respects natural pauses indicated by punctuation marks
  • Timeline Preservation: Maintains original timing information while adding segmentation

Example Transformations

Input:  "あの花が咲いたのは、そこに種が落ちたからで"
Output: "あの花が\n咲いたのは、\nそこに\n種が落ちたからで"

Input:  "Baby Baby Baby 君を抱きしめていたい"
Output: "Baby\nBaby\nBaby\n君を抱きしめていたい"

Input:  "Oh, I can't help falling in love with you"
Output: "Oh,\nI can't help falling in love with you"

API Reference

LineArgsTokenizer

The main tokenization function that processes timeline data with intelligent segmentation.

function LineArgsTokenizer(options: {
  lineArgs: CreateLineArgs;
  tokenizer: Tokenizer<IpadicFeatures>;
  brakeRules?: TokenizeRule[];
  whitespaceRules?: TokenizeRule[];
}): Promise<Map<number, CreateLineArgs>>

Parameters

  • lineArgs: Input timeline data containing text and timing information
  • tokenizer: Kuromoji tokenizer instance for morphological analysis
  • brakeRules: Optional custom rules for line breaks (defaults to DEFAULT_BRAKE_RULES)
  • whitespaceRules: Optional custom rules for whitespace handling (defaults to DEFAULT_WHITESPACE_RULES)

Returns

A Map containing segmented line data with natural break points and preserved timing information.

Custom Rules

You can extend the tokenizer with custom break point rules:

import { LineArgsTokenizer, DEFAULT_BRAKE_RULES } from "@ioris/tokenizer-kuromoji";

// Define custom rules
const customRules = [
  ...DEFAULT_BRAKE_RULES,
  {
    // Break after specific patterns
    current: {
      surface_form: [/^(.*特定の文字列).*$/]
    },
    after: {
      pos: [["名詞", false]]
    }
  }
];

// Apply custom rules
const result = await LineArgsTokenizer({
  lineArgs,
  tokenizer,
  brakeRules: customRules
});

Rule Structure

Rules use the TokenizeRule interface with conditions for:

  • before: Conditions for the previous token
  • current: Conditions for the current token
  • after: Conditions for the next token
  • length: Length-based constraints
  • insert: Where to insert the break ("before" or "current")

Development

Building

npm run build        # Full build process
npm run build:types  # TypeScript declarations only
npm run build:esbuild # ESBuild compilation only

Testing

npm test            # Run all tests
npm run test -- --watch  # Watch mode

Code Quality

npm run lint        # Check code quality
npm run format      # Auto-fix formatting

Use Cases

  • Karaoke Applications: Generate natural phrase breaks for synchronized lyrics display
  • Music Apps: Improve lyrics readability with intelligent segmentation
  • Lyrics Analysis: Analyze song structure and linguistic patterns
  • Subtitle Generation: Create well-formatted subtitles for music videos
  • Language Learning: Study Japanese lyrics with proper phrase boundaries

Requirements

  • Node.js 16.0 or higher
  • TypeScript 5.0 or higher (for development)
  • @ioris/core ^0.3.2
  • kuromoji ^0.1.2

License

MIT

changelog

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.3.3] - 2025-07-27

Changed

  • Updated @ioris/core dependency from version 0.3.3 to 0.3.4 for latest features and improvements

Removed

  • Removed ts-node dependency from package.json as it has been replaced by tsx for better Node.js compatibility

Technical

  • Cleaned up package-lock.json by removing 162 lines of ts-node related dependencies
  • Reduced package size and improved dependency management by eliminating redundant TypeScript execution tools

[0.3.2] - 2025-07-27

Added

  • tsx package as development dependency for improved TypeScript execution in CI environments

Changed

  • Update @iori/core to 0.3.3
  • Updated Node.js version to 24.4 in CI workflows for improved compatibility and latest features
  • Enhanced CI/CD stability with specific Node.js version pinning instead of using version ranges
  • Improved build script to use tsx instead of ts-node for better Node.js 24.4 compatibility
  • Unified Node.js version to 24.4 across all CI workflow jobs (build and publish-npm)

Fixed

  • Resolved "Unknown file extension .ts" error in CI builds by switching from ts-node to tsx
  • Fixed Node.js version inconsistency between build and publish jobs in release workflow

Technical

  • Modified GitHub Actions workflows (development.yml and release.yml) to use Node.js 24.4 specifically
  • Improved build consistency by removing version range specification ('24.x' → '24.4')
  • Enhanced TypeScript execution with tsx for better ESM support and Node.js compatibility

[0.3.1] - 2025-07-27

Added

  • CLAUDE.md file with comprehensive project guidance for Claude Code integration
  • Detailed API reference section in README.md with TypeScript types and examples
  • Enhanced "How It Works" section explaining tokenizer behavior and linguistic processing
  • Example transformations showing input/output of the tokenization process
  • Development section with build, test, and code quality commands
  • Use cases section highlighting practical applications (karaoke, music apps, etc.)
  • Requirements section with specific version dependencies

Changed

  • Enhanced README.md: Complete overhaul with improved structure and detailed explanations
  • Updated usage examples to reflect current @ioris/core API (createParagraph instead of Paragraph constructor)
  • Improved feature descriptions with bullet-point formatting and clear benefits
  • Enhanced custom rules documentation with practical examples
  • Better organization of documentation sections for improved readability

Technical

  • Added comprehensive development guidelines in CLAUDE.md for future contributors
  • Documented high-level architecture patterns and core components
  • Included testing strategy and special considerations for lyrics processing
  • Enhanced project overview with focus on timeline-based processing and rule-based segmentation

[0.3.0] - 2025-07-27

Changed

  • Updated @ioris/core dependency from ^0.2.0 to ^0.3.0 for latest features and improvements
  • Refactored build configuration with enhanced TypeScript settings and module resolution
  • Improved tokenizer implementation with better code organization and maintainability

Technical

  • Updated development dependencies: Biome 2.1.2, esbuild 0.25.8, TypeScript 5.8.3, Vitest 3.2.4
  • Enhanced CI/CD pipelines with Node.js 24.x support for better performance and compatibility
  • Streamlined test suite with improved test structure and reduced complexity
  • Updated build system configuration for better module handling and optimization

Fixed

  • Improved TypeScript configuration for better type checking and module resolution
  • Enhanced build process reliability and output consistency

[0.2.0] - 2024-12-28

Changed

  • BREAKING: Migrated to ESM-only package format, dropping CommonJS support
  • Updated @ioris/core dependency to ^0.3.2 (ESM-only)
  • Modernized TypeScript configuration with "bundler" module resolution
  • Updated Node.js CI workflows to use version 24.x

Added

  • Comprehensive README.md with usage examples and API documentation
  • Support for new @ioris/core createParagraph API (replacing Paragraph constructor)

Technical

  • Enhanced build system with JSON import attributes for Node.js ESM compatibility
  • Updated development tooling: Biome 2.1.2, TypeScript 5.8.3, Vitest 3.2.4
  • Improved test coverage and updated test assertions for new API
  • Removed CommonJS build artifacts and dual module support

Fixed

  • TypeScript import resolution errors for @ioris/core types
  • Test compatibility with new @ioris/core data structures
  • Build configuration for ESM-only environment

[0.1.14] - 2024-12-27

Improvements

  • Minor improvements and bug fixes
  • Dependency updates for security and performance

[0.1.13] - 2024-12-26

Updates

  • Internal refactoring and optimization
  • Updated build processes

[0.1.12] - 2024-12-25

Enhancements

  • Package configuration improvements
  • Enhanced compatibility

[0.1.11] and earlier

For releases prior to 0.1.12, please refer to the git history and release tags.