Package detail

@ioris/tokenizer-kuromoji

8beeeaaat221MIT0.3.3

A specialized tokenizer library for Japanese lyrics analysis that provides intelligent text segmentation using the Kuromoji morphological analyzer.

music, lyric, sync, iori

readme

@ioris/tokenizer-kuromoji

A specialized tokenizer library for Japanese lyrics analysis that provides intelligent text segmentation using the Kuromoji morphological analyzer.

Overview

@ioris/tokenizer-kuromoji integrates with the @ioris/core framework to provide advanced lyrics tokenization capabilities. The library focuses on natural phrase breaks and proper handling of mixed Japanese/English content, making it ideal for karaoke applications, music apps, and lyrics analysis tools.

Features

Intelligent Segmentation: Advanced rule-based system for natural phrase breaks
Mixed Language Support: Seamless handling of Japanese and English text
Lyrics-Optimized Rules: Specialized processing for parentheses, quotes, and repetitive patterns
Timeline Preservation: Maintains temporal relationships while adding logical segmentation
Part-of-Speech Analysis: Leverages Kuromoji's morphological analysis for accurate breaks
Extensible Rule System: Customizable rules for specific use cases

Installation

npm install @ioris/tokenizer-kuromoji @ioris/core kuromoji
# or
yarn add @ioris/tokenizer-kuromoji @ioris/core kuromoji

Basic Usage

import path from "path";
import { createParagraph } from "@ioris/core";
import { builder } from "kuromoji";
import { LineArgsTokenizer } from "@ioris/tokenizer-kuromoji";

// Initialize kuromoji tokenizer
const kuromojiBuilder = builder({
  dicPath: path.resolve(__dirname, "node_modules/kuromoji/dict")
});

// Get kuromoji tokenizer (Promise)
const getTokenizer = () => new Promise((resolve, reject) => {
  kuromojiBuilder.build((err, tokenizer) => {
    if (err) reject(err);
    resolve(tokenizer);
  });
});

// Usage example
async function example() {
  // Get kuromoji tokenizer instance
  const tokenizer = await getTokenizer();

  // Prepare lyrics data with timeline information
  const lyricData = {
    position: 1,
    timelines: [
      {
        wordID: "",
        begin: 1,
        end: 5,
        text: "あの花が咲いたのは、そこに種が落ちたからで"
      }
    ]
  };

  // Create paragraph with custom tokenizer
  const paragraph = await createParagraph({
    ...lyricData,
    lineTokenizer: (lineArgs) => LineArgsTokenizer({
      lineArgs,
      tokenizer
    })
  });

  // Get processing results with natural breaks
  const lines = paragraph.lines;
  const lineText = lines[0].words
    .map(word => {
      let text = word.timeline.text;
      if (word.timeline.hasNewLine) text += '\n';
      return text;
    })
    .join('');

  console.log(lineText);
  // Output: "あの花が\n咲いたのは、\nそこに\n種が落ちたからで"
}

example();

How It Works

The tokenizer analyzes lyrics using advanced linguistic rules to create natural phrase breaks:

Intelligent Break Detection

Part-of-Speech Analysis: Uses Kuromoji's morphological analysis to identify grammatical boundaries
Context Awareness: Considers before/current/after token relationships for accurate segmentation
Length Optimization: Balances phrase length for optimal readability and singing
Mixed Language Handling: Seamlessly processes Japanese-English transitions

Special Lyrics Processing

Parentheses & Quotes: Preserves phrases enclosed in brackets, parentheses, or quotation marks
Repetitive Patterns: Handles repetitive expressions like "Baby Baby Baby" intelligently
Punctuation Sensitivity: Respects natural pauses indicated by punctuation marks
Timeline Preservation: Maintains original timing information while adding segmentation

Example Transformations

Input:  "あの花が咲いたのは、そこに種が落ちたからで"
Output: "あの花が\n咲いたのは、\nそこに\n種が落ちたからで"

Input:  "Baby Baby Baby 君を抱きしめていたい"
Output: "Baby\nBaby\nBaby\n君を抱きしめていたい"

Input:  "Oh, I can't help falling in love with you"
Output: "Oh,\nI can't help falling in love with you"

API Reference

LineArgsTokenizer

The main tokenization function that processes timeline data with intelligent segmentation.

function LineArgsTokenizer(options: {
  lineArgs: CreateLineArgs;
  tokenizer: Tokenizer<IpadicFeatures>;
  brakeRules?: TokenizeRule[];
  whitespaceRules?: TokenizeRule[];
}): Promise<Map<number, CreateLineArgs>>

Parameters

lineArgs: Input timeline data containing text and timing information
tokenizer: Kuromoji tokenizer instance for morphological analysis
brakeRules: Optional custom rules for line breaks (defaults to DEFAULT_BRAKE_RULES)
whitespaceRules: Optional custom rules for whitespace handling (defaults to DEFAULT_WHITESPACE_RULES)

Returns

A Map containing segmented line data with natural break points and preserved timing information.

Custom Rules

You can extend the tokenizer with custom break point rules:

import { LineArgsTokenizer, DEFAULT_BRAKE_RULES } from "@ioris/tokenizer-kuromoji";

// Define custom rules
const customRules = [
  ...DEFAULT_BRAKE_RULES,
  {
    // Break after specific patterns
    current: {
      surface_form: [/^(.*特定の文字列).*$/]
    },
    after: {
      pos: [["名詞", false]]
    }
  }
];

// Apply custom rules
const result = await LineArgsTokenizer({
  lineArgs,
  tokenizer,
  brakeRules: customRules
});

Rule Structure

Rules use the TokenizeRule interface with conditions for:

before: Conditions for the previous token
current: Conditions for the current token
after: Conditions for the next token
length: Length-based constraints
insert: Where to insert the break ("before" or "current")

Development

Building

npm run build        # Full build process
npm run build:types  # TypeScript declarations only
npm run build:esbuild # ESBuild compilation only

Testing

npm test            # Run all tests
npm run test -- --watch  # Watch mode

Code Quality

npm run lint        # Check code quality
npm run format      # Auto-fix formatting

Use Cases

Karaoke Applications: Generate natural phrase breaks for synchronized lyrics display
Music Apps: Improve lyrics readability with intelligent segmentation
Lyrics Analysis: Analyze song structure and linguistic patterns
Subtitle Generation: Create well-formatted subtitles for music videos
Language Learning: Study Japanese lyrics with proper phrase boundaries

Requirements

Node.js 16.0 or higher
TypeScript 5.0 or higher (for development)
@ioris/core ^0.3.2
kuromoji ^0.1.2

License

MIT

changelog

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.3.3] - 2025-07-27

Changed

Updated @ioris/core dependency from version 0.3.3 to 0.3.4 for latest features and improvements

Removed

Removed ts-node dependency from package.json as it has been replaced by tsx for better Node.js compatibility

Technical

Cleaned up package-lock.json by removing 162 lines of ts-node related dependencies
Reduced package size and improved dependency management by eliminating redundant TypeScript execution tools

[0.3.2] - 2025-07-27

Added

tsx package as development dependency for improved TypeScript execution in CI environments

Changed

Update @iori/core to 0.3.3
Updated Node.js version to 24.4 in CI workflows for improved compatibility and latest features
Enhanced CI/CD stability with specific Node.js version pinning instead of using version ranges
Improved build script to use tsx instead of ts-node for better Node.js 24.4 compatibility
Unified Node.js version to 24.4 across all CI workflow jobs (build and publish-npm)

Fixed

Resolved "Unknown file extension .ts" error in CI builds by switching from ts-node to tsx
Fixed Node.js version inconsistency between build and publish jobs in release workflow

Technical

Modified GitHub Actions workflows (development.yml and release.yml) to use Node.js 24.4 specifically
Improved build consistency by removing version range specification ('24.x' → '24.4')
Enhanced TypeScript execution with tsx for better ESM support and Node.js compatibility

[0.3.1] - 2025-07-27

Added

CLAUDE.md file with comprehensive project guidance for Claude Code integration
Detailed API reference section in README.md with TypeScript types and examples
Enhanced "How It Works" section explaining tokenizer behavior and linguistic processing
Example transformations showing input/output of the tokenization process
Development section with build, test, and code quality commands
Use cases section highlighting practical applications (karaoke, music apps, etc.)
Requirements section with specific version dependencies

Changed

Enhanced README.md: Complete overhaul with improved structure and detailed explanations
Updated usage examples to reflect current @ioris/core API (createParagraph instead of Paragraph constructor)
Improved feature descriptions with bullet-point formatting and clear benefits
Enhanced custom rules documentation with practical examples
Better organization of documentation sections for improved readability

Technical

Added comprehensive development guidelines in CLAUDE.md for future contributors
Documented high-level architecture patterns and core components
Included testing strategy and special considerations for lyrics processing
Enhanced project overview with focus on timeline-based processing and rule-based segmentation

[0.3.0] - 2025-07-27

Changed

Updated @ioris/core dependency from ^0.2.0 to ^0.3.0 for latest features and improvements
Refactored build configuration with enhanced TypeScript settings and module resolution
Improved tokenizer implementation with better code organization and maintainability

Technical

Updated development dependencies: Biome 2.1.2, esbuild 0.25.8, TypeScript 5.8.3, Vitest 3.2.4
Enhanced CI/CD pipelines with Node.js 24.x support for better performance and compatibility
Streamlined test suite with improved test structure and reduced complexity
Updated build system configuration for better module handling and optimization

Fixed

Improved TypeScript configuration for better type checking and module resolution
Enhanced build process reliability and output consistency

[0.2.0] - 2024-12-28

Changed

BREAKING: Migrated to ESM-only package format, dropping CommonJS support
Updated @ioris/core dependency to ^0.3.2 (ESM-only)
Modernized TypeScript configuration with "bundler" module resolution
Updated Node.js CI workflows to use version 24.x

Added

Comprehensive README.md with usage examples and API documentation
Support for new @ioris/core createParagraph API (replacing Paragraph constructor)

Technical

Enhanced build system with JSON import attributes for Node.js ESM compatibility
Updated development tooling: Biome 2.1.2, TypeScript 5.8.3, Vitest 3.2.4
Improved test coverage and updated test assertions for new API
Removed CommonJS build artifacts and dual module support

Fixed

TypeScript import resolution errors for @ioris/core types
Test compatibility with new @ioris/core data structures
Build configuration for ESM-only environment

[0.1.14] - 2024-12-27

Improvements

Minor improvements and bug fixes
Dependency updates for security and performance

[0.1.13] - 2024-12-26

Updates

Internal refactoring and optimization
Updated build processes

[0.1.12] - 2024-12-25

Enhancements

Package configuration improvements
Enhanced compatibility

[0.1.11] and earlier

For releases prior to 0.1.12, please refer to the git history and release tags.