mb_strtok implementation in PHP–String tokenizer for Multibyte

This is a simple function to implement some kind of mb_strtok() in PHP. As maybe you all are aware the mb_strtok function does not available for multibyte string (aka Unicode string). So this is my attempt to solve the problem. Anyway, there are bugs where the program halt if the input text is too long (how long? not sure yet). Maybe you could improve to provide better result.

Thank you and happy coding.

The PHP code mb_strtok.php ;

<html>
    <head>
        <title>String token for MB_STRING</title>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
    </head>
    <body>
        <h1>String token for MB_STRING</h1>
       <h2><a href="http://kerul.net">kerul.net</a></h2>
 
<form method="GET" ACTION="">
Input text <br>
<textarea name="txtinput" cols=30 rows=10></textarea>
<br>
<input type="submit" >
</form>

<?php
$in=$_GET["txtinput"];
$inputlen=mb_strlen($in, 'UTF-8');
echo ("Input length: $inputlen characters. <br>\n");

$tokens=mb_strtok(" /n/t?\'.", $in);
echo ("List of TOKENS<br>\n");
//echo $tokens;
for($i=0; $i<count($tokens); $i++){
    echo ("[$i] -> ".$tokens[$i] ." <br> \n");
}

function mb_strtok($delimiters, $str=NULL)
{
    static $pos = 0; // Keep track of the position on the string for each subsequent call.
    static $string = "";
    static $listtoken=array();
    // If a new string is passed, reset the static parameters.
    if($str!=NULL)
    {
        $pos = 0;
        $string = $str;
    }
    
    // Initialize the token.
    $token = "";

    while ($pos < mb_strlen($string,'UTF-8'))//loop till end of input string
    {
        
        $char = mb_substr($string, $pos, 1);//fetch one character, pos = char position
        $pos++;
        //echo ("Char at $pos => $char <br>\n");//trace character at position

        if(mb_strpos($delimiters, $char)===FALSE)//if character is not delimeter
        {
            $token .= $char;//put character in the token node
        }
        else
        {
           //if arrive at delimeter, push token to listtoken
           array_push($listtoken, $token);
           $token="";//clear the token node
        }
    }
    // return the list of tokens
    if ($listtoken!="")
    {
        return $listtoken;
    }
    else
    {
        return false;
    }
}
?>
  </body>
</html>

There is another one, this time the separator (.,;:) will be stored in the list of token (listtoken).

<html>
    <head>
        <title>String token for MB_STRING</title>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
    </head>
    <body>
        <h1>String token for MB_STRING</h1>
       <h2><a href="http://kerul.net">kerul.net</a></h2>
 
    <form method="GET" ACTION="">
    Input text <br>
    <textarea name="txtinput" cols=30 rows=10></textarea>
    <br>
    <input type="submit" >
    </form>

    <?php
    $in=$_GET["txtinput"];
    $inputlen=mb_strlen($in, 'UTF-8');
    echo ("Input length: $inputlen characters. <br>\n");

    $tokens=mb_strtok(" /n/t/f", $in);//delimeter by whitespace only
    echo ("List of TOKENS<br>\n");
    //echo $tokens;
    for($i=0; $i<count($tokens); $i++){
        echo ("[$i] -> ".$tokens[$i] ." <br> \n");
    }

    function mb_strtok($delimiters, $str=NULL)
    {
        static $pos = 0; // Keep track of the position on the string for each subsequent call.
        static $string = "";
        static $listtoken=array();
        // If a new string is passed, reset the static parameters.
        if($str!=NULL)
        {
            $pos = 0;
            $string = $str;
        }
        
        // Initialize the token.
        $token = "";

        while ($pos < mb_strlen($string,'UTF-8'))//loop till end of input string
        {
            
            $char = mb_substr($string, $pos, 1, 'UTF-8');//fetch one character, pos = char position
            
            echo ("Char at $pos => $char <br>\n");//trace character at position

            
            if(mb_strpos($delimiters, $char)===FALSE)//if character is not delimeter
            {
                if($char=="." || $char==";"||$char==":"||$char==","){
                    echo "Token detected $token <br>\n";
                    array_push($listtoken, $char);
               //$token="";//clear the token node
                }else{
                    $token .= $char;//put character in the token node
                }
            }
            else
            {
               //if arrive at delimeter, push token to listtoken
               echo "Token detected $token <br>\n";
               array_push($listtoken, $token);
               $token="";//clear the token node
            }
            $pos++;
        }
        return $listtoken;
        // return the list of tokens
        if ($listtoken!="")
        {
            return $listtoken;
        }
        else
        {
            return false;
        }
        
    }
    ?>
      </body>
    </html>

mb_strtok implementation in PHP for Arabic unicode strings

Modified using the code by http://www.anastis.gr/mb_strtok-a-php-implementation/

Comments

about blog.kerul.net

This is the official blog for Khirulnizam Abd Rahman, the co-author of the book Android Development Tools for Eclipse. Khirulnizam is a Computer Science lecturer in the Faculty of Information Science and Technology, KUIS. He has been handling programming classes since 2000.

He started publishing Android apps in the year 2010, and his apps among others are Malay Proverb Dictionary (Peribahasa) and m-Mathurat. There are also Hijra of the Prophet Muhammad app also in the Playstore. And the latest apps published is SmartSolat.com

PHP and Java are also the programming languages that he is familiar with. Currently venturing into LARAVEL...

blog.kerul.net

Search This Blog

mb_strtok implementation in PHP–String tokenizer for Multibyte

Labels

Comments

Post a Comment

Popular posts from this blog

Several English proverbs and the Malay pair

Applications of Web 2.0

Contact Us at blog.kerul.net

WebDev

Kursus Ionic 2021