Skip to main content

mb_strtok implementation in PHP–String tokenizer for Multibyte

This is a simple function to implement some kind of mb_strtok() in PHP. As maybe you all are aware the mb_strtok function does not available for multibyte string (aka Unicode string). So this is my attempt to solve the problem. Anyway, there are bugs where the program halt if the input text is too long (how long? not sure yet). Maybe you could improve to provide better result.

Thank you and happy coding.

string-tokenizer-for-multibyte

The PHP code mb_strtok.php ;

<html>
<head>
<title>String token for MB_STRING</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<h1>String token for MB_STRING</h1>
<h2><a href="http://kerul.net">kerul.net</a></h2>

<form method="GET" ACTION="">
Input text <br>
<textarea name="txtinput" cols=30 rows=10></textarea>
<br>
<input type="submit" >
</form>

<?php
$in=$_GET["txtinput"];
$inputlen=mb_strlen($in, 'UTF-8');
echo ("Input length: $inputlen characters. <br>\n");

$tokens=mb_strtok(" /n/t?\'.", $in);
echo ("List of TOKENS<br>\n");
//echo $tokens;
for($i=0; $i<count($tokens); $i++){
echo ("[$i] -> ".$tokens[$i] ." <br> \n");
}

function mb_strtok($delimiters, $str=NULL)
{
static $pos = 0; // Keep track of the position on the string for each subsequent call.
static $string = "";
static $listtoken=array();
// If a new string is passed, reset the static parameters.
if($str!=NULL)
{
$pos = 0;
$string = $str;
}

// Initialize the token.
$token = "";

while ($pos < mb_strlen($string,'UTF-8'))//loop till end of input string
{

$char = mb_substr($string, $pos, 1);//fetch one character, pos = char position
$pos++;
//echo ("Char at $pos => $char <br>\n");//trace character at position

if(mb_strpos($delimiters, $char)===FALSE)//if character is not delimeter
{
$token .= $char;//put character in the token node
}
else
{
//if arrive at delimeter, push token to listtoken
array_push($listtoken, $token);
$token="";//clear the token node
}
}
// return the list of tokens
if ($listtoken!="")
{
return $listtoken;
}
else
{
return false;
}
}
?>
</body>
</html>


There is another one, this time the separator (.,;:) will be stored in the list of token (listtoken).


<html>
<head>
<title>String token for MB_STRING</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<h1>String token for MB_STRING</h1>
<h2><a href="http://kerul.net">kerul.net</a></h2>

<form method="GET" ACTION="">
Input text <br>
<textarea name="txtinput" cols=30 rows=10></textarea>
<br>
<input type="submit" >
</form>

<?php
$in=$_GET["txtinput"];
$inputlen=mb_strlen($in, 'UTF-8');
echo ("Input length: $inputlen characters. <br>\n");

$tokens=mb_strtok(" /n/t/f", $in);//delimeter by whitespace only
echo ("List of TOKENS<br>\n");
//echo $tokens;
for($i=0; $i<count($tokens); $i++){
echo ("[$i] -> ".$tokens[$i] ." <br> \n");
}

function mb_strtok($delimiters, $str=NULL)
{
static $pos = 0; // Keep track of the position on the string for each subsequent call.
static $string = "";
static $listtoken=array();
// If a new string is passed, reset the static parameters.
if($str!=NULL)
{
$pos = 0;
$string = $str;
}

// Initialize the token.
$token = "";

while ($pos < mb_strlen($string,'UTF-8'))//loop till end of input string
{

$char = mb_substr($string, $pos, 1, 'UTF-8');//fetch one character, pos = char position

echo ("Char at $pos => $char <br>\n");//trace character at position


if(mb_strpos($delimiters, $char)===FALSE)//if character is not delimeter
{
if($char=="." || $char==";"||$char==":"||$char==","){
echo "Token detected $token <br>\n";
array_push($listtoken, $char);
//$token="";//clear the token node
}else{
$token .= $char;//put character in the token node
}
}
else
{
//if arrive at delimeter, push token to listtoken
echo "Token detected $token <br>\n";
array_push($listtoken, $token);
$token="";//clear the token node
}
$pos++;
}
return $listtoken;
// return the list of tokens
if ($listtoken!="")
{
return $listtoken;
}
else
{
return false;
}

}
?>
</body>
</html>


mb_strtok implementation in PHP for Arabic unicode strings


Modified using the code by http://www.anastis.gr/mb_strtok-a-php-implementation/

Comments

Popular posts from this blog

Simple search – result appears in the same page

This PHP code is a simple search example. It needs the user to enter a name and the PHP script will to the searching by the name provided by the user in the text box. <form name='search' method='get' action=''> Insert the firstname <input type='text' name='txtname'> <input type='submit' value='search name'> </form> <?php //database connect $db=mysqli_connect("localhost","root","","mycompanyhr"); //checking database connection if ($db==false) { echo "Connect failed: ". mysqli_connect_error($db); exit(); } else { echo "Connection successful"; } $n=$_GET['txtname']; $sql="select EMPNO, FIRSTNAME, LASTNAME, WORKDEPT, PHONENO from employee where FIRSTNAME like '%$n%' "; $rs=mysqli_query($db,$sql); if($rs==false){ echo ("Query cannot be executed!<br>"); echo ("SQL Error : ".mysqli_error($db)); } else...

Several English proverbs and the Malay pair

Or you could download here for the Malay proverbs app – https://play.google.com/store/apps/details?id=net.kerul.peribahasa English proverbs and the Malay pair Corpus Reference: Amir Muslim, 2009. Peribahasa dan ungkapan Inggeris-Melayu. DBP, Kuala Lumpur http://books.google.com.my/books/about/Peribahasa_dan_ungkapan_Inggeris_Melayu.html?id=bgwwQwAACAAJ CTRL+F to search Proverbs in English Definition in English Similar Malay Proverbs Definition in Malay 1 Where there is a country, there are people. A country must have people. Ada air adalah ikan. Ada negeri adalah rakyatnya. 2 Dry bread at home is better than roast meat home's the best hujan emas di negeri orang,hujan batu di negeri sendiri Betapa baik pun tempat orang, baik lagi tempat sendiri. 3 There's no accounting for tastes We can't assume that every people have a same feel Kepala sama hitam hati lain-lain. Dalam kehidupan ini, setiap insan berbeza cara, kesukaan, perangai, tabia...

Bootstrap Template for PHP database system - MyCompanyHR

HTML without framework is dull. Doing hard-coded CSS and JS are quite difficult with no promising result on cross platform compatibility. So I decided to explore BootStrap as they said it is the most popular web framework. What is BootStrap? - Bootstrap is the most popular HTML, CSS, and JavaScript framework for developing responsive, mobile-first web sites. (  http://www.w3schools.com/bootstrap/   ) Available here -  http://getbootstrap.com/ Why you need Flat-UI? Seems like a beautiful theme to make my site look professional. Anyway you could get variety of BootStrap theme out there, feel free to select here  http://bootstraphero.com/the-big-badass-list-of-twitter-bootstrap-resources/ Flat-UI is from DesignModo -   http://designmodo.com/flat/ Web Programming MyCompanyHR – PHP & MySQL mini project (with Boostrap HTML framework) Template 1: Template for the Lab Exercise. This is a project sample of a staff record management system. It h...

Mobile Apps using HTML interface with jQueryMobile

This is the main tutorial for developing a mobile web app using the HTML  interface with jQueryMobile. The series contains tutorial on how to use HTML and jQueryMobile to construct the app’s interface, PHP+MySQL to run the back-end, AJAX, using phoneGap/Cordova to generate Android APK, and lastly to upload to Google Play Store. These are the link of completed tutorials related to Mobile Web Apps Development Mobile Apps using HTML interface with JQUERYMOBILE (this page) Complete code sample of Mobile Web Apps with FRONT-END and BACK-END Using AJAX to populate data from mobile web interface Generating APK using BUILD.PHONEGAP.com Generating Android Project using CORDOVA Publishing to Google Play Store Mobile Web Apps development TRAINING The sample app published in Google Play Store jQueryMobile is a JavaScript + CSS framework born from jQuery. The best place to learn jQueryMobile is by accessing the official website at http://demos.jquerymobile.com/ As the na...

The Challenges of Handling Proverbs in Malay-English Machine Translation – a research paper

*This paper was presented in the 14th International Conference on Translation 2013 ( http://ppa14atf.usm.my/ ). 27 – 29 August 2013, Universiti Sains Malaysia, Penang. The PDF version is here: www.scribd.com/doc/163669571/Khirulnizam-The-Challenges-of-Automated-Detection-and-Translation-of-Malay-Proverb The test data is here: http://www.scribd.com/doc/163669588/Test-Data-for-the-Research-Paper-the-Challenges-of-Handling-Proverbs-in-Malay-English-Machine-Translation Khirulnizam Abd Rahman, Faculty of Information Science & Technology, KUIS Abstract: Proverb is a unique feature of Malay language in which a message or advice is not communicated indirectly through metaphoric phrases. However, the use of proverb will cause confusion or misinterpretation if one does not familiar with the phrases since they cannot be translated literally. This paper will discuss the process of automated filtering of Malay proverb in Malay text. The next process is translation. In machine translatio...