2013年10月2日 星期三

《頭條日報》自動剪報系統

Sita Recognition in Headline Daily

在 Sita 過身後我才認識到香港有這樣好的歌手,也才開始留意她的消息...。在《陳僖儀補完計劃》的剪報方面,我希望能納入她生前的內容,至少是《頭條日報》及《am730》這些有著大量舊報重溫的報導。但要逐日逐日去查,我又花不起這樣的時間。那麼最好又是交給電腦替我去做!

為了試行構想的可行性,我先以《頭條日報》作為實驗,事關它每頁的內容是圖文分開成 JPG 及 PNG;編寫程式讀取 PNG 層的文字,較容易檢測到正確文字,如:僖、Sita。當發現文章出現以上關鍵字時,便進一步下載 JPG 圖片層,並合併成為單一高清 PNG 影像。花了一晚時間,做了點簡單測試,證實程序能成功抓出關鍵字後,便進行全面性的搜尋工作。電腦花了三小時能完成一個月份量的檢查。由於 Sita 是 2010 年出道,大約 40 個月,整個運作大約需要 120 小時。上圖便是電腦找出來的其中一張結果。那是 2013 年 3 月 18 日在《頭條日報》第 25 頁的部份內容。那頁沒有 Sita 的照片,只有在報導中出現過一次「陳僖儀」關鍵字。要是由我人手檢查,相信不會留意到這篇報導...。以下是偵測關鍵字的代碼:
//-----------------------------------------------------------------
//  Character recognition
//  僖(細明體)
$characterPattern = array(
 1,1,1,1,0,0,1,1,1,1,0,0,1,1,1,1,1,
 1,1,1,0,0,0,1,1,1,1,0,0,1,1,0,0,0,
 1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 1,1,1,0,0,1,1,1,1,1,0,0,1,1,0,0,1,
 1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,
 1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,
 1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,
 1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,
 0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,1,1,
 0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,
 1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,1,
 1,1,1,0,0,1,1,0,0,1,1,1,1,0,0,0,1,
 1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,1,
 1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,1,
 1,1,1,0,1,1,1,0,0,1,1,1,1,0,0,1,1);
$patternWidth = 17;
$patternHeight = 17;

$sx = 50;
$sy = 50;
$width = imagesx($image);
$height = imagesy($image);

for ($y=$sy; $y<$height-$sy; $y++)  {
 for ($x=$sx; $x<$width-$sx; $x++)  {

  //  40, Transparence index
  $x1 = $x;
  $y1 = $y;
  $colorKey = imagecolorat($image, $x1, $y1);

  $offset = 0;
  $notMatch = 0;
  for ($py=0; $py<$patternHeight; $py++)  {
   for ($px=0; $px<$patternWidth; $px++)  {

    $x1 = $x+$px;
    $y1 = $y+$py;
    $colorIndex = imagecolorat($image, $x1, $y1);
    if ($colorIndex == $colorKey)  {$flag2 = 1;}
    else  {$flag2 = 0;}

    $flag1 = $characterPattern[$offset];
    if ($flag1 != $flag2)  {
     $px = $patternWidth;
     $py = $patternHeight;
     $notMatch = 1;
    }

    $offset++;
   }
  }
  if ($notMatch == 1)  {continue;}

  //-----------------------------------------------------------------
  //  Match with pattern
  echo("\nMatch at ($x, $y)");

  //  Download JPG and merge to a single PNG image
 }
}

沒有留言: