"Improved Synonym Approach to Linguistic Steganography" Design and Proof-of-Concept Implementation

Page created by Julia Rose
 
CONTINUE READING
“Improved Synonym Approach to Linguistic Steganography”
                      Design and Proof-of-Concept Implementation

Aniket M. Nanhe                           Mayuresh P. Kunjir                      Sumedh V. Sakdeo
B.Tech Comp. Sci.,                        B.Tech Comp. Sci.,                      B.Tech Comp. Sci.,
College of Engineering,                   College of Engineering,                 College of Engineering,
Pune, India.                              Pune, India.                            Pune, India.
nanhe.aniket@gmail.com                    mayuresh.kunjir@gmail.co                sumedhsakdeo@gmail.com
                                          m

                      Abstract                                communication pass through the warden Wendy;
                                                              and if Wendy detects any suspicious messages, he
          This paper develops a linguistically robust         will frustrate their plan by throwing them in solitary
Linguistic steganography approach using synonym               confinement. So they must find some way of hiding
replacement, which converts a message into                    their secret message in an innocuous looking cover
semantically innocuous text. Drawing upon linguistic          text‖.
criteria, this approach uses word replacement, with                Information hiding has taken one form in image
substitution classes based on traditional word                based steganography, utilizing minimal changes in
replacement features (syntactic categories and                pixels or watermarking techniques. While text-based
subcategories), as well as features under-exploited           messages have also been used within image-based
in earlier works: semantic criteria, inflectional class,      maneuvers, by modifying the white space between
and frequency statistics. The original message is             letters and by minutely changing the fonts, this has
hidden through use of a cover text which is shared            proved less fruitful because text can be retyped and
between sender and receiver. This paper also                  is often altered in the conversion from one program
presents a new approach of sharing the cover text             version or platform to another. Proving more
and changing it periodically to make the algorithm            productive, as well as resistant to the difficulties
safe from steganalysis.                                       surrounding the re-typing of text-based messages is
                                                              lexical steganography, which uses linguistic
                                                              structures to disguise encryption of text messages
1. Introduction
                                                              such that the appearance of the message remains
     While current encryption techniques are                  semantically and syntactically innocent.
sufficiently advanced to make code-breaking                        This paper presents a new approach of
practically impossible, one major drawback of                 Linguistic      Steganography       using      synonym
current encryption methods is the ease in identifying         replacement, a linguistically-informed alternative to
an encrypted text—they do not resemble natural text           existing text-based steganography systems. This
in any way. Steganography attempts to answer this             approach adds extra features like inflectional class
need, acting to conceal the message's existence, in           and frequency statistics thereby producing
order to transmit encrypted messages without                  semantically and syntactically correct text which is
arousing suspicion. Steganography is the art and              more natural in appearance to human eye.
science of concealing a secret message inside a                    There is one more area which is under-exploited
cover object. When the secret message is in digital           in earlier works: sharing of cover text and changing it
form, it leaves enormous choices for the cover                periodically. The cover text is very critical as it is the
objects. For instance, one could hide digital                 text which is transformed and sent over
information inside images, audios, binaries, videos,          communication channel. Our aim is to make
texts etc.      Steganography can be classified               steganalysis difficult by altering cover text frequently
depending upon the type of cover objects used.                so no suspicion would be detected. This paper
These cover objects could be images, audio,                   discusses a new approach of achieving this by
binaries and as in our case, natural language text.           exploiting the part of cover text which remains
     The study of this subject in scientific literature       unchanged in transformation.
dates back to 1983, when Simmons formulated it as                  This paper analyzes Past Research in Section 2,
―The Prisoners‘ Problem‖. It says, ―Alice and Bob are         Basic Algorithm in Section 3 and Cover Text
in jail and wish to hatch an escape plan; their entire        Selection in Section 4.
In     general, syntactical  steganography
2. Past Research                                               techniques produce text having syntactically
                                                               wellformedness        without        semantically
                                                               wellformedness. It can be seen from Chomsky‘s
     Lexical steganography has had three main veins            famous sentence ―Colorless green ideas sleep
of research: watermarking techniques that                      furiously‖.
manipulate       sentences     through     syntactic
transformations a.k.a. ontological techniques, word
                                                               2.2. Lexical Steganography
replacement systems both with and without cover
texts, and context-free grammars such as
                                                                     In lexical Steganography lexical units of natural
NICETEXT. We will see the work carried out in each
                                                               language text such as words are used to hide secret
of these techniques.
                                                               bits. The most straightforward subliminal channel in
                                                               natural language is probably the choice of words. A
2.1. Syntactical Steganography                                 word could be replaced by its synonym and the
                                                               choice of word to be chosen from the list of
      The approaches to syntactical steganography              synonyms would depend upon secret bits. For
exploit the syntactic structures of a text. The                example consider a sentence –
approaches make use of Context Free Grammars
(CFG) to build syntactically correct sentences. The                 ―Pune is a nice little city‖
famous algorithm of CFG based Mimicry developed
by Peter Wayner[1] comes under this category.                        Now, suppose list of synonyms for nice is {nice,
There is another famous algorithm by Chapman et                wonderful, great, and decent}. Each of the
al.[8], NICETEXT, also based on CFG.                           synonyms can be represented by two bits as shown
      NICETEXT uses the cover text simply as a                 in the table:
source of syntactic patterns: by running the cover
text through a part-of-speech tagger, NICETEXT                        Word             Code
obtains a set of "sentence frames," e.g. [(noun)
(verb) (prep) (det) (noun)] for ‗I sat in the tree.‘ It also          Nice             00
compiles a lexicon of words found in the cover text
via part-of-speech tags, with each word in the                        Wonderful        01
lexicon associated (arbitrarily) to either of the binary
digits 0 or 1. In encryption, the plain text message is               Great            10
converted into a sequence of binary digits. A random
sentence frame is chosen and the part of speech                       Decent           11
tags in it are replaced by words in the lexicon
according to the sequence of binary digits.                         Table 1: Lexical code table
      Although, NICETEXT produces syntactically
correct sentences; it fails on the count of semantics.              Depending upon the input secret bits
The output text is almost always set of                        appropriate synonym for ‗nice‘ will be selected and
ungrammatical       and     semantically       anomalous       put in the stego text. So, the possible stego texts
sentences.                                                     could be:
      Another factor worth considering is the density
of encryption within the cover text. Ideally, the cover             a)   Pune is a nice little city.
text should work to hide the word frequencies and                   b)   Pune is a wonderful little city.
syntactic structure of the hidden plain text message.               c)   Pune is a decent little city.
Steganographic goals encourage sparse encryption,                   d)   Pune is a great little city.
which does not alter a majority of the text by the
word replacement. NICETEXT encryption is                             The lexical techniques produce better quality
maximally dense—every word within the final                    text than syntactical techniques. It is hard to find
encrypted cover text is conveying hidden                       presence of hidden message for statistical attacks.
information. Given that each encrypted word is part            The replacements are critical part of these
of the original information bearing message and                techniques. To give an example, in the above
common word usage patterns are unavoidable, this               mentioned synonym replacement approach, some
is problematic for the original steganographic intent:         words can have more than one sense. (Noun ―bank‖
avoiding detection and producing naturalistic text.            has two senses – ―a long pile or heap‖ or ―an
                                                               institution for receiving, lending, and safeguarding
money and transacting other financial business‖) If          We use a word dictionary to get synonym. The input
we don‘t use synonym having same sense as that of            text to be hidden is compressed using Huffman
original word, the output will look suspicious.              Compression Algorithm and a string of bits is
                                                             generated. The input bits are consumed in selection
    1. Bring those instruments.                              of synonyms.
    2. Bring those tool.                                         The algorithm works in stages. The various
                                                             stages of the algorithms are:
      A further impediment to synonym based word
replacement is inflection classes (i.e., legal and           3.1. Part-of-speech Tagging
illegal word combinations). (2) replaces a plural
noun ‗instruments‘ of (1) by its singular synonym                 The basic requirement of this algorithm is, a
noun ‗tool‘; thus making sentence grammatically              cover text should be shared between sender and
incorrect.                                                   receiver. Natural Language Processing is done on
                                                             the cover text in order to determine the part of
2.3. Ontological Technique                                   speech of each word. This is essential part of the
                                                             algorithm as we are going to replace only common
    Of the techniques considered herein, the                 nouns, adjectives, adverbs and verbs in the cover
ontological one is the most sophisticated approach           text. A Parts-of-speech tagger is applied on cover
with respect to modeling semantics. Instead of               text which outputs each word followed by its part-of-
implicitly leaving semantics intact by replacing only        speech.
synonymous words while embedding information
into an innocuous text, an explicit model for                3.2. Input Compression
―meaning‖ is used to evaluate equivalence between
texts.                                                            The input secret text is treated as binary bit
    Atallah et al.[4] watermark texts by manipulating        string. These bits are to be used in synonym
and exploiting the syntax (formal word order and             replacement stage to make a choice of synonym
grammatical voice) of sentences. Through common              that is to be used in place of a word. Using standard
generative transformations (clefting (4), adjunct            ASCII representation, we need 7 bits for each
fronting (5), passivization (6), adverbial insertion (7)),   character of input. So we need to hide (7 * ‗number
the syntax of each sentence is altered:                      of characters‘) bits. We can improve on this number
                                                             by exploiting characteristics of English language.
3. The lion ate the food yesterday.(original sentence)       Some characters appear more frequently in normal
4. It was the lion that ate the food yesterday.              English text than other characters. If we use less
5. Yesterday, the lion ate the food.                         number of bits for such characters, we can easily
6. The food was eaten by the lion yesterday.                 reduce number of input bits to be hidden. To achieve
7. Surprisingly, the lion ate the food yesterday.            this, we use Huffman Compression algorithm.
                                                                  On an average, Huffman coding reduces the
     The Ontological techniques though have some             size of input bit string to 33% of the original.
problems. The transformations sometimes affect the
semantics of a text. Newer theories of language
                                                             3.3. Synonym Replacement
argue for the interconnectedness of the semantic
and syntactic levels, demonstrating that the syntactic
                                                                 This stage is the core of the algorithm. The
pattern is itself inherently meaningful. Furthermore,
                                                             actual task of hiding bits into a cover text is carried
statistically, various syntactic structures (word
                                                             out here.      The inputs to this stage are the
orders) are not equal in distribution: different genres
                                                             compressed bit string and tagged cover text at
of text have wildly different syntactic structures, and
                                                             sender‘s side and receiver needs encoded text
replacing such structures freely could create a text
                                                             (a.k.a. stego text) and tagged cover text. This stage
which is trivially broken by statistical methods—a
                                                             makes use of dictionary to find replacements for
security threat to the program.
                                                             word.
3. Basic Algorithm                                           3.3.1 Dictionaries
    The algorithm replaces all the nouns, adjectives,            We use three dictionaries here:
verbs and adverbs of cover text by their respective
synonyms. A semantically and orthographically                a. WordNet2.1 English dictionary
correct text is used as cover text to hide messages.
WordNet is an open source English dictionary          ―travelling―, the stego text should contain present
containing almost all English words. We can get all       participle for of the ―go― to maintain the tense of the
synonyms of a word using this dictionary. A word          sentence. So verb dictionary provides this inflected
may have more than one sense in which it can be           form of the base form ―go‖ for actual replacement.
used. WordNet provides output in terms of ‗synsets‘.
A synset defines a sense of a word. Each synset           c. Noun Inflection dictionary
contains all the synonyms of given word which can
be used in that particular sense.                              A noun can be either in singular or plural form.
                                                          Again as the case with verbs, WordNet always gives
    e.g.: Synonym Sets of the word ―travelling‖ are:      the synonyms of a noun in their singular form. If we
    1. travel, go, move, locomote                         replace a plural by its singular synonym noun, we
    2. travel, journey                                    will get grammatically incorrect sentence. So we use
    3. travel, trip, jaunt                                separate noun dictionary to avoid this situation.
    4. travel, journey                                         We maintain a list of nouns (about 89,051
                                                          nouns) in their singular as well as plural forms.
     WordNet also provides frequency of occurrence        Before replacing a noun by its synonym, we check
of a word in normal English text. This information is     whether both are in same form. If not, we select
very useful in our algorithm using which we can           appropriate form of synonym noun from noun
encode the synonyms. Huffman coding is used here          dictionary and replace original by it.
again so that more frequently occurring synonyms
get shorter codes and vice versa. This is important           e.g.: Tagged Cover Text:
from Word Sense Disambiguation aspect, wherein            A/DT group/NN of/IN frogs/NNS were/VBD
the WordNet‘s 1st Sense assigns some frequency to         traveling/VBG through/IN the/DT woods/NNS ,/,
each synonym of the word. This frequency is
assigned to a word depending upon the use of that              The noun ―frogs‖ is plural form of the word frog.
synonym for that particular word in normal English        Suppose synonym Gaul is selected by input bit
text. Assigning a shorter code to most frequently         string; we need to ensure that it is plural as ‗frogs‘ is
used synonym ensures maintaining proper word              a plural. So we use this dictionary to obtain singular
sense.                                                    and plural forms of the nouns present in dictionary.

b. Verb Inflection dictionary                             3.4. Synonym Replacement

     A verb can have many inflected forms like                 Figure 1 shows the mechanism that is carried
present participle form, past tense form, past            out at sender end. As can be seen, tagged text
participle form and base form. When we try to find        obtained from stage 1 is scanned. Whenever a
synonyms for a verb, WordNet always gives the             noun, adjective, verb or adverb is found, its
synonyms of a verb in their base form irrespective of     synonyms are obtained from WordNet. All synonyms
inflected form of input verb. If we replace a verb by a   are put in a frequency table; the frequencies are
synonym in its base form, it will make output             obtained from WordNet. Huffman coding is done on
sentence grammatically incorrect. To avoid this           this frequency table to obtain codes for all
situation, we use a separate verb dictionary.             synonyms. By using frequencies, we achieve word
     We maintain a list of all verbs (about 16,064        sense disambiguation also, as more frequently used
verbs) along with their all inflected forms in a          senses get shorter codes so that they have higher
separate file. Before replacing a verb by its             probability of being used.
synonym, we check whether inflected forms of both              After building the encode table, we use input bit
original verb and its synonym match. If they don‘t        string to select one of the synonyms from the table.
match, we select appropriate inflected form of            If we are replacing a verb, the inflected forms are
synonym verb from verb dictionary and replace             checked and appropriate form of verb is obtained
original by it.                                           from verb dictionary. Similarly if we are replacing a
                                                          noun, the singular or plural form is selected from
    e.g.: Tagged Cover Text:                              noun dictionary in accordance of original noun‘s
A/DT group/NN of/IN frogs/NNS were/VBD                    form. Otherwise the selected synonym is put in place
traveling/VBG through/IN the/DT woods/NNS ,/,             of original word. Appendix shows examples of
                                                          sample stego text generated from cover text by
    Suppose, synonym ―go‖ is selected for                 hiding the secret text.
replacement from WordNet as a synonym of
bits are obtained and these are appended to output
                                                         string. The output string when decompressed,
                                                         produce original secret text.

                                                         4. Cover Text Selection
                                                              Steganalysis is identifying existence of a secret
                                                         message. This is obvious as the field of
                                                         steganography aims to conceal the existence of a
                                                         message, not scramble it. Our approach uses Word
                                                         Replacement in cover text. As only few words are
                                                         replaced by their synonyms majority of text remains
                                                         unchanged. If same cover text is used again and
                                                         again, an attack ―Known Stego-Text‖, in which
                                                         intruder keeps a track of text being sent on the
                                                         communication medium is possible. To prevent text
                                                         from steganalysis, the cover text needs to be
                                                         changed periodically. Our approach uses a book, a
                                                         collection of different chapters. Book should be
Figure 1: Sender end                                     privately owned by sender and receiver. One of the
                                                         chapters from the book can be selected as cover
                                                         text. The choice of the chapter is randomly decided
                                                         by sender. For reverse transformation at the receiver
                                                         end, same chapter should be selected as cover text.
                                                         To achieve this, we exploit the unchanged part of
                                                         the cover text.
                                                              Receiver decides which chapter is to be used as
                                                         cover text from stego text. This approach calculates
                                                         difference between individual chapters in book and
                                                         stego text and selects the chapter with minimum
                                                         difference.
                                                               Initially, each chapter in book is scanned once
                                                         to obtain a code for each sentence. Words which are
                                                         common nouns, adjectives, adverbs or verbs are
                                                         ignored from the calculation of the code. Values of
                                                         other words are calculated using ASCII values of
                                                         characters and position of the word in that particular
                                                         sentence. All these values are summed up to
                                                         determine the code for that sentence. Similarly,
                                                         codes for all sentences are calculated.

                                                         e.g.: ―This is important for me‖.

Figure 2: Receiver End                                        Code =‗t‘ * 1 + ‗h‘ * 1 + ‗i‘ * 1 +‗s‘ * 1 + ‗i‘ * 2 +
                                                              ‗s‘ * 2 + ‗f‘ * 4 + ‗o‘ * 4 + ‗r‘ * 4 + ‗m‘ * 5 + ‗e‘ * 5
     Reverse algorithm is carried out at receiver end.         ―important‖ is not used in calculation of the code
Figure 2 shows the mechanism. Tagged text is             for this sentence as it is an adjective.
scanned for noun, adjectives, verbs and adverbs.               Similar algorithm for code generation is applied
When one is found, its synonyms are obtained from        on stego text. Code of stego text is compared with
WordNet. As done at sender end, frequency table          codes for all chapters. Ideally codes for all
and later encode table is formed using frequencies       sentences of stego text should match with codes of
of the words.                                            a chapter which was used as cover text by sender.
     At the same time, stego text is also scanned.       But our algorithm allows compound word
From that stego text, we obtain the synonym              replacements (―travel‖ can be replaced by ―move
selected at sender‘s side. Then from the table, its      around‖). This causes a difference between codes of
                                                         stego text and chapter to be used as cover text. This
difference although is very small compared to                Triezenberg, ―Natural language watermarking
difference for other chapters. So the chapter with           and tamperproofing,‖ in Information Hiding: Fifth
minimum difference is selected as cover text.                International Workshop, F. A. P. Petitcolas, ed.,
                                                             Lecture Notes in Computer Science 2578, pp.
5. Conclusion                                                196–212, Springer, October 2002.
                                                        [6] K. Bennett, ―Linguistic steganography: Survey,
     The field of Linguistic Steganography is very           analysis, and robustness concerns for hiding
interesting as it conceals the very existence of             information in text,‖ Tech. Rep. TR 2004-13,
secret message from intruder, which is not                   Purdue CERIAS, May 2004.
achievable by cryptography. The Synonym                 [7] M. T. Chapman, ―Hiding the hidden: A software
replacement      approach     used  for    Linguistic        system for concealing ciphertext as innocuous
Steganography produces innocuous looking English             text,‖ Master‘s thesis, University of Wisconsin-
text thereby making detection of secret message              Milwaukee, May 1997.
very hard. The famous Stego-Turing Test states that     [8] M. T. Chapman and G. I. Davida, ―Hiding the
it is very hard for a computer to alter a natural            hidden: A software system for concealing
language text in a way that is undetectable to a             ciphertext as innocuous text,‖ in Information and
human. Many approaches have been carried out in              Communications Security: First International
the past in doing this. But none has been able to            Conference, O. S. Q. Yongfei Han Tatsuaki, ed.,
solve the problem. Though our solution doesn‘t               Lecture Notes in Computer Science 1334,
solve the problem, it produces better quality of             Springer, August 1997.
output than previously done approaches.                 [9] M. T. Chapman, G. I. Davida, and M. Rennhard,
     Also we give a new approach to dynamically              ―A practical and effective approach to large-
choose a cover text from chunk of text being shared          scale automated linguistic steganography,‖ in
by sender and receiver. This allows user to use              Information Security: Fourth International
different cover text for hiding message each time,           Conference, G. I. Davida and Y. Frankel, eds.,
thus making steganalysis difficult.                          Lecture Notes in Computer Science 2200, p.
     Synonym Replacement Approach to Linguistic              156ff, Springer, October 2001.
Steganography using Inflection classes, Frequency       [10] R. Bergmair, ―Towards linguistic steganography:
Statistics and Dynamic Cover Text Selection is a             A systematic investigation of approaches,
new improvement in the field of Linguistic                   systems, and issues.‖ final year thesis, April
Steganography and provides a very good, efficient            2004. handed in in partial fulfillment of the
tool for Information Hiding.                                 degree requirements for the degree ―B.Sc.
                                                             (Hons.) in Computer Studies‖ to the University of
                                                             Derby.
                                                        [11] I. A. Bolshakov, ―A method of linguistic
6. References                                                steganography based on collocationally-verified
                                                             synonymy.,‖ in Information Hiding: 6th
[1] P. Wayner, ―Mimic functions,‖ Cryptologia XVI,           International Workshop, J. J. Fridrich, ed.,
    pp. 193–214, July 1992.                                  Lecture Notes in Computer Science 3200, pp.
[2] P. Wayner, ―Strong theoretical steganography,‖           180–191, Springer, May 2004.
    Cryptologia XIX, pp. 285–299, July 1995.            [12] K. Winstein, ―Lexical steganography through
[3] P. Wayner,        ―Disappearing Cryptography-            adaptive modulation of the word choice hash,‖
    Information     Hiding:    Steganography     &           January 1999. Was disseminated during
    Watermarking‖, 2nd edition Morgan Kaufmann               secondary education at the Illinois Mathematics
    Publishers, Los Altos, CA 94022, USA, second             and Science Academy. The paper won the third
    ed., 2002 pp. 67-126, pp. 303-314.                       prize in the 2000 Intel Science Talent Search.
[4] M. J. Atallah, V. Raskin, M. Crogan, C.             [13] A. J. Tenenbaum, ―Linguistic steganography:
    Hempelmann, F. Kerschbaum, D. Mohamed,                   Passing covert data using text-based mimicry.‖
    and S. Naik, ―Natural language watermarking:             final year thesis, April 2002. Submitted in partial
    Design, analysis, and a proof-of-concept                 fulfillment of the requirements for the degree of
    implementation,‖ in Information Hiding: Fourth           ―Bachelor of Applied Science‖ to the University
    International Workshop, I. S. Moskowitz, ed.,            of Toronto.
    Lecture Notes in Computer Science 2137, pp.         [14] Vineeta Chand and C. Orhan Orgun, ―Exploiting
    185–199, Springer, April 2001.                           linguistic features in Lexical Steganography‖,
[5] M. J. Atallah, V. Raskin, C. F. Hempelmann, M.           Proceedings on 39th Hawaii International
    Karahan, R. Sion, U. Topkara, and K. E.                  Conference on System Sciences - 2006.
sentence for mentation, and the opposition offered
APPENDIX                                                  practical education which, to our regret, was only too
                                                          good.
Secret Text:

    Escape from jail today evening.

Sample Cover Text:

     EVER since I have been scrutinizing political
events, I have taken a tremendous interest in
propagandist activity. I saw that the Socialist-Marxist
organizations mastered and applied this instrument
with astounding skill. And I soon realized that the
correct use of propaganda is a true art which has
remained practically unknown to the bourgeois
parties. Only the Christian-Social movement,
especially in Lueger's time, achieved a certain
virtuosity on this instrument, to which it owed many
of its successes.
     But it was not until the War that it became
evident what immense results could be obtained by
a correct application of propaganda. Here again,
unfortunately, all our studying had to be done on the
enemy side, for the activity on our side was modest,
to say the least. The total miscarriage of the German
'enlightenment ' service stared every soldier in the
face, and this spurred me to take up the question of
propaganda even more deeply than before.
     There was often more than enough time for
thinking, and the enemy offered practical instruction
which, to our sorrow, was only too good.

Sample StegoText:

     ever since I have been scrutinizing political
cases, I have taken a tremendous interest in
propagandist action. I experienced that the Socialist-
Marxist organizations subdued and practiced this
tool with astounding accomplishment. And I shortly
recognized that the right use of propaganda is a
truthful art which has stayed much unknown to the
businessperson parties. merely the Christian-Social
move, particularly in Lueger 's time, accomplished a
sure virtuosity on this instrument, to which it owed
many of its successes. But it was not until the War
that it turned evident what immense effects could be
found by a right application of propaganda. hither
once more, unfortunately, all our considering had to
be made on the enemy side, for the action on our
side was modest, to say the least. The total stillbirth
of the German ` Nirvana ' service starred every
soldier in the face, and this spurred me to bring up
the inquiry of propaganda even more deeply than
ahead. There was frequently more than plenty
You can also read