Codebase list po4a / upstream/0.58.1 po4a-gettextize
upstream/0.58.1

Tree @upstream/0.58.1 (Download .tar.gz)

po4a-gettextize @upstream/0.58.1raw · history · blame

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
#! /usr/bin/env perl
eval 'exec perl -S $0 ${1+"$@"}'
  if $running_under_some_shell;

# po4a-gettextize -- convert an original file to a PO file
#
# Copyright 2002-2020 by SPI, inc.
#
# This program is free software; you can redistribute it and/or modify it
# under the terms of GPL (see COPYING).

=encoding UTF-8

=head1 NAME

po4a-gettextize - convert an original file (and its translation) to a PO file

=head1 SYNOPSIS

B<po4a-gettextize> B<-f> I<fmt> B<-m> I<master.doc> [B<-l> I<XX.doc>] B<-p> I<XX.po>

(I<XX.po> is the output, all others are inputs)

=head1 DESCRIPTION

po4a (PO for anything) eases the maintenance of documentation translation using
the classical gettext tools. The main feature of po4a is that it decouples the
translation of content from its document structure.  Please refer to the page
L<po4a(7)> for a gentle introduction to this project.

The B<po4a-gettextize> script is in charge of converting documentation files into
PO files. You only need it to setup your translation project with po4a, never afterward.

If you start from scratch, B<po4a-gettextize> will extract the translatable
strings from the documentation and write a POT file. If you provide a previously
existing translated file with the B<-l> flag, B<po4a-gettextize> will try to use
the translations that it contains in the produced PO file. This process remains
tedious and manual, as explained in Section 'Converting a manual translation to
po4a' below.

If the master document has non-ASCII characters, the new generated PO file will
be in UTF-8. Else (if the master document is completely in ASCII), the generated
PO will use the encoding of the translated input document, or UTF-8 if no
translated document is provided.

=head1 OPTIONS

=over 4

=item B<-f>, B<--format>

Format of the documentation you want to handle. Use the B<--help-format>
option to see the list of available formats.

=item B<-m>, B<--master>

File containing the master document to translate. You can use this option
multiple times if you want to gettextize multiple documents.

=item B<-M>, B<--master-charset>

Charset of the file containing the document to translate.

=item B<-l>, B<--localized>

File containing the localized (translated) document. If you provided
multiple master files, you may wish to provide multiple localized file by
using this option more than once.

=item B<-L>, B<--localized-charset>

Charset of the file containing the localized document.

=item B<-p>, B<--po>

File where the message catalog should be written. If not given, the message
catalog will be written to the standard output.

=item B<-o>, B<--option>

Extra option(s) to pass to the format plugin. Specify each option in the
'I<name>B<=>I<value>' format. See the documentation of each plugin for more
information about the valid options and their meanings.

=item B<-h>, B<--help>

Show a short help message.

=item B<--help-format>

List the documentation formats understood by po4a.

=item B<-V>, B<--version>

Display the version of the script and exit.

=item B<-v>, B<--verbose>

Increase the verbosity of the program.

=item B<-d>, B<--debug>

Output some debugging information.

=item B<--msgid-bugs-address> I<email@address>

Set the report address for msgid bugs. By default, the created POT files
have no Report-Msgid-Bugs-To fields.

=item B<--copyright-holder> I<string>

Set the copyright holder in the POT header. The default value is
"Free Software Foundation, Inc."

=item B<--package-name> I<string>

Set the package name for the POT header. The default is "PACKAGE".

=item B<--package-version> I<string>

Set the package version for the POT header. The default is "VERSION".

=back

=head2 Converting a manual translation to po4a

B<po4a-gettextize> will try to extract the content of any provided translation
file, and use this content as msgstr in the produced PO file. Be warned that
this process is very fragile: the Nth string of the translated file is supposed
to be the translation of the Nth string in the original. This will naturally not
work unless both files share exactly the same structure.

Internally, each po4a parser reports the syntactical type of each extracted
strings. This is how desynchronization are detected during the gettextization.
For example, if the files have the following structure, it is very unlikely that
the 4th string in translation (of type 'chapter') is the translation of the 4th
string in original (of type 'paragraph'). It is more likely that a new
paragraph was added to the original, or that two original paragraphs were merged
together in the translation.

    Original         Translation

  chapter            chapter
    paragraph          paragraph
    paragraph          paragraph
    paragraph        chapter
  chapter              paragraph
    paragraph          paragraph

B<po4a-gettextize> will verbosely diagnose any detected structure
desynchronization. When this happens, you should manually edit the files (this
probably requires that you have some notions of the target language). You must
add fake paragraphs or remove some content in one of the documents (or both) to
fix the reported disparities, until the structure of both documents perfectly
match. Some tricks are given in the next section.

Even when the document is successfully processed, undetected disparities and
silent errors are still possible. That is why any translation associated
automatically by po4a-gettextize is marked as I<fuzzy> to require an manual
inspection by humans. One has to check that each retrieved msgstr is actually
the translation of the associated msgid, and not the string before or after.

As you can see, the key here is to have the exact same structure in the
translated document and in the original one. The best is to do the
gettextization on the exact version of F<master.doc> that was used for the
translation, and only update the PO file against the latest master file once the
gettextization was successful.

If you are lucky enough to have a a perfect match in the file structures,
building a correct PO file is a matter of seconds. Otherwise, you will soon
understand why this process has such an ugly name :) But remember that this
grunt work is the price to pay to get the comfort of po4a afterward. Once
converted, the synchronization between master documents and translations will
always be fully automatic.

Even when things go wrong, gettextization often remains faster than translating
everything again. I was able to gettextize the existing French translation of
the whole Perl documentation in one day, even though the structure of many
documents were desynchronized. That was more than two megabytes of original text
(2 millions of characters): restarting the translation from scratch would have
required several months of work.

=head2 Hints and tricks for the gettextization process

The gettextization stops as soon as a desynchronization is detected. In theory,
it should probably be possible resynchronize the gettextization later in the
documents using e.g. the same algorithm than the L<diff(1)> utility. But a manual
intervention would still be mandatory to manually match the elements that
couldn't be automatically matched, explaining why automatic resynchronization is
not implemented (yet?).

When this happens, the whole game comes down to the alignment of these damn
files' structures again through manual edits. B<po4a-gettextize> is rather
verbose about what went wrong when it happens. It reports the strings that don't
match, their positions in the text, and the type of each of them. Moreover, the
PO file generated so far is dumped as F<gettextization.failed.po> for further
inspection.

Here are some other tricks to help you in this tedious process:

=over

=item

Remove all extra content of the translations, such as the section giving credits
to the translators. You can add them back in po4a afterward, using an addenda
(see L<po4a(7)>).

=item

If you need to edit the files to align their structures, you should prefer
editing the translation if possible. Indeed, if the changes to the original are
too intrusive, the old and new versions will not be matched during the PO
update, and the corresponding translation will be dumped anyway. But do not
hesitate to also edit the original document if required: the important thing is
to get a first PO file to start with.

=item

Do not hesitate to kill any original content that would not exist in the
translated version. This content will be automatically reintroduced afterward,
when synchronizing the PO file with the document.

=item

You should probably inform the original author of any structural change in the
translation that seems justified. Issues in the original document should reported
to the author. Fixing them in your translation only fixes them for a part of the
community. Plus, it is impossible to do so when using po4a ;)

=item

Sometimes, the paragraph content does match, but not their types. Fixing it is
rather format-dependent. In POD and man, it often comes from the fact that one
of them contains a line beginning with a white space while the other does not.
In those formats, such paragraph cannot be wrapped and thus become a different
type. Just remove the space and you are fine. It may also be a typo in the tag
name in XML.

Likewise, two paragraphs may get merged together in POD when the separating
line contains some spaces, or when there is no empty line between the B<=item>
line and the content of the item.

=item

Sometimes, the desynchronization message seems odd because the translation is
attached to the wrong original paragraph. It is the sign of an undetected issue
earlier in the process. Search for the actual desynchronization point by
inspecting F<gettextization.failed.po>, and fix the problem where it really is.

=item

In some unfortunate settings, you will get the feeling that po4a ate some parts
of the text, either the original or the translation. F<gettextization.failed.po>
indicates that both files matched as expected up to the paragraph N. But then,
an (unsuccessful) attempt is made to match the N+1 paragraph in the original
file not with the N+1 paragraph in the translation as it should, but with the
N+2 paragraph. Just as if the N+1 paragraph that you see in the document simply
disappeared from the file during the process.

This unfortunate situation happens when the same paragraph is repeated over
the document. In that case, no new entry is created in the PO file, but a
new reference is added to the existing one instead.

So, the previous situation occurs when two similar but different paragraphs are
translated in the exact same way. This will apparently remove a paragraph of the
translation. To fix the problem, it is sufficient to slightly alter one of the
translations in the document. You can also prefer to kill the second paragraph
in the original document.

To the opposite, if the same paragraph appearing twice in the original document
is not translated in the exact same way at both locations, you will get the
feeling that one paragraph of the original document just vanished. Just copy the
best translation over the other one in the translated document to fix the
problem.

=item

As a final note, do not be too surprised if the first synchronization of your PO
file takes a long time. This is because most of the msgid of the PO file
resulting from the gettextization don't match exactly any element of the POT
file built from the recent master files. This forces gettext to search for the
closest one using a costly string proximity algorithm.

For example, the first B<po4a-updatepo> of the Perl documentation's French
translation (5.5 MB PO file) took about 48 hours (yes, two days) while the
subsequent ones only take a dozen of seconds.

=back

=head1 SEE ALSO

L<po4a(1)>,
L<po4a-normalize(1)>,
L<po4a-translate(1)>,
L<po4a-updatepo(1)>,
L<po4a(7)>.

=head1 AUTHORS

 Denis Barbier <barbier@linuxfr.org>
 Nicolas François <nicolas.francois@centraliens.net>
 Martin Quinson (mquinson#debian.org)

=head1 COPYRIGHT AND LICENSE

Copyright 2002-2020 by SPI, inc.

This program is free software; you may redistribute it and/or modify it
under the terms of GPL (see the COPYING file).

=cut

use 5.006;
use strict;
use warnings;

use Getopt::Long qw(GetOptions);

use Locale::Po4a::Chooser;
use Locale::Po4a::TransTractor;
use Locale::Po4a::Common;

use Pod::Usage qw(pod2usage);

Locale::Po4a::Common::textdomain('po4a');

sub show_version {
    Locale::Po4a::Common::show_version("po4a-gettextize");
    exit 0;
}

my %opts = (
    "verbose"            => 0,
    "debug"              => 0,
    "copyright-holder"   => undef,
    "msgid-bugs-address" => undef,
    "package-name"       => undef,
    "package-version"    => undef
);

my ($pofile) = ('-');
my ( @masterfile, @locfile, $help_fmt, $help, $type, @options );
my ( $mastchar, $locchar );
Getopt::Long::config( 'bundling', 'no_getopt_compat', 'no_auto_abbrev' );
GetOptions(
    'help|h'      => \$help,
    'help-format' => \$help_fmt,

    'master|m=s'    => \@masterfile,
    'localized|l=s' => \@locfile,
    'po|p=s'        => \$pofile,
    'format|f=s'    => \$type,

    'master-charset|M=s'    => \$mastchar,
    'localized-charset|L=s' => \$locchar,

    'option|o=s' => \@options,

    'copyright-holder=s'   => \$opts{"copyright-holder"},
    'msgid-bugs-address=s' => \$opts{"msgid-bugs-address"},
    'package-name=s'       => \$opts{"package-name"},
    'package-version=s'    => \$opts{"package-version"},

    'verbose|v' => \$opts{"verbose"},
    'debug|d'   => \$opts{"debug"},
    'version|V' => \&show_version
) or pod2usage();

# Argument check
$help && pod2usage( -verbose => 1, -exitval => 0 );
$help_fmt && Locale::Po4a::Chooser::list(0);
pod2usage() if ( scalar @ARGV > 1 ) || ( scalar @masterfile < 1 );

foreach (@options) {
    if (m/^([^=]*)=(.*)$/) {
        $opts{$1} = "$2";
    } else {
        $opts{$_} = 1;
    }
}

# Check file existence
foreach my $file ( @masterfile, @locfile ) {
    $file eq '-' || -e $file || die wrap_msg( gettext("File %s does not exist."), $file );
}

# Declare the TransTractor parsers
my ( $mastertt, $transtt ) = ( Locale::Po4a::Chooser::new( $type, %opts ), Locale::Po4a::Chooser::new( $type, %opts ) );

# Parse master file forcing conversion to utf if it's not in ascii
foreach my $file (@masterfile) {
    $mastertt->read( $file, $file );
}
$mastertt->{TT}{utf_mode} = 1;
if ( $mastertt->{TT}{ascii_input} ) {
    $mastertt->detected_charset('ascii');
} elsif ( defined($mastchar) ) {
    $mastertt->detected_charset($mastchar);
    $mastertt->{TT}{po_in}->set_charset($mastchar);
}
$mastertt->parse;

# Implementation note:
# In practice, po4a-gettextize uses the po4a parsers on both the original and the
# translation files to extract two PO files. A third PO file is built from them
# taking strings from the second as translation of strings from the first.

unless ( scalar @locfile >= 1 ) {

    # Ok, outputing the pot extracted from original is enough
    $mastertt->writepo($pofile);
} else {

    # We have to merge two transtractor files

    foreach my $file (@locfile) {
        $transtt->read( $file, $file );
    }

    # We force the conversion to utf if the master document wasn't in ascii
    $transtt->{TT}{utf_mode} = !$mastertt->{TT}{ascii_input};
    $transtt->detected_charset($locchar);
    $transtt->{TT}{po_in}->set_charset($locchar);
    $transtt->parse;

    my $mergedpo = Locale::Po4a::Po->gettextize( $mastertt->getpoout(), $transtt->getpoout() );

    $mergedpo->write($pofile);
}

__END__