• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Tim Cooke
  • Campbell Ritchie
  • paul wheaton
  • Ron McLeod
  • Devaka Cooray
Sheriffs:
  • Jeanne Boyarsky
  • Liutauras Vilda
  • Paul Clapham
Saloon Keepers:
  • Tim Holloway
  • Carey Brown
  • Piet Souris
Bartenders:

A program to convert text files from one specification to another

 
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Please can any one help? i was ask to write a program in Java to convert .xml, .dat, .html, add., and .txt (they are all compress files in different folders)format to XML format.

Requirements to output specification

Format XML, which want to convert

<Patent number="номер патента" kind="номер редакции патента" country="страна" date="дата принятия патента в формате YYYYMMDD" lang="язык патента"AppNumber="номер заявки" AppKind="номер редакции заявки" DisclaimerDate="Дата отказа заявке">
<classificationipcmain section ="Буква секции" class = "номер класса" subclass = "буква сабкласса" main-group = "номер главной группы" subgroup = "номер подгруппы"/>
<classificationipcadditional
<classificationipc section ="Буква секции" class = "номер класса" subclass = "буква сабкласса" main-group = "номер главной группы" subgroup = "номер подгруппы"/>
<classificationipc section ="Буква секции" class = "номер класса" subclass = "буква сабкласса" main-group = "номер главной группы" subgroup = "номер подгруппы"/>
..............
</classificationipcadditional>
<classificationUSmain class = "номер класса" subclass = "буква сабкласса"/>
<classificationUSadditional>
<classificationUS class = "номер класса" subclass = "буква сабкласса"/>
<classificationUS class = "номер класса" subclass = "буква сабкласса"/>
...........
</classificationUSadditional>
<Title>Название патента</Title>
<TitleEng>Название патента на английском</TitleEng>
<RelatesPatents>//поле UREF в .txt файлах
<RelatedPatent number="номер патента" kind="вид патента" country="страна патента" class="класс патента" date="дата публикации патента">
<RelatedPatent number="номер патента" kind="вид патента" country="страна патента" class="класс патента" date="дата публикации патента">
<RelatedPatent number="номер патента" kind="вид патента" country="страна патента" class="класс патента" date="дата публикации патента">
</RelatesPatents>
<RelatesForeignPatents> //поле FREF в .txt файлах
<RelatedForeignPatent number="номер патента" country="страна патента" class="класс патента" date="дата публикации патента">
<RelatedForeignPatent number="номер патента" country="страна патента" class="класс патента" date="дата публикации патента">
<RelatedForeignPatent number="номер патента" country="страна патента" class="класс патента" date="дата публикации патента">
</RelatesForeignPatents>
<Authors>
<Author Name="ФИО автора патента" ></Author>
<Author>ФИО автора патента</Author>
<Author>ФИО автора патента</Author>
</Authors>
<Company>Название компании владельца патента</Company>
<Description>Реферат патента без html мусора. Учтите, что все html теги надо удалять, кроме тена <p>. Его надо заменять на "\n". Чистый текст из предложений</Description> //поле DETD в .txt файлах
<DescriptionShort>Краткий Реферат патента без html мусора. Учтите, что все html теги надо удалять, кроме тена <p>. Его надо заменять на "\n". Чистый текст из предложений</DescriptionShort> //поле BSUM в .txt файлах
<Abstract>Аннотация патента без html мусора. Учтите, что все html теги надо удалять, кроме тена <p>. Его надо заменять на "\n". Чистый текст из предложений</Abstract>
<AbstractEng>Аннотация патента на английском языке без html мусора. Учтите, что все html теги надо удалять, кроме тена <p>. Его надо заменять на "\n". Чистый текст из предложений</AbstractEng>
<Claims>
<Claim>Пункт формулы патента без html мусора. Учтите, что все html теги надо удалять, кроме тена <p>. Его надо заменять на "\n" Чистый текст из предложений</Claim>
<Claim>Пункт формулы патента без html мусора .Учтите, что все html теги надо удалять, кроме тена <p>. Его надо заменять на "\n". Чистый текст из предложений</Claim>
</Claims>
<Drawings>//поле DRWD в .txt файлах
<Drawing>Название рисунка<Drawing>
<Drawing>Название рисунка<Drawing>
<Drawing>Название рисунка<Drawing>
</Drawings>
</Patent>

Note
After conversion - each patent must be in a separate . Xml file. Files must be placed in folders that correspond to the names of files. After converting another archive - archive the resulting file folder , leave the archive and delete the folder itself . If you will not do that - then you eventually get 600 gigabytes . When archiving , use the "-mx = 9" for 7z for maximum compression .

After the write converter - convert time to vote the entire array, as well as the total amount of the resulting file (in compressed form). Total there are about 10 million patents

please i would appreciate it if any one can provide me with assistance
 
author & internet detective
Posts: 42135
937
Eclipse IDE VI Editor Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
This sounds like homework. Think about how you would design it and post that for feedback. Or ask about which part of the assignment you are stuck on.
 
James Ogar
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Jeanne Boyarsky wrote:This sounds like homework. Think about how you would design it and post that for feedback. Or ask about which part of the assignment you are stuck on.



Yes, yes i need some guide please
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Harold's free online book is loaded with examples of doing various XML related jobs with JAVA.

Bill
 
lowercase baba
Posts: 13091
67
Chrome Java Linux
  • Likes 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

James Ogar wrote:Yes, yes i need some guide please


1) turn off your compter.
2) get paper, pencils, and erasers
3) work out in English, Spanish, Russian, French, German or whatever natural language you prefer how to do it, step by step
4) Revise those steps, adding in details and clarity.
5) repeat step 4 until you could hand the directions to a 10 year old child and expect them to be able to follow your steps
6) ONLY when the above steps are complete should you consider writing a single line of java.
 
Marshal
Posts: 80623
469
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Looks like Russian to me

Welcome to the Ranch
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Creating the XML is probably not the major problem. Extracting structured information from various sources formats is likely the hard part. For that, Fred's advice is spot on: think about how to get at the information for each of the file formats you need to deal with, and only then consider how to represent it in memory and how to write it to an XML file.
 
James Ogar
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
please guys, i have made progress by using 7-zip c# open source code to extract the files from the archives. For now the only thing i need is some Java Codes to convert the various files format to XML UTF-8. Please i need your help. Thanks
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Which file formats have you yet to tackle? The API will likely be different for each format. Or you can give Apache Tika a try; it can extract text from a wide range of formats.

Creating XML can be done using any number of APIs, or just writing it directly to a text file.
 
James Ogar
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Ulf Dittmer wrote:Which file formats have you yet to tackle? The API will likely be different for each format. Or you can give Apache Tika a try; it can extract text from a wide range of formats.

Creating XML can be done using any number of APIs, or just writing it directly to a text file.




I am converting the following file formats .add, .dat, .txt
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
.add and .dat are not standard file formats; you'll have to figure out on your own how to extract the data from them. .txt is easy to read, but there, too, you have to figure out its format, meaning where in the text file is stored which of the various pieces of data you intend to fill the XML with.
 
fred rosenberger
lowercase baba
Posts: 13091
67
Chrome Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
the extension on a file name is really meaningless. It is a HINT - nothing more - to the computer on what program should be used to open the file. But you can take any file on your computer and changes the file extension. This will most likely cause your computer to try to open a file with the wrong program - so a file with text in it will be opened by (say) iTunes, which will then complain it can't understand the content of the file.

So simply saying "i have a .dat file" tells us nothing. If that's all the info you have, you are in trouble. You will need to analyze the data in the file by opening it with something - possible a hex reader if the info is binary, and then pray you can make sense of it.
 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

James Ogar wrote:please guys, i have made progress by using 7-zip c# open source code to extract the files from the archives.


Well that seems like an odd choice if this is supposed to be a Java application.

For now the only thing i need is some Java Codes to convert the various files format to XML UTF-8.


No you don't. And even if you did; that's not what we do here. We are NotACodeMill (←click).

It sounds to me like you've been given an awful lot of instructions about HOW to do this, when what you really need to do is understand WHAT needs to be done.

My advice:
1. Follow Fred's advice.
2. Forget all about folders and compression and patents and archiving and concentrate on the problem:
How do I convert [some] file to XML?

Pick ONE file format and get some sample files, and make sure that you can convert uncompressed versions of that format to XML, file by file - ie, write a program or method or class that converts ONE file at a time.

Once (and only once) you're satisfied that you can convert ANY file of that format, pick another format and do the same for that. In the process, you'll probably discover that there are several things that you're repeating, so refactor them into common methods (or possibly, a common class).
And when you've done that, repeat the process with a 3rd format...

Finally, once you have all your converters written and working, THEN worry about compressing/uncompressing your input and output and putting it in the right folders.

There really is no "magic bullet" to stuff like this. You need to break down the problem and understand it.

HIH

Winston
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic